Basic Concepts

Name	Explanation
Kubernetes	K8s, also known as Kubernetes, is a lightweight and scalable open-source platform for managing containerized applications and services. It enables automated deployment and scaling of applications, using cross-platform and container scheduling capabilities.
Resources and Environment	The AI computing cloud environment uses k8s for management and scheduling. It provides NVIDIA graphics cards, online IDE environment, integrated parallel file storage for persistent storage, Tensorboard for online charts, and task fault tolerance with automatic retries.
Resource Group	Exclusive resources for users, where resource groups are created in advance and physical computing nodes are reserved. Users can choose to run tasks on a dedicated host, reducing queues, monopolizing resources, and adding or deleting nodes as necessary.
Mirror Repository	The platform has built-in commonly used container applications based on container computing scenarios. This includes frameworks like PyTorch, TensorFlow, and Jupyter. It also provides user-defined image repositories for building custom images based on public images or Dockerfiles.
Container Instances	Used for algorithm development and model fine-tuning, especially with small training datasets. Users can apply for single-card or 8-card instances, use local data disks and file storage, and develop algorithms using Jupyter. After training, results can be saved to mounted shared storage, downloaded, and the instance can be released.
Distributed Training Tasks	Offers a quick start for distributed tasks across multiple machines and cards, helping users to focus on running large-scale training tasks. The system automatically schedules the necessary nodes based on the selected specifications and quantity, streamlining the process for multi-machine and multi-card training scenarios.
Parallel File Storage	A high-performance, scalable distributed file system designed for parallel computing environments, providing efficient and persistent storage for large-scale computations.