singlefabric
Product Introduction

Basic Concepts

Definitions and explanations of important terms in AI computing cloud services.

NameExplanation
KubernetesK8s, also known as Kubernetes, is a lightweight and scalable open-source platform for managing containerized applications and services. It enables automated deployment and scaling of applications, using cross-platform and container scheduling capabilities.
Resources and EnvironmentThe AI computing cloud environment uses k8s for management and scheduling. It provides NVIDIA graphics cards, online IDE environment, integrated parallel file storage for persistent storage, Tensorboard for online charts, and task fault tolerance with automatic retries.
Resource GroupExclusive resources for users, where resource groups are created in advance and physical computing nodes are reserved. Users can choose to run tasks on a dedicated host, reducing queues, monopolizing resources, and adding or deleting nodes as necessary.
Mirror RepositoryThe platform has built-in commonly used container applications based on container computing scenarios. This includes frameworks like PyTorch, TensorFlow, and Jupyter. It also provides user-defined image repositories for building custom images based on public images or Dockerfiles.
Container InstancesUsed for algorithm development and model fine-tuning, especially with small training datasets. Users can apply for single-card or 8-card instances, use local data disks and file storage, and develop algorithms using Jupyter. After training, results can be saved to mounted shared storage, downloaded, and the instance can be released.
Distributed Training TasksOffers a quick start for distributed tasks across multiple machines and cards, helping users to focus on running large-scale training tasks. The system automatically schedules the necessary nodes based on the selected specifications and quantity, streamlining the process for multi-machine and multi-card training scenarios.
Parallel File StorageA high-performance, scalable distributed file system designed for parallel computing environments, providing efficient and persistent storage for large-scale computations.