AI DevelopmentDistributed Training
View Task Details
Learn how to view task details in the AI Computing Platform and monitor the progress of distributed training tasks.
Introduction
This document describes how to view task details in the AI Computing Platform and monitor the progress of distributed training tasks.
Prerequisites
The distributed training task has been created successfully.
Procedure
- Log in to the management console.
- In the top navigation bar, click
Products and Services
>AI Computing Platform
>AI Computing Platform
to go to its overview page. - In the left navigation bar, select
Distributed Training
. The distributed training task list page is displayed by default. - On the distributed training list page, click
Task Details
in theOperation
column on the right side of the row where the specified task is located to enter its basic information page. - On the task basic information page, you can view
Task Information
,Task Running Information
, andBilling Resources
information. - On the task details page, click the
Pods
tab to view information about the container group used by the current training task, including:Container Group Name/ID
Status
- within the container group:
Node Name/IP Address
Allocated GPU Cards
GPU Utilization
GPU Memory Utilization
CPU Usage
Memory Usage
Creation and Update Time
Monitoring
- On the task details page, click the
Log
tab to view the log output of the current training task. After the task is completed, the pods of the corresponding task will disappear.