singlefabric
AI DevelopmentDistributed Training

View Task Details

Learn how to view task details in the AI Computing Platform and monitor the progress of distributed training tasks.

Introduction

This document describes how to view task details in the AI Computing Platform and monitor the progress of distributed training tasks.

Prerequisites

The distributed training task has been created successfully.

Procedure

  1. Log in to the management console.
  2. In the top navigation bar, click Products and Services > AI Computing Platform > AI Computing Platform to go to its overview page.
  3. In the left navigation bar, select Distributed Training. The distributed training task list page is displayed by default.
  4. On the distributed training list page, click Task Details in the Operation column on the right side of the row where the specified task is located to enter its basic information page.
  5. On the task basic information page, you can view Task Information, Task Running Information, and Billing Resources information.
  6. On the task details page, click the Pods tab to view information about the container group used by the current training task, including:
    • Container Group Name/ID
    • Status
    • within the container group:
      • Node Name/IP Address
      • Allocated GPU Cards
      • GPU Utilization
      • GPU Memory Utilization
      • CPU Usage
      • Memory Usage
      • Creation and Update Time
      • Monitoring
  7. On the task details page, click the Log tab to view the log output of the current training task. After the task is completed, the pods of the corresponding task will disappear.

On this page