Monitoring and Logging GPU Utilization in your job

Many people meet the command nvidia-smi pretty quickly if they’re using Nvidia GPUs with command-line tools. It’s a great way to get a quick view of the status of the GPUs on a node. In the context of a job, the command output will be limited to the GPUs which have been allocated to the job.

[brownm12@tc-gpu001 ~]$ nvidia-smi
Wed Feb 23 17:22:18 2022
+-----------------------------------------------------------------------------+
| NVIDIA-SMI 470.57.02    Driver Version: 470.57.02    CUDA Version: 11.4     |
|-------------------------------+----------------------+----------------------+
| GPU  Name        Persistence-M| Bus-Id        Disp.A | Volatile Uncorr. ECC |
| Fan  Temp  Perf  Pwr:Usage/Cap|         Memory-Usage | GPU-Util  Compute M. |
|                               |                      |               MIG M. |
|===============================+======================+======================|
|   0  NVIDIA A100-SXM...  On   | 00000000:07:00.0 Off |                    0 |
| N/A   64C    P0   162W / 400W |   2668MiB / 81251MiB |     45%      Default |
|                               |                      |             Disabled |
+-------------------------------+----------------------+----------------------+
|   1  NVIDIA A100-SXM...  On   | 00000000:0B:00.0 Off |                    0 |
| N/A   62C    P0   161W / 400W |   2644MiB / 81251MiB |     39%      Default |
|                               |                      |             Disabled |
+-------------------------------+----------------------+----------------------+
|   2  NVIDIA A100-SXM...  On   | 00000000:48:00.0 Off |                    0 |
| N/A   55C    P0   137W / 400W |   2668MiB / 81251MiB |     42%      Default |
|                               |                      |             Disabled |
+-------------------------------+----------------------+----------------------+
|   3  NVIDIA A100-SXM...  On   | 00000000:4C:00.0 Off |                    0 |
| N/A   61C    P0   138W / 400W |   2644MiB / 81251MiB |     40%      Default |
|                               |                      |             Disabled |
+-------------------------------+----------------------+----------------------+
|   4  NVIDIA A100-SXM...  On   | 00000000:88:00.0 Off |                    0 |
| N/A   38C    P0    60W / 400W |      0MiB / 81251MiB |      0%      Default |
|                               |                      |             Disabled |
+-------------------------------+----------------------+----------------------+
|   5  NVIDIA A100-SXM...  On   | 00000000:8B:00.0 Off |                    0 |
| N/A   41C    P0    64W / 400W |      0MiB / 81251MiB |      0%      Default |
|                               |                      |             Disabled |
+-------------------------------+----------------------+----------------------+
|   6  NVIDIA A100-SXM...  On   | 00000000:C8:00.0 Off |                    0 |
| N/A   39C    P0    69W / 400W |      0MiB / 81251MiB |      0%      Default |
|                               |                      |             Disabled |
+-------------------------------+----------------------+----------------------+
|   7  NVIDIA A100-SXM...  On   | 00000000:CB:00.0 Off |                    0 |
| N/A   51C    P0   122W / 400W |   3802MiB / 81251MiB |     55%      Default |
|                               |                      |             Disabled |
+-------------------------------+----------------------+----------------------+

+-----------------------------------------------------------------------------+
| Processes:                                                                  |
|  GPU   GI   CI        PID   Type   Process name                  GPU Memory |
|        ID   ID                                                   Usage      |
|=============================================================================|
|    0   N/A  N/A    118062      C   python                           2665MiB |
|    1   N/A  N/A    118062      C   python                           2641MiB |
|    2   N/A  N/A     18210      C   python                           2665MiB |
|    3   N/A  N/A     18210      C   python                           2641MiB |
|    7   N/A  N/A     98012      C   ...onda/envs/test/bin/python     3799MiB |
+-----------------------------------------------------------------------------+

We can see a list of

  • the visible GPUs (There are 8 here, numbered 0-7)

  • model, ID, temp, power consumption, PCIe bus ID, % GPU utilization, % GPU memory utilization

  • list of processes currently running on each GPU

This is nice pretty output, but is no good for logging or continuous monitoring. More concise output and repeated refreshes are needed. Here’s how to get started with that:

nvidia-smi –query-gpu=…

The output can be formatted as comma-separated values (CSV), the query parameters can be customized, and the query can be set to loop or repeat on a regular interval:

nvidia-smi --query-gpu=timestamp,name,pci.bus_id,driver_version,temperature.gpu,utilization.gpu,utilization.memory --format=csv -l 5
  • -l controls the looping interval (5 seconds here)(-lms can also be used to defice looping intervals in milliseconds)

  • --format=csv specifies CSV formatting for the output

Use man nvidia-smi and scroll to “GPU ATTRIBUTES” and “UNIT ATTRIBUTES” section to get a list of attributes which can be queried.

Make a bash function

Instead of typing all this, or copy-pasting repeatedly, you can create a bash function to act as a wrapper for this query.

[brownm12@tc-gpu001 ~]$ gpumon() { nvidia-smi --query-gpu=timestamp,name,pci.bus_id,driver_version,temperature.gpu,utilization.gpu,utilization.memory,memory.used --format=csv -lms 100 ; }
[brownm12@tc-gpu001 ~]$ gpumon
timestamp, name, pci.bus_id, driver_version, temperature.gpu, utilization.gpu [%], utilization.memory [%]
2022/02/23 17:26:54.522, NVIDIA A100-SXM-80GB, 00000000:07:00.0, 470.57.02, 64, 43 %, 4 %
2022/02/23 17:26:54.524, NVIDIA A100-SXM-80GB, 00000000:0B:00.0, 470.57.02, 61, 40 %, 4 %
2022/02/23 17:26:54.526, NVIDIA A100-SXM-80GB, 00000000:48:00.0, 470.57.02, 55, 21 %, 2 %
2022/02/23 17:26:54.527, NVIDIA A100-SXM-80GB, 00000000:4C:00.0, 470.57.02, 61, 34 %, 2 %
2022/02/23 17:26:54.529, NVIDIA A100-SXM-80GB, 00000000:88:00.0, 470.57.02, 38, 0 %, 0 %
2022/02/23 17:26:54.531, NVIDIA A100-SXM-80GB, 00000000:8B:00.0, 470.57.02, 41, 0 %, 0 %
2022/02/23 17:26:54.532, NVIDIA A100-SXM-80GB, 00000000:C8:00.0, 470.57.02, 39, 0 %, 0 %
2022/02/23 17:26:54.534, NVIDIA A100-SXM-80GB, 00000000:CB:00.0, 470.57.02, 52, 56 %, 9 %
2022/02/23 17:26:55.535, NVIDIA A100-SXM-80GB, 00000000:07:00.0, 470.57.02, 64, 43 %, 4 %
2022/02/23 17:26:55.536, NVIDIA A100-SXM-80GB, 00000000:0B:00.0, 470.57.02, 61, 40 %, 4 %
2022/02/23 17:26:55.537, NVIDIA A100-SXM-80GB, 00000000:48:00.0, 470.57.02, 54, 24 %, 2 %
2022/02/23 17:26:55.538, NVIDIA A100-SXM-80GB, 00000000:4C:00.0, 470.57.02, 61, 36 %, 2 %
2022/02/23 17:26:55.538, NVIDIA A100-SXM-80GB, 00000000:88:00.0, 470.57.02, 38, 0 %, 0 %
2022/02/23 17:26:55.539, NVIDIA A100-SXM-80GB, 00000000:8B:00.0, 470.57.02, 41, 0 %, 0 %
2022/02/23 17:26:55.540, NVIDIA A100-SXM-80GB, 00000000:C8:00.0, 470.57.02, 39, 0 %, 0 %
2022/02/23 17:26:55.541, NVIDIA A100-SXM-80GB, 00000000:CB:00.0, 470.57.02, 52, 56 %, 9 %

Show only non-zero utilization and log to a csv file

# run gpumon, but only show nonzero output and send it to a file instead of the terminal display 
gpumon | grep -v " 0 %, 0 %" > gpustats.csv