“Set up GPU on slurm”
Last modified: February 26, 2024
by Yuejia Zhang, April 11, 2022
nvidia-smi -L
On loginNode we have,
GPU 0: NVIDIA GeForce GTX 1080 Ti (UUID: GPU-082b5928-f417-d18b-2e66-7ac8725d2eef)
GPU 1: NVIDIA GeForce GTX 1080 Ti (UUID: GPU-d4beb8f8-57b9-9e30-757d-dace8ba58a6d)
On bigMem0 we have,
GPU 0: Tesla T4 (UUID: GPU-e39346a5-2c99-b9b3-3990-314a4955c698)
GPU 1: Tesla T4 (UUID: GPU-804f3721-c596-7163-0d12-0e6430083919)
GPU 2: Tesla T4 (UUID: GPU-a713e9f1-3221-e328-0bf4-c5390ef4c54a)
GPU 3: Tesla T4 (UUID: GPU-0f2b1458-1711-050a-8296-e07175c96634)
On bigMem1 we have,
GPU 0: NVIDIA A30 (UUID: GPU-4b5ca539-95f5-1f73-31ac-f88fb31b02d4)
GPU 1: NVIDIA A30 (UUID: GPU-96e0c10d-98af-1ad0-dfc7-aeb3c7b90c7d)
GPU 2: NVIDIA A30 (UUID: GPU-97470ff2-12a9-fc91-fd5f-66d291104940)
GPU 3: NVIDIA A30 (UUID: GPU-14ace18b-5160-afd6-467c-c20f1d706432)
Test the AutoDetect mechanism
修改/etc/slurm/slurm.conf
, debug2,
(line 16)
GresTypes=gpu
(line 148 & 149)
NodeName=loginNode CPUs=24 RealMemory=128546 Sockets=2 CoresPerSocket=12 ThreadsPerCore=1 State=UNKNOWN Gres=gpu:2
NodeName=bigMem[0-1] CPUs=64 RealMemory=1030499 Sockets=2 CoresPerSocket=16 ThreadsPerCore=2 State=UNKNOWN Gres=gpu:4
修改/etc/slurm/gres.conf
:
AutoDetect=nvml
sudo systemctl restart slurmd
. 看到loginNode上的/var/log/slurmd.log
:
[2022-04-11T13:36:09.102] debug2: gpu/nvml: _get_system_gpu_list_nvml: Name: nvidia_geforce_gtx_1080_ti
[2022-04-11T13:36:09.102] debug2: gpu/nvml: _get_system_gpu_list_nvml: UUID: GPU-082b5928-f417-d18b-2e66-7ac8725d2eef
[2022-04-11T13:36:09.102] debug2: gpu/nvml: _get_system_gpu_list_nvml: PCI Domain/Bus/Device: 0:4:0
[2022-04-11T13:36:09.102] debug2: gpu/nvml: _get_system_gpu_list_nvml: PCI Bus ID: 00000000:04:00.0
[2022-04-11T13:36:09.102] debug2: gpu/nvml: _get_system_gpu_list_nvml: NVLinks: -1,0
[2022-04-11T13:36:09.102] debug2: gpu/nvml: _get_system_gpu_list_nvml: Device File (minor number): /dev/nvidia0
[2022-04-11T13:36:09.102] debug2: gpu/nvml: _get_system_gpu_list_nvml: CPU Affinity Range - Machine: 0,2,4,6,8,10,12,14,16,18,20,22
[2022-04-11T13:36:09.102] debug2: gpu/nvml: _get_system_gpu_list_nvml: Core Affinity Range - Abstract: 0-11
[2022-04-11T13:36:09.102] debug2: gpu/nvml: _get_system_gpu_list_nvml: MIG mode: disabled
...
[2022-04-11T13:36:09.107] debug: Gres GPU plugin: Final merged gres.conf list:
[2022-04-11T13:36:09.107] debug: GRES[gpu] Type:(null) Count:1 Cores(24):0-11 Links:-1,0 Flags:HAS_FILE,ENV_NVML File:/dev/nvidia0 UniqueId:(null)
[2022-04-11T13:36:09.107] debug: GRES[gpu] Type:(null) Count:1 Cores(24):12-23 Links:0,-1 Flags:HAS_FILE,ENV_NVML File:/dev/nvidia1 UniqueId:(null)
bigMem0上的var/log/slurmd.log
:
[2022-04-11T13:46:20.220] debug2: gpu/nvml: _get_system_gpu_list_nvml: Name: tesla_t4
[2022-04-11T13:46:20.220] debug2: gpu/nvml: _get_system_gpu_list_nvml: UUID: GPU-e39346a5-2c99-b9b3-3990-314a4955c698
[2022-04-11T13:46:20.220] debug2: gpu/nvml: _get_system_gpu_list_nvml: PCI Domain/Bus/Device: 0:59:0
[2022-04-11T13:46:20.220] debug2: gpu/nvml: _get_system_gpu_list_nvml: PCI Bus ID: 00000000:3B:00.0
[2022-04-11T13:46:20.220] debug2: gpu/nvml: _get_system_gpu_list_nvml: NVLinks: -1,0,0,0
[2022-04-11T13:46:20.220] debug2: gpu/nvml: _get_system_gpu_list_nvml: Device File (minor number): /dev/nvidia0
[2022-04-11T13:46:20.220] debug2: gpu/nvml: _get_system_gpu_list_nvml: CPU Affinity Range - Machine: 0-15,32-47
[2022-04-11T13:46:20.220] debug2: gpu/nvml: _get_system_gpu_list_nvml: Core Affinity Range - Abstract: 0-15
[2022-04-11T13:46:20.220] debug2: gpu/nvml: _get_system_gpu_list_nvml: MIG mode: disabled
...
[2022-04-11T13:46:21.517] debug: Gres GPU plugin: Final merged gres.conf list:
[2022-04-11T13:46:21.517] debug: GRES[gpu] Type:(null) Count:1 Cores(64):0-15 Links:-1,0,0,0 Flags:HAS_FILE,ENV_NVML File:/dev/nvidia0 UniqueId:(null)
[2022-04-11T13:46:21.517] debug: GRES[gpu] Type:(null) Count:1 Cores(64):16-31 Links:0,-1,0,0 Flags:HAS_FILE,ENV_NVML File:/dev/nvidia1 UniqueId:(null)
[2022-04-11T13:46:21.517] debug: GRES[gpu] Type:(null) Count:1 Cores(64):16-31 Links:0,0,-1,0 Flags:HAS_FILE,ENV_NVML File:/dev/nvidia2 UniqueId:(null)
[2022-04-11T13:46:21.517] debug: GRES[gpu] Type:(null) Count:1 Cores(64):16-31 Links:0,0,0,-1 Flags:HAS_FILE,ENV_NVML File:/dev/nvidia3 UniqueId:(null)
bigMem1上的/var/log/slurmd.log
:
[2022-04-11T13:49:10.275] debug2: gpu/nvml: _get_system_gpu_list_nvml: Name: nvidia_a30
[2022-04-11T13:49:10.275] debug2: gpu/nvml: _get_system_gpu_list_nvml: UUID: GPU-4b5ca539-95f5-1f73-31ac-f88fb31b02d4
[2022-04-11T13:49:10.275] debug2: gpu/nvml: _get_system_gpu_list_nvml: PCI Domain/Bus/Device: 0:24:0
[2022-04-11T13:49:10.275] debug2: gpu/nvml: _get_system_gpu_list_nvml: PCI Bus ID: 00000000:18:00.0
[2022-04-11T13:49:10.275] debug2: gpu/nvml: _get_system_gpu_list_nvml: NVLinks: -1,0,0,0
[2022-04-11T13:49:10.275] debug2: gpu/nvml: _get_system_gpu_list_nvml: Device File (minor number): /dev/nvidia0
[2022-04-11T13:49:10.275] debug2: gpu/nvml: _get_system_gpu_list_nvml: CPU Affinity Range - Machine: 0-15,32-47
[2022-04-11T13:49:10.275] debug2: gpu/nvml: _get_system_gpu_list_nvml: Core Affinity Range - Abstract: 0-15
[2022-04-11T13:49:10.275] debug2: gpu/nvml: _get_system_gpu_list_nvml: MIG mode: disabled
...
[2022-04-11T13:49:12.545] debug: Gres GPU plugin: Final merged gres.conf list:
[2022-04-11T13:49:12.545] debug: GRES[gpu] Type:(null) Count:1 Cores(64):0-15 Links:-1,0,0,0 Flags:HAS_FILE,ENV_NVML File:/dev/nvidia0 UniqueId:(null)
[2022-04-11T13:49:12.545] debug: GRES[gpu] Type:(null) Count:1 Cores(64):0-15 Links:0,-1,0,0 Flags:HAS_FILE,ENV_NVML File:/dev/nvidia1 UniqueId:(null)
[2022-04-11T13:49:12.545] debug: GRES[gpu] Type:(null) Count:1 Cores(64):16-31 Links:0,0,-1,0 Flags:HAS_FILE,ENV_NVML File:/dev/nvidia2 UniqueId:(null)
[2022-04-11T13:49:12.545] debug: GRES[gpu] Type:(null) Count:1 Cores(64):16-31 Links:0,0,0,-1 Flags:HAS_FILE,ENV_NVML File:/dev/nvidia3 UniqueId:(null)
尽管AutoDetect很好用,但我们还是手工加入了更多信息,按官网上说的,This allows gres.conf to serve as an optional sanity check and notifies administrators of any unexpected changes in GPU properties.
Configuration Files:
slurm.conf
:
NodeName=loginNode CPUs=24 RealMemory=128546 Sockets=2 CoresPerSocket=12 ThreadsPerCore=1 State=UNKNOWN Gres=gpu:nvidia_geforce_gtx_1080_ti:2
NodeName=bigMem0 CPUs=64 RealMemory=1030499 Sockets=2 CoresPerSocket=16 ThreadsPerCore=2 State=UNKNOWN Gres=gpu:tesla_t4:4
NodeName=bigMem1 CPUs=64 RealMemory=1030499 Sockets=2 CoresPerSocket=16 ThreadsPerCore=2 State=UNKNOWN Gres=gpu:nvidia_a30:4
gres.conf
on loginNode
:
##################################################################
# Slurm's Generic Resource (GRES) configuration file
# Define GPU devices with MPS support, with AutoDetect sanity checking
##################################################################
AutoDetect=nvml
Name=gpu Type=nvidia_geforce_gtx_1080_ti File=/dev/nvidia0
Name=gpu Type=nvidia_geforce_gtx_1080_ti File=/dev/nvidia1
gres.conf
on bigMem0
:
##################################################################
# Slurm's Generic Resource (GRES) configuration file
# Define GPU devices with MPS support, with AutoDetect sanity checking
##################################################################
AutoDetect=nvml
Name=gpu Type=tesla_t4 File=/dev/nvidia0
Name=gpu Type=tesla_t4 File=/dev/nvidia1
Name=gpu Type=tesla_t4 File=/dev/nvidia2
Name=gpu Type=tesla_t4 File=/dev/nvidia3
gres.conf
on bigMem1
:
##################################################################
# Slurm's Generic Resource (GRES) configuration file
# Define GPU devices with MPS support, with AutoDetect sanity checking
##################################################################
AutoDetect=nvml
Name=gpu Type=nvidia_a30 File=/dev/nvidia0
Name=gpu Type=nvidia_a30 File=/dev/nvidia1
Name=gpu Type=nvidia_a30 File=/dev/nvidia2
Name=gpu Type=nvidia_a30 File=/dev/nvidia3
配置结束后, restart slurmd, and scontrol update if necessary.