Infer, GPU Cluster

Overview

Infer came online in January of 2021 and provides 18 nodes, each with an Nvidia T4 GPU. The cluster’s name “Infer” alludes to the AI/ML inference capabilities of the T4 GPUs derived from the “tensor cores” on these devices. We think they will also be a great all-purpose resource for researchers who are making their first forays into GPU-enabled computations of any type.

In the spring of 2021, 40 nodes with two Nvidia P100 GPUs each were migrated from a older ARC system which was being decommissioned.

Technical details are below:

Vendor	HPE	Dell
Chip	Intel Xeon Gold 6130	Intel Xeon E5-2680v4 2.4GHz
Nodes	18	40
Cores/Node	32	28
GPU Model	Nvidia Tesla T4	Nvidia Tesla P100
GPU/Node	1	2
Memory (GB)/Node	192	512
Total Cores	576	1120
Total Memory (GB)	3,456	20,480
Local Disk	480GB SSD	187GB SSD
Interconnect	EDR-100 IB	Ethernet

Policies

Limits are set on the scale and quantity of jobs at the user and allocation (Slurm account) levels to help ensure availability of resources to a broad set of researchers and applications:

	t4_normal_q	t4_dev_q	p100_normal_q	p100_dev_q	v100_normal_q	v100_dev_q
Node Type	T4 GPU	T4 GPU	P100 GPU	P100 GPU	V100 GPU	V100 GPU
Billing Weight	0 (no billing)	0 (no billing)	0 (no billing)	0 (no billing)	0 (no billing)	0 (no billing)
Number of Nodes	16	2	38	2	38	2
MaxRunningJobs (User)	10	2	16	2	16	2
MaxSubmitJobs (User)	100	3	32	4	32	4
MaxRunningJobs (Allocation)	20	3	24	4	24	3
MaxSubmitJobs (Allocation)	200	6	48	8	48	6
MaxNodes (User)	8	2	16	16	16	16
MaxNodes (Allocation)	12	2	24	24	24	24
MaxCPUs (User)	256	64	224	224	256	256
MaxCPUs (Allocation)	384	64	336	336	384	384
MaxGPUs (User)	8	2	16	16	16	16
MaxGPUs (Allocation)	12	2	24	24	32	32
Max Job Duration (hours)	72	4	144	4	144	4

Modules

Infer’s module structure is similar to that of TinkerCliffs, but different from previous ARC clusters in that it uses a new application stack/module system based on EasyBuild. A video tutorial of module usage under this paradigm is provided here; a longer class on EasyBuild, including how you can use it to build your own modules is here.

Key differences between EasyBuild and our legacy paradigm from a user perspective include:

Hierarchies are replaced by toolchains. Right now, there are four:
- foss (“Free Open Source Software”): gcc compilers, OpenBLAS for linear algebra, OpenMPI for MPI, etc
- fosscuda: foss with CUDA support
- intel: Intel compilers, Intel MKL for linear algebra, Intel MPI
- intelcuda: intel with CUDA support
Instead of loading modules individually (e.g., module load intel mkl impi), a user can just load the toolchain (e.g., module load fosscuda/2020b).
Modules load their dependencies, e.g.,

$ module reset; module load GROMACS/2020.4-fosscuda-2020b; module list
Currently Loaded Modules:
  1) shared                       8) GCCcore/10.2.0                15) numactl/2.0.13-GCCcore-10.2.0     22) GDRCopy/2.1-GCCcore-10.2.0-CUDA-11.1.1  29) FFTW/3.3.8-gompic-2020b
  2) gcc/9.2.0                    9) zlib/1.2.11-GCCcore-10.2.0    16) XZ/5.2.5-GCCcore-10.2.0           23) UCX/1.9.0-GCCcore-10.2.0-CUDA-11.1.1    30) ScaLAPACK/2.1.0-gompic-2020b
  3) slurm/slurm/19.05.5         10) binutils/2.35-GCCcore-10.2.0  17) libxml2/2.9.10-GCCcore-10.2.0     24) libfabric/1.11.0-GCCcore-10.2.0         31) fosscuda/2020b
  4) apps                        11) GCC/10.2.0                    18) libpciaccess/0.16-GCCcore-10.2.0  25) PMIx/3.1.5-GCCcore-10.2.0               32) GROMACS/2020.4-fosscuda-2020b
  5) site/infer/easybuild/setup  12) CUDAcore/11.1.1               19) hwloc/2.2.0-GCCcore-10.2.0        26) OpenMPI/4.0.5-gcccuda-2020b
  6) useful_scripts              13) CUDA/11.1.1-GCC-10.2.0        20) libevent/2.1.12-GCCcore-10.2.0    27) OpenBLAS/0.3.12-GCC-10.2.0
  7) DefaultModules              14) gcccuda/2020b                 21) Check/0.15.2-GCCcore-10.2.0       28) gompic/2020b

All modules are visible with module avail. So in many cases it is probably better to search with module spider rather than printing the whole list.
Some key system software, like the Slurm scheduler, are included in default modules. This means that module purge can break important functionality. Use module reset instead.
Lower-level software is included in the module structure (see, e.g., binutils in the GROMACS example above), which should mean less risk of conflicts in adding new versions later.
Environment variables (e.g., $SOFTWARE_LIB) available in our previous module system may not be provided. Instead, EasyBuild typically provides $EBROOTSOFTWARE to point to the software installation location. So for example, to link to NetCDF libraries, one might use -L$EBROOTCUDA/lib64 instead of the previous -L$CUDA_LIB.

Infer, GPU Cluster

Overview

Login

Policies

Modules