Infer, GPU Cluster

Overview

Infer came online in January of 2021 and provides 18 nodes, each with an Nvidia T4 GPU. The cluster’s name “Infer” alludes to the AI/ML inference capabilities of the T4 GPUs derived from the “tensor cores” on these devices. We think they will also be a great all-purpose resource for researchers who are making their first forays into GPU-enabled computations of any type.

In the spring of 2021, 40 nodes with two Nvidia P100 GPUs each were migrated from a older ARC system which was being decommissioned.

Technical details are below:

Vendor

HPE

Dell

Chip

Intel Xeon Gold 6130

Intel Xeon E5-2680v4 2.4GHz

Nodes

18

40

Cores/Node

32

28

GPU Model

Nvidia Tesla T4

Nvidia Tesla P100

GPU/Node

1

2

Memory (GB)/Node

192

512

Total Cores

576

1120

Total Memory (GB)

3,456

20,480

Local Disk

480GB SSD

187GB SSD

Interconnect

EDR-100 IB

Ethernet

Login

ARC users can log into Infer at:

infer1.arc.vt.edu

Policies

Limits are set on the scale and quantity of jobs at the user and allocation (Slurm account) levels to help ensure availability of resources to a broad set of researchers and applications:

t4_normal_q

t4_dev_q

p100_normal_q

p100_dev_q

v100_normal_q

v100_dev_q

Node Type

T4 GPU

T4 GPU

P100 GPU

P100 GPU

V100 GPU

V100 GPU

Billing Weight

0 (no billing)

0 (no billing)

0 (no billing)

0 (no billing)

0 (no billing)

0 (no billing)

Number of Nodes

16

2

38

2

38

2

MaxRunningJobs (User)

10

2

16

2

16

2

MaxSubmitJobs (User)

100

3

32

4

32

4

MaxRunningJobs (Allocation)

20

3

24

4

24

3

MaxSubmitJobs (Allocation)

200

6

48

8

48

6

MaxNodes (User)

8

2

16

16

16

16

MaxNodes (Allocation)

12

2

24

24

24

24

MaxCPUs (User)

256

64

224

224

256

256

MaxCPUs (Allocation)

384

64

336

336

384

384

MaxGPUs (User)

8

2

16

16

16

16

MaxGPUs (Allocation)

12

2

24

24

32

32

Max Job Duration (hours)

72

4

144

4

144

4

Modules

Infer’s module structure is similar to that of TinkerCliffs, but different from previous ARC clusters in that it uses a new application stack/module system based on EasyBuild. A video tutorial of module usage under this paradigm is provided here; a longer class on EasyBuild, including how you can use it to build your own modules is here.

Key differences between EasyBuild and our legacy paradigm from a user perspective include:

  • Hierarchies are replaced by toolchains. Right now, there are four:

    • foss (“Free Open Source Software”): gcc compilers, OpenBLAS for linear algebra, OpenMPI for MPI, etc

    • fosscuda: foss with CUDA support

    • intel: Intel compilers, Intel MKL for linear algebra, Intel MPI

    • intelcuda: intel with CUDA support

  • Instead of loading modules individually (e.g., module load intel mkl impi), a user can just load the toolchain (e.g., module load fosscuda/2020b).

  • Modules load their dependencies, e.g.,

$ module reset; module load GROMACS/2020.4-fosscuda-2020b; module list
Currently Loaded Modules:
  1) shared                       8) GCCcore/10.2.0                15) numactl/2.0.13-GCCcore-10.2.0     22) GDRCopy/2.1-GCCcore-10.2.0-CUDA-11.1.1  29) FFTW/3.3.8-gompic-2020b
  2) gcc/9.2.0                    9) zlib/1.2.11-GCCcore-10.2.0    16) XZ/5.2.5-GCCcore-10.2.0           23) UCX/1.9.0-GCCcore-10.2.0-CUDA-11.1.1    30) ScaLAPACK/2.1.0-gompic-2020b
  3) slurm/slurm/19.05.5         10) binutils/2.35-GCCcore-10.2.0  17) libxml2/2.9.10-GCCcore-10.2.0     24) libfabric/1.11.0-GCCcore-10.2.0         31) fosscuda/2020b
  4) apps                        11) GCC/10.2.0                    18) libpciaccess/0.16-GCCcore-10.2.0  25) PMIx/3.1.5-GCCcore-10.2.0               32) GROMACS/2020.4-fosscuda-2020b
  5) site/infer/easybuild/setup  12) CUDAcore/11.1.1               19) hwloc/2.2.0-GCCcore-10.2.0        26) OpenMPI/4.0.5-gcccuda-2020b
  6) useful_scripts              13) CUDA/11.1.1-GCC-10.2.0        20) libevent/2.1.12-GCCcore-10.2.0    27) OpenBLAS/0.3.12-GCC-10.2.0
  7) DefaultModules              14) gcccuda/2020b                 21) Check/0.15.2-GCCcore-10.2.0       28) gompic/2020b
  • All modules are visible with module avail. So in many cases it is probably better to search with module spider rather than printing the whole list.

  • Some key system software, like the Slurm scheduler, are included in default modules. This means that module purge can break important functionality. Use module reset instead.

  • Lower-level software is included in the module structure (see, e.g., binutils in the GROMACS example above), which should mean less risk of conflicts in adding new versions later.

  • Environment variables (e.g., $SOFTWARE_LIB) available in our previous module system may not be provided. Instead, EasyBuild typically provides $EBROOTSOFTWARE to point to the software installation location. So for example, to link to NetCDF libraries, one might use -L$EBROOTCUDA/lib64 instead of the previous -L$CUDA_LIB.