Falcon - Mid-Range NVIDIA GPU

Falcon has 111 nodes, 4,896 CPU cores, 44 TB RAM, and 307 GPUs (80 NVIDIA L40S GPUs, 128 NVIDIA A30 GPUs, 40 NVIDIA V100 GPUs, and 19 NVIDIA T4 GPUs).

Overview

Node Type

L40s Nodes

A30 Nodes

V100 Nodes

T4 Nodes

Total

Chip

Intel Xeon Platinum 8462Y+

Intel Xeon Platinum 8462Y+

Intel Xeon Gold 6136

Intel Xeon Gold 6130

-

Architecture

Shappire Rapids

Shappire Rapids

Cascade Lake

Cascade Lake

-

Slurm features

-

-

-

-

-

Nodes

20

32

40

19

111

GPUs

4x NVIDIA L40s-48G

4x NVIDIA A30-24G

2x NVIDIA V100-16G

1x NVIDIA T4-16G

307

Cores/Node

64

64

24

32

-

Memory (GB)/Node

512

512

384

196

-

Total Cores

1,280

2,048

960

608

4,896

Total Memory (GB)

10,240

16,384

15,360

3,724

45,708

Local Disk

1.7TB NVMe

1.7TB NVMe

669GB SSD

371G SSD

-

Interconnect

NDR-200 IB

NDR-200 IB

10 Gbps Ethernet

10 Gbps Ethernet

-

Falcon is hosted in the AISB Datacenter at the Corporate Research Center (CRC) in Blacksburg.

An IBM ESS GPFS file system supports /projects for group collaboration and a VAST /scratch serves high-performance input/output (I/O).

Get Started

Falcon can be accessed via one of the two login nodes using your VT credentials:

  • falcon1.arc.vt.edu

  • falcon2.arc.vt.edu

For testing purposes, all users will be alloted an initial 240 core-hours for 90 days in the “personal” allocation. Researchers at the PI level are able to request resource allocations in the “free” tier (usage fully subsidized by VT) and can allocate 1,000,000 monthly Service Units among their projects.

To create an allocation, log in to the ARC allocation portal https://coldfront.arc.vt.edu

  • select or create a project

  • click the “+ Request Resource Allocation” button

  • Choose the “Compute (Free) (Cluster)” allocation type

Usage needs in excess of 1,000,000 monthly Service Units can be purchased via the ARC Cost Center.

Partitions

Users submit jobs to partitions of the cluster depending on the type of GPU resources needed. If users do not specify the amount of memory requested for a job, the parameter DefMemPerCPU will automatically determine the amount of memory for the job based on the number of CPU cores requested. If the users do not specify the number of CPU cores on a GPU job, the parameter DepCpuPerGPU will automatically determine the number of CPU cores based on the number of GPUs requested. Jobs will be billed against the user’s allocation accounting for the utilization of number of CPU cores, memory, and GPU time. Consult the Slurm configuration to understand how to specify the parameters for your job.

Partition

l40s_normal_q

l40s_preemptable_q

a30_normal_q

a30_preemptable_q

v100_normal_q

v100_preemptable_q

t4_normal_q

t4_preemptable_q

Node Type

L40s

L40s

A30

A30

V100

V100

T4

T4

Features

-

-

-

-

-

-

-

-

Number of Nodes

20

20

32

32

40

40

19

19

DefMemPerCPU (MB)

7920

7920

7920

7920

15720

15720

5744

5744

DefCpuPerGPU

8

8

8

8

6

6

6

6

TRESBillingWeights

CPU=1,Mem=0.0625G,GRES/gpu=75

-

CPU=1,Mem=0.0625G,GRES/gpu=75

-

CPU=1,Mem=0.0625G,GRES/gpu=50

-

CPU=1,Mem=0.0625G,GRES/gpu=25

-

PreemptMode

OFF

ON

OFF

ON

OFF

ON

OFF

ON

Quality of Service (QoS)

The QOS associated with a job will affect the job in three key ways: scheduling priority, resource limits, and time limits. Each partition has a defaulq QoS named partitionname_base with a default priority, resource limits, and time limits. Users can optionally select a different QoS to increase or decrease the priority, resource limits, and time limits. The goal is to offer users multiple flexible options that adjust to their jobs needs. The long QoS allows users to run for an extended period of time (up to 14 days) but reduces the total amount of resources that can be allocated for the job. The short QoS allows users to increase the number of resources for a job but reduces the maximum time to 1 day. ARC staff reserves the right to modify the QoS settings at any point of time to ensure a fair and balanced utilization of resources among all users.

Partition

QoS

Priority

MaxWall

MaxTRESPerUser

MaxTRESPerAccount

l40s_normal_q

fal_l40s_normal_base

1000

7-00:00:00

cpu=410,mem=3220G,gres/gpu=16

cpu=820,mem=6439G,gres/gpu=32

l40s_normal_q

fal_l40s_normal_long

500

14-00:00:00

cpu=103,mem=805G,gres/gpu=4

cpu=205,mem=1610G,gres/gpu=8

l40s_normal_q

fal_l40s_normal_short

2000

1-00:00:00

cpu=615,mem=4829G,gres/gpu=24

cpu=1229,mem=9658G,gres/gpu=48

l40s_preemptable_q

fal_l40s_preemptable_base

0

30-00:00:00

cpu=52,mem=403G,gres/gpu=2

cpu=103,mem=805G,gres/gpu=4

a30_normal_q

fal_a30_normal_base

1000

7-00:00:00

cpu=256,mem=2012G,gres/gpu=26

cpu=512,mem=4024G,gres/gpu=52

a30_normal_q

fal_a30_normal_long

500

14-00:00:00

cpu=64,mem=503G,gres/gpu=7

cpu=128,mem=1006G,gres/gpu=13

a30_normal_q

fal_a30_normal_short

2000

1-00:00:00

cpu=384,mem=3018G,gres/gpu=39

cpu=768,mem=6036G,gres/gpu=77

a30_preemptable_q

fal_a30_preemptable_base

0

30-00:00:00

cpu=32,mem=252G,gres/gpu=4

cpu=64,mem=503G,gres/gpu=7

v100_normal_q

fal_v100_normal_base

1000

7-00:00:00

cpu=192,mem=3008G,gres/gpu=16

cpu=384,mem=6016G,gres/gpu=32

v100_normal_q

fal_v100_normal_long

500

14-00:00:00

cpu=48,mem=752G,gres/gpu=4

cpu=96,mem=1504G,gres/gpu=8

v100_normal_q

fal_v100_normal_short

2000

1-00:00:00

cpu=288,mem=4512G,gres/gpu=24

cpu=576,mem=9024G,gres/gpu=48

v100_preemptable_q

fal_v100_preemptable_base

0

30-00:00:00

cpu=24,mem=376G,gres/gpu=2

cpu=48,mem=752G,gres/gpu=4

t4_normal_q

fal_t4_normal_base

1000

7-00:00:00

cpu=122,mem=711G,gres/gpu=4

cpu=244,mem=1422G,gres/gpu=8

t4_normal_q

fal_t4_normal_long

500

14-00:00:00

cpu=31,mem=178G,gres/gpu=1

cpu=61,mem=356G,gres/gpu=2

t4_normal_q

fal_t4_normal_short

2000

1-00:00:00

cpu=183,mem=1066G,gres/gpu=6

cpu=365,mem=2132G,gres/gpu=12

t4_preemptable_q

fal_t4_preemptable_base

0

30-00:00:00

cpu=16,mem=89G,gres/gpu=1

cpu=31,mem=178G,gres/gpu=1

Optimization

Node Type

L40s Nodes

A30 Nodes

V100 Nodes

T4 Nodes

CPU arch

Shappire Rapids

Shappire Rapids

Cascade Lake

Cascade Lake

Compiler flags

-march=shappirerapids

-march=shappirerapids

-march=cascadelake

-march=cascadelake

GPU arch

NVIDIA L40s

NVIDIA A30

NVIDIA V100

NVIDIA T4

Compute Capability

8.9

8.0

7

7.5

NVCC flags

-gencode=arch=compute_89,code=sm_89

-gencode=arch=compute_80,code=sm_80

-gencode=arch=compute_70,code=sm_70

-gencode=arch=compute_75,code=sm_75

See the tuning guides available at https://www.intel.com/content/www/us/en/developer/

  • Cache locality really matters - process pinning can make a big difference on performance.

  • Hybrid programming often pays off - one MPI process per L3 cache with 4 threads is often optimal.

  • Use the appropritate -march flag to optimize the compiled code and -gencode flag when using the NVCC compiler.

Slurm GPU to CPU bindings

Optimal application performance for GPU accelerated workloads requires that processes launched on the nodes run on CPU core topologically closest to the GPU that the process will use. On Falcon, Slurm is aware of which sets of CPU cores and memory locations have the most direct connection to each GPU. The arrangement is slightly unintuitive:

GPU device bus ID

GPU device

NUMA node

CPU cores

4a:00.0

0 - /dev/nvidia0

1

16-31

61:00.0

1 - /dev/nvidia1

0

0-15

ca:00.0

2 - /dev/nvidia2

3

48-63

e1:00.0

3 - /dev/nvidia3

2

32-47

If we do not inform Slurm of this affinity, then nearly all jobs will have reduced performance due to misalignment of allocated cores and GPUs. By default, these cores will be preferred by Slurm for scheduling with the affiliated GPU device, but other arrangements are possible.

  • Use the option --gres-flags=enforce-binding to require Slurm to allocate affiliated CPU core(s) with the corresponding GPU device(s)

  • The option --gres-flags=disable-binding is required to allocate more CPU cores than are bound to a device, but this is discouraged because these core will then be unavailable to their correctly affiliated GPU.

To summarize, these nodes and the Slurm scheduling algorithms will operate most efficiently when jobs consistently request between 1-16 cores per GPU device. For example:

Do this: --gres=gpu:1 --ntasks-per-node=1 --cpus-per-task=16 --gres-flags=enforce-binding which allocates 1 GPU, the associated 16 CPU cores, and 128GB of system memory.

Do not do this: --gres=gpu:1 --exclusive which allocates all the CPU core and all the system memory to the job, but only one GPU device. The other 3 GPUs will be unavailable to your job and also unavailable to other jobs.

Do not do this: --gres=gpu:1 --ntasks-per-node=32 which allocates 256GB of system memory, one GPU device + its 16 affiliated CPU cores AND 16 additional CPUs that have affinity to a different GPU. This isolated GPU is still available to other jobs, but can only run with diminished performance.