Falcon - Mid-Range NVIDIA GPU
Falcon has 111 nodes, 4,896 CPU cores, 44 TB RAM, and 307 GPUs (80 NVIDIA L40S GPUs, 128 NVIDIA A30 GPUs, 40 NVIDIA V100 GPUs, and 19 NVIDIA T4 GPUs).
Overview
Node Type |
L40s Nodes |
A30 Nodes |
V100 Nodes |
T4 Nodes |
Total |
---|---|---|---|---|---|
Chip |
- |
||||
Architecture |
Shappire Rapids |
Shappire Rapids |
Cascade Lake |
Cascade Lake |
- |
Slurm features |
- |
- |
- |
- |
- |
Nodes |
20 |
32 |
40 |
19 |
111 |
GPUs |
4x NVIDIA L40s-48G |
4x NVIDIA A30-24G |
2x NVIDIA V100-16G |
1x NVIDIA T4-16G |
307 |
Cores/Node |
64 |
64 |
24 |
32 |
- |
Memory (GB)/Node |
512 |
512 |
384 |
196 |
- |
Total Cores |
1,280 |
2,048 |
960 |
608 |
4,896 |
Total Memory (GB) |
10,240 |
16,384 |
15,360 |
3,724 |
45,708 |
Local Disk |
1.7TB NVMe |
1.7TB NVMe |
669GB SSD |
371G SSD |
- |
Interconnect |
NDR-200 IB |
NDR-200 IB |
10 Gbps Ethernet |
10 Gbps Ethernet |
- |
Falcon is hosted in the AISB Datacenter at the Corporate Research Center (CRC) in Blacksburg.
An IBM ESS GPFS file system supports /projects
for group collaboration and a VAST /scratch
serves high-performance input/output (I/O).
Get Started
Falcon can be accessed via one of the two login nodes using your VT credentials:
falcon1.arc.vt.edu
falcon2.arc.vt.edu
For testing purposes, all users will be alloted an initial 240 core-hours for 90 days in the “personal” allocation. Researchers at the PI level are able to request resource allocations in the “free” tier (usage fully subsidized by VT) and can allocate 1,000,000 monthly Service Units among their projects.
To create an allocation, log in to the ARC allocation portal https://coldfront.arc.vt.edu
select or create a project
click the “+ Request Resource Allocation” button
Choose the “Compute (Free) (Cluster)” allocation type
Usage needs in excess of 1,000,000 monthly Service Units can be purchased via the ARC Cost Center.
Partitions
Users submit jobs to partitions of the cluster depending on the type of GPU resources needed. If users do not specify the amount of memory requested for a job, the parameter DefMemPerCPU will automatically determine the amount of memory for the job based on the number of CPU cores requested. If the users do not specify the number of CPU cores on a GPU job, the parameter DepCpuPerGPU will automatically determine the number of CPU cores based on the number of GPUs requested. Jobs will be billed against the user’s allocation accounting for the utilization of number of CPU cores, memory, and GPU time. Consult the Slurm configuration to understand how to specify the parameters for your job.
Partition |
l40s_normal_q |
l40s_preemptable_q |
a30_normal_q |
a30_preemptable_q |
v100_normal_q |
v100_preemptable_q |
t4_normal_q |
t4_preemptable_q |
---|---|---|---|---|---|---|---|---|
Node Type |
L40s |
L40s |
A30 |
A30 |
V100 |
V100 |
T4 |
T4 |
Features |
- |
- |
- |
- |
- |
- |
- |
- |
Number of Nodes |
20 |
20 |
32 |
32 |
40 |
40 |
19 |
19 |
DefMemPerCPU (MB) |
7920 |
7920 |
7920 |
7920 |
15720 |
15720 |
5744 |
5744 |
DefCpuPerGPU |
8 |
8 |
8 |
8 |
6 |
6 |
6 |
6 |
TRESBillingWeights |
CPU=1,Mem=0.0625G,GRES/gpu=75 |
- |
CPU=1,Mem=0.0625G,GRES/gpu=75 |
- |
CPU=1,Mem=0.0625G,GRES/gpu=50 |
- |
CPU=1,Mem=0.0625G,GRES/gpu=25 |
- |
PreemptMode |
OFF |
ON |
OFF |
ON |
OFF |
ON |
OFF |
ON |
Recommended Uses
The nodes selected for this cluster are intended to provide broad utility for a wide range of GPU-enabled applications.
The L40S GPUs deliver excellent AI/ML inference and training capabilities for models that fit within a single GPU’s 48 GB of device memory or can be sharded across multiple GPUs. However, they do not support double‑precision (FP64) arithmetic, making them unsuitable for most traditional HPC workloads that rely on high‑precision computations.
The A30 nodes do support FP64 and are ideal for GPU‑accelerated applications such as computational fluid dynamics, computational chemistry, and multiphysics simulations. They also handle AI/ML inference and training for smaller models. With 24 GB of device memory and NVIDIA’s Ampere‑generation architectural enhancements, they deliver performance for these tasks that is comparable to—or in some cases slightly better than—existing V100 installations.
The V100 GPUs, based on NVIDIA’s Volta architecture, feature 16 GB of HBM2 memory and full FP64 support at up to 7.8 TFLOPS peak per GPU. They strike a strong balance between HPC and deep‑learning workloads, delivering up to 125 TFLOPS of mixed‑precision (FP16) performance and robust double‑precision throughput. V100s remain a reliable workhorse for traditional simulation codes and large‑scale training jobs.
The T4 GPUs, built on the Turing architecture, offer 16 GB of GDDR6 memory and excel at efficient inference and low‑precision training. With specialized Tensor Cores for INT8 and INT4 operations, T4s provide up to 130 TOPS of INT8 throughput while consuming only 70 W of power.
Quality of Service (QoS)
The QOS associated with a job will affect the job in three key ways: scheduling priority, resource limits, and time limits. Each partition has a defaulq QoS named partitionname_base with a default priority, resource limits, and time limits. Users can optionally select a different QoS to increase or decrease the priority, resource limits, and time limits. The goal is to offer users multiple flexible options that adjust to their jobs needs. The long QoS allows users to run for an extended period of time (up to 14 days) but reduces the total amount of resources that can be allocated for the job. The short QoS allows users to increase the number of resources for a job but reduces the maximum time to 1 day. ARC staff reserves the right to modify the QoS settings at any point of time to ensure a fair and balanced utilization of resources among all users.
Partition |
QoS |
Priority |
MaxWall |
MaxTRESPerUser |
MaxTRESPerAccount |
---|---|---|---|---|---|
l40s_normal_q |
fal_l40s_normal_base |
1000 |
7-00:00:00 |
cpu=410,mem=3220G,gres/gpu=16 |
cpu=820,mem=6439G,gres/gpu=32 |
l40s_normal_q |
fal_l40s_normal_long |
500 |
14-00:00:00 |
cpu=103,mem=805G,gres/gpu=4 |
cpu=205,mem=1610G,gres/gpu=8 |
l40s_normal_q |
fal_l40s_normal_short |
2000 |
1-00:00:00 |
cpu=615,mem=4829G,gres/gpu=24 |
cpu=1229,mem=9658G,gres/gpu=48 |
l40s_preemptable_q |
fal_l40s_preemptable_base |
0 |
30-00:00:00 |
cpu=52,mem=403G,gres/gpu=2 |
cpu=103,mem=805G,gres/gpu=4 |
a30_normal_q |
fal_a30_normal_base |
1000 |
7-00:00:00 |
cpu=256,mem=2012G,gres/gpu=26 |
cpu=512,mem=4024G,gres/gpu=52 |
a30_normal_q |
fal_a30_normal_long |
500 |
14-00:00:00 |
cpu=64,mem=503G,gres/gpu=7 |
cpu=128,mem=1006G,gres/gpu=13 |
a30_normal_q |
fal_a30_normal_short |
2000 |
1-00:00:00 |
cpu=384,mem=3018G,gres/gpu=39 |
cpu=768,mem=6036G,gres/gpu=77 |
a30_preemptable_q |
fal_a30_preemptable_base |
0 |
30-00:00:00 |
cpu=32,mem=252G,gres/gpu=4 |
cpu=64,mem=503G,gres/gpu=7 |
v100_normal_q |
fal_v100_normal_base |
1000 |
7-00:00:00 |
cpu=192,mem=3008G,gres/gpu=16 |
cpu=384,mem=6016G,gres/gpu=32 |
v100_normal_q |
fal_v100_normal_long |
500 |
14-00:00:00 |
cpu=48,mem=752G,gres/gpu=4 |
cpu=96,mem=1504G,gres/gpu=8 |
v100_normal_q |
fal_v100_normal_short |
2000 |
1-00:00:00 |
cpu=288,mem=4512G,gres/gpu=24 |
cpu=576,mem=9024G,gres/gpu=48 |
v100_preemptable_q |
fal_v100_preemptable_base |
0 |
30-00:00:00 |
cpu=24,mem=376G,gres/gpu=2 |
cpu=48,mem=752G,gres/gpu=4 |
t4_normal_q |
fal_t4_normal_base |
1000 |
7-00:00:00 |
cpu=122,mem=711G,gres/gpu=4 |
cpu=244,mem=1422G,gres/gpu=8 |
t4_normal_q |
fal_t4_normal_long |
500 |
14-00:00:00 |
cpu=31,mem=178G,gres/gpu=1 |
cpu=61,mem=356G,gres/gpu=2 |
t4_normal_q |
fal_t4_normal_short |
2000 |
1-00:00:00 |
cpu=183,mem=1066G,gres/gpu=6 |
cpu=365,mem=2132G,gres/gpu=12 |
t4_preemptable_q |
fal_t4_preemptable_base |
0 |
30-00:00:00 |
cpu=16,mem=89G,gres/gpu=1 |
cpu=31,mem=178G,gres/gpu=1 |
Optimization
Node Type |
L40s Nodes |
A30 Nodes |
V100 Nodes |
T4 Nodes |
---|---|---|---|---|
CPU arch |
Shappire Rapids |
Shappire Rapids |
Cascade Lake |
Cascade Lake |
Compiler flags |
|
|
|
|
GPU arch |
NVIDIA L40s |
NVIDIA A30 |
NVIDIA V100 |
NVIDIA T4 |
Compute Capability |
8.9 |
8.0 |
7 |
7.5 |
NVCC flags |
|
|
|
|
See the tuning guides available at https://www.intel.com/content/www/us/en/developer/
Cache locality really matters - process pinning can make a big difference on performance.
Hybrid programming often pays off - one MPI process per L3 cache with 4 threads is often optimal.
Use the appropritate
-march
flag to optimize the compiled code and-gencode
flag when using the NVCC compiler.
Slurm GPU to CPU bindings
Optimal application performance for GPU accelerated workloads requires that processes launched on the nodes run on CPU core topologically closest to the GPU that the process will use. On Falcon, Slurm is aware of which sets of CPU cores and memory locations have the most direct connection to each GPU. The arrangement is slightly unintuitive:
GPU device bus ID |
GPU device |
NUMA node |
CPU cores |
---|---|---|---|
4a:00.0 |
0 - /dev/nvidia0 |
1 |
16-31 |
61:00.0 |
1 - /dev/nvidia1 |
0 |
0-15 |
ca:00.0 |
2 - /dev/nvidia2 |
3 |
48-63 |
e1:00.0 |
3 - /dev/nvidia3 |
2 |
32-47 |
If we do not inform Slurm of this affinity, then nearly all jobs will have reduced performance due to misalignment of allocated cores and GPUs. By default, these cores will be preferred by Slurm for scheduling with the affiliated GPU device, but other arrangements are possible.
Use the option
--gres-flags=enforce-binding
to require Slurm to allocate affiliated CPU core(s) with the corresponding GPU device(s)The option
--gres-flags=disable-binding
is required to allocate more CPU cores than are bound to a device, but this is discouraged because these core will then be unavailable to their correctly affiliated GPU.
To summarize, these nodes and the Slurm scheduling algorithms will operate most efficiently when jobs consistently request between 1-16 cores per GPU device. For example:
Do this: --gres=gpu:1 --ntasks-per-node=1 --cpus-per-task=16 --gres-flags=enforce-binding
which allocates 1 GPU, the associated 16 CPU cores, and 128GB of system memory.
Do not do this: --gres=gpu:1 --exclusive
which allocates all the CPU core and all the system memory to the job, but only one GPU device. The other 3 GPUs will be unavailable to your job and also unavailable to other jobs.
Do not do this: --gres=gpu:1 --ntasks-per-node=32
which allocates 256GB of system memory, one GPU device + its 16 affiliated CPU cores AND 16 additional CPUs that have affinity to a different GPU. This isolated GPU is still available to other jobs, but can only run with diminished performance.