Falcon - Mid-Range NVIDIA GPU

Falcon has 111 nodes, 4,896 CPU cores, 44 TB RAM, and 307 GPUs (80 NVIDIA L40S GPUs, 128 NVIDIA A30 GPUs, 40 NVIDIA V100 GPUs, and 19 NVIDIA T4 GPUs).

Overview

Node Type	L40s Nodes	A30 Nodes	V100 Nodes	T4 Nodes	Total
Chip	Intel Xeon Platinum 8462Y+	Intel Xeon Platinum 8462Y+	Intel Xeon Gold 6136	Intel Xeon Gold 6130	-
Architecture	Shappire Rapids	Shappire Rapids	Cascade Lake	Cascade Lake	-
Slurm features	-	-	-	-	-
Nodes	20	32	40	19	111
GPUs	4x NVIDIA L40s-48G	4x NVIDIA A30-24G	2x NVIDIA V100-16G	1x NVIDIA T4-16G	307
Cores/Node	64	64	24	32	-
Memory (GB)/Node	512	512	384	196	-
Maximum Memory for Slum (GB)/Node	495	495	368	179	-
Total Cores	1,280	2,048	960	608	4,896
Total Memory (GB)	10,240	16,384	15,360	3,724	45,708
Local Disk	1.7TB NVMe	1.7TB NVMe	669GB SSD	371G SSD	-
Interconnect	NDR-200 IB	NDR-200 IB	10 Gbps Ethernet	10 Gbps Ethernet	-

Falcon is hosted in the AISB Datacenter at the Corporate Research Center (CRC) in Blacksburg.

An IBM ESS GPFS file system supports /projects for group collaboration and a VAST /scratch serves high-performance input/output (I/O).

Get Started

Falcon can be accessed via one of the two login nodes using your VT credentials:

falcon1.arc.vt.edu
falcon2.arc.vt.edu

For testing purposes, all users will be alloted an initial 240 core-hours for 90 days in the “personal” allocation. Researchers at the PI level are able to request resource allocations in the “free” tier (usage fully subsidized by VT) and can allocate 2,000,000 monthly Service Units among their projects.

To create an allocation, log in to the ARC allocation portal https://coldfront.arc.vt.edu

Select or create a project
Click the “+ Request Resource Allocation” button
Choose the “Compute (Free) (Cluster)” allocation type

Usage needs in excess of 2,000,000 monthly Service Units can be purchased via the ARC Cost Center.

Partitions

Users submit jobs to partitions of the cluster depending on the type of GPU resources needed. If users do not specify the amount of memory requested for a job, the parameter DefMemPerCPU will automatically determine the amount of memory for the job based on the number of CPU cores requested. If the users do not specify the number of CPU cores on a GPU job, the parameter DepCpuPerGPU will automatically determine the number of CPU cores based on the number of GPUs requested. Jobs will be billed against the user’s allocation accounting for the utilization of number of CPU cores, memory, and GPU time. Consult the Slurm configuration to understand how to specify the parameters for your job.

Partition	l40s_normal_q	l40s_preemptable_q	a30_normal_q	a30_preemptable_q	v100_normal_q	v100_preemptable_q	t4_normal_q	t4_preemptable_q
Node Type	L40s	L40s	A30	A30	V100	V100	T4	T4
Features	-	-	-	-	-	-	-	-
Number of Nodes	20	20	32	32	40	40	19	19
DefMemPerCPU (MB)	7920	7920	7920	7920	15715	15715	5740	5740
DefCpuPerGPU	8	8	8	8	6	6	6	6
TRESBillingWeights	CPU=1,Mem=0.0625G,GRES/gpu=75	-	CPU=1,Mem=0.0625G,GRES/gpu=75	-	CPU=1,Mem=0.0625G,GRES/gpu=50	-	CPU=1,Mem=0.0625G,GRES/gpu=25	-
PreemptMode	OFF	ON	OFF	ON	OFF	ON	OFF	ON

Recommended Uses

The nodes selected for this cluster are intended to provide broad utility for a wide range of GPU-enabled applications.

The L40S GPUs deliver excellent AI/ML inference and training capabilities for models that fit within a single GPU’s 48 GB of device memory or can be sharded across multiple GPUs. However, they do not support double‑precision (FP64) arithmetic, making them unsuitable for most traditional HPC workloads that rely on high‑precision computations.

The A30 nodes do support FP64 and are ideal for GPU‑accelerated applications such as computational fluid dynamics, computational chemistry, and multiphysics simulations. They also handle AI/ML inference and training for smaller models. With 24 GB of device memory and NVIDIA’s Ampere‑generation architectural enhancements, they deliver performance for these tasks that is comparable to—or in some cases slightly better than—existing V100 installations.

The V100 GPUs, based on NVIDIA’s Volta architecture, feature 16 GB of HBM2 memory and full FP64 support at up to 7.8 TFLOPS peak per GPU. They strike a strong balance between HPC and deep‑learning workloads, delivering up to 125 TFLOPS of mixed‑precision (FP16) performance and robust double‑precision throughput. V100s remain a reliable workhorse for traditional simulation codes and large‑scale training jobs.

The T4 GPUs, built on the Turing architecture, offer 16 GB of GDDR6 memory and excel at efficient inference and low‑precision training. With specialized Tensor Cores for INT8 and INT4 operations, T4s provide up to 130 TOPS of INT8 throughput while consuming only 70 W of power.

Quality of Service (QoS)

ARC must balance the needs of individuals with the needs of all to ensure fairness. This is done by providing options which determine the Quality of Service (QoS).

The QoS associated with a job affects the job in three key ways: scheduling priority, resource limits, and time limits. Each partition has a default QoS named partitionname_base with a default priority, resource limits, and time limits. Users can optionally select a different QoS to increase or decrease the priority, resource limits, and time limits. The goal is to offer users multiple flexible options that adjust to their jobs needs. The long QoS allows users to run for an extended period of time (up to 14 days) but reduces the total amount of resources that can be allocated for the job. The short QoS allows users to increase the number of resources for a job but reduces the maximum time to 1 day. ARC staff reserves the right to modify the QoS settings at any point of time to ensure a fair and balanced utilization of resources among all users.

Partition	QoS	Priority	MaxWall	MaxTRESPerUser	MaxTRESPerAccount	UsageFactor
l40s_normal_q	fal_l40s_normal_base	1000	7-00:00:00	cpu=410,mem=3220G,gres/gpu=16	cpu=820,mem=6439G,gres/gpu=32	1
l40s_normal_q	fal_l40s_normal_long	500	14-00:00:00	cpu=103,mem=805G,gres/gpu=4	cpu=205,mem=1610G,gres/gpu=8	1
l40s_normal_q	fal_l40s_normal_short	2000	1-00:00:00	cpu=615,mem=4829G,gres/gpu=24	cpu=1229,mem=9658G,gres/gpu=48	2
l40s_preemptable_q	fal_l40s_preemptable_base	0	30-00:00:00	cpu=52,mem=403G,gres/gpu=2	cpu=103,mem=805G,gres/gpu=4	0
a30_normal_q	fal_a30_normal_base	1000	7-00:00:00	cpu=256,mem=2012G,gres/gpu=26	cpu=512,mem=4024G,gres/gpu=52	1
a30_normal_q	fal_a30_normal_long	500	14-00:00:00	cpu=64,mem=503G,gres/gpu=7	cpu=128,mem=1006G,gres/gpu=13	1
a30_normal_q	fal_a30_normal_short	2000	1-00:00:00	cpu=384,mem=3018G,gres/gpu=39	cpu=768,mem=6036G,gres/gpu=77	2
a30_preemptable_q	fal_a30_preemptable_base	0	30-00:00:00	cpu=32,mem=252G,gres/gpu=4	cpu=64,mem=503G,gres/gpu=7	0
v100_normal_q	fal_v100_normal_base	1000	7-00:00:00	cpu=192,mem=3008G,gres/gpu=16	cpu=384,mem=6016G,gres/gpu=32	1
v100_normal_q	fal_v100_normal_long	500	14-00:00:00	cpu=48,mem=752G,gres/gpu=4	cpu=96,mem=1504G,gres/gpu=8	1
v100_normal_q	fal_v100_normal_short	2000	1-00:00:00	cpu=288,mem=4512G,gres/gpu=24	cpu=576,mem=9024G,gres/gpu=48	2
v100_preemptable_q	fal_v100_preemptable_base	0	30-00:00:00	cpu=24,mem=376G,gres/gpu=2	cpu=48,mem=752G,gres/gpu=4	0
t4_normal_q	fal_t4_normal_base	1000	7-00:00:00	cpu=122,mem=711G,gres/gpu=4	cpu=244,mem=1422G,gres/gpu=8	1
t4_normal_q	fal_t4_normal_long	500	14-00:00:00	cpu=31,mem=178G,gres/gpu=1	cpu=61,mem=356G,gres/gpu=2	1
t4_normal_q	fal_t4_normal_short	2000	1-00:00:00	cpu=183,mem=1066G,gres/gpu=6	cpu=365,mem=2132G,gres/gpu=12	2
t4_preemptable_q	fal_t4_preemptable_base	0	30-00:00:00	cpu=16,mem=89G,gres/gpu=1	cpu=31,mem=178G,gres/gpu=1	0

Optimization

The performance of jobs can be greatly enhanced by appropriate optimizations being applied. Not only does this reduce the execution time of jobs but it also makes more efficient use of the resources for the benefit of all.

See the tuning guides available at https://developer.amd.com and https://www.intel.com/content/www/us/en/developer/

General principles of optimization:

Cache locality really matters - process pinning can make a big difference on performance.
Hybrid programming often pays off - one MPI process per L3 cache with 4 threads is often optimal.
Use the appropriate -march flag to optimize the compiled code and -gencode flag when using the NVCC compiler.

Suggested optimization parameters:

Node Type	L40s Nodes	A30 Nodes	V100 Nodes	T4 Nodes
CPU arch	Shappire Rapids	Shappire Rapids	Cascade Lake	Cascade Lake
Compiler flags	`-march=shappirerapids`	`-march=shappirerapids`	`-march=cascadelake`	`-march=cascadelake`
GPU arch	NVIDIA L40s	NVIDIA A30	NVIDIA V100	NVIDIA T4
Compute Capability	8.9	8.0	7	7.5
NVCC flags	`-gencode=arch=compute_89,code=sm_89`	`-gencode=arch=compute_80,code=sm_80`	`-gencode=arch=compute_70,code=sm_70`	`-gencode=arch=compute_75,code=sm_75`

Slurm GPU to CPU bindings

Optimal application performance for GPU accelerated workloads requires that processes execute on CPU cores topologically closest to the GPU that the process will use. On Falcon, Slurm is aware of which sets of CPU cores and memory locations have the most direct connection to each GPU. The arrangement is slightly unintuitive:

GPU device bus ID	GPU device	NUMA node	CPU cores
4a:00.0	0 - /dev/nvidia0	1	16-31
61:00.0	1 - /dev/nvidia1	0	0-15
ca:00.0	2 - /dev/nvidia2	3	48-63
e1:00.0	3 - /dev/nvidia3	2	32-47

If we do not inform Slurm of this affinity, then nearly all jobs will have reduced performance due to misalignment of allocated cores and GPUs. By default, these cores will be preferred by Slurm for scheduling with the affiliated GPU device, but other arrangements are possible.

Use the option --gres-flags=enforce-binding to require Slurm to allocate affiliated CPU core(s) with the corresponding GPU device(s)
The option --gres-flags=disable-binding is required to allocate more CPU cores than are bound to a device, but this is discouraged because these core will then be unavailable to their correctly affiliated GPU.

To summarize, these nodes and the Slurm scheduling algorithms will operate most efficiently when jobs consistently request between 1-16 cores per GPU device. For example:

Do this: --gres=gpu:1 --ntasks-per-node=1 --cpus-per-task=16 --gres-flags=enforce-binding which allocates 1 GPU, the associated 16 CPU cores, and 128GB of system memory.

Do not do this: --gres=gpu:1 --exclusive which allocates all the CPU core and all the system memory to the job, but only one GPU device. The other 3 GPUs will be unavailable to your job and also unavailable to other jobs but the job will still be charged for the Service Units even though it did not use them.

Do not do this: --gres=gpu:1 --ntasks-per-node=32 which allocates 256GB of system memory, one GPU device + its 16 affiliated CPU cores AND 16 additional CPUs that have affinity to a different GPU. This isolated GPU is still available to other jobs, but can only run with diminished performance.

Examples

Run Matlab on L40S GPU accelerator nodes

#!/bin/bash

## Run on L40S GPUs of Falcon cluster.

#SBATCH -J matlab_gpu


## Wall time. E.g. 1h
#SBATCH --time=01:00:00

## Account to "charge" to/run against.
#SBATCH --account=your_slurm_account_name

## Partition/queue.
#SBATCH --partition=l40s_normal_q

### This requests 1 node, 1 task, 1 core. 1 gpu.
#SBATCH --nodes=1
#SBATCH --ntasks-per-node=1
#SBATCH --cpus-per-task=1
#SBATCH --gres=gpu:1

## Slurm output and error files. Always include %j in names.
#SBATCH -o slurm.matlab.02.gpu.%j.out
#SBATCH -e slurm.matlab.02.gpu.%j.err

## QoS
## This is not required because it is the default.
#SBATCH qos=fal_l40s_normal_base

## Load modules, if any.  But first "reset."
module reset
module load MATLAB/R2024b

## Load virtual environments, if any.
## None in this example.

# Set up

## Get the core number for job and other job details.
## -d flag gets you the particular cores running on.
echo " ------------"
echo "Set of cores job running on: "
echo " "
scontrol show job -d  $SLURM_JOB_ID
echo " "
echo " "

## Monitor the GPU.
## The 3 means output data every 3 seconds; you will have to tweek
## based on your execution duration.
echo " "
echo " "
echo "Start file and monitoring of GPU."
nvidia-smi --query-gpu=timestamp,name,pci.bus_id,driver_version,temperature.gpu,utilization.gpu,utilization.memory,memory.total,memory.free,memory.used --format=csv -l 3 > gpu.perform.$SLURM_JOBID.log &
echo " "
echo " "

## Monitor the cpus (cores) and memory and I/O.
## This is cumbersome, but useful: we start three tools before the job starts and we
## stop the three tools after code ends.
echo " "
echo " ------------"
echo "Running IOSTAT"

iostat 2 >iostat-stdout.txt 2>iostat-stderr.txt &

echo " ------------"
echo "Running MPSTAT"

mpstat -P ALL 2 >mpstat-stdout.txt 2>mpstat-stderr.txt &

echo " ------------"
echo "Running VMSTAT"

vmstat 2 >vmstat-stdout.txt 2>vmstat-stderr.txt &

echo " ------------"
echo "Running executable"

# ------------------------
# Code to execute:  Matlab.
arrayLength=10
numIterations=1

## Code name.
mycode="code02b"

## Invocation.  Matlab syntax in double-quotes.
matlab -nodisplay -nosplash -r "bogus = ${mycode}(${arrayLength}, ${numIterations})"



echo " ------------"
echo "Executable done"

echo " ------------"
echo "Killing IOSTAT"
kill %1

echo " ------------"
echo "Killing MPSTAT"
kill %2

echo " ------------"
echo "Killing VMSTAT"
kill %3

The Matlab code, code02b.m, called out in the above sbatch slurm script, is below:

function aa = code02b(arrayLength, numIterations)
    outfile="mat.02b.out";

    fprintf('arrayLength: \n');
    disp (arrayLength);
    fprintf('numIterations: \n');
    disp (numIterations);
    fprintf('outfile: \n');
    disp (outfile);

    N = arrayLength;
    r = gpuArray.linspace(1,100,N);
    % x = rand(1,N,"gpuArray");
    x = gpuArray.linspace(1,100,N);
    x = transpose(x);

    % numIterations = 1000;
    for n=1:numIterations
        x = r.*x.*(1-x);
    end

    % Write x to file.
    % We cannot write the command to open the output file; it crashes this page.  Really.
    % (1) open up output file and assign file handle fid.
    % (2) write x to file using fid handle like so: fprintf(fid, '%f\n',x);
    % (3) close output file using handle like so:  close(fid);


    % plot(r,x,'.',MarkerSize=1)
    % xlabel("Growth Rate")
    % ylabel("Population")

    % Return argument.
    aa="done";


end

Run Matlab on A30, V100, or T4 GPU accelerator nodes

To run the same code above on any of the other three types of GPU accelerators on Falcon, simply make the following substitutions. in the slurm sbatch script above. Look for the text above in the headings of column 2 and column 3, and replace it with the text in the table for the type of GPU you wish to execute on (first column).

GPU	Partition	Default QoS
L40s	#SBATCH –partition=l40s_normal_q	#SBATCH qos=fal_l40s_normal_base
A30	#SBATCH –partition=a30_normal_q	#SBATCH qos=fal_a30_normal_base
V100	#SBATCH –partition=v100_normal_q	#SBATCH qos=fal_v100_normal_base
T4	#SBATCH –partition=t4_normal_q	#SBATCH qos=fal_t4_normal_base

You may want to alter the following variables in the slurm sbatch script above to increase your execution time (e.g., to see more data in the performance logging files):

arrayLength
numIterations

Execute a longer running job on an L40S GPU node

We want to run a job where the number of iterations is very large. We increase the number of iterations, say, from 10 to 10^9.

Therefore, we want to add a QoS parameter to specify a long-running job.

We make the following changes to the sbatch slurm script in the first example:

Change #SBATCH qos=fal_l40s_normal_base to #SBATCH qos=fal_l40s_normal_long.
Change #SBATCH --time=01:00:00 to #SBATCH --time=14-00:00:00.
Change numIterations=1 to numIterations=1000000000.

Execute a longer running job on the other types of GPU nodes

Start with the sbatch slurm script of the previous example. Then, pick the corresponding partition and QoS for the GPU model you want. You may want to alter the following variables in the slurm sbatch script above, which will change the execution time:

arrayLength
numIterations