ARC System Changes: 2026-05

  • Open OnDemand: ood.arc.vt.edu system, firmware, and software updates

  • Hosted LLMs: llm.arc.vt.edu authentication framework change

  • Clusters changes (Tinkercliffs, Owl, Falcon):

    • update to Slurm version 25.11 (requires job queues to be dropped)

    • cluster manager software updates

    • fix bug in sizing of TMPFS

    • updates to correctly auto-drain nodes due to ECC errors on GPUs

    • (TBD, in testing) update Nvidia drivers to 595.71.05 for CUDA 13.2 compatibility on all GPU nodes except V100s

  • Configuration changes:

    • tweak to reduce nodes being drained after some jobs which encounter timeout

    • increase throughput of backfill scheduler cycle

    • introduce scheduling tweaks to reduce wait time for interactive jobs launched via OnDemand or interact

    • update job priority weights to better balance job age, job size, and fairshare and improve capacity for backfill scheduling

  • Storage systems

    • VAST (/scratch, /apps), Qumulo (/home), ESS (/projects) storage system updates