DragonsTooth, High-Throughput Computing

The Dragonstooth cluster, after due notice, was decommissioned during a maintenance outage January 17-19, 2023. Please consider using the [Tinkercliffs](tinkercliffs) and [Infer](infer) clusters instead.

Overview

DragonsTooth is a 48-node system designed to support general batch HPC jobs. The table below lists the technical details of each DragonsTooth node. Nodes are connected to each other and to storage via 10 gigabit ethernet (10GbE), a communication channel with high bandwidth but higher latency than InfiniBand (IB). As a result, DragonsTooth is better suited to jobs that require less internode communication and/or less I/O intearction with non-local storage than NewRiver, which has similar nodes but a low-latency IB interconnect. To allow I/O-intensive jobs, DragonsTooth nodes are each outfitted with nearly 2 TB of solid state local disk. DragonsTooth was released to the Virginia Tech research community in August 2016. In November of 2018, DragonsTooth was reprovisioned with Slurm as its scheduler as a replacement for Moab/Torque.

Technical Specifications

Component

Specification

CPU

2 x Intel Xeon E5-2680v3 (Haswell) 2.5 GHz 12-core

Memory

256 GB 2133 MHz DDR4

Local Storage

4 x 480 GB SSD Drives

Theoretical Peak (DP)

806 GFlops/s

Policies

Note: DragonsTooth is governed by an allocation manager, meaning that in order to run most jobs on it, you must be an authorized user of an allocation that has been submitted and approved. For more on allocations, click here.

As described above, communications between nodes and between a node and storage will have higher latency on DragonsTooth than on other ARC clusters. For this reason the queue structure is designed to allow more jobs and longer-running jobs than on other ARC clusters. DragonsTooth has two partitions (queues):

  • normal_q for production (research) runs.

  • dev_q for short testing, debugging, and interactive sessions. dev_q provides slightly elevated job priority to facilitate code development and job testing prior to production runs.

The settings for the partitions are:

Partition

normal_q

dev_q

Access to

dt003-dt048

dt003-dt048

Max Jobs

288 per user 432 per allocation

1 per user

Max Nodes

12 per user 18 per allocation

12 per user

Max Core-Hours*

34,560 per user 51,840 per allocation

96 per user

Max Walltime

30 days

2 hr

Other notes:

  • Shared node access: more than one job can run on a node.

*A user cannot, at any one time, have more than this many core-hours allocated across all of their running jobs. So you can run long jobs or large/many jobs, but not both. For illustration, the following table describes how many nodes a user can allocate for a given amount of time:

Walltime

Max Nodes (per user)

Max Nodes (per allocation)

72 hr (3 days)

12

18

144 hr (6 days)

10

15

360 hr (15 days)

4

6

720 hr (30 days)

2

3

Access

DragonsTooth can be accessed via one of the two login nodes:

  • dragonstooth1.arc.vt.edu

  • dragonstooth2.arc.vt.edu

Users may also use Open OnDemand to access the cluster.

Job Submission

Access to all compute nodes is controlled via the Slurm resource manager; see the Slurm documentation for additional usage information. Example resource requests on Cascades include:

#Request exclusive access to all resources on 2 nodes 
#SBATCH --nodes=2 
#SBATCH --exclusive

#Request 4 cores (on any number of nodes)
#SBATCH --ntasks=4

#Request 2 nodes with 12 tasks running on each
#SBATCH --nodes=2
#SBATCH --ntasks-per-node=12

#Request 12 tasks with 20GB memory per core
#SBATCH --ntasks=12 
#SBATCH --mem-per-cpu=20G