Basic data transfer tools:

SCP

Use scp to download a file or directory from ARC

Call the scp command from the shell command line. This is usually available in default installation of Linux, Windows (PowerShell), and MacOS. In recent tests (fall 2022), SCP significantly outperformed GUI-based tools on Windows systems such as MobaXterm and WinSCP.

The basic syntax is scp <source> <destination>. Both the source and destination can be local or remote. When specifying a remote source or destination, you need to provide the hostname and the full path to the file or directory like this (ie. colon separated):

host.domain.tld:/path/to/file

Example Pull from Tinkercliffs

In this example we “pull” data onto the local computer (eg. a laptop, workstation or even a shell on another ARC node) from ARC systems. So the <source> uses hostname:filename format and the <destination> is the current working directory which is referenced with a period “.”.

scp tinkercliffs1.arc.vt.edu:/home/username/filename.zip .

Example Push to a projects directory on Tinkercliffs

In this example we push a directory and its contents from the local system to a /projects directory which is mounted on a Tinkercliffs login node:

scp -r dirname tinkercliffs2.arc.vt.edu:/projects/mygroup/

The “-r” option is for a “recursive” transfer which means the referenced directory and all of its contents will be transferred and the directory structure will be retained on the destination. If the “-r” is not specificied, but the source is a directory, then scp will fail with an error like:

cp: omitting directory ‘dirname‘

RSYNC

rsync “a fast, versatile, remote (and local) file-copying tool” is a standard tool on linux and unix system which has a long list of options you can turn on or off to customize the nature of the transfer. It is particularly well-suited for performing a synchronizition where different versions of a data collection reside in two locations because it can minimize the amount of data transferred and being able to resume a partially completed transfer. scp or cp, on the other hand, will always perform an entire copy from source to destination, even if files and directories already exist at the destination.

Best practices for transfers

Package datasets with a large number of files before transferring

If you need to transfer a dataset which has a large number of small files, use tar or zip to package the dataset into a smaller number or larger files. Most tools will process files in a dataset sequentially and there is significant overhead from the OS, network, and storage devices when many small files are transferred this way. A single, large-file transfer, on the other hand, will only incur this overhead latency once and the rest of the time will be spent in high-bandwidth transfers.

For context in these scenarios

  • “small files” means files smaller than 10MB

  • “large number of files” means thousands or more: 1000+

This is applicable for any transfer of a large number of small files, even intra-cluster. In many cases, it can be very effective to copy a data set (for example AI/ML training data) to local scratch space on compute nodes. See this example for more detail

Parallelize data transfers when possible

Most, if not all, of ARC’s networked storage systems (eg. /home, /fastscratch, /projects) are capable of managing many simultaneous data flows and are mounted via a protocol which has inherent performance limitations in a single data transfer which is much lower than the aggregate performance of several streams running in parallel. Standard tools like cp, mv, scp, and rsync will process the source arguments in serial which means only one file is copied at a time. To engage the full bandwidth of the networked storage system, we need to parallelize, or force multiple simultaneous transfers.

In this example benchmark, GNU parallel is used to launch a varying number of simultaneous copies from /fastscratch to the $TMPNVME on a DGX compute node. Performance improves dramatically by parallizing, but does plateau at around eight simultaneous copies.

image

rclone

  • Login to OnDemand: https://ood.arc.vt.edu

  • Start Remote Desktop

  • Start shell via link in the job card

tmux ls

Start tmux for job. For example, this is for job with id 447439:

tmux a -t OOD.tmux.447439
module load rclone/1.42-foss-2020a-amd64

Now follow: https://rclone.org/drive/

Example: Config rclone for metfaces

As an example, to download the metfaces dataset (big, so beware):

rclone config
> n
> metfaces
storage> 11
client_id> {blank}
client_secret> {blank}
scope> 1

Next is folder id, for instance for metfaces: https://drive.google.com/drive/folders/1w-Os4uERBmXwCm7Oo_kW6X3Sd2YHpJMC

root_folder_id>1w-Os4uERBmXwCm7Oo_kW6X3Sd2YHpJMC
service_account_file> {blank}
service_account_file> {blank}
Y/n> y

Now, copy the address shown “http://127.0.0.1:53682/auth” for instance.

Go to the Remote Desktop, start Firefox, and head to that web address.

Go back to the rclone config

Y/n> n
Y/e/d> y
/n/d/r/c/s/q> q

To start using the rclone you just setup, you can do for instance:

Get a listing of files

rclone ls metfaces:

Download the data in the metfaces google drive to current dir

rclone copy metfaces: ./