(data-transfer)=

# Data transfer tools
## SCP
### Use scp to transfer a file or directory from/to ARC systems
Call the `scp` command from the shell command line. This is available in default installation of Linux, Windows (PowerShell), and MacOS. SCP significantly outperforms GUI-based tools on Windows systems such as MobaXterm and WinSCP.

The basic syntax is `scp <source> <destination>`. Both the source and destination can be local or remote. When specifying a remote source or destination, you need to provide the hostname and the full path to the file or directory like this (ie. colon separated):
```
user@host.domain:/path/to/file
```

ARC clusters share the `/home` and `/projects` mountpoints. Therefore, any files you transfer are automatically visible on all clusters.

We recommend to use the host `datatransfer.arc.vt.edu` to improve the performance of the data tranfer.

**Note**: We strongly recommend using Globus for transferring large data sets as it transfers files in parallel while `scp` transfers files one at a time and hence does not perform nearly as well.

### Example copy from ARC to your computer
In this example we "pull" data onto the local computer (eg. a laptop, workstation or even a shell on another ARC node) from ARC systems. So the `<source>` uses `hostname:filename` format and the `<destination>` is the current working directory which is referenced with a period "`.`". Replace `myVTpid` with your actual username.
```
scp myVTpid@datatransfer.arc.vt.edu:/home/myVTpid/filename.zip .
```

### Example copy from your computer to ARC
In this example we push a directory `dirname` and its contents from the local system to a `/projects/mygroup/` directory:
```
scp -r dirname myVTpid@datatransfer.arc.vt.edu:/projects/mygroup/
```
The "`-r`" option is for a "recursive" transfer which means the referenced directory and all of its contents will be transferred. The directory structure on the destination will be identical to the source. If the "`-r`" is not specificied, but the source is a directory, then `scp` will fail with an error like:
```
scp: omitting directory ‘dirname‘
```
## RSYNC
`rsync` "a fast, versatile, remote (and local) file-copying tool" is a standard tool on linux and unix system which has a long list of options you can turn on or off to customize the nature of the transfer. It is particularly well-suited for performing a synchronizition where different versions of a data collection reside in two locations because it can minimize the amount of data transferred and being able to resume a partially completed transfer. `scp` or `cp`, on the other hand, will always perform an entire copy from source to destination, even if files and directories already exist at the destination.

## Best practices for transfers

(best-practice)=
### Package datasets with a large number of files before transferring
If you need to transfer a dataset which has a large number of small files, use `tar` or `zip` to package the dataset into a smaller number of larger files. Most tools will process files in a dataset sequentially and there is significant overhead from the OS, network, and storage devices when many small files are transferred this way. A single, large-file transfer, on the other hand, will only incur this overhead latency once and the rest of the time will be spent in high-bandwidth transfers.

For context in these scenarios
 - "small files" means files smaller than 10MB
 - "large number of files" means thousands or more: 1000+

This is applicable for any transfer of a large number of small files, even intra-cluster. In many cases, it can be very effective for copying a data set (for example AI/ML training data) to local scratch space on compute nodes. See [this example](scratch) for more details.

### Parallelize data transfers when possible
Most, if not all, of ARC's networked storage systems (eg. `/home`, `/scratch`, `/projects`) are capable of managing many simultaneous data flows such that a single data transfer has much lower performance than the aggregate of several streams running in parallel. Standard tools like `cp`, `mv`, `scp`, and `rsync` will process the source arguments in serial which means only one file is copied at a time. To engage the full bandwidth of the networked storage system, we need to parallelize or use multiple simultaneous transfers.

In this example benchmark, GNU parallel is used to launch a varying number of simultaneous copies from `/scratch` to the `$TMPNVME` on a DGX compute node. Performance improves dramatically by parallizing, but does plateau at around eight simultaneous copies.

![image](../_assets/img/gnupar_copy_timerate.png)

## rclone
- Login to OnDemand: https://ood.arc.vt.edu
- Start Remote Desktop
- Start shell via link in the job card


```
tmux ls
```
Start tmux for job. For example, this is for job with id 447439:
```
tmux a -t OOD.tmux.447439
module load rclone
```

Now follow: https://rclone.org/drive/
### Example: Config rclone for metfaces
As an example, to download the metfaces dataset (big, so beware):
```
rclone config
> n
> metfaces
storage> 11
client_id> {blank}
client_secret> {blank}
scope> 1
```
Next is folder id, for instance for metfaces: https://drive.google.com/drive/folders/1w-Os4uERBmXwCm7Oo_kW6X3Sd2YHpJMC
```
root_folder_id>1w-Os4uERBmXwCm7Oo_kW6X3Sd2YHpJMC
service_account_file> {blank}
service_account_file> {blank}
Y/n> y
```
Now, copy the address shown "http://127.0.0.1:53682/auth" for instance.

Go to the Remote Desktop, start Firefox, and head to that web address.

Go back to the rclone config
```
Y/n> n
Y/e/d> y
/n/d/r/c/s/q> q
```
To start using the rclone you just setup, you can do for instance:

### Get a listing of files
```
rclone ls metfaces:
```
### Download the data in the metfaces google drive to current dir
```
rclone copy metfaces: ./
```

## FileZilla

[FileZilla](https://filezilla-project.org/download.php) is a popular tool with a more intuitive user interface for transfering data. With FileZilla installed, the application can securely move data in and out of ARC's storage systems.

### Example

Use the quickconnect bar to setup the local and remote systems. The host can be `datatransfer.arc.vt.edu` (recommended) or any login node of the clusters. Enter your PID username and your password, and use port 22.

![filezilla setup](../_assets/img/filezilla_setup.png)

After entering the fields, complete 2 Factor Authentication (2FA) required for making the secure connection. Once the connection is made, transfers can be made using the interface.

![filezilla connected](../_assets/img/filezilla_connect.png)

## Globus

ARC purchased a High Assurance Subscription to Globus and established a Globus Data Transfer Node `globus.arc.vt.edu`. Detailed documentation about using Globus in ARC is available [here](globus).