(data-transfer)= # Basic data transfer tools: ## SCP ### Use scp to download a file or directory from ARC Call the `scp` command from the shell command line. This is usually available in default installation of Linux, Windows (PowerShell), and MacOS. In recent tests (fall 2022), SCP significantly outperformed GUI-based tools on Windows systems such as MobaXterm and WinSCP. The basic syntax is `scp `. Both the source and destination can be local or remote. When specifying a remote source or destination, you need to provide the hostname and the full path to the file or directory like this (ie. colon separated): ``` host.domain.tld:/path/to/file ``` ### Example Pull from Tinkercliffs In this example we "pull" data onto the local computer (eg. a laptop, workstation or even a shell on another ARC node) from ARC systems. So the `` uses `hostname:filename` format and the `` is the current working directory which is referenced with a period "`.`". ``` scp tinkercliffs1.arc.vt.edu:/home/username/filename.zip . ``` ### Example Push to a projects directory on Tinkercliffs In this example we push a directory and its contents from the local system to a `/projects` directory which is mounted on a Tinkercliffs login node: ``` scp -r dirname tinkercliffs2.arc.vt.edu:/projects/mygroup/ ``` The "`-r`" option is for a "recursive" transfer which means the referenced directory and all of its contents will be transferred and the directory structure will be retained on the destination. If the "`-r`" is not specificied, but the source is a directory, then `scp` will fail with an error like: ``` cp: omitting directory ‘dirname‘ ``` ## RSYNC `rsync` "a fast, versatile, remote (and local) file-copying tool" is a standard tool on linux and unix system which has a long list of options you can turn on or off to customize the nature of the transfer. It is particularly well-suited for performing a synchronizition where different versions of a data collection reside in two locations because it can minimize the amount of data transferred and being able to resume a partially completed transfer. `scp` or `cp`, on the other hand, will always perform an entire copy from source to destination, even if files and directories already exist at the destination. ## Best practices for transfers (best-practice)= ### Package datasets with a large number of files before transferring If you need to transfer a dataset which has a large number of small files, use `tar` or `zip` to package the dataset into a smaller number or larger files. Most tools will process files in a dataset sequentially and there is significant overhead from the OS, network, and storage devices when many small files are transferred this way. A single, large-file transfer, on the other hand, will only incur this overhead latency once and the rest of the time will be spent in high-bandwidth transfers. For context in these scenarios - "small files" means files smaller than 10MB - "large number of files" means thousands or more: 1000+ This is applicable for any transfer of a large number of small files, even intra-cluster. In many cases, it can be very effective to copy a data set (for example AI/ML training data) to local scratch space on compute nodes. See [this example](scratch) for more detail ### Parallelize data transfers when possible Most, if not all, of ARC's networked storage systems (eg. `/home`, `/fastscratch`, `/projects`) are capable of managing many simultaneous data flows and are mounted via a protocol which has inherent performance limitations in a single data transfer which is much lower than the aggregate performance of several streams running in parallel. Standard tools like `cp`, `mv`, `scp`, and `rsync` will process the source arguments in serial which means only one file is copied at a time. To engage the full bandwidth of the networked storage system, we need to parallelize, or force multiple simultaneous transfers. In this example benchmark, GNU parallel is used to launch a varying number of simultaneous copies from `/fastscratch` to the `$TMPNVME` on a DGX compute node. Performance improves dramatically by parallizing, but does plateau at around eight simultaneous copies. ![image](../_assets/img/gnupar_copy_timerate.png) ## rclone - Login to OnDemand: https://ood.arc.vt.edu - Start Remote Desktop - Start shell via link in the job card ``` tmux ls ``` Start tmux for job. For example, this is for job with id 447439: ``` tmux a -t OOD.tmux.447439 module load rclone/1.42-foss-2020a-amd64 ``` Now follow: https://rclone.org/drive/ ### Example: Config rclone for metfaces As an example, to download the metfaces dataset (big, so beware): ``` rclone config > n > metfaces storage> 11 client_id> {blank} client_secret> {blank} scope> 1 ``` Next is folder id, for instance for metfaces: https://drive.google.com/drive/folders/1w-Os4uERBmXwCm7Oo_kW6X3Sd2YHpJMC ``` root_folder_id>1w-Os4uERBmXwCm7Oo_kW6X3Sd2YHpJMC service_account_file> {blank} service_account_file> {blank} Y/n> y ``` Now, copy the address shown "http://127.0.0.1:53682/auth" for instance. Go to the Remote Desktop, start Firefox, and head to that web address. Go back to the rclone config ``` Y/n> n Y/e/d> y /n/d/r/c/s/q> q ``` To start using the rclone you just setup, you can do for instance: ### Get a listing of files ``` rclone ls metfaces: ``` ### Download the data in the metfaces google drive to current dir ``` rclone copy metfaces: ./ ```