Using Anaconda on ARC systems

Use recent versions of Anaconda

ARC will keep older versions of software packages available on clusters, but we recommend using more recent packages where available. This is particularly true of Anaconda because the base code continues to evolve in functionality and integration with package repositories. There are many cases where current/recent environments are impossible to create or update when using older versions of Anaconda.

Use module spider anaconda to search our module system for the most recent Anaconda available on the system you’re using.

Do not run conda init

Running conda init is a convenience for managing Anaconda virutal environments on a single computer, but it does not produce portable results. The principle action of conda init seems to be do add lines like this to the user’s BASH startup script ~/.bashrc:

# >>> conda initialize >>> 
# !! Contents within this block are managed by 'conda init' !! 
__conda_setup="$('/apps/easybuild/software/tinkercliffs-rome/Anaconda3/2020.11/bin/conda' 'shell.bash' 'hook' 2> /dev/null)" 
if [ $? -eq 0 ]; then 
eval "$__conda_setup" 
else 
if [ -f "/apps/easybuild/software/tinkercliffs-rome/Anaconda3/2020.11/etc/profile.d/conda.sh" ]; then 
. "/apps/easybuild/software/tinkercliffs-rome/Anaconda3/2020.11/etc/profile.d/conda.sh" 
else 
export PATH="/apps/easybuild/software/tinkercliffs-rome/Anaconda3/2020.11/bin:$PATH" 
fi 
fi 
unset __conda_setup 
# <<< conda initialize <<<

Notably, you can see that explicit references are made to paths which are specific to a particular node type on ARC systems, like /apps/easybuild/software/tinkercliffs-rome/.... Such a path only exists on one node type on Tinkercliffs and will fail on a different type of Tinkercliffs node or any other cluster’s nodes. In short, conda init produces non-portable results and so we recommend not to use it.

Use source activate and do not use conda activate

Use of conda activate <envname> requires the Anaconda initialization from above, and so is not designed to work on systems where a single home directory is shared between several different nodes. Instead, use source activate <envname> to activate Anaconda virtual environments.

Create a virtual environment specifically for the type of node where it will be used

The Tinkercliffs cluster has at least three different node types and so does Infer. Each node type is equipped with a different cpu micro-architecture, slightly different operating system and/or kernel versions, slightly different system configuration and packages. All are tuned to be customized and efficient for the particular node features. These system differences can make Anaconda virtual environments non-portable between node types.

As a result, you should create and build a virtual environment on a node of the type where you will use the environment.

Example 1:

The Tinkercliffs login nodes are essentially identical to normal_q partition nodes. So if you wish to use Anaconda for jobs on normal_q nodes, you can build the environment on the Tinkerliffs login nodes OR on the normal_q nodes. But you should not use an environment which was built on another cluster (eg. Cascades or Infer). Instead, use conda list to view and document the most important packages and versions in the environment and build a new environment for the Tinkercliffs normal_q matching those specifications.

Example 2:

If you want to use Anaconda on Tinkercliffs a100_normal_q nodes, then you need to build the environment from a shell on those nodes.

The important commands for this are:

command

purpose

interact

get an interactive command line shell on a compute node

module spider

search for the latest anaconda module

module load

load a module

conda create -p $HOME/envname

create a new anaconda environment at the provided path

source activate $HOME/envname

activate the newly created environment

conda install ...

install packages into the environment

Note

$HOME “expands” in the shell to your home directory, eg. /home/jdoe2. And envname from above should be a short but meaninful name for the environment. Since they are particular to the node type, it is recommended to reference the node type in the name. For example tca100-science or tcnq for Tinkercliffs a100_normal_q nodes or Tinkercliffs normal_q nodes respectively.

[jdoe2@tinkercliffs2 ~]$ interact --partition=a100_normal_q --nodes=1 --ntasks-per-node=4 --gres=gpu:1 --account=jdoeacct
srun: job 438605 queued and waiting for resources
srun: job 438605 has been allocated resources
[jdoe2@tc-gpu004 ~]$ module spider anaconda
module
---------------------------------------------------------------------------------------------------------------
  Anaconda3:
---------------------------------------------------------------------------------------------------------------
    Description:
      Built to complement the rich, open source Python community, the Anaconda platform provides an enterprise-ready data analytics platform that empowers companies
      to adopt a modern open data science analytics architecture.

     Versions:
        Anaconda3/2020.07
        Anaconda3/2020.11

---------------------------------------------------------------------------------------------------------------
  For detailed information about a specific "Anaconda3" package (including how to load the modules) use the module's full name.
  Note that names that have a trailing (E) are extensions provided by other modules.
  For example:

     $ module spider Anaconda3/2020.11
---------------------------------------------------------------------------------------------------------------

[jdoe2@tc-gpu004 ~]$ module load Anaconda3/2020.11
[jdoe2@tc-gpu004 ~]$ conda create -p ~/env/a100_env
Collecting package metadata (current_repodata.json): done
Solving environment: done

Please update conda by running

    $ conda update -n base conda

## Package Plan ##

  environment location: /home/jdoe2/env/a100_env

Proceed ([y]/n)? y

Preparing transaction: done
Verifying transaction: done
Executing transaction: done

[jdoe2@tc-gpu004 ~]$ source activate /home/jdoe2/env/a100_env/
(/home/jdoe2/env/a100_env) [jdoe2@tc-gpu004 ~]$ conda install python=3.9 pandas
Collecting package metadata (current_repodata.json): done
Solving environment: done

==> WARNING: A newer version of conda exists. <==
  current version: 4.10.3
  latest version: 4.12.0

Please update conda by running

    $ conda update -n base conda

## Package Plan ##

  environment location: /home/jdoe2/env/a100_env

  added / updated specs:
    - pandas
    - python=3.9

GPU - Cuda compatability

While nvidia-smi will display a version of CUDA, this is just the base CUDA on the node and can be overridden by

  • loading a different CUDA module: module spider cuda

  • activating an Anaconda environment which has cudatoolkit conda list cudatoolkit

  • installing a conda package built with a different cuda: conda list tensorflow -> check the build string

A100 GPUs require CUDA 11.0 or greater

Check CUDA version in Tensorflow

import tensorflow as tf
sys_details = tf.sysconfig.get_build_info()
cuda_version = sys_details["cuda_version"]
print(cuda_version)

Check cuDNN version in TensorFlow

cudnn_version = sys_details["cudnn_version"]  
print(cudnn_version)

Check CUDA version in PyTorcb

torch.version.cuda