AlphaFold

Introduction

AlphaFold This open source code provides an implementation of the AlphaFold v2.0 system. It allows users to predict the 3-D structure of arbitrary proteins with unprecedented accuracy. AlphaFold v2.0 is a completely new model that was entered in the CASP14 assessment and published in Nature (Jumper et al. 2021). The package contains source code, trained weights, and an inference script. We have also made available a centralized repository of more than 1.3TB (as of Jan 2022) of protein databases which AlphaFold references.

Alphafold v2.2 or later is required for applications involving multimers.

Availability

AlphaFold is currently available only on Tinkercliffs

  • A100 GPU nodes (a100_normal_q) have Alphafold v2.0

  • DGX nodes nodes (dgx_normal_q) have Alphafold v2.2.2

The databases required for Alphafold to run are downloaded to a central repository local to the cluster (/global/biodatabases/alphafold) so that you do not need to maintain you own copy. The databases can be updated by ARC upon request if newer versions are needed.

License

AlphaFold is licensed under the Apache License 2.0, A permissive license whose main conditions require preservation of copyright and license notices.

Interface

While AlphaFold documentation is primarily focused on using a containerized implementation of the software, we have installed AlphaFold and all it’s dependencies using a proscribed build procedure via EasyBuild. The interface for AlphaFold is available via command-line scripts. Most use of AlphaFold will be in batch jobs where the user provides input files and submits a script to the Slurm scheduler to run the job in batch mode.

The software and all its dependencies are available via a module

::::{tab-set}

:::{tab-item}v2.2.2 on Tinkercliffs DGX nodes

module load AlphaFold/2.2.2-foss-2021a-CUDA-11.3.1

:::

:::{tab-item} v2.0 Tinkercliffs A100 nodes

```{code-block}
module load AlphaFold/2.0.0-fosscuda-2020b

:::

::::

Job Submission

Duration

Typical runs on a single input have taken 1-2 hours for sequences with 150-1000 bases. The bulk of the time is spent in high I/O while AlphaFold references the databases in the central repository. Usually less than 5 minutes is spent on GPU.

Significant time is also spent building the application during the job run. If you have multiple input fasta files, you can avoid repeating this process by providing them all, comma-separated, in the --input=... specification to the alphafold command.

Computational resources to request

Since little time is spent on GPU, it would be wasteful to allocate more than a single GPU for the job. However, AlphaFold does appear to make good use of the large GPU memory on the A100-80GB GPUs, with occupancy reaching 80% or higher.

Since these nodes have 8 GPUs, we recommend requesting a proportional amount of CPU (and consequently memory) for the job. The A100 nodes have 128 cores in total, so that amounts to 16 cores/GPU and this also aligns with the memory and PCIe architecture of the nodes. When 16 cores are requested, the memory allocation will be 512GB for the job.

Template from Examples repo for Alphafold 2.2 multimer

Download job script and example fasta file from ARC’s Example repo on Github:

wget 'https://raw.githubusercontent.com/AdvancedResearchComputing/examples/master/alphafold/tcdgx_af_mult.sh' -O tcdgx_af_multi.sh
wget 'https://raw.githubusercontent.com/AdvancedResearchComputing/examples/master/alphafold/Melanogaster_GR28BD_tetramer.fasta' -O Melanogaster_GR28BD_tetramer.fasta

Edit the job script tcdgx_af_multi.sh so that it uses your account, then submit the job:

sbatch tcdgx_af_multi.sh

Template job script for Alphafold 2.0

Tinkercliffs A100 nodes, Alphafold 2.0

Version 2.0 of Alphafold support only multimer and does not enable multimer folding inference.
#!/bin/bash
# run-af.sh: template job script for running AlphaFold on Tinkercliffs A100 nodes
# usage: 1. supply input and output paths 2. "sbatch run-af.sh"

#SBATCH --account=<your allocation account>
#SBATCH --partition=a100_normal_q
#SBATCH --ntasks-per-node=1
#SBATCH --cpus-per-task=16
#SBATCH --gres=gpu:1
#SBATCH --time=3:00:00

module reset
module load AlphaFold/2.0.0-fosscuda-2020b

#Set input and output paths
FASTA_INPUT=<path to fasta input file(s)>
OUTPUT_DIR=<path to desired output directory>

#Set Path to central database repository
export ALPHAFOLD_DATA_DIR=/global/biodatabases/alphafold/

#Run Alphafold
alphafold --data_dir $ALPHAFOLD_DATA_DIR \
        --output_dir $OUTPUT_DIR \
        --model_names model_1 \
        --fasta_paths $FASTA_INPUT \ 
        --max_template_date 2021-11-17