AlphaFold

Introduction

AlphaFold This open source code provides an implementation of the AlphaFold system. It allows users to predict the 3-D structure of arbitrary proteins with unprecedented accuracy. AlphaFold v2.0 is a completely new model that was entered in the CASP14 assessment and published in Nature (Jumper et al. 2021). The package contains source code, trained weights, and an inference script. We have also made available a centralized repository of more than 1.3TB of protein databases which AlphaFold references.

Availability

AlphaFold 2.3.2 and 3.0.1 is available on the GPU partitions of TinkerCliffs and Falcon.

Several installed versions are available:

AlphaFold/2.3.2-foss-2023a-CUDA-12.1.1
AlphaFold/3.0.1

The databases required for Alphafold to run are downloaded to a central repository local to the cluster (/common/data/alphafold) so that you do not need to maintain your own copy. The databases can be updated by ARC upon request if newer versions are needed.

License

AlphaFold versions through 2.3.2 are licensed under the Apache License 2.0, A permissive license whose main conditions require preservation of copyright and license notices.

AlphaFold v3 was released close-source and unavailable to the public. Since then, the code has been released, but access to the necessary trained-model parameters require a binding legal agreement between Google and Virginia Tech. ARC signed such agreement and users who use AlphaFold v3 are required to accept the terms and conditions provided by Google.

Interface

While AlphaFold documentation is primarily focused on using a containerized implementation of the software, we have installed AlphaFold and all it’s dependencies using a proscribed build procedure via EasyBuild. The interface for AlphaFold is available via command-line scripts. Most use of AlphaFold will be in batch jobs where the user provides input files and submits a script to the Slurm scheduler to run the job in batch mode.

The software and all its dependencies are available via a module:

module load AlphaFold/2.3.2-foss-2023a-CUDA-12.1.1
or
module load AlphaFold/3.0.1

Job Submission

Duration

Typical runs on a single input have taken 1-2 hours for sequences with 150-1000 bases. The bulk of the time is spent in high I/O while AlphaFold references the databases in the central repository. Usually less than 5 minutes is spent on GPU.

Significant time is also spent building the application during the job run. If you have multiple input fasta files, you can avoid repeating this process by providing them all, comma-separated, in the --input=... specification to the alphafold command.

Computational resources to request

Since little time is spent on GPU, it would be wasteful to allocate more than a single GPU for the job. However, AlphaFold does appear to make good use of the large GPU memory on the A100-80GB GPUs, with occupancy reaching 80% or higher.

Since these nodes have 8 GPUs, we recommend requesting a proportional amount of CPU (and consequently memory) for the job. The A100 nodes have 128 cores in total, so that amounts to 16 cores/GPU and this also aligns with the memory and PCIe architecture of the nodes. When 16 cores are requested, the memory allocation will be 512GB for the job.

Tips

Make use of the ALPHAFOLD_DATA_DIR we provide and use the environment variable to simplify the command structure
If you specify the --pdb_seqres_database_path option on the command line, AlphaFold expects the path to a file (i.e. $ALPHAFOLD_DATA_DIR/pdb_seqres/pdb_seqres.txt) instead of a directory.

Template from Examples repo for Alphafold 2.3 multimer

Download job script and example fasta file from ARC’s Example repo on Github:

wget 'https://raw.githubusercontent.com/AdvancedResearchComputing/examples/master/alphafold/tcdgx_af_mult.sh' -O tcdgx_af_multi.sh
wget 'https://raw.githubusercontent.com/AdvancedResearchComputing/examples/master/alphafold/Melanogaster_GR28BD_tetramer.fasta' -O Melanogaster_GR28BD_tetramer.fasta

Edit the job script tcdgx_af_multi.sh so that it uses your account, then submit the job:

sbatch tcdgx_af_multi.sh

Template job script for Alphafold 2.3.2

#!/bin/bash
# run-af.sh: template job script for running AlphaFold on Tinkercliffs A100 nodes
# usage: 1. supply input and output paths 2. "sbatch run-af.sh"

#SBATCH --account=<your allocation account>
#SBATCH --partition=a100_normal_q
#SBATCH --ntasks-per-node=1
#SBATCH --cpus-per-task=16
#SBATCH --gres=gpu:1
#SBATCH --time=3:00:00

module reset
module load AlphaFold/2.3.2-foss-2023a-CUDA-12.1.1

#Set input and output paths
FASTA_INPUT=<path to fasta input file(s)>
OUTPUT_DIR=<path to desired output directory>

#Set Path to central database repository
export ALPHAFOLD_DATA_DIR=/common/data/alphafold/2.3.2/
export ALPHAFOLD_HHBLITS_N_CPU=16
export ALPHAFOLD_JACKHMMER_N_CPU=16

#Run Alphafold
alphafold --data_dir ${ALPHAFOLD_DATA_DIR} \
          --output_dir ${OUTPUT_DIR} \
          --model_preset multimer \
          --fasta_paths ${FASTA_INPUT} \
          --max_template_date=2025-03-01 \
          --pdb70_database_path=${ALPHAFOLD_DATA_DIR}/pdb70 \
          --pdb_seqres_database_path=${ALPHAFOLD_DATA_DIR}/pdb_seqres/pdb_seqres.txt \
          --uniprot_database_path=${ALPHAFOLD_DATA_DIR}/uniprot