Search open search form

Alphafold 2



About

Alphafold 2 allows users to predict the 3-D structure of arbitrary proteins. It was published in Nature (Jumper et al. 2021). 

We have implemented the use of alphafold on the campus cluster through the use of singularity and some scripts to help run the container for your particular files. It works best when using gpus to help with the computation.


Preparing to run

Here we will show how to load the modules for it, create a submission scripts, and submit the job.


To get started you should get the fasta file you want to run against.  If you want to just try it out, you can grab a fasta file from public sources.


Next  load the environment modules to put the software in your path:

module load singularity/3.8.0 alphafold/2.2.0


This examples assumes you have the fasta file in  a direcotry in your home directory called fasta_files. It also assumes you are writing the output files to the scratch dir. You can make the directories like this:

mkdir -p ~/fasta_files

mkdir -p /central/scratch/$USER/alphafold/out

There is an example fasta file at /central/software/alphafold/examples which can be copied to your fasta_files direcotry:

cp /central/software/alphafold/examples/rcsb_pdb_3DMW-EDS.fasta ~/fasta_files/.


Next you will want to create a submission script. There is an example file available as well which you can copy to your home diretory:

cp  /central/software/alphafold/examples/alphafold.sub ~/.


The Submission Script
We will go through the script line by line so you understand what it is doing.

The script will always start like a normal shell script would.  typically calling bash:

#!/bin/bash

Any line that starts with #SBATCH is an instruction to the scheduler.  This tells the scheduler what resources you need and can set various things.  These options will be superceded by anything sent on the command line when submitting

Next we give the job a name which will show up in the scheduler:

#SBATCH --job-name=alphafold_run

Then we will say how long we want it to run,  The job will be killed when it reaches this length. We will start with the maximum time of 7 days, but when you are more comfortable with job runtimes you may want to drop this to a reasonable time.  Setting a more realistic time will help keep jobs that are doing the wrong thing from incurring additional costs and will also let you jobs get through the queue faster since it may be able to fit into a backfill slot

#SBATCH --time=7-00:00

The next lines are all about the resources you job will use.  In this case we will be using a single node, but not allowing other jobs to run on it (exclusive). It will use one task but that task can use 28 cores. We are requesting 4 gpus on the node and 32G

#SBATCH --nodes=1

#SBATCH --ntasks=1

#SBATCH --gres=gpu:4           # You need to request one GPU to be able to run AlphaFold properly

#SBATCH --exclusive

#SBATCH --cpus-per-task=28     # adjust this if you are using parallel commands

#SBATCH --mem=32G              # adjust this according to the memory requirement per node you need


The next two lines are about having the schedule keep you informed about when the job starts and ends.  Make sure to put your actual email address in. You can also not set these if you prefer to not be emailed.

#SBATCH --mail-user=$USER@caltech.edu

#SBATCH --mail-type=ALL


Next we get to what will actually run on the compute node when it runs.

First we will set some variables on where your input files are, where to put the output files, and where the alphafold data directories are. The download dir is only necessary if you are using some non standard data:

DOWNLOAD_DIR=/central/software/alphafold/data/  # Set the appropriate path to your downloaded data

INPUT_DIR=/home/$USER/fasta_files/

OUTPUT_DIR=/central/scratch/$USER/alphafold/out


Next we will load the modules.  This is in case you forgot to load them before.

module load singularity/3.8.0 alphafold/2.2.0

We create the ouput direcotry if you hadn;t already, the change directories to the alphafold directory:

mkdir -p $OUTPUT_DIR

The we run the wrapper script for alphafold which will launch the alphafold container via singularity and pass it the options you set and use some defaults that weren't set by the end user. We use the time command at the beginning just to know how long the process took.

time run_alphafold.sh -g True -m multimer -o $OUTPUT_DIR  -f $INPUT_DIR/rcsb_pdb_3DMW-EDS.fasta -t 2010-01-01

Wrapper script options

Please make sure all required parameters are given

Usage: /central/software/alphafold/2.2.0/bin/run_alphafold.sh <OPTIONS>

Required Parameters:

-o <output_dir>   Path to a directory that will store the results.

-m <model_names>  Name of model to use <monomer|monomer_casp14|monomer_ptm|multimer>

-f <fasta_path>   Path to a FASTA file containing one sequence

-t <max_template_date> Maximum template release date to consider (ISO-8601 format - i.e. YYYY-MM-DD). Important if folding historical test sets

Optional Parameters:

-b <benchmark>    Run multiple JAX model evaluations to obtain a timing that excludes the compilation time, which should be more indicative of the time required for inferencing many proteins (default: 'False')

-d <data_dir>     Path to directory of supporting data

-g <use_gpu>      Enable NVIDIA runtime to run with GPUs (default: True)

-a <gpu_devices>  Comma separated list of devices to pass to 'CUDA_VISIBLE_DEVICES' (default: 0)

-n <number>       How many predictions (each with a different random seed) will be generated per model

-p <preset>       Choose preset model configuration - no ensembling (full_dbs) or 8 model ensemblings (casp14) (default: 'full_dbs')

Submitting your job
Submitting your job is quite simple.  Now that you have a submission script with whatever options you want, simply submit it with the sbatch command and it will give you your job id:

[naveed@head1 ]$ sbatch alphafold.sub
Submitted batch job 17028039

Once running, there will be an job output file in your working directory (in this case probably your home directory) called something like this:  slurm-17028039.out

and the job output files will be in the ouput directory previous set up in your submission script

Troubleshooting

Error: 2022-07-19 14:56:53.454079: E external/org_tensorflow/tensorflow/stream_executor/cuda/cuda_driver.cc:618] unable to add host callback: CUDA_ERROR_INVALID_HANDLE: invalid resource handle. 2022-07-19 14:56:53.487217: E external/org_tensorflow/tensorflow/stream_executor/cuda/cuda_driver.cc:1047] could not synchronize on CUDA context: CUDA_ERROR_ILLEGAL_ADDRESS: an illegal memory access was encountered :: *** Begin stack trace ***

Solution: Try unsetting TF_FORCE_UNIFIED_MEMORY either in an interactive session or your sbatch file *and* increase your memory requested x2 or x3 times as a test. (You can drop memory back down once verified that things are working)

unset TF_FORCE_UNIFIED_MEMORY