Frequently asked questions Collapse all

How does AlphaFold work?

 

DeepMind’s 2021 methods paper is the best reference for this. It gives an overview of the most important ideas, and there is a detailed description of all aspects of the system in the Supplementary Information.
Note that the architecture of the system used at CASP14 differs significantly from the version used at CASP13, making it important to refer to the 2021 publication.

Visit our online training course to learn more about AlphaFold.

 


What is AlphaMissense?

 

AlphaMissense is an AI model that builds on Google DeepMind’s AlphaFold2 to categorise ‘missense’ mutations in different proteins as either ‘likely pathogenic’, ‘likely benign’ or ‘uncertain’, producing a score that estimates the likelihood of a variant being pathogenic. AlphaMissense leverages AlphaFold2’s capability to model protein structure, and its capacity to learn evolutionary constraints from related sequences. The implementation is closely aligned with AlphaFold2, with some architectural differences.

AlphaMissense was used to classify the effects of all possible 216 million single amino acid sequence substitutions across the 19,233 canonical human proteins.

Using an amino acid sequence as an input, AlphaMissense:

  • Gives an indication of which mutations are more likely to underlie human diseases - such as rare genetic disorders or developmental diseases - by categorising missense mutations into likely pathogenic or likely benign. Combined with other types of information, it can help to decipher what mutations may be causing a disease.
  • Helps to highlight potential functionally important regions of the protein.

Note that AlphaMissense does not predict the change in protein structure, or biophysical properties such as stability, upon mutation. Instead, it uses related protein sequences and protein structure as contextual information to estimate pathogenicity.

For more information about AlphaMissense, please refer to the paper: Accurate proteome-wide missense variant effect prediction with AlphaMissense. Science (2023).

AlphaMissense scores for all human missense variants are available on the Google Cloud Public Dataset.

 


What data can I download from AlphaMissense?

 

We provide downloadable files in csv format to help you explore the data for a specific canonical human protein:

  • Heatmap data. This file contains the scores estimating the likelihood of  pathogenicity and classifications for each possible amino acid substitution within the protein. This file is used to visualise the heatmap you see on the entry pages.
    Column am_class categorises the mutation based on the score:
    • LPath: Likely pathogenic
    • LBen: Likely benign
    • Amb: Ambiguous
  • Pathogenicity scores (HG19 and HG38). These files include AlphaMissense scores for all possible missense single nucleotide variants across the human reference genome. Each file corresponds to a specific genome assembly with information on specific genome position.
    • HG19. Corresponds to the GRCh37 human genome reference assembly.
    • HG38. Corresponds to the GRCh38 human genome reference assembly.
 


What information does AlphaFold use from the Protein Data Bank?

 

AlphaFold is trained on protein chains in the PDB released before 2018-04-30. Predictions can also make use of up to 4 templates released before 2021-02-15. However, templates are not a critical input for AlphaFold to make an accurate prediction; the model can make a strong prediction based on a multiple sequence alignment alone.

Additionally, AlphaFold can ignore templates if they appear unhelpful - it isn’t required to copy their structure. As of this latest release, PDB structures shown to the model are recorded in the prediction mmCIF file.

 


How does AlphaFold compare to other structure prediction tools?

 

The CASP14 assessment compared leading structure prediction methods in detail; the results are available here. AlphaFold was the top-ranked method, with a median GDT (Global Distance Test) score of 92.4 across all targets and 87.0 on the challenging free-modelling category, compared to 72.8 and 61.0 for the next best methods in these categories.

Structural biologists more often express the similarity between two protein structures by first optimally superposing the structures and then calculating the root-mean-square distance (RMSD) between the Cα atoms of equivalent residues. Taking the median RMSD-Cα on the best predicted 95% of residues reduces the effect of flexible tails and crystal-packing artefacts. On this metric AlphaFold’s CASP14 predictions had a median distance of 0.96 Å to the experimental models, compared to 2.83 Å for the next-best method.

 


Which proteins are included?

 

AlphaFold DB has grown in several stages:

  • July 2021: included 20 model organism proteomes, with sequences taken from the “one sequence per gene” reference proteomes provided in UniProt 2021_02.
  • December 2021: added Swiss-Prot, from UniProt 2021_04.
  • January 2022: added a collection of proteomes relevant to global health, taken from priority lists compiled by the World Health Organisation. Sequences were again taken from the “one sequence per gene” reference proteomes provided in UniProt 2021_04.
  • July 2022: added most of the remaining UniProt 2021_04. As part of this release we have also included an additional tar file on the AFDB download page and FTP, containing predictions in MANE select.
  • Nov 2022: updated a set of structures affected by a temporary numerical bug (miscompilation) in the previous July release (list of affected accessions, N.B. 160 MiB). This temporary issue resulted in low accuracy predictions with correspondingly low pLDDT for ~4% of the total structure predictions available in the database. This release includes:
    • Updated coordinates for affected structures. You can still access all old coordinates as v3 files, and easily compare v3 and v4 coordinates
    • Minor metadata changes in the mmCIF files for the rest of the structures (these files are released as v4). Please refer to our changelog for more details. Note that as part of this release we’ve also removed predictions with Ca-Ca >10A.

The wider UniProt predictions are the output of a single model, while Swiss-Prot / proteomes entries represent the most confident prediction out of 5 model runs. Internal benchmarking on CASP14 shows that the model used for UniProt (“model_2_ptm”) is insignificantly less accurate (-1 GDT versus five models), and that there is a slight bias toward lower confidence (-1 pLDDT) due to the effect of using one model rather than selecting from 5.

Not all sequences are covered; the most common reasons for a missing sequence are:

  • It is outside our length range. The minimum length is 16 amino acids, while the maximum is 2,700 for proteomes / Swiss-Prot and 1,280 for the rest of UniProt. For the human proteome only, our download includes longer proteins segmented into fragments.
  • It contains non-standard amino acids (e.g. X).
  • It is not in the UniProt reference proteome “one sequence per gene” Fasta file.
  • It has been added or modified by UniProt in a more recent release.
  • It is a viral protein. These are currently excluded, pending improved support for polyproteins.

We plan to continue updating the database. In the meantime, if your sequence(s) aren’t included, you can generate your own AlphaFold predictions using Google DeepMind’s Colab notebook and open source code, which also support multimer predictions.

 


How many proteins are there in the database?

 

There are 214,683,839 structures available on the AlphaFold DB website, including 48 complete proteomes available for bulk download. An additional 3,095 structures are included in the human proteome download, covering sequences longer than our usual length limit split into fragments.

 


What use cases does AlphaFold not support?

 

AlphaFold DB currently focuses on the use case validated in CASP14: predicting the structure of a single protein chain with a naturally occurring sequence. Many other use cases remain active areas of research, for example:

  • The version of AlphaFold used to construct in this database does not output multi-chain predictions (complexes). In some cases the single-chain prediction may correspond to the structure adopted in complex. In other cases (especially where the chain is structured only on binding to partner molecules) the missing context from surrounding molecules may lead to an uninformative prediction. A separate version of AlphaFold was trained for complex prediction (AlphaFold Multimer). You can find the open source code on GitHub and make multimer predictions using Google DeepMind’s Colab.
  • For regions that are intrinsically disordered or unstructured in isolation AlphaFold is expected to produce a low-confidence prediction (pLDDT < 50) and the predicted structure will have a ribbon-like appearance. AlphaFold may be of use in identifying such regions, but the prediction makes no statement about the relative likelihood of different conformations (it is not a sample from the Boltzmann distribution).
  • AlphaFold has not been validated for predicting the effect of mutations. In particular, AlphaFold is not expected to produce an unfolded protein structure given a sequence containing a destabilising point mutation.
  • Where a protein is known to have multiple conformations, AlphaFold usually only produces one of them. The output conformation cannot be reliably controlled.
  • AlphaFold does not predict the positions of any non-protein components found in experimental structures (such as cofactors, metals, ligands, ions, DNA/RNA, or post-translational modifications). However, AlphaFold is trained to predict the structure of proteins as they might appear in the PDB. Therefore backbone and side chain coordinates are frequently consistent with the expected structure in the presence of ions (e.g. for zinc-binding sites) or cofactors (e.g. side chain geometry consistent with heme binding).
 


How do I search the database

 

The search bar at the top of the page accepts queries based on protein name (e.g. Free fatty acid receptor 2), gene name (e.g. At1g58602), UniProt accession (e.g. Q5VSL9), or organism name (e.g. E. coli).

 


How does sequence-based search work?

 

AlphaFold Database sequence-based similarity search is implemented using the Protein Basic Local Alignment Search Tool (BLASTP, see further information here: https://github.com/ncbi/docker/blob/master/blast/README.md). This tool compares a protein sequence query to the sequences of the predictions in the AlphaFold Database and returns a list of AlphaFold predictions with similar sequences to the one the user provided. 

The search process can take up to 10 minutes to complete. To revisit, review or share your results, you can copy or bookmark the url to the results page. Note that the query must be at least 20 amino acids long and only standard residues are accepted.

We display a pairwise sequence alignment, where the top row is the input sequence, the middle row is the matching amino acid positions between the input sequence query and the target sequence from the database, and the bottom row shows the target sequence. 

 


What if I can’t find the protein I’m interested in?

 

If you can’t find the structure you’re looking for, here are some suggestions to improve your search results:

  • Try searching by protein or gene name rather than specific UniProt accession.
  • If running a sequence search, the input query should contain at least 20 amino acids, and only standard amino acids are accepted.
  • For human proteins longer than 2,700 amino acids, check the whole proteome download. This contains longer proteins predicted as overlapping fragments. For example, Titin has predicted fragment structures named as Q8WZ42-F1 (residues 1–1400), Q8WZ42-F2 (residues 201–1600), etc. 
  • Check that the protein isn’t excluded by any of the criteria covered in a previous FAQ.

The AlphaFold source code and Colab notebook can be used to predict the structures of proteins not in AlphaFold DB. Both resources have been updated to support predicting multimer structures.

If you experience any issues with search, please contact afdbhelp@ebi.ac.uk. 


What is included on a structure page?

 

Structure pages show basic information about the protein (drawn from UniProt), and three separate outputs from AlphaFold.

The first is the 3D coordinates (including side chains if you click on the sequence in the viewer).

The second output is a per-residue confidence metric called pLDDT, which is used to colour the residues of the prediction. Note that model confidence can vary greatly along a chain so it is important to consult the confidence estimates when interpreting structural features. The lower confidence bands may be associated with disorder (see our publication).

The third output is Predicted Aligned Error, which is necessary to assess confidence in the domain packing and large-scale topology of the protein. See the FAQ below on how to interpret relative domain positions.

 


How can I download a structure prediction?

 

Coordinate files can be downloaded from the menu in the top right of the structure page in mmCIF or PDB format. These formats are widely accepted by 3D structure viewing software, such as PyMOL and Chimera.

 


How confident should I be in a prediction?

 

AlphaFold produces a per-residue estimate of its confidence on a scale from 0 - 100. This confidence measure is called pLDDT and corresponds to the model’s predicted score on the lDDT-Cα metric. It is stored in the B-factor fields of the mmCIF and PDB files available for download (although unlike a B-factor, higher pLDDT is better). pLDDT is also used to colour-code the residues of the model in the 3D structure viewer. The following rules of thumb provide guidance on the expected reliability of a given region:

  • Regions with pLDDT > 90 are expected to be modelled to high accuracy. These should be suitable for any application that benefits from high accuracy (e.g. characterising binding sites).
  • Regions with pLDDT between 70 and 90 are expected to be modelled well (a generally good backbone prediction).
  • Regions with pLDDT between 50 and 70 are low confidence and should be treated with caution.
  • The 3D coordinates of regions with pLDDT < 50 often have a ribbon-like appearance and should not be interpreted. We show in our paper that pLDDT < 50 is a reasonably strong predictor of disorder, i.e. it suggests such a region is either unstructured in physiological conditions or only structured as part of a complex. (Note: this relationship has typically been tested in the context of well-studied proteins, which may have more evolutionarily-related sequences available than a randomly chosen UniProt entry.)
  • Structured domains with many inter-residue contacts are likely to be more reliable than extended linkers or isolated long helices.
  • Unphysical bond lengths and clashes do not usually appear in confident regions. Any part of a structure with several of these should be disregarded.

Note that the PDB and mmCIF files contain coordinates for all regions, regardless of their pLDDT score. It is up to the user to interpret the model judiciously, in accordance with the guidance above.

 


How should I interpret the relative positions of domains?

 

Independent of the 3D structure, AlphaFold produces an output called “Predicted Aligned Error” (PAE). This is shown at the bottom of structure pages as an interactive 2D plot.

  • The colour at (x, y) indicates AlphaFold’s expected position error at residue x if the predicted and true structures were aligned on residue y.
  • If the PAE is generally low for residue pairs x, y from two different domains, it indicates that AlphaFold predicts well-defined relative positions and orientations for them.
  • If the PAE is generally high for residue pairs x, y from two different domains, then the relative positions and/or orientations of these domains in the 3D structure are uncertain and should not be interpreted.

AlphaFold produces a useful inter-domain prediction in some cases. However, in CASP14 intra-domain prediction accuracy was more extensively validated and is therefore expected to be more reliable.

 


How can I download and use the Predicted Aligned Error (PAE) file?

 

The PAE is displayed as an image for each of the structure predictions. If you need the raw data with PAE for all residue pairs, you can download the PAE as a JSON file using the button at the top of the structure page.

This file is in a custom format and it isn't supported by any existing software – you will have to use Python or another programming language to analyse or plot the information that is contained in it.

For a protein of length num_res, the JSON file has the following structure of arrays format:
[
    {
        "predicted_aligned_error": [[0, 1, 4, 7, 9, ...], ...], # Shape: (num_res, num_res).
        "max_predicted_aligned_error": 31.75
    }
]


The fields in the JSON file are:

  • predicted_aligned_error: The PAE value of the residue pair, rounded to the closest integer. For the PAE value at position (i, j), i is the residue on which the structure is aligned, j is the residue on which the error is predicted.
  • max_predicted_aligned_error: A number that denotes the maximum possible value of PAE. The smallest possible value of PAE is 0.

We updated the PAE JSON file format on 28th July 2022 to reduce file size by 4x. Please ensure you read the 2D matrix of PAE values from the predicted_aligned_error field instead of the removed 1D "distances" field and avoid using the old "residue1" and "residue2" fields.

If you are using a script or third party tool to read the PAE JSON file programmatically and you are seeing errors (e.g. missing field "distance"), check with the author of the program whether the latest PAE JSON format is supported.

 


How can I bulk download the data?

 

Predictions for individual proteomes and for Swiss-Prot are available via our downloads page and via the FTP site:
https://ftp.ebi.ac.uk/pub/databases/alphafold.

The full dataset containing all predictions is available at no cost and under a CC-BY-4.0 licence from Google Cloud Public Datasets. The size is ~23TiB, and we expect that most users will be better served by downloading only a subset of the files relevant to their use case. Please refer to our readme for more details on working with the full dataset.

 


How can I search for similar structures in the AlphaFold Database?

 

Foldseek has been integrated into the AlphaFold Database, enabling easy access to similar structures across both experimentally determined structures from the PDB (Protein Data Bank) and predicted structures from the clustered AlphaFold Database (AFDB50).

Search results are organised into two tabs:

  • PDB structures: These are structures determined using experimental methods. Please note that NMR (Nuclear Magnetic Resonance) structures do not have a resolution value associated with them.
  • AFDB50 structures: These are predicted structures available from the AlphaFold Database, clustered at 50% sequence identity to reduce redundancy.

Each row includes: a pairwise sequence alignment, residue range indicating the specific range of amino acids in the alignment, E-value to show the statistical significance of the alignment, sequence identity, resolution (for PDB structures), and average pLDDT (for AlphaFold2 structures).

You can select which structures to superimpose in the 3D viewer. The structural alignment, along with RMSD (Root Mean Square Deviation) is reported based on the superposition of Cɑ atoms. Once the structures are aligned, you can download the aligned coordinates in mmCIF format.

 


What is Foldseek?

 

Foldseek is a tool for fast and sensitive protein structure search. It compares protein structures by representing them as sequences over a 3Di alphabet that describes the local tertiary interactions between residues in the structure. This allows Foldseek to efficiently search vast protein structure databases to find similar structures.

For technical details on Foldseek, please refer to the original publication: Fast and accurate protein structure search with Foldseek. Nature (2023).

 


How do cluster members work?

 

Collaborating with the Steinegger lab, we grouped structurally similar proteins into clusters using Foldseek Cluster. Our clustering approach was a two-phase process:

The first phase, AFDB50/MMSeqs2, involved clustering the UniProtKB protein sequences from the AlphaFold Database using MMseqs2, with a maximum of 50% sequence identity and a 90% sequence overlap. This reduced dataset was called AFDB50. In the second phase, AFDB/Foldseek, proteins were grouped based on structural similarity. Representatives of each cluster from the first phase were chosen according to the highest pLDDT score. Structural clustering was performed with Foldseek Cluster, using a minimum structural alignment overlap of 90% and an E-value cutoff of 0.01.

For more details about the clustering process, please refer to the following paper: Clustering predicted structures at the scale of the known protein universe. Nature (2023).

 


How should I cite this resource?

 

EMBL-EBI expects attribution (e.g. in publications, services or products) for any of its online services, databases or software in accordance with good scientific practice.

If you use an AlphaFold DB prediction in your work, please cite the following papers:
Jumper, J et al. Highly accurate protein structure prediction with AlphaFold. Nature (2021).

Varadi, M et al. AlphaFold Protein Structure Database: massively expanding the structural coverage of protein-sequence space with high-accuracy models. Nucleic Acids Research (2021).

 


What licence applies to the predictions?

 

All of the data provided is freely available for both academic and commercial use under Creative Commons Attribution 4.0 (CC-BY 4.0) licence terms.

 


Where can I deposit AlphaFold structure predictions that are not in the database?

 

We do not currently have functionality to deposit structure predictions. If you have generated AlphaFold structure predictions that you would like to make available to the community, you can take a look at the Research Data Management kit being developed by the ELIXIR 3D-BioInfo community that describes guidelines on how to make models and the relevant metadata available according to FAIR principles. 

 


Who should I contact with enquiries?

 

For questions and feedback about the AlphaFold DB website, please contact afdbhelp@ebi.ac.uk.

For sharing feedback on structure predictions or for questions about AlphaFold not directly related to the database, please contact the AlphaFold team at alphafold@deepmind.com. We may not be able to respond to every query and there may be some delay before we can get back to you.

For other questions about AlphaFold not directly related to the database, please contact the AlphaFold team at alphafold@deepmind.com. Please do not share anything confidential with Google DeepMind.

For press enquiries, please contact press@deepmind.com or comms@ebi.ac.uk.

 


How can I get in touch about my experience with the AlphaFold DB?

 

We would love to hear your feedback and understand how the database has been useful in your research. Share your stories with us at alphafold@deepmind.com.