Frequently asked questions Collapse all
How does AlphaFold work?
DeepMind’s 2021 methods paper is the best reference for this. It gives an overview of the most important ideas, and there is a detailed description of all aspects of the system in the Supplementary Information. Visit our online training course to learn more about AlphaFold.
Note that the architecture of the system used at CASP14 differs significantly from the version used at CASP13, making it important to refer to the 2021 publication.
What is AlphaMissense?
AlphaMissense is an AI model that builds on Google DeepMind’s AlphaFold2 to categorise ‘missense’ mutations in different proteins as either ‘likely pathogenic’, ‘likely benign’ or ‘uncertain’, producing a score that estimates the likelihood of a variant being pathogenic. AlphaMissense leverages AlphaFold2’s capability to model protein structure, and its capacity to learn evolutionary constraints from related sequences. The implementation is closely aligned with AlphaFold2, with some architectural differences. AlphaMissense was used to classify the effects of all possible 216 million single amino acid sequence substitutions across the 19,233 canonical human proteins. Using an amino acid sequence as an input, AlphaMissense: Note that AlphaMissense does not predict the change in protein structure, or biophysical properties such as stability, upon mutation. Instead, it uses related protein sequences and protein structure as contextual information to estimate pathogenicity. For more information about AlphaMissense, please refer to the paper: Accurate proteome-wide missense variant effect prediction with AlphaMissense. Science (2023). AlphaMissense scores for all human missense variants are available on the Google Cloud Public Dataset.
What data can I download from AlphaMissense?
We provide downloadable files in csv format to help you explore the data for a specific canonical human protein:
Column am_class categorises the mutation based on the score:
What information does AlphaFold use from the Protein Data Bank?
AlphaFold is trained on protein chains in the PDB released before 2018-04-30.
Predictions can also make use of up to 4 templates released before 2021-02-15. However, templates are not a critical input for AlphaFold to make an accurate prediction; the model can make a strong prediction based on a multiple sequence alignment alone. Additionally, AlphaFold can ignore templates if they appear unhelpful - it isn’t required to copy their structure. As of this latest release, PDB structures shown to the model are recorded in the prediction mmCIF file.
How does AlphaFold compare to other structure prediction tools?
The CASP14 assessment compared leading structure prediction methods in detail; the results are available here. AlphaFold was the top-ranked method, with a median GDT (Global Distance Test) score of 92.4 across all targets and 87.0 on the challenging free-modelling category, compared to 72.8 and 61.0 for the next best methods in these categories. Structural biologists more often express the similarity between two protein structures by first optimally superposing the structures and then calculating the root-mean-square distance (RMSD) between the Cα atoms of equivalent residues. Taking the median RMSD-Cα on the best predicted 95% of residues reduces the effect of flexible tails and crystal-packing artefacts. On this metric AlphaFold’s CASP14 predictions had a median distance of 0.96 Å to the experimental models, compared to 2.83 Å for the next-best method.
Which proteins are included?
AlphaFold DB has grown in several stages: The wider UniProt predictions are the output of a single model, while Swiss-Prot / proteomes entries represent the most confident prediction out of 5 model runs. Internal benchmarking on CASP14 shows that the model used for UniProt (“model_2_ptm”) is insignificantly less accurate (-1 GDT versus five models), and that there is a slight bias toward lower confidence (-1 pLDDT) due to the effect of using one model rather than selecting from 5. Not all sequences are covered; the most common reasons for a missing sequence are: We plan to continue updating the database. In the meantime, if your sequence(s) aren’t included, you can generate your own AlphaFold predictions using Google DeepMind’s Colab notebook and open source code, which also support multimer predictions.
How many proteins are there in the database?
There are 214,683,839 structures available on the AlphaFold DB website, including 48 complete proteomes available for bulk download. An additional 3,095 structures are included in the human proteome download, covering sequences longer than our usual length limit split into fragments.
What use cases does AlphaFold not support?
AlphaFold DB currently focuses on the use case validated in CASP14: predicting the structure of a single protein chain with a naturally occurring sequence. Many other use cases remain active areas of research, for example:
How do I search the database
The search bar at the top of the page accepts queries based on protein name (e.g. Free fatty acid receptor 2), gene name (e.g. At1g58602), UniProt accession (e.g. Q5VSL9), or organism name (e.g. E. coli).
How does sequence-based search work?
AlphaFold Database sequence-based similarity search is implemented using the Protein Basic Local Alignment Search Tool (BLASTP, see further information here: https://github.com/ncbi/docker/blob/master/blast/README.md). This tool compares a protein sequence query to the sequences of the predictions in the AlphaFold Database and returns a list of AlphaFold predictions with similar sequences to the one the user provided. The search process can take up to 10 minutes to complete. To revisit, review or share your results, you can copy or bookmark the url to the results page. Note that the query must be at least 20 amino acids long and only standard residues are accepted. We display a pairwise sequence alignment, where the top row is the input sequence, the middle row is the matching amino acid positions between the input sequence query and the target sequence from the database, and the bottom row shows the target sequence.
What if I can’t find the protein I’m interested in?
If you can’t find the structure you’re looking for, here are some suggestions to improve your search results:
The AlphaFold source code and Colab notebook can be used to predict the structures of proteins not in AlphaFold DB. Both resources have been updated to support predicting multimer structures.
If you experience any issues with search, please contact afdbhelp@ebi.ac.uk.
What is included on a structure page?
Structure pages show basic information about the protein (drawn from UniProt), and three separate outputs from AlphaFold.
The first is the 3D coordinates (including side chains if you click on the sequence in the viewer).
The second output is a per-residue confidence metric called pLDDT, which is used to colour the residues of the prediction. Note that model confidence can vary greatly along a chain so it is important to consult the confidence estimates when interpreting structural features. The lower confidence bands may be associated with disorder (see our publication).
The third output is Predicted Aligned Error, which is necessary to assess confidence in the domain packing and large-scale topology of the protein. See the FAQ below on how to interpret relative domain positions.
How can I download a structure prediction?
Coordinate files can be downloaded from the menu in the top right of the structure page in mmCIF or PDB format. These formats are widely accepted by 3D structure viewing software, such as PyMOL and Chimera.
How confident should I be in a prediction?
AlphaFold produces a per-residue estimate of its confidence on a scale from 0 - 100. This confidence measure is called pLDDT and corresponds to the model’s predicted score on the lDDT-Cα metric. It is stored in the B-factor fields of the mmCIF and PDB files available for download (although unlike a B-factor, higher pLDDT is better). pLDDT is also used to colour-code the residues of the model in the 3D structure viewer. The following rules of thumb provide guidance on the expected reliability of a given region: Note that the PDB and mmCIF files contain coordinates for all regions, regardless of their pLDDT score. It is up to the user to interpret the model judiciously, in accordance with the guidance above.
How should I interpret the relative positions of domains?
Independent of the 3D structure, AlphaFold produces an output called “Predicted Aligned Error” (PAE). This is shown at the bottom of structure pages as an interactive 2D plot. AlphaFold produces a useful inter-domain prediction in some cases. However, in CASP14 intra-domain prediction accuracy was more extensively validated and is therefore expected to be more reliable.
How can I download and use the Predicted Aligned Error (PAE) file?
The PAE is displayed as an image for each of the structure predictions. If you need the raw data with PAE for all residue pairs, you can download the PAE as a JSON file using the button at the top of the structure page. This file is in a custom format and it isn't supported by any existing software – you will have to use Python or another programming language to analyse or plot the information that is contained in it. The fields in the JSON file are: We updated the PAE JSON file format on 28th July 2022 to reduce file size by 4x. Please ensure you read the 2D matrix of PAE values from the predicted_aligned_error field instead of the removed 1D "distances" field and avoid using the old "residue1" and "residue2" fields. If you are using a script or third party tool to read the PAE JSON file programmatically and you are seeing errors (e.g. missing field "distance"), check with the author of the program whether the latest PAE JSON format is supported.
[
{
"predicted_aligned_error": [[0, 1, 4, 7, 9, ...], ...], # Shape: (num_res, num_res).
"max_predicted_aligned_error": 31.75
}
]
How can I bulk download the data?
Predictions for individual proteomes and for Swiss-Prot are available via our downloads page and via the FTP site: The full dataset containing all predictions is available at no cost and under a CC-BY-4.0 licence from Google Cloud Public Datasets. The size is ~23TiB, and we expect that most users will be better served by downloading only a subset of the files relevant to their use case. Please refer to our readme for more details on working with the full dataset.
https://ftp.ebi.ac.uk/pub/databases/alphafold.
How can I search for similar structures in the AlphaFold Database?
Foldseek has been integrated into the AlphaFold Database, enabling easy access to similar structures across both experimentally determined structures from the PDB (Protein Data Bank) and predicted structures from the clustered AlphaFold Database (AFDB50). Search results are organised into two tabs: Each row includes: a pairwise sequence alignment, residue range indicating the specific range of amino acids in the alignment, E-value to show the statistical significance of the alignment, sequence identity, resolution (for PDB structures), and average pLDDT (for AlphaFold2 structures). You can select which structures to superimpose in the 3D viewer. The structural alignment, along with RMSD (Root Mean Square Deviation) is reported based on the superposition of Cɑ atoms. Once the structures are aligned, you can download the aligned coordinates in mmCIF format.
What is Foldseek?
Foldseek is a tool for fast and sensitive protein structure search. It compares protein structures by representing them as sequences over a 3Di alphabet that describes the local tertiary interactions between residues in the structure. This allows Foldseek to efficiently search vast protein structure databases to find similar structures.
For technical details on Foldseek, please refer to the original publication: Fast and accurate protein structure search with Foldseek. Nature (2023).
How do cluster members work?
Collaborating with the Steinegger lab, we grouped structurally similar proteins into clusters using Foldseek Cluster. Our clustering approach was a two-phase process: The first phase, AFDB50/MMSeqs2, involved clustering the UniProtKB protein sequences from the AlphaFold Database using MMseqs2, with a maximum of 50% sequence identity and a 90% sequence overlap. This reduced dataset was called AFDB50. In the second phase, AFDB/Foldseek, proteins were grouped based on structural similarity. Representatives of each cluster from the first phase were chosen according to the highest pLDDT score. Structural clustering was performed with Foldseek Cluster, using a minimum structural alignment overlap of 90% and an E-value cutoff of 0.01. For more details about the clustering process, please refer to the following paper: Clustering predicted structures at the scale of the known protein universe. Nature (2023).
How should I cite this resource?
EMBL-EBI expects attribution (e.g. in publications, services or products) for any of its online services, databases or software in accordance with good scientific practice.
If you use an AlphaFold DB prediction in your work, please cite the following papers:
Jumper, J et al. Highly accurate protein structure prediction with AlphaFold. Nature (2021).
What licence applies to the predictions?
All of the data provided is freely available for both academic and commercial use under Creative Commons Attribution 4.0 (CC-BY 4.0) licence terms.
Where can I deposit AlphaFold structure predictions that are not in the database?
We do not currently have functionality to deposit structure predictions. If you have generated AlphaFold structure predictions that you would like to make available to the community, you can take a look at the Research Data Management kit being developed by the ELIXIR 3D-BioInfo community that describes guidelines on how to make models and the relevant metadata available according to FAIR principles.
Who should I contact with enquiries?
For questions and feedback about the AlphaFold DB website, please contact afdbhelp@ebi.ac.uk. For sharing feedback on structure predictions or for questions about AlphaFold not directly related to the database, please contact the AlphaFold team at alphafold@deepmind.com. We may not be able to respond to every query and there may be some delay before we can get back to you. For other questions about AlphaFold not directly related to the database, please contact the AlphaFold team at alphafold@deepmind.com. Please do not share anything confidential with Google DeepMind. For press enquiries, please contact press@deepmind.com or comms@ebi.ac.uk.
How can I get in touch about my experience with the AlphaFold DB?
We would love to hear your feedback and understand how the database has been useful in your research. Share your stories with us at alphafold@deepmind.com.