AlphaFold Database Adds 1.7 Million Protein Complexes

The AlphaFold Database now includes predictions of how proteins interact with each other — not just their individual shapes. A collaboration between Google DeepMind, EMBL’s European Bioinformatics Institute, NVIDIA, and Seoul National University has added 1.7 million high-confidence “homodimer” predictions, showing how pairs of identical protein molecules intertwine to form functional structures.

Another 18 million lower-confidence homodimers are available for bulk download. In total, the partnership calculated around 30 million protein complex structures.

Why This Matters

The original AlphaFold database contained 200 million individual protein structures — a resource that transformed how biologists understand molecular shapes. But most proteins don’t work alone. They function as assemblies, binding together to carry out cellular tasks. A single-protein prediction tells you the shape of a component but not how the machine fits together.

Understanding these interactions is fundamental to drug discovery. Many diseases involve proteins that shouldn’t bind together doing so anyway, or proteins that should bind failing to connect. Without knowing how proteins pair up, researchers work partially blind.

“By making predicted protein complexes accessible at an unprecedented scale, we are illuminating an unseen landscape of molecular interactions,” said Martin Steinegger from Seoul National University.

What’s Included

The release focuses on proteins from 20 of the most-studied species: humans, mice, yeast, and bacteria that cause disease in humans, including Mycobacterium tuberculosis and organisms on the WHO priority pathogens list.

These aren’t random predictions. The consortium prioritized proteins relevant to human health and disease — the ones most likely to be useful for researchers developing treatments.

Homodimers are the simplest type of complex: two identical proteins binding together. Future releases will likely include heterodimers (two different proteins) and larger multi-protein assemblies, though those predictions are computationally harder and less reliable.

The Compute Problem

Generating these structures required significant computing resources. The partnership estimates that doing this work independently would have taken researchers roughly 17 million GPU hours.

By running the calculations centrally and releasing the results publicly, the consortium eliminates that barrier. Any researcher with internet access can now investigate protein interactions across organisms without needing their own computing infrastructure.

“This release is a great example of how AI infrastructure and software can uniquely enable new scales of biological understanding,” said Anthony Costa, NVIDIA’s Director of Digital Biology.

The Open-Science Approach

All 1.7 million high-confidence structures are freely available through the AlphaFold Database with no restrictions on use. The data can be downloaded, modified, and integrated into other research projects.

“By making this foundational protein complex dataset openly available to the world, we’re inviting researchers to test, refine, and build on it,” said Jo McEntyre, EMBL-EBI’s interim director.

This continues a pattern. DeepMind made AlphaFold’s code open source in 2021 and released the full 200-million-structure database in 2022. Making predictions available without commercial restrictions — rather than monetizing access — has accelerated adoption across academic and pharmaceutical research.

The Fine Print

These are predictions, not experimentally verified structures. AlphaFold’s predictions are highly accurate for individual proteins, but complex predictions are harder. The 1.7 million high-confidence structures passed reliability thresholds, but the 18 million lower-confidence predictions need to be used more carefully.

Researchers should treat these as hypotheses to guide experiments rather than definitive answers. The database provides a starting point — suggesting which protein interactions are worth investigating — but laboratory validation remains necessary for any application involving human health.

The release also only covers homodimers. Many biologically important complexes involve different proteins binding together, or larger assemblies of multiple components. Those predictions will require future work.

Still, for researchers studying protein interactions in model organisms and human disease, 1.7 million high-quality predictions represents a substantial resource that didn’t exist last week.