DeepMind's AlphaGenome Decodes the 98% of DNA Nobody Understood

Only 2% of human DNA codes for proteins. The other 98% — long dismissed as “junk DNA” — contains the regulatory machinery that controls when, where, and how much each gene gets turned on. Mutations in these non-coding regions cause disease, but figuring out which mutations matter has been nearly impossible.

Google DeepMind’s AlphaGenome takes on this problem directly. The AI model can read up to a million base pairs of DNA sequence and predict how variations affect gene regulation. In benchmarks, it outperformed existing tools in 25 of 26 evaluations, and it’s already being used in hackathons to hunt for the genetic causes of rare diseases that have evaded diagnosis.

What AlphaGenome Predicts

Most genetic tests focus on the protein-coding 2% of the genome because that’s where the effects of mutations are clearest. A misspelling in a gene’s instructions often produces a broken or missing protein. The consequences are visible.

Non-coding DNA is harder. These regions contain promoters, enhancers, silencers, and other regulatory elements that control gene activity. A mutation here might cause a gene to be expressed in the wrong tissue, at the wrong time, or in the wrong amount. The effects are real but subtle.

AlphaGenome predicts thousands of molecular properties from raw sequence, including:

Gene expression levels — how much RNA each gene produces in different cell types
Chromatin accessibility — whether regions of DNA are open for reading
Transcription factor binding — where proteins that control gene activity attach to DNA
RNA splicing — how the raw gene transcript gets cut and reassembled
3D chromatin structure — how DNA folds to bring distant regions into contact

Previous models could predict some of these properties individually. AlphaGenome handles them all in a unified architecture, and it does so at single-base-pair resolution across sequence windows 50 times longer than earlier tools.

Why Splicing Matters

The explicit modeling of RNA splicing is particularly useful for rare disease. When a gene’s instructions are transcribed into RNA, the transcript gets edited: certain sections (introns) are cut out, and the remaining pieces (exons) are spliced together to form the final message.

Mutations that disrupt splicing can cause disease even when they don’t directly affect the protein-coding sequence. Spinal muscular atrophy and some forms of cystic fibrosis work this way. AlphaGenome can predict both where splicing occurs and how efficiently it happens, providing direct evidence for whether a mutation might cause problems.

Hackathons for Rare Disease

Researchers are already using AlphaGenome in hackathons to solve rare disease cases that have stumped human geneticists. The model can take variants of uncertain significance — mutations found in patient genomes that might or might not cause disease — and predict their functional consequences.

For the first time, clinicians have a tool that can provide evidence about non-coding variants at scale. Previous approaches required extensive laboratory work to test each variant individually. AlphaGenome generates predictions computationally, allowing researchers to prioritize which variants deserve experimental follow-up.

The model demonstrated its capabilities by analyzing a cancer-associated mutation in T-cell acute lymphoblastic leukemia. AlphaGenome predicted that the mutation creates a binding site for the MYB transcription factor, which in turn activates the TAL1 oncogene. This mechanistic explanation matched what laboratory studies had previously found through painstaking work.

The 98% Problem

The broader significance is about completing the picture of human genetics. For decades, researchers focused on the protein-coding regions because those were tractable. Non-coding variants were flagged as “uncertain” and often ignored.

That approach misses a substantial portion of genetic disease. Genome-wide association studies have found that most disease-linked variants lie in non-coding regions. Without tools to interpret them, those discoveries have been difficult to translate into understanding.

AlphaGenome doesn’t solve this problem completely. The model performs best on variants with large effects — the kind most likely to cause rare Mendelian disorders. Complex traits involving many small-effect variants remain challenging. Very distant regulatory elements, more than 100,000 base pairs from their target genes, are also harder to capture accurately.

But for the families waiting years for a diagnosis, any expansion of what can be tested computationally is meaningful. The model is freely available on GitHub for academic research, and thousands of scientists are already using it.

The Fine Print

AlphaGenome predicts functional consequences, not clinical outcomes. A variant that affects gene expression might or might not cause disease depending on the gene, the tissue, the degree of change, and countless other factors. The model provides evidence, not diagnoses.

The training data came from bulk cell populations, which may obscure effects that only appear in specific cell types or developmental stages. Complex traits involving environmental factors are outside the model’s scope.

Still, for the 98% of the genome that has been mostly uninterpretable, having any systematic prediction tool represents a meaningful shift. The protein-coding 2% has been studied intensively for decades. Now the rest of the genome is finally becoming readable.