AlphaFold changed biology. But there’s a problem nobody talks about much: how do you know when to trust what it tells you?
Researchers at the University of Missouri have built a solution: a database of 1.4 million protein structure models, all independently verified by experts, that can train AI systems to assess their own accuracy.
The Trust Problem
AI protein structure prediction has advanced rapidly since DeepMind’s AlphaFold won CASP14 in 2020. Tools like AlphaFold, Boltz-1, and Chai-1 can now predict how proteins fold with remarkable accuracy - sometimes.
The catch is that no single tool maintains consistent accuracy across all protein types. Some proteins are harder to predict than others, and the models don’t reliably flag their own uncertainty. A researcher using these predictions for drug development or disease research needs to know: can I trust this particular structure, or is the model guessing?
“We need AI methods to assess the quality of predicted protein models and decide if they can be trusted,” said Jianlin “Jack” Cheng, the Curators’ Distinguished Professor at Missouri who led the project.
What PSBench Contains
PSBench draws from the Critical Assessment of protein Structure Prediction (CASP), an international competition that has evaluated computational protein prediction methods for over 30 years. The database incorporates verified assessments from CASP15 and CASP16, comprising four large-scale labeled datasets.
Each of the 1.4 million protein structure models includes quality annotations from independent experts - a ground truth label that AI systems can learn from. This lets researchers train quality assessment algorithms that can evaluate new predictions.
The team presented PSBench at NeurIPS 2025 in San Diego, positioning it alongside other major AI contributions at one of the field’s most competitive conferences.
Why This Matters for Medicine
Protein structure prediction feeds directly into drug development. If you’re designing a molecule to bind to a specific protein, you need to know its exact shape. Errors in predicted structures can send entire research programs in the wrong direction.
Diseases like Alzheimer’s and cancer involve complex protein interactions. Better quality assessment means researchers can have appropriate confidence in their structural predictions, or know when to seek experimental validation.
The Bigger Picture
PSBench represents a pattern we’re seeing more often in AI research: as models get more powerful, the community needs better tools to evaluate them. Benchmark saturation is a real problem. When models score 99% on existing tests, those tests stop being useful for distinguishing good predictions from great ones.
By providing 1.4 million verified examples, PSBench gives the field room to develop more sophisticated quality assessment methods. It’s the kind of infrastructure that doesn’t make headlines but enables better science downstream.
The database and associated resources are available to the research community.