LW - I'm a bit skeptical of AlphaFold 3 by Oleg Trott

The Nonlinear Library: LessWrong

Content provided by The Nonlinear Fund. All podcast content including episodes, graphics, and podcast descriptions are uploaded and provided directly by The Nonlinear Fund or their podcast platform partner. If you believe someone is using your copyrighted work without your permission, you can follow the process outlined here https://player.fm/legal.

1M ago 3:55

MP3•Episode home

Link to original article
Welcome to The Nonlinear Library, where we use Text-to-Speech software to convert the best writing from the Rationalist and EA communities into audio. This is: I'm a bit skeptical of AlphaFold 3, published by Oleg Trott on June 25, 2024 on LessWrong. (also on https://olegtrott.substack.com) So this happened: DeepMind (with 48 authors, including a new member of the British nobility) decided to compete with me. Or rather, with some of my work from 10+ years ago. Apparently, AlphaFold 3 can now predict how a given drug-like molecule will bind to its target protein. And it does so better than AutoDock Vina (the most cited molecular docking program, which I built at Scripps Research): On top of this, it doesn't even need a 3D structure of the target. It predicts it too! But I'm a bit skeptical. I'll try to explain why. Consider a hypothetical scientific dataset where all data is duplicated: Perhaps the scientists had trust issues and tried to check each others' work. Suppose you split this data randomly into training and test subsets at a ratio of 3-to-1, as is often done: Now, if all your "learning" algorithm does is memorize the training data, it will be very easy for it to do well on 75% of the test data, because 75% of the test data will have copies in the training data. Scientists mistrusting each other are only one source of data redundancy, by the way. Different proteins can also be related to each other. Even when the sequence similarity between two proteins is low, because of evolutionary pressures, this similarity tends to be concentrated where it matters, which is the binding site. Lastly, scientists typically don't just take random proteins and random drug-like molecules, and try to determine their combined structures. Oftentimes, they take baby steps, choosing to study drug-like molecules similar to the ones already discovered for the same or related targets. So there can be lots of redundancy and near-redundancy in the public 3D data of drug-like molecules and proteins bound together. Long ago, when I was a PhD student at Columbia, I trained a neural network to predict protein flexibility. The dataset I had was tiny, but it had interrelated proteins already: With a larger dataset, due to the Birthday Paradox, the interrelatedness would have probably been a much bigger concern. Back then, I decided that using a random train-test split would have been wrong. So I made sure that related proteins were never in both "train" and "test" subsets at the same time. With my model, I was essentially saying "Give me a protein, and (even) if it's unrelated to the ones in my training data, I can predict …" The authors don't seem to do that. Their analysis reports that most of the proteins in the test dataset had kin in the training dataset with sequence identity in the 95-100 range. Some had sequence identity below 30, but I wonder if this should really be called "low": This makes it hard to interpret. Maybe the results tell us something about the model's ability to learn how molecules interact. Or maybe they tell us something about the redundancy of 3D data that people tend to deposit? Or some combination? Docking software is used to scan millions and billions of drug-like molecules looking for new potential binders. So it needs to be able to generalize, rather than just memorize. But the following bit makes me really uneasy. The authors say: The second class of stereochemical violations is a tendency of the model to occasionally produce overlapping (clashing) atoms in the predictions. This sometimes manifests as extreme violations in homomers in which entire chains have been observed to overlap (Fig. 5e). If AlphaFold 3 is actually learning any non-obvious insights from data, about how molecules interact, why is it missing possibly the most obvious one of them all, which is that interpenetrating atoms are bad? On the other hand, if most of what it does is memorize and regurgitate data (when it can), this would explain such fail...

1749 episodes

#The Nonlinear Fund #Podcasting Education #Of TexttoSpeech