Thoughts on AlphaFold @ Casp13
Dec. 21st, 2018 12:55 pmBased on reading this post by Harvard biologist Mo Al-Quraishi.
What was the direct scientific benefit of AlphaFold's folding results? Did they, as Kelsey says, 'break the problem open'?
The CASP contest happens every two years, and for the past couple years improvements on the GDT_TS distance metric have been fairly constant. Comparing their GDT_TS to the previous CASP, AlphaFold was about 1.8x the average improvement; the second-place team this year was at just about the average improvement. This metric went from 40% to ~ 58%, meaning there are only ~2 improvements of comparable size to make before the metric hits 100% (meaning complete accuracy to within 1 angstrom). MaQ notes that this could require some new revolutionary insights, though.
Further, the GDT_HA (high accuracy) metric shows again that AlphaFold was ~2x as big an improvement as usual, but has a lot further to go - AlphaFold took it from 27% to 40%.
What about the underlying methods/insights?
Here's my understanding of MaQ's "AlphaFold" section. AlphaFold used MSAs (sets of corresponding, aligned protein + RNA + DNA sequences) as training data. Protein residues (amino acids) were considered 'similar' if the genes for them tended to appear together and mutate together. They used a neural net to map similarity into a discrete probability distribution of residue distance (with a few bins). Used smoothing and some physics knowledge ('reference potentials') to turn these into pairwise potentials, then did gradient descent on these. [I assume there were constraints to ensure physicality. Maybe just pairwise constraints on nearby residues is enough to enforce that the protein chain is realistic, if your at-a-distance potentials are reasonable? Probably some Van der Waals and dipole stuff to ensure too-close residues get penalized.] I think their training involved backpropping over the gradient descent steps, presumably taking the (error) gradient of the potential function elements at each step.
This seems straightforward, and MaQ says it seems that way to biologists too. The use of pairwise potentials rather than binary 'in contact / out of contact' indicators is somewhat new, but another team (Xu) was also doing this. Meanwhile, the second-place Zhang team used binary contact data but a better protein folding model. MaQ says "AlphaFold's preliminary results indicate" they wouldn't gain much from using a better folder; maybe they tried this and didn't see much benefit? If so, maybe the field would have gotten about as far by combining Xu and Zhang labs' results after the contest this year. (Note that it would have to be after the contest because protein-folding labs are highly competitive and secretive in the leadup to the contest, so the field only has global updates every two years. MaQ attributes these norms to the needs of a much less developed field, and thinks they should change now that the field is more sophisticated about training/test splits, has some public benchmark data, and has enough secular progress that they don't have to worry about charlatans faking it.) MaQ claims AlphaFold's methods are more elegant, though, which might be true.
One thing worth noting is that AlphaFold's 'energy functions' aren't constrained by real physics; they're 'the function such that, on the training set, gradient descent yielded structures with the highest GDT_TS". And from what MaQ says, they're not universal, but specific to the set of proteins whose info is given in a single MSA. [One thing I don't get; how do they determine how to analyze a protein that's outside any of the structural families they have MSAs for? I should revisit this when the AlphaFold paper is up.] In fact I suppose there's no reason they have to act as a pairwise additive force, though it sounds like they do work this way; I guess it's easier to calculate, plus it's suggested by the physics.
What are the implications?
MaQ says unclear. If high accuracy is needed for useful folding predictions, many new insights might be necessary. And unlike in other ML fields, there isn't a ton of increase in dataset size over time, so that can't be relied on to feed algorithmic improvement. But there's a new player in town, and it's possible protein folding will eventually get eaten by ML.
He does see AlphaFold's success here as an indictment of protein folding academic culture, and especially of pharma lab culture, which (he says) talks a big game about fundamental research but is almost entirely focused on incremental practical work.
He highlights the further subproblems that still need to be figured out. AlphaFold type methods are fairly specific to protein 'families' and struggle to deal with novel structures (mutations, engineered proteins). Even within families, sub-angstrom precision is important for designing small targeted drugs, and such precision is still far beyond us. And there's the 'holy grail' of actual folding dynamics, which would be really useful for predicting protein function. So there's a long way to go, and protein folding researchers should think about what problems they can best approach from a high-insight but lower-compute angle.
--
This was a neat, accessible blog post. It's nice to read things by people honest enough to admit things like "we were all really worried at first that DeepMind made some amazing insight and we'd all be out of a job, then read their preprint, breathed a sigh of relief, and started sneering about how incremental their advances were". A little inaccessible for sure--MaQ seems to be writing for people somewhat outside the field, but also doesn't explain what an MSA is--so I've probably missed some things, but quick lookups gave me enough context to feel like I get it. It's also nice to hear about molecular modeling-y things from time to time; they're like missives from a life I'm glad I don't live, but enjoy visiting.
What was the direct scientific benefit of AlphaFold's folding results? Did they, as Kelsey says, 'break the problem open'?
The CASP contest happens every two years, and for the past couple years improvements on the GDT_TS distance metric have been fairly constant. Comparing their GDT_TS to the previous CASP, AlphaFold was about 1.8x the average improvement; the second-place team this year was at just about the average improvement. This metric went from 40% to ~ 58%, meaning there are only ~2 improvements of comparable size to make before the metric hits 100% (meaning complete accuracy to within 1 angstrom). MaQ notes that this could require some new revolutionary insights, though.
Further, the GDT_HA (high accuracy) metric shows again that AlphaFold was ~2x as big an improvement as usual, but has a lot further to go - AlphaFold took it from 27% to 40%.
What about the underlying methods/insights?
Here's my understanding of MaQ's "AlphaFold" section. AlphaFold used MSAs (sets of corresponding, aligned protein + RNA + DNA sequences) as training data. Protein residues (amino acids) were considered 'similar' if the genes for them tended to appear together and mutate together. They used a neural net to map similarity into a discrete probability distribution of residue distance (with a few bins). Used smoothing and some physics knowledge ('reference potentials') to turn these into pairwise potentials, then did gradient descent on these. [I assume there were constraints to ensure physicality. Maybe just pairwise constraints on nearby residues is enough to enforce that the protein chain is realistic, if your at-a-distance potentials are reasonable? Probably some Van der Waals and dipole stuff to ensure too-close residues get penalized.] I think their training involved backpropping over the gradient descent steps, presumably taking the (error) gradient of the potential function elements at each step.
This seems straightforward, and MaQ says it seems that way to biologists too. The use of pairwise potentials rather than binary 'in contact / out of contact' indicators is somewhat new, but another team (Xu) was also doing this. Meanwhile, the second-place Zhang team used binary contact data but a better protein folding model. MaQ says "AlphaFold's preliminary results indicate" they wouldn't gain much from using a better folder; maybe they tried this and didn't see much benefit? If so, maybe the field would have gotten about as far by combining Xu and Zhang labs' results after the contest this year. (Note that it would have to be after the contest because protein-folding labs are highly competitive and secretive in the leadup to the contest, so the field only has global updates every two years. MaQ attributes these norms to the needs of a much less developed field, and thinks they should change now that the field is more sophisticated about training/test splits, has some public benchmark data, and has enough secular progress that they don't have to worry about charlatans faking it.) MaQ claims AlphaFold's methods are more elegant, though, which might be true.
One thing worth noting is that AlphaFold's 'energy functions' aren't constrained by real physics; they're 'the function such that, on the training set, gradient descent yielded structures with the highest GDT_TS". And from what MaQ says, they're not universal, but specific to the set of proteins whose info is given in a single MSA. [One thing I don't get; how do they determine how to analyze a protein that's outside any of the structural families they have MSAs for? I should revisit this when the AlphaFold paper is up.] In fact I suppose there's no reason they have to act as a pairwise additive force, though it sounds like they do work this way; I guess it's easier to calculate, plus it's suggested by the physics.
What are the implications?
MaQ says unclear. If high accuracy is needed for useful folding predictions, many new insights might be necessary. And unlike in other ML fields, there isn't a ton of increase in dataset size over time, so that can't be relied on to feed algorithmic improvement. But there's a new player in town, and it's possible protein folding will eventually get eaten by ML.
He does see AlphaFold's success here as an indictment of protein folding academic culture, and especially of pharma lab culture, which (he says) talks a big game about fundamental research but is almost entirely focused on incremental practical work.
He highlights the further subproblems that still need to be figured out. AlphaFold type methods are fairly specific to protein 'families' and struggle to deal with novel structures (mutations, engineered proteins). Even within families, sub-angstrom precision is important for designing small targeted drugs, and such precision is still far beyond us. And there's the 'holy grail' of actual folding dynamics, which would be really useful for predicting protein function. So there's a long way to go, and protein folding researchers should think about what problems they can best approach from a high-insight but lower-compute angle.
--
This was a neat, accessible blog post. It's nice to read things by people honest enough to admit things like "we were all really worried at first that DeepMind made some amazing insight and we'd all be out of a job, then read their preprint, breathed a sigh of relief, and started sneering about how incremental their advances were". A little inaccessible for sure--MaQ seems to be writing for people somewhat outside the field, but also doesn't explain what an MSA is--so I've probably missed some things, but quick lookups gave me enough context to feel like I get it. It's also nice to hear about molecular modeling-y things from time to time; they're like missives from a life I'm glad I don't live, but enjoy visiting.