A Tale of Talents

A Tale of Talents

The Problem: The complexity of Combinatorics and Charge interactions

Imagine a thread with 20 different types of pearls, each capable of having 9 different rotational positions around the thread. Each pearl has distinct properties—some are positively charged, some are negatively charged, others are oily, and some are water-loving. These pearls are threaded together to form a long chain (~300-500 long). The challenge is to predict how this string of pearls will fold into a specific three-dimensional structure when immersed in water. The oppositely charged pearls start to attract each other, and oily pearls tend to club together and these forces cause twisting, turning, bending, and folding of the chain, giving rise to the particular 3-dimensional structure. Proteins are polymers of amino acids; in our analogy, pearls represent amino acids in a protein chain. This shows the immense complexity of protein folding, which baffled scientists for decades.

Proteins are the molecular machines of life. They are nanorobots performing various functions in the body, from accelerating reactions and converting signals to providing structural support to cells. How is this function determined? By the specific shape of the protein. Yet, predicting this 3D shape from its amino acid sequence (the threaded pearls)—referred to as the “protein folding problem”—has been one of biology’s most challenging open questions. To solve this problem, several approaches have emerged over the decades, culminating in breakthroughs like the AI tool AlphaFold.

More than mere curiosity, solving the correct structure will have immense practical applications like drug discovery and designing new biological materials. The number of Nobel Prizes awarded for work in protein structure underscores the importance of this field of study. Max Perutz and John Kendrew received the Nobel Prize in 1962 for determining the crystal structure of hemoglobin and myoglobin, and in 1972, Christian Anfinsen was awarded for his work on the molecular basis of protein folding. In 2009, the Nobel Prize was awarded to Venkatraman Ramakrishnan, Thomas Steitz, and Ada Yonath for their work on the structure of the ribosome, the 2012 prize to Kobilka for solving the structure of the first GPCR protein, and finally, the 2024 prize was given to the David Baker and Google Deepmind researchers for computational prediction of the protein folding. These awards highlight the ongoing importance of protein structure determination.

Approaches to Determine the 3D Structure

The classic approach to solving protein structures involves a technique called X-ray crystallography. The process is first to crystallize a protein, then shoot high-energy X-rays at the crystal. The protein electrons will diffract the x-rays and the resulting diffracted beams are collected into a detector. From these diffraction patterns, researchers can calculate the exact positions of atoms within the protein. However, this method is significantly difficult, as the crystallization process is highly unpredictable. As any crystallographer will tell you, the process can be either quick—sometimes yielding a crystal in the time it takes to finish a coffee break—or an ongoing endeavor taking years to obtain high-quality crystal structures. Over the decades other powerful techniques such as Nuclear Magnetic Resonance (NMR) and cryo-electron microscopy have been used to solve the structure of proteins. Even though these new methods do not require crystallizing a protein for structure determination, X-ray crystallography remains the gold standard.

Anfinsen’s Dogma and the Levinthal Paradox

In the 1960s, Christian Anfinsen proved a groundbreaking idea known as Anfinsen’s Dogma: the three-dimensional structure of a protein is determined entirely by its amino acid sequence. He showed that, when the protein’s three-dimensional structure has been reverted to the linear chain (a protein’s primary structure), this linear chain can again fold back to the original three-dimensional structure in the test tube. The insight was that the information for folding into the correct three-dimensional structure is somehow encoded in the sequence of the linear chain itself. Although revolutionary, this idea led to the Levinthal Paradox. It highlighted a critical issue in protein folding: given the vast number of possible conformations (the unique coordinates of atoms for a given orientation), how can protein structures fold into their native structures so quickly? If a protein’s folding process involves exploring all possible configurations, that would be computationally infeasible.

Cyrus Levinthal of MIT calculated that the number of possible configurations for a protein with just 100 amino acids would be astronomically high—on the order of 1095. However, proteins fold in a fraction of a second, suggesting that nature has found a much more efficient way to fold proteins, bypassing the need to explore all possibilities. This is known as the Levinthal paradox. One can not even imagine how near infinite the possibilities of atomic arrangements are for the largest protein titin in our body, which has ~34000 amino acids. To put this into perspective, the fastest computers of today, which can perform 1018 calculations per second would take time that far exceeds the time of the universe to compute all possible conformations. In reality, many of these conformations are simply not possible because the packing of large atoms in those positions would not be possible. Other properties of amino acid molecules also would prevent many possible conformations. But, even after accounting for all those non-permitted conformations, an unfathomable number of conformations will remain as allowed.

Traditional Computational Methods

With the advent of modern computers by the 1980s, researchers turned to predicting this folding process using algorithmic methods. In response to the sheer complexity of the problem, heuristic methods (finding ‘good enough’ solutions rather than the best possible solution) have been applied to solve the protein folding problem. The foundational principle used is that the stable and native structure of a molecule is the one with the lowest energy.

One important concept in protein folding is the idea of the folding funnel. The folding funnel envisions a multi-dimensional energy landscape where the unfolded protein starts at a high-energy state. As folding progresses, the protein “rolls downhill” toward the bottom of the funnel, representing the lowest energy state. The funnel shape ensures that a protein folds into its native, stable shape despite many possible folding pathways. This concept helps to avoid the need to explore every configuration and reconcile the speed of protein folding in nature with the vast number of possible conformations described in the Levinthal Paradox.

When implementing this prediction process in computers, the underlying idea was to find atomic coordinates that give the minimum total energy of the protein molecule. Force equations are used to predict the energy of a particular conformation and sequentially update the atomic coordinates in the direction of the lower energy. Traditional computational approaches like molecular dynamics (MD) simulations and Monte Carlo simulations have been used to model protein folding. MD simulations use the principles of Newtonian mechanics (Using Newton’s equations to predict the motion) to model the movement of atoms over time, while Monte Carlo methods rely on random sampling to explore potential atomic configurations. In other words, both methods attempt to find ways to roll down the energy landscape for the protein molecule.

These techniques, while valuable, are computationally expensive and often require simplifications to be feasible. MD simulations, for example, simulate the physical interactions between atoms, but the process can be extremely slow, especially for large proteins. Monte Carlo simulations, on the other hand, use random sampling to search through possible configurations but are limited in their ability to capture all the nuances of protein folding dynamics. Another popular method is homology modeling, where the principle is that a similar sequence will follow a similar folding pattern.

AlphaFold: AI in Protein Structure Prediction

The development of the AI tool AlphaFold by Google DeepMind is a major shift in protein structure prediction methods. Unlike traditional computational methods, AlphaFold uses artificial intelligence techniques known as neural networks to predict protein structures. The key difference between physics-based prediction models and AI tools is that, instead of attempting to compute the energy of molecules and find a stable shape based on that, AI models use pattern prediction in known 3D structures and evolutionary sequences.

AlphaFold leverages evolutionary data to improve its predictions. It utilizes a technique known as multiple sequence alignments (MSAs), which represent the variation in amino acid sequences across different species for the same protein. The evolutionary conservation of certain amino acids is crucial because such conserved amino acids often play essential roles in maintaining the protein’s structure. This MSA is then fed to the neural network, where it tries to detect patterns in the evolutionary history of proteins, by identifying conserved regions and their relationships. AlphaFold’s neural network is also trained on a vast dataset of experimentally determined protein structures from the Protein Data Bank (PDB). The model learns from this data to recognize patterns of pairwise distance of amino acids.

By combining these data, AlphaFold predicts not only local interactions between amino acids but also long-range interactions between distant amino acids that are crucial for the protein’s final 3D structure. This approach, integrating both evolutionary and structural data into a deep learning model, enables AlphaFold to outsmart the earlier computational methods.

AlphaFold has opened a new chapter in protein structure prediction and has revolutionized the field. As we stand on the shoulders of giants of past generations, we now have a powerful tool to accelerate discoveries in drug design, molecular biology, and protein engineering. Yet, despite the impressive capability of AlphaFold to predict the structure of a protein from its linear sequence, it is not a solution from the first principles. That is to say, protein folding remains an open question that needs to be solved from a physics perspective, and a Nobel prize is sure to be awarded to the person who can solve it.

Share This Post!