Directed Evolution Goes Digital: How AI Uses Nature to Design the Perfect Protein
11 min read
Every time we step out of our house, we're surrounded by thousands of years of human-engineered life. From domesticated dogs to bright red strawberries, a huge portion of our daily experience is shaped by the subtle bioengineering practice of selective breeding. We used this method because of our extremely crude historical understanding of living things— we could observe that big strawberries created more big strawberries, so we learned how to exploit that principle to our benefit. Today, we understand biology more mechanistically, so we can reach for bigger and more ambitious ideas than ever before. Knowing about DNA lets us imagine targeted cures for genetic diseases, while advanced understanding of proteins lets us imagine growing food without animals. And yet, despite how much we know, we’re still so far away from regularly realizing our highest visions. We struggle to edit afflicted DNA with inefficient enzymes, or bind to cancer-associated proteins with antibodies that are ill-fitting puzzle pieces. The parts we’re working with are small and squishy, and they change as the system they interact with breathes and grows. This complexity causes modern biology projects to take longer than or cost way more than expected. Many fail altogether.
In the late 80s, scientists had an idea: what if we approached engineering of tiny complex biology like selective breeding? We enlarged strawberries by observing many of them, picking the biggest ones, using them to make a new generation, and repeating. Could you do this with individual proteins? We knew how to mutate proteins, so as long as we tied their quality (we call this fitness) to something measurable, we could evolve better versions of that protein: If the protein works well, the cell will glow brightly. If the protein works well, the cell will grow fast. By using clever molecular biology tricks to tie the activity we cared about to these selectable traits, we could treat the improvement of protein function like selective breeding. This method was called directed evolution, and it won Frances Arnold Nobel Prize in 2018.
But there’s a catch: mutating proteins is a very slight process. You start with the sequence of a protein that works pretty well in your application and then you tweak it randomly and test this change. In doing so, you meander about the fitness landscape of your starting protein until you find something a little better. It can take millions of variations to find the improvement you seek. Worse, you might start with a protein that can't improve much at all, when you should have started with a different natural sequence with more room to grow. Better mutants are rare, so you have to search through a really big haystack to find needles, but if your starting point is bad, you might be looking through the wrong haystack to begin with. On top of all that, you can’t even see the other haystacks.
AI offers researchers a new kind of guide through the landscape of possible sequences, learning from the language of all proteins to offer suggested routes to evolve new functions for your protein of interest. The model serves as a map of where all the haystacks are in relation to each other, and a navigator for how to go about digging. “In random mutagenesis, you’re more or less bound by, first, the rate of random mutagenesis, and second, your starting template,” says Kaiyi Jiang, an assistant professor of biological engineering at Princeton University and a pioneer in using a protein language model called ESM to guide directed evolution. “Computers can cover a much larger design space than random mutagenesis can.” Instead of testing millions of proteins to find improvement, you can design 20 or 50 variants, tell your ESM-based model which ones worked best, and let it generate a new batch of designs for you.
A NEW APPROACH
Protein engineering can be tricky because it’s function-driven and the starting point isn’t fixed. We choose the task of breeding bigger strawberries specifically because we like strawberries, but problems in protein engineering are open-ended and we care more about the destination than where we start. Imagine you’re given the prompt: Design the fastest swimming animal. It doesn't matter if the final creature is a fish or a mammal or even a bird, but you recognize that all these different animals are very different from each other and you need to be able to account for all the natural diversity before you can really dig into the task.
In this example, you would begin by compiling a database of all the swimming animals in the world. Computers don’t see images very well, but using a large language model, you can chunk the features of these animals into smaller and smaller pieces, making contextual observations. This type of tail shape is usually seen in animals X big. Animals with Y number of fins usually have them spaced Z distance apart. Thousands of observations get passed through a mathematical device called a tokenizer which transforms them into vectors of numbers, called embeddings. These vectors correspond to points in multidimensional space– each one representing a different animal– which cluster by similarity and are easier for computers to analyze en masse.
Despite thoroughly cataloguing animals, you don't know how fast they swim. You can fix this issue by collecting data from something you’re familiar with– say, a goldfish. The next step resembles conventional directed evolution: mutate the goldfish’s eggs and then race all of the new baby fish against each other, finding mutations that offer slightly faster swimming speeds. And now, inform the model with data for the mutants you tested. If the best mutant had a tail that was flipped 90 degrees, its mathematical embeddings might cluster a little closer to a dolphin, which also moves its tail up-and-down. In the next round of design, the model may sample other traits from that cluster of species like rubbery skin or a narrow nose. When we repeat the cycle again, perhaps the even faster narrow-nosed goldphin has a head shape that reminds the model of an eel, leading it to borrow traits from that lineage in the next iteration. And so on.
Protein language models like ESM see proteins by embedding features of protein sequences into a high dimensional mathematical representation of the design space. "Embeddings are a way to simplify information so you can relate two things to each other and to learn from that," said Jonathan Gootenberg, who together with Omar Abudayyeh runs a lab that develops tools for programmable biology at Harvard Medical School. “ If you embed two things that have a very similar structural fold, but very different sequence, they embed in similar space. And in that space you can see other features—this is the fitness of that protein, or this is the localization.” Embeddings cluster to describe different dimensions of a biomolecule. Some of these dimensions we can understand, like a protein’s hydrophobicity—every hydrophobic protein may share a pattern in their vast matrix of numbers. But other embedding patterns represent traits beyond our understanding. Traits like hydrophobicity don’t need verbalization to predict whether a given protein has it. Patterns in embeddings instead emerge from seeing the “right” and “wrong” sequences over and over trained on millions of proteins.
As a graduate student in the AbuGroot lab, Jiang developed AI tools built on ESM to guide iterative rounds of directed evolution, an approach known as active learning. With active learning, the model searches for evolutionary context for each high-performing mutant, digging through the patterns found in every corner of nature. As with our mutated goldfish, ESM asks “where have I seen this before?” with winning protein features and parses through millions of embeddings for similarities. Though structure yields function, evolution occurs on a DNA level, so ESM can use nature’s many sequence-encoded blueprints in a way that structural models cannot. When a structural model suggests a design, it engages with the tricky task of predicting how the design will perform on a biochemical level, whereas ESM simply creates new sequences based on what has already worked. Vast sequence databases are available at small data sizes and flagging “good” and “bad” sequences within them bridges parts of the database cheaply and with lower computational power.
Jiang has seen how this focused design methodology can yield higher success rates in screening. He authored the development of EVOLVEpro, a new active learning model built upon ESM-2, which unifies small batches of labwork and large protein sequence databases to produce improved proteins that borrow characteristics from across the fitness landscape. Testing 12 diverse protein classes with broad biotech and pharma relevance, from antibodies to CRISPR editors to RNA polymerase, the authors found that on average with just 16 mutants per round of testing (compared to the millions or billions required for traditional directed evolution), ~60% of new mutants displayed high activity in only five rounds of testing. By the tenth round, this number jumps to roughly 90% of mutants. “With a protein language model you get the possibility that you don’t have to do a super large library. You can do a small library and ask the model to tell you where to concentrate,” Jiang said. Scientists can use this focused screening to seek the richest and most meaningful lab data for their use case, and let that data guide further cycles of development.
"Protein language models can sample the functional landscape of a protein a little smarter than other methods,” Jiang explained. “If you have a local view on certain areas, then you can make some educated guesses around what other parts of the landscape look like." With active learning, every mutation is an educated guess instead of a dart throw. The vast majority of random mutations decrease function, so traditional directed evolution requires slogging through millions of bad or neutral mutations and often never finding the right path to better function. With ESM embeddings, AI can design a much smaller set of mutations that are much more likely to succeed.
Smaller screens translate to faster, cheaper experiments, which can be a major accelerant for researchers working to develop new medicines or biotechnologies. "This is what will make a company succeed or fail,” Gootenberg added. “AI will be able to help you move faster and move cheaper, where instead of doing a huge campaign you can screen a smaller library. There’s a lot of ways that it can make the process less painful, fewer iteration cycles, higher success rates, smaller scale experiments. All of that translates to speed and cost, which is everything you need."
THE POWER OF SMALL DATA
Beyond sheer time and cost required to make and screen more mutations, the smaller focused libraries needed for active learning solve a wide array of other challenges. Designing, executing, and analyzing data from very large scale screens requires specialized equipment and niche expertise. There are practical challenges too: Large mutant libraries built to find extremely rare outliers can overwhelm the machines for screening them— looking for needles in a haystack that’s too large tests the upper limits of cell sorters and DNA sequencers alike. And when your successful mutant is really rare, every inefficiency in building your library risks losing the needle— DNA assembly, transformation, growth, and DNA extraction all lose bits of the genetic library prior to screening, meaning rare winning variants can vanish prior to discovery. Even worse, many valuable traits and functions can’t be adapted to a massive screen at all.
This gets to the heart of a longstanding paradox in biotechnology and drug development: generating new leads often requires screening through tons of candidates, but high-throughput tests rarely resemble therapeutic or industrial contexts. Many variants that perform well in experiments that can be done at large scales in test tubes fail to treat the disease in animal models or clinical trials; many strains producing high product yield in a flask don’t function as well in a bioreactor.
Mutations that benefit a feature that can be easily measured in a high throughput context might force tradeoffs with harder to measure factors that are equally critical to success. Does our antibody just bind, or does it also produce the cellular response we care about? Is it immunogenic? Stable? Manufacturable at high yield? With EVOLVEpro, Jiang and his team evolved several antibodies to optimize binding to their target as well as production efficiency. These two parameters are in tension with each other–for most of the antibodies tested, there was a tradeoff between mutations that improve binding and those that improve yield. But even in the first round, there were some mutations that improved across both axes, zeroing in on areas in the design space to pursue in future rounds. Others have also shown that using ESM to guide active learning for antibody evolution can improve binding as well as thermostability and neutralizing activity, using small amounts of experimental data.
Abudayyeh suggested the potential to go even further, where with active learning “you can even evolve in vivo since the low scale of it is amenable to animal experiments.” With ESM to focus and navigate the genetic design space, scientists could potentially generate therapeutic leads using successive small rounds of animal studies paired with active learning. EVOLVEpro improved proteins in rounds of just 16 mutants, a scale previously unimaginable using conventional directed evolution, but within the reach of animal studies. Researchers could demonstrate the clinical relevance of their leads from the get-go, and drug candidates exiting this process would likely have a higher chance of progressing further in clinical trials and turning into successful therapeutics. This means more scientists could use fewer resources to find more promising solutions for illnesses that currently evade treatment.
This benefit is translatable across many industries. Is my engineered bacteria that produces an insect-repelling protein in a 96-well plate going to work it when it colonizes a plant leaf? Well, if you have a trusted leaf assay for testing leads from your plate-based experiment, you can instead use it to evolve the function you seek. All my attempts to improve my flavor-producing yeast are failing and I want to do directed evolution but can’t link the flavor production to anything I can select for in a million mutant screen. Well, how about using analytical chemistry to detect the flavor molecule and EVOLVEpro to design a small enough batch of enzyme variants for the mass spectrometer to screen? With ESM, scientists can avoid difficult rational protein design by performing directed evolution, while also using assays that give the most relevant data. This lowers the experimental barrier for targeting a wide variety of challenging applications.
In instances where our assays are actually easy and scalable, models like ESM can finally close the automation loop in directed evolution applications. Zhejiang University in Hangzhou combined automated cloning, transformation, and testing pipelines with a learning model built upon ESM-2 to create a continuous protein evolution system. This system generated new enzyme designs, robotically created strains to test those enzymes, and funneled the activity measurements into an active learning model that generated the next round of designs. They could repeat this process as many times as necessary for the enzyme activity to increase. In Nature Communications, they described improving the function of tRNA synthetase by 2.4X in four learning cycles and just 10 days. They made a real cloud lab– a myth of yore where you send instructions to a computer and then the computer does lab work and spits out data. Until recently it was a fairy tale, largely because successful experiments usually require sophisticated strategy, and the human thinking work involved usually takes far more time than automated assays save. Directed evolution is strategically non-strategic, but data analysis for traditional libraries with millions of members requires a lot of custom treatment, and the mandatory human intervention can sprawl out of control. By making the actual evolution portion of directed evolution virtual, and doing sporadic testing in a format with trivial data analysis like a plate-based biochemical assay, the whole experiment can happen in a robotic workcell with a brain driven by ESM.
When this in silico brain built upon ESM generates validated hits, the insights can be audited to reveal undiscovered elements of protein design. Over many learning cycles, when it zooms in on a group of embeddings that are truly important, researchers can ask why. What qualities do these embeddings describe, and is there anything we can learn about this protein? Is it possible that we are being pointed to a larger truth about protein engineering at large? The patterns that trained ESM were formed over millions of years of gradual change, survival, death, and repetition, and when some of those patterns emerge as meaningful, they can point to living mechanisms we haven’t observed or described before.
These screening efficiencies are more powerful considered within the context of how the tool explores protein design space. ESM-based active learning methods crawl into corners of the fitness landscape that may otherwise be inaccessible, enable screening with more relevant assays that simulate the final system the researchers wish to engineer, and use the data to dynamically explore more of the landscape with higher efficiency in the next round. This means groups can engineer better proteins faster with less work and higher confidence that they approached the theoretical maximum function. They can exit the engineering process more sure that their final protein will function as intended in their intended application, allowing a sigh of relief—there is one less point of vulnerability to the project. And they can use active learning to quickly explore important problems they otherwise would have viewed as too tricky, high-risk, and time-consuming. “As more people appreciate the power of these approaches, they'll start to think about new applications, and that's going to be a big unlock. Any researcher with a problem could ask, what if I use these models to figure this out?” mused Abaddayah on how active learning might allow biologists to change their approach to tricky projects. Jiang added, “Hopefully in a few years, this sort of PLM will become the linear regression of the protein engineer. It will just be default to use protein language models. Not saying that they'll always work, but this is– of all the options you have– probably the best.”
Active learning can reduce the calculated risk for projects big and small, for users far and wide. Scientists will succeed more often in the near term because of this risk reduction, and will tolerate more complex projects in the long term as a result. It may pave the way for a new generation of directed evolution that doesn’t grasp so blindly in the dark. And with PLMs we may evolve a new generation of biology—an era where our imagination isn’t limited by what we know, and where our tools for solving our problems teach us more about our world.
###
