AI/ML tools and startups for drug design
Computer modeling techniques and structure-based drug design
Drug Design 🖌️
Traditional methods of drug discovery—known also as forward pharmacology—rely on high-throughput screening of chemical libraries (and now AI) to match a certain cellular modulation to a specific drug treatment (I covered this in an earlier newsletter: source, source). The opposite of this is the rational drug design—also called reverse pharmacology or just drug design—which is based on the hypothesis that a “designed” molecule can induce a specific modulation of a biological target.
In particular, drug design is the process of finding drug candidates during drug discovery based on the knowledge of a biological target, and involves the design of molecules that complement the shape and charge of a target—most commonly a protein or a nucleic acid—hoping that the molecule will be able to interact, bind and modulate the target in a way that produces therapeutic value.
From an experimental point of view, both traditional drug discovery and rational drug design are equally important, so the pharmaceutical industry depends on the combined efforts of both of them.
In the first case, the chemical libraries required during screening can be simply purchased and then tested in vitro to identify the targets that are modulated. For example, Terray Therapeutics explores molecules and targets with a sophisticated integration of ultra-high throughput experimentation, generative AI, biology, medicinal chemistry, automation and nanotechnology that enables them to add millions of new library molecules per week to a collection of 60M+, screen 2M molecules against a target in four minutes and convert 25 TB of image data into binding affinities daily, allowing them to rapidly identify potent and selective molecules.
In the second case—when it comes to rational drug design—for a biomolecule to be selected as a target knowledge is required:
knowledge that the modulation of the selected target will be disease-modifying, and
knowledge that the target is druggable, namely a small molecule or a peptide or a therapeutic antibody can alter its activity.
Accordingly, since drug design frequently relies on computer modeling techniques—that is the representation of a 3D structure of chemical and biological molecules—is referred to as computer-aided drug design. An example of computer molecular modeling is Datamol.io by Valence Lab (Recursion’s new ML research center). Datamol is an open source library that is an elegant, rdkit-powered python library to perform computational tasks on molecules of the chemical space and is built on top of powerful numerical and cheminformatics scientific libraries pandas, rdkit, NumPy, Matplotlib. Datamol represents molecules as mathematical objects (the Morgan fingerprints) in order to apply ML techniques. After generating the fingerprints, algorithms can be used to cluster the molecules by using the Butina algorithm with the Tanimoto similarity index for distance computations. Ultimately, this enables them to quickly learn about the effect of structural changes on activity while efficiently exploring a large chemical space, searching for the right drug candidate to target a protein.
Moreover, when drug design relies on the knowledge of the 3D structure of the biomolecular target it is known as structure-based drug design (SBDD).
Usually, SBDD includes:
Structure determination of the target protein, cavity identification, ligand database construction, ligand docking and lead discovery. The softwares used for SBDD are: SWISS-MODEL, MODELER, Phyre and Phyre2, CASTp, Active site prediction tool, AutoDockVina and Schrodinger.
A computational-based virtual screening of large chemical libraries. The docking-based virtual screening—namely the identification of the location of the binding site of the drug candidate within the protein target—traditionally has been done using different methods such as (link):
Template-based methods (firestar, 3DLigandSite and Libra),
Geometry-based methods (CurPocket, Surfnet and SiteMap),
Energy-based methods (FTMap and Q-SiteFinder),
As well as ML methods (DeepSite, Kalasanty and DeepCSeqSite being some of the newer approaches).
In addition to the location of the binding site, the evaluation of its potential druggability can be done for example using DoGSiteScorer, a web server that supports the prediction of potential pockets and gives a druggability estimation.
To make a long story short, the increasing sophistication of computational technologies has consequently led to the exponential improvements of AI tools, leading to the process of rational drug design becoming more and more reliant on computer-aided drug design using AI modeling techniques. These modeling techniques refer to
homology modeling—also known as comparative modeling of targets—a commonly used computational structure prediction method used to determine a protein’s 3D structure from its amino acid sequence, helping us to design new molecules to target a protein,
molecular modeling that describes the generation, manipulation or representation of a 3D structure of chemical and biological molecules, using quantum and classical physics to simulate the shape and stability of molecules. It can also be combined with the homology modeling tools when studying protein complexes and protein–protein interaction studies,
de novo modeling an algorithmic process by which the protein’s 3D structure is predicted from its primary sequence and is much more computationally intensive than the comparative modeling. Usually this type of modeling is advantageous when no initial protein template structure has been identified—the critical step in the homology modeling—and is limited to small protein sequences (<100-150 residues) due to massive computational resources and conformational space required. Overall, de novo is a more general term that refers to the greater category of methods that do not use templates as a prediction tools, instead they produce a series of possible candidate structures (called ‘decoys’) guided by scoring functions and sequence-dependent biases. And
Ab initio structure prediction tools that classically refer to structure prediction using nothing more than first-principles (i.e. physics).
After this brief introduction, we get now to the core of this newsletter by reviewing the latest advancements of some of the different AI/ML tools, as well as startups, for modeling during rational drug design.
AI algorithms for Protein Predictions
🛟AlphaFold
AlphaFold developed by DeepMind is considered the gold standard for AI protein structure predictions. AlphaFold was trained on publicly available data consisting of ~170,000 protein structures from the protein data bank (PDB) together with large databases at UniProt containing protein sequences of unknown structure.
Initially, the first version of AlphaFold was used in 2018 to predict the 3D structure of proteins at the biannual Critical Assessment of Structure Prediction protein-folding challenge—known as CASP, a worldwide community that has historically recorded the progress in the field of homology modeling—and performed really well with a Global Distance Test (GDT) that reached around 60.
GDT is the main metric used to assess the success of the computational prediction models (ranging from 0 to 100) and can be thought of as the % of amino acid residues in the in-silico model that are within a certain distance from the correct position in the experimental structures that are used as control groups. If a prediction model gets a GDT score of around 90 is considered comparable to the wet lab experimental methods, such as x-ray crystallography and cryo–electron microscopy, which are the major methods of atomic resolution structure determination.
Subsequently, the second version of AlphaFold was used in 2020 at the 14th CASP and got a GDT of 92.4, performing amazingly well. AlphaFold 2.0 relies exclusively on pattern recognition and is an attention-based neural network architecture combined with a deep learning (DL) framework. Instead, the original AlphaFold combined local physics and pattern recognition, and would often overestimate the effect of interactions between nearby residues.
In artificial neural networks, an attention mechanism is a technique that is meant to mimic cognitive attention. It attempts to enhance some parts of the input data while diminishing other parts. By doing so, a neural network can focus on smaller but more important parts of the data.
Let’s see now what a neural network is and how it differs from DL.
A basic neural network, a subset of ML, is inspired by the human brain and has interconnected nodes—which are functionally equivalent to neurons—in three layers:
an input layer: that processes and analyzes information before sending it to the next layer,
a hidden layer: that receives data from the input layer and further processes and analyzes the data by applying a set of mathematical operations to transform and extract relevant features from the input data, and
an output layer: that delivers the final information using the extracted features.
Similarly, DL is a separate subcategory of ML and involves training neural networks to automatically learn and evolve independently without being programmed to do so. Moreover, between the input and the output layers of DL, there are many hidden layers, and this allows the network to perform highly complex operations and continuously learn as the data representations are processed across multiple layers.
Though often used interchangeably, neural networks and DL algorithms differ in various ways, including:
Layers: the neural networks are usually made up of an input, hidden and output layer, while the DL models have several layers of neural networks.
Scope: the applications of neural networks include more straightforward tasks such as pattern recognition, face identification, machine translation and sequence recognition, while DL networks are usually used for tasks such as customer relationship management, speech and language processing, image restoration and drug discovery.
Extraction of Features: a neural network requires human intervention and engineers must manually determine the hierarchy of features. On the other hand, DL models can automatically determine the hierarchy of features using labeled datasets and unstructured raw data.
Performance: a neural network takes less time to train, but has lower accuracy when compared to DL which is more complex and large amounts of labeled data are required in order to train it.
Computation: neural networks can be trained using smaller datasets with fewer computational resources, while DL is a complex neural network that can classify and interpret raw data but requires more computational resources.
Going back to the 3D protein prediction tools, the AlphaFold2 latest model can now generate predictions for nearly all molecules in the PDB—frequently reaching atomic accuracy—unlocking new understanding and improving accuracy in multiple key biomolecule classes, including ligands (small molecules), proteins, nucleic acids (DNA and RNA) and those containing post-translational modifications.
Moreover, at Isomorphic Labs—launched from Alphabet’s DeepMind to build on the success of AlphaFold2—they believe there may be a common underlying structure between biology and information science, namely an isomorphism, a structure-preserving mapping between two structures of the same type that can be reversed by an inverse mapping! In other words, they are searching for a “template” or a “prototype” inside biology’s code.
For more: PERSPECTIVE: THE RISE OF “WET” ARTIFICIAL INTELLIGENCE
AlphaFold doesn’t do well predicting protein-protein interactions (an important consideration for designing drugs) or the structures of proteins which span cell membranes (like key proteins that affect toxicity) that are more difficult to predict because relatively few have actually been solved.
And these important problems won’t go away without new data.
The Solution? 👇
The marrying together of wet lab techniques with AI—“wet AI models”—is the opposite of “dry” AI models like ChatGPT. They will be small (not large), special purpose (not general), built from specially created proprietary datasets, (rather than pre-existing public ones), and many will be trivially cheap to train—not costing between $30M and $100M like many LLMs. Although they will be small and private, they are on track to have enormous impact.
🛟ESMFold
In 2022, Meta AI’s researchers launched a breakthrough model called Evolutionary Scale Modeling or ESMFold for protein structure prediction, claiming that their model can make predictions 60 times faster than other state-of-the-art systems for short protein sequences. This tool uses AI to learn to read patterns in the protein sequences and is focused on proteins that are found in microbes in the soil, deep in the ocean and inside humans. Together, its dataset totals roughly 600 million predicted proteins. You can explore them in the ESM Metagenomics atlas.
ESMFold works by using machine-readable forms of proteins (known as representations) from the ESM2 large language model to generate accurate structure predictions. This allows it to infer new structural information of a protein by recognizing specific patterns in the amino acid sequence.
Language models work by using a probability distribution over words or word sequences as a basis for learning. This means that DL models for text-based data must incorporate natural language processing (NLP) methods. NLP methods can help to decipher text (among other forms of input) by using concepts such as attention and transformers. In this specific case, the language models are trained on large ensembles of protein sequences, in order to capture long-range dependencies within a protein sequence. Very briefly, the positions of the amino acids in the primary structure of a protein can be separated into long and short-range dependencies. The long-range interactions (such as the physico chemical interactions apart from the linear arrangement of the amino acids in the 1D) play a distinct role in determining the 3D of a protein. While the short-range interactions contribute to the 2D formation.
More specifically, the attention concept used is applied by teaching the model to draw connections between any parts of the protein sequence, such as the long-range dependencies. So, in the transformer networks—that exclusively use attention blocks—long-range dependencies have the same likelihood of being taken into account as any other short-range dependencies.
However, last summer Meta has laid off the team that built the revolutionary ESMFold, a move that indicates Meta is shifting its focus to generating revenue from commercial AI.
Besides AlphaFold and ESMFold, there are also other protein prediction models to quickly and accurately predict protein structures in as little as ten minutes, like ColabFold, RoseTTAFold, IntFOLD and RaptorX (2022 was The Year of Protein Folding Models. Wait, What?).
🛟RoseTTAFold
RoseTTAFold is a “three-track” neural network that simultaneously considers patterns in the protein sequence, how a protein’s amino acids interact with one another and a protein’s possible 3D structure. In this neural network, sequence, distance and the 3D information flow back and forth, allowing the network to collectively reason about the relationship between a protein’s chemical parts and its folded structure.
RoseTTAFold can solve challenging x-ray crystallography and cryo–electron microscopy modeling problems and is the most renowned tool used for ab initio protein predictions, that are the most challenging to conduct, as they involve predicting protein structures based only on first-principles and without using existing templates. In contrast, both AlphaFold2 and RoseTTAFold rely on Multiple Sequence Alignments (MSAs) as inputs to their models, which map the evolutionary relationship between corresponding residues of genetically-related sequences, derived from large, public, genome-wide gene sequencing databases that have grown exponentially since the emergence of next-generation sequencing.
The Rosetta molecular modeling software package is used right now by OutpaceBio, creating cell therapies with curative potential through protein design and cellular engineering.
🛟OmegaFold
On July 20, 2022, the Chinese biotech firm Helixon launched OmegaFold, a new combination of a PLM that allows making predictions from single sequences, eliminating the need for MSAs, and from a geometry-inspired transformer model, the Geoformer module, a new geometry-inspired transformer neural network to further distill the structural and physical pairwise relationships between amino acids.
So far, OmegaFold claims to outperform RoseTTAFold and achieve similar prediction accuracy to AlphaFold, along with other models such as HelixFold-Single and ESMFold. It is believed to have a higher potential in predicting the structure of orphan proteins and antibodies that don't require MSAs as their input.
Read also how Democratizing Protein Language Models with Parameter-Efficient Fine-Tuning allows to many research groups to overcome computational and memory footprint of fine-tuning issues with large langue models.
AI tools for molecular modeling of the drugs themselves
📌 Molecular property prediction applies to both high-throughput screening and molecule optimization and depends on the representation form of molecules, including recurrent neural networks (RNN) for SMILES strings, feed-forward networks (FFN) for molecule fingerprints and graph neural networks (GNN) for molecule graphs. However, these DL-based approaches require a large amount of training data in order to be effective. For this reason, the MIT-IBM Watson AI Lab proposed a data-efficient property predictor by utilizing a learnable hierarchical molecular grammar that can generate molecules from grammar production rules GRAMMAR-INDUCED GEOMETRY FOR DATAEFFICIENT MOLECULAR PROPERTY PREDICTION.
Such a grammar induces an explicit geometry of the space of molecular graphs, which provides an informative prior on molecular structural similarity. The property prediction is performed using graph neural diffusion over the grammar-induced geometry. On both small and large datasets, the evaluation showed that this approach outperforms a wide spectrum of baselines, including supervised and pre-trained graph neural networks. A further analysis of this solution, showed its effectiveness in cases with extremely limited data (only ∼100 samples), and its extension to application in molecular generation.
📌 PAMNet, is a universal framework for accurately and efficiently learning the representations of 3D molecules of varying sizes and types in any molecular system: A universal framework for accurate and efficient geometric deep learning of molecular systems. PAMNet induces a physics-informed bias to explicitly model local and non-local interactions and their combined effects. As a result, PAMNet can reduce expensive operations, making it time and memory efficient. In extensive benchmark studies, PAMNet outperforms state-of-the-art baselines regarding both accuracy and efficiency in three diverse learning tasks: small molecule properties, RNA 3D structures, and protein-ligand binding affinities.
AI Startups for rational drug design
✔️ NVIDIA and Evozyne created a generative AI model for proteins, to generate predictions for proteins whose structure is unknown. Evozyne used NVIDIA’s implementation of ProtT5, a transformer model that’s part of NVIDIA BioNeMo—a software framework and service for creating AI models for healthcare—and created two proteins with significant potential in healthcare and clean energy. On September 27, 2023, Evozyne announced the closing of an $81M Series B investment round (a total of $144.4M).
✔️ The DeepChem open-source project by Dr. Ramsundar—the founder and CEO of Deep Forest Sciences the most popular open source framework for drug discovery—is a python-based AI tool for various drug discovery task predictions such as: predicting the solubility of small molecules, predicting binding affinity for small molecule to protein targets, predicting physical properties of simple materials, analyzing protein structures and extracting useful descriptors.
✔️ Celeris Therapeutics is using ML to predict biomolecular interactions and generate new chemical entities. By using the Xanthos platform for design, they start with a protein target sequence, determine 3D structures, predict ligand binding, generate linkers, predict ternary complexes and refine proximity-inducing compounds selection through several layers of filters including molecular dynamics and QSAR. Last year, Celeris Therapeutics and Boehringer Ingelheim announced a collaboration to develop next-generation targeted protein degraders.
✔️ Codexis is a leading enzyme engineering company that has a proprietary platform, the CodeEvolver, that provides in silico, high-throughput assay screening. By using powerful ML tools, sophisticated molecular, cellular and bioanalytical workflows, at Codexis they can design and screen libraries of thousands of enzyme variants in high throughput, then sequence every variant and correlate its sequence with its performance in a highly application-relevant screen. Among Codexis' partners you can find Merck, GSK, Novartis, Nestle, Takeda and many more.
✔️ Innophore uses AI guided point-cloud technology and not only analyses a protein’s 3D structure but includes extended surface properties (HALOS) and volumetric cavities (catalophores) to predict target's characteristics and reactivity (AI virtual screening, simulations). Their AI-driven strategy to design novel therapeutic enzymes combines the Catalophore technology, a mix of prepared protein structural data (CATALObase), and search algorithms and patterns tailored to specific needs.
✔️ Antiverse is using ML to engineer novel antibodies against difficult targets such as GPCRs and ion channels (AI de novo discovery). Just a month ago, Antiverse and GlobalBio—an antibody engineering company developing methods to engineer improved and more developable therapeutic antibodies—announced they will be extending their collaboration, leveraging Antiverse’s platform alongside GlobalBio’s ALTHEA semisynthetic libraries for the discovery and optimization of antibody-based therapeutics in order to advance immune checkpoint inhibitors in cancer therapy.
✔️ Congruence Therapeutics deploys its platform Revenir to capture the biophysical features of functional proteins and their pathogenic counterparts in order to discover functional allosteric and cryptic pockets which can lead to small molecule hits at unprecedented speed. On March, 06, 2023, Congruence Therapeutics announced the close of an extension to its Series A financing, bringing the total amount raised to over $65M.
Until next time 🧥🧣👢,
Another great edition!