Weekly TechBio News: AI/ML tools and startups for rational drug design

Transforming drug design with AI and computer modeling techniques 🧑‍💻🧑‍🏫

Dec 12, 2024

∙ Paid

“To be yourself in a world that is constantly trying to make you something else is the greatest accomplishment.”
By Ralph Waldo Emerson

Rational Drug Design

Traditional drug discovery—known also as forward pharmacology—relies on high-throughput screening of chemical libraries to match a certain cellular modulation to a specific drug treatment. The opposite of this is the rational drug design—also called reverse pharmacology or just drug design—which is based on the hypothesis that a “designed” molecule can induce a specific modulation of a biological target. In other words, drug design 🎨🖌️ is the process of finding drug candidates during drug discovery based on the knowledge of a biological target, and involves the design of molecules that complement the shape and charge of a target—most commonly a protein or a nucleic acid—in order to interact, bind and modulate the target in a way that produces therapeutic value. From an experimental point of view, both traditional drug discovery and rational drug design are equally important, so the pharmaceutical industry depends on the combined efforts of both of them.

When it comes to rational drug design—for a biomolecule to be selected as a target—knowledge is required, ➡️ knowledge that the modulation of the selected target will be disease-modifying, and ➡️ knowledge that the target is druggable.

Since drug design frequently relies on computer modeling techniques—that is the representation of a 3D structure of chemical and biological molecules—is also referred to as computer-aided drug design. Moreover, the drug design based on the knowledge of the 3D structure of a biological target is known as structure-based drug design (SBDD).

Usually, SBDD 👓 includes:

👉 Structure determination of the target protein, cavity identification, ligand database construction, ligand docking and lead discovery. The softwares used for SBDD are: SWISS-MODEL, MODELER, Phyre and Phyre2, CASTp, Active site prediction tool, AutoDockVina and Schrödinger.
- Maestro is Schrödinger’s streamlined portal for access to state-of-the-art predictive computational modeling and ML workflows for molecular discovery.
- A novel method just presented for SBDD is MISATO: machine learning dataset of protein–ligand complexes for structure-based drug discovery.
- Another method just introduced is SurfDock is a surface-informed diffusion generative model for reliable and accurate protein–ligand complex prediction.
👉 Another thing to consider during SBDD is a computational-based virtual screening of large chemical libraries. The docking-based virtual screening—namely the identification of the location of the binding site of the drug candidate within the protein target—traditionally has been done using different methods such as (link):
- Template-based methods (firestar, 3DLigandSite and Libra),
- Geometry-based methods (CurPocket, Surfnet and SiteMap),
- Energy-based methods (FTMap and Q-SiteFinder),
- As well as ML methods (DeepSite freely available at www.playmolecule.org, Kalasanty and DeepCSeqSite).
- For more: In Silico Methods for Identification of Potential Active Sites of Therapeutic Targets: MetaPocket 2.0, COACH, PockDrug, FTMap, Sitemap.
👉 Finally, in addition to the location of the binding site of the drug candidate within the protein target you have to evaluate its potential druggability that can be done for example using DoGSiteScorer, a web server that supports the prediction of potential pockets and gives a druggability estimation.

Another example is Datamol, an open source library that is an elegant, rdkit-powered python library to perform computational tasks on molecules of the chemical space and represents molecules as mathematical objects (the Morgan fingerprints) in order to apply ML techniques. After generating the fingerprints, algorithms can be used to cluster the molecules by using the Butina algorithm with the Tanimoto similarity index for distance computations. Ultimately, this enables us to quickly learn about the effect of structural changes on activity while efficiently virtually exploring a large chemical space, searching for the right drug candidate to target a protein.

Datamol is part of Valence Lab. In particular, in May 2023 Valence Discovery joined forces with Recursion Pharmaceuticals to create Valence Labs: an AI research engine within Recursion dedicated for industrializing scientific discovery. So far, they have made an exciting progress with LOWE—an LLM-Orchestrated Workflow Engine for executing complex drug discovery and drug design workflows using natural language.

LOWE is an LLM agent that represents the next evolution of the Recursion OS. This includes the ability to navigate and assess relationships within Recursion’s proprietary PhenoMap data, use MatchMaker to identify drug-target interactions and deploy DL based generative chemistry methods. This integration enables LOWE to perform critical, multi-step tasks in drug discovery such as identifying new therapeutic targets, designing novel compounds and libraries and predicting ADMET properties. LOWE allows scientists to easily query Recursion’s massive proprietary biological and chemical datasets (60+ petabytes) to identify new therapeutic targets of interest and immediately test them and develop new drug candidates. Other foundation models trained on public data and transferred to the RecursionOS are: Phenom, which translates all of their phenomics data into a digital representation, MolPhenix, which takes phenomics data and integrates it with chemistry data and MolGPS, a property prediction foundation model.
On March 18, 2024, Valence Labs introduced MolGPS — A Foundational GNN for Molecular Property Prediction. MolGPS was trained on the LargeMix dataset consisting of 5 million molecules grouped into 5 different tasks with each task having multiple labels. LargeMix contains datasets like the L1000_VCAP and L1000_MCF7 (transcriptomics), PCBA_1328 (bioassays), PCQM4M_G25 and PCGM4M_N4 (DFT simulations). They also added a classification dataset using a subset of Recursion's phenomics dataset that was created by using a pre-trained masked auto-encoder clustering the phenomics images into 6,000 different classes, which are then used for binary classification. MolGPS was first pre-trained using a common multi-task supervised learning strategy and was then fine tuned for various molecular property prediction tasks to evaluate performance. Finally, they benchmarked the performance of MolGPS on the Therapeutics Data Commons (TDC), MoleculeNet and Polaris benchmarks.
For more about Recursion’s Pharmaceuticals Inc (NASDAQ: RXRX) pipeline, that just released the publicly available model OpenPhenom-S/16 in Google Cloud’s Model Garden which sets a new industry standard for microscopy data analysis outperforming CellProfiler and just announced the first patient dosed in a Phase 1/2 clinical study of REC-1245, here:
AI-powered drug discovery: update (VIII)
Marina T Alamanou
·
August 29, 2024
Read full story
Finally, on November 13, 2024 Recursion introduced MolE: A New Model for Predicting Molecular Properties for AI Drug Design and Beyond.

Beyond publishing primary research at Valence Labs they strive also to be leaders within the ML community through the open-science and open-source efforts Valence has become known for. These include:

Portal, the home of the TechBio community, where you’ll find their reading groups, community blogs and more. And
Open Source Software, such as Datamol, Molfeat, Graphium (open-source library for training molecular GNNs at scale), Medchem and more.

According to the ValencePortal community, Graphium (a deep learning library focused on graph representation learning for real-world chemical tasks) stands out, because it’s designed for graph representation learning on real-world chemistry tasks. Graphium has rich and expressive built-in molecular featurizers and provides access to SOTA GNN architectures via an extensible API. By using Graphium you can easily implement the best/recent GNN models via a configuration file, with the degree of flexibility necessary for research.

Let’s move to something different now, ▶️ the template-based protein design.

Raygun, is a template-guided protein design framework that unlocks efficient miniaturization, modification and augmentation of existing proteins (Miniaturizing, Modifying, and Augmenting Nature’s Proteins with Raygun). Using a novel probabilistic encoding of protein sequences constructed from language model embeddings, Raygun is able to generate diverse candidates with deletions, insertions and substitutions while maintaining core structural elements. Raygun can shrink proteins by 10-25% (sometimes over 50%) while preserving predicted structural integrity and fidelity, introduce extensive sequence diversity while preserving functional sites and even expand proteins beyond their natural size. Raygun’s conceptual innovations in template-based protein design opens new avenues for protein engineering, potentially catalyzing the development of more efficient molecular tools and therapeutics.

To make a long story short, the increasing sophistication of computational technologies has consequently led to the exponential improvements of AI tools, leading to the process of rational drug design becoming more and more reliant on computer-aided drug design using AI modeling techniques. These modeling techniques refer to:

MetaphysicalCells

Weekly TechBio News: AI/ML tools and startups for rational drug design

Transforming drug design with AI and computer modeling techniques 🧑‍💻🧑‍🏫

Rational Drug Design

AI-powered drug discovery: update (VIII)

This post is for paid subscribers