AI/ML tools for planning and execution of chemical synthesis

An overview of AI/ML tools 🛠️ and startups 🚀 transforming retrosynthesis, and some random latest news on TechBio

Nov 16, 2023

Given the importance of AI/ML in drug development, in my two previous newsletters (source, source) I highlighted some of the different AI/ML tools and startups for ADMET prediction during drug discovery that generally involves:

screening hits (source, source) for seeking a new drug candidate in a chemical library with high throughput screening and secondary assays,
lead optimization to reduce potential drug side effects (in vitro & in vivo ADME, initial animal efficacy studies),
in silico studies that in combination with cellular functional tests are used to improve the functional properties of the drug candidates, and
design and synthesis of the lead compound.

As we continue our journey through this amazing and emergent AI/ML drug discovery landscape, in this newsletter I will highlight some of the AI/ML tools and startups for planning and execution of chemical synthesis during drug discovery.

➡️ AI in planning of chemical synthesis

From a drug development perspective, chemical synthesis planning begins after we have identified a suitable lead compound during drug screening. At this point—since many of the drug candidates exist only in chemical libraries and do not have a reliable or optimal chemical synthesis pathway—we can perform what is known as retrosynthetic analysis or retrosynthesis, namely tracing the steps in reverse in order to synthesize the lead compound.

Retrosynthesis is considered a cornerstone of organic chemistry, and involves planning a complete synthesis pathway in order to create complex organic molecules found in nature using only simple precursors. For that reason, the lead compound is first reduced into a sequence of progressively simpler structures, with the ultimate goal of identifying a simple or commercially available starting material to work with.

But the reality is that retrosynthesis analysis is highly challenging. For example, 60% of all FDA-approved drugs are natural compounds (NPs) or their derivatives, yet complete biosynthetic pathways for most of them remain unknown. This is where AI and computer-aided tools come in handy. For example, by using existing knowledge for the retrosynthesis of all NPs (cataloged in libraries such as Dictionary of Natural Products) we can develop chemical synthesis pathways for new drugs using a technique called the Monte Carlo tree search (MCTS).

MCTS is an algorithm in the field of AI that figures out the best move out of a set of moves to generate a final solution. Given a set of inputs (in this case, chemical building blocks), MCTS can assess the possible pathways of putting them together to generate the desired compound. These building blocks are known as nodes, which gradually expand the ‘tree’ as more nodes are added to the algorithm. As we perform more searches using the same nodes, the tree grows in size as well as in knowledge, which means that we can repeat the search to receive better and more probable outcomes.

MCTS is probabilistic and heuristic; that is, it is based on statistics alone and does not require any strategic or tactical knowledge about the given domain to make reasonable decisions. This form of search algorithm is helpful for making sequential truncation predictions such as in retrosynthesis, combining the classic tree search alongside ML principles of reinforcement learning (RL).

RL is an area of ML concerned with how intelligent agents ought to take actions in an environment to maximize the cumulative reward. Essentially, RL is based on interactions between an AI system and its environment and helps to assess if an algorithm is producing the correct answer. RL is one of the three basic ML paradigms, alongside supervised learning and unsupervised learning.

Furthermore, RL can be implemented using neural networks, giving the system the ability to assess outcomes using existing knowledge. For example, by supplying known retrosynthesis information, we can train neural networks to accurately evaluate the viability of a chemical synthesis pathway.

To give an example, by combining MCTS with three neural networks, a method known as 3N-MCTS is reported to be able to generate retrosynthesis pathways and evaluate their feasibility. This computing system was inspired by biological neural networks that consist of three layers: an input layer, a hidden layer and an output layer.

The 3N-MCTS method offers a workflow for computer-assisted retrosynthesis by evaluating the feasibility of a chemical transformation, having been trained using 12.4 million transformation rules from existing organic synthesis literature. Additionally, the speed of retrosynthesis prediction per molecule with 3N-MCTS is, on average, 20-fold faster than with the traditional MCTS method.

Let’s see now what the ReTReK multi-step retrosynthesis method is.

In the early days of ML-aided retrosynthesis, algorithms like computer-assisted synthesis planning (CASP) were developed that could take a molecular structure as an input and give a list of detailed reaction schemes as an output. Each reaction scheme also listed purchasable starting materials that could create the target molecule through (supposedly) chemically feasible reaction steps. But eventually, these algorithms failed to gain wide popularity among chemists, since they suffered from infeasible suggestions and bias (since human input was necessary).

Subsequently, recent breakthroughs in ML techniques like deep neural networks have significantly improved data-driven synthetic route designs without human intervention. These ‘deep’ networks are artificial decision-makers with multiple hidden layers between the input and output layers, creating a more complex decision tree. At this point, a group of scientists developed a CASP application integrated with various portions of retrosynthesis knowledge called “ReTReK” (that stands for “Retrosynthesis planning application using retrosynthesis knowledge”). Essentially, ReTReK is based on a data-driven framework of retrosynthetic predictions driven by deep learning (DL) and path search by MCTS.

DL is a ML technique that teaches computers to do what comes naturally to humans, and is essentially a neural network with three or more layers. While traditional ML algorithms are linear (an input leads to a specific outcome), DL algorithms are stacked in a hierarchy of increasing complexity and abstraction. This can come in the form of different forms or architectures, including deep neural networks, deep belief networks, deep reinforcement learning, recurrent neural networks, convolutional neural networks and transformers.

To make a long story short, ReTReK successfully incorporated DL as a “knowledge concept” parameter. So, it was able to search for and provide more favorable synthetic routes for chemicals based on the retrosynthesis knowledge process, indicating that the synthetic routes searched with the knowledge concept were preferred to those without the knowledge concept (rule-based approach).

In general, the knowledge-based retrosynthesis approach identifies possible biosynthesis routes according to existing reaction databases—such as MetaCyc5—a curated database of experimentally elucidated metabolic pathways from all domains of life—and KEGG6—a collection of manually drawn pathway maps representing our knowledge of the molecular interaction, reaction and relation networks. But unfortunately, when it comes to complex NPs, these knowledge-based retrosynthesis approaches are often not applicable since the reactions of their biosynthetic pathways might not be included in those databases.

On the other hand, the rule-based models of retrosynthesis approaches (e.g. RetroPathRL, an MCTS reinforcement learning method guided by chemical similarity) can match the target compound to a collection of reaction rules and make predictions. These rules are either summarized manually by scientists or extracted automatically from the reaction databases. But, even if these rule-based methods have led to promising results, they also have some limitations since: the formulation of the rules is complicated and time-consuming, the degree of specificity of these rules can lead to invalid or incomplete proposals, and they can't predict reactions beyond the rule databases.

In contrast to these classical rule-based models, we now have the entirely data-driven toolkit BioNavi-NP, that can predict the biosynthetic pathways for both NPs and NP-like compounds. Initially, a single-step retrosynthesis prediction model is trained using both general organic and biosynthetic reactions, through a novel architecture that aims to solve sequence-to-sequence tasks while handling long-range dependencies with ease (the end-to-end transformer neural network a state-of-the-art technique in the field of natural language processing).

In particular, this model through data augmentation—a technique used to increase the amount of data by adding slightly modified copies of already existing data or newly created synthetic data from existing data—and through ensemble learning—that uses multiple learning algorithms to obtain better predictive performance—can achieve a top-10 prediction accuracy of 60.6% on a single-step retrosynthesis test, that is 1.7 times more accurate than the previous rule-based models.

Moreover, in February 2023 a group of researchers made a discovery that can dramatically speed up the planning of future chemical synthesis, providing a proof-of-concept for synthesizing a complex alkaloid found in nature. In particular, by combining the SYNTHIA Retrosynthesis Software (owned by Merck) together with an algorithm they developed to curate all the data, these researchers managed to identify the critical steps in the alkaloid synthetic pathway, and they achieved the synthesis of a complex alkaloid found in nature in just three steps while previous syntheses took between seven and 26 steps.

Furthermore, in May 2023 a group of researchers at the Ohio State University published a paper (G²Retro as a two-step graph generative models for retrosynthesis prediction) that documented a new AI framework called G²Retro to automatically generate reactions for any given molecule. This new AI framework was able to cover an enormous range of possible chemical reactions as well as accurately and quickly discern which reactions might work best to create a given drug molecule. G²Retro is a one-step retrosynthesis prediction tool that imitates the reversed logic of synthetic reactions.

“Our generative AI method G2Retro is able to supply multiple different synthesis routes and options, as well as a way to rank different options for each molecule. This is not going to replace current lab-based experiments, but it will offer more and better drug options so experiments can be prioritized and focused much faster.”
By Xia Ning, PhD, associate professor of computer science and engineering at Ohio State University

Other AI tools for retrosynthesis prediction out there are:

📌 Graph2Edits based on a graph generative architecture (Retrosynthesis prediction using an end-to-end graph generative architecture for molecular graph editing),
📌 RetroRanker that improves upon existing models by mitigating frequency bias through re-ranking (RetroRanker: leveraging reaction changes to improve retrosynthesis prediction through re-ranking), and
📌 a single-step retrosynthesis prediction tool using SMILES grammar-based representations in a neural machine translation framework (Retrosynthesis prediction using grammar-based neural machine translation: An information-theoretic approach). The SMILES (simplified molecular-input line-entry system) sequence is 1D representation of a molecule—a notation system used to represent chemical structures using plain text, which consists of a sequence of characters to describe the arrangement of atoms, bonds and rings within a molecule. In addition to encoding molecules as SMILES strings, it is possible to naturally represent molecules as 2D graphs (where atoms are represented by nodes and bonds by edges) or molecular fingerprints (that convert a molecular structure into a bit string).

For more about AI/ML retrosynthesis tools here and here.

Very few words now about automated synthesis.

➡️ AI in automation of chemical synthesis

Fully automated chemical synthesis using AI robots, instead of humans, is a future megatrend of Industry 4.0. To give an example, IBM’s RoboRXN is a cloud-based AI-driven lab that efficiently automates the majority of the initial groundwork in materials synthesis. Behind the RXN is a state-of-the-art neural ML translation method that can predict the most likely outcome of a chemical reaction using neural machine translation architectures.

In particular, the architecture used translates the language of chemistry converting reactants and reagents to products utilizing the SMILE representation. Interestingly, the RXN robot has also integrated a retrosynthetic architecture using as a prediction model the Molecular Transformer, which is an ML model inspired by language translation that accurately predicts the outcomes of organic reactions and estimates the confidence of its own predictions. For more, How artificial intelligence and robotics are changing chemical research.

Industry 4.0, this industrial change blurring the lines between the physical, digital and biological worlds has just begun, so it is worth saying that although the necessary hardware units for automated synthesis are commercially available, cost, standardization and efficiency issues can make scaling up difficult.

➡️ Databases for Retrosynthesis ⚗

The most popular databases for retrosynthesis are:

The United States Patent and Trademark Office (USPTO), is the predominant current open-source reaction database in the field of ML, containing 1,939,253 reactions that were extracted by text-mining from U.S. patents published between 1976 and 2016:
- USPTO50K (is a high-quality dataset containing about 50,000 reactions with accurate atom mappings between products and reactants),
- USPTO-MIT (contains about 400,000 reactions as the training set, 30,000 reactions as the valid set and 40,000 reactions as the test set), and
- USPTOFULL (a much larger dataset for chemical reactions, consisting of about 1,000,000 reactions),
ChEMU is a manually annotated database of organic reaction texts in 1500 patents,
operated by Elsevier, Reaxys is a major commercial subscription reaction database,
similar to Reaxys, we have the Chemical Abstracts Service (CAS), a huge database of both organic and inorganic reactions, covering detailed information of over 145 million single-step and multi-step reactions since 1840,
another subscription database similar to the USPTO database is Pistachio, based on a text-mining approach that extracts reaction information from patents,
eMolecules, consists of 231 million available molecules, and usually serves as the building block library for researchers, and
ZINC, which is also a reliable database of available materials contains about 1.3 billion purchasable molecules.

From Recent advances in artificial intelligence for retrosynthesis.

➡️ AI/ML chemical synthesis startups

👉 Chemify spun out of 15 years of research at the University of Glasgow UK is a new company that aims to digitize chemistry and produce solutions to run chemical code for drug discovery, chemical synthesis and materials discovery.

They have an extendable chemical execution architecture for chemical synthesis that can automatically read the literature, leading to a universal autonomous workflow. The robotic synthesis code can be corrected in natural language without any programming knowledge, and is hardware independent. This chemical code can then be combined with a graph describing the hardware modules and compiled into platform-specific, low-level robotic instructions for execution (A universal system for digitization and automatic execution of the chemical synthesis literature).

On August 02, 2023, Chemify announced funding of $43M including a Series A led by Triatomic Capital, joined by new investors including Hong-Kong based Horizon Ventures, US-based Rocketship Ventures, Possible Ventures, Alix Ventures, Scotland-based Eos, and the UK Government Innovation Accelerators program. Existing investor BlueYard Capital also participated in the round. And on 13 September 2023, Dewpoint Therapeutics announced a collaboration with Chemify to expedite the discovery of molecules that target cancer and neurodegeneration.

👉 Molecule.one utilizes AI to predict chemical reactions with unprecedented accuracy. They offer RetroM1—an AI-powered retrosynthesis pathway planner—and M1RetroSAS—a tool to screen tens of thousands of compounds for synthetic accessibility. On August 11, 2023, CAS—a division of the American Chemical Society specializing in scientific information solutions—and Molecule.one announced a strategic collaboration focused on the joint development of computer-aided synthesis design technologies to accelerate scientific breakthroughs in early-stage drug discovery and aid chemists in the discovery of novel small molecules.

👉 Chemical.ai in Shanghai China offers the ChemFamily products, including ChemAIRS (an AI-empowered retrosynthesis platform), ChemAIOS (an intelligent laboratory management system), ChemAIoT (a human-machine collaboration platform) and ChemAILab (an automated chemical synthesis service), based on its proprietary retrosynthesis algorithm as a standard closed data loop to enhance chemical synthesis efficiency. Chemical.ai is already working with Pfizer to implement their ChemAIRS AI-aided synthetic route-design system and evaluate its potential applications for drug discovery and development, and has announced strategic collaborations with:

Livzon Medicine, a leading pharmaceutical group integrating R&D, production and sales in China,
XuanZhu BioPharm in China, that owns a number of biology companies covering from drug discovery, preclinical research, clinical development and sales,
Chemaxon headquartered in Budapest, a leading chemical and biological software development company that provides solutions for biotech and pharma,
Scilligence in US, an innovation leader of web-based cheminformatics and bioinformatics solutions for life sciences R&D,
CarbonSilicon AI in China, to cooperate in order to provide customers with drug design and retrosynthesis software products and
many more.

Chemical.AI announced the completion of Series B rounding of nearly $14M (100 million RMB) in 2022.

➡️ News 🗞️ on TechBio

💡Octant in California is a therapeutics company with a platform of four components enabling experiments at the massive scale of computational simulations but with gold-standard real-life experiments in human cells (high throughput synthetic biology, high throughput chemistry, multiplexed assay platform and computation) to investigate drugs, proteins and signaling pathways at unprecedented scales. Octant was co-founded by Sri Kosuri—a former UCLA professor chemistry and biochemistry—and Ramsey Homsany—a veteran of Google and Dropbox. They are building the Octant Navigator platform to advance drug discovery by driving high throughput chemistries through engineered cellular systems, to scale up multiplexing capabilities for assay optimization, broad target screening and deep mutational scanning. In April 2022, Octant secured $80M in a series B round ($115M in total) that included Bristol Myers Squibb (BMS).

💡 Data Intuitive founded in 2011 is a Belgian-based bioinformatics that develops intuitive tools and custom biotech data pipelines to help clients gain valuable insights and make data-driven decisions. They offer:

Viash, that simplifies the creation of data pipelines in biotech R&D by allowing researchers to turn their scripts into modular, script language-independent building blocks that can be easily integrated into any data pipeline, and
ComPass, a powerful application for querying and analyzing Connectivity Map data, also known as L1000 data but could be extended for other types of data as well.

For example, Viash is being used in OpenProblems to streamline the development, execution and sharing of methods, metrics and dataset loaders. OpenProblems is an open-source platform for benchmarking and formalizing computational tasks in single-cell analysis, by defining mathematical interpretations of tasks, providing publicly available gold-standard datasets in standardized formats, defining quantitative metrics to evaluate success, and ranking state-of-the-art methods in continuously updated leaderboards. OpenProblems is hosted on GitHub and evaluated using AWS with support from the Chan Zuckerberg Initiative. And by using Viash, OpenProblems can provide a flexible and scalable platform for benchmarking single-cell analysis methods.

💡 Thibault Geoui, leading an innovation team to reimagine life sciences and drug R&D with (FAIR) data and AI technology at Charles River Laboratories, has a weekly newsletter at LinkedIn, the GPThibault Pulse, with insider tips and news on Generative AI, and Life Sciences.

For more, AI/ML chemical synthesis startups.

Until next time 🧣,

Thank you for reading MetaphysicalCells. This post is public so feel free to share it.

MetaphysicalCells