Biomedical data mining: AI/ML tools and startups
AI transforming dissemination of biomedical science
In the context of computer science, “Data Mining” can be referred to as knowledge mining from data, knowledge extraction, data/pattern analysis, data archaeology, and data dredging. Data Mining, also known as Knowledge Discovery in Databases, refers to the nontrivial extraction of implicit, previously unknown and potentially useful information from data stored in databases. By GeeksforGeeks
When it comes to producing biomedical knowledge (research, research papers, omics, databases and lead-generators for pharma) the best thing that describes the relationship between academia and pharma is a tango. A dance that requires two partners moving in relation to each other, sometimes in tandem and sometimes in opposition. That being the case, the academia-pharma tango—like real tango 💃, the result of a combination of Waltz, Polka, Mazurka, Schottische, Habanera, Candombe and Milonga—is the result of a combination of university researchers, professors, nerds, medical doctors with PhDs, corporate scientists and corporate white-collar executives.
In this academia-pharma tango, everything starts as an early stage research (drug-biomarker discovery) in a university lab or in a university spinoff entity or in a small biotechnology company and is sponsored by the government or by pharma or by both. After that, and through a complicated and elaborated process, tons of data are produced and kept hidden behind a firewall (positive/negative results or/and hidden results) while elaborated preliminary results are only seen by few during conferences (abstracts, posters and powerpoint presentations).
At the end of this process—and on average after 2-5 years—flattering and only positive results of the early stage research will be published (papers) and presented to the public after going through the peer review process (Biomedical data mining, Biomedical Data and Artificial Intelligence). These papers, once published, are usually considered pharma lead-generators for choosing future drug candidates for further drug-biomarker development.
Imagine now this hypothetical scenario, where two researchers one from pharma and one from academia— let's call them Rose and Jack— decide to work together (in-tango) on the same hypothetical drug candidate, called Titanic. (For many, the ship 🚢 Titanic—emblematic of wealth and privilege, yet sank—is a metaphor for the crumbling of complacent power structures, something we have all seen happening when VIP drug development projects fail miserably).
Early mid-stage research for the project Titanic— after target and lead identification and validation (screening, virtual screening, drug design, ADMET, retrosynthesis, chemical synthesis)—starts by testing Titanic on cell lines (in vitro studies) and that is the beginning of the preclinical studies. Considering that Rose and Jack are working in a state-of-the-art laboratory, then more likely after six months they will have produced their first results, meaning some negative and some positive results.
In consequence—after obtaining their first positive and negative results with P-value ideally less than 0.01—Rose and Jack will decide to present them at conferences, congresses, seminars and meetings, and for that reason they prepare posters, abstracts, videos, notes, graphs, data sets and powerpoint presentations.
Interestingly, when these results will be presented for the first time—in the form of posters, abstracts and powerpoint presentations—that is the closest we have to REAL TIME results (6 months after the beginning of the studies, let's call this t1= t0+6 months). Unfortunately, and until very recently, this preliminary data couldn’t be found online.
Moreover, Rose’s and Jack’s results at t1 = t0+ 6 months are strictly correlated with the technological and scientific advancements at that precise time. Meaning, if Rose and Jack have to wait for another 3 years in order to complete all their preclinical studies and publish all their results, then more likely by the time their paper is ready to be published (t2 = t0+ 3 years) their results would be probably “old enough”, not reflecting the actual technological and scientific advances at t2, given that other couples around the globe have been dancing tango as well, pushing technological and scientific advancements.
Let’s go back now to t1= t0+6 months now.
After Rose and Jack have finally presented their first positive and negative results at a conference, their next step is to continue their studies by testing Titanic on mouse models (in vivo studies), while the combined in vitro and in vivo experiments will determine the preclinical phase of Titanic’s discovery. In this hypothetical scenario I am not going to consider the development of a biomarker for Titanic—a biomarker is something like a ship-radar📡—since it will only complicate things and add more pressure to the timeline of development.
And here comes the best part, if they don’t test Titanic in vivo and decide to publish only their initial in vitro studies, more likely their findings are not going to be accepted by a high impact journal—where blockbuster research is featured—and sometimes these results might just remain positive and negatives results on an abstract and get lost in a huge fire and humidity resistant archive, or remain inside a forgotten hard disk.
However, even if they do decide to test Titanic in vivo and wait for a minimum of 2.5 years—if everything goes well—to complete all the preclinical studies, following this they will have to prepare themselves for a publishing journey of
flagrant conflicts of interest and power bias,
fashionable trends of dubious importance,
journal shopping 🛍️, a process where scientists submit first to the most prestigious journals in their field and then working down the hierarchy of impact factors,
with reviewers sometimes turned out to be fake, overworked, under-prepared, not consistent and rarely paid for,
with ghostwriters 👻,
with citation manipulations,
with agencies that “handle the peer review process” for authors, and
with publication bias—a process where negative results go unpublished—with bad science, with small sample sizes, with tiny effects and invalid exploratory analyses.
This creates an endless cycle of submission, rejection, review, re-review and re-re-review that seems to eat up months even years of Rose and Jack’s lives—interfering with their research and slowing down the dissemination of biomedical scientific knowledge—while Titanic is literally sinking, and only Rose from pharma will survive at the end because Jack from academia is just poor. Basically, what started as a beautiful courtship dance between Rose and Jack might end up as a disaster without life jackets 🛟 and lifeboats 🚣♂️.
Hopefully, we have now AI/ML solutions in order to tame the data iceberg threatening Titanic, solutions that promote aggregation and synthesis of biomedical information for generating data and models and for analyzing real world evidence throughout the entire drug-biomarker development and for making the peer review process more navigable, inaugurating in this way a new era for biomedical research. An new era of dirty dancing 🕺 between Baby from bio and Johnny from tech.
“Nobody puts Baby in a corner.”—Johnny Castle
What is happening right now, between those two, is going to change forever the way we do research, by rewriting our SOPs (standard operating procedures) in the lab, by adding a brand new chapter in the “Materials and Methods” section, by creating a “before AI data” and “after AI data” PubMed section, by giving a new meaning to the statistical analysis while doing experiments, eventually tearing down the wall of scientific separatism.
”Look, spaghetti arms. This is my dance space. This is your dance space. I don't go into yours, you don't go into mine. You gotta hold the frame.” —Johnny Castle
AI-Based Literature Review Tools
Artificial Intelligence Review Assistant (AIRA) is a platform to support editors, reviewers and authors to evaluate the quality of manuscripts and to help meet global demand for high-quality, objective peer-review in publishing.
StatReviewer is an automated reviewer of statistical errors and reports integrity for scientific manuscripts.
SEMANTIC SCHOLAR is a free AI-powered research tool for scientific literature, for the topics of computer science, geoscience and neuroscience.
Elicit uses language models to help you automate research workflows, like summarizing papers, extracting data and synthesizing your findings.
Consensus is a search engine that uses AI to find insights in research papers, and now is using data annotations from Centaur Lab, a medical data labeling company for 📌Text (Unstructured clinical notes, Scientific text, Chatbots and more), 📌Audio (Heart auscultation, Lung auscultation, Artery auscultation and more), 📌Images (Ultrasound, External images, X-ray and more), 📌Video (Surgery, Clinical sessions and more) and 📌Wave (EEG, ECG and more). Centaur Labs has raised a total of $15.9M.
Scite is an award-winning platform for discovering and evaluating scientific articles via Smart Citations that allow users to see how a publication has been cited, by providing the context of the citation and a classification describing whether it provides supporting or contrasting evidence for the cited claim.
Scite allows researchers to assess the dependability in any particular context of references. It helps in evaluating the quality and impact of the research. It also provides better visualizations and metrics to understand the citation landscape of a particular paper or a topic, and
SciSpace is using AI to simplify the publication of research, for submitting, evaluating and publishing manuscripts.
Big Data & Analytics Tools
Off course, data mining and knowledge extraction from biomedical data
➡️ is not only about Peer-Reviewed Literature and the relevant:
💡 Omics
proteomics, transcriptomics, genomics, metabolomics, lipidomics and epigenomics—which correspond to global analyses of proteins, RNA, genes, metabolites, lipids and methylated DNA or modified histone proteins in chromosomes, respectively—phenomics—an emerging transdiscipline, defined as the changes seen in an organism resulting in variations in the phenotype during the life span of the organism—and metagenomics—the study of the structure and function of entire nucleotide sequences isolated and analyzed from all the organisms (typically microbes) in a bulk sample.
For example, a branch of transcriptomics is concerned with the sequencing and analysis of the transcriptome (mRNA, rRNA, tRNA and other non-coding RNA). In 2021, LatchBio emerged from stealth mode providing almost code-free biocomputing solutions on the cloud that can be accessed from anywhere via a browser to simplify biological data analysis. Using their platform, researchers can upload files and access dozens of bioinformatics pipelines and data visualisation tools from analysing RNA sequencing data to designing CRISPR edits and even running the AlphaFold software just from their laptop. LatchBio has raised a total of $33.2M. For more follow them on Substack: LatchBio
Another example, comes from a branch of genomics concerned with the sequencing and analysis of the genome of an individual. Genialis is developing next-generation patient classifiers using ML and high-throughput omics data and they offer: Genialis ResponderID, a biomarker discovery platform, and the Genialis Expressions software, that enables ML driven biomarker discovery by aggregating consistently analyzed and annotated data. The Genialis Expressions software is built on FAIR (findability, accessibility, interoperability and reusability) data management principles, in order to analyze sequencing data across numerous NGS platforms. Genialis has raised a total of. $15.5M.
Aimed Analytics in Germany provides big data, analytics and ML solutions to analyze medical data, by analyzing multiple data sources such as transcriptomics, epigenomics, proteomics and multi-omics data to provide research analytics for drug development. For example, during proteomics analysis focusing on flow-cytometry, mass cytometry, imaging mass cytometry, mass spectrometry and mass spectrometry, they offer the following modules: Dimensionality reduction, Clustering, Marker molecule identification, Cell-type annotation, Perturbation analysis, Differential expression analysis, Transcriptional regulator prediction, Integration of clinical data, Patient classification and many more.
After omics another important dataset comes from imaging.
💡Imaging data
bioimaging—refers to technologies for viewing (with microscopy) biological substances that have been fixed/prepared for monitoring. For example, Euro BioImaging is a EU funded project hosted by EMBL that offers open access to imaging technologies, training and data services in biological and biomedical imaging. Euro BioImaging consists of imaging facilities, called Nodes, that have opened their doors to all life science researchers. And the BioImage Model Zoo—created by the AI4LIFE consortium—is where researchers can share their trained AI models and tools for life science imaging data,
histo-pathology—refers to the examination of a biopsy or surgical specimen by a pathologist—and digital pathology—includes the acquisition, management, sharing and interpretation of pathology information including slides) and
AI medical imaging for cardiovascular imaging, breast imaging, lung imaging etc.
For example, Owkin develops ML to connect medical researchers with high-quality datasets from leading academic research centres around the world and applies AI to research cohorts and scientific questions. By implementing a causal approach to AI, Owkin is able to simultaneously discover new treatments while identifying new subgroups of patients who would most benefit from them. On January 19, 2023, Nature Medicine published breakthrough Owkin research on the first ever use of federated learning to train DL models on multiple hospitals’ histopathology data. On June 14, 2023, Owkin successfully validated its MSIntuit™ CRC AI solution for colorectal cancer screening, a technology that is now integrated into clinical workflows via France's largest network of pathologists, the Medipath. Owkin raised a total of $304.1M.
And off course another important dataset to analyze comes from patents.
💡Patents ⚖️
Euretos uses natural language processing to interpret research papers—2-2.5 million new scientific papers are published each year in about 28,100 active scholarly peer-reviewed journals—but this is secondary to the 200-plus biomedical-data repositories it integrates. In particular, they provide biological knowledge graphs that semantically harmonise public and proprietary data, literature and patents. And they customise these to create client-specific knowledge graphs with domain specific and/or proprietary data. Their ML models are driven by multi-omics data minimising publication bias and by integrating predictions from different types of multi-omics networks to provide biological insight.
PatSnap is the leading Connected Innovation Intelligence platform. In 2022, PatSnap launched Eureka, an AI-powered innovation solutions platform, designed to make intellectual property (IP) accessible for R&D professionals, by translating the legal language of IP into the technical language of R&D. PatSnap has raised a total of $351.6M in funding over 6 rounds.
➡️ but biomedical data mining is also about:
💡Surveys (a health survey is a tool used to gather information on the behavior of a specific group of people from a determined area).
💡Medical 🏥 Records—used to track events and transactions between patients and health care providers and offer information on diagnoses, procedures, lab tests: computer tomography and magnetic resonance imaging scans, signals from electroencephalograms, laboratory data from blood, specimen analysis and clinical data from patients and other services—💡Electronic Medical Records, EMRs—an electronic version of a patients medical history—and 💡Claims Data—a bill that healthcare providers submit to a patient's insurance provider.
For example, Datavant which specializes in breaking down silos and analyzing health data securely and privately, acquired Swellbox to enable patients to request their medical records seamlessly. Swellbox also enables patient authorisation for record retrieval for clinical trial recruitment, long-term surveillance, registry creation and other use cases. On December 15, 2022, Datavant announced a partnership integrating Syntegra’s synthetic data capabilities (Syntegra Synthetic Data API) into the Datavant Switchboard, a neutral, trusted and ubiquitous infrastructure for the exchange of privacy-preserved health data. Moreover, on January 17, 2023 Socially Determined—the social risk analytics and data company that is empowering health care organisations to manage risk, improve outcomes and advance equity at scale—announced a partnership with Datavant, that will enable Socially Determined to provide curated, de-identified and linkable social risk data on the patient-level. Datavant has raised a total of $80.5M in funding over 2 rounds.
Nference, Inc is a science-first software company that partners with medical centers to turn decades of rich and predominantly unstructured data captured in electronic medical records into powerful software solutions that enable scientists to discover and develop the next-generation of personalized diagnostics and treatments. Nference just made it on FastCompany’s list of 10 most innovative companies in data science for 2023 and has raised a total of $152.7M.
Predicta Med in Israel is using DL and AI-based medical decision support for the early detection and treatment of undiagnosed autoimmune diseases, by aggregating and analyzing EMR and claims data to provide predictive disease analytics. Predicta Med’s platform also integrates with EMR systems and care managers’ software to assist healthcare providers. Predicta Med has raised $3.2M.
Kapsule in Rwanda Africa utilizes big data and analytics to capture healthcare data—including EMRs and medical supply chain data—to deliver critical insight analytics. Kapsule’s dashboard integrated with the EMR system also enables healthcare providers to track key performance indicators of patients.
💡Electronic health records (EHR), the systematized collection of patient and population electronically stored health information in a digital format.
Forefacts in US has FactsIn, a data activation and population health management platform that simplifies risk stratification of patients and resource utilization in facilities. The startup’s other solution, FactsCare, improves patient engagement through personalized communication and care plans. Additionally, FastCloud is their health cloud software suite that features mobile EHR, teleconsultation and more.
💡Vital Records—are records of life events kept under governmental authority, including birth certificates, marriage licenses or marriage certificates, separation agreements, divorce certificates and death certificates. In some jurisdictions, vital records may also include records of civil unions or domestic partnerships.
💡Clinical data from clinical trials—a collection of data related to patient diagnosis, demographics, exposures, laboratory tests and family relationships—and data from clinical trials.
💡Surveillance—medical surveillance is the analysis of health information to look for problems that may be occurring in the workplace that require targeted prevention—and
💡Unpublished Proprietary Lab Data (observations and lab notes).
ELN Adoption in Research Labs (@the aliquot
).
Data Lakes and data warehouses
Data lakes and data warehouses are centralized repositories that allows you to store ALL your structured and unstructured data at any scale.
👉 Databricks is offering Lakehouse for Healthcare and Life Sciences. In particular, they offer a single platform that brings together all data (structured and unstructured data, patient, R&D and operations) and analytics workloads—with applications ranging from managing hospital bed capacity to optimizing the manufacturing and distribution of pharmaceuticals—to enable transformative innovations in patient care and drug R&D. On December 27, 2022, Quantori, LLC—a leading global provider of data science and digital transformation solutions for life science and healthcare organisations—announced a partnership with Databricks to power innovation across the entire drug lifecycle by unifying data, analytics and AI on a simple and open multi-cloud platform. Databricks paid $1.3 billion to acquire MosaicML—an open source startup with neural networks expertise that has built a platform for organisations to train large language models and deploy generative AI tools based on them. Databricks has raised a total of $3.5B in funding over 9 rounds.
👉 Snowflake’s Healthcare & Life Sciences Data Cloud allows companies to eliminate data marts, break down silos, capitalize on near-unlimited performance and create a single source of truth by bringing diverse data together and granting governed access for all users and applications. Snowflake has raised a total of $2B in funding over 10 rounds. Their latest funding was raised on Apr 19, 2022 from a Post-IPO Equity round.
👉 Veeva System is a global leader in cloud software for the life sciences. On March 28, 2023, Veeva announced that more than 100 life sciences companies are using Veeva CRM Events Management to plan and execute in-person, virtual, and hybrid events worldwide. Veeva has raised a total of $7M in funding over 2 rounds.
👉 Kyndi is a global natural language processing company, an AI-powered platform, being deployed in areas such as supply chain management, manufacturing, healthcare, medical research and financial services. On March 21, 2023, Kyndi Natural Language Platform has been named a 2023 CUSTOMER magazine Product of the Year Award winner 🏆 by the global integrated media company TMC. Kyndi has raised a total of $42.8M in funding over 8 rounds.
For more: Biomedical data mining: AI/ML tools and startups (2nd part)
Until next time,