Biomedical data mining ⛏️
From the peer review process and the reproducibility crisis, to great data mining startups and AI/ML tools for biomedical science
Hi 👋 everyone and welcome back to another edition of MetaphysicalCells on AI Drug Discovery and Biomedical Data, starting from the peer review process, namely an armageddon of data.
The value of global big data analytics in the healthcare market is estimated to reach $101.07 billion by 2031.
The global big data market is forecasted to earn $103 billion in revenue by 2027.
It is estimated that the world will create and store over 180 zettabytes of data by 2025. One zettabyte is equal to one trillion gigabytes or one sextillion bytes.
Peer review and reproducibility crisis: a snake 🐍 eating its tail
The academic peer review took its first steps 🪜 in 1665 in order to ensure that: “the honor of X author’s invention will be inviolably preserved to all posterity”. For that reason it was determined that: “the Y article in the Society’s Science Transactions should be first reviewed by some of the members of the same (reviewers)”.
This system at the heart of all science has remained essentially unchanged since 1665 and nowadays it is the method by which:
papers are published for dissemination of biomedical knowledge,
grants are allocated,
academics are promoted and
Nobel prizes won.
However, back in 2005, the legendary Greek-American Stanford epidemiologist John Ioannidis wrote a paper —which has become the most widely cited paper ever published in the journal PLoS Medicine —and examined how issues currently ingrained in the scientific publishing process and its peer review process might indicate that at present:
🗣️“Most published findings are likely to be incorrect”.
A decade later after Ioannidis the UK–based medical writer Richard Horton, and editor-in-chief of The Lancet, put it only slightly more mildly:
🗣️“Much of the scientific literature, perhaps half 5️⃣0️⃣%, may simply be untrue”.
Furthermore, according to Richard Smith—a British medical doctor, editor, businessman and chief executive of the BMJ Publishing Group for 13 years—and Christopher Tancock—Editor-in-Chief of Elsevier—the peer review process is:
slow 🐌 and expensive,
inconsistent,
with reviewers sometimes turned out to be fake, overworked, under-prepared, not consistent and rarely paid for,
with agencies that “handle the peer review process” for authors,
with journal shopping, a process where scientists submit first to the most prestigious journals in their field and then work down the hierarchy of impact factors,
with citation manipulations,
with ghostwriters 👻,
with flagrant conflicts of interest and power bias,
with fashionable trends of dubious importance,
with publication bias: a process where negative results go unpublished, together with small sample sizes, with tiny effects and invalid exploratory analyses, and
with an obsession for pursuing fashionable trends of dubious importance,
that collectively have allowed science to take a turn towards darkness.
To make a long story short, as a result of all of the above the replication or reproducibility crisis in the scientific publishing industry has emerged with enormous implications in the drug discovery process (both in terms of economic and human resources), since the early stage research (where novel hypothesis for lead drug candidates are being formulated) once it goes through the “bottleneck of the peer review process” comes out as a reproducibility crisis leading everyone down the wrong research path during novel drug development.
How AI is changing dissemination of biomedical knowledge
On 20 March 2023, the Web of Science database—the world's leading scientific citation search and analytical information platform that provides access to multiple databases from academic journals, conference proceedings and other documents in various academic disciplines—announced that it had removed 50 journals from its list (“Web of Science delists some 50 journals, including one of the world’s largest”). Furthermore, Clarivate—the company which calculates journal’s impact factor using data from the Web of Science—said it is continuing to review 450 more journals with the help of an AI tool.
The impact factor of an academic journal is a scientometric index—scientometrics is a sub-field of informetric for measuring and analysing scholarly literature—calculated by Clarivate, and reflects the yearly mean number of citations of articles published in the last two years in a given journal, as indexed by Clarivate's Web of Science.
Losing the Web of Science impact factors is bad news for the authors—because this metric is widely used in hiring, tenure and promotion decisions as a proxy for quality, despite all the criticism that the impact factors are methodologically flawed—but in the same time it is also good news for the dissemination of well founded science!
But how many scientific journals do we have?
It is estimated that there are currently more than 30,000 academic journals, and the number continues to increase by about 5%-7% per year and there are an estimated 2,000 academic publishers globally. Moreover, more than 2 million journal articles are published yearly, which is expected to increase by 2%-4% annually.
To give you a perspective, if you decide to study breast cancer then you will have to read something like 415,124 papers on PubMed (as of 20 August 2020). Let’s now hypothesise that you only need 1 hr to read a scientific article and you can read on average 12 hrs per day. Accordingly, you will need roughly 95 years only to read everything that has been published about breast cancer on PubMed and then eventually—considering you are still alive—design a strategy for your drug discovery project.
Obliviously, this scenario is at the limits of a mission 🤯 impossible. And actually it gets even more complicated when you have to factor in the impact that the replication crisis already had on the 415,124 articles about breast cancer already published.
Hopefully, we already have several AI/ML tools that can assist us in areas such as plagiarism prevention, requirements compliance checks and reviewer manuscript matching.
For example,
Artificial Intelligence Review Assistant (AIRA) is a platform to support editors, reviewers and authors to evaluate the quality of manuscripts and to help meet global demand for high-quality, objective peer-review in publishing,
UNSILO uses a corpus-based concept extraction tool to identify hundreds of concepts (key phrases that distinguish each article from all the others in the corpus) from a submission, and ranks them in order of relevance to that paper. Then matches the resulting cluster of concepts with 29 million articles and abstracts in the PubMed corpus,
Statcheck is a statistical programming language designed to detect statistical errors in peer-reviewed psychology articles,
Penelope.ai is an online tool that automatically checks whether scientific manuscripts meet journal requirements (such as references and the structure of a manuscript),
StatReviewer is an automated reviewer of statistical errors and reports integrity for scientific manuscripts.
Other AI-Based Literature Review Tools (from a guide by Texas A&M University Libraries 📚) are the following:
SEMANTIC SCHOLAR is a free AI-powered research tool for scientific literature, for the topics of computer 🖥️ science, geoscience 👴👵 and neuroscience 🧠,
ELICIT.ORG uses language models to help you automate research workflows, like parts of literature review,
CONSENSUS.APP is a search engine that uses AI to find insights in research papers, and now is using data annotations from Centaur Lab (I will talk about Centaur Labs in a minute),
Scite.Ai is an award-winning platform for discovering and evaluating scientific articles via Smart Citations that allow users to see how a publication has been cited, by providing the context of the citation and a classification describing whether it provides supporting or contrasting evidence for the cited claim,
ChatGPT has potential applications for teaching, learning and doing literature reviews,
Bing AI builds on GPT4 technology, connects to the Internet and better use it with the Microsoft Edge browser, and
Bing AI Sidebar where GPT4 has been integrated into MS Edge browser and it is free.
However, the opposite is also true since some AI/ML tools can make researchers’ life even harder!!!!
For example, the AI models DALL-E, Stable Diffusion and Midjourney that already produce realistic pictures of human faces, objects and scenes, it's a matter of time before they start to produce also convincing scientific images, as pointed out in the article “Thanks to generative AI, catching fraud science is going to be this much harder” by The Register.
In particular, the author Katyanna Quach highlighted how some image analysts while scrutinising 🔍 data in scientific papers came across a strange set of images that appeared in 17 biochemistry-related studies, and ALL had the same background (“Digital magic, or the dark arts of the 21st century—how can journals and peer reviewers detect manuscripts and publications from paper mills?”).
These analysts concluded at the end, that the suspicious-looking images of western blots under in investigation, were most likely computer-generated images and produced as part of a “Paper Mill Operation”, namely “an effort to mass produce bio-chemical papers using faked data and get them peer reviewed and published”.
However, the fact that AI can be a double edged sword ⚔️ for the peer review process (and its fabricating results business) should not scare us.
We like it or not, deception, AI or not AI generated, is a form of creativity. It might be negative creativity but creativity it is. And creativity is part of human evolution and human learning, on a planet where Yin and yang ☯️ or bad and good forces have always co-existed in an endless feedback loop ➰. In other words, we have to co-exist with deception and even learn from this “art of deception creature” in order to become better.
Accordingly, thanks to these paper mills―these shadowy organisations that provide ghostwritten or fabricated manuscripts and submission services, with or not AI tools―we as society we have to continue to train our future biologists on real lab benches 🧪🥽🥼👩🔬 and not only on “in-silico benches”, so that they would able to recognise fake science based on real scientific experience, and eventually use the appropriate forensic life science tools to AI detect scientific spams.
Off course, data mining and knowledge extraction from biomedical data is not only about peer-reviewed literature and the relevant omics and imaging data, but is also about
Surveys,
Medical 🏥 Records (computer tomography and magnetic resonance imaging scans, signals from electroencephalograms, laboratory data from blood, specimen analysis and clinical data from patients),
Claims Data,
Vital Records,
Surveillance,
Quiz and
Unpublished Proprietary Lab Data (observations and lab notes)
making biomedical data a real beast to tame!
“Data Mining, also known as Knowledge Discovery in Databases, refers to the nontrivial extraction of implicit, previously unknown and potentially useful information from data stored in databases.
And involves: Data Cleaning 🧹, Data Integration 🔗, Data Selection, Data Transformation 📊, Data Mining ⛏️, Pattern Evaluation 🧐 and Knowledge Extraction 🔂.”
AI/ML companies for aggregation and synthesis of biomedical data, for generating data and models and for analysing real world evidence
Science IO has an AI platform that enables real-time transformation of unstructured healthcare data into structured data, which can then be leveraged for search and analysis. It uses a single line of code to identify and extract over 9 million healthcare concepts, clinical variables and medical codes from text that are linked to 20+ industry-standard ontologies. In a healthcare data setting, an ontology is a standardised way of identifying different types of healthcare information across the entire industry and around the globe.
ScienceIO’s Knowledge Graph categorises each of the supported ontologies as primary when it uses them to map data. For example, primary ontologies include:
UMLS, the Unified Medical Language System a compendium of many controlled vocabularies in the biomedical sciences,
ChEMBL, a manually curated database of bioactive molecules with drug-like properties and ChEBI or Chemical Entities of Biological Interest, a freely available dictionary of molecular entities focused on small molecules,
dbSNP, the Single Nucleotide Polymorphism Database a free public archive for genetic variation within and across different species,
CLO (Cell Line Ontology), a community-based ontology of cell lines,
GeneID, a program to predict genes, exons, splice sites and other signals along a DNA sequence,
ClinVar, a public archive that archives and aggregates information about relationships among variation and human health, and
NCBI Taxonomy ID, a curated set of names and classifications for all of the organisms that are represented in GenBank.
ScienceIO, founded by Gaurav Kaushik (a bioengineer) together with Will Manidis (a former Thiel fellow and managing partner of Dorm Room Fund) emerged from stealth with big AI ambitions in 2021 and with $8 million in seed funding over 1 round from 11 investors.
meMR Health is an AI-automated medical record retrieval and analysis platform that auto-categorises and transforms records into high-yield data (using the Advanced Encryption Standard or AES to shield all patient health information). meMR Health claims its not just an API (application programming interface) company, since they take things a step further with a comprehensive approach that collects 90% of patients data leaving fewer gaps and no guesswork. On June 13, 2022, meMR Health raised a pre seed $400,000 from Entrepreneurs Roundtable Accelerator. A similar company to meMR Health, for simplifying access to real-world medical imaging data, is Segmed with something like 100 million imaging studies to its data network! Segmed has raised so far a total of $10M.
Cascade MD enables healthcare providers to dictate and capture the entire patient visit details with their mobile devices, by using Cascade MD's voice-to-text and AI inferencing engine to capture all important information in real-time and integrate it into an EMR.
CascadeMD announced this year the general availability of its cloud-based clinical documentation solution that uses several proprietary technologies including multi-lingual Speech to Text, AI, ML and Natural Language Processing to automate the process of populating data fields in an EMR from voice dictation. And this summer, CascadeMD announced its partnership with PointClickCare, the leading cloud-based software provider for the senior care market.
Unlearn develops generative ML methods to predict individual health outcomes and accelerate clinical innovation, by using AI-powered digital twins on every patient. A patient’s digital twin (TwinRCTs or TwinRCTs II) is computationally created with generative AI (and it is not a matched patient from an external cohort) providing a probabilistic forecast of their specific health outcomes, while it cannot be used like data from a new patient.
Unlearn's technology is regulatory-qualified and used by leading global pharmaceutical companies to run AI-powered clinical trials. For example, on June 27, 2023 QurAlis Corporation and Unlearn announced a collaboration to accelerate and optimise QurAlis' clinical program in amyotrophic lateral sclerosis with Unlearn's advanced AI technology.
Unlearn.AI has raised a total of $84.9M. For more follow them on Substack:
Centaur Labs is a medical data labeling company for 📌Text (Unstructured clinical notes, Scientific text, Chatbots and more), 📌Audio (Heart auscultation, Lung auscultation, Artery auscultation and more), 📌Images (Ultrasound, External images, X-ray and more), 📌Video (Surgery, Clinical sessions and more) and 📌Wave (EEG, ECG and more); and Lab can be used by the following industries: Medical Devices, Life Sciences, Insurance, Wellness and Research.
In particular, the Consensus.app (mentioned before) improves its scientific search algorithm (during AI review research) using high quality data annotations from Centaur Labs.
“Annotations that would have taken me 3 months to complete with our prior data annotation system, Centaur Labs completed in only 2 weeks,” said Eric Olson, cofounder and CEO of Consensus
Centaur Labs has raised a total of $15.9M.
Rhino Health is using edge computing and federated learning to create a large, distributed dataset making it possible for developers and researchers to collaborate across the healthcare ecosystem, including researchers, healthcare organisations and industry without ever moving data, transferring ownership or risking patient privacy 🔏.
Rhino Health partners with medical researchers and AI developers throughout the full lifecycle of healthcare AI:
Data registry and data discovery
Data analytics and quality assessment
Predictive modeling creation and validation
Deployment, monitoring and continuous learning
Rhino Health has raised $13.95M.
LatchBio emerged from stealth mode in 2021 providing almost code-free biocomputing solutions on the cloud that can be accessed from anywhere via a browser to simplify biological data analysis. Using their platform, researchers can upload files and access dozens of bioinformatics pipelines and data visualisation tools from analysing RNA sequencing data to designing CRISPR edits and even running the AlphaFold software just from their laptop. LatchBio has raised a total of $33.2M. For more follow them on Substack:
Datavant, which specialises in breaking down silos 🧱 and analysing health data securely and privately, acquired Swellbox to enable patients to request their medical records seamlessly. Swellbox also enables patient authorisation for record retrieval for clinical trial recruitment, long-term surveillance, registry creation and other use cases. On December 15, 2022, Datavant announced a partnership integrating Syntegra’s synthetic data capabilities (Syntegra Synthetic Data API) into the Datavant Switchboard, a neutral, trusted and ubiquitous infrastructure for the exchange of privacy-preserved health data.
Datavant has raised a total of $80.5M while Syntegra has raised a total of $5.6M.
Genialis is developing Genialis ResponderID, a biomarker discovery platform, and the Genialis Expressions software, that enables ML driven biomarker discovery by aggregating consistently analysed and annotated data. The Genialis Expressions software is built on FAIR (findability, accessibility, interoperability and reusability) data management principles, in order to analyse sequencing data across numerous platforms. Genialis has raised a total of $15.5M.
Owkin develops ML to connect medical researchers with high-quality datasets from leading academic research centers around the world and applies AI to research cohorts and scientific questions. On June 14, 2023, Owkin has successfully validated its MSIntuit™ CRC AI solution for colorectal cancer screening, a technology that is now integrated into clinical workflows via France's largest network of pathologists the Medipath. Owkin raised a total of $304.1M.
PatSnap launched in 2022 Eureka, an AI-powered innovation solutions platform, designed to make intellectual property accessible for R&D professionals, by translating the legal language of IP into the technical language of R&D. PatSnap has raised a total of $351.6M.
Nference, Inc is a science-first software company that partners with medical centers to turn decades of rich and predominantly unstructured data captured in electronic medical records into powerful software solutions that enable scientists to discover and develop the next-generation of personalised diagnostics and treatments. Nference just made it on FastCompany’s list of 10 most innovative companies in data science for 2023 and has raised a total of $152.7M.
Snowflake’s Healthcare & Life Sciences Data Cloud allows companies to eliminate data marts, break down silos, capitalise on near-unlimited performance and create a single source of truth by bringing diverse data together and granting governed access for all users and applications.
Databricks is offering Lakehouse for Healthcare and Life Sciences, a single platform that brings together all data (structured and unstructured data — patient, R&D and operations) and analytics workloads to enable transformative innovations in patient care and drug R&D. Databricks just announced it will pay $1.3 billion to acquire MosaicML, an open source startup with neural networks expertise that has built a platform for organisations to train large language models and deploy generative AI tools based on them.
OneThreeBiotech utilises AI to integrate and analyse data from over 30 types of chemical, biological and clinical data allowing researchers to generate new insights during drug development. The company is collaborating with Poolbeg Pharma, that is applying OneThreeBiotech’s ATLANTIS platform to identify novel drug targets and signatures driving respiratory syncytial virus infection. OneThree Biotech has raised a total of $2.5M.
Until next time 🎡,
Another informative edition!