Weekly TechBio News I: Data Mining
Mining Tools For Omics, Imaging, Patents, Surveys, Medical Records, Electronic Medical Records, Electronic Health Records, Claims Data, Clinical Data and Unpublished Data
Hi everyone and welcome back to another edition of MetaphysicalCells on Data Mining:
Weekly TechBio News I: Data Mining and
“Nature doesn’t separate creativity from function; it innovates by integrating them. Polyintelligence embraces this approach by combining human creativity, machine precision, and nature’s deep reservoir of adaptive strategies.”
Polyintelligence and the Art of Connected Ideas
By
Weekly TechBio News I: Data Mining
✴️ Big Data & Analytics Tools
Data mining and knowledge extraction (data/pattern analysis, data archaeology and data dredging) from biomedical data in order to analyze large data sets and uncover patterns, trends and relationships is not just about analyzing Peer-Reviewed Literature (papers) but is also about the relevant Omics, Imaging, Patents, Surveys, Medical Records, Electronic Medical Records, Claims Data, Clinical Data and Unpublished Data. Furthermore, biomedical big data incorporates health monitoring data supported by sensors and Internet of Things (IoT) technologies.
Approximately 30% of the world’s data volume is being generated by the healthcare industry. By 2025, the compound annual growth rate of data for healthcare will reach 36%. That’s 6% faster than manufacturing, 10% faster than financial services, and 11% faster than media & entertainment (The healthcare data explosion).
In order to effectively manage the large amount of biomedical data, the database technology and all data management methods play a critical role. For example:
Big Data Platforms such as Apache Hadoop (an open source framework that allows for the distributed storage and processing of large datasets across clusters of computers using simple programming models) and Apache Spark (an open-source, distributed processing system used for big data workloads) are both used to store and process big data, by enabling distributed storage and processing of large‐scale genomic, clinical and imaging data, making it easier to extract meaningful insights.
Cloud Computing providers such as Amazon Web Services (AWS), Microsoft Azure and Google Cloud Platform offer a range of services and tools that can be used to store, process and analyze biomedical data. Snowflake’s Healthcare & Life Sciences Data Cloud allows companies to eliminate data marts, break down silos, capitalize on near-unlimited performance and create a single source of truth by bringing diverse data together and granting governed access for all users and applications. Veeva System is a global leader in cloud software for the life sciences.
On December 17, 2024, eClinical Solutions Collaborated with Snowflake to Streamline Data for Life Sciences.
On January 23, 2025, Veeva and Zifo Partner to Accelerate Quality Control Modernization in Life Science.
NoSQL Databases such as MongoDB (a source-available, cross-platform, document-oriented database program), Apache Cassandra (a free and open-source database management system designed to handle large volumes of data across multiple commodity servers) and Apache HBase (an open-source non-relational distributed database modeled after Google's Bigtable and written in Java) are all used to store and manage large‐scale genomic and clinical datasets.
Data Integration and Interoperability Platforms such as OMOP and SMART Health IT are both used to integrate and harmonize data from multiple sources, enabling researchers and clinicians to access and analyze data seamlessly.
The core of data flow in the data management system is the Translational Data Warehouse (TDW) based on Informatics for Integrating Biology & the Bedside (I2B2) core technology. TDW aggregates data from pre‐clinical and clinical data sources. The I2B2 framework (a scalable informatics framework that organizes and transforms patient-oriented clinical data in a way that's optimized for clinical research) is a platform designed to support the integration and analysis of heterogeneous biomedical data, and it uses ETL (extract, transform and load) processes to extract data from various sources, transform it into a standard format and load it into the data warehouse. While the ETL processes are used to integrate data such as electronic health records, imaging systems, genomic databases and so on, the i2b2 framework provides a diverse set of analytical tools enabling researchers and clinicians to analyze the data stored in the data warehouse. These tools include visualizations, statistical analysis tools and ML algorithms (Biomedical Big Data Technologies, Applications, and Challenges for Precision Medicine: A Review).
Databricks is offering Lakehouse for Healthcare and Life Sciences. In particular, they offer a single platform that brings together all data (structured and unstructured data, patient, R&D and operations) and analytics workloads—with applications ranging from managing hospital bed capacity to optimizing the manufacturing and distribution of pharmaceuticals—to enable transformative innovations in patient care and drug R&D. On December 27, 2022, Quantori, LLC—a leading global provider of data science and digital transformation solutions for life science and healthcare organisations—announced a partnership with Databricks to power innovation across the entire drug lifecycle by unifying data, analytics and AI on a simple and open multi-cloud platform. Databricks paid $1.3 billion to acquire MosaicML—an open source startup with neural networks expertise that built a platform for organisations to train large language models and deploy generative AI tools based on them.
Last this week (January 22, 2025), Databricks closed $15.3B financing at $62B valuation, Meta joins as ‘strategic investor’.
To make a long story short, the global big data in healthcare market size is estimated to be worth USD 50.74 billion in 2024 and is projected to reach from USD 61.26 billion in 2025 to USD 145.42 billion by 2033, growing at a CAGR of 11.41% during the forecast period (2025-2033).
The first part of today's newsletter is dedicated to TechBio startups working with omics and imaging biomedical data.
✴️ Mining Omics Data
Omics data refers to data generated from high-throughput technologies used to study the various "omes" of an organism and includes: proteomics, transcriptomics and spatial transcriptomics, genomics, metabolomics, lipidomics, epigenomics—which correspond to global analyses of proteins, RNA, genes, metabolites, lipids and methylated DNA or modified histone proteins in chromosomes, respectively—phenomics—an emerging transdiscipline, defined as the changes seen in an organism resulting in variations in the phenotype during the life span of the organism—metagenomics—the study of the structure and function of entire nucleotide sequences isolated and analyzed from all the organisms (typically microbes) in a bulk sample, and the microbiome data (all data such as shotgun sequencing, amplicon sequencing, metatranscriptomic, metabolomic and metaproteomic data from the collection of all microbes, such as bacteria, fungi, viruses and their genes, that naturally live on our bodies and inside us).
Emerging TechBio Startups working with the “omics” data are the following:
💈 Biognosys (Biognosys Inc)
Biognosys—a spin-off from the lab of proteomics pioneer Ruedi Aebersold at ETH Zurich—offers a diverse range of proteomics solutions that can help you address your key research objectives for biomarker discovery and drug development. Regarding biomarker discovery, their TrueDiscovery® platform utilizes Hyper Reaction Monitoring/HRM™ that enables specific and unbiased discovery through the quantification of complete proteomes and phospho-proteomes. Further, the platform leverages their flagship software, Spectronaut®, for AI/ML enhanced proteomics data analysis. They also provide highly multiplexed targeted proteomics with absolute quantification for customized panels of proteins.
In January 2023, Bruker Corporation—an American manufacturer of scientific instruments for molecular and materials research, as well as for industrial and applied analysis—made a majority-ownership investment in Biognosys and functions ever since as a strategic investor. On December 05, 2024, IonOpticks, a world leading provider of chromatography solutions, and Biognosys announced a long-term supply agreement. On April 4, 2024, Biognosys and Alamar Biosciences Forged Strategic Partnership in Proteomics to Advance Biopharma and Precision Medicine Research. On June 5, 2024, Thermo Fisher Scientific, Biognosys announced a co-marketing agreement. On September 18, 2024, Biognosys Entered Reselling Agreement for Spectronaut.