Leveraging The Power of Public Datasets for Omics Research The Whys, Whats, And Hows You Should Know – Part I

The explosion of omics-based research has opened up new possibilities for investigating cellular and molecular processes in greater depth. The vast volume of data generated from these specialized omics fields has driven the development of computational tools designed to analyze and interpret them effectively.

However, two stubborn roadblocks still persist: cost and clock. The high expenses associated with omics technologies, combined with lengthy processing timelines, limit researchers from undertaking large-scale investigations.

This is where public datasets, freely available as open-source resources, step in to fill the gap. These resources offer a way to work around such constraints, opening doors to broader research possibilities. In this multi-part blog series, we will look into how these datasets power reproducible research, explore key repositories you should know about, and share practical advice for making the most of them.

What are open-source omics data, and where can they be available?

Open-source omics data refer to publicly available datasets generated from high-throughput technologies, such as genomics, transcriptomics, proteomics, and metabolomics, which measure biological molecules such as DNA, RNA, proteins, and metabolites at a large scale. These datasets are of high quality, come with open licenses, are freely available, and are well-structured and machine-readable. They are typically maintained through public or private funding programs.

Several prominent repositories host open-source omics data, each tailored to specific omics disciplines and equipped with tools for data access and analysis. These repositories provide raw and processed data, standardized metadata, and analytical tools, ensuring researchers can efficiently leverage these resources to address diverse scientific questions.

Having established what open-source omics datasets are, we now turn to how they are fundamentally transforming modern research practices.

Why public datasets are a game changer in omics research?

Beyond addressing cost and time limitations, public omics datasets open up powerful opportunities for accelerating scientific progress. They promote collaboration, expand access to advanced research, and enable data-driven discoveries at scale. With free access to standardized, diverse, and well-curated datasets, researchers can conduct more robust analyses, integrate multi-omics data, and foster interdisciplinary approaches. Importantly, they democratize access to critical resources, allowing institutions of all sizes to contribute to areas like rare disease research, cancer biology, and precision medicine.

In the sections below, we highlight the most impactful opportunities they present and why they’re becoming essential to modern biomedical research.

(i) Enhancing statistical power

Public datasets offer a significant opportunity to overcome the limitations of small sample sizes, which often reduce the statistical power and generalizability of omics studies, particularly for rare diseases. Repositories like The Cancer Genome Atlas (TCGA) provide access to vast cohorts, such as over 20,000 tumor samples across 33 cancer types, enabling researchers to perform analyses with greater statistical robustness. This large sample size allows for the detection of subtle patterns, biomarkers, or disease subtypes that might be missed in smaller studies.

Furthermore, meta-analyses combining data from multiple public sources can further amplify statistical power, leading to more reliable findings. Through the use of these extensive datasets, researchers can accelerate discoveries in fields like oncology and rare disease research, ensuring results are both significant and broadly applicable.

(ii) Simplifying multi-omics integration

Public datasets facilitate seamless integration of multi-omics data, addressing the complexity of combining genomics, transcriptomics, proteomics, and metabolomics data due to variations in formats and experimental protocols. Platforms like the Omics Discovery Index aggregate and standardize datasets across these disciplines, providing unified formats and metadata that simplify data integration. This enables researchers to correlate patterns across different omics types to gain a holistic understanding of biological systems.

For instance, integrating transcriptomic and proteomic data can reveal how gene expression influences protein function in diseases like cancer, opening new avenues for therapeutic development. As a result, researchers are better equipped to explore complex biological phenomena through integrated, data-driven insights.

(iii) Improving reproducibility

Public datasets enhance the reproducibility of omics research, a critical factor for building scientific credibility and advancing knowledge. Variations in experimental conditions, such as sequencing platforms or sample preparation methods, can lead to inconsistent results across studies. Public repositories like TCGA provide standardized, high-quality datasets that serve as reference points for validation. For example, TCGA data has been widely used to confirm cancer biomarkers identified in smaller studies, ensuring findings are not artifacts of specific experimental setups. This reproducibility strengthens the reliability of research outcomes and accelerates the development of diagnostic and therapeutic tools by providing a consistent foundation for cross-study comparisons.

(iv) Training machine learning models

Public datasets are a cornerstone for training machine learning (ML) models in omics research, where large and diverse datasets are essential for developing accurate predictive models. Individual studies often lack the scale needed to train sophisticated ML algorithms, particularly deep learning models that require vast amounts of data. Public data repositories provide the necessary volume and variety of datasets, enabling researchers to train models for tasks such as disease classification, biomarker discovery, and outcome prediction.

With open access to such large-scale datasets, public repositories empower researchers to harness modern ML tools, accelerating progress in precision medicine and individualized therapies.

(v) Promoting inclusive research

Public datasets promote inclusivity by providing access to data from diverse demographic groups, addressing the issue of population bias in research. Many studies focus on specific populations, limiting the applicability of findings to other groups. Public data repositories include samples from patients of various ethnicities and geographical locations, offering a more representative view of human physiology and disease. For instance, TCGA’s diverse cancer patient data has supported the identification of population-specific biomarkers, ensuring that research outcomes are relevant to a broader population. This inclusivity is critical for developing treatments that are effective across diverse groups and for understanding diseases that disproportionately affect certain populations.

(vi) Streamlining ethical compliance

Public datasets simplify ethical and legal compliance, which can be a significant challenge in omics research involving human data. Ensuring adherence to regulations like General Data Protection Regulation (GDPR) or Health Insurance Portability and Accountability Act (HIPAA) is complex and time-consuming. Public repositories are typically curated to meet these standards, providing datasets that are de-identified and compliant with ethical guidelines. They also adhere to strict privacy standards, allowing researchers to use it without navigating complex data-sharing agreements. This streamlined compliance process reduces administrative burdens, enabling researchers to focus on analysis and interpretation.

(vii) Accelerating rare disease research

Public datasets are transformative for rare disease research, where small patient populations often make data collection a major challenge. They aggregate data from multiple sources and help generate sufficient sample sizes for meaningful analysis. In rare cancers, for instance, this approach can uncover genetic drivers of disease and potential therapeutic targets that would be difficult to identify in isolation. Such aggregation enables researchers to detect shared patterns and distinct features across rare diseases, ultimately paving the way for novel diagnostics and treatments. Ultimately, public datasets provide the scale and accessibility needed to advance research in historically neglected areas.

(viii) Fostering Collaboration in research and education

Public datasets foster cross-disciplinary collaboration, which is essential for addressing the multifaceted nature of omics research. Such research studies often require expertise from biology, computer science, statistics, and medicine, but coordinating such diverse skill sets can be challenging. Repositories like Gene Expression Omnibus (GEO) and Encyclopedia of DNA Elements (ENCODE) provide a common data foundation that researchers from different fields can access and analyze. For example, a biologist might contribute biological insights, while a computational scientist develops analysis pipelines, and a clinician interprets results for medical applications.

Furthermore, generating large-scale omics data is often beyond the scope of educational settings, but public data repositories offer a wealth of data for teaching purposes. These datasets are used in university courses and online training programs to teach skills ranging from basic data analysis to advanced machine learning applications. For instance, students can use GEO data to practice analyzing gene expression profiles, gaining hands-on experience with authentic biological data.

Thus, public datasets, through their accessibility, standardization, and built-in tools, support both effective cross-disciplinary research and the advancement of future-ready scientists capable of driving meaningful discoveries.

Let’s now look at how these advantages translate into real research breakthroughs by examining concrete examples from the field.

Public datasets in action: Real-life examples

Whether it’s repurposing existing drugs or mapping out intricate cellular landscapes, public datasets are at the heart of some of today’s most groundbreaking scientific work. These data-driven approaches are transforming research in oncology, neurodegenerative disease, and precision medicine. The examples below highlight how these datasets are being put to powerful use, unlocking insights that might otherwise remain hidden.

(i) Drug repositioning with Connectivity Map (CMap)

Public datasets from the GEO database have been pivotal in driving drug repositioning efforts through the Connectivity Map (CMap), a powerful tool that utilizes gene expression profiles to identify novel therapeutic uses for existing drugs. In a comprehensive study, researchers utilized GEO datasets to analyze gene expression changes in cancer cells treated with histone deacetylase (HDAC) inhibitors like vorinostat (SAHA) and romidepsin (FK228). When the differentially expressed genes were queried against the L1000-based CMap database—featuring compound profiles from over 8,870 drugs across nine cancer cell lines—the analysis revealed potential HDAC inhibitors, including KM-00927 and BRD-K75081836.

Further integration with the L1000 Fireworks Display (L1000FWD) tool confirmed their similarity to known HDAC inhibitors, while experimental assays validated the ability of KM-00927 to induce histone acetylation and inhibit cancer cell growth. The study also repurposed mitomycin C as a topoisomerase IIB inhibitor, demonstrating its ability to trap topoisomerase-DNA complexes, a mechanism that enhances its anti-cancer efficacy.

CMap demonstrates the power of integrating GEO’s transcriptomic data for systematic approaches to polypharmacology and drug repurposing using open-access resources.

(ii) Discovery of cancer biomarkers and therapeutic targets

Public proteomic datasets, such as those from the Human Proteome Map, have been instrumental in advancing cancer research by enabling the identification of biomarkers and therapeutic targets across multiple cancer types.

In a comprehensive study, researchers analyzed 232 tissue samples from 16 major human cancers, including breast, lung, colon, and prostate, using a data-independent acquisition (DIA) mass spectrometry approach. Using DIA data alongside publicly available data-dependent acquisition (DDA) datasets from the Human Proteome Map, the researchers built a spectral library that identified 8,527 proteins, with 7,947 being quantifiable. This analysis revealed 2384 universally expressed housekeeping proteins, 2458 tissue-enriched proteins, and 6835 cancer-associated proteins, with 40 proteins significantly upregulated in over 40% of cancer types, which are linked to tumor progression.

The study also identified 1139 druggable proteins, including 464 potential therapeutic targets like EGFR and PARP1, corresponding to FDA-approved drugs, and 21 cancer/testis antigens, such as MAGEA4, with potential for immunotherapy vaccine development. Leveraging public proteomic datasets, this research highlights the power of large-scale proteomics in uncovering cancer-specific molecular signatures, accelerating biomarker discovery, and guiding the development of targeted therapies and immunotherapies for various cancers.

(iii) Identification of novel biomarkers and patient subgroups in inflammatory bowel disease (IBD)

Public multi-omics datasets, generated from studies like the Study of a Prospective Adult Research Cohort (SPARC) IBD, have significantly advanced our understanding of the disease. Researchers analyzed genomics, transcriptomics, and proteomics data from 603 patients in the SPARC IBD cohort, collected across 1184 longitudinal time points, to train an XGBoost machine learning model that effectively distinguished Ulcerative Colitis (UC) from Crohn’s Disease (CD).

Key predictive features included known IBD-associated genes and proteins alongside novel candidates like RPS26, TMEM25, and ANGPTL3, offering potential diagnostic biomarkers for indeterminate colitis.

Using Multi-Omics Factor Analysis (MOFA), the study identified UC patient subgroups correlated with disease severity (R²=0.40, indicating a moderate relationship), marked by biomarkers like IL17A, TGFA, DLD, and others linked to interleukin signaling and metabolic pathways. For CD, the study uncovered two distinct patient populations with colon inflammation: (i) one characterized by upregulated HLA genes (involved in immune system regulation) and TSBP1-AS1 genetic variants, indicating a strong adaptive immune response; and (ii) another defined by molecular signatures of innate immune pathways.

Leveraging the comprehensive SPARC IBD dataset, this study highlights how multi-omics integration can uncover novel biomarkers and define patient subgroups, laying the foundation for precision medicine in IBD care.

(iv) Cellular Reference Atlases

Public single-cell RNA-seq (scRNA-seq) datasets have revolutionized the creation of reference atlases, serving as foundational resources for understanding cellular diversity and tissue-specific biology. Two prominent examples are the Human Cell Atlas (HCA) and the Tabula Sapiens project, both of which rely on public datasets to map cellular landscapes with unprecedented resolution.

Human Cell Atlas (HCA): The HCA has utilized publicly available scRNA-seq data to map the cellular composition of human tissues, including tumor microenvironments, leading to significant advancements in cancer research. The open-access data available in HCA allows researchers worldwide to explore cellular heterogeneity without generating their own single-cell datasets, enabling high-resolution studies that reveal critical insights into immune dynamics and therapeutic targets. This resource has become a cornerstone for studying cellular diversity, supporting applications in oncology and beyond.

Tabula Sapiens: The Tabula Sapiens project leveraged publicly available single-cell RNA-seq data to create a comprehensive atlas of human cell types across 24 organs, providing a standardized reference for studying tissue-specific gene expression. This atlas has been used to investigate diseases like diabetes, revealing unique gene expression profiles in pancreatic cells that shed light on disease mechanisms and potential therapeutic targets.

Through the integration of heterogeneous datasets, Tabula Sapiens establishes a consistent and reproducible reference framework, facilitating cross-study comparisons and robust biological inference. The availability of such public datasets enables broad research applications, from understanding organ-specific biology to developing targeted therapies, demonstrating their value as foundational tools for omics research.

These cases highlight what is possible when researchers embrace the use of publicly available omics datasets. But how can you do it? What resources should you know about? Which tools can actually help? We will explore all of this in upcoming posts. Stay tuned!

Leveraging The Power of Public Datasets for Omics Research The Whys, Whats, And Hows You Should Know – Part I

Leave a Reply Cancel reply

Quick Links

Quick Links