Customize Consent Preferences

We use cookies to help you navigate efficiently and perform certain functions. You will find detailed information about all cookies under each consent category below.

The cookies that are categorized as "Necessary" are stored on your browser as they are essential for enabling the basic functionalities of the site. ... 

Always Active

Necessary cookies are required to enable the basic features of this site, such as providing secure log-in or adjusting your consent preferences. These cookies do not store any personally identifiable data.

No cookies to display.

Functional cookies help perform certain functionalities like sharing the content of the website on social media platforms, collecting feedback, and other third-party features.

No cookies to display.

Analytical cookies are used to understand how visitors interact with the website. These cookies help provide information on metrics such as the number of visitors, bounce rate, traffic source, etc.

No cookies to display.

Performance cookies are used to understand and analyze the key performance indexes of the website which helps in delivering a better user experience for the visitors.

No cookies to display.

Advertisement cookies are used to provide visitors with customized advertisements based on the pages you visited previously and to analyze the effectiveness of the ad campaigns.

No cookies to display.

Genomics

Large-scale DNA datasets for building machine learning models.
Table of Contents
2
3

Introduction

The field of genomics is revolutionizing our understanding of health and disease by providing comprehensive insights into genetic variations and their implications. Genomics datasets are essential for the development of AI models that can analyze complex genetic data, identify patterns, and make accurate predictions. These datasets contain extensive genetic information, including DNA sequences, gene expression data, and associated clinical annotations. Leveraging these datasets allows researchers and developers to create advanced AI algorithms that can enhance genomic research and improve healthcare outcomes. Below is a curated list of some of the largest and most comprehensive genomics datasets available, selected for their size, diversity, and relevance to current research needs.

Datasets

The Cancer Genome Atlas (TCGA)
1000 Genomes Project
Genotype-Tissue Expression (GTEx) Project
UK Biobank
Genome Aggregation Database (gnomAD)
All of Us Research Program
Human Genome Diversity Project (HGDP)
International Cancer Genome Consortium (ICGC)
Personal Genome Project (PGP)
ENCODE (Encyclopedia of DNA Elements) Project
Description
A comprehensive dataset comprising genomic, epigenomic, transcriptomic, and proteomic data from multiple cancer types.
A detailed catalog of human genetic variation, including deep sequencing of 2,504 individuals from 26 populations.
A dataset that provides comprehensive data on gene expression and regulation across multiple human tissues.
A large-scale biomedical database containing in-depth genetic and health information from half a million UK participants.
A resource that aggregates and harmonizes exome and genome sequencing data from multiple large-scale sequencing projects.
An extensive dataset aimed at gathering genetic data from diverse populations across the United States to advance precision medicine.
A dataset providing detailed genomic data from diverse human populations, used to study human genetic diversity.
A dataset that includes comprehensive genomic data from multiple cancer types, aimed at understanding cancer genomics.
A public resource of human genomic, environmental, and trait data from individuals who have consented to public data release.
A project aimed at identifying all functional elements in the human genome, providing a comprehensive resource of genomic and epigenomic data.
Size
Over 11,000 patients across 33 cancer types
2,504 whole genomes
Over 20,000 tissue samples from 900 individuals
500,000 participants
Over 140,000 exomes and 15,000 genomes
Over 1 million participants (target)
Over 1,000 individuals from 51 populations
Over 25,000 cancer genomes
Thousands of whole genomes
Thousands of datasets