Genomics

Large-scale DNA datasets for building machine learning models.

Table of Contents

Genomics

Introduction
Datasets

Introduction

The field of genomics is revolutionizing our understanding of health and disease by providing comprehensive insights into genetic variations and their implications. Genomics datasets are essential for the development of AI models that can analyze complex genetic data, identify patterns, and make accurate predictions. These datasets contain extensive genetic information, including DNA sequences, gene expression data, and associated clinical annotations. Leveraging these datasets allows researchers and developers to create advanced AI algorithms that can enhance genomic research and improve healthcare outcomes. Below is a curated list of some of the largest and most comprehensive genomics datasets available, selected for their size, diversity, and relevance to current research needs.

Datasets

The Cancer Genome Atlas (TCGA)

1000 Genomes Project

Genotype-Tissue Expression (GTEx) Project

UK Biobank

Genome Aggregation Database (gnomAD)

All of Us Research Program

Human Genome Diversity Project (HGDP)

International Cancer Genome Consortium (ICGC)

Personal Genome Project (PGP)

ENCODE (Encyclopedia of DNA Elements) Project

Description

A comprehensive dataset comprising genomic, epigenomic, transcriptomic, and proteomic data from multiple cancer types.

A detailed catalog of human genetic variation, including deep sequencing of 2,504 individuals from 26 populations.

A dataset that provides comprehensive data on gene expression and regulation across multiple human tissues.

A large-scale biomedical database containing in-depth genetic and health information from half a million UK participants.

A resource that aggregates and harmonizes exome and genome sequencing data from multiple large-scale sequencing projects.

An extensive dataset aimed at gathering genetic data from diverse populations across the United States to advance precision medicine.

A dataset providing detailed genomic data from diverse human populations, used to study human genetic diversity.

A dataset that includes comprehensive genomic data from multiple cancer types, aimed at understanding cancer genomics.

A public resource of human genomic, environmental, and trait data from individuals who have consented to public data release.

A project aimed at identifying all functional elements in the human genome, providing a comprehensive resource of genomic and epigenomic data.

Size

Over 11,000 patients across 33 cancer types

2,504 whole genomes

Over 20,000 tissue samples from 900 individuals

500,000 participants

Over 140,000 exomes and 15,000 genomes

Over 1 million participants (target)

Over 1,000 individuals from 51 populations

Over 25,000 cancer genomes

Thousands of whole genomes

Thousands of datasets