1. Clinical Omics Benchmark Datasets#
To systematically evaluate the performance of computational tools for clinical molecular diagnostics, we propose to construct a comprehensive gold-standard benchmarking dataset by integrating multiple high-quality public cancer genomics resources. As illustrated in the figure, diverse datasets—including TCGA, CCLE, COLO829, ICGC, CPTAC, HCMI, CGIB, and GIAB—serve as foundational “training equipment,” collectively supporting the development and validation of robust analytical workflows.
These datasets encompass a wide range of sequencing modalities (WES, WGS, and RNA-seq) and variant types, including single nucleotide variants (SNVs), insertions and deletions (Indels), structural variants (SVs), copy number variations (CNVs), and gene fusions. By leveraging their complementary characteristics—such as tumor-normal paired samples, deeply characterized cell lines, and high-confidence reference genomes—we aim to curate a harmonized dataset with well-defined ground truth annotations.
A standardized quality control and data processing pipeline will be applied to ensure consistency across datasets, including alignment, variant calling, filtering, and orthogonal validation where applicable. The resulting benchmark dataset will enable systematic evaluation of tool performance in terms of sensitivity, specificity, precision, and robustness across diverse genomic contexts.
Ultimately, this resource is intended to serve as a community reference for benchmarking clinical genomics tools, facilitating fair comparison between methods and promoting the development of more accurate and reliable computational approaches for precision oncology applications. This document collects publicly available datasets for benchmarking NGS-based tumor detection tools.
1.1. Supported sequencing types#
Whole Exome Sequencing (WES)
Whole Genome Sequencing (WGS)
RNA Sequencing (RNA-seq)
1.2. Supported variant types#
SNV / Indel
Structural Variants (SV)
Copy Number Variations (CNV)
Gene Fusions
1.3. Dataset Summary#
| Dataset Name | Data Type | Variant Type | Sample Type | Ground Truth | Access | Notes |
|---|---|---|---|---|---|---|
| TCGA | WES/WGS/RNA | SNV, CNV, Fusion | Tumor | Partial | Controlled | Large cohort |
| ICGC | WGS/WES | SNV, SV, CNV | Tumor | Yes | Controlled | International |
| GIAB | WGS | SNV, Indel | Germline | High-confidence | Open | Gold standard |
| CGIB | WGS,WES,long-read,Hi-C | SNV, Indel | Somatic | High-confidence | Open | Gold standard |
| SEQC2 | WES/WGS | SNV, CNV | Synthetic | Yes | Open | Benchmark focused |
| PCAWG | WGS | SNV, SV, CNV | Tumor | Yes | Controlled | Deep annotation |
| CCLE | WES/RNA | SNV, CNV, Fusion | Cell line | Partial | Open | Cancer cell lines |
| COLO829 | WES/WGS/RNA | SNV, CNV, SV | Cell line | Partial | Open | Cancer cell lines |
| Somatic reference standards from BostonGene | WES/WGS/RNA | SNV, CNV, SV | Cell line | Partial | Open | Cancer cell lines |
1.4. Dataset Details#
1.4.1. The Cancer Genome Atlas#
Website: https://portal.gdc.cancer.gov/
Data Type: WES, WGS (limited), RNA-seq
Variant Types: SNV, Indel, CNV, Fusion
Sample Type: Tumor / Normal pairs
Ground Truth: Partial (validated subsets)
1.4.1.1. Access#
Open + controlled (dbGaP required)
GDC Data Portal / API
1.4.1.2. Recommended Usage#
SNV/Indel benchmarking
RNA fusion detection
CNV analysis
1.4.1.3. Notes#
Not a strict ground truth dataset
Requires harmonization
1.4.2. International Cancer Genome Consortium#
Website: https://dcc.icgc.org/
Data Type: WGS, WES
Variant Types: SNV, SV, CNV
Sample Type: Tumor
Ground Truth: Curated calls
1.4.2.1. Access#
Controlled access (DACO approval)
1.4.2.2. Recommended Usage#
Structural variant benchmarking
Cross-cohort validation
1.4.3. Genome in a Bottle#
Website: https://www.nist.gov/programs-projects/genome-bottle
Data Type: WGS
Variant Types: SNV, Indel
Sample Type: Germline reference
Ground Truth: High-confidence regions
1.4.3.1. Download#
1.4.3.2. Recommended Usage#
SNV/Indel caller benchmarking
Precision/recall evaluation
1.4.3.3. Notes#
Not tumor data
Limited SV/CNV truth sets
1.4.4. Cancer Genome in a Bottle#
Website: https://www.nist.gov/programs-projects/cancer-genome-bottle
Data Type: WGS,WES,Hi-C
Variant Types: SNV, Indel
Sample Type: Tumor-normal
Ground Truth: High-confidence regions
1.4.4.1. Download#
1.4.4.2. Recommended Usage#
SNV/Indel caller benchmarking
Precision/recall evaluation
1.4.4.3. Notes#
realworld tumor-normal paired data
Limited SV/CNV truth sets
1.4.5. MAQC Consortium#
Website: https://www.fda.gov/science-research/bioinformatics-tools/seqc2
Data Type: WES, WGS
Variant Types: SNV, CNV
Sample Type: Synthetic / reference mixtures
Ground Truth: Known spike-ins
1.4.5.1. Download#
1.4.5.2. Recommended Usage#
Sensitivity benchmarking
Low VAF detection
1.4.6. Pan-Cancer Analysis of Whole Genomes#
Website: https://dcc.icgc.org/pcawg
Data Type: WGS
Variant Types: SNV, SV, CNV
Sample Type: Tumor / Normal
Ground Truth: Consensus calls
1.4.6.1. Access#
Controlled access (ICGC portal)
1.4.6.2. Recommended Usage#
SV detection benchmarking
Pan-cancer analysis
1.4.7. Cancer Cell Line Encyclopedia#
Website: https://depmap.org/portal/
Data Type: WES, RNA-seq
Variant Types: SNV, CNV, Fusion
Sample Type: Cell lines
Ground Truth: Partial
1.4.7.1. Download#
Open access via DepMap
1.4.7.2. Recommended Usage#
Fusion detection
CNV benchmarking
Reproducibility testing
1.4.8. BostonGene#
Data Type: WES, RNA-seq
Variant Types: SNV, CNV, Fusion
Sample Type: Cell lines
Ground Truth: Partial
1.4.8.1. Download#
Open access via SRA
1.4.8.2. Recommended Usage#
Fusion detection
CNV benchmarking
Reproducibility testing
1.5. Specialized Benchmark Datasets#
1.5.1. DREAM Challenge Datasets#
Website: https://www.synapse.org/
Focus: Somatic mutation calling
Ground Truth: Simulated + validated
1.5.2. Synthetic Datasets#
| Dataset | Description | Link |
|---|---|---|
| BAMSurgeon | Spike-in mutations | https://github.com/adamewing/bamsurgeon |
| VarSim | Variant simulation | https://github.com/bioinform/varsim |
1.6. Metadata Schema#
Each dataset entry should follow the schema below:
Dataset Name:
Data Type:
Variant Type:
Sequencing Platform:
Read Length:
Coverage:
Sample Size:
Tumor Type:
Matched Normal:
Ground Truth Type:
Download Link:
Access Type:
License:
Contact:
Last Updated:
Contributor:
Validation Status:
Benchmark Category: