3. Use case III: Whole genome sequencing of COLO829 cell line#

3.1. Background#

Whole-genome sequencing (WGS) is increasingly being adopted in clinical oncology, providing a comprehensive view of the tumor genetic landscape. Compared to whole-exome sequencing (WES), this approach enables the detection of a broader range of disease-associated genetic alterations, including mutations in non-coding regions and structural variants (SVs). Such comprehensive profiling can facilitate deeper insights into cancer evolution and inform the development of personalized therapeutic strategies. However, WGS generates extensive datasets, necessitating robust bioinformatics pipelines for their analysis. To demonstrate how high-quality, validated somatic variants can be identified from WGS data, we employed Clindet to analyze a metastatic melanoma cell line (COLO829) and its matched normal lymphoblastoid cell line (COLO829BL), for which a “truth set” of copy number variants (CNVs) and structural variants (SVs) had been established.

COLO829 cell line

COLO829

3.2. Setup a project folder#

Note

Before starting the analysis, please ensure that you have set up the analysis environment using the build_conda_env.sh script.

Create a folder named project/WGS in your home directory and activate the Clindet conda environment.

mkdir -p ~/projects/COLO829_WGS
cd ~/projects/COLO829_WGS
conda activate clindet

3.3. Download data and setup a samplesheet.csv#

Your can download WGS fastq file (~249G) from HMFtools Resources

cd ~/projects/COLO829_WGS
mkdir -p data && cd data
conda activate gsutil
gsutil -m cp \
  "gs://hmf-public/HMFtools-Resources/test_data/COLO829v003T/fastq/COLO829v003R_AHHKYHDSXX_S13_L001_R1_001.fastq.gz" \
  "gs://hmf-public/HMFtools-Resources/test_data/COLO829v003T/fastq/COLO829v003R_AHHKYHDSXX_S13_L001_R2_001.fastq.gz" \
  "gs://hmf-public/HMFtools-Resources/test_data/COLO829v003T/fastq/COLO829v003R_AHHKYHDSXX_S13_L002_R1_001.fastq.gz" \
  "gs://hmf-public/HMFtools-Resources/test_data/COLO829v003T/fastq/COLO829v003R_AHHKYHDSXX_S13_L002_R2_001.fastq.gz" \
  "gs://hmf-public/HMFtools-Resources/test_data/COLO829v003T/fastq/COLO829v003R_AHHKYHDSXX_S13_L003_R1_001.fastq.gz" \
  "gs://hmf-public/HMFtools-Resources/test_data/COLO829v003T/fastq/COLO829v003R_AHHKYHDSXX_S13_L003_R2_001.fastq.gz" \
  "gs://hmf-public/HMFtools-Resources/test_data/COLO829v003T/fastq/COLO829v003R_AHHKYHDSXX_S13_L004_R1_001.fastq.gz" \
  "gs://hmf-public/HMFtools-Resources/test_data/COLO829v003T/fastq/COLO829v003R_AHHKYHDSXX_S13_L004_R2_001.fastq.gz" \
  "gs://hmf-public/HMFtools-Resources/test_data/COLO829v003T/fastq/COLO829v003T_AHHKYHDSXX_S12_L001_R1_001.fastq.gz" \
  "gs://hmf-public/HMFtools-Resources/test_data/COLO829v003T/fastq/COLO829v003T_AHHKYHDSXX_S12_L001_R2_001.fastq.gz" \
  "gs://hmf-public/HMFtools-Resources/test_data/COLO829v003T/fastq/COLO829v003T_AHHKYHDSXX_S12_L002_R1_001.fastq.gz" \
  "gs://hmf-public/HMFtools-Resources/test_data/COLO829v003T/fastq/COLO829v003T_AHHKYHDSXX_S12_L002_R2_001.fastq.gz" \
  "gs://hmf-public/HMFtools-Resources/test_data/COLO829v003T/fastq/COLO829v003T_AHHKYHDSXX_S12_L003_R1_001.fastq.gz" \
  "gs://hmf-public/HMFtools-Resources/test_data/COLO829v003T/fastq/COLO829v003T_AHHKYHDSXX_S12_L003_R2_001.fastq.gz" \
  "gs://hmf-public/HMFtools-Resources/test_data/COLO829v003T/fastq/COLO829v003T_AHHKYHDSXX_S12_L004_R1_001.fastq.gz" \
  "gs://hmf-public/HMFtools-Resources/test_data/COLO829v003T/fastq/COLO829v003T_AHHKYHDSXX_S12_L004_R2_001.fastq.gz" \
  .

then you need merge those fastq.gz files (eg. cat) to :

  • COLO829v003R_R1.fastq.gz (~26G)

  • COLO829v003R_R2.fastq.gz (~27G)

  • COLO829v003T_R1.fastq.gz (~96G)

  • COLO829v003T_R2.fastq.gz (~100G)

Next, create a CSV file named pipe_wgs.csv in the ~/projects/COLO829_WGS directory with the following content:

Tumor_R1_file_path,Tumor_R2_file_path,Normal_R1_file_path,Normal_R2_file_path,Sample_name,Project
/AbsoPath/of/projects/COLO829_WGS/data/COLO829v003T_R1.fastq.gz,/AbsoPath/of/projects/COLO829_WGS/data/COLO829v003T_R2.fastq.gz,/AbsoPath/of/projects/COLO829_WGS/data/COLO829v003R_R1.fastq.gz,/AbsoPath/of/projects/COLO829_WGS/data/COLO829v003R_R2.fastq.gz,COL0829,WGS

3.4. Write an Snakemake file from template#

For this project, modify the sample sheet and create a new Snakemake file named snake_wgs.smk (see below). Set the following parameters in the Snakemake file:

  1. configfile (str): config file for softwares and resource parameters.

  2. stage (list): analysis steps. avaiable options:['conpair','report']

  3. sample_csv (str): sample_info.csv path

  4. genome_version (str): genome version, can be genome version which setup in cofig.yaml eg. b37

  5. recal (Boolean): use GATK BaseRecalibrator for Base Quality Score Recalibration? this is a time-consume task but can slightly improved calling accuracy. True or False. default True.

  6. caller_list (list): somatic mutation calling softwares used. avaiable options:['sage','HaplotypeCaller','strelkasomaticmanta','cgppindel_filter','caveman','muse','deepvariant','Mutect2_filter']

  7. germ_call_list (list): germline mutation calling softwares used. avaiable options: ['strelkamanta','caveman']

  8. somatic_cnv_list (list): Somatic Copy Number variant calling softwares. avaiable options: ['purple','ASCAT','sequenza','freec','exomedepth']

  9. recall_pon (Boolean): call panel-of-normal mutect2_pon.vcf from cohort normal samples. default False, use public available resource.

  10. custome_pon_db (Boolean): use public available resource. default False.

  11. recall_pon_pindel (Boolean): call panel-of-normal pindel_pon.vcf from cohort normal samples. default False, use public available resource.

3.5. write Snakemake file#

For this project, we need change the sample sheet info.

3.6. Run clindet#

There is two way you can run clindet

  1. run on a local server

  2. submit to HPC through slurm

3.6.1. Run on local node#

nohup snakemake -j 30 --printshellcmds -s snake_wgs.smk \
--use-singularity --singularity-args "--bind /your/home/path:/your/home/path" \
--latency-wait 300 --use-conda >> wgs.log

3.6.2. Submit to HPC use slurm#

we provide a slurm config.yaml under clindet/workflow/config_slurm folder.

nohup snakemake --profile workflow/config_slurm \
-j 30 --printshellcmds -s snake_wgs.smk --use-singularity \
--singularity-args "--bind /your/home/path:/your/home/path" \
--latency-wait 300 --use-conda >> wgs.log

3.7. Results#

After successful execution, you will see the following directory structure. The cnv folder contains the copy number variants detection results, the sv folder contains the structural variants detection results, and the vcf/maf folders contain mutaions (annotated and raw) results.

3.7.1. Overview of outputs#

~/projects/WGS/b37/results
├── cnv ** Copy Number variants results**
│   └── paired ** For tumor-normal paired sample**
│       ├── ascat
│       └── purple
├── dedup ** deduplication BAM files**
│   └── paired
├── maf ** annotation somatic mutation MAF files**
│   └── paired
│       └── COL0829
├── qc ** QC results for fastp conpair and so on**
│   └── dedup
│       └── paired
├── recal ** BAM files after base recalibration**
│   └── paired
├── sv ** Structural Variant results**
│   └── paired
│       ├── BRASS
│       ├── DELLY
│       ├── gridss
│       ├── linx
│       └── svaba
├── vcf
   └── paired
       ├── COL0829
       └── HG008

3.7.2. Copy number variants of COLO829#

To assess the consistency among different software tools in representing the genomic content of the COLO829 cancer cell line, we evaluated the presence of CNVs and SVs. Regarding copy number variation, the four software tools generated similar estimates of tumor purity and ploidy.

The karyotype of the COLO829 cell line

COLO820 CNV karyotype

Copy number results of COLO829

COLO820 CNV karyotype

Furthermore, low-resolution copy number alteration (CNA) analysis revealed highly consistent copy number profiles across external “true set” software tools except for Facets, with correlation coefficients of 0.76–0.98 among different datasets.

COLO820 CNV compare

3.7.3. Structural variants of COLO829#

Due to the absence of established benchmarks and best-practice protocols for somatic SV detection, the primary aim of this study was to establish an analytical workflow rather than to benchmark SV calling tools. Consequently, we selected optimal mapping and SV calling tools based on current best practices and available knowledge. SV calling parameters were optimized for high sensitivity rather than maximum precision to minimize the risk of missing genuine events. Compared to the previously established “truth set” from XX et al., the candidate somatic SV calls produced by individual software tools showed considerable variability, ranging from 115 breakpoints detected by GRIDSS to 27,285 by DELLY, resulting in a combined total of 28,233 merged SV calls. Among these tools, DELLY generated a relatively high number of false-positive calls. The results indicated that integrating outputs from multiple SV detection tools substantially reduced the number of false-positive predictions.

COLO820 SV compare