1. ClinDet RNA-seq workflow#

ClinDet RNA data analysis module of ClinDet primarily comprises three key components: (1) quantification of gene expression levels or transcript isoform expression levels; (2) detection of fusion genes and immune repertoire analysis; and (3) RNA-based mutation detection.

1.1. QC and preprocess#

In the pre-processing step, ClinDet processes FASTQ files in accordance with GATK best practices.Fastp were used to trim adapter and generate sequencing report of each fastq file. Then trimmed sequence reads were aligned to the reference genome using BWA-MEM, followed by deduplication and recalibration with GATK. Specifically, for tumor-normal paired samples, ConPair is employed to verify whether the samples originate from the same individual, and quality control statistics files are generated via GATK.

1.2. RNA-seq (Gene expression)#

ClinDet employed three software tools based on distinct algorithms for the quantification of gene and transcript expression levels. Among these, Kallisto and Salmon offer rapid processing speeds, whereas RSEM is comparatively slower but yields more accurate results; users may select the appropriate strategy based on their specific requirements.

1.3. RNA-seq (Fusion gene detection)#

For fusion gene detection in tumor samples, we selected Arriba, which demonstrated the highest accuracy in the DREAM SMC-RNA Challenge. To enhance computational efficiency, ClinDet optimizes the two-pass mode of STAR to reduce sequence alignment runtime by manually removing rare junctions. Additionally, ClinDet utilizes TRUST4 to analyze the clonality of immune cells (T cells and B cells) in RNA-seq data; this approach facilitates the identification of monoclonality in certain tumor samples, such as those from multiple myeloma, thereby enabling assessments of tumor purity.

1.4. RNA-seq based variants calling (SNVs/Indels)#

In the mutation detection module, we aligned FASTQ files to the reference genome using STAR with parameters adapted from the published paper. Subsequent preprocessing followed GATK best practices for RNA mutations (with optional acceleration via Sentieon). Mutations were then detected from the BAM files generated by STAR using three software in tumor-only mode. ClinDet annotates all RNA editing sites cataloged in relevant databases., allowing users to decide whether to exclude them in downstream analyses. Ultimately, ClinDet employs vcf2maf for variant annotation, producing a consensus mutation detection MAF file. Although numerous studies have explored mutation detection using RNA-seq, controversies persist regarding its standalone utility due to potential false positives introduced by RNA editing, post-transcriptional modifications, and elevated reverse transcription errors. Therefore, we recommend that users integrate sample DNA-seq results for joint mutation analysis or prioritize candidate variants within hotspot mutations of cancer driver genes.

1.4.1. GATK processing#

1.4.2. SNV, MNV, INDEL calling:#

HaplotypeCaller, Mutect2, Strelka, Varscan, Pindel, Lofreq

  • HaplotypeCaller, part of the Genome Analysis Toolkit (GATK), is utilized for variant calling to identify single-nucleotide polymorphisms (SNPs) and indels through local de-novo assembly of haplotypes in active genomic regions. It processes aligned reads in BAM or CRAM formats to produce VCF files, often as part of best-practice workflows following base quality score recalibration and prior to variant filtration, emphasizing accuracy in complex variation regions.

  • Mutect2, integrated within the Genome Analysis Toolkit (GATK) version 4 and later, serves as a somatic variant caller to detect mutations in tumor samples by comparing them to matched normal samples using a Bayesian model. It focuses on identifying SNPs and small indels with high sensitivity and specificity for cancer genomics, supporting scalable processing in local or cloud environments via Apache Spark.

  • Strelka2 is a small variant caller optimized for detecting germline and somatic variations in small cohorts and somatic variations in tumor-normal pairs, employing tiered haplotype models and mixture-model indel error estimation for improved accuracy. It accepts BAM or CRAM inputs and outputs VCF 4.1 files, with features like read-backed phasing and empirical variant re-scoring; for best somatic indel performance, it is recommended to pair it with the Manta structural variant caller.

  • VarScan is a platform-independent tool for variant detection in next-generation sequencing data, compatible with platforms like Illumina and Roche/454, for targeted, exome, or whole-genome resequencing. It applies heuristic and statistical thresholds for read depth, base quality, and allele frequency to identify variants in complex samples, including those with contamination, and is developed in Java for broad operating system compatibility.

  • Pindel, within the Cancer Genome Project’s cgpPindel pipeline, is designed for detecting insertions, deletions, and structural variants from tumor and normal BAM alignments. It converts Pindel text outputs to VCF formats and applies filters, with CGP-specific modifications for enhanced analysis.

  • LoFreq is a sensitive and fast variant caller designed for detecting low-frequency single-nucleotide variants (SNVs) and indels in heterogeneous samples, such as viral populations or cancer genomes, using a Poisson-binomial distribution to model sequencing errors. It excels in identifying rare variants below typical detection thresholds, processes BAM files efficiently without requiring matched normals, and includes features like strand-bias filtering and parallelization for scalability in high-throughput analyses.

  • VarDict is a variant discovery program, initially developed in Perl and ported to Java, designed for sensitive variant calling from BAM files in next-generation sequencing, particularly for cancer genomics. Its primary purpose is to detect single nucleotide variants (SNVs), insertions, deletions, and structural variants in both single and paired sample analyses. Key features include amplicon bias awareness for targeted sequencing, rescue of long indels through realignment of soft-clipped reads, and improved scalability, with the Java port being approximately 10x faster than the original Perl implementation. It supports various modes such as single sample, paired sample, and amplicon-based calling, utilizing inputs like reference genomes in FASTA format, aligned reads in BAM format, and target regions in BED format. Applications are prominent in cancer research, facilitating the identification of somatic mutations and other genomic alterations.

1.4.3. Annotation of mutations#

The called results (in VCF format) are filtered and annotated using the vcf2maf software to produce MAF files. This tool is based on VEP, so user can add some VEP plugins.

1.4.4. Consensus results from Multiple softwares#

Subsequently, all MAF-format outputs from all SNV callers are processed through a custom R script to generate consensus mutation detection results. For somatic structural variations, ClinDet employs five software tools for detection; to achieve consensus results, Jasmine software is used to merge structural variation events sharing the same orientation and breakpoint positions within 500bp. For copy number variations, ClinDet utilizes seven software tools for detection and organizes the final results into segment-format files. Users can select specific software for subsequent analyses according to their requirements.