2. Setup Clindet#
2.1. Prerequisites#
2.1.1. System Requirements#
| Resource | Minimum | Recommended |
|---|---|---|
| Disk space | ~200 GB | 500 GB+ |
| RAM | 32 GB | 64 GB+ |
| CPU cores | 8 | 20+ |
| OS | Linux | Linux |
Disk space note: The reference genome setup alone downloads approximately 170 GB of files. Ensure you have sufficient free space before starting.
2.1.2. Software#
Conda — environment and package management
SingularityCE — containerized tool execution
Git — to clone the repository
Verify your installations:
conda --version
singularity --version
git --version
2.2. Clone the Repository#
git clone https://github.com/zyllifeworld/clindet.git
cd clindet
2.3. Quick Test#
The mini_test_data/ folder contains a reduced dataset (chromosome 21 only) for a fast end-to-end test. Run this first to verify your environment is correctly configured before moving on to full-scale analyses.
Note: Running Clindet requires reference annotation files (e.g., dbSNP, tool-specific configuration files) downloaded during the reference genome setup step below. The default configuration has been validated across a wide range of cancer datasets and handles cross-tool compatibility issues — such as chromosome naming conventions (
chrprefix vs. no prefix in the reference FASTA). Beginners should stick with the default setup. Experienced users may customize as needed.
2.3.1. Install the Conda Environment#
conda env create -f envs/clindet.yaml
conda activate clindet
2.3.2. Configure Environment Reuse#
By default, Snakemake rebuilds Conda environments on every run. To install environments once and reuse them, edit workflow/config/conf/softwares.yaml and set each tool to its installed environment name:
conda:
clindet_main: 'clindet'
multiqc: 'clindet'
clindet_rsem: 'clindet_rsem'
clindet_vep: 'clindet_vep'
facets:
hmftools: 'hmftools'
strelka: 'strelka'
trust4: 'clindet_rsem'
rna: 'clindet_rsem'
clindet_mut: "clindet_mutflag"
If you prefer Snakemake to rebuild environments on every run, leave all values above empty.
2.3.3. Configure Temporary Directory#
GATK tools can fail when the default temporary directory runs out of space. Set a custom temp_directory with sufficient disk space in workflow/config/conf/softwares.yaml:
params:
java:
temp_directory: "/path/to/your/tmp"
2.3.4. Configure Singularity Bind Path#
When using Singularity, you must bind your home directory (or the directory containing your data and reference files) so the container can access them. The bind path is passed via --singularity-args:
--singularity-args "--bind /your/home/path:/your/home/path"
2.3.5. Run the Quick Test#
Replace /your/home/path below with your actual home directory, then choose the command for your data type:
# DNA (WES)
snakemake -c 20 --config run_type=wes \
--configfile mini_test_data/dna/data/test_config.yaml \
--rerun-triggers mtime --benchmark-extended \
--use-singularity --singularity-args "--bind /your/home/path:/your/home/path" \
--latency-wait 300 --use-conda --conda-frontend conda -k
# RNA
snakemake -c 20 --config run_type=rna \
--configfile mini_test_data/rna/fusion/data/test_rna.yaml \
--rerun-triggers mtime --benchmark-extended \
--use-singularity --singularity-args "--bind /your/home/path:/your/home/path" \
--latency-wait 300 --use-conda --conda-frontend conda -k
2.3.5.1. Expected Outputs#
After a successful RNA run, the results directory should look like this:
mini_test/rna/hg38_chr21/results
├── benchmarks
│ └── fusion
│ ├── mini.star_arriba_map_1.benchmark.txt
│ └── mini.star_arriba_map.benchmark.txt
├── fusion
│ └── mini_arriba_fusion.tsv # contains TMPRSS2-ERG fusion
└── mapped
└── STAR
└── mini
├── Aligned.out.bam
├── Log.final.out
├── Log.out
├── Log.progress.out
├── mini_pass1.log
├── mini.sorted.bam.bai
├── mini_star.log
├── mini_unmapped_R1.fq
├── mini_unmapped_R2.fq
├── SJ.out.tab
└── _STARgenome/ ...
The key output file is fusion/mini_arriba_fusion.tsv, which should detect the TMPRSS2-ERG fusion — a known driver event in the test dataset.
If the quick test completes successfully, proceed to the Reference Genome Setup below. Otherwise, double-check your Conda environment and Singularity bind path configuration.
2.3.6. Note for Experienced Users#
If you already have reference files (human genome FASTA, GTF, dbSNP, etc.) on your cluster, you can skip the download step and point directly to your existing files. Edit the reference paths under the resources section in your workflow config file (e.g., workflow/config/config.yaml). See Configuring the Workflow for details.
2.4. Reference Genome Setup#
Once the quick test passes, download and configure the full human b37 reference genome:
snakemake --config run_type=build_b37
2.4.1. Legacy Script (Deprecated)#
The shell script build_conda_envs.sh is no longer recommended. It downloads the same ~170 GB of reference files but is not integrated with the current Snakemake workflow. Use snakemake --config run_type=build_b37 instead.
2.4.2. Generate a BED File for WES Analysis#
Create a BED file from a GTF annotation to define exome capture regions. The BED file should contain four columns: chr, start, end, gene_name (optional). Example:
1 11858 12237 DDX11L1
1 12602 12731 DDX11L1
1 12964 13062 DDX11L1
1 13210 14511 DDX11L1;WASH7P
ClinDet provides a reference BED file for the b37 genome (used in Use Case I). You can also generate your own using the provided script gtf2bed.R.
2.5. Next Steps#
Configure your workflow parameters — set up
config.yamlandsamples.tsvfor your projectBrowse use cases — see real-world analysis examples
DNA-seq workflow details — in-depth documentation for DNA analysis
RNA-seq workflow details — in-depth documentation for RNA analysis