Ferlab-Ste-Justine/seq-data-validation is a bioinformatics pipeline that validates the integrity and format of sequencing data files including FASTQ, BAM/CRAM, and VCF/GVCF files. It performs a series of checks to ensure that the files are not corrupted, conform to expected formats, and (optionally) contain valid variant information. The pipeline generates a comprehensive report summarizing the validation results for each file. Optionally, it can replace sample IDs in the data files based on a user-provided mapping.
The pipeline consists of three streams to validate different types of sequencing data, with an optional module that replaces sample IDs with new IDs provided by the user.
Sample ID Replacement module (optional):
- FASTQ - Replace sample ID in file names.
- BAM/CRAM - Parse and update header, replacing any occurrences of sample ID using custom scripts and
samtools reheader. Optionally updating read groups withsamtools addreplacergif RGID contains sample ID. - VCF/GVCF - Parse and update header, replacing any occurrences of sample ID using custom scripts and
bcftools reheader. - Text files - Replace sample ID occurrences in any text file using custom scripts.
It accepts a mixed input of data files and internally separates the files by data type, running the relevant stream.
At the end, it produces a set of data files with updated sample IDs.
Main Validation Streams:
- FASTQ - Verify fastq integrity and matching pairs (if PE) with
fq lintandseqfu check. - BAM/CRAM - Validate file integrity, format, checks reference, validate index, and diagnose erros with
samtools quickcheckandpicard validateSamFile. - GVCF/VCF - Validate integrity, format, and (optionally) variants with
gatk4 validateVariants.
It accepts a mixed input of FASTQ, BAM/CRAM, and VCF/GVCF files and internally separates the files by data type, running the relevant stream.
At the end, it produces a report summarising the status (PASS/FAIL) of each submitted data file and index, visualized with MultiQC.
This schema was done using draw.io with the good pratices recommended by the nf-core community. See nf-core Graphic Design.
Tip
If you are new to Nextflow and nf-core, please refer to this page on how to set-up Nextflow. Make sure to test your setup with -profile test before running the workflow on actual data.
First, prepare a samplesheet with your input data that looks as follows:
samplesheet.csv:
participant,sample,fileType,file1,file2
P001,S001,FASTQ,sample1_R1.fastq.gz,sample1_R2.fastq.gz
P001,S001,BAM,sample1.bam,sample1.bam.bai
P001,S001,GVCF,sample1.gvcf,sample1.gvcf.idx
Each row represents a data file or a pair of files (FASTQ pairs or data file and its index). The fileType column indicates the type of data (FASTQ, BAM, CRAM, VCF, GVCF). The file1 and file2 columns contain the paths to the data files. For single-end FASTQ files or data files without an index, leave the file2 column empty.
[!NOTE] If running the optional Sample ID Replacement module --replace_sample_id true, prepare a CSV file mapping old sample IDs to new sample IDs:
sample_id_map.csv:
oldID,newID
S001,NEW_S001
S002,NEW_S002
[!CAUTION] The
newIDvalues must exactly matchsamplein the input samplesheet.oldIDvalues must be contained within the internal sample IDs of the data files for the replacement to work correctly.
params.json:
Parameters can be provided via a JSON or YAML file. Alternatively, parameters can be provided directly via the command line. It can also be a mix of both command-line parameters and a parameters file.
When input contains CRAM or GVCF files, reference genome files must be provided. Here is an example params.json file for running the pipeline with sample ID replacement:
{
"input": "samplesheet.csv",
"outdir": "results/",
"replace_sample": true,
"id_mapping": "sample_id_map.csv",
"fasta": "/path/to/reference.fasta",
"fai": "/path/to/reference.fasta.fai",
"fasta_dict": "/path/to/reference.dict"
}-
To run locally or in a virtual machine, use Docker:
nextflow run Ferlab-Ste-Justine/seq-data-validation -profile docker \ -r v1.0.0 \ -params-file params.jsonor with a combination of a parameters file and command-line parameters:
nextflow run Ferlab-Ste-Justine/seq-data-validation -profile docker \ -r v1.0.0 \ --input samplesheet.csv \ --outdir <OUTDIR> \ --replace_sample true \ -params-file params.json[!NOTE] Parameters provided via the command line will override those provided in the parameters file if there are any conflicts.
-
In a production environment with a specific configuration and parameters:
nextflow -c app.config run Ferlab-Ste-Justine/seq-data-validation \ -r v1.0.0 \ --input samplesheet.csv \ --outdir <OUTDIR> \ -params-file params.json[!WARNING] Please provide pipeline parameters via the CLI or Nextflow
-params-fileoption. Custom config files including those provided by the-cNextflow option can be used to provide any configuration except for parameters; see docs.
Ferlab-Ste-Justine/seq-data-validation was originally written by Georgette Femerling, Samantha Yuen, Félix-Antoine Le Sieur, Lysiane Bouchard, David Morais.
We thank the Ferlab team and its partners for their support and collaboration in the development of this pipeline.
If you would like to contribute to this pipeline, please see the contributing guidelines.
An extensive list of references for the tools used by the pipeline can be found in the CITATIONS.md file.
This pipeline uses code and infrastructure developed and maintained by the nf-core community, reused here under the MIT license.
The nf-core framework for community-curated bioinformatics pipelines.
Philip Ewels, Alexander Peltzer, Sven Fillinger, Harshil Patel, Johannes Alneberg, Andreas Wilm, Maxime Ulysse Garcia, Paolo Di Tommaso & Sven Nahnsen.
Nat Biotechnol. 2020 Feb 13. doi: 10.1038/s41587-020-0439-x.

