Skip to content

Ferlab-Ste-Justine/seq-data-validation

Repository files navigation

Ferlab-Ste-Justine/seq-data-validation

nf-test Nextflow nf-core template version

Introduction

Ferlab-Ste-Justine/seq-data-validation is a bioinformatics pipeline that validates the integrity and format of sequencing data files including FASTQ, BAM/CRAM, and VCF/GVCF files. It performs a series of checks to ensure that the files are not corrupted, conform to expected formats, and (optionally) contain valid variant information. The pipeline generates a comprehensive report summarizing the validation results for each file. Optionally, it can replace sample IDs in the data files based on a user-provided mapping.

Pipeline Summary

The pipeline consists of three streams to validate different types of sequencing data, with an optional module that replaces sample IDs with new IDs provided by the user.

Sample ID Replacement module (optional):

  • FASTQ - Replace sample ID in file names.
  • BAM/CRAM - Parse and update header, replacing any occurrences of sample ID using custom scripts and samtools reheader. Optionally updating read groups with samtools addreplacerg if RGID contains sample ID.
  • VCF/GVCF - Parse and update header, replacing any occurrences of sample ID using custom scripts and bcftools reheader.
  • Text files - Replace sample ID occurrences in any text file using custom scripts.

It accepts a mixed input of data files and internally separates the files by data type, running the relevant stream.

At the end, it produces a set of data files with updated sample IDs.

IDRepairDiagram

Main Validation Streams:

  • FASTQ - Verify fastq integrity and matching pairs (if PE) with fq lint and seqfu check.
  • BAM/CRAM - Validate file integrity, format, checks reference, validate index, and diagnose erros with samtools quickcheck and picard validateSamFile.
  • GVCF/VCF - Validate integrity, format, and (optionally) variants with gatk4 validateVariants.

It accepts a mixed input of FASTQ, BAM/CRAM, and VCF/GVCF files and internally separates the files by data type, running the relevant stream.

At the end, it produces a report summarising the status (PASS/FAIL) of each submitted data file and index, visualized with MultiQC.

DataValidationDiagram

This schema was done using draw.io with the good pratices recommended by the nf-core community. See nf-core Graphic Design.

Usage

Tip

If you are new to Nextflow and nf-core, please refer to this page on how to set-up Nextflow. Make sure to test your setup with -profile test before running the workflow on actual data.

First, prepare a samplesheet with your input data that looks as follows:

samplesheet.csv:

participant,sample,fileType,file1,file2
P001,S001,FASTQ,sample1_R1.fastq.gz,sample1_R2.fastq.gz
P001,S001,BAM,sample1.bam,sample1.bam.bai
P001,S001,GVCF,sample1.gvcf,sample1.gvcf.idx

Each row represents a data file or a pair of files (FASTQ pairs or data file and its index). The fileType column indicates the type of data (FASTQ, BAM, CRAM, VCF, GVCF). The file1 and file2 columns contain the paths to the data files. For single-end FASTQ files or data files without an index, leave the file2 column empty.

[!NOTE] If running the optional Sample ID Replacement module --replace_sample_id true, prepare a CSV file mapping old sample IDs to new sample IDs:

sample_id_map.csv:

oldID,newID
S001,NEW_S001
S002,NEW_S002

[!CAUTION] The newID values must exactly match sample in the input samplesheet. oldID values must be contained within the internal sample IDs of the data files for the replacement to work correctly.

params.json: Parameters can be provided via a JSON or YAML file. Alternatively, parameters can be provided directly via the command line. It can also be a mix of both command-line parameters and a parameters file.

When input contains CRAM or GVCF files, reference genome files must be provided. Here is an example params.json file for running the pipeline with sample ID replacement:

{
  "input": "samplesheet.csv",
  "outdir": "results/",
  "replace_sample": true,
  "id_mapping": "sample_id_map.csv",
  "fasta": "/path/to/reference.fasta",
  "fai": "/path/to/reference.fasta.fai",
  "fasta_dict": "/path/to/reference.dict"
}

Running the pipeline

  • To run locally or in a virtual machine, use Docker:

    nextflow run Ferlab-Ste-Justine/seq-data-validation -profile docker \
        -r v1.0.0 \
        -params-file params.json

    or with a combination of a parameters file and command-line parameters:

    nextflow run Ferlab-Ste-Justine/seq-data-validation -profile docker \
        -r v1.0.0 \
        --input samplesheet.csv \
        --outdir <OUTDIR> \
        --replace_sample true \
        -params-file params.json

    [!NOTE] Parameters provided via the command line will override those provided in the parameters file if there are any conflicts.

  • In a production environment with a specific configuration and parameters:

    nextflow -c app.config run Ferlab-Ste-Justine/seq-data-validation \
        -r v1.0.0 \
        --input samplesheet.csv \
        --outdir <OUTDIR> \
        -params-file params.json

    [!WARNING] Please provide pipeline parameters via the CLI or Nextflow -params-file option. Custom config files including those provided by the -c Nextflow option can be used to provide any configuration except for parameters; see docs.

Credits

Ferlab-Ste-Justine/seq-data-validation was originally written by Georgette Femerling, Samantha Yuen, Félix-Antoine Le Sieur, Lysiane Bouchard, David Morais.

We thank the Ferlab team and its partners for their support and collaboration in the development of this pipeline.

Contributions and Support

If you would like to contribute to this pipeline, please see the contributing guidelines.

Citations

An extensive list of references for the tools used by the pipeline can be found in the CITATIONS.md file.

This pipeline uses code and infrastructure developed and maintained by the nf-core community, reused here under the MIT license.

The nf-core framework for community-curated bioinformatics pipelines.

Philip Ewels, Alexander Peltzer, Sven Fillinger, Harshil Patel, Johannes Alneberg, Andreas Wilm, Maxime Ulysse Garcia, Paolo Di Tommaso & Sven Nahnsen.

Nat Biotechnol. 2020 Feb 13. doi: 10.1038/s41587-020-0439-x.

About

Nextflow pipeline to validate the integrity of sequencing analysis files

Resources

License

Contributing

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published

Contributors 2

  •  
  •