A comprehensive quality control pipeline for cleaning and preparing imputed genotype data for protein quantitative trait locus (pQTL) analysis. This pipeline is based on Alessia Mapelli and Solène Cadiou's work, and has been adapted for reproducible genomic data processing.
- Comprehensive QC: Filters mirror SNPs, problematic variants, and low-quality samples
- Format Conversion: Handles PGEN and BED formats
- Harmonization: Standardizes variant IDs and alleles across datasets
- Modular Design: Configurable for different projects (INTERVAL, BELIEVE, etc.)
- Reproducible: Uses Snakemake workflow management with containerized tools
- Snakemake (workflow management)
- Singularity (container runtime)
- Git (version control)
See environment.yml for Python dependencies and Makefile for common commands.
-
Clone the repository:
git clone https://github.com/ht-diva/genomics_QC_pipeline.git cd genomics_QC_pipeline -
Select a project configuration:
# Choose one of the available configurations make project-interval # For INTERVAL study make project-believe # For BELIEVE study make project-<custom> # For custom configurations
-
Run the pipeline:
sbatch submit.sbatch
The pipeline supports multiple project configurations stored in config/config.<name>.yaml. The active project is set by:
- Creating/updating the
.projectfile, or - Using Makefile helpers (recommended)
Results are written to the path specified by workspace_path in your config file (default: ./results).
The pipeline consists of several processing steps organized in 3 main blocks:
- ID Standardization: Converts variant IDs to
chr:pos:ref:altformat - Sample Selection: Filters to include only individuals with matching proteomic data
- Mirror SNP Handling: Identifies and removes problematic palindromic variants
- Variant Filtering: Removes low-quality variants based on multiple metrics (MAF, HWE, etc.)
- Imputation quality: Filters based on imputation quality metrics (INFO-score or MINIMAC3)
- ID Harmonization: Arrange the variant IDs in alphabetical order.
- Allele Harmonization: Ensures consistent allele representation across datasets
- Format Conversion: Supports PGEN ↔ BED conversions
- Merging: Combines chromosome-specific files into unified datasets
- Delivery: Organizes final outputs in standardized directory structure
- Report Generation: Generates comprehensive summary reports detailing the pipeline's results.
results/
├── bed/
│ └── qc_harmonised/ # Harmonized BED files ready for analysis
├── pgen/
│ ├── *_impute.info
│ ├── pseudo_biallelic.txt
│ ├── qc_harmonised/ # Fully quality-controlled, harmonized PGEN files
│ ├── recode_rsid.txt
│ └── reports/
│ ├── all_chromosomes_filtering_summary_report.txt
│ ├── all_chromosomes_filtering_summary_report.tsv
│ └── all_chromosomes_harmonization_summary_report.txt
└── README.txt
| Steps | Rule Name | Purpose | Output Files |
|---|---|---|---|
| Basic info | |||
list_rs |
Generate lists of rsIDs and pseudo-biallelic variants | merge_rsids.txt, recode_rsids.txt, pseudo_biallelic_var.txt, pseudo_biallelic.txt |
|
header_info |
Generate a basic information report from original dataset per chromosome | text report | |
| QC | |||
recode_pgen |
Replace IDs with chr:pos:ref:alt format | Recoded PGEN files | |
select_sample |
Select individuals present in both genomic and proteomic datasets | Filtered sample files | |
get_mirror_snps |
Identify mirror SNP pairs (X:XXXXXXX:A:B and X:XXXXXXX:B:A) | {chrom}_mirror_snps.txt |
|
filter_mirror_snps |
Remove mirror SNPs from dataset | {chrom}_filtered_mirror_snps.{pgen,pvar,psam} |
|
filter_problematic_snps |
Filter out a list of predefined problematic SNPs | Filtered PGEN files | |
filter_var |
Perform QC: remove failed samples, heterozygosity outliers, MAF, and HWE filtering | Quality-controlled PGEN files | |
create_bgen |
Convert filtered data to BGEN format | {chrom}_impute_recoded_selected_sample_filtered_var.{bgen,sample} |
|
qctool |
Compute SNP statistics using qctool | snp-stats_chr_{chrom}_impute_recoded_selected_sample_filtered_var.txt |
|
get_hq_variants |
Filter variants with info score > 0.7 | List of high-quality variants | |
filter_hq_variants |
Extract high-quality variants from PGEN files | Chromosome-specific PGEN files with HQ variants | |
filter_by_minimac3 |
Extract high-quality variants from PGEN files | Chromosome-specific PGEN files with HQ variants | |
merge_filter_hq_variants |
Merge chromosome-specific PGEN files | Combined PGEN file with HQ variants | |
| Harmonization | |||
build_snp_mapping_files |
Generate SNP mapping files for harmonization | Mapping table and harmonized PVAR table | |
update_pgen_id |
Update variant IDs to chr:pos:A0:A1 format (alphabetical order) | PGEN files with updated IDs | |
update_pgen_alleles |
Harmonize alleles to match new IDs | PGEN files with harmonized alleles | |
merge_qc_harmonised_pgen |
Merge final harmonized PGEN files | Final combined PGEN file | |
pgen2bed |
Convert PGEN to BED format (hard-call-threshold = 0.49999999) | BED files with harmonized alleles | |
merge_qc_harmonised_bed |
Merge final harmonized BED files | Final combined BED file | |
| Documentation | |||
write_readme |
Generate documentation with traceability information | README.txt with git information |
|
generate_chromosome_summary_report |
Generate a comprehensive report with variant filtering results | Text report | |
generate_chromosome_summary_report |
Generate a table of variant filtering results | TSV table | |
generate_harmonization_summary_report |
Generate a comprehensive report with harmonization results | Text report |
This graph illustrates the progression of the workflow from the input files to the final outputs. For simplicity, the workflow is restricted to two chromosomes only.
To adapt the pipeline for your project:
- Create a new config file in
config/config.<your_project>.yaml - Update paths and parameters as needed
- Add any project-specific rules to the appropriate
.smkfiles
For issues or questions:
- Check existing GitHub Issues
- Refer to the original INTERVAL QC script
If you use this pipeline, please cite:
- This pipeline repository