Skip to content

GoekeLab/xpore-minimal-nextflow-pipeline

Repository files navigation

Minimal Differential Modification Pipeline (Xpore)

A minimal pipeline for differential RNA modifications detection from nanopore direct RNA sequencing data using Xpore. The pipeline:

  1. Aligns reads (fastq) to your reference using minimap2
  2. Uses blue-crab to convert POD5 files to BLOW5 format
  3. Uses f5c to align raw signal data to the reference (resquiggling)
  4. Runs Xpore (https://github.com/GoekeLab/xpore) to detect differential RNA modifications between conditions (modified vs unmodified)

Before You Start

Xpore Mode

Xpore can be run in two modes, controlled by the xpore_mode parameter in nextflow.config:

  • transcriptome mode (default): Results report modification positions using transcriptome coordinates
  • genome mode: Results report modification positions using genomic coordinates

In both modes, reads are aligned with minimap2 against the transcriptome reference. For genome mode, a GTF/GFF file is required to map transcriptomic positions to genomic coordinates.

Required Files

The following input files are needed:

  • POD5 files: Raw signal files (.pod5 format)
  • FASTQ files: Basecalled reads (.fastq format)
  • Reference sequence: A FASTA file containing your reference transcriptome, (.fa or .fasta). At the moment the pipeline only accepts a single reference for all samples. If you want to run different samples with different references you'll have to run the pipeline multiple times.

Note:

  • Kmer models: At the moment pre-trained models for the RNA004 chemistry (are in data/models/), these are used for f5c and Xpore.

System Requirements

This pipeline can be run locally or on an HPC cluster using SLURM for scheduling.

To run the pipeline locally you'll need:

  • Nextflow installed (see: https://www.nextflow.io/docs/latest/install.html#installation)
  • Docker, Singularity or Conda/Mamba for dependency management
  • Note - running the pipeline locally is possible for small subsets of data or test data, but is discouraged for larger data files (pod5 files can be very large)

To run the pipeline on an HPC cluster you'll need:

  • Access to an HPC system with SLURM
  • Nextflow installed (on HPC clusters it is often available via module load nextflow) - included in this repo is a run_slurm.sh script that handles module loading and job submission, but it may need to be tweaked for your institution's system. Note the run_slurm.sh script assumes your HPC cluster uses Singularity. This can be edited to use Conda or Docker for dependency management.

Quick Start Guide

Step 1: Samplesheet Preparation

Create a CSV file listing all your samples. Use example_sample_spreadsheet.csv as a template.

Required columns:

Column Description Example
sample_id Unique name for each sample WT, KO
comparison_group This is the group within which samples will be compared against each other. For each comparison group you'll need a 'modified' and 'unmodified' sample 1, 2, 3
condition Either modified or unmodified modified
replicate If you have replicates within a condition (e.g. WT1, WT2) these will be input into xpore together 1, 2, 3
fastq Path to FASTQ file /path/to/sample.fastq
pod5 Path to POD5 file /path/to/sample.pod5

Notes on fastq & pod5 pahts: these can be absolute or relative paths (relative to )

Example spreadsheet:

sample_id,comparison_group,condition,replicate,fastq,pod5
WT_rep1,1,modified,1,/path/to/WT_rep1.fastq,/path/to/WT_rep1.pod5
WT_rep2,1,modified,2,/path/to/WT_rep2.fastq,/path/to/WT_rep2.pod5
KO_rep1,1,unmodified,1,/path/to/KO_rep1.fastq,/path/to/KO_rep1.pod5
KO_rep2,1,unmodified,2,/path/to/KO_rep2.fastq,/path/to/KO_rep2.pod5

Note: You can analyze multiple replicates by using the replicate column and multiple independent comparisons in one run by using different comparison_group values in your samplesheet, for example:

sample_id,comparison_group,condition,replicate,fastq,pod5
WT_rep1,1,modified,1,/path/to/WT_rep1.fastq,/path/to/WT_rep1.pod5
WT_rep2,1,modified,2,/path/to/WT_rep2.fastq,/path/to/WT_rep2.pod5
KO_rep1,1,unmodified,1,/path/to/KO_rep1.fastq,/path/to/KO_rep1.pod5
KO_rep2,1,unmodified,2,/path/to/KO_rep2.fastq,/path/to/KO_rep2.pod5
WT_celltype2_rep1,2,modified,1,/path/to/WT_rep1.fastq,/path/to/WT_rep1.pod5
WT_celltype2_rep2,2,modified,2,/path/to/WT_rep2.fastq,/path/to/WT_rep2.pod5
KO_celltype2_rep1,2,unmodified,1,/path/to/KO_rep1.fastq,/path/to/KO_rep1.pod5
KO_celltype2_rep2,2,unmodified,2,/path/to/KO_rep2.fastq,/path/to/KO_rep2.pod5

Replicates within the same comparison group will be within the same results table. Independent comparison groups will generate separate results in diffmod_group1/ and diffmod_group2/.

Step 2: Update Configuration

Edit nextflow.config to point to your files:

params {
    samplesheet = "/full/path/to/your_samples.csv"
    outdir = "/full/path/to/output_directory"
    xpore_mode = "transcriptome"  // or "genome"
    reference = "/full/path/to/your_transcriptome.fa"
    gtf_or_gff = "/full/path/to/your_annotation.gtf"  // Required for genome mode
    f5c_kmer_model = "${projectDir}/data/models/rna004.nucleotide.5mer.model.txt" //Typically will not need to be updated for RNA004 data
    xpore_diffmod_model = "${projectDir}/data/models/RNA004_5mer_model.txt" //Typically will not need to be updated for RNA004 data
}

What to change:

  • samplesheet: Path to your sample CSV file
  • outdir: Where you want results saved (the foldername you input will be created automatically)
  • xpore_mode: Choose "transcriptome" or "genome" depending on desired output coordinates
  • reference: Path to your reference transcriptome FASTA file
  • gtf_or_gff: Path to your GTF or GFF annotation file (only required if using genome mode; GTF format recommended)

Step 3: Run the Pipeline Locally or Submit the Pipeline to SLURM

Clone the repository into your working directory (locally or on your HPC) and navigate to the main project directory.

From the PRJ008 directory, submit the pipeline using either:

For local execution:

nextflow run main.nf -profile conda,local

If you need to restart the pipeline if it's interrupted you can use:

nextflow run main.nf -profile conda,local -resume

When running locally, Nextflow will display progress in your terminal in real-time. You'll see:

  • Each process as it starts and completes
  • Progress bars showing completion status
  • Any errors or warnings as they occur

The pipeline runs in the foreground, so keep the terminal open. If interrupted, restart with the -resume flag to continue from the last successful step.

For HPC execution:

For submitting a job via slurm on an HPC:

sbatch run_slurm.sh

Note: run_slurm.sh will:

  • Submit a main job that manages the pipeline
  • Automatically submit additional jobs for each processing step

If you want to check you running job you can use:

squeue -u $USER

You should see a job named nextflow_main (the controller) and additional jobs for each processing step as they start.

If your pipeline fails or is interrupted, you don't need to start over. The -resume flag in run_slurm.sh automatically resumes from the last successful step.

Results

Results are saved in the directory specified by outdir in your config file (default: results/).

Output Directory Structure

20260127_results/
├── blow5/                     # Converted pod5 files (intermediate)
├── minimap2/                  # Aligned fastq reads (intermediate)
├── eventalign/f5c/            # Signal-level alignments (intermediate)
├── xpore/
│   ├── dataprep/              # Preprocessed data (intermediate)
│   └── diffmod/               # FINAL RESULTS HERE
│       └── diffmod_group1/    # Results for comparison group 1
│       └── diffmod_group2/    # Results for comparison group 2 (if applicable)

Main Results Files

The most important results are in xpore/diffmod/diffmod_groupX/ directories:

  • diffmod.table: Main results table with modification statistics for each RNA position
  • diffmod.log: Log file with processing information

Key columns in diffmod.table: To understand the results see: https://xpore.readthedocs.io/en/latest/outputtable.html

  • position: Position in the reference sequence
  • kmer: The 5-nucleotide sequence context
  • diff_mod_rate_<condition1>_vs_<condition2>: Ddifferential modification rate between condition1 and condition2 (modification rate of modified - modification rate of unmodified)
  • pval_<condition1>_vs_<condition2>: significance level from z-test of the differential modification rate

Cleaning Up Work Files

Nextflow stores intermediate files in a work/ directory which can grow very large, especially with nanopore data. After your pipeline has completed successfully, you can reclaim disk space by running nextflow clean -f to remove all cached work files. If you plan to use -resume later, you can selectively clean old runs with nextflow clean -before <run_name> to keep only the most recent cache. You can view past runs with nextflow log.

Troubleshooting

Common Issues

Problem: Jobs fail immediately

  • Check the error log: cat nextflow_*.err
  • Common causes:
    • Incorrect file paths in samplesheet
    • Missing reference file
    • Conda environment issues

Citations

  • Nextflow: Di Tommaso et al. (2017) Nat Biotechnol
  • minimap2: Li (2018) Bioinformatics
  • f5c: Gamaarachchi et al. (2020) BMC Bioinformatics
  • xpore: Pratanwanich et al. (2021) Nat Biotechnol

Advanced Usage

Customizing Resource Allocation

If you need to adjust compute resources, edit the profiles (for e.g. hpc profile) in nextflow.config. For example, to increase memory for the F5C step:

withName: 'F5C_INDEX_EVENTALIGN' {
    memory = '64 GB'  // Increased from 32 GB
}

File Formats Reference

  • POD5: Binary format for raw nanopore signal data (output from sequencer)
  • BLOW5: Compressed version of POD5, faster to process
  • FASTQ: Text format with sequences and quality scores (from basecalling)
  • FASTA: Text format with reference sequences (no quality scores)
  • BAM: Binary format for aligned reads (compressed SAM)
  • Eventalign: Tab-separated file with signal-level alignments to reference

About

A minimal nextflow pipeline to align (fastq) reads using minimap2, align raw nanopore signal (pod5) using f5c, and perform differential modification analysis using Xpore.

Resources

License

Stars

Watchers

Forks

Packages

 
 
 

Contributors