GitHub - GoekeLab/xpore-minimal-nextflow-pipeline: A minimal nextflow pipeline to align (fastq) reads using minimap2, align raw nanopore signal (pod5) using f5c, and perform differential modification analysis using Xpore.

Minimal Differential Modification Pipeline (Xpore)

A minimal pipeline for differential RNA modifications detection from nanopore direct RNA sequencing data using Xpore. The pipeline:

Aligns reads (fastq) to your reference using minimap2
Uses blue-crab to convert POD5 files to BLOW5 format
Uses f5c to align raw signal data to the reference (resquiggling)
Runs Xpore (https://github.com/GoekeLab/xpore) to detect differential RNA modifications between conditions (modified vs unmodified)

Before You Start

Xpore Mode

Xpore can be run in two modes, controlled by the xpore_mode parameter in nextflow.config:

transcriptome mode (default): Results report modification positions using transcriptome coordinates
genome mode: Results report modification positions using genomic coordinates

In both modes, reads are aligned with minimap2 against the transcriptome reference. For genome mode, a GTF/GFF file is required to map transcriptomic positions to genomic coordinates.

Required Files

The following input files are needed:

POD5 files: Raw signal files (.pod5 format)
FASTQ files: Basecalled reads (.fastq format)
Reference sequence: A FASTA file containing your reference transcriptome, (.fa or .fasta). At the moment the pipeline only accepts a single reference for all samples. If you want to run different samples with different references you'll have to run the pipeline multiple times.

Note:

Kmer models: At the moment pre-trained models for the RNA004 chemistry (are in data/models/), these are used for f5c and Xpore.

System Requirements

This pipeline can be run locally or on an HPC cluster using SLURM for scheduling.

To run the pipeline locally you'll need:

Nextflow installed (see: https://www.nextflow.io/docs/latest/install.html#installation)
Docker, Singularity or Conda/Mamba for dependency management
Note - running the pipeline locally is possible for small subsets of data or test data, but is discouraged for larger data files (pod5 files can be very large)

To run the pipeline on an HPC cluster you'll need:

Access to an HPC system with SLURM
Nextflow installed (on HPC clusters it is often available via module load nextflow) - included in this repo is a run_slurm.sh script that handles module loading and job submission, but it may need to be tweaked for your institution's system. Note the run_slurm.sh script assumes your HPC cluster uses Singularity. This can be edited to use Conda or Docker for dependency management.

Quick Start Guide

Step 1: Samplesheet Preparation

Create a CSV file listing all your samples. Use example_sample_spreadsheet.csv as a template.

Required columns:

Column	Description	Example
`sample_id`	Unique name for each sample	`WT`, `KO`
`comparison_group`	This is the group within which samples will be compared against each other. For each comparison group you'll need a 'modified' and 'unmodified' sample	`1`, `2`, `3`
`condition`	Either `modified` or `unmodified`	`modified`
`replicate`	If you have replicates within a condition (e.g. WT1, WT2) these will be input into xpore together	`1`, `2`, `3`
`fastq`	Path to FASTQ file	`/path/to/sample.fastq`
`pod5`	Path to POD5 file	`/path/to/sample.pod5`

Notes on fastq & pod5 pahts: these can be absolute or relative paths (relative to )

Example spreadsheet:

sample_id,comparison_group,condition,replicate,fastq,pod5
WT_rep1,1,modified,1,/path/to/WT_rep1.fastq,/path/to/WT_rep1.pod5
WT_rep2,1,modified,2,/path/to/WT_rep2.fastq,/path/to/WT_rep2.pod5
KO_rep1,1,unmodified,1,/path/to/KO_rep1.fastq,/path/to/KO_rep1.pod5
KO_rep2,1,unmodified,2,/path/to/KO_rep2.fastq,/path/to/KO_rep2.pod5

Note: You can analyze multiple replicates by using the replicate column and multiple independent comparisons in one run by using different comparison_group values in your samplesheet, for example:

sample_id,comparison_group,condition,replicate,fastq,pod5
WT_rep1,1,modified,1,/path/to/WT_rep1.fastq,/path/to/WT_rep1.pod5
WT_rep2,1,modified,2,/path/to/WT_rep2.fastq,/path/to/WT_rep2.pod5
KO_rep1,1,unmodified,1,/path/to/KO_rep1.fastq,/path/to/KO_rep1.pod5
KO_rep2,1,unmodified,2,/path/to/KO_rep2.fastq,/path/to/KO_rep2.pod5
WT_celltype2_rep1,2,modified,1,/path/to/WT_rep1.fastq,/path/to/WT_rep1.pod5
WT_celltype2_rep2,2,modified,2,/path/to/WT_rep2.fastq,/path/to/WT_rep2.pod5
KO_celltype2_rep1,2,unmodified,1,/path/to/KO_rep1.fastq,/path/to/KO_rep1.pod5
KO_celltype2_rep2,2,unmodified,2,/path/to/KO_rep2.fastq,/path/to/KO_rep2.pod5

Replicates within the same comparison group will be within the same results table. Independent comparison groups will generate separate results in diffmod_group1/ and diffmod_group2/.

Step 2: Update Configuration

Edit nextflow.config to point to your files:

params {
    samplesheet = "/full/path/to/your_samples.csv"
    outdir = "/full/path/to/output_directory"
    xpore_mode = "transcriptome"  // or "genome"
    reference = "/full/path/to/your_transcriptome.fa"
    gtf_or_gff = "/full/path/to/your_annotation.gtf"  // Required for genome mode
    f5c_kmer_model = "${projectDir}/data/models/rna004.nucleotide.5mer.model.txt" //Typically will not need to be updated for RNA004 data
    xpore_diffmod_model = "${projectDir}/data/models/RNA004_5mer_model.txt" //Typically will not need to be updated for RNA004 data
}

What to change:

samplesheet: Path to your sample CSV file
outdir: Where you want results saved (the foldername you input will be created automatically)
xpore_mode: Choose "transcriptome" or "genome" depending on desired output coordinates
reference: Path to your reference transcriptome FASTA file
gtf_or_gff: Path to your GTF or GFF annotation file (only required if using genome mode; GTF format recommended)

Step 3: Run the Pipeline Locally or Submit the Pipeline to SLURM

Clone the repository into your working directory (locally or on your HPC) and navigate to the main project directory.

From the PRJ008 directory, submit the pipeline using either:

For local execution:

nextflow run main.nf -profile conda,local

If you need to restart the pipeline if it's interrupted you can use:

nextflow run main.nf -profile conda,local -resume

When running locally, Nextflow will display progress in your terminal in real-time. You'll see:

Each process as it starts and completes
Progress bars showing completion status
Any errors or warnings as they occur

The pipeline runs in the foreground, so keep the terminal open. If interrupted, restart with the -resume flag to continue from the last successful step.

For HPC execution:

For submitting a job via slurm on an HPC:

sbatch run_slurm.sh

Note: run_slurm.sh will:

Submit a main job that manages the pipeline
Automatically submit additional jobs for each processing step

If you want to check you running job you can use:

squeue -u $USER

You should see a job named nextflow_main (the controller) and additional jobs for each processing step as they start.

If your pipeline fails or is interrupted, you don't need to start over. The -resume flag in run_slurm.sh automatically resumes from the last successful step.

Results

Results are saved in the directory specified by outdir in your config file (default: results/).

Output Directory Structure

20260127_results/
├── blow5/                     # Converted pod5 files (intermediate)
├── minimap2/                  # Aligned fastq reads (intermediate)
├── eventalign/f5c/            # Signal-level alignments (intermediate)
├── xpore/
│   ├── dataprep/              # Preprocessed data (intermediate)
│   └── diffmod/               # FINAL RESULTS HERE
│       └── diffmod_group1/    # Results for comparison group 1
│       └── diffmod_group2/    # Results for comparison group 2 (if applicable)

Main Results Files

The most important results are in xpore/diffmod/diffmod_groupX/ directories:

diffmod.table: Main results table with modification statistics for each RNA position
diffmod.log: Log file with processing information

Key columns in diffmod.table: To understand the results see: https://xpore.readthedocs.io/en/latest/outputtable.html

position: Position in the reference sequence
kmer: The 5-nucleotide sequence context
diff_mod_rate_<condition1>_vs_<condition2>: Ddifferential modification rate between condition1 and condition2 (modification rate of modified - modification rate of unmodified)
pval_<condition1>_vs_<condition2>: significance level from z-test of the differential modification rate

Cleaning Up Work Files

Nextflow stores intermediate files in a work/ directory which can grow very large, especially with nanopore data. After your pipeline has completed successfully, you can reclaim disk space by running nextflow clean -f to remove all cached work files. If you plan to use -resume later, you can selectively clean old runs with nextflow clean -before <run_name> to keep only the most recent cache. You can view past runs with nextflow log.

Troubleshooting

Common Issues

Problem: Jobs fail immediately

Check the error log: cat nextflow_*.err
Common causes:
- Incorrect file paths in samplesheet
- Missing reference file
- Conda environment issues

Citations

Nextflow: Di Tommaso et al. (2017) Nat Biotechnol
minimap2: Li (2018) Bioinformatics
f5c: Gamaarachchi et al. (2020) BMC Bioinformatics
xpore: Pratanwanich et al. (2021) Nat Biotechnol

Advanced Usage

Customizing Resource Allocation

If you need to adjust compute resources, edit the profiles (for e.g. hpc profile) in nextflow.config. For example, to increase memory for the F5C step:

withName: 'F5C_INDEX_EVENTALIGN' {
    memory = '64 GB'  // Increased from 32 GB
}

File Formats Reference

POD5: Binary format for raw nanopore signal data (output from sequencer)
BLOW5: Compressed version of POD5, faster to process
FASTQ: Text format with sequences and quality scores (from basecalling)
FASTA: Text format with reference sequences (no quality scores)
BAM: Binary format for aligned reads (compressed SAM)
Eventalign: Tab-separated file with signal-level alignments to reference

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Minimal Differential Modification Pipeline (Xpore)

Before You Start

Xpore Mode

Required Files

System Requirements

Quick Start Guide

Step 1: Samplesheet Preparation

Step 2: Update Configuration

Step 3: Run the Pipeline Locally or Submit the Pipeline to SLURM

Results

Output Directory Structure

Main Results Files

Cleaning Up Work Files

Troubleshooting

Common Issues

Citations

Advanced Usage

Customizing Resource Allocation

File Formats Reference

About

Uh oh!

Releases

Packages

Uh oh!

Contributors

Uh oh!

Languages

Name		Name	Last commit message	Last commit date
Latest commit History 8 Commits
data/models		data/models
envs		envs
.gitignore		.gitignore
LICENSE		LICENSE
README.md		README.md
example_sample_spreadsheet.csv		example_sample_spreadsheet.csv
main.nf		main.nf
nextflow.config		nextflow.config
run_slurm.sh		run_slurm.sh

Folders and files

Latest commit

History

Repository files navigation

Minimal Differential Modification Pipeline (Xpore)

Before You Start

Xpore Mode

Required Files

System Requirements

Quick Start Guide

Step 1: Samplesheet Preparation

Step 2: Update Configuration

Step 3: Run the Pipeline Locally or Submit the Pipeline to SLURM

Results

Output Directory Structure

Main Results Files

Cleaning Up Work Files

Troubleshooting

Common Issues

Citations

Advanced Usage

Customizing Resource Allocation

File Formats Reference

About

Resources

License

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Contributors

Uh oh!

Languages

Packages