A minimal pipeline for differential RNA modifications detection from nanopore direct RNA sequencing data using Xpore. The pipeline:
- Aligns reads (fastq) to your reference using minimap2
- Uses blue-crab to convert POD5 files to BLOW5 format
- Uses f5c to align raw signal data to the reference (resquiggling)
- Runs Xpore (https://github.com/GoekeLab/xpore) to detect differential RNA modifications between conditions (modified vs unmodified)
Xpore can be run in two modes, controlled by the xpore_mode parameter in nextflow.config:
transcriptomemode (default): Results report modification positions using transcriptome coordinatesgenomemode: Results report modification positions using genomic coordinates
In both modes, reads are aligned with minimap2 against the transcriptome reference. For genome mode, a GTF/GFF file is required to map transcriptomic positions to genomic coordinates.
The following input files are needed:
- POD5 files: Raw signal files (
.pod5format) - FASTQ files: Basecalled reads (
.fastqformat) - Reference sequence: A FASTA file containing your reference transcriptome, (
.faor.fasta). At the moment the pipeline only accepts a single reference for all samples. If you want to run different samples with different references you'll have to run the pipeline multiple times.
Note:
- Kmer models: At the moment pre-trained models for the RNA004 chemistry (are in
data/models/), these are used for f5c and Xpore.
This pipeline can be run locally or on an HPC cluster using SLURM for scheduling.
To run the pipeline locally you'll need:
- Nextflow installed (see: https://www.nextflow.io/docs/latest/install.html#installation)
- Docker, Singularity or Conda/Mamba for dependency management
- Note - running the pipeline locally is possible for small subsets of data or test data, but is discouraged for larger data files (pod5 files can be very large)
To run the pipeline on an HPC cluster you'll need:
- Access to an HPC system with SLURM
- Nextflow installed (on HPC clusters it is often available via
module load nextflow) - included in this repo is arun_slurm.shscript that handles module loading and job submission, but it may need to be tweaked for your institution's system. Note therun_slurm.shscript assumes your HPC cluster uses Singularity. This can be edited to use Conda or Docker for dependency management.
Create a CSV file listing all your samples. Use example_sample_spreadsheet.csv as a template.
Required columns:
| Column | Description | Example |
|---|---|---|
sample_id |
Unique name for each sample | WT, KO |
comparison_group |
This is the group within which samples will be compared against each other. For each comparison group you'll need a 'modified' and 'unmodified' sample | 1, 2, 3 |
condition |
Either modified or unmodified |
modified |
replicate |
If you have replicates within a condition (e.g. WT1, WT2) these will be input into xpore together | 1, 2, 3 |
fastq |
Path to FASTQ file | /path/to/sample.fastq |
pod5 |
Path to POD5 file | /path/to/sample.pod5 |
Notes on fastq & pod5 pahts: these can be absolute or relative paths (relative to )
Example spreadsheet:
sample_id,comparison_group,condition,replicate,fastq,pod5
WT_rep1,1,modified,1,/path/to/WT_rep1.fastq,/path/to/WT_rep1.pod5
WT_rep2,1,modified,2,/path/to/WT_rep2.fastq,/path/to/WT_rep2.pod5
KO_rep1,1,unmodified,1,/path/to/KO_rep1.fastq,/path/to/KO_rep1.pod5
KO_rep2,1,unmodified,2,/path/to/KO_rep2.fastq,/path/to/KO_rep2.pod5Note: You can analyze multiple replicates by using the replicate column and multiple independent comparisons in one run by using different comparison_group values in your samplesheet, for example:
sample_id,comparison_group,condition,replicate,fastq,pod5
WT_rep1,1,modified,1,/path/to/WT_rep1.fastq,/path/to/WT_rep1.pod5
WT_rep2,1,modified,2,/path/to/WT_rep2.fastq,/path/to/WT_rep2.pod5
KO_rep1,1,unmodified,1,/path/to/KO_rep1.fastq,/path/to/KO_rep1.pod5
KO_rep2,1,unmodified,2,/path/to/KO_rep2.fastq,/path/to/KO_rep2.pod5
WT_celltype2_rep1,2,modified,1,/path/to/WT_rep1.fastq,/path/to/WT_rep1.pod5
WT_celltype2_rep2,2,modified,2,/path/to/WT_rep2.fastq,/path/to/WT_rep2.pod5
KO_celltype2_rep1,2,unmodified,1,/path/to/KO_rep1.fastq,/path/to/KO_rep1.pod5
KO_celltype2_rep2,2,unmodified,2,/path/to/KO_rep2.fastq,/path/to/KO_rep2.pod5Replicates within the same comparison group will be within the same results table. Independent comparison groups will generate separate results in diffmod_group1/ and diffmod_group2/.
Edit nextflow.config to point to your files:
params {
samplesheet = "/full/path/to/your_samples.csv"
outdir = "/full/path/to/output_directory"
xpore_mode = "transcriptome" // or "genome"
reference = "/full/path/to/your_transcriptome.fa"
gtf_or_gff = "/full/path/to/your_annotation.gtf" // Required for genome mode
f5c_kmer_model = "${projectDir}/data/models/rna004.nucleotide.5mer.model.txt" //Typically will not need to be updated for RNA004 data
xpore_diffmod_model = "${projectDir}/data/models/RNA004_5mer_model.txt" //Typically will not need to be updated for RNA004 data
}What to change:
samplesheet: Path to your sample CSV fileoutdir: Where you want results saved (the foldername you input will be created automatically)xpore_mode: Choose"transcriptome"or"genome"depending on desired output coordinatesreference: Path to your reference transcriptome FASTA filegtf_or_gff: Path to your GTF or GFF annotation file (only required if using genome mode; GTF format recommended)
Clone the repository into your working directory (locally or on your HPC) and navigate to the main project directory.
From the PRJ008 directory, submit the pipeline using either:
For local execution:
nextflow run main.nf -profile conda,localIf you need to restart the pipeline if it's interrupted you can use:
nextflow run main.nf -profile conda,local -resumeWhen running locally, Nextflow will display progress in your terminal in real-time. You'll see:
- Each process as it starts and completes
- Progress bars showing completion status
- Any errors or warnings as they occur
The pipeline runs in the foreground, so keep the terminal open. If interrupted, restart with the -resume flag to continue from the last successful step.
For HPC execution:
For submitting a job via slurm on an HPC:
sbatch run_slurm.shNote: run_slurm.sh will:
- Submit a main job that manages the pipeline
- Automatically submit additional jobs for each processing step
If you want to check you running job you can use:
squeue -u $USERYou should see a job named nextflow_main (the controller) and additional jobs for each processing step as they start.
If your pipeline fails or is interrupted, you don't need to start over. The -resume flag in run_slurm.sh automatically resumes from the last successful step.
Results are saved in the directory specified by outdir in your config file (default: results/).
20260127_results/
├── blow5/ # Converted pod5 files (intermediate)
├── minimap2/ # Aligned fastq reads (intermediate)
├── eventalign/f5c/ # Signal-level alignments (intermediate)
├── xpore/
│ ├── dataprep/ # Preprocessed data (intermediate)
│ └── diffmod/ # FINAL RESULTS HERE
│ └── diffmod_group1/ # Results for comparison group 1
│ └── diffmod_group2/ # Results for comparison group 2 (if applicable)
The most important results are in xpore/diffmod/diffmod_groupX/ directories:
diffmod.table: Main results table with modification statistics for each RNA positiondiffmod.log: Log file with processing information
Key columns in diffmod.table: To understand the results see: https://xpore.readthedocs.io/en/latest/outputtable.html
position: Position in the reference sequencekmer: The 5-nucleotide sequence contextdiff_mod_rate_<condition1>_vs_<condition2>: Ddifferential modification rate between condition1 and condition2 (modification rate of modified - modification rate of unmodified)pval_<condition1>_vs_<condition2>: significance level from z-test of the differential modification rate
Nextflow stores intermediate files in a work/ directory which can grow very large, especially with nanopore data. After your pipeline has completed successfully, you can reclaim disk space by running nextflow clean -f to remove all cached work files. If you plan to use -resume later, you can selectively clean old runs with nextflow clean -before <run_name> to keep only the most recent cache. You can view past runs with nextflow log.
Problem: Jobs fail immediately
- Check the error log:
cat nextflow_*.err - Common causes:
- Incorrect file paths in samplesheet
- Missing reference file
- Conda environment issues
- Nextflow: Di Tommaso et al. (2017) Nat Biotechnol
- minimap2: Li (2018) Bioinformatics
- f5c: Gamaarachchi et al. (2020) BMC Bioinformatics
- xpore: Pratanwanich et al. (2021) Nat Biotechnol
If you need to adjust compute resources, edit the profiles (for e.g. hpc profile) in nextflow.config. For example, to increase memory for the F5C step:
withName: 'F5C_INDEX_EVENTALIGN' {
memory = '64 GB' // Increased from 32 GB
}- POD5: Binary format for raw nanopore signal data (output from sequencer)
- BLOW5: Compressed version of POD5, faster to process
- FASTQ: Text format with sequences and quality scores (from basecalling)
- FASTA: Text format with reference sequences (no quality scores)
- BAM: Binary format for aligned reads (compressed SAM)
- Eventalign: Tab-separated file with signal-level alignments to reference