This is a work in progress document to plan the tools to be included in the pipeline (flowchart). This should be moved to the wiki when it is finalized, for posterity. A shorter summary of the tools and what they do will be in the readme
| Category | Tool | Nextflow Implementation |
|---|---|---|
| Quality Control | NanoPlot | nf-core/nanoplot |
| hostile | nf-core/hostile | |
| chopper | nf-core/chopper | |
| Assembly | Flye | nf-core/flye |
| myloasm | - | |
| Binning | SemiBin2 | - |
| Bin QC | CheckM2 | - |
| SNP/SV Detection | rhea (SV, timecourse) | - |
| Taxonomic Profiling | Emu | gms-16s |
| Lemur + MAGnet |
- | |
| Sylph | - | |
| SingleM pipe | - | |
| Functional Annotation | SeqScreen | DSL1 to 2 |
| bakta | - | |
| Community Assessment | SingleM appraise | - |
| Reporting, Visualization | taxburst | - |
| MetagenomeScope | - | |
| Total number of unique tools: 17 | ||
| Number with nextflow implmentations: 7ish/17? | ||
| Soon to be many more... |
These will be implemented in the future ; moving them here to preserve the links etc
| Category | Tool | Nextflow Implementation |
|---|---|---|
| Quality Control | filtlong | nf-core/filtlong |
| Assembly | MetaCompass | - |
| Pangenomics | parsnp |
- |
| tMHG-Finder | - | |
| Taxonomic Profiling | MetaPhlAn | - |
| Metabolic Reconstruction | Bakdrive | - |
| micom | - |
Table summarizes all the tools listed below; tools are grouped by category and ordered by priority. Tools with available nextflow implementations are linked above. Tools are listed in order of priority within each category.
Original plan from flowchart sketched on whiteboard on 2025-05-13: Inputs from Todd Treangen
- Add github links and citations for these tools
- List the LLM questions that will lead to each step
- Check and link the pipelines for which nextflow DSL1 or DSL2 implementation exists + citations
- 3 broad branches of the pipeline
- 16S
- metagenomic classification
- assembly based metagenomics
- 1 vs n samples
- Rheaa is good for n samples so it will dictate the rest of the pipeline
- Type of data technology: (ideas for more detailed options flye docs)
- Nanopore (ONT)
- Pacbio (Pacbio)
- Synthetic long reads (Illumina)
- Assembly vs reads
- Reference guided vs de novo assembly
- NanoPlot: QC plotting suite for long-read sequencing data and alignments. for QC on the raw files / final reads after filtering. nf-core/module | citation
- hostile: A tool for filtering reads that align to a host genome (removes host contamination from microbial metagenomes). nf-core/modules: hostile-clean; hostile-fetch, code | citation
- chopper: A tool to filter nanopore sequencing reads by quality and length ; Use to filter out low quality and short reads. nf-core/module, code | citation
- filtlong: Quality filtering tool for long-reads by read quality and length nf-core/module, code | no citation
De novo assembly: Note: uses a lot of RAM: 32 GB minimum
- Flye: De novo assembler for single-molecule sequencing reads using repeat graphs (PacBio and Oxford Nanopore). nf-core/module, code | citation
- myloasm Myloasm is a de novo metagenome assembler, it takes long reads and outputs polished contigs in a single command. (not published yet but probably will be by the time Somatem manuscript is out, very complete github & docs already made, also Jim Shaw & Hang Li collab).
Reference guided We aim to implement a reference-guided assembler in the future, however, meta-compass is currently short-read only...
- meta-compass: A metagenomic reference-guided assembler that leverages multiple reference genomes
Currently we only use one binner but we have aspirations to include mutliple binners + refinement tools down the read.
- SemiBin2 SemiBin2 uses self-supervised learning to learn feature embeddings from the contigs with emphasis on long-read sequencing data. citation
- parsnp: A fast microbial core-genome alignment tool, which can output core genome phylogeny, multiple genome alignments and SNP calls. citation
- tMHG-Finder: Tool for tree guided maximal homologous group (MHG) identification from multiple genomes. MHGs enable more accurate phylogenetic reconstruction than gene annotations, accounting for horizontal gene transfer. citation & older MHG-finder
Includes gene duplication loss
- rhea: Detects structural variants (SV, >10 bp indels) and HGT between temporally evolving microbial metagenomic samples for large cohorts of related or similar genomes (1:n samples). citation: Kristen et al., Bioinformatics, 2024
- Emu: Taxonomic classification, and abundance estimation of 16S rRNA reads for long-read data. Nextflow DSL2 implementation: gms-16s + gms-16S citation | citation
- Lemur: For rapid and accurate taxonomic profiling on long-read metagenomic datasets. citation
- Sylph: A tool for rapid and accurate species level taxonomic profiling of metagenomic data using k-mer sketches. citation: Shaw et al., Nature Biotechnology, 2024 | documentation
- MetaPhlAn Including because recent support for long reads and will make a lightweight functional annotation analysis when combined with HUMAnN.
- Centrifuger might be better for large databases due to compression? | citation
- SeqScreen: Functional screening of pathogenic sequences in metagenomic data.
nextflow: DSL1 to 2 transition | citation: Balaji et al., Genome Biology, 2022
- includes antibiotic resistance genes.
- HUMAnN: HMP Unified Metabolic Analysis Network - profiling microbial community metabolic potential (? / Not for long reads? (says Austin) - eukaryotic; RAM intensive; )
- bakta
How is this different from taxonomic classification?
- SeqScreen: Functional screening of pathogenic sequences in metagenomic data.
- Centrifuge: A rapid and memory-efficient classification system for metagenomic sequences
- MAGnet: Metagenomic Analysis of Genomes in the ENvironmental Toolkit
- SeqScreen: Functional screening of pathogenic sequences in metagenomic data
- Bakdrive / recent/private version: Can take in Emu output. citation: Wang et al., Bioinformatics, 2023
- CheckM2
- SingleM appraise Check how MetAMOS implements this says Todd
- Check if tools ran correctly
- nextflow execution report (html)
- MultiQC
- SingleM appraise Check how MetAMOS report was made from scratch says Todd
- taxburst
- MetagenomeScope: Web based visualization tool for metagenomic assembly graphs.
- emperor Interactive ordination plot viz
- bacterial: RefSeq
- viral: NCBI Viral Genomes
- CAMP ; paper is a snakemake workflow that aims to be one-click deployment but also modular. They work with long-reads and hybrid assemblies along with short-reads. They are also incorporating an LLM called bootcamp similar to omi. So they might really beat us with the Somatem?
- aviary is a snakemake workflow that Austin espouses, but Todd doesn't like that it's not published.
(Austin) This is the pipeline used for our previous depletion paper and also put together by Ben woodcraft’s group, has singlem and all the rest of their m tools
- cloudres: MLST assignment and AMR detection. could learn/borrow code
- mmlong2 Genome-centric long-read metagenomics workflow for automated recovery and analysis of prokaryotic genomes with Nanopore or PacBio HiFi sequencing data.
- MUFFIN MUFFIN is a hybrid assembly and differential binning workflow for metagenomics, transcriptomics and pathway analysis.
- mapo tofu is a nextflow pipeline for short reads.
- Can borrow their Sylph module?
- BugBuster
- No tool in our list
Stuff from this document was re-formatted/summarized into a few other documents for easier access. I used Windsurf AI for this so keeping these in the same repo keeps them easy to iterate on with AI when changes occur
List
tool_voting.csv: Contains tool names with category for Todd to vote on/suggest new onestool_links.csv: contains links to the github and citations for easy parsing into embeddings making by Sahilmock_nf_params.yaml: is a dadasnake inspired yaml formatted with the tools in this list- I want sahil to edit this file as appropriate (write access needed); so will need to copy
moveit into theomirepo. - will explore making a sub-module that's shared in both repos in the future if > 2 changes are being made by Sahil..
- (Not possible to use git supported soft links to share with another computer) But need some simlink mechanism to keep the files linked // need some way to sync their commit history as well - currently will need to do this manually?
- I want sahil to edit this file as appropriate (write access needed); so will need to copy
- Tool links: archived meta-compass,Assembly,A metagenomic reference-guided assembler that leverages multiple reference genomes,https://github.com/marbl/MetaCompass,,, filtlong,Quality control / preprocessing,Quality filtering tool for long-reads by read quality and length,https://github.com/rrwick/Filtlong,,https://github.com/nf-core/modules/tree/master/modules/nf-core/filtlong,No citation available AutoCycler,Assembly,Consensus long-read assembly pipeline combining multiple alternative assemblies,https://github.com/rrwick/AutoCycler,https://www.biorxiv.org/content/10.1101/2025.05.12.653612v1.full,, Canu,Assembly,Hierarchical assembler designed for high-noise single-molecule sequencing,https://github.com/marbl/canu,https://doi.org/10.1101/gr.215087.116,https://github.com/nf-core/modules/tree/master/modules/nf-core/canu,End of life since 2021; don't use parsnp,Pangenomic analyses,A fast microbial core-genome alignment tool,https://github.com/marbl/parsnp,https://academic.oup.com/bioinformatics/article/40/5/btae311/7667868,, tMHG-Finder,Pangenomic analyses,Tree guided maximal homologous group (MHG) identification,https://github.com/yongze-yin/tMHG-Finder,https://www.biorxiv.org/content/10.1101/2025.03.16.643543v1.full,, Centrifuge,Taxonomic classification,A rapid and memory-efficient classification system for metagenomic sequences,https://github.com/DaehwanKimLab/centrifuge,https://doi.org/10.1101/gr.210641.116,https://github.com/nf-core/modules/tree/master/modules/nf-core/centrifuge, Centrifuger,Taxonomic classification,Improved version of Centrifuge for large databases,https://github.com/mourisl/centrifuger,https://link.springer.com/article/10.1186/s13059-024-03244-4,, EggNOG-mapper,Functional annotation,Fast functional annotation of novel sequences using orthology assignments,https://github.com/eggnogdb/eggnog-mapper,https://doi.org/10.1093/molbev/msab293,https://github.com/nf-core/modules/tree/master/modules/nf-core/eggnogmapper, HUMAnN,Functional annotation,HMP Unified Metabolic Analysis Network - profiling microbial community metabolic potential,https://github.com/biobakery/humann,,,,Not for long reads; RAM intensive Bakdrive,Metabolic reconstruction,Metagenomic Analysis of Genomes in the ENvironmental Toolkit,https://gitlab.com/treangenlab/bakdrive,https://academic.oup.com/bioinformatics/article/39/Supplement_1/i47/7210449,,Can take in Emu output micom,Metabolic reconstruction,Microbiome modeling,https://github.com/micom-dev/micom,https://journals.asm.org/doi/10.1128/msystems.00606-19,,Best to use with bakdrive Apollo,Metabolic reconstruction,Interactive sequence annotation editor,https://genomearchitect.readthedocs.io/,https://link.springer.com/article/10.1186/gb-2002-3-12-research0082,,Relevance for nextflow workflow unclear FastQC,Report,A quality control tool for high throughput sequence data,https://github.com/s-andrews/FastQC,,https://github.com/nf-core/modules/tree/master/modules/nf-core/fastqc,Can be enhanced with LLM MetagenomeScope,Visualization,Web based visualization tool for metagenomic assembly graphs,https://github.com/marbl/MetagenomeScope,,,
llm_bioinformatics_training_pairs_with_output.csv
16/Jul/25 - flowchart -> questions
I generated these questions using chatGPT with the following prompt (condensed 3 serial prompts into 1):
I am building an LLM to be used by biologists without bioinformatics background to run bioinformatic tools. The broad modality of my tools is listed with specific tools in brackets.
How would a biologist with metagenomic data at hand describe questions that each of the following bioinformatic tools would answer. Give me 3 questions for each modality.
- Annotated genome (Bakta | Prokka)
- Pangenomics (tmhg-finder)
- Metagenomic taxonomic profiling
- Read classification (SeqScreen)
- 16S taxonomic profiling
Please output this as a csv file. And include the typical output of the tool as well. Such as
**“What genes are present in this genome, and what do they do?”**
_(→ Functional annotation of coding sequences, gene product names, EC numbers, etc.)
TODO:
- Add more nodes to this list or use the full flowchart. This was only a starting point with a few key nodes to get started.
Underwhat circumstances should each tool be used? and what are the pros and cons.
- EMU
- LEMUR
- Pros: Lemur can efficiently process large datasets within minutes to hours in limited computational resource settings.
- Cons: Reliance on bacterial marker genes necessarily implies it cannot generalize to viral genome classification.
- Less sensitive than Kraken 2 or MetaMaps, which use all long reads and complete genomes.
- a bacteria from a novel, i.e. out-of-database family will necessarily be missed by Lemur.
- MAGNET: Use magnet to correct false positives from lemur. Magnet, particularly in the presence of low-abundance or low-coverage data, it can improve precision by detecting and filtering out many false positive calls. The goal of Magnet is to detect and remove potential false positives by performing competitive read alignment leveraging all of the reads mapped against the entire reference genome (rather than just the marker gene reads and marker genes used by Lemur).
- Pros: post-processing with Magnet doesn’t incur much additional cost for small and medium datasets
- Cons: On the large and diverse data, post-processing with Magnet takes a significant portion of total time due to richer microbial composition of the dataset.
- Spiel: relying on a wide pool of single-copy universal marker genes allows Lemur to achieve high recall and relative abundance estimation accuracy while using only a small portion of the input data. These markers cover all bacteria but only a fraction of any given genome. In contrast, Magnet starts with the set of genomes identified by Lemur and evaluates the read-alignment quality and coverage distributions across the genome to make a maximally informed call about whether the putative genome is actually in the sample.