Plannning tools

This is a work in progress document to plan the tools to be included in the pipeline (flowchart). This should be moved to the wiki when it is finalized, for posterity. A shorter summary of the tools and what they do will be in the readme

Summary Table

Category	Tool	Nextflow Implementation
Quality Control	NanoPlot	nf-core/nanoplot
	hostile	nf-core/hostile
	chopper	nf-core/chopper
Assembly	Flye	nf-core/flye
	myloasm	-
Binning	SemiBin2	-
Bin QC	CheckM2	-
SNP/SV Detection	rhea (SV, timecourse)	-
Taxonomic Profiling	Emu	gms-16s
	Lemur + MAGnet	-
	Sylph	-
	SingleM pipe	-
Functional Annotation	SeqScreen	DSL1 to 2
	bakta	-
Community Assessment	SingleM appraise	-
Reporting, Visualization	taxburst	-
	MetagenomeScope	-
Total number of unique tools: 17
Number with nextflow implmentations: 7ish/17?
Soon to be many more...

Archived tools

These will be implemented in the future ; moving them here to preserve the links etc

Category	Tool	Nextflow Implementation
Quality Control	filtlong	nf-core/filtlong
Assembly	MetaCompass	-
Pangenomics	parsnp	-
	tMHG-Finder	-
Taxonomic Profiling	MetaPhlAn	-
Metabolic Reconstruction	Bakdrive	-
	micom	-

Table summarizes all the tools listed below; tools are grouped by category and ordered by priority. Tools with available nextflow implementations are linked above. Tools are listed in order of priority within each category.

Original plan from flowchart sketched on whiteboard on 2025-05-13: Inputs from Todd Treangen

TODO:

Add github links and citations for these tools
List the LLM questions that will lead to each step
Check and link the pipelines for which nextflow DSL1 or DSL2 implementation exists + citations

Major decision points

3 broad branches of the pipeline
- 16S
- metagenomic classification
- assembly based metagenomics
1 vs n samples
- Rheaa is good for n samples so it will dictate the rest of the pipeline
Type of data technology: (ideas for more detailed options flye docs)
- Nanopore (ONT)
- Pacbio (Pacbio)
- Synthetic long reads (Illumina)
Assembly vs reads
Reference guided vs de novo assembly

Tools list

Quality control / preprocessing

NanoPlot: QC plotting suite for long-read sequencing data and alignments. for QC on the raw files / final reads after filtering. nf-core/module | citation
hostile: A tool for filtering reads that align to a host genome (removes host contamination from microbial metagenomes). nf-core/modules: hostile-clean; hostile-fetch, code | citation
chopper: A tool to filter nanopore sequencing reads by quality and length ; Use to filter out low quality and short reads. nf-core/module, code | citation
filtlong: Quality filtering tool for long-reads by read quality and length nf-core/module, code | no citation

Assembly

De novo assembly: Note: uses a lot of RAM: 32 GB minimum

Flye: De novo assembler for single-molecule sequencing reads using repeat graphs (PacBio and Oxford Nanopore). nf-core/module, code | citation
myloasm Myloasm is a de novo metagenome assembler, it takes long reads and outputs polished contigs in a single command. (not published yet but probably will be by the time Somatem manuscript is out, very complete github & docs already made, also Jim Shaw & Hang Li collab).

Reference guided We aim to implement a reference-guided assembler in the future, however, meta-compass is currently short-read only...

meta-compass: A metagenomic reference-guided assembler that leverages multiple reference genomes

Binning

Currently we only use one binner but we have aspirations to include mutliple binners + refinement tools down the read.

SemiBin2 SemiBin2 uses self-supervised learning to learn feature embeddings from the contigs with emphasis on long-read sequencing data. citation

Pangenomic analyses

parsnp: A fast microbial core-genome alignment tool, which can output core genome phylogeny, multiple genome alignments and SNP calls. citation
tMHG-Finder: Tool for tree guided maximal homologous group (MHG) identification from multiple genomes. MHGs enable more accurate phylogenetic reconstruction than gene annotations, accounting for horizontal gene transfer. citation & older MHG-finder

SNP and SV Detection

Includes gene duplication loss

rhea: Detects structural variants (SV, >10 bp indels) and HGT between temporally evolving microbial metagenomic samples for large cohorts of related or similar genomes (1:n samples). citation: Kristen et al., Bioinformatics, 2024

Taxonomic classification/profiling

Emu: Taxonomic classification, and abundance estimation of 16S rRNA reads for long-read data. Nextflow DSL2 implementation: gms-16s + gms-16S citation | citation
Lemur: For rapid and accurate taxonomic profiling on long-read metagenomic datasets. citation
- MAGnet: Refines taxonomic profiles for accuracy using reference genome mapping from all the reads. same citation as Lemur: citation
Sylph: A tool for rapid and accurate species level taxonomic profiling of metagenomic data using k-mer sketches. citation: Shaw et al., Nature Biotechnology, 2024 | documentation
MetaPhlAn Including because recent support for long reads and will make a lightweight functional annotation analysis when combined with HUMAnN.
Centrifuger might be better for large databases due to compression? | citation

Functional annotation

SeqScreen: Functional screening of pathogenic sequences in metagenomic data. nextflow: DSL1 to 2 transition | citation: Balaji et al., Genome Biology, 2022
- includes antibiotic resistance genes.
HUMAnN: HMP Unified Metabolic Analysis Network - profiling microbial community metabolic potential (? / Not for long reads? (says Austin) - eukaryotic; RAM intensive; )
bakta

Read Classification

How is this different from taxonomic classification?

SeqScreen: Functional screening of pathogenic sequences in metagenomic data.
Centrifuge: A rapid and memory-efficient classification system for metagenomic sequences

Pathogen identification

MAGnet: Metagenomic Analysis of Genomes in the ENvironmental Toolkit
SeqScreen: Functional screening of pathogenic sequences in metagenomic data

Metabolic reconstruction

Bakdrive / recent/private version: Can take in Emu output. citation: Wang et al., Bioinformatics, 2023
- micom: Best to use with bakdrive. citation: Diener et al., mSystems, 2020

Final: Validation / QC

CheckM2
SingleM appraise Check how MetAMOS implements this says Todd
Check if tools ran correctly

Report

nextflow execution report (html)
MultiQC
SingleM appraise Check how MetAMOS report was made from scratch says Todd

Visualization tools

taxburst
MetagenomeScope: Web based visualization tool for metagenomic assembly graphs.
emperor Interactive ordination plot viz

Databases

bacterial: RefSeq
viral: NCBI Viral Genomes

Other workflows to learn from

CAMP ; paper is a snakemake workflow that aims to be one-click deployment but also modular. They work with long-reads and hybrid assemblies along with short-reads. They are also incorporating an LLM called bootcamp similar to omi. So they might really beat us with the Somatem?
aviary is a snakemake workflow that Austin espouses, but Todd doesn't like that it's not published.

(Austin) This is the pipeline used for our previous depletion paper and also put together by Ben woodcraft’s group, has singlem and all the rest of their m tools

cloudres: MLST assignment and AMR detection. could learn/borrow code
mmlong2 Genome-centric long-read metagenomics workflow for automated recovery and analysis of prokaryotic genomes with Nanopore or PacBio HiFi sequencing data.
MUFFIN MUFFIN is a hybrid assembly and differential binning workflow for metagenomics, transcriptomics and pathway analysis.
mapo tofu is a nextflow pipeline for short reads.
- Can borrow their Sylph module?
BugBuster
- No tool in our list

Derivatives of this doc

Stuff from this document was re-formatted/summarized into a few other documents for easier access. I used Windsurf AI for this so keeping these in the same repo keeps them easy to iterate on with AI when changes occur

List

tool_voting.csv: Contains tool names with category for Todd to vote on/suggest new ones
tool_links.csv: contains links to the github and citations for easy parsing into embeddings making by Sahil
mock_nf_params.yaml: is a dadasnake inspired yaml formatted with the tools in this list
- I want sahil to edit this file as appropriate (write access needed); so will need to copy ~~move~~ it into the omi repo.
- will explore making a sub-module that's shared in both repos in the future if > 2 changes are being made by Sahil..
- (Not possible to use git supported soft links to share with another computer) But need some simlink mechanism to keep the files linked // need some way to sync their commit history as well - currently will need to do this manually?

backups

Tool links: archived meta-compass,Assembly,A metagenomic reference-guided assembler that leverages multiple reference genomes,https://github.com/marbl/MetaCompass,,, filtlong,Quality control / preprocessing,Quality filtering tool for long-reads by read quality and length,https://github.com/rrwick/Filtlong,,https://github.com/nf-core/modules/tree/master/modules/nf-core/filtlong,No citation available AutoCycler,Assembly,Consensus long-read assembly pipeline combining multiple alternative assemblies,https://github.com/rrwick/AutoCycler,https://www.biorxiv.org/content/10.1101/2025.05.12.653612v1.full,, Canu,Assembly,Hierarchical assembler designed for high-noise single-molecule sequencing,https://github.com/marbl/canu,https://doi.org/10.1101/gr.215087.116,https://github.com/nf-core/modules/tree/master/modules/nf-core/canu,End of life since 2021; don't use parsnp,Pangenomic analyses,A fast microbial core-genome alignment tool,https://github.com/marbl/parsnp,https://academic.oup.com/bioinformatics/article/40/5/btae311/7667868,, tMHG-Finder,Pangenomic analyses,Tree guided maximal homologous group (MHG) identification,https://github.com/yongze-yin/tMHG-Finder,https://www.biorxiv.org/content/10.1101/2025.03.16.643543v1.full,, Centrifuge,Taxonomic classification,A rapid and memory-efficient classification system for metagenomic sequences,https://github.com/DaehwanKimLab/centrifuge,https://doi.org/10.1101/gr.210641.116,https://github.com/nf-core/modules/tree/master/modules/nf-core/centrifuge, Centrifuger,Taxonomic classification,Improved version of Centrifuge for large databases,https://github.com/mourisl/centrifuger,https://link.springer.com/article/10.1186/s13059-024-03244-4,, EggNOG-mapper,Functional annotation,Fast functional annotation of novel sequences using orthology assignments,https://github.com/eggnogdb/eggnog-mapper,https://doi.org/10.1093/molbev/msab293,https://github.com/nf-core/modules/tree/master/modules/nf-core/eggnogmapper, HUMAnN,Functional annotation,HMP Unified Metabolic Analysis Network - profiling microbial community metabolic potential,https://github.com/biobakery/humann,,,,Not for long reads; RAM intensive Bakdrive,Metabolic reconstruction,Metagenomic Analysis of Genomes in the ENvironmental Toolkit,https://gitlab.com/treangenlab/bakdrive,https://academic.oup.com/bioinformatics/article/39/Supplement_1/i47/7210449,,Can take in Emu output micom,Metabolic reconstruction,Microbiome modeling,https://github.com/micom-dev/micom,https://journals.asm.org/doi/10.1128/msystems.00606-19,,Best to use with bakdrive Apollo,Metabolic reconstruction,Interactive sequence annotation editor,https://genomearchitect.readthedocs.io/,https://link.springer.com/article/10.1186/gb-2002-3-12-research0082,,Relevance for nextflow workflow unclear FastQC,Report,A quality control tool for high throughput sequence data,https://github.com/s-andrews/FastQC,,https://github.com/nf-core/modules/tree/master/modules/nf-core/fastqc,Can be enhanced with LLM MetagenomeScope,Visualization,Web based visualization tool for metagenomic assembly graphs,https://github.com/marbl/MetagenomeScope,,,

LLM Bioinformatics Training Pairs

llm_bioinformatics_training_pairs_with_output.csv 16/Jul/25 - flowchart -> questions

I generated these questions using chatGPT with the following prompt (condensed 3 serial prompts into 1):

I am building an LLM to be used by biologists without bioinformatics background to run bioinformatic tools. The broad modality of my tools is listed with specific tools in brackets.
How would a biologist with metagenomic data at hand describe questions that each of the following bioinformatic tools would answer. Give me 3 questions for each modality.
- Annotated genome (Bakta | Prokka)
- Pangenomics (tmhg-finder)
- Metagenomic taxonomic profiling
- Read classification (SeqScreen)
- 16S taxonomic profiling

Please output this as a csv file. And include the typical output of the tool as well. Such as 
**“What genes are present in this genome, and what do they do?”**  
_(→ Functional annotation of coding sequences, gene product names, EC numbers, etc.)

TODO:

Add more nodes to this list or use the full flowchart. This was only a starting point with a few key nodes to get started.

Tool knowledge

Underwhat circumstances should each tool be used? and what are the pros and cons.

Taxonomic profiling

EMU
LEMUR
- Pros: Lemur can efficiently process large datasets within minutes to hours in limited computational resource settings.
- Cons: Reliance on bacterial marker genes necessarily implies it cannot generalize to viral genome classification.
  - Less sensitive than Kraken 2 or MetaMaps, which use all long reads and complete genomes.
  - a bacteria from a novel, i.e. out-of-database family will necessarily be missed by Lemur.
MAGNET: Use magnet to correct false positives from lemur. Magnet, particularly in the presence of low-abundance or low-coverage data, it can improve precision by detecting and filtering out many false positive calls. The goal of Magnet is to detect and remove potential false positives by performing competitive read alignment leveraging all of the reads mapped against the entire reference genome (rather than just the marker gene reads and marker genes used by Lemur).
- Pros: post-processing with Magnet doesn’t incur much additional cost for small and medium datasets
- Cons: On the large and diverse data, post-processing with Magnet takes a significant portion of total time due to richer microbial composition of the dataset.
- Spiel: relying on a wide pool of single-copy universal marker genes allows Lemur to achieve high recall and relative abundance estimation accuracy while using only a small portion of the input data. These markers cover all bacteria but only a fraction of any given genome. In contrast, Magnet starts with the set of genomes identified by Lemur and evaluates the read-alignment quality and coverage distributions across the genome to make a maximally informed call about whether the putative genome is actually in the sample.

Name		Name	Last commit message	Last commit date
Latest commit History 51 Commits
RAG_docs_exchange		RAG_docs_exchange
analysis_types		analysis_types
archive		archive
planning		planning
.gitignore		.gitignore
LICENSE		LICENSE
README.md		README.md
llm_bioinformatics_training_pairs_with_output.csv		llm_bioinformatics_training_pairs_with_output.csv
metadata_descriptions.yaml		metadata_descriptions.yaml
metadata_template.yaml		metadata_template.yaml
somatem_ m3 presentation_Mar_2026.pdf		somatem_ m3 presentation_Mar_2026.pdf
tool_links.csv		tool_links.csv
tools_keywords.json		tools_keywords.json
workflow_mapping.json		workflow_mapping.json

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Plannning tools

Summary Table

Archived tools

TODO:

Major decision points

Tools list

Quality control / preprocessing

Assembly

Binning

Pangenomic analyses

SNP and SV Detection

Taxonomic classification/profiling

Functional annotation

Read Classification

Pathogen identification

Metabolic reconstruction

Final: Validation / QC

Report

Visualization tools

Databases

Other workflows to learn from

Derivatives of this doc

backups

LLM Bioinformatics Training Pairs

Tool knowledge

Taxonomic profiling

Assembly

About

Uh oh!

Releases

Packages

Uh oh!

Contributors

Uh oh!

Languages

Folders and files

Latest commit

History

Repository files navigation

Plannning tools

Summary Table

Archived tools

TODO:

Major decision points

Tools list

Quality control / preprocessing

Assembly

Binning

Pangenomic analyses

SNP and SV Detection

Taxonomic classification/profiling

Functional annotation

Read Classification

Pathogen identification

Metabolic reconstruction

Final: Validation / QC

Report

Visualization tools

Databases

Other workflows to learn from

Derivatives of this doc

backups

LLM Bioinformatics Training Pairs

Tool knowledge

Taxonomic profiling

Assembly

About

Resources

License

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Contributors

Uh oh!

Languages

Packages