Skip to content

treangenlab/somatem-docs

Repository files navigation

Plannning tools

This is a work in progress document to plan the tools to be included in the pipeline (flowchart). This should be moved to the wiki when it is finalized, for posterity. A shorter summary of the tools and what they do will be in the readme

Summary Table

Category Tool Nextflow Implementation
Quality Control NanoPlot nf-core/nanoplot
hostile nf-core/hostile
chopper nf-core/chopper
Assembly Flye nf-core/flye
myloasm -
Binning SemiBin2 -
Bin QC CheckM2 -
SNP/SV Detection rhea (SV, timecourse) -
Taxonomic Profiling Emu gms-16s
Lemur + MAGnet
-
Sylph -
SingleM pipe -
Functional Annotation SeqScreen DSL1 to 2
bakta -
Community Assessment SingleM appraise -
Reporting, Visualization taxburst -
MetagenomeScope -
Total number of unique tools: 17
Number with nextflow implmentations: 7ish/17?
Soon to be many more...

Archived tools

These will be implemented in the future ; moving them here to preserve the links etc

Category Tool Nextflow Implementation
Quality Control filtlong nf-core/filtlong
Assembly MetaCompass -
Pangenomics parsnp
-
tMHG-Finder -
Taxonomic Profiling MetaPhlAn -
Metabolic Reconstruction Bakdrive -
micom -

Table summarizes all the tools listed below; tools are grouped by category and ordered by priority. Tools with available nextflow implementations are linked above. Tools are listed in order of priority within each category.

Original plan from flowchart sketched on whiteboard on 2025-05-13: Inputs from Todd Treangen

TODO:

  • Add github links and citations for these tools
  • List the LLM questions that will lead to each step
  • Check and link the pipelines for which nextflow DSL1 or DSL2 implementation exists + citations

Major decision points

  • 3 broad branches of the pipeline
    • 16S
    • metagenomic classification
    • assembly based metagenomics
  • 1 vs n samples
    • Rheaa is good for n samples so it will dictate the rest of the pipeline
  • Type of data technology: (ideas for more detailed options flye docs)
    • Nanopore (ONT)
    • Pacbio (Pacbio)
    • Synthetic long reads (Illumina)
  • Assembly vs reads
  • Reference guided vs de novo assembly

Tools list

Quality control / preprocessing

Assembly

De novo assembly: Note: uses a lot of RAM: 32 GB minimum

  • Flye: De novo assembler for single-molecule sequencing reads using repeat graphs (PacBio and Oxford Nanopore). nf-core/module, code | citation
  • myloasm Myloasm is a de novo metagenome assembler, it takes long reads and outputs polished contigs in a single command. (not published yet but probably will be by the time Somatem manuscript is out, very complete github & docs already made, also Jim Shaw & Hang Li collab).

Reference guided We aim to implement a reference-guided assembler in the future, however, meta-compass is currently short-read only...

  • meta-compass: A metagenomic reference-guided assembler that leverages multiple reference genomes

Binning

Currently we only use one binner but we have aspirations to include mutliple binners + refinement tools down the read.

  • SemiBin2 SemiBin2 uses self-supervised learning to learn feature embeddings from the contigs with emphasis on long-read sequencing data. citation

Pangenomic analyses

  • parsnp: A fast microbial core-genome alignment tool, which can output core genome phylogeny, multiple genome alignments and SNP calls. citation
  • tMHG-Finder: Tool for tree guided maximal homologous group (MHG) identification from multiple genomes. MHGs enable more accurate phylogenetic reconstruction than gene annotations, accounting for horizontal gene transfer. citation & older MHG-finder

SNP and SV Detection

Includes gene duplication loss

  • rhea: Detects structural variants (SV, >10 bp indels) and HGT between temporally evolving microbial metagenomic samples for large cohorts of related or similar genomes (1:n samples). citation: Kristen et al., Bioinformatics, 2024

Taxonomic classification/profiling

  • Emu: Taxonomic classification, and abundance estimation of 16S rRNA reads for long-read data. Nextflow DSL2 implementation: gms-16s + gms-16S citation | citation
  • Lemur: For rapid and accurate taxonomic profiling on long-read metagenomic datasets. citation
    • MAGnet: Refines taxonomic profiles for accuracy using reference genome mapping from all the reads. same citation as Lemur: citation
  • Sylph: A tool for rapid and accurate species level taxonomic profiling of metagenomic data using k-mer sketches. citation: Shaw et al., Nature Biotechnology, 2024 | documentation
  • MetaPhlAn Including because recent support for long reads and will make a lightweight functional annotation analysis when combined with HUMAnN.
  • Centrifuger might be better for large databases due to compression? | citation

Functional annotation

  • SeqScreen: Functional screening of pathogenic sequences in metagenomic data. nextflow: DSL1 to 2 transition | citation: Balaji et al., Genome Biology, 2022
    • includes antibiotic resistance genes.
  • HUMAnN: HMP Unified Metabolic Analysis Network - profiling microbial community metabolic potential (? / Not for long reads? (says Austin) - eukaryotic; RAM intensive; )
  • bakta

Read Classification

How is this different from taxonomic classification?

  • SeqScreen: Functional screening of pathogenic sequences in metagenomic data.
  • Centrifuge: A rapid and memory-efficient classification system for metagenomic sequences

Pathogen identification

  • MAGnet: Metagenomic Analysis of Genomes in the ENvironmental Toolkit
  • SeqScreen: Functional screening of pathogenic sequences in metagenomic data

Metabolic reconstruction

Final: Validation / QC

Report

  • nextflow execution report (html)
  • MultiQC
  • SingleM appraise Check how MetAMOS report was made from scratch says Todd

Visualization tools

Databases

Other workflows to learn from

  • CAMP ; paper is a snakemake workflow that aims to be one-click deployment but also modular. They work with long-reads and hybrid assemblies along with short-reads. They are also incorporating an LLM called bootcamp similar to omi. So they might really beat us with the Somatem?
  • aviary is a snakemake workflow that Austin espouses, but Todd doesn't like that it's not published.

(Austin) This is the pipeline used for our previous depletion paper and also put together by Ben woodcraft’s group, has singlem and all the rest of their m tools

  • cloudres: MLST assignment and AMR detection. could learn/borrow code
  • mmlong2 Genome-centric long-read metagenomics workflow for automated recovery and analysis of prokaryotic genomes with Nanopore or PacBio HiFi sequencing data.
  • MUFFIN MUFFIN is a hybrid assembly and differential binning workflow for metagenomics, transcriptomics and pathway analysis.
  • mapo tofu is a nextflow pipeline for short reads.
  • BugBuster
    • No tool in our list

Derivatives of this doc

Stuff from this document was re-formatted/summarized into a few other documents for easier access. I used Windsurf AI for this so keeping these in the same repo keeps them easy to iterate on with AI when changes occur

List

  • tool_voting.csv: Contains tool names with category for Todd to vote on/suggest new ones
  • tool_links.csv: contains links to the github and citations for easy parsing into embeddings making by Sahil
  • mock_nf_params.yaml: is a dadasnake inspired yaml formatted with the tools in this list
    • I want sahil to edit this file as appropriate (write access needed); so will need to copy move it into the omi repo.
    • will explore making a sub-module that's shared in both repos in the future if > 2 changes are being made by Sahil..
    • (Not possible to use git supported soft links to share with another computer) But need some simlink mechanism to keep the files linked // need some way to sync their commit history as well - currently will need to do this manually?

backups

LLM Bioinformatics Training Pairs

llm_bioinformatics_training_pairs_with_output.csv 16/Jul/25 - flowchart -> questions

I generated these questions using chatGPT with the following prompt (condensed 3 serial prompts into 1):

I am building an LLM to be used by biologists without bioinformatics background to run bioinformatic tools. The broad modality of my tools is listed with specific tools in brackets.
How would a biologist with metagenomic data at hand describe questions that each of the following bioinformatic tools would answer. Give me 3 questions for each modality.
- Annotated genome (Bakta | Prokka)
- Pangenomics (tmhg-finder)
- Metagenomic taxonomic profiling
- Read classification (SeqScreen)
- 16S taxonomic profiling

Please output this as a csv file. And include the typical output of the tool as well. Such as 
**“What genes are present in this genome, and what do they do?”**  
_(→ Functional annotation of coding sequences, gene product names, EC numbers, etc.)

TODO:

  • Add more nodes to this list or use the full flowchart. This was only a starting point with a few key nodes to get started.

Tool knowledge

Underwhat circumstances should each tool be used? and what are the pros and cons.

Taxonomic profiling

  • EMU
  • LEMUR
    • Pros: Lemur can efficiently process large datasets within minutes to hours in limited computational resource settings.
    • Cons: Reliance on bacterial marker genes necessarily implies it cannot generalize to viral genome classification.
      • Less sensitive than Kraken 2 or MetaMaps, which use all long reads and complete genomes.
      • a bacteria from a novel, i.e. out-of-database family will necessarily be missed by Lemur.
  • MAGNET: Use magnet to correct false positives from lemur. Magnet, particularly in the presence of low-abundance or low-coverage data, it can improve precision by detecting and filtering out many false positive calls. The goal of Magnet is to detect and remove potential false positives by performing competitive read alignment leveraging all of the reads mapped against the entire reference genome (rather than just the marker gene reads and marker genes used by Lemur).
    • Pros: post-processing with Magnet doesn’t incur much additional cost for small and medium datasets
    • Cons: On the large and diverse data, post-processing with Magnet takes a significant portion of total time due to richer microbial composition of the dataset.
    • Spiel: relying on a wide pool of single-copy universal marker genes allows Lemur to achieve high recall and relative abundance estimation accuracy while using only a small portion of the input data. These markers cover all bacteria but only a fraction of any given genome. In contrast, Magnet starts with the set of genomes identified by Lemur and evaluates the read-alignment quality and coverage distributions across the genome to make a maximally informed call about whether the putative genome is actually in the sample.

Assembly

About

documents of the pipeline: flowchart, tools list, etc. in structured human readable formats

Resources

License

Stars

Watchers

Forks

Releases

No releases published

Packages

 
 
 

Contributors