Skip to content

LiuzLab/RareCollab

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

23 Commits
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

image

Introduction

RareCollab is a Python package for rare diseases powered by Ollama. It integrates multimodal patient data, including DNA, RNA, and phenotype information, to support candidate variant prioritization and diagnostic interpretation.

Our paper is available on arXiv: https://arxiv.org/abs/2602.04058

Before Running RareCollab

Before running RareCollab, please prepare the following files and directories:

  1. Create a folder containing your DNA VCF files.

  2. Run AI-MARRVEL and save the output to /input/aim-out.

  3. Create an HPO folder such as /input/hpo/, containing one file per sample in the format ID.hpo.txt. Each file should include a list of HPO terms (for example, HP:XXXXXXX) and follow the same format required by AI-MARRVEL input HPO files.

  4. Create a work folder for intermediate files generated by RareCollab.

  5. Create an output folder for results generated by RareCollab.

  6. Create a reference folder and download all required reference files from here.

  7. Prepare the hg38.chr.fa reference genome file from UCSC Downloads.

  8. Create an NCBI API key using your email address from NCBI API Keys. This step is optional, but it can speed up searches using the NCBI and Entrez APIs.

Usage

Step 1: Preprocessing

Run the preprocessing step to map variants to genes:

import RareCollab

RareCollab.Preprocessing.VarToGene(
    work_path=work_path,
    aim_path=aim_path,
    max_workers=5,
    HasGroundTruth=True,
    GroudTruthList=GroudTruthList,
    GENECODE_ANNOT_path=GENECODE_ANNOT_path,
    MANE_TRANSCRIPT_path=MANE_TRANSCRIPT_path
)

Parameters

  • work_path (str): Path to the empty work directory created for intermediate files.
  • aim_path (str): Path to the AI-MARRVEL output directory, for example /input/aim-out.
  • max_workers (int): Number of samples to process in parallel.
  • HasGroundTruth (bool): Whether ground-truth disease genes are available for the samples.
  • GroudTruthList (str or None): Path to the ground-truth file if HasGroundTruth=True; otherwise set it to None.
  • GENECODE_ANNOT_path (str): Path to parsed_gencode_v49_basic_annotation.feather.
  • MANE_TRANSCRIPT_path (str): Path to gtf_MANE_select_transcript.feather.

If HasGroundTruth=True, the ground-truth file should contain the following five columns:

  • SampleID: Sample IDs matching the folder names under /input/aim-out/
  • chr: Chromosome in ENSEMBL-style format, without the chr prefix, consistent with AI-MARRVEL output
  • pos: Position
  • ref: Reference allele
  • alt: Alternate allele

If HasGroundTruth=False, set:

GroudTruthList = None

Optional RNA Preprocessing

If RNA data are available, run:

RareCollab.Preprocessing.RNA(
    work_path=work_path,
    splicing_path=splicing_path,
    expression_path=expression_path,
    ase_path=ase_path
)

Parameters

  • work_path (str): Path to the work directory.
  • splicing_path (str): Path to the output from FRASER2.
  • expression_path (str): Path to the output from OUTRIDER.
  • ase_path (str): Path to the output from GATK ASEReadCounter.

Step 2: Run the Diagnostic Engine

Run the Mixture-of-Experts (MoE) diagnostic engine and retrieve candidate variants:

RareCollab.DiagnosticEngine.MoE(work_path=work_path)

RareCollab.DiagnosticEngine.Candidates(
    work_path=work_path,
    max_workers=5
)

Step 3: Run RareCollab Agents

You can run the following agent modules in parallel if sufficient computational resources are available.

In our setup, we used gpt-oss:20b as the MODEL_NAME and 0.7 as the TEMPERATURE for all agents.

3.1 Module-1: Database Agent

RareCollab.DatabaseAgent.RunAgent(
    work_path=work_path,
    ClinVar_path=ClinVar_path,
    NCBI_EMAIL=NCBI_EMAIL,
    NCBI_KEY=NCBI_KEY,
    MODEL_NAME=MODEL_NAME_db,
    OLLAMA_URL=OLLAMA_URL_db,
    TEMPERATURE=TEMPERATURE_db,
    max_workers=5
)

Parameters

  • work_path (str): Path to the work directory.
  • ClinVar_path (str): Path to ClinVar_vcf.feather.
  • NCBI_EMAIL (str or None): Email address used for NCBI/Entrez access. If unavailable, set it to None.
  • NCBI_KEY (str or None): NCBI API key. If unavailable, set it to None.
  • MODEL_NAME_db (str): Model name used by the Database Agent.
  • OLLAMA_URL_db (str): Ollama endpoint used by the Database Agent.
  • TEMPERATURE_db (float): Temperature used by the Database Agent.
  • max_workers (int): Number of samples to process in parallel.

3.2 Module-2: In-Silico Agent

RareCollab.InSilicoAgent.RunAgent(
    work_path=work_path,
    MODEL_NAME=MODEL_NAME_is,
    OLLAMA_URL=OLLAMA_URL_is,
    TEMPERATURE=TEMPERATURE_is
)

Parameters

  • work_path (str): Path to the work directory.
  • MODEL_NAME_is (str): Model name used by the In-Silico Agent.
  • OLLAMA_URL_is (str): Ollama endpoint used by the In-Silico Agent.
  • TEMPERATURE_is (float): Temperature used by the In-Silico Agent.

3.3 Module-3: Phenotype Agent

3.3.1 Preprocessing

First, run phenotype preprocessing:

RareCollab.PhenotypeAgent.Preprocessing(
    work_path=work_path,
    HPO_patient_path=HPO_patient_path,
    HPO_lib_path=HPO_lib_path,
    HPO_genes_path=HPO_genes_path,
    OMIM_path=OMIM_path
)

Parameters

Then run the phenotype-related agents below.

3.3.2 HPO Agent
RareCollab.PhenotypeAgent.RunAgent_HPO(
    work_path=work_path,
    MODEL_NAME=MODEL_NAME_hpo,
    OLLAMA_URL=OLLAMA_URL_hpo,
    TEMPERATURE=TEMPERATURE_hpo
)

Parameters

  • work_path (str): Path to the work directory.
  • MODEL_NAME_hpo (str): Model name used by the HPO Agent.
  • OLLAMA_URL_hpo (str): Ollama endpoint used by the HPO Agent.
  • TEMPERATURE_hpo (float): Temperature used by the HPO Agent.
3.3.3 OMIM Agent
RareCollab.PhenotypeAgent.RunAgent_OMIM(
    work_path=work_path,
    MODEL_NAME=MODEL_NAME_om,
    OLLAMA_URL=OLLAMA_URL_om,
    TEMPERATURE=TEMPERATURE_om
)

Parameters

  • work_path (str): Path to the work directory.
  • MODEL_NAME_om (str): Model name used by the OMIM Agent.
  • OLLAMA_URL_om (str): Ollama endpoint used by the OMIM Agent.
  • TEMPERATURE_om (float): Temperature used by the OMIM Agent.
3.3.4 Literature Agent
RareCollab.PhenotypeAgent.RunAgent_Literature(
    work_path=work_path,
    MODEL_NAME=MODEL_NAME_lit,
    OLLAMA_URL=OLLAMA_URL_lit,
    TEMPERATURE=TEMPERATURE_lit,
    Entrez_EMAIL=NCBI_EMAIL,
    Entrez_KEY=NCBI_KEY,
    NCBI_EMAIL=NCBI_EMAIL,
    NCBI_KEY=NCBI_KEY
)

Parameters

  • work_path (str): Path to the work directory.
  • MODEL_NAME_lit (str): Model name used by the Literature Agent.
  • OLLAMA_URL_lit (str): Ollama endpoint used by the Literature Agent.
  • TEMPERATURE_lit (float): Temperature used by the Literature Agent.
  • NCBI_EMAIL and Entrez_EMAIL (str or None): Email address used for NCBI/Entrez access. If unavailable, set it to None.
  • NCBI_KEY and Entrez_KEY (str or None): NCBI API key. If unavailable, set it to None.

3.4 Module-4: RNA Agent

Run this module only if preprocessed RNA data are available.

First, quantify allele-specific RNA signals from BAM files:

RareCollab.RNAAgent.AlleleQuantification(
    work_path=work_path,
    BAM_root_path=BAM_root_path
)

Parameters

  • work_path (str): Path to the work directory.
  • BAM_root_path (str): Path to the directory containing all RNA sorted BAM files. The sample ID must be included in each BAM filename.

Then run the RNA Agent:

RareCollab.RNAAgent.RunAgent(
    work_path=work_path,
    MODEL_NAME=MODEL_NAME_rna,
    OLLAMA_URL=OLLAMA_URL_rna,
    TEMPERATURE=TEMPERATURE_rna,
    CLINGEN_DOSAGE=CLINGEN_DOSAGE,
    UseNCBI=True,
    CLINVAR_ASSEMBLY="GRCh38",
    NCBI_EMAIL=NCBI_EMAIL,
    NCBI_KEY=NCBI_KEY
)

Parameters

  • work_path (str): Path to the work directory.
  • MODEL_NAME_rna (str): Model name used by the RNA Agent.
  • OLLAMA_URL_rna (str): Ollama endpoint used by the RNA Agent.
  • TEMPERATURE_rna (float): Temperature used by the RNA Agent.
  • CLINGEN_DOSAGE (str): Path to ClinGen_Dosage_Info.csv.
  • UseNCBI (bool): Whether to query sequence region information from NCBI.
  • CLINVAR_ASSEMBLY (str): Reference assembly used for ClinVar/NCBI queries, for example GRCh38.
  • NCBI_EMAIL (str or None): Email address used for NCBI/Entrez access.
  • NCBI_KEY (str or None): NCBI API key.

If UseNCBI=True, please provide:

  • CLINVAR_ASSEMBLY
  • NCBI_EMAIL
  • NCBI_KEY

If UseNCBI=False, the NCBI search step will be skipped. In that case, you can ignore CLINVAR_ASSEMBLY, NCBI_EMAIL, and NCBI_KEY.

Step 4: Integration

Run the final integration step to review results and generate the output files:

RareCollab.Integration.Review(
    work_path=work_path,
    reference_genome=reference_genome,
    output_path=output_path
)

Parameters

  • work_path (str): Path to the work directory.
  • reference_genome (str): Path to your hg38.chr.fa reference genome file.
  • output_path (str): Path to the empty output directory created for RareCollab results.

About

No description, website, or topics provided.

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

 
 
 

Contributors

Languages