RareCollab is a Python package for rare diseases powered by Ollama. It integrates multimodal patient data, including DNA, RNA, and phenotype information, to support candidate variant prioritization and diagnostic interpretation.
Our paper is available on arXiv: https://arxiv.org/abs/2602.04058
Before running RareCollab, please prepare the following files and directories:
-
Create a folder containing your DNA VCF files.
-
Run AI-MARRVEL and save the output to
/input/aim-out. -
Create an HPO folder such as
/input/hpo/, containing one file per sample in the formatID.hpo.txt. Each file should include a list of HPO terms (for example,HP:XXXXXXX) and follow the same format required by AI-MARRVEL input HPO files. -
Create a work folder for intermediate files generated by RareCollab.
-
Create an output folder for results generated by RareCollab.
-
Create a reference folder and download all required reference files from here.
-
Prepare the
hg38.chr.fareference genome file from UCSC Downloads. -
Create an NCBI API key using your email address from NCBI API Keys. This step is optional, but it can speed up searches using the NCBI and Entrez APIs.
Run the preprocessing step to map variants to genes:
import RareCollab
RareCollab.Preprocessing.VarToGene(
work_path=work_path,
aim_path=aim_path,
max_workers=5,
HasGroundTruth=True,
GroudTruthList=GroudTruthList,
GENECODE_ANNOT_path=GENECODE_ANNOT_path,
MANE_TRANSCRIPT_path=MANE_TRANSCRIPT_path
)work_path(str): Path to the empty work directory created for intermediate files.aim_path(str): Path to the AI-MARRVEL output directory, for example/input/aim-out.max_workers(int): Number of samples to process in parallel.HasGroundTruth(bool): Whether ground-truth disease genes are available for the samples.GroudTruthList(str or None): Path to the ground-truth file ifHasGroundTruth=True; otherwise set it toNone.GENECODE_ANNOT_path(str): Path toparsed_gencode_v49_basic_annotation.feather.MANE_TRANSCRIPT_path(str): Path togtf_MANE_select_transcript.feather.
If HasGroundTruth=True, the ground-truth file should contain the following five columns:
SampleID: Sample IDs matching the folder names under/input/aim-out/chr: Chromosome in ENSEMBL-style format, without thechrprefix, consistent with AI-MARRVEL outputpos: Positionref: Reference allelealt: Alternate allele
If HasGroundTruth=False, set:
GroudTruthList = NoneIf RNA data are available, run:
RareCollab.Preprocessing.RNA(
work_path=work_path,
splicing_path=splicing_path,
expression_path=expression_path,
ase_path=ase_path
)work_path(str): Path to the work directory.splicing_path(str): Path to the output from FRASER2.expression_path(str): Path to the output from OUTRIDER.ase_path(str): Path to the output from GATK ASEReadCounter.
Run the Mixture-of-Experts (MoE) diagnostic engine and retrieve candidate variants:
RareCollab.DiagnosticEngine.MoE(work_path=work_path)
RareCollab.DiagnosticEngine.Candidates(
work_path=work_path,
max_workers=5
)You can run the following agent modules in parallel if sufficient computational resources are available.
In our setup, we used gpt-oss:20b as the MODEL_NAME and 0.7 as the TEMPERATURE for all agents.
RareCollab.DatabaseAgent.RunAgent(
work_path=work_path,
ClinVar_path=ClinVar_path,
NCBI_EMAIL=NCBI_EMAIL,
NCBI_KEY=NCBI_KEY,
MODEL_NAME=MODEL_NAME_db,
OLLAMA_URL=OLLAMA_URL_db,
TEMPERATURE=TEMPERATURE_db,
max_workers=5
)work_path(str): Path to the work directory.ClinVar_path(str): Path toClinVar_vcf.feather.NCBI_EMAIL(str or None): Email address used for NCBI/Entrez access. If unavailable, set it toNone.NCBI_KEY(str or None): NCBI API key. If unavailable, set it toNone.MODEL_NAME_db(str): Model name used by the Database Agent.OLLAMA_URL_db(str): Ollama endpoint used by the Database Agent.TEMPERATURE_db(float): Temperature used by the Database Agent.max_workers(int): Number of samples to process in parallel.
RareCollab.InSilicoAgent.RunAgent(
work_path=work_path,
MODEL_NAME=MODEL_NAME_is,
OLLAMA_URL=OLLAMA_URL_is,
TEMPERATURE=TEMPERATURE_is
)work_path(str): Path to the work directory.MODEL_NAME_is(str): Model name used by the In-Silico Agent.OLLAMA_URL_is(str): Ollama endpoint used by the In-Silico Agent.TEMPERATURE_is(float): Temperature used by the In-Silico Agent.
First, run phenotype preprocessing:
RareCollab.PhenotypeAgent.Preprocessing(
work_path=work_path,
HPO_patient_path=HPO_patient_path,
HPO_lib_path=HPO_lib_path,
HPO_genes_path=HPO_genes_path,
OMIM_path=OMIM_path
)work_path(str): Path to the work directory.HPO_patient_path(str): Path to the directory containing patient HPO files, for example/input/hpo/.HPO_lib_path(str): Path tohp.obo.HPO_genes_path(str): Path toHPO_genes_to_phenotype.txt.OMIM_path(str): Path toOMIM_Disease_Description.tsv.
Then run the phenotype-related agents below.
RareCollab.PhenotypeAgent.RunAgent_HPO(
work_path=work_path,
MODEL_NAME=MODEL_NAME_hpo,
OLLAMA_URL=OLLAMA_URL_hpo,
TEMPERATURE=TEMPERATURE_hpo
)work_path(str): Path to the work directory.MODEL_NAME_hpo(str): Model name used by the HPO Agent.OLLAMA_URL_hpo(str): Ollama endpoint used by the HPO Agent.TEMPERATURE_hpo(float): Temperature used by the HPO Agent.
RareCollab.PhenotypeAgent.RunAgent_OMIM(
work_path=work_path,
MODEL_NAME=MODEL_NAME_om,
OLLAMA_URL=OLLAMA_URL_om,
TEMPERATURE=TEMPERATURE_om
)work_path(str): Path to the work directory.MODEL_NAME_om(str): Model name used by the OMIM Agent.OLLAMA_URL_om(str): Ollama endpoint used by the OMIM Agent.TEMPERATURE_om(float): Temperature used by the OMIM Agent.
RareCollab.PhenotypeAgent.RunAgent_Literature(
work_path=work_path,
MODEL_NAME=MODEL_NAME_lit,
OLLAMA_URL=OLLAMA_URL_lit,
TEMPERATURE=TEMPERATURE_lit,
Entrez_EMAIL=NCBI_EMAIL,
Entrez_KEY=NCBI_KEY,
NCBI_EMAIL=NCBI_EMAIL,
NCBI_KEY=NCBI_KEY
)work_path(str): Path to the work directory.MODEL_NAME_lit(str): Model name used by the Literature Agent.OLLAMA_URL_lit(str): Ollama endpoint used by the Literature Agent.TEMPERATURE_lit(float): Temperature used by the Literature Agent.NCBI_EMAILandEntrez_EMAIL(str or None): Email address used for NCBI/Entrez access. If unavailable, set it toNone.NCBI_KEYandEntrez_KEY(str or None): NCBI API key. If unavailable, set it toNone.
Run this module only if preprocessed RNA data are available.
First, quantify allele-specific RNA signals from BAM files:
RareCollab.RNAAgent.AlleleQuantification(
work_path=work_path,
BAM_root_path=BAM_root_path
)work_path(str): Path to the work directory.BAM_root_path(str): Path to the directory containing all RNA sorted BAM files. The sample ID must be included in each BAM filename.
Then run the RNA Agent:
RareCollab.RNAAgent.RunAgent(
work_path=work_path,
MODEL_NAME=MODEL_NAME_rna,
OLLAMA_URL=OLLAMA_URL_rna,
TEMPERATURE=TEMPERATURE_rna,
CLINGEN_DOSAGE=CLINGEN_DOSAGE,
UseNCBI=True,
CLINVAR_ASSEMBLY="GRCh38",
NCBI_EMAIL=NCBI_EMAIL,
NCBI_KEY=NCBI_KEY
)work_path(str): Path to the work directory.MODEL_NAME_rna(str): Model name used by the RNA Agent.OLLAMA_URL_rna(str): Ollama endpoint used by the RNA Agent.TEMPERATURE_rna(float): Temperature used by the RNA Agent.CLINGEN_DOSAGE(str): Path toClinGen_Dosage_Info.csv.UseNCBI(bool): Whether to query sequence region information from NCBI.CLINVAR_ASSEMBLY(str): Reference assembly used for ClinVar/NCBI queries, for exampleGRCh38.NCBI_EMAIL(str or None): Email address used for NCBI/Entrez access.NCBI_KEY(str or None): NCBI API key.
If UseNCBI=True, please provide:
CLINVAR_ASSEMBLYNCBI_EMAILNCBI_KEY
If UseNCBI=False, the NCBI search step will be skipped. In that case, you can ignore CLINVAR_ASSEMBLY, NCBI_EMAIL, and NCBI_KEY.
Run the final integration step to review results and generate the output files:
RareCollab.Integration.Review(
work_path=work_path,
reference_genome=reference_genome,
output_path=output_path
)work_path(str): Path to the work directory.reference_genome(str): Path to yourhg38.chr.fareference genome file.output_path(str): Path to the empty output directory created for RareCollab results.