VERITAS

Verifiable Epistemic Reasoning for Image-Derived Hypothesis Testing via Agentic Systems

Lucas Stoffl, Benedikt Wiestler, Johannes C. Paetzold · arXiv 2026

VERITAS runs biomedical hypothesis tests on multi-modal datasets end-to-end: from a natural-language question to a statistically grounded verdict, with every intermediate artifact auditable. A team of LLM agents formulates an analysis plan, calls a medical-imaging segmentation foundation model (SAT), writes and executes statistical code in a sandbox, and delivers a verdict that a deterministic Evidence Classification Operator (ECO) scores against power, directionality, and effect size — no post-hoc LLM grading, no fabricated p-values.

On a 64-hypothesis benchmark (ACDC cardiac MRI + UCSF-PDGM glioma), locally-deployed VERITAS reaches 87.9% evidence-label accuracy on ACDC with zero hallucinated statistical significance. See the paper for full results.

Look before you install

Browse example_run/ for two complete frozen runs of the same hypothesis — agent transcripts, segmentation request, statistical code, produced plots, and the final verdict JSON. One uses small gpt-oss and qwen3 models via local Ollama, the other GPT-5.2 via OpenRouter. Both reach the same verdict. No install, no API key required to inspect them.

Setup

1. Install

git clone https://github.com/LucZot/veritas.git
cd veritas

conda create -n veritas python=3.12 -y
conda activate veritas
pip install -e ".[all]"

2. Configure MCP servers

cp mcp_servers.example.json mcp_servers.json

The default code_execution block works as-is. The sat block has ${SAT_PYTHON} / ${SAT_REPO_PATH} / ${SAT_CHECKPOINT_DIR} placeholders that step 4 will fill in automatically. (If you skip step 4, you can either export those env vars or paste absolute paths in by hand.)

3. Code-execution MCP server (required)

Installs the dependencies the sandbox needs to run agent-authored statistical code. Reuses the veritas conda env:

bash scripts/setup_code_execution.sh

4. SAT segmentation MCP server (required for raw imaging)

SAT is a text-prompted medical-imaging segmentation foundation model. Only needed when running on raw imaging datasets (ACDC, UCSF-PDGM). Needs a GPU, a separate conda env (Python 3.11), and ~6.5 GB of checkpoints.

bash scripts/setup_sat.sh   # creates 'sat' env, clones SAT, installs SAT deps,
                            #   downloads checkpoints, and patches mcp_servers.json
                            #   (idempotent — re-runs skip what's already done)

5. LLM provider

Ollama (local, no API key):

# Install Ollama (see https://ollama.com), then start the daemon:
export OLLAMA_LOG_LEVEL=warn   # suppress verbose Ollama output
ollama serve &

# Pull every model used by the default config:
ollama pull gpt-oss:20b
ollama pull qwen3:8b
ollama pull qwen3-coder:30b

Use experiments/configs/default_ollama_local.json.

OpenRouter (frontier models via API):

export OPENROUTER_API_KEY=sk-or-...

Use experiments/configs/default_openrouter_gpt52.json.

6. Datasets

export BIO_DATA_ROOT=/path/to/your/data   # add to ~/.bashrc to make it permanent

Config files use __DATA_ROOT__ as a placeholder — replaced with $BIO_DATA_ROOT at runtime. The SAT cache ($BIO_DATA_ROOT/sat_cache/) is created automatically on first run.

ACDC (cardiac MRI, 100 patients) — ACDC challenge (free registration). Extract so the tree looks like:

$BIO_DATA_ROOT/
└── ACDC/
    └── database/
        ├── training/
        │   ├── patient001/
        │   │   ├── patient001_4d.nii.gz
        │   │   ├── patient001_frame01.nii.gz
        │   │   └── Info.cfg
        │   │   └── ...
        │   └── ...
        └── testing/

UCSF-PDGM (brain glioma, 501 patients) — TCIA (free, no registration). Extract so the tree looks like:

$BIO_DATA_ROOT/
└── UCSF-PDGM/
    ├── UCSF-PDGM-v5/                    # NIfTI per patient
    │   └── UCSF-PDGM-0004_nifti/
    │   └── ...
    └── UCSF-PDGM-metadata_v5.csv
    └── ...

Then build the manifest:

python scripts/generate_pdgm_manifest.py --data-root $BIO_DATA_ROOT/UCSF-PDGM

Running experiments

One hypothesis, local (Ollama):

python experiments/run_experiments.py \
  --bank experiments/tiered_hypothesis_bank.json \
  --hypotheses cardiac_01_dcm_lvef_lower \
  --n-runs 1 \
  --config experiments/configs/default_ollama_local.json \
  --output-dir results/veritas_main

One hypothesis, OpenRouter:

python experiments/run_experiments.py \
  --bank experiments/tiered_hypothesis_bank.json \
  --hypotheses cardiac_01_dcm_lvef_lower \
  --n-runs 1 \
  --config experiments/configs/default_openrouter_gpt52.json \
  --output-dir results/veritas_main

Full benchmark (64 hypotheses × 10 runs):

python experiments/run_experiments.py \
  --bank experiments/tiered_hypothesis_bank.json \
  --n-runs 10 \
  --config experiments/configs/default_ollama_local.json \
  --output-dir results/veritas_main

Re-evaluate an existing experiment folder offline (no LLM calls):

python experiments/evaluate_experiment_folder.py results/veritas_main/exp_YYYYMMDD_HHMMSS

See experiments/README.md for all flags, the evaluation-framework rationale, and troubleshooting.

Citation

@article{stoffl2026veritas,
  title={VERITAS: Verifiable Epistemic Reasoning for Image-Derived Hypothesis Testing via Agentic Systems},
  author={Stoffl, Lucas and Wiestler, Benedikt and Paetzold, Johannes C},
  journal={arXiv preprint arXiv:2604.12144},
  year={2026}
}

We are grateful to build on Virtual Lab (Swanson et al., Nature 2025) for its agent and meeting primitives, and SAT (Zhao et al., NPJ Digital Medicine 2025) for medical-image segmentation in Phase 2A.

License

Apache 2.0. See LICENSE and NOTICE.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

VERITAS

Look before you install

Setup

Running experiments

Citation

License

About

Uh oh!

Releases

Packages

Uh oh!

Contributors

Uh oh!

Languages

Name		Name	Last commit message	Last commit date
Latest commit History 2 Commits
.github		.github
example_run		example_run
experiments		experiments
mcp_servers		mcp_servers
media		media
scripts		scripts
src		src
tests		tests
.gitignore		.gitignore
CITATION.cff		CITATION.cff
CONTRIBUTING.md		CONTRIBUTING.md
LICENSE		LICENSE
NOTICE		NOTICE
README.md		README.md
mcp_servers.example.json		mcp_servers.example.json
pyproject.toml		pyproject.toml

Folders and files

Latest commit

History

Repository files navigation

VERITAS

Look before you install

Setup

Running experiments

Citation

License

About

Resources

License

Contributing

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Contributors

Uh oh!

Languages

Packages