Run benchmarks against your own data to see how well the memory system is working.
# Quick health check (~1 second)
npx causantic benchmark-collection --quick
# Standard benchmark (~30 seconds)
npx causantic benchmark-collection
# Full benchmark (~2-5 minutes)
npx causantic benchmark-collection --full- Collection size and structure (chunks, sessions, projects)
- Graph connectivity and edge density
- Cluster coverage and quality
- Per-project breakdown
- Can the system find related chunks? (Adjacent Recall)
- Can it bridge across sessions? (Cross-Session Bridging)
- Does project filtering work? (Precision@K)
- How much of the returned context is useful? (Token Efficiency)
- How well does episodic chain walking work?
- What percentage of queries produce chains vs falling back to search?
- How long are chains and how relevant per token?
- How fast are queries across different retrieval modes?
- Are there performance bottlenecks?
The overall score (0-100) is a weighted composite:
| Category | Weight | What it means |
|---|---|---|
| Health | 25% | Collection structure and organization |
| Retrieval | 35% | Can the system find the right context? |
| Chain Quality | 25% | Does episodic chain walking produce useful narratives? |
| Latency | 15% | Is query performance acceptable? |
Only scored categories contribute; weights renormalize. A --quick run scores health only.
Score ranges:
- 80-100: Excellent — memory system is working very well
- 60-79: Good — working well, tuning recommendations may help
- 40-59: Fair — some areas need attention (check recommendations)
- 0-39: Needs work — follow tuning recommendations and re-ingest
The report includes specific recommendations with config changes:
// causantic.config.json
{
"clustering": { "threshold": 0.12 },
"traversal": { "maxDepth": 25 }
}After making changes, re-run the benchmark to measure improvement:
npx causantic benchmark-collection --seed 42Using the same --seed ensures comparable results.
Low cluster coverage (<70%)
- Lower
clustering.threshold(default: 0.10, try 0.12) - Lower
clustering.minClusterSize(default: 4, try 3) - Then run
npx causantic maintenance run update-clusters
Low chain quality (high fallback rate)
- Rebuild edges:
npx causantic maintenance rebuild-edges - Check edge count — if edges are missing, re-ingest with the latest parser
- Ensure cross-session edges exist by checking edge type distribution
Low source diversity (0% cluster sources despite good cluster coverage)
- Lower
retrieval.mmrLambda(default: 0.7, try 0.5) - This increases MMR's diversity bias, allowing cluster-expanded chunks to compete with near-duplicate vector hits
- If cluster coverage itself is low (<50%), fix clustering first — MMR can't diversify sources that don't exist
High latency (p95 >5s)
- Reduce
traversal.maxDepth(default: 50, try 30) - Reduce cluster expansion:
clusterExpansion.maxSiblings: 3 - For large collections (>5,000 chunks): set
vectors.ttlDays: 60to reduce search space - Note: first-call latency includes embedding model load (~500ms). Subsequent calls are faster.
Each run is stored automatically. View trends:
npx causantic benchmark-collection --historyThis shows how your score has changed over time, helping you verify that config changes are having the intended effect.
Use --seed for deterministic sampling:
npx causantic benchmark-collection --seed 42The same seed produces identical query samples, making before/after comparisons meaningful.
| Flag | Description | Default |
|---|---|---|
--quick |
Health only | - |
--standard |
Health + retrieval | (default) |
--full |
All categories | - |
--categories |
Comma-separated list | (from profile) |
--sample-size |
Queries to sample | 50 |
--seed |
Random seed | (random) |
--project |
Limit to one project | (all) |
--output |
Report directory | ./causantic-benchmark/ |
--json |
JSON output only | false |
--no-tuning |
Skip recommendations | false |
--history |
Show past trends | false |