Calibrated Hierarchical Introspection and Meta-cognitive Error Recognition Assessment
CHIMERA is a comprehensive benchmark for evaluating the meta-cognitive calibration of Large Language Models — their ability to accurately recognize, express, and respond to their own uncertainty, errors, and knowledge boundaries.
Traditional benchmarks measure what models produce. CHIMERA measures whether models know when they're wrong.
CHIMERA evaluates meta-cognitive capabilities through four complementary tracks:
| Track | Description | Primary Metric |
|---|---|---|
| Calibration Probing | Tests whether stated confidence matches actual accuracy | ECE |
| Error Detection | Tests ability to identify errors in presented statements | F1 Score |
| Knowledge Boundary | Tests appropriate abstention on unanswerable questions | Abstention F1 |
| Self-Correction | Tests detection and correction of corrupted reasoning | E2E Success |
Tests whether a model's expressed confidence actually predicts correctness. A well-calibrated model should be correct 80% of the time when expressing 80% confidence.
Metrics: ECE, MCE, Brier Score, Reliability Diagrams
Presents statements with deliberate errors and tests whether the model can identify them.
Error Types: Factual, Logical, Computational, Temporal, Magnitude, Hallucination
Tests whether models appropriately abstain from unanswerable questions while confidently answering what they know.
Question Types: Answerable, Impossible, Too Specific, Obscure Facts, Future Events
Introduces corruptions to reasoning chains and tests whether the model can detect and correct the errors.
Perturbation Types: Value Corruption, Step Removal, Logic Inversion, Premise Change
git clone https://github.com/Rahul-Lashkari/chimera.git
cd chimera
python -m venv .venv
source .venv/bin/activate # Windows: .venv\Scripts\activate
pip install -e ".[dev]"Create a .env file with your API keys:
GOOGLE_API_KEY=your_gemini_api_key_here
OPENAI_API_KEY=your_openai_api_key_here # OptionalOr use YAML configuration:
# configs/gemini_eval.yaml
model:
provider: gemini
name: gemini-2.0-flash
tracks:
- calibration
- error_detection
- knowledge_boundary
- self_correction
evaluation:
n_tasks: 100
seed: 42# Check environment and dependencies
chimera check
# Run full benchmark
chimera run --model gemini --track all
# Run specific track
chimera run --model gemini --track calibration --n-tasks 50
# Dry run (generate tasks without API calls)
chimera run --track calibration --dry-run
# Use custom configuration
chimera run --config configs/gemini_eval.yaml
# Analyze existing results
chimera analyze results/run_20260130/
# Generate report
chimera report results/run_20260130/ --format htmlfrom chimera.evaluation import EvaluationPipeline, PipelineConfig
config = PipelineConfig(
tracks=["calibration", "error_detection"],
model_provider="gemini",
model_name="gemini-2.0-flash",
n_tasks=100,
seed=42,
)
pipeline = EvaluationPipeline(config)
results = pipeline.run()
print(f"Overall Score: {results.overall_score:.2%}")
for track, summary in results.track_summaries.items():
print(f" {track}: {summary.score:.2%}")from chimera.evaluation import ModelComparison
comparison = ModelComparison()
comparison.add_model_results("gemini-2.0-flash", gemini_results)
comparison.add_model_results("gpt-4o", gpt4_results)
rankings = comparison.compute_rankings()
for rank in rankings:
print(f"{rank.rank}. {rank.model_name}: {rank.score:.2%}")| Metric | Description | Optimal |
|---|---|---|
| ECE | Expected Calibration Error | 0 |
| MCE | Maximum Calibration Error | 0 |
| Brier Score | Mean squared error of probabilistic predictions | 0 |
| Metric | Description |
|---|---|
| Precision | Fraction of detected errors that are actual errors |
| Recall | Fraction of actual errors that were detected |
| F1 Score | Harmonic mean of precision and recall |
| Metric | Description |
|---|---|
| Abstention Rate | Frequency of declining to answer |
| Appropriate Abstention F1 | Accuracy of abstention decisions |
| Metric | Description |
|---|---|
| Detection Rate | Correctly identified corruptions |
| Correction Accuracy | Correctly fixed errors |
| E2E Success | Detection × Correction |
chimera/
├── src/chimera/
│ ├── cli/ # Command-line interface
│ ├── evaluation/ # Evaluation pipeline and aggregation
│ ├── generators/ # Task generators (one per track)
│ ├── interfaces/ # Model API interfaces
│ ├── metrics/ # Metric computation
│ ├── models/ # Pydantic data models
│ └── runner/ # Benchmark execution
├── tests/ # Test suite (802 tests)
├── configs/ # YAML configuration files
├── docs/ # Documentation (MkDocs)
├── examples/ # Example scripts and notebooks
└── results/ # Evaluation outputs
| Document | Description |
|---|---|
| Quick Start Guide | Get running in 5 minutes |
| Configuration Guide | YAML and environment setup |
| CLI Reference | Complete CLI documentation |
| Calibration Concepts | Theory of confidence calibration |
| Introspection Concepts | Meta-cognitive evaluation framework |
| Metrics Reference | All metrics explained |
| API: Models | Data models reference |
| API: Generators | Task generators reference |
| API: Evaluation | Evaluation pipeline reference |
# Run all tests
pytest tests -v
# Run with coverage
pytest tests --cov=src/chimera --cov-report=html
# Run specific test module
pytest tests/test_evaluation/ -v# Install development dependencies
pip install -e ".[dev]"
# Format code
black src tests
isort src tests
# Lint
ruff check src tests
# Type check
mypy src
# Security check
bandit -r srcSafety: A model that knows when it's uncertain is safer than one that confidently hallucinates. CHIMERA directly measures safety-relevant epistemic properties.
Agents: Agentic systems must know when to pause and ask for help. CHIMERA tests this prerequisite capability.
Trust: Users need to know when to trust model outputs. Calibrated confidence enables appropriate human-AI collaboration.
Alignment: Truthfulness requires knowing what is true. CHIMERA measures the meta-cognitive foundations of honest AI.
This project is licensed under the Apache License 2.0 - see the LICENSE file for details.
CHIMERA draws inspiration from foundational research on calibration and meta-cognition:
- Guo et al. (2017) - On Calibration of Modern Neural Networks
- Kadavath et al. (2022) - Language Models (Mostly) Know What They Know
- Lin et al. (2022) - Teaching Models to Express Their Uncertainty in Words
@software{chimera2026,
title={CHIMERA: Calibrated Hierarchical Introspection and Meta-cognitive Error Recognition Assessment},
author={Rahul Lashkari},
year={2026},
url={https://github.com/Rahul-Lashkari/chimera}
}