You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
Optional inference for claims, mapping, and Lean proposals; suggest-only architecture with human-gated apply
Evaluation infrastructure
Prompt template versioning (prompt_template_id, prompt_template_sha256), reference fixtures under benchmarks/llm_eval/, benchmark task llm_eval, human review rubric
Live testing
Full end-to-end pipeline run validated on math_sum_evens with allenai/olmo-3.1-32b-instruct; all three LLM surfaces (claims, mapping, Lean) operational
Commands
just llm-claim-proposals <paper_id>, just llm-mapping-proposals <paper_id>, just llm-lean-proposals <paper_id>; just llm-live-eval for smoke reports
Claims with valid source_span and linked_assumptions / linked_symbols
Done
mapping.json maps every machine-checked claim to Lean in formal/ScientificMemory/Chemistry/Solutions/DilutionRef.lean
Done
theorem_cards.json and manifest.json via just publish-artifacts chem_dilution_reference (no hand-edited coverage)
Done
Gold under benchmarks/gold/chem_dilution_reference/ (claims, source_spans.json, assumptions)
Done
benchmarks/baseline_thresholds.json aligned with post-sprint totals
Done
Full local CI parity: Contributor playbook – Local CI (validate-all, pytest, adsorption tests, portal lint/build, smoke-graph, benchmark; Lean lake build)
Done
v0.3 (next)
Theme
Deliverable
Proof repair
Human-gated proof-repair-apply on formal/ (no CI auto-apply); deeper automation optional
Corpus depth
physics_kinematics_uniform at 15 claims with full intake; probability / larger physics still open
Corpus growth
batch-admit with --dry-run (CLI-only policy); contributor dry-run workflow; release corpus delta script
Content target (Milestone 3)
Goal
State
Original target
20–40 meaningful machine-checked claims across one or more papers
Current
Exceeds upper bound: 62 declarations across six papers
Next stretch
More real-source papers or richer domains
Track progress
just metrics (proof completion / Milestone 3 line) and portal dashboard