This checklist supports prime-intellect-llm.md and the llm_lean_proposals sidecar. Canonical formal/ is never modified by generation; only proof-repair-apply with explicit human-review flags may write Lean after you review a converted bundle.
- Prompts are versioned in
pipeline/src/sm_pipeline/llm/prompt_templates.py; changing them updatesmetadata.prompt_template_sha256on new runs and thellm_prompt_templatesmap injust benchmarkoutput. - Reviewed anchors live under
benchmarks/llm_eval/(reference_llm_*_proposals.jsonper paper). After intentional prompt or behavior changes, regenerate or edit those files and runjust benchmark(Gate 6 includestasks.llm_evalminima). - Extended coverage:
uv run --project pipeline pytest pipeline/tests/test_llm_eval_harness.py -q. - Optional smoke report:
just llm-live-eval --paper-ids math_sum_evens --use-fake-provider. - Optional full Lean sanity after manual apply experiments:
LLM_EVAL_RUN_LAKE_BUILD=1 just llm-eval-lean-build-optional.
| Check | Command | Expected |
|---|---|---|
| Unit + integration | uv run --project pipeline pytest pipeline/tests/test_llm_lean_proposals.py pipeline/tests/test_benchmark_runner.py -q |
Pass |
| Full pipeline tests | uv run --project pipeline pytest pipeline/tests -q |
Pass |
| LLM eval harness | uv run --project pipeline pytest pipeline/tests/test_llm_eval_harness.py -q |
Pass; tasks.llm_eval, llm_prompt_templates |
| Sidecar schema | Covered by test_validate_llm_lean_sidecar_invalid_warns |
Warn-only path for invalid llm_lean_proposals.json |
| Benchmark task | tasks.llm_lean_suggestions in benchmark_runner.TASK_NAMES |
just benchmark includes keys: llm_lean_proposal_files, lean_proposals_total, lean_proposals_conversion_ready, by_edit_kind |
| Fake provider plumbing | uv run --project pipeline python -m sm_pipeline.cli llm-lean-proposals --paper-id math_sum_evens --use-fake-provider -o tmp/live_matrix_fake_lean.json |
Writes valid bundle; paper_id forced to CLI value; proposals may be empty |
Prime Intellect may return 404 for some catalog models (no provider route for your account). Before relying on a model, confirm a minimal authenticated chat completion succeeds (see Live test A).
- Load
.envfrom repo root (or export env vars). GET {PRIME_INTELLECT_BASE_URL or https://api.pinference.ai/api/v1}/modelswithAuthorization: Bearer …and pick a model id that returns 200 on a minimalPOST …/chat/completionsbody (messages: [{role:user,content:OK}]).- Set
SM_LLM_MODEL_LEANto that model (optionalSM_LLM_MODEL_DEFAULTfallback).
uv run --project pipeline python -m sm_pipeline.cli llm-lean-proposals --paper-id math_sum_evens -o tmp/lean_math_sum_evens.jsonRecord: wall-clock time, metadata.latency_seconds, token fields, whether each proposal’s find is unique in the target Lean file (required for proof-repair-apply).
Repeat with --paper-id langmuir_1918_adsorption (or another multi-declaration paper). Compare proposal count and failure rate (empty replacements vs conversion-ready entries).
- Pick a proposal with non-empty
replacementsandtarget_fileunderformal/. uv run --project pipeline python -m sm_pipeline.cli llm-lean-proposals-to-apply-bundle tmp/lean_math_sum_evens.json --proposal-id <id> -o tmp/lean_apply_bundle.jsonuv run --project pipeline python -m sm_pipeline.cli proof-repair-apply tmp/lean_apply_bundle.json(no--apply) and inspect output.- Only if correct:
--apply --i-understand-human-reviewedon a throwaway branch/worktree.
After any real apply: lake build (or just lake-build), just publish-artifacts <paper_id>, just validate, just benchmark.
| Case | Expected |
|---|---|
Invalid llm_lean_proposals.json committed |
validate-all stderr warning; exit 0 |
Converter on proposal with empty replacements |
ValueError with clear message |
target_file outside formal/ or contains .. |
Pydantic validation error on bundle load / converter refusal |
| Date (UTC) | Operator | Test | Model (SM_LLM_MODEL_LEAN) |
Result / notes |
|---|---|---|---|---|
| 2026-03-20 | automation | Fake CLI math_sum_evens |
fake |
Valid JSON; proposals: [] from default stub |
| 2026-03-20 | agent | Live eval math_sum_evens (real API) |
meta-llama/llama-3.1-70b-instruct |
Report: benchmarks/reports/llm_live_20260320T144531Z.json. Claims+mapping succeeded via configured allenai/olmo-3.1-32b-instruct; Lean call returned HTTP 404 on chat/completions for this Lean model (provider route unavailable for account). |
| 2026-03-20 | agent | Live eval math_sum_evens with Lean override |
allenai/olmo-3.1-32b-instruct |
Report: benchmarks/reports/llm_live_math_sum_evens_lean_olmo.json. Claims+mapping succeeded; Lean reached provider but failed JSON extraction (Invalid control character), indicating non-strict JSON output from model response. |
| 2026-03-20 | agent | Live eval after error fixes | allenai/olmo-3.1-32b-instruct |
Report: benchmarks/reports/llm_live_math_sum_evens_after_error_fixes.json. Claims/mapping/Lean all schema-valid (ok: true); Lean returned proposal_count: 0 (valid empty suggestion set). |
| 2026-03-20 | agent | Full E2E pipeline run (math_sum_evens) |
allenai/olmo-3.1-32b-instruct |
Branch: e2e-llm-pipeline-run. All LLM surfaces executed live: claims (0 proposals, schema-valid), mapping (1 proposal applied via --apply --i-understand-human-reviewed), Lean (1 proposal generated, schema-valid). Applied: mapping merged into mapping.json. Final gates: Lean build passed, validate-all passed (warnings only), benchmark passed with llm_prompt_templates digests. Report: benchmarks/reports/llm_live_e2e_math_sum_evens.json. |