This document maps each metric category from SPEC Section 12 to where the metric is produced or displayed. Use it to locate and evaluate product, formalization, extraction, infra, and research-value metrics.
| Metric | Source / display |
|---|---|
| Number of papers admitted | Portal dashboard; corpus/index.json; coverage dashboard totals |
| Number of claim bundles | Portal dashboard (Totals: Claims); aggregate from paper manifests |
| Percentage of claims with source spans | Derived from claims with valid source_span; coverage in manifest / dashboard |
| Percentage of claims mapped to formal targets | Dashboard "Mapped" vs "Claims"; manifest.coverage_metrics.mapped_claim_count |
| Percentage of mapped claims machine-checked | Dashboard "Machine-checked"; manifest.coverage_metrics.machine_checked_count |
| Percentage linked to executable kernels | Dashboard "Kernel-linked"; manifest.coverage_metrics.kernel_linked_count |
| Median time from paper intake to first artifact page | CLI: uv run --project pipeline python -m sm_pipeline.cli metrics --median-intake-time (uses intake_report.json created_at and metadata first_artifact_at). Papers without both are excluded. |
Primary display: Portal coverage dashboard (or portal/app/dashboard/page.tsx). Data comes from corpus/papers/<paper_id>/manifest.json and coverage_metrics.
| Metric | Source / display |
|---|---|
| Declaration compile rate, proof completion rate | Lean build (Lake); CI gate 1; proof completion rate and Milestone 3 progress via just metrics --proof-completion. |
| Axiom / sorry count (repo-level and per-paper) | CLI: just metrics --axiom-count; scans formal/**/*.lean; per-paper via mapping.json namespace. |
| Dependency reuse ratio | CLI: metrics --dependency; computed from manifest dependency_graph and theorem_cards dependency_ids. |
| Average theorem-card fan-in / fan-out | Same as above; report includes average_fan_in, average_fan_out, max_fan_in, max_fan_out. |
| Number of claims marked disputed after formalization | just metrics --research-value: disputed_claims_with_formal_targets (claims with status: disputed and non-empty linked_formal_targets) |
Primary sources: Lean build success (CI); corpus/papers/<paper_id>/manifest.json (theorem cards, dependency graph); theorem_cards.json.
| Metric | Source / display |
|---|---|
| LLM / suggestion sidecar counts, unresolved preserved links | just benchmark tasks llm_suggestions (optional claim/mapping/suggested file counts and linked_*_unresolved on canonical claims) and llm_lean_suggestions (llm_lean_proposals.json footprint and conversion-ready tallies); thresholds in benchmarks/baseline_thresholds.json (tasks.llm_suggestions, tasks.llm_lean_suggestions; minima default to zero until you raise floors). |
| LLM eval regression anchors (non-canonical) | Task llm_eval: reviewed benchmarks/llm_eval/<paper_id>/reference_llm_*_proposals.json vs gold claim ids, corpus mapping.json, and static Lean conversion-ready checks; includes disagreement rates, promotion counters, and reviewer-time totals when metadata is present. Declared prompt template digests appear at top level of the benchmark report as llm_prompt_templates (id → SHA-256). See benchmarks/llm_eval/README.md, benchmarks/README.md, and ADR 0013. |
| Claim extraction precision/recall/F1, assumption precision/recall/F1 | Gold task in just benchmark; report keys gold_claim_precision, gold_claim_recall, gold_claim_f1, gold_assumption_*. See Gold labels. |
| Symbol normalization conflict rate | CLI: just metrics --symbol-conflict (or just metrics); see Metrics CLI below. |
| Source-span alignment | For each claim id listed in benchmarks/gold/<paper_id>/source_spans.json, corpus source_span is compared to the reference. Benchmark (Gate 6): tasks.gold includes source_span_total_compared, source_span_alignment_error_count, and source_span_alignment_error_rate (same computation as the metrics CLI). Regression: benchmarks/baseline_thresholds.json may set tasks_ceiling.gold.source_span_alignment_error_rate (maximum allowed rate; 0 means no mismatches). CLI duplicate: just metrics --source-span-alignment emits source_span_alignment with total_compared, alignment_error_count, alignment_error_rate. See Gold labels. |
Primary source: Benchmark suite. Run just benchmark (or uv run --project pipeline python -m sm_pipeline.cli benchmark). Results are written to benchmarks/reports/latest.json and include proof_success_snapshot (machine_checked_count, declaration_count, fraction_machine_checked); tasks.theorem_cards.per_paper (per-paper quality slices); tasks.gold (including source-span alignment fields above); tasks.llm_suggestions (papers scanned, sidecar file counts, unresolved-link counts on canonical claims); tasks.llm_lean_suggestions (llm_lean_proposals.json counts and conversion-ready proposal tallies); tasks.llm_eval (reference-bundle proxies under benchmarks/llm_eval/); llm_prompt_templates (declared prompt template SHA-256 map); proof_success_summary.md; and benchmarks/reports/trend/proof_success_history.json (rolling trend history). Gate 6 trend check: if fraction_machine_checked decreased vs previous run, a warning is written to the summary (warn-only). See also Generated artifacts. The main PR validation workflow runs the benchmark with regression check (Gate 6), including runtime budgets, minimum thresholds under tasks, and optional ceilings under tasks_ceiling; extraction/proof metrics are enforced before merge.
| Metric | Source / display |
|---|---|
| Cold build time, cached build time, portal build time | CI job timings; local time just build / time pnpm --dir portal build |
| Flaky test rate, schema migration break rate | CI history; manual tracking; test runs in GitHub Actions |
Primary source: GitHub Actions workflow run durations and logs; local timing of just build, just validate, and portal build.
Optional automation (report-only): Workflow jobs can record pass/fail per job to a JSON artifact; compare consecutive runs manually or via the GitHub Actions API. Full flaky-rate dashboards are not required for v0.2; document counts from Actions run history when auditing infra health.
| Metric | Source / display |
|---|---|
| Reusable foundation count, cross-paper reuse count | CLI: just metrics --research-value; from manifest dependency_graph and theorem_cards. |
| Literature error discoveries | Same CLI; from corpus/papers/<paper_id>/literature_errors.json when present (schema: literature_error.schema.json); else ADR placeholder count. |
| Claims whose assumptions were clarified | Same report: claims_with_clarified_assumptions (claims with non-empty linked_assumptions). |
| Kernels with formally linked invariants | Same report: kernels_with_formally_linked_invariants (kernels in corpus/kernels.json with non-empty linked_theorem_cards). |
| Disputed claims still linked to formal targets | Same report: disputed_claims_with_formal_targets (reviewer workflow signal after formalization). |
Primary sources: just metrics --research-value; data contracts documented in pipeline/src/sm_pipeline/metrics/research_value.py.
Extraction precision and recall use a gold labelled set of claims and assumptions per paper when present. Layout: benchmarks/gold/<paper_id>/claims.json and assumptions.json, same schema as corpus. The gold task (benchmarks/tasks/gold/scorer.py) runs via just benchmark and returns:
- Counts:
papers_with_gold,gold_claim_count - Claim quality (when corpus and gold exist for same papers):
gold_claim_precision,gold_claim_recall,gold_claim_f1 - Assumption quality (when both gold and corpus have assumptions):
gold_assumption_precision,gold_assumption_recall,gold_assumption_f1 - Source-span alignment (when any
benchmarks/gold/<paper_id>/source_spans.jsonexists):source_span_total_compared,source_span_alignment_error_count,source_span_alignment_error_rate(viasm_pipeline.metrics.source_span_alignment.compute_source_span_alignment)
Matching contract (deterministic): Corpus and gold items are matched by id. True positives = ids present in both corpus and gold for that paper. Precision = TP / |corpus|, Recall = TP / |gold|, F1 = 2PR/(P+R). Aggregation is micro-averaged over all papers with gold. When no gold dirs exist, the task returns papers_with_gold: 0, gold_claim_count: 0. This repository currently tracks gold for six formalized core papers and keeps tasks.gold.papers_with_gold at 6 (two hard-dimension stress scaffolds are indexed but intake-only). Gold F1 floors and tasks_ceiling.gold.source_span_alignment_error_rate are enforced; forks may set papers_with_gold to 0 or omit ceilings while bootstrapping. Ceilings: tasks_ceiling holds upper bounds (e.g. tasks_ceiling.gold.source_span_alignment_error_rate). Current gold dirs: benchmarks/gold/ (langmuir_1918_adsorption, freundlich_1906_adsorption, chem_dilution_reference, temkin_1941_adsorption, math_sum_evens, physics_kinematics_uniform).
Process (state of the art): For every new paper added to corpus/index.json, create benchmarks/gold/<paper_id>/ with at least claims.json (and optionally assumptions.json, source_spans.json) as part of the intake or first-artifact workflow. This keeps the metrics gap as an ongoing process (add gold when admitting a paper), not a backlog.
- Choose paper: Use a paper that has (or will have) corpus
claims.jsonand optionallyassumptions.json. - Scaffold or create directory: Run
just scaffold-gold <paper_id>to createbenchmarks/gold/<paper_id>/from the corpus (copies claims, buildssource_spans.jsonfrom claimsource_span, copies assumptions). Then curate as needed. Or create the directory and files by hand. - Add gold labels: Create or edit
claims.jsonand/orassumptions.jsonwith the same JSON schema as the corpus. Each item should have anid; matching is byid(corpus item matches gold if the id exists in gold; precision/recall are computed over matched counts). - Optional source-span reference:
scaffold-goldwritessource_spans.jsonwhen claims havesource_span. Or add by hand: a list of{ "claim_id": "<id>", "source_span": { ... } }; used by gold task andjust metrics --source-span-alignment. - Run benchmark:
just benchmark; the report includesgold_claim_*,gold_assumption_*when applicable, and source-span alignment fields ontasks.goldwhen any goldsource_spans.jsonexists. CI runs the benchmark on every PR. Current state: all six formalized core papers have gold (claims andsource_spans.jsonwhere claims carry spans; assumptions copied when present in corpus). The two hard-dimension stress scaffolds are indexed but not yet gold-populated. Add gold for each new non-scaffold paper before merging (see paper-intake.md and ROADMAP.md).
| Metric | Source |
|---|---|
| Axiom / sorry count | CLI: just metrics --axiom-count. Scans formal/**/*.lean for word-boundary axiom and sorry; reports total, per-file, and per-paper (when mapping.json namespace maps files to papers). Best-effort (no full Lean AST). |
| Reusable foundation count, cross-paper reuse | CLI: just metrics --research-value. From manifests and theorem_cards: declarations used as dependency targets by more than one paper. |
| Literature errors | Same report: when corpus/papers/<paper_id>/literature_errors.json exists, count used; else ADR placeholder. |
| Claims with clarified assumptions | Same report: claims_with_clarified_assumptions. |
| Kernels with formally linked invariants | Same report: kernels_with_formally_linked_invariants. |
| Metric | What would be needed |
|---|---|
| Unit/dimension tagging automation | Schema has dimensional_metadata; just metrics --dimension-visibility and just metrics --dimension-suggestions (heuristic suggestions for human triage); no automatic corpus edit. |
Run all derived metrics or specific ones (from repo root):
just metrics
just metrics -o report.json
# Or with flags: uv run --project pipeline python -m sm_pipeline.cli metrics --median-intake-time --dependency --symbol-conflict -o report.json- Median intake time: Requires
intake_report.json(created_at) and metadatafirst_artifact_atper paper. - Dependency: Reads
manifest.jsondependency_graph or theorem_cards dependency_ids; outputs reuse ratio, fan-in, fan-out. - Symbol conflict: Reads each paper's
symbols.json; outputs per-paper and aggregate conflict rate. - Proof completion: Reads all manifests' coverage_metrics; outputs proof_completion_rate and Milestone 3 progress. Aggregate machine-checked declaration totals are authoritative in status/repo-snapshot.md (regenerate with
just repo-snapshot). - Normalization report: Cross-paper duplicate
normalized_namereport (8.3 visibility); includessuggest_mergefor human triage;--normalization-reportor included in default run. - Axiom count: Scans
formal/**/*.leanforaxiomandsorry;--axiom-count. - Research-value: Reusable foundation, cross-paper reuse, literature_errors count (from optional literature_errors.json or ADR placeholder);
--research-value. - Source-span alignment: When
benchmarks/gold/*/source_spans.jsonexists;--source-span-alignment. - Normalization visibility (8.3): Symbols with role_unclear, claims without linked_assumptions;
--normalization-visibility. - Assumption suggestions (8.3): Candidate assumptions for claims with none (text overlap);
--assumption-suggestions; human triage only. - Dimension visibility (8.3): Symbols with vs without dimensional_metadata;
--dimension-visibility. - Dimension suggestions (8.3): Heuristic suggested unit/dimension for symbols (from kernels and symbol names);
--dimension-suggestions; human triage only, no corpus auto-edit. - Normalization policy (8.3): Waiver-backed policy checks;
--normalization-policy. Readsbenchmarks/normalization_policy.json; reports unwaived cross-paper duplicates, assumption coverage, and dimension violations; warn-only unless CI promotes to fail. - Reviewer status: Theorem-card reviewer queues and consistency checks; emitted as
reviewer_statusinjust metricsoutput.
- Dashboard (product coverage): Portal
/dashboard; data from paper manifests. - Derived metrics: Run
just metrics(oruv run --project pipeline python -m sm_pipeline.cli metrics); optional-o report.json. - Extraction / proof metrics: Run
just benchmark; seebenchmarks/reports/(e.g.latest.jsonwithtasks.gold,tasks.llm_suggestions,tasks.llm_lean_suggestions, andproof_success_snapshot). - Formalization health: Lean build in CI (gate 1); Lake and theorem cards for structure.
- Infra health: CI run times and test results in GitHub Actions.