29,750 competition-math proofs from the Goedel-LM/Lean-workbook-proofs dataset, migrated from Lean 4.8 to Lean 4.27.0 / Mathlib v4.27.0.
The original proofs were generated by DeepSeek-Prover-V1.5 against the Lean Workbook problem set. This repository provides the migrated .lean files plus 60,341 extracted (proof state, tactic) pairs for ML training.
| Metric | Value |
|---|---|
| Total proofs | 29,750 |
| Compiling on Lean 4.27 | 28,016 (94.1%) |
| Traced tactic pairs | 60,341 |
| Theorems with traced pairs | 24,879 |
| Unique tactic heads | 73 |
| Median proof depth | 1 |
| Deepest proof | 37 tactics |
GoedelProofs/ 29,750 .lean proof files (one theorem each)
data/
tactic_pairs.jsonl 60,341 (state_before, tactic, depth) pairs
compile_status.json Per-proof compilation status
integrity_report.json Sorry/admit/cheat contamination sweep (0 found)
manifest.json Proof file ↔ upstream problem_id mapping
scripts/ Migration and extraction scripts
patches/ LeanDojo Lean 4.27 compatibility patch
docs/
migration.md Migration process and Mathlib rename details
analysis.md Full dataset statistics and tactic distributions
Each line of data/tactic_pairs.jsonl:
{"theorem": "lean_workbook_10009", "state": "a : ℝ\nb : ℝ\n⊢ a ^ 2 + b ^ 2 ≥ 0", "tactic": "nlinarith [sq_nonneg a, sq_nonneg b]", "depth": 0, "source": "goedel_workbook"}theorem: upstream problem ID from Lean Workbookstate: Lean 4 tactic state (goal + hypotheses) before the tactictactic: the tactic applieddepth: proof tree depth (0 = root goal)source: always"goedel_workbook"
Each entry in data/compile_status.json:
{"proof_id": "Proof_00000", "problem_id": "lean_workbook_10009", "status": "ok", "elapsed_s": 32.99}Status is one of: ok, warn (compiles with warnings), error, timeout.
git clone https://github.com/mike1729/goedel-workbook-lean427.git
cd goedel-workbook-lean427
# Fetch Mathlib oleans (takes ~30 min on first run)
lake exe cache get
lake buildimport json
pairs = [json.loads(line) for line in open("data/tactic_pairs.jsonl")]
print(f"{len(pairs)} tactic pairs from {len(set(p['theorem'] for p in pairs))} theorems")
# Filter to deep proofs (depth >= 4)
deep = [p for p in pairs if p["depth"] >= 4]
print(f"{len(deep)} deep tactic steps")The migration from Lean 4.8 to 4.27 involved:
- Mathlib lemma renames (7 identifier renames + BigOperators
in→∈notation) field_simpbehavioral fix (firstcombinator to handle more aggressive normalization)- Batch compilation with 32 workers, 60s timeout per proof
94.1% of proofs compile successfully. The remaining 5.9% fail due to irrecoverable Mathlib behavioral changes (linarith failures, max recursion, timeouts).
See docs/migration.md for the full migration process and docs/analysis.md for dataset statistics.
- 5.9% of proofs don't compile (1,734 out of 29,750). These are included for completeness; check
data/compile_status.jsonto filter. - Machine-generated proofs are shallow: 46% are single-tactic (mostly
nlinarith), and only 13% have depth >= 4. - Competition algebra focus: 71% of goals are inequalities. Not representative of general Lean/Mathlib usage.
- Tactic pairs were extracted via Pantograph replay, not LeanDojo source tracing. The states are faithful but don't include source position metadata.
Apache 2.0 (see LICENSE).
- Proofs: Goedel-LM/Lean-workbook-proofs by the Goedel-LM team
- Proof generator: DeepSeek-Prover-V1.5 by the DeepSeek team
- Problem set: Lean Workbook by InternLM
- Extraction: Pantograph REPL for tactic state extraction
If you use this dataset, please cite the upstream sources:
@article{lin2025goedel,
title={Goedel-Prover: A Frontier Model for Open-Source Automated Theorem Proving},
author={Lin, Yong and Agarwal, Shange and Jiang, Bohan and Lu, Changran and Yuan, Yufei and Liu, Hao and Wang, Tongxuan and Ruan, Zhongjian and Zhang, Qingyu and Zeng, Aimin and others},
journal={arXiv preprint arXiv:2502.07640},
year={2025}
}
@article{xin2024deepseek,
title={DeepSeek-Prover-V1.5: Harnessing Proof Assistant Feedback for Reinforcement Learning and Monte-Carlo Tree Search},
author={Xin, Huajian and Guo, Daya and Shao, Zhihong and Ren, Zhizhou and Zhu, Qihao and Liu, Bo and Ruan, Chong and Li, Wenda and Liang, Xiaodan},
journal={arXiv preprint arXiv:2408.08152},
year={2024}
}