Replicating "From Reasoning to Answer" (Zhang et al., EMNLP 2025)
| Metric | Value |
|---|---|
| RFH Match | 8 / 10 of our top-10 Reasoning-Focus Heads match the paper's ground truth |
| Top RFH | L16.H0 (A→R attention = 0.899) — exact #1 match |
| RFH Zone | Layers 14–22 (peak at L16) — replicates paper exactly |
| Causal Effect | Late layers (20–27) have 1.8× the patching impact of early layers |
| NLD Peak | Layer 27 — NLD = 0.105 (strongest causal link from reasoning to answer) |
| Traces Analyzed | 194 (RFH) + 10 QA pairs (patching) |
This repository contains an independent replication of the mechanistic interpretability experiments from:
"From Reasoning to Answer: Empirical, Attention-Based and Mechanistic Insights into Distilled DeepSeek R1 Models" Zhang, Lin, Rajmohan & Zhang (Microsoft Research) — EMNLP 2025 [Paper] [Code]
We replicate two key experiments on DeepSeek-R1-Distill-Qwen-7B (the exact model used in the paper):
- Reasoning-Focus Head (RFH) Identification — Which attention heads route information from reasoning (
<think>) tokens to the final answer? - Activation Patching — Causal proof that reasoning tokens functionally determine the answer.
All experiments run on a single consumer GPU using custom memory management for the 7B model.
RFHs are attention heads where answer tokens consistently attend to reasoning tokens within <think>...</think>. We measure the mean Answer → Reasoning attention weight per (layer, head) pair, averaged across 194 MATH-500 reasoning traces.
| Rank | Ours | Paper | In Paper Set? |
|---|---|---|---|
| 1 | L16.H0 | (16, 0) | ✓ |
| 2 | L19.H15 | (19, 15) | ✓ |
| 3 | L1.H1 | (1, 1) | ✓ |
| 4 | L22.H7 | (22, 7) | ✓ |
| 5 | L17.H18 | (17, 18) | ✓ |
| 6 | L17.H14 | (17, 14) | ✓ |
| 7 | L14.H19 | — | ✗ |
| 8 | L17.H19 | (17, 19) | ✓ |
| 9 | L16.H14 | (16, 14) | ✓ |
| 10 | L12.H7 | — | ✗ |
Clear peak at layers 14–22 — matching the paper's reported RFH zone.
We patch residual-stream activations from corrupted → clean reasoning traces, one layer at a time, and measure the Normalized Logit Difference (NLD): how much patching recovers the correct answer.
Method: Per-token patching (step=1) on the last ~5 reasoning tokens + last 3 answer tokens, following the paper's hook_fine_grain_resid_auto(answer_backoff=3).
- Late layers (20–27) have 1.8× the causal impact of early layers (0–9)
- Valley at layers 15–18: patching here has minimal effect on the answer
- Sharp rise from layer 20: reasoning information flows through late layers to the final prediction
- Layer 27 (final): strongest patching effect (NLD = 0.105)
- Pattern matches the paper's Figure 6
# Download the model (~14 GB)
huggingface-cli download deepseek-ai/DeepSeek-R1-Distill-Qwen-7B \
--local-dir models/DeepSeek-R1-Distill-Qwen-7B
# Install dependencies
pip install -r requirements.txt# Experiment 1 — RFH Analysis (~15 min, uses pre-collected traces)
ML_MEMORY_LIMIT_GB=107 ML_MAX_SEQ_LEN=5000 python code/rfh_analysis.py
# Experiment 2 — Activation Patching (~30 min for 10 pairs)
ML_MEMORY_LIMIT_GB=80 MAX_PAIRS=10 python run_patching.py
# Full patching run (all 56 valid pairs)
ML_MEMORY_LIMIT_GB=80 MAX_PAIRS=56 python run_patching.py| Experiment | Peak VRAM | Approx. Runtime |
|---|---|---|
| RFH Analysis (N=194) | ~50 GB | ~15 min |
| Activation Patching (N=10) | ~20 GB | ~30 min |
Supports Metal/MPS and CUDA backends.
.
├── README.md
├── LICENSE
├── requirements.txt
├── run_patching.py # Entry point: activation patching
├── code/
│ ├── rfh_analysis.py # Entry point: RFH identification
│ ├── memory_guard.py # MPS memory management
│ ├── attention_analysis.py # Answer-to-reasoning attention
│ ├── activation_patching.py # Residual stream patching
│ ├── QAPairPatchingAnalyzer.py # QA pair causal analysis
│ ├── QAPairResultProcesser.py # Response parsing
│ ├── const.py # Paper ground truth indices
│ ├── trace_collection.py # Trace generation
│ ├── mech_interp_setup.py # Environment verification
│ ├── utils/ # Shared utilities
│ └── *.ipynb # Original paper notebooks
├── traces/
│ └── MATH-500_*_withR.jsonl # Reasoning traces (MATH-500)
├── results/
│ ├── patching_results.json # Raw patching data
│ ├── fig[1-7]_*.png # Static figures
│ └── anim[1-2]_*.gif # Animated visualizations
└── data/
├── 3_sample_qa_pair_responses.jsonl
├── 3_selected_qa_pairs.csv
└── ...
- RFH analysis processes attention patterns in layer chunks (14 layers/pass) to stay within memory limits. OOM errors on long traces are caught and skipped.
- Activation patching follows the paper's
hook_fine_grain_resid_auto: per-token patching on the reasoning tail and answer tail only (not the full sequence). - Model loaded via
HuggingFace AutoModelForCausalLM→TransformerLens HookedTransformerusing theQwen/Qwen2.5-7Barchitecture mapping. QAPairPatchingAnalyzeradapted for MPS:bfloat16→float16, auto device selection.
This work replicates experiments from:
- Zhang, J., Lin, Q., Rajmohan, S., & Zhang, D. (2025). From Reasoning to Answer: Empirical, Attention-Based and Mechanistic Insights into Distilled DeepSeek R1 Models. EMNLP 2025. [arXiv] [Code]
Built with TransformerLens and DeepSeek-R1.
@inproceedings{zhang2025reasoning,
title = {From Reasoning to Answer: Empirical, Attention-Based and
Mechanistic Insights into Distilled {DeepSeek} {R1} Models},
author = {Zhang, Jue and Lin, Qingwei and Rajmohan, Saravan and Zhang, Dongmei},
booktitle = {EMNLP},
year = {2025}
}






