GitHub - atgugu/mechinterp-rfh-replication: Replication of 'From Reasoning to Answer' (EMNLP 2025) — Reasoning-Focus Heads + Activation Patching on DeepSeek-R1-Distill-Qwen-7B

Mechanistic Interpretability of Reasoning in DeepSeek-R1

Replicating "From Reasoning to Answer" (Zhang et al., EMNLP 2025)

Highlights

Metric	Value
RFH Match	8 / 10 of our top-10 Reasoning-Focus Heads match the paper's ground truth
Top RFH	L16.H0 (A→R attention = 0.899) — exact #1 match
RFH Zone	Layers 14–22 (peak at L16) — replicates paper exactly
Causal Effect	Late layers (20–27) have 1.8× the patching impact of early layers
NLD Peak	Layer 27 — NLD = 0.105 (strongest causal link from reasoning to answer)
Traces Analyzed	194 (RFH) + 10 QA pairs (patching)

Overview

This repository contains an independent replication of the mechanistic interpretability experiments from:

"From Reasoning to Answer: Empirical, Attention-Based and Mechanistic Insights into Distilled DeepSeek R1 Models" Zhang, Lin, Rajmohan & Zhang (Microsoft Research) — EMNLP 2025 [Paper] [Code]

We replicate two key experiments on DeepSeek-R1-Distill-Qwen-7B (the exact model used in the paper):

Reasoning-Focus Head (RFH) Identification — Which attention heads route information from reasoning (<think>) tokens to the final answer?
Activation Patching — Causal proof that reasoning tokens functionally determine the answer.

All experiments run on a single consumer GPU using custom memory management for the 7B model.

Experiment 1: Reasoning-Focus Heads

RFHs are attention heads where answer tokens consistently attend to reasoning tokens within <think>...</think>. We measure the mean Answer → Reasoning attention weight per (layer, head) pair, averaged across 194 MATH-500 reasoning traces.

Top-10 RFHs vs. Paper Ground Truth

Rank	Ours	Paper	In Paper Set?
1	L16.H0	(16, 0)	✓
2	L19.H15	(19, 15)	✓
3	L1.H1	(1, 1)	✓
4	L22.H7	(22, 7)	✓
5	L17.H18	(17, 18)	✓
6	L17.H14	(17, 14)	✓
7	L14.H19	—	✗
8	L17.H19	(17, 19)	✓
9	L16.H14	(16, 14)	✓
10	L12.H7	—	✗

Per-Layer Attention Profile

Clear peak at layers 14–22 — matching the paper's reported RFH zone.

Experiment 2: Activation Patching

We patch residual-stream activations from corrupted → clean reasoning traces, one layer at a time, and measure the Normalized Logit Difference (NLD): how much patching recovers the correct answer.

Method: Per-token patching (step=1) on the last ~5 reasoning tokens + last 3 answer tokens, following the paper's hook_fine_grain_resid_auto(answer_backoff=3).

Per-Layer Causal Effect

Layer-by-Layer Scan (Animated)

Consistency Across QA Pairs

Attention vs. Causal Impact (Animated)

Key Findings

Late layers (20–27) have 1.8× the causal impact of early layers (0–9)
Valley at layers 15–18: patching here has minimal effect on the answer
Sharp rise from layer 20: reasoning information flows through late layers to the final prediction
Layer 27 (final): strongest patching effect (NLD = 0.105)
Pattern matches the paper's Figure 6

Reproducing the Results

Prerequisites

# Download the model (~14 GB)
huggingface-cli download deepseek-ai/DeepSeek-R1-Distill-Qwen-7B \
  --local-dir models/DeepSeek-R1-Distill-Qwen-7B

# Install dependencies
pip install -r requirements.txt

Run Experiments

# Experiment 1 — RFH Analysis (~15 min, uses pre-collected traces)
ML_MEMORY_LIMIT_GB=107 ML_MAX_SEQ_LEN=5000 python code/rfh_analysis.py

# Experiment 2 — Activation Patching (~30 min for 10 pairs)
ML_MEMORY_LIMIT_GB=80 MAX_PAIRS=10 python run_patching.py

# Full patching run (all 56 valid pairs)
ML_MEMORY_LIMIT_GB=80 MAX_PAIRS=56 python run_patching.py

Resources

Experiment	Peak VRAM	Approx. Runtime
RFH Analysis (N=194)	~50 GB	~15 min
Activation Patching (N=10)	~20 GB	~30 min

Supports Metal/MPS and CUDA backends.

Repository Structure

.
├── README.md
├── LICENSE
├── requirements.txt
├── run_patching.py                  # Entry point: activation patching
├── code/
│   ├── rfh_analysis.py              # Entry point: RFH identification
│   ├── memory_guard.py              # MPS memory management
│   ├── attention_analysis.py        # Answer-to-reasoning attention
│   ├── activation_patching.py       # Residual stream patching
│   ├── QAPairPatchingAnalyzer.py    # QA pair causal analysis
│   ├── QAPairResultProcesser.py     # Response parsing
│   ├── const.py                     # Paper ground truth indices
│   ├── trace_collection.py          # Trace generation
│   ├── mech_interp_setup.py         # Environment verification
│   ├── utils/                       # Shared utilities
│   └── *.ipynb                      # Original paper notebooks
├── traces/
│   └── MATH-500_*_withR.jsonl       # Reasoning traces (MATH-500)
├── results/
│   ├── patching_results.json        # Raw patching data
│   ├── fig[1-7]_*.png               # Static figures
│   └── anim[1-2]_*.gif              # Animated visualizations
└── data/
    ├── 3_sample_qa_pair_responses.jsonl
    ├── 3_selected_qa_pairs.csv
    └── ...

Methodology Notes

RFH analysis processes attention patterns in layer chunks (14 layers/pass) to stay within memory limits. OOM errors on long traces are caught and skipped.
Activation patching follows the paper's hook_fine_grain_resid_auto: per-token patching on the reasoning tail and answer tail only (not the full sequence).
Model loaded via HuggingFace AutoModelForCausalLM → TransformerLens HookedTransformer using the Qwen/Qwen2.5-7B architecture mapping.
QAPairPatchingAnalyzer adapted for MPS: bfloat16 → float16, auto device selection.

Acknowledgments

This work replicates experiments from:

Zhang, J., Lin, Q., Rajmohan, S., & Zhang, D. (2025). From Reasoning to Answer: Empirical, Attention-Based and Mechanistic Insights into Distilled DeepSeek R1 Models. EMNLP 2025. [arXiv] [Code]

Built with TransformerLens and DeepSeek-R1.

Citation

@inproceedings{zhang2025reasoning,
  title     = {From Reasoning to Answer: Empirical, Attention-Based and
               Mechanistic Insights into Distilled {DeepSeek} {R1} Models},
  author    = {Zhang, Jue and Lin, Qingwei and Rajmohan, Saravan and Zhang, Dongmei},
  booktitle = {EMNLP},
  year      = {2025}
}

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Mechanistic Interpretability of Reasoning in DeepSeek-R1

Highlights

Overview

Experiment 1: Reasoning-Focus Heads

Top-10 RFHs vs. Paper Ground Truth

Per-Layer Attention Profile

Experiment 2: Activation Patching

Per-Layer Causal Effect

Layer-by-Layer Scan (Animated)

Consistency Across QA Pairs

Attention vs. Causal Impact (Animated)

Key Findings

Reproducing the Results

Prerequisites

Run Experiments

Resources

Repository Structure

Methodology Notes

Acknowledgments

Citation

About

Uh oh!

Releases

Packages

Uh oh!

Contributors

Uh oh!

Languages

Name		Name	Last commit message	Last commit date
Latest commit History 1 Commit
code		code
data		data
results		results
traces		traces
.gitignore		.gitignore
LICENSE		LICENSE
README.md		README.md
requirements.txt		requirements.txt
run_patching.py		run_patching.py

Folders and files

Latest commit

History

Repository files navigation

Mechanistic Interpretability of Reasoning in DeepSeek-R1

Highlights

Overview

Experiment 1: Reasoning-Focus Heads

Top-10 RFHs vs. Paper Ground Truth

Per-Layer Attention Profile

Experiment 2: Activation Patching

Per-Layer Causal Effect

Layer-by-Layer Scan (Animated)

Consistency Across QA Pairs

Attention vs. Causal Impact (Animated)

Key Findings

Reproducing the Results

Prerequisites

Run Experiments

Resources

Repository Structure

Methodology Notes

Acknowledgments

Citation

About

Topics

Resources

License

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Contributors

Uh oh!

Languages

Packages