Skip to content

atgugu/mechinterp-rfh-replication

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

1 Commit
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

Mechanistic Interpretability of Reasoning in DeepSeek-R1

Replicating "From Reasoning to Answer" (Zhang et al., EMNLP 2025)

Paper Original Code License Python PyTorch Model

Summary of Results


Highlights

Metric Value
RFH Match 8 / 10 of our top-10 Reasoning-Focus Heads match the paper's ground truth
Top RFH L16.H0 (A→R attention = 0.899) — exact #1 match
RFH Zone Layers 14–22 (peak at L16) — replicates paper exactly
Causal Effect Late layers (20–27) have 1.8× the patching impact of early layers
NLD Peak Layer 27 — NLD = 0.105 (strongest causal link from reasoning to answer)
Traces Analyzed 194 (RFH) + 10 QA pairs (patching)

Overview

This repository contains an independent replication of the mechanistic interpretability experiments from:

"From Reasoning to Answer: Empirical, Attention-Based and Mechanistic Insights into Distilled DeepSeek R1 Models" Zhang, Lin, Rajmohan & Zhang (Microsoft Research) — EMNLP 2025 [Paper] [Code]

We replicate two key experiments on DeepSeek-R1-Distill-Qwen-7B (the exact model used in the paper):

  1. Reasoning-Focus Head (RFH) Identification — Which attention heads route information from reasoning (<think>) tokens to the final answer?
  2. Activation Patching — Causal proof that reasoning tokens functionally determine the answer.

All experiments run on a single consumer GPU using custom memory management for the 7B model.


Experiment 1: Reasoning-Focus Heads

RFHs are attention heads where answer tokens consistently attend to reasoning tokens within <think>...</think>. We measure the mean Answer → Reasoning attention weight per (layer, head) pair, averaged across 194 MATH-500 reasoning traces.

Top-10 RFHs vs. Paper Ground Truth

RFH Comparison

Rank Ours Paper In Paper Set?
1 L16.H0 (16, 0)
2 L19.H15 (19, 15)
3 L1.H1 (1, 1)
4 L22.H7 (22, 7)
5 L17.H18 (17, 18)
6 L17.H14 (17, 14)
7 L14.H19
8 L17.H19 (17, 19)
9 L16.H14 (16, 14)
10 L12.H7

Per-Layer Attention Profile

RFH Profile

Clear peak at layers 14–22 — matching the paper's reported RFH zone.


Experiment 2: Activation Patching

We patch residual-stream activations from corrupted → clean reasoning traces, one layer at a time, and measure the Normalized Logit Difference (NLD): how much patching recovers the correct answer.

Method: Per-token patching (step=1) on the last ~5 reasoning tokens + last 3 answer tokens, following the paper's hook_fine_grain_resid_auto(answer_backoff=3).

Per-Layer Causal Effect

Patching NLD

Layer-by-Layer Scan (Animated)

Patching Animation

Consistency Across QA Pairs

NLD Heatmap

Per-Pair Overlay

Attention vs. Causal Impact (Animated)

Dual Animation

Key Findings

  • Late layers (20–27) have 1.8× the causal impact of early layers (0–9)
  • Valley at layers 15–18: patching here has minimal effect on the answer
  • Sharp rise from layer 20: reasoning information flows through late layers to the final prediction
  • Layer 27 (final): strongest patching effect (NLD = 0.105)
  • Pattern matches the paper's Figure 6

Reproducing the Results

Prerequisites

# Download the model (~14 GB)
huggingface-cli download deepseek-ai/DeepSeek-R1-Distill-Qwen-7B \
  --local-dir models/DeepSeek-R1-Distill-Qwen-7B

# Install dependencies
pip install -r requirements.txt

Run Experiments

# Experiment 1 — RFH Analysis (~15 min, uses pre-collected traces)
ML_MEMORY_LIMIT_GB=107 ML_MAX_SEQ_LEN=5000 python code/rfh_analysis.py

# Experiment 2 — Activation Patching (~30 min for 10 pairs)
ML_MEMORY_LIMIT_GB=80 MAX_PAIRS=10 python run_patching.py

# Full patching run (all 56 valid pairs)
ML_MEMORY_LIMIT_GB=80 MAX_PAIRS=56 python run_patching.py

Resources

Experiment Peak VRAM Approx. Runtime
RFH Analysis (N=194) ~50 GB ~15 min
Activation Patching (N=10) ~20 GB ~30 min

Supports Metal/MPS and CUDA backends.


Repository Structure

.
├── README.md
├── LICENSE
├── requirements.txt
├── run_patching.py                  # Entry point: activation patching
├── code/
│   ├── rfh_analysis.py              # Entry point: RFH identification
│   ├── memory_guard.py              # MPS memory management
│   ├── attention_analysis.py        # Answer-to-reasoning attention
│   ├── activation_patching.py       # Residual stream patching
│   ├── QAPairPatchingAnalyzer.py    # QA pair causal analysis
│   ├── QAPairResultProcesser.py     # Response parsing
│   ├── const.py                     # Paper ground truth indices
│   ├── trace_collection.py          # Trace generation
│   ├── mech_interp_setup.py         # Environment verification
│   ├── utils/                       # Shared utilities
│   └── *.ipynb                      # Original paper notebooks
├── traces/
│   └── MATH-500_*_withR.jsonl       # Reasoning traces (MATH-500)
├── results/
│   ├── patching_results.json        # Raw patching data
│   ├── fig[1-7]_*.png               # Static figures
│   └── anim[1-2]_*.gif              # Animated visualizations
└── data/
    ├── 3_sample_qa_pair_responses.jsonl
    ├── 3_selected_qa_pairs.csv
    └── ...

Methodology Notes

  • RFH analysis processes attention patterns in layer chunks (14 layers/pass) to stay within memory limits. OOM errors on long traces are caught and skipped.
  • Activation patching follows the paper's hook_fine_grain_resid_auto: per-token patching on the reasoning tail and answer tail only (not the full sequence).
  • Model loaded via HuggingFace AutoModelForCausalLMTransformerLens HookedTransformer using the Qwen/Qwen2.5-7B architecture mapping.
  • QAPairPatchingAnalyzer adapted for MPS: bfloat16float16, auto device selection.

Acknowledgments

This work replicates experiments from:

  • Zhang, J., Lin, Q., Rajmohan, S., & Zhang, D. (2025). From Reasoning to Answer: Empirical, Attention-Based and Mechanistic Insights into Distilled DeepSeek R1 Models. EMNLP 2025. [arXiv] [Code]

Built with TransformerLens and DeepSeek-R1.

Citation

@inproceedings{zhang2025reasoning,
  title     = {From Reasoning to Answer: Empirical, Attention-Based and
               Mechanistic Insights into Distilled {DeepSeek} {R1} Models},
  author    = {Zhang, Jue and Lin, Qingwei and Rajmohan, Saravan and Zhang, Dongmei},
  booktitle = {EMNLP},
  year      = {2025}
}

About

Replication of 'From Reasoning to Answer' (EMNLP 2025) — Reasoning-Focus Heads + Activation Patching on DeepSeek-R1-Distill-Qwen-7B

Topics

Resources

License

Stars

Watchers

Forks

Releases

No releases published

Packages

 
 
 

Contributors