Skip to content

rattlesnakey/Awesome-Actionable-MI-Survey

Folders and files

NameName
Last commit message
Last commit date

Latest commit

Β 

History

20 Commits
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 

Repository files navigation

πŸ”Ž Locate, Steer, and Improve: A Practical Survey of Actionable Mechanistic Interpretability in Large Language Models

PDF Status arXiv

Awesome GitHub stars

We will continue to update this repository.

If you enjoy or benefit from the project, a star ⭐ on GitHub would be greatly appreciated and will help you stay informed about future updates.

πŸ“– Table of Contents

πŸ“– Overview

A systematic survey on how to locate interpretable objects, steer model behaviors, and improve LLMs (Alignment, Capability, Efficiency) via Mechanistic Interpretability.

Overview Figure

Note: The figure illustrates our framework: Locate (Identifying internal objects), Steer (Manipulating behaviors), and Improve (Downstream applications).

Mechanistic Interpretability (MI) has evolved from merely observing model internals to actively intervening in them. This repository maintains a curated list of papers reviewed in our survey, focusing on Actionable MI.

πŸ”₯ Latest News

  • [2026-03-28] We have significantly updated our paper list with 18 new papers! These additions cover the latest advancements in reasoning steering, knowledge editing, efficient inference, and more. πŸš€
  • [2026-1-21] Our paper is available on arXiv! Check it out here.
  • [2026-1-20] This repository is created to track the latest progress in Actionable MI.

🏷 Taxonomy & Legends

To help navigate the paper list, we use the following abbreviations for Objects, Localizing Methods, and Steering Methods:

Interpretable Objects

The core interpretable objects in our survey are shown below:

Localizing Methods (How to find it?)

  • Magnitude Analysis: Weights or activation magnitude analysis, training-free but data-dependent
  • Causal Attribution: Patching and ablation
  • Gradient Detection: Data-dependent and incurs extra compute
  • Probing: Supervised property decoding
  • Vocab Projection: Logit Lens, direct semantic mapping
  • Circuit Discovery: Causal subnetwork identification

Steering Methods (How to control it?)

  • Amplitude Manipulation: Scaling and replace
  • Targeted Optimization: Weight editing, targeted fine-tuning
  • Vector Arithmetic: Steering via feature and task vectors

πŸ“‘ Paper List

For studies employing multiple objects or localizing/steering methods, we annotate the primary tag.

1. Improve Alignment

Safety and Reliability

Paper Object Localizing Method Steering Method Venue Year Link
Safety Layers in Aligned Large Language Models: The Key to LLM Security Residual Stream Causal Attribution Targeted Optimization ICLR 2025 Link
Refusal in Language Models Is Mediated by a Single Direction Residual Stream Causal Attribution Vector Arithmetic NeurIPS 2024 Link
LLMs Encode Harmfulness and Refusal Separately Residual Stream Causal Attribution Vector Arithmetic NeurIPS 2025 Link
To Steer or Not to Steer? Mechanistic Error Reduction with Abstention for Language Models Residual Stream Probing Vector Arithmetic ICML 2025 Link
Refusal Falls off a Cliff: How Safety Alignment Fails in Reasoning? Residual Stream Probing Vector Arithmetic ArXiv 2025 Link
Spurious Rewards Paradox: Mechanistically Understanding How RLVR Activates Memorization Shortcuts in LLMs Residual Stream Vocab Projection Amplitude Manipulation ArXiv 2026 Link
A Mechanistic Understanding of Alignment Algorithms: A Case Study on DPO and Toxicity Residual Stream Probing Targeted Optimization ICML 2024 Link
Understanding Jailbreak Success: A Study of Latent Space Dynamics in Large Language Models Residual Stream Causal Attribution Vector Arithmetic ArXiv 2024 Link
Surgical, Cheap, and Flexible: Mitigating False Refusal in Language Models via Single Vector Ablation Residual Stream Causal Attribution Vector Arithmetic ICLR 2025 Link
Refusal Direction is Universal Across Safety-Aligned Languages Residual Stream Causal Attribution Vector Arithmetic NeurIPS 2025 Link
Truthful or Fabricated? Using Causal Attribution to Mitigate Reward Hacking in Explanations Residual Stream Causal Attribution Vector Arithmetic ICML 2025 Link
Internal Causal Mechanisms Robustly Predict Language Model Out-of-Distribution Behaviors Residual Stream Causal Attribution Vector Arithmetic ICML 2025 Link
The Hidden Dimensions of LLM Alignment: A Multi-Dimensional Analysis of Orthogonal Safety Directions Residual Stream Causal Attribution Vector Arithmetic ICML 2025 Link
DoLa: Decoding by Contrasting Layers Improves Factuality in Large Language Models Residual Stream Vocab Projection Vector Arithmetic ICLR 2024 Link
In-Context Sharpness as Alerts: An Inner Representation Perspective for Hallucination Mitigation Residual Stream Vocab Projection Vector Arithmetic ICML 2024 Link
TruthX: Alleviating Hallucinations by Editing Large Language Models in Truthful Space Residual Stream Probing Vector Arithmetic ACL 2024 Link
LLMs Know More Than They Show: On the Intrinsic Representation of LLM Hallucinations Residual Stream Probing Vector Arithmetic ICLR 2025 Link
Multi-Attribute Steering of Language Models via Targeted Intervention Residual Stream Gradient Detection Vector Arithmetic ACL 2025 Link
Improving Instruction-Following in Language Models through Activation Steering Residual Stream - Vector Arithmetic ICLR 2025 Link
On the Role of Attention Heads in Large Language Model Safety MHA Causal Attribution Amplitude Manipulation ICLR 2025 Link
Refine Large Language Model Fine-tuning via Instruction Vector MHA Causal Attribution Targeted Optimization ArXiv 2024 Link
Towards Understanding Safety Alignment: A Mechanistic Perspective from Safety Neurons Neuron Causal Attribution Amplitude Manipulation ArXiv 2025 Link
Whispering Experts: Neural Interventions for Toxicity Mitigation in Language Models Neuron Magnitude Analysis Amplitude Manipulation ICML 2024 Link
H-Neurons: On the Existence, Impact, and Origin of Hallucination-Associated Neurons in LLMs Neuron Magnitude Analysis Amplitude Manipulation ArXiv 2025 Link
Understanding and Enhancing Safety Mechanisms of LLMs via Safety-Specific Neuron Neuron Magnitude Analysis Targeted Optimization ICLR 2025 Link
Neuron-Aware Data Selection in Instruction Tuning for Large Language Models Neuron Magnitude Analysis - ICLR 2026 Link
Precision Knowledge Editing: Enhancing Safety in Large Language Models Neuron Magnitude Analysis Targeted Optimization ArXiv 2025 Link
Scaling Monosemanticity: Extracting Interpretable Features from Claude 3 Sonnet SAE Feature Magnitude Analysis Amplitude Manipulation Blog 2024 Link
Breaking Bad Tokens: Detoxification of LLMs Using Sparse Autoencoders SAE Feature Magnitude Analysis Amplitude Manipulation EMNLP 2025 Link
Understanding Refusal in Language Models with Sparse Autoencoders SAE Feature Magnitude Analysis Amplitude Manipulation EMNLP 2025 Link
Safe-SAIL: Towards a Fine-grained Safety Landscape of Large Language Models via Sparse Autoencoder Interpretation Framework SAE Feature Magnitude Analysis Amplitude Manipulation ArXiv 2025 Link
AxBench: Steering LLMs? Even Simple Baselines Outperform Sparse Autoencoders SAE Feature Magnitude Analysis Vector Arithmetic ICML 2025 Link
Saif: A sparse autoencoder framework for interpreting and steering instruction following of language models SAE Feature Magnitude Analysis Vector Arithmetic ArXiv 2025 Link
Training Superior Sparse Autoencoders for Instruct Models SAE Feature Magnitude Analysis Vector Arithmetic ArXiv 2025 Link
Mechanistic Data Attribution: Tracing the Training Origins of Interpretable LLM Units SAE Feature Gradient Detection - ArXiv 2026 Link
Towards Secure Tuning: Mitigating Security Risks Arising from Benign Instruction Fine-Tuning Token Embedding Gradient Detection Vector Arithmetic ArXiv 2025 Link
Pierce the Mists, Greet the Sky: Decipher Knowledge Overshadowing via Knowledge Circuit Analysis MHA & FFN Circuit Discovery Targeted Optimization EMNLP 2025 Link
On Localizing and Deleting Toxic Memories in Large Language Models FFN Causal Attribution Targeted Optimization NAACL 2025 Link

Fairness and Bias

Paper Object Localizing Method Steering Method Venue Year Link
Entangled in Representations: Mechanistic Investigation of Cultural Biases in Large Language Models Residual Stream Causal Attribution - ArXiv 2025 Link
MPF: Aligning and Debiasing Language Models Post Deployment via Multi Perspective Fusion Residual Stream - Amplitude Manipulation ICML 2025 Link
Mitigate Position Bias in LLMs via Scaling a Single Hidden States Channel Residual Stream Magnitude Analysis Amplitude Manipulation ACL 2025 Link
Analysing Moral Bias in Finetuned LLMs through Mechanistic Interpretability Residual Stream Causal Attribution Amplitude Manipulation ArXiv 2025 Link
Investigating Gender Bias in Language Models Using Causal Mediation Analysis MHA Causal Attribution Amplitude Manipulation NeurIPS 2020 Link
Dissecting Bias in LLMs: A Mechanistic Interpretability Perspective MHA Magnitude Analysis Amplitude Manipulation TMLR 2025 Link
Linear Representations of Political Perspective Emerge in Large Language Models MHA Probing Vector Arithmetic ICLR 2025 Link
Tracing Positional Bias in Financial Decision-Making: Mechanistic Insights from Qwen2.5 MHA Magnitude Analysis - ICAIF 2025 Link
Eliminating Position Bias of Language Models: A Mechanistic Approach MHA Magnitude Analysis Amplitude Manipulation ICLR 2025 Link
Identifying and Adapting Transformer-Components Responsible for Gender Bias in an English Language Model MHA Causal Attribution Targeted Optimization ACLWS 2023 Link
Locating and Mitigating Gender Bias in Large Language Models FFN Causal Attribution Targeted Optimization ICIC 2024 Link
Elucidating Mechanisms of Demographic Bias in LLMs for Healthcare FFN Causal Attribution Amplitude Manipulation EMNLP 2025 Link
Anchored Answers: Unravelling Positional Bias in GPT-2's Multiple-Choice Questions FFN Vocab Projection Targeted Optimization ACL 2025 Link
Understanding and Mitigating Gender Bias in LLMs via Interpretable Neuron Editing Neuron Circuit Discovery Targeted Optimization ArXiv 2025 Link
The Devil is in the Neurons: Interpreting and Mitigating Social Biases in Pre-trained Language Models Neuron Gradient Detection Amplitude Manipulation ICLR 2024 Link

Persona and Role

Paper Object Localizing Method Steering Method Venue Year Link
The Assistant Axis: Situating and Stabilizing the Default Persona of Language Models Residual Stream Magnitude Analysis Vector Arithmetic ArXiv 2026 Link
Can Role Vectors Affect LLM Behaviour? Residual Stream Causal Attribution Vector Arithmetic EMNLP 2025 Link
Persona vectors: Monitoring and controlling character traits in language models Residual Stream Causal Attribution Vector Arithmetic ArXiv 2025 Link
From Monolingual to Bilingual: Investigating Language Conditioning in Large Language Models for Psycholinguistic Tasks Residual Stream Probing - AACL 2025 Link
Mechanistic Interpretability of Emotion Inference in Large Language Models Residual Stream Probing Vector Arithmetic ACL 2025 Link
Personality as a Probe for LLM Evaluation: Method Trade-offs and Downstream Effects Residual Stream Causal Attribution Vector Arithmetic NeurIPS 2025 Link
Steering Llama 2 via Contrastive Activation Addition Residual Stream Causal Attribution Vector Arithmetic ACL 2024 Link
Probing then Editing Response Personality of Large Language Models Residual Stream Probing Targeted Optimization COLM 2025 Link
Neural Transparency: Mechanistic Interpretability Interfaces for Anticipating Model Behaviors for Personalized AI Residual Stream Causal Attribution - ArXiv 2025 Link
From Rational Answers to Emotional Resonance: The Role of Controllable Emotion Generation in Language Models Residual Stream - Vector Arithmetic ArXiv 2025 Link
Psychological Steering in LLMs: An Evaluation of Effectiveness and Trustworthiness Residual Stream Causal Attribution Vector Arithmetic ArXiv 2025 Link
Steering Latent Traits, Not Learned Facts: An Empirical Study of Activation Control Limits Residual Stream Causal Attribution Vector Arithmetic ArXiv 2025 Link
Personality Vector: Modulating Personality of Large Language Models by Model Merging Residual Stream Causal Attribution Vector Arithmetic EMNLP 2025 Link
Billy: Steering large language models via merging persona vectors for creative generation Residual Stream Causal Attribution Vector Arithmetic ArXiv 2025 Link
Personas as a way to model truthfulness in language models Residual Stream Probing - EMNLP 2024 Link
Who's asking? User personas and the mechanics of latent misalignment Residual Stream Causal Attribution Vector Arithmetic NeurIPS 2024 Link
Understanding How Value Neurons Shape the Generation of Specified Values in LLMs Neuron Causal Attribution Amplitude Manipulation EMNLP 2025 Link
Neuron based Personality Trait Induction in Large Language Models Neuron Causal Attribution Amplitude Manipulation ICLR 2025 Link
From Yes-Men to Truth-Tellers: Addressing Sycophancy in Large Language Models with Pinpoint Tuning Neuron Causal Attribution Targeted Optimization ICML 2024 Link

2. Improve Capability

Logic and Reasoning

Paper Object Localizing Method Steering Method Venue Year Link
Reasoning-Finetuning Repurposes Latent Representations in Base Models Residual Stream Causal Attribution Vector Arithmetic ICML 2025 Link
Improving Reasoning Performance in Large Language Models via Representation Engineering Residual Stream Causal Attribution Vector Arithmetic ICLR 2025 Link
Bottom-up Policy Optimization: Your Language Model Policy Secretly Contains Internal Policies Residual Stream Vocab Projection Targeted Optimization ArXiv 2025 Link
Unlocking General Long Chain-of-Thought Reasoning Capabilities of Large Language Models via Representation Engineering Residual Stream Causal Attribution Vector Arithmetic ACL 2025 Link
Probing for Arithmetic Errors in Language Models Residual Stream Probing - EMNLP 2025 Link
Eliciting Chain-of-Thought in Base LLMs via Gradient-Based Representation Optimization Residual Stream Probing Vector Arithmetic AAAI 2026 Link
Understanding Reasoning in Thinking Language Models via Steering Vectors Residual Stream Causal Attribution Vector Arithmetic ICLR 2025 Link
Hopping Too Late: Exploring the Limitations of Large Language Models on Multi-Hop Queries Residual Stream Probing - EMNLP 2024 Link
Physics of Language Models: Part 2.1, Grade-School Math and the Hidden Reasoning Process Residual Stream Probing - ICLR 2025 Link
The Reasoning-Memorization Interplay in Language Models Is Mediated by a Single Direction Residual Stream Causal Attribution Vector Arithmetic ACL 2025 Link
Uncovering Latent Chain of Thought Vectors in Language Models Residual Stream Causal Attribution Vector Arithmetic ICLR 2025 Link
Fractional Reasoning via Latent Steering Vectors Improves Inference Time Compute Residual Stream Causal Attribution Vector Arithmetic ArXiv 2025 Link
Steering LLM Reasoning Through Bias-Only Adaptation Residual Stream Causal Attribution Vector Arithmetic EMNLP 2025 Link
Feature Extraction and Steering for Enhanced Chain-of-Thought Reasoning in Language Models Residual Stream Causal Attribution Vector Arithmetic EMNLP 2025 Link
ATLAS: Adaptive Test-Time Latent Steering with External Verifiers for Enhancing LLMs' Reasoning Residual Stream Probing Vector Arithmetic ArXiv 2026 Link
Understanding and Steering the Cognitive Behaviors of Reasoning Models at Test-Time MHA Probing Amplitude Manipulation ArXiv 2025 Link
How do Large Language Models Learn In-Context? Query and Key Matrices of In-Context Heads are Two Towers for Metric Learning MHA Magnitude Analysis Amplitude Manipulation EMNLP 2024 Link
Interpreting Arithmetic Mechanism in Large Language Models through Comparative Neuron Analysis MHA Causal Attribution Amplitude Manipulation EMNLP 2024 Link
Back Attention: Understanding and Enhancing Multi-Hop Reasoning in Large Language Models MHA Causal Attribution - EMNLP 2025 Link
Chain-of-Thought in Large Language Models: Decoding, Projection, and Activation MHA & FFN Magnitude Analysis - ArXiv 2024 Link
Interpreting and Improving Large Language Models in Arithmetic Calculation MHA & FFN Causal Attribution Targeted Optimization ICML 2024 Link
A Mechanistic Interpretation of Arithmetic Reasoning in Language Models using Causal Mediation Analysis MHA & FFN Causal Attribution - EMNLP 2023 Link
Uncovering the Interpretation of Large Language Models MHA & FFN Causal Attribution - COMPSAC 2024 Link
Understanding Addition in Transformers MHA & FFN Causal Attribution Amplitude Manipulation ICLR 2024 Link
Inner Thinking Transformer: Leveraging Dynamic Depth Scaling to Foster Adaptive Internal Thinking MHA & FFN Gradient Detection Targeted Optimization ACL 2025 Link
How does GPT-2 compute greater-than?: Interpreting mathematical abilities in a pre-trained language model MHA & FFN Circuit Discovery - NeurIPS 2023 Link
Arithmetic Without Algorithms: Language Models Solve Math with a Bag of Heuristics MHA & FFN Circuit Discovery - ICLR 2025 Link
Sparse Autoencoders Learn Monosemantic Features in Vision-Language Models SAE Feature Magnitude Analysis Amplitude Manipulation ArXiv 2025 Link
Fantastic Reasoning Behaviors and Where to Find Them: Unsupervised Discovery of the Reasoning Process SAE Feature Magnitude Analysis Amplitude Manipulation ArXiv 2025 Link
Reasoning Models Generate Societies of Thought SAE Feature Magnitude Analysis Amplitude Manipulation ArXiv 2026 Link
I Have Covered All the Bases Here: Interpreting Reasoning Features in Large Language Models via Sparse Autoencoders SAE Feature Magnitude Analysis Amplitude Manipulation ArXiv 2025 Link
Internal states before wait modulate reasoning patterns SAE Feature Magnitude Analysis Vector Arithmetic EMNLP 2025 Link
Can we interpret latent reasoning using current mechanistic interpretability tools? Token Embedding Causal Attribution Amplitude Manipulation Blog 2025 Link
Analyzing chain-of-thought prompting in large language models via gradient-based feature attributions Token Embedding Gradient Detection - ICML 2023 Link
Probabilistic Soundness Guarantees in LLM Reasoning Chains Token Embedding Magnitude Analysis - EMNLP 2025 Link
Two Experts Are All You Need for Steering Thinking: Reinforcing Cognitive Effort in MoE Reasoning Models Without Additional Training FFN Magnitude Analysis Amplitude Manipulation ArXiv 2025 Link

Multilingualism

Paper Object Localizing Method Steering Method Venue Year Link
Importance-based Neuron Allocation for Multilingual Neural Machine Translation Neuron Magnitude Analysis Amplitude Manipulation ACL 2021 Link
On the Multilingual Ability of Decoder-based Pre-trained Language Models: Finding and Controlling Language-Specific Neurons Neuron Magnitude Analysis Amplitude Manipulation NAACL 2024 Link
Language-Specific Neurons: The Key to Multilingual Capabilities in Large Language Models Neuron Magnitude Analysis Amplitude Manipulation ACL 2024 Link
How do Large Language Models Handle Multilingualism? Neuron Magnitude Analysis Amplitude Manipulation NeurIPS 2024 Link
Language Arithmetics: Towards Systematic Language Neuron Identification and Manipulation Neuron Magnitude Analysis Amplitude Manipulation ArXiv 2025 Link
On Relation-Specific Neurons in Large Language Models Neuron Magnitude Analysis Amplitude Manipulation EMNLP 2025 Link
LinguaLens: Towards Interpreting Linguistic Mechanisms of Large Language Models via Sparse Auto-Encoder Neuron Magnitude Analysis Amplitude Manipulation EMNLP 2025 Link
Sparse Autoencoders Can Capture Language-Specific Concepts Across Diverse Languages SAE Feature Magnitude Analysis Amplitude Manipulation ArXiv 2025 Link
Large Language Models Share Representations of Latent Grammatical Concepts Across Typologically Diverse Languages SAE Feature Magnitude Analysis Amplitude Manipulation NAACL 2025 Link
On the Language Neutrality of Pre-trained Multilingual Representations Residual Stream Probing - EMNLP 2020 Link
Can Cross-Lingual Transferability of Multilingual Transformers Be Activated Without End-Task Data? Residual Stream - Vector Arithmetic ACL 2023 Link
Identifying the Correlation Between Language Distance and Cross-Lingual Transfer in a Multilingual Representation Space Residual Stream Magnitude Analysis Vector Arithmetic ACL 2023 Link
Do Llamas Work in English? On the Latent Language of Multilingual Transformers Residual Stream Vocab Projection Vector Arithmetic ACL 2024 Link
Exploring Alignment in Shared Cross-lingual Spaces Residual Stream Magnitude Analysis Vector Arithmetic ACL 2024 Link
Why do LLaVA Vision-Language Models Reply to Images in English? Residual Stream Probing Vector Arithmetic EMNLP 2024 Link
ShifCon: Enhancing Non-Dominant Language Capabilities with a Shift-based Multilingual Contrastive Framework Residual Stream Magnitude Analysis Vector Arithmetic ACL 2025 Link
Lost in Multilinguality: Dissecting Cross-lingual Factual Inconsistency in Transformer Language Models Residual Stream Vocab Projection Vector Arithmetic ACL 2025 Link
The Semantic Hub Hypothesis: Language Models Share Semantic Representations Across Languages and Modalities Residual Stream Vocab Projection - ICLR 2025 Link
Language Mixing in Reasoning Language Models: Patterns, Impact, and Internal Causes Residual Stream Vocab Projection Vector Arithmetic EMNLP 2025 Link
Mechanistic Understanding and Mitigation of Language Confusion in English-Centric Large Language Models Residual Stream Vocab Projection Vector Arithmetic EMNLP 2025 Link
Tracing Multilingual Factual Knowledge Acquisition in Pretraining Residual Stream Vocab Projection Vector Arithmetic EMNLP 2025 Link

Knowledge Management

Paper Object Localizing Method Steering Method Venue Year Link
Fine-tuning Done Right in Model Editing FFN Gradient Detection Targeted Optimization ICLR 2026 Link
ACE: Attribution-Controlled Knowledge Editing for Multi-hop Factual Recall FFN Causal Attribution Targeted Optimization ICLR 2026 Link
LoKI: Low-damage Knowledge Implanting of Large Language Models FFN Causal Attribution Targeted Optimization AAAI 2026 Link
Locating and Editing Factual Associations in GPT FFN Causal Attribution Targeted Optimization NeurIPS 2022 Link
Mass-Editing Memory in a Transformer FFN Causal Attribution Targeted Optimization ICLR 2023 Link
Joint Localization and Activation Editing for Low-Resource Fine-Tuning MHA Magnitude Analysis Targeted Optimization ICML 2025 Link
Taming Knowledge Conflicts in Language Models MHA Magnitude Analysis Amplitude Manipulation ICML 2025 Link
Massive Values in Self-Attention Modules are the Key to Contextual Knowledge Understanding MHA Magnitude Analysis Amplitude Manipulation ICML 2025 Link
Cutting Off the Head Ends the Conflict: A Mechanism for Interpreting and Mitigating Knowledge Conflicts in Language Models MHA Causal Attribution Amplitude Manipulation ACL 2024 Link
Interpreting Key Mechanisms of Factual Recall in Transformer-Based Language Models MHA Causal Attribution Amplitude Manipulation ArXiv 2024 Link
Llama See, Llama Do: A Mechanistic Perspective on Contextual Entrainment and Distraction in LLMs MHA Causal Attribution Amplitude Manipulation ACL 2025 Link
Attributing Response to Context: A Jensen-Shannon Divergence Driven Mechanistic Study of Context Attribution in Retrieval-Augmented Generation MHA Magnitude Analysis Amplitude Manipulation ICLR 2026 Link
Probing and Boosting Large Language Models Capabilities via Attention Heads MHA Probing Targeted Optimization EMNLP 2025 Link
TIES-Merging: Resolving Interference When Merging Models MHA & FFN Magnitude Analysis Vector Arithmetic NeurIPS 2023 Link
Neuron-Level Knowledge Attribution in Large Language Models MHA & FFN Magnitude Analysis Amplitude Manipulation EMNLP 2024 Link
Balancing Speciality and Versatility: a Coarse to Fine Framework for Supervised Fine-tuning Large Language Model MHA & FFN Magnitude Analysis Targeted Optimization ACL 2024 Link
Knowledge Localization: Mission Not Accomplished? Enter Query Localization! MHA & FFN Magnitude Analysis Amplitude Manipulation ICLR 2025 Link
Enhancing Large Language Model Performance with Gradient-Based Parameter Selection MHA & FFN Magnitude Analysis Targeted Optimization AAAI 2025 Link
The Geometry of Forgetting: Analyzing Machine Unlearning through Local Learning Coefficients MHA & FFN Magnitude Analysis - ICML 2025 Link
Knowledge Circuits in Pretrained Transformers MHA & FFN Circuit Discovery - NeurIPS 2024 Link
Towards Secure Tuning: Mitigating Security Risks Arising from Benign Instruction Fine-Tuning MHA & FFN Probing Targeted Optimization ArXiv 2024 Link
Unveiling Linguistic Regions in Large Language Models MHA & FFN Gradient Detection Targeted Optimization ACL 2024 Link
Sens-Merging: Sensitivity-Guided Parameter Balancing for Merging Large Language Models MHA & FFN Gradient Detection Vector Arithmetic ACL 2025 Link
Activation-Guided Consensus Merging for Large Language Models MHA & FFN Magnitude Analysis Vector Arithmetic NeurIPS 2025 Link
Dissecting Recall of Factual Associations in Auto-Regressive Language Models MHA & FFN Causal Attribution - EMNLP 2023 Link
Multilingual Knowledge Editing with Language-Agnostic Factual Neurons Neuron Magnitude Analysis Targeted Optimization COLING 2025 Link
Journey to the Center of the Knowledge Neurons: Discoveries of Language-Independent Knowledge Neurons and Degenerate Knowledge Neurons Neuron Gradient Detection Amplitude Manipulation AAAI 2024 Link
IRCAN: Mitigating Knowledge Conflicts in LLM Generation via Identifying and Reweighting Context-Aware Neurons Neuron Gradient Detection Amplitude Manipulation NeurIPS 2024 Link
Identifying Query-Relevant Neurons in Large Language Models for Long-Form Texts Neuron Gradient Detection Amplitude Manipulation AAAI 2025 Link
Reviving Your MNEME: Predicting The Side Effects of LLM Unlearning and Fine-Tuning via Sparse Model Diffing Neuron - Amplitude Manipulation EMNLP 2025 Link
Steering Knowledge Selection Behaviours in LLMs via SAE-Based Representation Engineering SAE Feature Magnitude Analysis Amplitude Manipulation NAACL 2025 Link
SAEs Can Improve Unlearning: Dynamic Sparse Autoencoder Guardrails for Precision Unlearning in LLMs SAE Feature Magnitude Analysis Amplitude Manipulation ICML 2025 Link
Breaking Bad Tokens: Detoxification of LLMs Using Sparse Autoencoders SAE Feature Magnitude Analysis Amplitude Manipulation EMNLP 2025 Link
Sparse Feature Circuits: Discovering and Editing Interpretable Causal Graphs in Language Models SAE Feature Circuit Discovery Amplitude Manipulation ICLR 2025 Link
Impact of Co-occurrence on Factual Knowledge of Large Language Models Residual Stream Probing - EMNLP 2023 Link
Backward Lens: Projecting Language Model Gradients into the Vocabulary Space Residual Stream Vocab Projection Targeted Optimization EMNLP 2024 Link
ReFT: Representation Finetuning for Language Models Residual Stream Causal Attribution Targeted Optimization NeurIPS 2024 Link
Analysing the Residual Stream of Language Models Under Knowledge Conflicts Residual Stream Probing - ArXiv 2024 Link
How Large Language Models Encode Context Knowledge? A Layer-Wise Probing Study Residual Stream Probing - COLING 2024 Link
Exploring Concept Depth: How Large Language Models Acquire Knowledge and Concept at Different Layers? Residual Stream Probing - COLING 2025 Link
Transferring Linear Features Across Language Models With Model Stitching Residual Stream Probing Vector Arithmetic NeurIPS 2025 Link

3. Improve Efficiency

Efficient Training

Paper Object Localizing Method Steering Method Venue Year Link
Task-Specific Skill Localization in Fine-tuned Language Models Neuron Magnitude Analysis Targeted Optimization ICML 2023 Link
LANDeRMT: Dectecting and Routing Language-Aware Neurons for Selectively Finetuning LLMs to Machine Translation Neuron Gradient Detection Targeted Optimization ACL 2024 Link
Sparse is enough in fine-tuning pre-trained large language models Neuron Gradient Detection Targeted Optimization ICML 2024 Link
Fine-tuning Happens in Tiny Subspaces: Exploring Intrinsic Task-specific Subspaces of Pre-trained Language Models Neuron Magnitude Analysis Targeted Optimization ACL 2023 Link
Let's Focus on Neuron: Neuron-Level Supervised Fine-tuning for Large Language Model Neuron Magnitude Analysis Targeted Optimization COLING 2025 Link
Language-Specific Neurons Do Not Facilitate Cross-Lingual Transfer Neuron Magnitude Analysis Targeted Optimization ACL 2025 Link
Sparse Subnetwork Enhancement for Underrepresented Languages in Large Language Models Neuron Magnitude Analysis Targeted Optimization AACL 2025 Link
How do Large Language Models Handle Multilingualism? Neuron Causal Attribution Targeted Optimization NeurIPS 2024 Link
Optimizing Multimodal Language Models through Attention-based Interpretability MHA Magnitude Analysis Targeted Optimization ICAI 2025 Link
In-context Learning and Induction Heads MHA Magnitude Analysis - ArXiv 2022 Link
How Transformers Implement Induction Heads: Approximation and Optimization Analysis MHA Magnitude Analysis - ArXiv 2024 Link
What needs to go right for an induction head? a mechanistic study of in-context learning circuits and their formation MHA Magnitude Analysis - ICML 2024 Link
The developmental landscape of in-context learning MHA Magnitude Analysis - TLMR 2025 Link
In-Context Meta Learning Induces Multi-Phase Circuit Emergence MHA Magnitude Analysis - ICLR 2025 Link
Joint Localization and Activation Editing for Low-Resource Fine-Tuning MHA Magnitude Analysis Vector Arithmetic ICML 2025 Link
The slingshot mechanism: An empirical study of adaptive optimizers and the Grokking Phenomenon MHA & FFN Magnitude Analysis - NeurIPS 2022 Link
Explaining grokking through circuit efficiency MHA & FFN Magnitude Analysis - ArXiv 2023 Link
Towards Empirical Interpretation of Internal Circuits and Properties in Grokked Transformers on Modular Polynomials MHA & FFN Magnitude Analysis - TMLR 2024 Link
Progress measures for grokking via mechanistic interpretability MHA & FFN Magnitude Analysis - ICLR 2023 Link
Predicting grokking long before it happens: A look into the loss landscape of models which grok MHA & FFN Magnitude Analysis - ArXiv 2023 Link
Exploring Grokking: Experimental and Mechanistic Investigations MHA & FFN Magnitude Analysis - ArXiv 2024 Link
Omnigrok: Grokking Beyond Algorithmic Data MHA & FFN Magnitude Analysis - ICLR 2023 Link
Where to find Grokking in LLM Pretraining? Monitor Memorization-to-Generalization without Test MHA & FFN Magnitude Analysis - ArXiv 2025 Link
Grokking of implicit reasoning in transformers: A mechanistic journey to the edge of generalization MHA & FFN Magnitude Analysis - NeurIPS 2024 Link
Unified View of Grokking, Double Descent and Emergent Abilities: A Comprehensive Study on Algorithm Task MHA & FFN Magnitude Analysis - COLM 2024 Link
Fine-Tuning is Subgraph Search: A New Lens on Learning Dynamics MHA & FFN Circuit Discovery Targeted Optimization ArXiv 2025 Link
Constructive Circuit Amplification:Improving Math Reasoning in LLMS via Targeted Sub-Network Updates MHA & FFN Circuit Discovery Targeted Optimization ArXiv 2025 Link

Efficient Inference

Paper Object Localizing Method Steering Method Venue Year Link
TokenSkip: Controllable Chain-of-Thought Compression in LLMs Token Embedding Magitude Analysis Amplitude Manipulation EMNLP 2025 Link
Generic Token Compression in Multimodal Large Language Models from an Explainability Perspective Token Embedding Gradient Detection Amplitude Manipulation ArXiv 2025 Link
Attention Score is not All You Need for Token Importance Indicator in KV Cache Reduction: Value Also Matters Token Embedding Magnitude Analysis Amplitude Manipulation EMNLP 2024 Link
Fit and prune: Fast and training-free visual token pruning for multi-modal large language models Token Embedding Magnitude Analysis Amplitude Manipulation AAAI 2025 Link
Zipcache: Accurate and efficient kv cache quantization with salient token identification Token Embedding Magnitude Analysis Amplitude Manipulation NeurIPS 2024 Link
Pyramidkv: Dynamic kv cache compression based on pyramidal information funneling Token Embedding Magnitude Analysis Amplitude Manipulation COLM 2025 Link
What Layers When: Learning to Skip Compute in LLMs with Residual Gates Residual Stream Magnitude Analysis Amplitude Manipulation ArXiv 2025 Link
Accelerating Large Language Model Inference with Self-Supervised Early Exits Residual Stream Probing Amplitude Manipulation ArXiv 2024 Link
LayerSkip: Enabling Early Exit Inference and Self-Speculative Decoding Residual Stream Probing Amplitude Manipulation ACL 2024 Link
HadSkip: Homotopic and Adaptive Layer Skipping of Pre-trained Language Models for Efficient Inference Residual Stream Magnitude Analysis Amplitude Manipulation EMNLP 2023 Link
Learning to Skip the Middle Layers of Transformers Residual Stream Magnitude Analysis Amplitude Manipulation ArXiv 2025 Link
ShortGPT: Layers in Large Language Models are More Redundant Than You Expect Residual Stream Magnitude Analysis Amplitude Manipulation ACL 2025 Link
Layer-wise quantization: A pragmatic and effective method for quantizing llms beyond integer bit-levels Residual Stream Magnitude Analysis - ArXiv 2024 Link
Towards Superior Quantization Accuracy: A Layer-sensitive Approach Residual Stream Magnitude Analysis - ArXiv 2025 Link
Exploring Layer-wise Information Effectiveness for Post-Training Quantization in Small Language Models Residual Stream Magnitude Analysis - ArXiv 2025 Link
Mix-QViT: Mixed-precision vision transformer quantization driven by layer importance and quantization sensitivity Residual Stream Gradient Detection - ArXiv 2025 Link
Lsaq: Layer-specific adaptive quantization for large language model deployment Residual Stream Vocab Projection - ArXiv 2024 Link
Towards Building Efficient Sentence BERT Models using Layer Pruning Residual Stream Causal Attribution Amplitude Manipulation ACL 2024 Link
KVSink: Understanding and Enhancing the Preservation of Attention Sinks in KV Cache Quantization for LLMs MHA & FFN Circuit Discovery - COLM 2025 Link
Beyond Outliers: A Data-Free Layer-wise Mixed-Precision Quantization Approach Driven by Numerical and Structural Dual-Sensitivity MHA & FFN Magnitude Analysis - ArXiv 2026 Link
Massive activations in large language models MHA & FFN Magnitude Analysis - NeurIPS 2024 Link
Systematic outliers in large language models MHA & FFN Circuit Discovery - ICLR 2025 Link
Quantizable Transformers: Removing Outliers by Helping Attention Heads Do Nothing MHA & FFN Circuit Discovery - NeurIPS 2023 Link
RazorAttention: Efficient kv cache compression through retrieval heads MHA Circuit Discovery Amplitude Manipulation ICLR 2025 Link
DuoAttention: Efficient long-context llm inference with retrieval and streaming heads MHA Circuit Discovery Amplitude Manipulation ICLR 2025 Link
Unveiling visual perception in language models: An Attention head analysis approach MHA Magnitude Analysis - CVPR 2025 Link
Fast and Low-Cost Genomic Foundation Models via Outlier Removal MHA Magnitude Analysis Amplitude Manipulation ICML 2025 Link
FROST: Filtering Reasoning Outliers with Attention for Efficient Reasoning MHA Magnitude Analysis Amplitude Manipulation ICLR 2026 Link
Rotatekv: Accurate and robust 2-bit kv cache quantization for llms via outlier-aware adaptive rotations MHA Magnitude Analysis Amplitude Manipulation IJCAI 2025 Link
Efficient Streaming Language Models with Attention Sinks MHA Magnitude Analysis Amplitude Manipulation ICLR 2024 Link
Unraveling babel: Exploring multilingual activation patterns within large language models Neuron Magnitude Analysis Amplitude Manipulation ArXiv 2024 Link
Neuron Specialization: Leveraging Intrinsic Task Modularity for Multilingual Machine Translation Neuron Magnitude Analysis - EMNLP 2024 Link
The super weight in large language models FFN Magnitude Analysis Amplitude Manipulation Arxiv 2024 Link
Not all experts are equal: Efficient expert pruning and skipping for mixture-of-experts large language models FFN Magnitude Analysis Amplitude Manipulation ACL 2024 Link
Unveiling super experts in mixture-of-experts large language models FFN Magnitude Analysis Amplitude Manipulation ArXiv 2025 Link

🌟 Citation

If you find this survey or repository useful for your research, please cite:

@article{zhang2026locate,
  title={Locate, Steer, and Improve: A Practical Survey of Actionable Mechanistic Interpretability in Large Language Models},
  author={Zhang, Hengyuan and Zhang, Zhihao and Wang, Mingyang and Su, Zunhai and Wang, Yiwei and Wang, Qianli and Yuan, Shuzhou and Nie, Ercong and Duan, Xufeng and Xue, Qibo and others},
  journal={arXiv preprint arXiv:2601.14004},
  year={2026}
}

πŸ“§ Contact

Feel free to open an issue or contact us if you have any questions or want to include your work in this list!

Corresponding Author: Hengyuan Zhang (hengyuan.zhang88@gmail.com) and Zhihao Zhang (zhihaozhang017@gmail.com)

About

The Github repo for our survey paper: "Locate, Steer, and Improve: A Practical Survey of Actionable Mechanistic Interpretability in Large Language Models"

Resources

License

Stars

Watchers

Forks

Releases

No releases published

Packages

 
 
 

Contributors