π Locate, Steer, and Improve: A Practical Survey of Actionable Mechanistic Interpretability in Large Language Models
We will continue to update this repository.
If you enjoy or benefit from the project, a star β on GitHub would be greatly appreciated and will help you stay informed about future updates.
A systematic survey on how to locate interpretable objects, steer model behaviors, and improve LLMs (Alignment, Capability, Efficiency) via Mechanistic Interpretability.
Note: The figure illustrates our framework: Locate (Identifying internal objects), Steer (Manipulating behaviors), and Improve (Downstream applications).
Mechanistic Interpretability (MI) has evolved from merely observing model internals to actively intervening in them. This repository maintains a curated list of papers reviewed in our survey, focusing on Actionable MI.
- [2026-03-28] We have significantly updated our paper list with 18 new papers! These additions cover the latest advancements in reasoning steering, knowledge editing, efficient inference, and more. π
- [2026-1-21] Our paper is available on arXiv! Check it out here.
- [2026-1-20] This repository is created to track the latest progress in Actionable MI.
To help navigate the paper list, we use the following abbreviations for Objects, Localizing Methods, and Steering Methods:
The core interpretable objects in our survey are shown below:
- Magnitude Analysis: Weights or activation magnitude analysis, training-free but data-dependent
- Causal Attribution: Patching and ablation
- Gradient Detection: Data-dependent and incurs extra compute
- Probing: Supervised property decoding
- Vocab Projection: Logit Lens, direct semantic mapping
- Circuit Discovery: Causal subnetwork identification
- Amplitude Manipulation: Scaling and replace
- Targeted Optimization: Weight editing, targeted fine-tuning
- Vector Arithmetic: Steering via feature and task vectors
For studies employing multiple objects or localizing/steering methods, we annotate the primary tag.
| Paper | Object | Localizing Method | Steering Method | Venue | Year | Link |
|---|---|---|---|---|---|---|
| Safety Layers in Aligned Large Language Models: The Key to LLM Security | Residual Stream | Causal Attribution | Targeted Optimization | ICLR | 2025 | Link |
| Refusal in Language Models Is Mediated by a Single Direction | Residual Stream | Causal Attribution | Vector Arithmetic | NeurIPS | 2024 | Link |
| LLMs Encode Harmfulness and Refusal Separately | Residual Stream | Causal Attribution | Vector Arithmetic | NeurIPS | 2025 | Link |
| To Steer or Not to Steer? Mechanistic Error Reduction with Abstention for Language Models | Residual Stream | Probing | Vector Arithmetic | ICML | 2025 | Link |
| Refusal Falls off a Cliff: How Safety Alignment Fails in Reasoning? | Residual Stream | Probing | Vector Arithmetic | ArXiv | 2025 | Link |
| Spurious Rewards Paradox: Mechanistically Understanding How RLVR Activates Memorization Shortcuts in LLMs | Residual Stream | Vocab Projection | Amplitude Manipulation | ArXiv | 2026 | Link |
| A Mechanistic Understanding of Alignment Algorithms: A Case Study on DPO and Toxicity | Residual Stream | Probing | Targeted Optimization | ICML | 2024 | Link |
| Understanding Jailbreak Success: A Study of Latent Space Dynamics in Large Language Models | Residual Stream | Causal Attribution | Vector Arithmetic | ArXiv | 2024 | Link |
| Surgical, Cheap, and Flexible: Mitigating False Refusal in Language Models via Single Vector Ablation | Residual Stream | Causal Attribution | Vector Arithmetic | ICLR | 2025 | Link |
| Refusal Direction is Universal Across Safety-Aligned Languages | Residual Stream | Causal Attribution | Vector Arithmetic | NeurIPS | 2025 | Link |
| Truthful or Fabricated? Using Causal Attribution to Mitigate Reward Hacking in Explanations | Residual Stream | Causal Attribution | Vector Arithmetic | ICML | 2025 | Link |
| Internal Causal Mechanisms Robustly Predict Language Model Out-of-Distribution Behaviors | Residual Stream | Causal Attribution | Vector Arithmetic | ICML | 2025 | Link |
| The Hidden Dimensions of LLM Alignment: A Multi-Dimensional Analysis of Orthogonal Safety Directions | Residual Stream | Causal Attribution | Vector Arithmetic | ICML | 2025 | Link |
| DoLa: Decoding by Contrasting Layers Improves Factuality in Large Language Models | Residual Stream | Vocab Projection | Vector Arithmetic | ICLR | 2024 | Link |
| In-Context Sharpness as Alerts: An Inner Representation Perspective for Hallucination Mitigation | Residual Stream | Vocab Projection | Vector Arithmetic | ICML | 2024 | Link |
| TruthX: Alleviating Hallucinations by Editing Large Language Models in Truthful Space | Residual Stream | Probing | Vector Arithmetic | ACL | 2024 | Link |
| LLMs Know More Than They Show: On the Intrinsic Representation of LLM Hallucinations | Residual Stream | Probing | Vector Arithmetic | ICLR | 2025 | Link |
| Multi-Attribute Steering of Language Models via Targeted Intervention | Residual Stream | Gradient Detection | Vector Arithmetic | ACL | 2025 | Link |
| Improving Instruction-Following in Language Models through Activation Steering | Residual Stream | - | Vector Arithmetic | ICLR | 2025 | Link |
| On the Role of Attention Heads in Large Language Model Safety | MHA | Causal Attribution | Amplitude Manipulation | ICLR | 2025 | Link |
| Refine Large Language Model Fine-tuning via Instruction Vector | MHA | Causal Attribution | Targeted Optimization | ArXiv | 2024 | Link |
| Towards Understanding Safety Alignment: A Mechanistic Perspective from Safety Neurons | Neuron | Causal Attribution | Amplitude Manipulation | ArXiv | 2025 | Link |
| Whispering Experts: Neural Interventions for Toxicity Mitigation in Language Models | Neuron | Magnitude Analysis | Amplitude Manipulation | ICML | 2024 | Link |
| H-Neurons: On the Existence, Impact, and Origin of Hallucination-Associated Neurons in LLMs | Neuron | Magnitude Analysis | Amplitude Manipulation | ArXiv | 2025 | Link |
| Understanding and Enhancing Safety Mechanisms of LLMs via Safety-Specific Neuron | Neuron | Magnitude Analysis | Targeted Optimization | ICLR | 2025 | Link |
| Neuron-Aware Data Selection in Instruction Tuning for Large Language Models | Neuron | Magnitude Analysis | - | ICLR | 2026 | Link |
| Precision Knowledge Editing: Enhancing Safety in Large Language Models | Neuron | Magnitude Analysis | Targeted Optimization | ArXiv | 2025 | Link |
| Scaling Monosemanticity: Extracting Interpretable Features from Claude 3 Sonnet | SAE Feature | Magnitude Analysis | Amplitude Manipulation | Blog | 2024 | Link |
| Breaking Bad Tokens: Detoxification of LLMs Using Sparse Autoencoders | SAE Feature | Magnitude Analysis | Amplitude Manipulation | EMNLP | 2025 | Link |
| Understanding Refusal in Language Models with Sparse Autoencoders | SAE Feature | Magnitude Analysis | Amplitude Manipulation | EMNLP | 2025 | Link |
| Safe-SAIL: Towards a Fine-grained Safety Landscape of Large Language Models via Sparse Autoencoder Interpretation Framework | SAE Feature | Magnitude Analysis | Amplitude Manipulation | ArXiv | 2025 | Link |
| AxBench: Steering LLMs? Even Simple Baselines Outperform Sparse Autoencoders | SAE Feature | Magnitude Analysis | Vector Arithmetic | ICML | 2025 | Link |
| Saif: A sparse autoencoder framework for interpreting and steering instruction following of language models | SAE Feature | Magnitude Analysis | Vector Arithmetic | ArXiv | 2025 | Link |
| Training Superior Sparse Autoencoders for Instruct Models | SAE Feature | Magnitude Analysis | Vector Arithmetic | ArXiv | 2025 | Link |
| Mechanistic Data Attribution: Tracing the Training Origins of Interpretable LLM Units | SAE Feature | Gradient Detection | - | ArXiv | 2026 | Link |
| Towards Secure Tuning: Mitigating Security Risks Arising from Benign Instruction Fine-Tuning | Token Embedding | Gradient Detection | Vector Arithmetic | ArXiv | 2025 | Link |
| Pierce the Mists, Greet the Sky: Decipher Knowledge Overshadowing via Knowledge Circuit Analysis | MHA & FFN | Circuit Discovery | Targeted Optimization | EMNLP | 2025 | Link |
| On Localizing and Deleting Toxic Memories in Large Language Models | FFN | Causal Attribution | Targeted Optimization | NAACL | 2025 | Link |
| Paper | Object | Localizing Method | Steering Method | Venue | Year | Link |
|---|---|---|---|---|---|---|
| Entangled in Representations: Mechanistic Investigation of Cultural Biases in Large Language Models | Residual Stream | Causal Attribution | - | ArXiv | 2025 | Link |
| MPF: Aligning and Debiasing Language Models Post Deployment via Multi Perspective Fusion | Residual Stream | - | Amplitude Manipulation | ICML | 2025 | Link |
| Mitigate Position Bias in LLMs via Scaling a Single Hidden States Channel | Residual Stream | Magnitude Analysis | Amplitude Manipulation | ACL | 2025 | Link |
| Analysing Moral Bias in Finetuned LLMs through Mechanistic Interpretability | Residual Stream | Causal Attribution | Amplitude Manipulation | ArXiv | 2025 | Link |
| Investigating Gender Bias in Language Models Using Causal Mediation Analysis | MHA | Causal Attribution | Amplitude Manipulation | NeurIPS | 2020 | Link |
| Dissecting Bias in LLMs: A Mechanistic Interpretability Perspective | MHA | Magnitude Analysis | Amplitude Manipulation | TMLR | 2025 | Link |
| Linear Representations of Political Perspective Emerge in Large Language Models | MHA | Probing | Vector Arithmetic | ICLR | 2025 | Link |
| Tracing Positional Bias in Financial Decision-Making: Mechanistic Insights from Qwen2.5 | MHA | Magnitude Analysis | - | ICAIF | 2025 | Link |
| Eliminating Position Bias of Language Models: A Mechanistic Approach | MHA | Magnitude Analysis | Amplitude Manipulation | ICLR | 2025 | Link |
| Identifying and Adapting Transformer-Components Responsible for Gender Bias in an English Language Model | MHA | Causal Attribution | Targeted Optimization | ACLWS | 2023 | Link |
| Locating and Mitigating Gender Bias in Large Language Models | FFN | Causal Attribution | Targeted Optimization | ICIC | 2024 | Link |
| Elucidating Mechanisms of Demographic Bias in LLMs for Healthcare | FFN | Causal Attribution | Amplitude Manipulation | EMNLP | 2025 | Link |
| Anchored Answers: Unravelling Positional Bias in GPT-2's Multiple-Choice Questions | FFN | Vocab Projection | Targeted Optimization | ACL | 2025 | Link |
| Understanding and Mitigating Gender Bias in LLMs via Interpretable Neuron Editing | Neuron | Circuit Discovery | Targeted Optimization | ArXiv | 2025 | Link |
| The Devil is in the Neurons: Interpreting and Mitigating Social Biases in Pre-trained Language Models | Neuron | Gradient Detection | Amplitude Manipulation | ICLR | 2024 | Link |
| Paper | Object | Localizing Method | Steering Method | Venue | Year | Link |
|---|---|---|---|---|---|---|
| The Assistant Axis: Situating and Stabilizing the Default Persona of Language Models | Residual Stream | Magnitude Analysis | Vector Arithmetic | ArXiv | 2026 | Link |
| Can Role Vectors Affect LLM Behaviour? | Residual Stream | Causal Attribution | Vector Arithmetic | EMNLP | 2025 | Link |
| Persona vectors: Monitoring and controlling character traits in language models | Residual Stream | Causal Attribution | Vector Arithmetic | ArXiv | 2025 | Link |
| From Monolingual to Bilingual: Investigating Language Conditioning in Large Language Models for Psycholinguistic Tasks | Residual Stream | Probing | - | AACL | 2025 | Link |
| Mechanistic Interpretability of Emotion Inference in Large Language Models | Residual Stream | Probing | Vector Arithmetic | ACL | 2025 | Link |
| Personality as a Probe for LLM Evaluation: Method Trade-offs and Downstream Effects | Residual Stream | Causal Attribution | Vector Arithmetic | NeurIPS | 2025 | Link |
| Steering Llama 2 via Contrastive Activation Addition | Residual Stream | Causal Attribution | Vector Arithmetic | ACL | 2024 | Link |
| Probing then Editing Response Personality of Large Language Models | Residual Stream | Probing | Targeted Optimization | COLM | 2025 | Link |
| Neural Transparency: Mechanistic Interpretability Interfaces for Anticipating Model Behaviors for Personalized AI | Residual Stream | Causal Attribution | - | ArXiv | 2025 | Link |
| From Rational Answers to Emotional Resonance: The Role of Controllable Emotion Generation in Language Models | Residual Stream | - | Vector Arithmetic | ArXiv | 2025 | Link |
| Psychological Steering in LLMs: An Evaluation of Effectiveness and Trustworthiness | Residual Stream | Causal Attribution | Vector Arithmetic | ArXiv | 2025 | Link |
| Steering Latent Traits, Not Learned Facts: An Empirical Study of Activation Control Limits | Residual Stream | Causal Attribution | Vector Arithmetic | ArXiv | 2025 | Link |
| Personality Vector: Modulating Personality of Large Language Models by Model Merging | Residual Stream | Causal Attribution | Vector Arithmetic | EMNLP | 2025 | Link |
| Billy: Steering large language models via merging persona vectors for creative generation | Residual Stream | Causal Attribution | Vector Arithmetic | ArXiv | 2025 | Link |
| Personas as a way to model truthfulness in language models | Residual Stream | Probing | - | EMNLP | 2024 | Link |
| Who's asking? User personas and the mechanics of latent misalignment | Residual Stream | Causal Attribution | Vector Arithmetic | NeurIPS | 2024 | Link |
| Understanding How Value Neurons Shape the Generation of Specified Values in LLMs | Neuron | Causal Attribution | Amplitude Manipulation | EMNLP | 2025 | Link |
| Neuron based Personality Trait Induction in Large Language Models | Neuron | Causal Attribution | Amplitude Manipulation | ICLR | 2025 | Link |
| From Yes-Men to Truth-Tellers: Addressing Sycophancy in Large Language Models with Pinpoint Tuning | Neuron | Causal Attribution | Targeted Optimization | ICML | 2024 | Link |
| Paper | Object | Localizing Method | Steering Method | Venue | Year | Link |
|---|---|---|---|---|---|---|
| Reasoning-Finetuning Repurposes Latent Representations in Base Models | Residual Stream | Causal Attribution | Vector Arithmetic | ICML | 2025 | Link |
| Improving Reasoning Performance in Large Language Models via Representation Engineering | Residual Stream | Causal Attribution | Vector Arithmetic | ICLR | 2025 | Link |
| Bottom-up Policy Optimization: Your Language Model Policy Secretly Contains Internal Policies | Residual Stream | Vocab Projection | Targeted Optimization | ArXiv | 2025 | Link |
| Unlocking General Long Chain-of-Thought Reasoning Capabilities of Large Language Models via Representation Engineering | Residual Stream | Causal Attribution | Vector Arithmetic | ACL | 2025 | Link |
| Probing for Arithmetic Errors in Language Models | Residual Stream | Probing | - | EMNLP | 2025 | Link |
| Eliciting Chain-of-Thought in Base LLMs via Gradient-Based Representation Optimization | Residual Stream | Probing | Vector Arithmetic | AAAI | 2026 | Link |
| Understanding Reasoning in Thinking Language Models via Steering Vectors | Residual Stream | Causal Attribution | Vector Arithmetic | ICLR | 2025 | Link |
| Hopping Too Late: Exploring the Limitations of Large Language Models on Multi-Hop Queries | Residual Stream | Probing | - | EMNLP | 2024 | Link |
| Physics of Language Models: Part 2.1, Grade-School Math and the Hidden Reasoning Process | Residual Stream | Probing | - | ICLR | 2025 | Link |
| The Reasoning-Memorization Interplay in Language Models Is Mediated by a Single Direction | Residual Stream | Causal Attribution | Vector Arithmetic | ACL | 2025 | Link |
| Uncovering Latent Chain of Thought Vectors in Language Models | Residual Stream | Causal Attribution | Vector Arithmetic | ICLR | 2025 | Link |
| Fractional Reasoning via Latent Steering Vectors Improves Inference Time Compute | Residual Stream | Causal Attribution | Vector Arithmetic | ArXiv | 2025 | Link |
| Steering LLM Reasoning Through Bias-Only Adaptation | Residual Stream | Causal Attribution | Vector Arithmetic | EMNLP | 2025 | Link |
| Feature Extraction and Steering for Enhanced Chain-of-Thought Reasoning in Language Models | Residual Stream | Causal Attribution | Vector Arithmetic | EMNLP | 2025 | Link |
| ATLAS: Adaptive Test-Time Latent Steering with External Verifiers for Enhancing LLMs' Reasoning | Residual Stream | Probing | Vector Arithmetic | ArXiv | 2026 | Link |
| Understanding and Steering the Cognitive Behaviors of Reasoning Models at Test-Time | MHA | Probing | Amplitude Manipulation | ArXiv | 2025 | Link |
| How do Large Language Models Learn In-Context? Query and Key Matrices of In-Context Heads are Two Towers for Metric Learning | MHA | Magnitude Analysis | Amplitude Manipulation | EMNLP | 2024 | Link |
| Interpreting Arithmetic Mechanism in Large Language Models through Comparative Neuron Analysis | MHA | Causal Attribution | Amplitude Manipulation | EMNLP | 2024 | Link |
| Back Attention: Understanding and Enhancing Multi-Hop Reasoning in Large Language Models | MHA | Causal Attribution | - | EMNLP | 2025 | Link |
| Chain-of-Thought in Large Language Models: Decoding, Projection, and Activation | MHA & FFN | Magnitude Analysis | - | ArXiv | 2024 | Link |
| Interpreting and Improving Large Language Models in Arithmetic Calculation | MHA & FFN | Causal Attribution | Targeted Optimization | ICML | 2024 | Link |
| A Mechanistic Interpretation of Arithmetic Reasoning in Language Models using Causal Mediation Analysis | MHA & FFN | Causal Attribution | - | EMNLP | 2023 | Link |
| Uncovering the Interpretation of Large Language Models | MHA & FFN | Causal Attribution | - | COMPSAC | 2024 | Link |
| Understanding Addition in Transformers | MHA & FFN | Causal Attribution | Amplitude Manipulation | ICLR | 2024 | Link |
| Inner Thinking Transformer: Leveraging Dynamic Depth Scaling to Foster Adaptive Internal Thinking | MHA & FFN | Gradient Detection | Targeted Optimization | ACL | 2025 | Link |
| How does GPT-2 compute greater-than?: Interpreting mathematical abilities in a pre-trained language model | MHA & FFN | Circuit Discovery | - | NeurIPS | 2023 | Link |
| Arithmetic Without Algorithms: Language Models Solve Math with a Bag of Heuristics | MHA & FFN | Circuit Discovery | - | ICLR | 2025 | Link |
| Sparse Autoencoders Learn Monosemantic Features in Vision-Language Models | SAE Feature | Magnitude Analysis | Amplitude Manipulation | ArXiv | 2025 | Link |
| Fantastic Reasoning Behaviors and Where to Find Them: Unsupervised Discovery of the Reasoning Process | SAE Feature | Magnitude Analysis | Amplitude Manipulation | ArXiv | 2025 | Link |
| Reasoning Models Generate Societies of Thought | SAE Feature | Magnitude Analysis | Amplitude Manipulation | ArXiv | 2026 | Link |
| I Have Covered All the Bases Here: Interpreting Reasoning Features in Large Language Models via Sparse Autoencoders | SAE Feature | Magnitude Analysis | Amplitude Manipulation | ArXiv | 2025 | Link |
| Internal states before wait modulate reasoning patterns | SAE Feature | Magnitude Analysis | Vector Arithmetic | EMNLP | 2025 | Link |
| Can we interpret latent reasoning using current mechanistic interpretability tools? | Token Embedding | Causal Attribution | Amplitude Manipulation | Blog | 2025 | Link |
| Analyzing chain-of-thought prompting in large language models via gradient-based feature attributions | Token Embedding | Gradient Detection | - | ICML | 2023 | Link |
| Probabilistic Soundness Guarantees in LLM Reasoning Chains | Token Embedding | Magnitude Analysis | - | EMNLP | 2025 | Link |
| Two Experts Are All You Need for Steering Thinking: Reinforcing Cognitive Effort in MoE Reasoning Models Without Additional Training | FFN | Magnitude Analysis | Amplitude Manipulation | ArXiv | 2025 | Link |
| Paper | Object | Localizing Method | Steering Method | Venue | Year | Link |
|---|---|---|---|---|---|---|
| Importance-based Neuron Allocation for Multilingual Neural Machine Translation | Neuron | Magnitude Analysis | Amplitude Manipulation | ACL | 2021 | Link |
| On the Multilingual Ability of Decoder-based Pre-trained Language Models: Finding and Controlling Language-Specific Neurons | Neuron | Magnitude Analysis | Amplitude Manipulation | NAACL | 2024 | Link |
| Language-Specific Neurons: The Key to Multilingual Capabilities in Large Language Models | Neuron | Magnitude Analysis | Amplitude Manipulation | ACL | 2024 | Link |
| How do Large Language Models Handle Multilingualism? | Neuron | Magnitude Analysis | Amplitude Manipulation | NeurIPS | 2024 | Link |
| Language Arithmetics: Towards Systematic Language Neuron Identification and Manipulation | Neuron | Magnitude Analysis | Amplitude Manipulation | ArXiv | 2025 | Link |
| On Relation-Specific Neurons in Large Language Models | Neuron | Magnitude Analysis | Amplitude Manipulation | EMNLP | 2025 | Link |
| LinguaLens: Towards Interpreting Linguistic Mechanisms of Large Language Models via Sparse Auto-Encoder | Neuron | Magnitude Analysis | Amplitude Manipulation | EMNLP | 2025 | Link |
| Sparse Autoencoders Can Capture Language-Specific Concepts Across Diverse Languages | SAE Feature | Magnitude Analysis | Amplitude Manipulation | ArXiv | 2025 | Link |
| Large Language Models Share Representations of Latent Grammatical Concepts Across Typologically Diverse Languages | SAE Feature | Magnitude Analysis | Amplitude Manipulation | NAACL | 2025 | Link |
| On the Language Neutrality of Pre-trained Multilingual Representations | Residual Stream | Probing | - | EMNLP | 2020 | Link |
| Can Cross-Lingual Transferability of Multilingual Transformers Be Activated Without End-Task Data? | Residual Stream | - | Vector Arithmetic | ACL | 2023 | Link |
| Identifying the Correlation Between Language Distance and Cross-Lingual Transfer in a Multilingual Representation Space | Residual Stream | Magnitude Analysis | Vector Arithmetic | ACL | 2023 | Link |
| Do Llamas Work in English? On the Latent Language of Multilingual Transformers | Residual Stream | Vocab Projection | Vector Arithmetic | ACL | 2024 | Link |
| Exploring Alignment in Shared Cross-lingual Spaces | Residual Stream | Magnitude Analysis | Vector Arithmetic | ACL | 2024 | Link |
| Why do LLaVA Vision-Language Models Reply to Images in English? | Residual Stream | Probing | Vector Arithmetic | EMNLP | 2024 | Link |
| ShifCon: Enhancing Non-Dominant Language Capabilities with a Shift-based Multilingual Contrastive Framework | Residual Stream | Magnitude Analysis | Vector Arithmetic | ACL | 2025 | Link |
| Lost in Multilinguality: Dissecting Cross-lingual Factual Inconsistency in Transformer Language Models | Residual Stream | Vocab Projection | Vector Arithmetic | ACL | 2025 | Link |
| The Semantic Hub Hypothesis: Language Models Share Semantic Representations Across Languages and Modalities | Residual Stream | Vocab Projection | - | ICLR | 2025 | Link |
| Language Mixing in Reasoning Language Models: Patterns, Impact, and Internal Causes | Residual Stream | Vocab Projection | Vector Arithmetic | EMNLP | 2025 | Link |
| Mechanistic Understanding and Mitigation of Language Confusion in English-Centric Large Language Models | Residual Stream | Vocab Projection | Vector Arithmetic | EMNLP | 2025 | Link |
| Tracing Multilingual Factual Knowledge Acquisition in Pretraining | Residual Stream | Vocab Projection | Vector Arithmetic | EMNLP | 2025 | Link |
| Paper | Object | Localizing Method | Steering Method | Venue | Year | Link |
|---|---|---|---|---|---|---|
| Fine-tuning Done Right in Model Editing | FFN | Gradient Detection | Targeted Optimization | ICLR | 2026 | Link |
| ACE: Attribution-Controlled Knowledge Editing for Multi-hop Factual Recall | FFN | Causal Attribution | Targeted Optimization | ICLR | 2026 | Link |
| LoKI: Low-damage Knowledge Implanting of Large Language Models | FFN | Causal Attribution | Targeted Optimization | AAAI | 2026 | Link |
| Locating and Editing Factual Associations in GPT | FFN | Causal Attribution | Targeted Optimization | NeurIPS | 2022 | Link |
| Mass-Editing Memory in a Transformer | FFN | Causal Attribution | Targeted Optimization | ICLR | 2023 | Link |
| Joint Localization and Activation Editing for Low-Resource Fine-Tuning | MHA | Magnitude Analysis | Targeted Optimization | ICML | 2025 | Link |
| Taming Knowledge Conflicts in Language Models | MHA | Magnitude Analysis | Amplitude Manipulation | ICML | 2025 | Link |
| Massive Values in Self-Attention Modules are the Key to Contextual Knowledge Understanding | MHA | Magnitude Analysis | Amplitude Manipulation | ICML | 2025 | Link |
| Cutting Off the Head Ends the Conflict: A Mechanism for Interpreting and Mitigating Knowledge Conflicts in Language Models | MHA | Causal Attribution | Amplitude Manipulation | ACL | 2024 | Link |
| Interpreting Key Mechanisms of Factual Recall in Transformer-Based Language Models | MHA | Causal Attribution | Amplitude Manipulation | ArXiv | 2024 | Link |
| Llama See, Llama Do: A Mechanistic Perspective on Contextual Entrainment and Distraction in LLMs | MHA | Causal Attribution | Amplitude Manipulation | ACL | 2025 | Link |
| Attributing Response to Context: A Jensen-Shannon Divergence Driven Mechanistic Study of Context Attribution in Retrieval-Augmented Generation | MHA | Magnitude Analysis | Amplitude Manipulation | ICLR | 2026 | Link |
| Probing and Boosting Large Language Models Capabilities via Attention Heads | MHA | Probing | Targeted Optimization | EMNLP | 2025 | Link |
| TIES-Merging: Resolving Interference When Merging Models | MHA & FFN | Magnitude Analysis | Vector Arithmetic | NeurIPS | 2023 | Link |
| Neuron-Level Knowledge Attribution in Large Language Models | MHA & FFN | Magnitude Analysis | Amplitude Manipulation | EMNLP | 2024 | Link |
| Balancing Speciality and Versatility: a Coarse to Fine Framework for Supervised Fine-tuning Large Language Model | MHA & FFN | Magnitude Analysis | Targeted Optimization | ACL | 2024 | Link |
| Knowledge Localization: Mission Not Accomplished? Enter Query Localization! | MHA & FFN | Magnitude Analysis | Amplitude Manipulation | ICLR | 2025 | Link |
| Enhancing Large Language Model Performance with Gradient-Based Parameter Selection | MHA & FFN | Magnitude Analysis | Targeted Optimization | AAAI | 2025 | Link |
| The Geometry of Forgetting: Analyzing Machine Unlearning through Local Learning Coefficients | MHA & FFN | Magnitude Analysis | - | ICML | 2025 | Link |
| Knowledge Circuits in Pretrained Transformers | MHA & FFN | Circuit Discovery | - | NeurIPS | 2024 | Link |
| Towards Secure Tuning: Mitigating Security Risks Arising from Benign Instruction Fine-Tuning | MHA & FFN | Probing | Targeted Optimization | ArXiv | 2024 | Link |
| Unveiling Linguistic Regions in Large Language Models | MHA & FFN | Gradient Detection | Targeted Optimization | ACL | 2024 | Link |
| Sens-Merging: Sensitivity-Guided Parameter Balancing for Merging Large Language Models | MHA & FFN | Gradient Detection | Vector Arithmetic | ACL | 2025 | Link |
| Activation-Guided Consensus Merging for Large Language Models | MHA & FFN | Magnitude Analysis | Vector Arithmetic | NeurIPS | 2025 | Link |
| Dissecting Recall of Factual Associations in Auto-Regressive Language Models | MHA & FFN | Causal Attribution | - | EMNLP | 2023 | Link |
| Multilingual Knowledge Editing with Language-Agnostic Factual Neurons | Neuron | Magnitude Analysis | Targeted Optimization | COLING | 2025 | Link |
| Journey to the Center of the Knowledge Neurons: Discoveries of Language-Independent Knowledge Neurons and Degenerate Knowledge Neurons | Neuron | Gradient Detection | Amplitude Manipulation | AAAI | 2024 | Link |
| IRCAN: Mitigating Knowledge Conflicts in LLM Generation via Identifying and Reweighting Context-Aware Neurons | Neuron | Gradient Detection | Amplitude Manipulation | NeurIPS | 2024 | Link |
| Identifying Query-Relevant Neurons in Large Language Models for Long-Form Texts | Neuron | Gradient Detection | Amplitude Manipulation | AAAI | 2025 | Link |
| Reviving Your MNEME: Predicting The Side Effects of LLM Unlearning and Fine-Tuning via Sparse Model Diffing | Neuron | - | Amplitude Manipulation | EMNLP | 2025 | Link |
| Steering Knowledge Selection Behaviours in LLMs via SAE-Based Representation Engineering | SAE Feature | Magnitude Analysis | Amplitude Manipulation | NAACL | 2025 | Link |
| SAEs Can Improve Unlearning: Dynamic Sparse Autoencoder Guardrails for Precision Unlearning in LLMs | SAE Feature | Magnitude Analysis | Amplitude Manipulation | ICML | 2025 | Link |
| Breaking Bad Tokens: Detoxification of LLMs Using Sparse Autoencoders | SAE Feature | Magnitude Analysis | Amplitude Manipulation | EMNLP | 2025 | Link |
| Sparse Feature Circuits: Discovering and Editing Interpretable Causal Graphs in Language Models | SAE Feature | Circuit Discovery | Amplitude Manipulation | ICLR | 2025 | Link |
| Impact of Co-occurrence on Factual Knowledge of Large Language Models | Residual Stream | Probing | - | EMNLP | 2023 | Link |
| Backward Lens: Projecting Language Model Gradients into the Vocabulary Space | Residual Stream | Vocab Projection | Targeted Optimization | EMNLP | 2024 | Link |
| ReFT: Representation Finetuning for Language Models | Residual Stream | Causal Attribution | Targeted Optimization | NeurIPS | 2024 | Link |
| Analysing the Residual Stream of Language Models Under Knowledge Conflicts | Residual Stream | Probing | - | ArXiv | 2024 | Link |
| How Large Language Models Encode Context Knowledge? A Layer-Wise Probing Study | Residual Stream | Probing | - | COLING | 2024 | Link |
| Exploring Concept Depth: How Large Language Models Acquire Knowledge and Concept at Different Layers? | Residual Stream | Probing | - | COLING | 2025 | Link |
| Transferring Linear Features Across Language Models With Model Stitching | Residual Stream | Probing | Vector Arithmetic | NeurIPS | 2025 | Link |
| Paper | Object | Localizing Method | Steering Method | Venue | Year | Link |
|---|---|---|---|---|---|---|
| Task-Specific Skill Localization in Fine-tuned Language Models | Neuron | Magnitude Analysis | Targeted Optimization | ICML | 2023 | Link |
| LANDeRMT: Dectecting and Routing Language-Aware Neurons for Selectively Finetuning LLMs to Machine Translation | Neuron | Gradient Detection | Targeted Optimization | ACL | 2024 | Link |
| Sparse is enough in fine-tuning pre-trained large language models | Neuron | Gradient Detection | Targeted Optimization | ICML | 2024 | Link |
| Fine-tuning Happens in Tiny Subspaces: Exploring Intrinsic Task-specific Subspaces of Pre-trained Language Models | Neuron | Magnitude Analysis | Targeted Optimization | ACL | 2023 | Link |
| Let's Focus on Neuron: Neuron-Level Supervised Fine-tuning for Large Language Model | Neuron | Magnitude Analysis | Targeted Optimization | COLING | 2025 | Link |
| Language-Specific Neurons Do Not Facilitate Cross-Lingual Transfer | Neuron | Magnitude Analysis | Targeted Optimization | ACL | 2025 | Link |
| Sparse Subnetwork Enhancement for Underrepresented Languages in Large Language Models | Neuron | Magnitude Analysis | Targeted Optimization | AACL | 2025 | Link |
| How do Large Language Models Handle Multilingualism? | Neuron | Causal Attribution | Targeted Optimization | NeurIPS | 2024 | Link |
| Optimizing Multimodal Language Models through Attention-based Interpretability | MHA | Magnitude Analysis | Targeted Optimization | ICAI | 2025 | Link |
| In-context Learning and Induction Heads | MHA | Magnitude Analysis | - | ArXiv | 2022 | Link |
| How Transformers Implement Induction Heads: Approximation and Optimization Analysis | MHA | Magnitude Analysis | - | ArXiv | 2024 | Link |
| What needs to go right for an induction head? a mechanistic study of in-context learning circuits and their formation | MHA | Magnitude Analysis | - | ICML | 2024 | Link |
| The developmental landscape of in-context learning | MHA | Magnitude Analysis | - | TLMR | 2025 | Link |
| In-Context Meta Learning Induces Multi-Phase Circuit Emergence | MHA | Magnitude Analysis | - | ICLR | 2025 | Link |
| Joint Localization and Activation Editing for Low-Resource Fine-Tuning | MHA | Magnitude Analysis | Vector Arithmetic | ICML | 2025 | Link |
| The slingshot mechanism: An empirical study of adaptive optimizers and the Grokking Phenomenon | MHA & FFN | Magnitude Analysis | - | NeurIPS | 2022 | Link |
| Explaining grokking through circuit efficiency | MHA & FFN | Magnitude Analysis | - | ArXiv | 2023 | Link |
| Towards Empirical Interpretation of Internal Circuits and Properties in Grokked Transformers on Modular Polynomials | MHA & FFN | Magnitude Analysis | - | TMLR | 2024 | Link |
| Progress measures for grokking via mechanistic interpretability | MHA & FFN | Magnitude Analysis | - | ICLR | 2023 | Link |
| Predicting grokking long before it happens: A look into the loss landscape of models which grok | MHA & FFN | Magnitude Analysis | - | ArXiv | 2023 | Link |
| Exploring Grokking: Experimental and Mechanistic Investigations | MHA & FFN | Magnitude Analysis | - | ArXiv | 2024 | Link |
| Omnigrok: Grokking Beyond Algorithmic Data | MHA & FFN | Magnitude Analysis | - | ICLR | 2023 | Link |
| Where to find Grokking in LLM Pretraining? Monitor Memorization-to-Generalization without Test | MHA & FFN | Magnitude Analysis | - | ArXiv | 2025 | Link |
| Grokking of implicit reasoning in transformers: A mechanistic journey to the edge of generalization | MHA & FFN | Magnitude Analysis | - | NeurIPS | 2024 | Link |
| Unified View of Grokking, Double Descent and Emergent Abilities: A Comprehensive Study on Algorithm Task | MHA & FFN | Magnitude Analysis | - | COLM | 2024 | Link |
| Fine-Tuning is Subgraph Search: A New Lens on Learning Dynamics | MHA & FFN | Circuit Discovery | Targeted Optimization | ArXiv | 2025 | Link |
| Constructive Circuit Amplification:Improving Math Reasoning in LLMS via Targeted Sub-Network Updates | MHA & FFN | Circuit Discovery | Targeted Optimization | ArXiv | 2025 | Link |
| Paper | Object | Localizing Method | Steering Method | Venue | Year | Link |
|---|---|---|---|---|---|---|
| TokenSkip: Controllable Chain-of-Thought Compression in LLMs | Token Embedding | Magitude Analysis | Amplitude Manipulation | EMNLP | 2025 | Link |
| Generic Token Compression in Multimodal Large Language Models from an Explainability Perspective | Token Embedding | Gradient Detection | Amplitude Manipulation | ArXiv | 2025 | Link |
| Attention Score is not All You Need for Token Importance Indicator in KV Cache Reduction: Value Also Matters | Token Embedding | Magnitude Analysis | Amplitude Manipulation | EMNLP | 2024 | Link |
| Fit and prune: Fast and training-free visual token pruning for multi-modal large language models | Token Embedding | Magnitude Analysis | Amplitude Manipulation | AAAI | 2025 | Link |
| Zipcache: Accurate and efficient kv cache quantization with salient token identification | Token Embedding | Magnitude Analysis | Amplitude Manipulation | NeurIPS | 2024 | Link |
| Pyramidkv: Dynamic kv cache compression based on pyramidal information funneling | Token Embedding | Magnitude Analysis | Amplitude Manipulation | COLM | 2025 | Link |
| What Layers When: Learning to Skip Compute in LLMs with Residual Gates | Residual Stream | Magnitude Analysis | Amplitude Manipulation | ArXiv | 2025 | Link |
| Accelerating Large Language Model Inference with Self-Supervised Early Exits | Residual Stream | Probing | Amplitude Manipulation | ArXiv | 2024 | Link |
| LayerSkip: Enabling Early Exit Inference and Self-Speculative Decoding | Residual Stream | Probing | Amplitude Manipulation | ACL | 2024 | Link |
| HadSkip: Homotopic and Adaptive Layer Skipping of Pre-trained Language Models for Efficient Inference | Residual Stream | Magnitude Analysis | Amplitude Manipulation | EMNLP | 2023 | Link |
| Learning to Skip the Middle Layers of Transformers | Residual Stream | Magnitude Analysis | Amplitude Manipulation | ArXiv | 2025 | Link |
| ShortGPT: Layers in Large Language Models are More Redundant Than You Expect | Residual Stream | Magnitude Analysis | Amplitude Manipulation | ACL | 2025 | Link |
| Layer-wise quantization: A pragmatic and effective method for quantizing llms beyond integer bit-levels | Residual Stream | Magnitude Analysis | - | ArXiv | 2024 | Link |
| Towards Superior Quantization Accuracy: A Layer-sensitive Approach | Residual Stream | Magnitude Analysis | - | ArXiv | 2025 | Link |
| Exploring Layer-wise Information Effectiveness for Post-Training Quantization in Small Language Models | Residual Stream | Magnitude Analysis | - | ArXiv | 2025 | Link |
| Mix-QViT: Mixed-precision vision transformer quantization driven by layer importance and quantization sensitivity | Residual Stream | Gradient Detection | - | ArXiv | 2025 | Link |
| Lsaq: Layer-specific adaptive quantization for large language model deployment | Residual Stream | Vocab Projection | - | ArXiv | 2024 | Link |
| Towards Building Efficient Sentence BERT Models using Layer Pruning | Residual Stream | Causal Attribution | Amplitude Manipulation | ACL | 2024 | Link |
| KVSink: Understanding and Enhancing the Preservation of Attention Sinks in KV Cache Quantization for LLMs | MHA & FFN | Circuit Discovery | - | COLM | 2025 | Link |
| Beyond Outliers: A Data-Free Layer-wise Mixed-Precision Quantization Approach Driven by Numerical and Structural Dual-Sensitivity | MHA & FFN | Magnitude Analysis | - | ArXiv | 2026 | Link |
| Massive activations in large language models | MHA & FFN | Magnitude Analysis | - | NeurIPS | 2024 | Link |
| Systematic outliers in large language models | MHA & FFN | Circuit Discovery | - | ICLR | 2025 | Link |
| Quantizable Transformers: Removing Outliers by Helping Attention Heads Do Nothing | MHA & FFN | Circuit Discovery | - | NeurIPS | 2023 | Link |
| RazorAttention: Efficient kv cache compression through retrieval heads | MHA | Circuit Discovery | Amplitude Manipulation | ICLR | 2025 | Link |
| DuoAttention: Efficient long-context llm inference with retrieval and streaming heads | MHA | Circuit Discovery | Amplitude Manipulation | ICLR | 2025 | Link |
| Unveiling visual perception in language models: An Attention head analysis approach | MHA | Magnitude Analysis | - | CVPR | 2025 | Link |
| Fast and Low-Cost Genomic Foundation Models via Outlier Removal | MHA | Magnitude Analysis | Amplitude Manipulation | ICML | 2025 | Link |
| FROST: Filtering Reasoning Outliers with Attention for Efficient Reasoning | MHA | Magnitude Analysis | Amplitude Manipulation | ICLR | 2026 | Link |
| Rotatekv: Accurate and robust 2-bit kv cache quantization for llms via outlier-aware adaptive rotations | MHA | Magnitude Analysis | Amplitude Manipulation | IJCAI | 2025 | Link |
| Efficient Streaming Language Models with Attention Sinks | MHA | Magnitude Analysis | Amplitude Manipulation | ICLR | 2024 | Link |
| Unraveling babel: Exploring multilingual activation patterns within large language models | Neuron | Magnitude Analysis | Amplitude Manipulation | ArXiv | 2024 | Link |
| Neuron Specialization: Leveraging Intrinsic Task Modularity for Multilingual Machine Translation | Neuron | Magnitude Analysis | - | EMNLP | 2024 | Link |
| The super weight in large language models | FFN | Magnitude Analysis | Amplitude Manipulation | Arxiv | 2024 | Link |
| Not all experts are equal: Efficient expert pruning and skipping for mixture-of-experts large language models | FFN | Magnitude Analysis | Amplitude Manipulation | ACL | 2024 | Link |
| Unveiling super experts in mixture-of-experts large language models | FFN | Magnitude Analysis | Amplitude Manipulation | ArXiv | 2025 | Link |
If you find this survey or repository useful for your research, please cite:
@article{zhang2026locate,
title={Locate, Steer, and Improve: A Practical Survey of Actionable Mechanistic Interpretability in Large Language Models},
author={Zhang, Hengyuan and Zhang, Zhihao and Wang, Mingyang and Su, Zunhai and Wang, Yiwei and Wang, Qianli and Yuan, Shuzhou and Nie, Ercong and Duan, Xufeng and Xue, Qibo and others},
journal={arXiv preprint arXiv:2601.14004},
year={2026}
}Feel free to open an issue or contact us if you have any questions or want to include your work in this list!
Corresponding Author: Hengyuan Zhang (hengyuan.zhang88@gmail.com) and Zhihao Zhang (zhihaozhang017@gmail.com)

