Mechanistic interpretability
| Part of a series on |
| Machine learning and data mining |
|---|
Mechanistic interpretability (often abbreviated as mech interp, mechinterp, or MI) is a subfield of research within explainable artificial intelligence that aims to understand the internal workings of neural networks by analyzing the mechanisms present in their computations. The approach seeks to analyze neural networks in a manner similar to how binary computer programs can be reverse-engineered to understand their functions.
History
The term mechanistic interpretability was coined by Chris Olah as a description of his work in circuit analysis as opposed to usual methods in interpretable AI. Circuit analysis attempted to completely characterize individual features and circuits within models, while the broader field tended towards gradient-based approaches like saliency maps.[1][2]
Before circuit analysis, work in the subfield combined various techniques such as feature visualization, dimensionality reduction, and attribution with human-computer interaction methods to analyze models like the vision model Inception v1.[3][4]
Key concepts
Mechanistic interpretability aims to identify structures, circuits or algorithms encoded in the weights of machine learning models.[5][6] This contrasts with earlier interpretability methods that focused primarily on input-output explanations.[7]
Linear representation hypothesis
This hypothesis suggests that high-level concepts are represented as linear directions in the activation space of neural networks. Empirical evidence from word embeddings and large language models supports this view, although it does not hold up universally.[8][9]
Methods
Mechanistic interpretability employs causal methods to understand how internal model components influence outputs, often using formal tools from causality theory.[10]
Mechanistic interpretability, in the field of AI safety, is used to understand and verify the behavior of complex AI systems, and to attempt to identify potential risks[11][6] such as AI misalignment.
Sparse autoencoders
A sparse autoencoder (SAE) is a model trained to disentangle neural network activations into sparse representations. The learned dimensions often represent simple, human-understandable concepts. The technique was applied to large language model interpretability by Anthropic.[12][13][14]
References
- ^ Saphra, Naomi; Wiegreffe, Sarah (2024). Mechanistic? (PDF). BlackboxNLP workshop.
- ^ Olah, Chris; Cammarata, Nick; Schubert, Ludwig; Goh, Gabriel; Petrov, Michael; Carter, Shan (10 March 2020). "Zoom In: An Introduction to Circuits". Distill. 5 (3) e00024.001. doi:10.23915/distill.00024.001. ISSN 2476-0757.
- ^ Olah, Chris; Satyanarayan, Arvind; Johnson, Ian; Carter, Shan; Schubert, Ludwig; Ye, Katherine; Mordvintsev, Alexander (6 March 2018). "The Building Blocks of Interpretability". Distill. 3 (3) e10. doi:10.23915/distill.00010. ISSN 2476-0757.
- ^ Olah, Chris; Cammarata, Nick; Schubert, Ludwig; Goh, Gabriel; Petrov, Michael; Carter, Shan (1 April 2020). "An Overview of Early Vision in InceptionV1". Distill. 5 (4) e00024.002. doi:10.23915/distill.00024.002. ISSN 2476-0757.
- ^ Conmy, Arthur; Mavor-Parker, Augustine N.; Lynch, Aengus; Heimersheim, Stefan; Garriga-Alonso, Adrià (10 December 2023). "Towards automated circuit discovery for mechanistic interpretability". Proceedings of the 37th International Conference on Neural Information Processing Systems. Art. 719. Red Hook, New York: Curran Associates Inc. pp. 16318–16352.
- ^ a b Levy, Steven (27 October 2025). "Why AI Breaks Bad". Wired. ISSN 1059-1028. Retrieved 18 March 2026.
- ^ Kästner, Lena; Crook, Barnaby (11 October 2024). "Explaining AI through mechanistic interpretability". European Journal for Philosophy of Science. 14 (4) 52. doi:10.1007/s13194-024-00614-4. ISSN 1879-4920.
- ^ Mikolov, Tomas; Yih, Wen-tau; Zweig, Geoffrey (June 2013). "Linguistic Regularities in Continuous Space Word Representations". In Vanderwende, Lucy; Daumé III, Hal; Kirchhoff, Katrin (eds.). Proceedings of the 2013 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies. Atlanta, Georgia: Association for Computational Linguistics. pp. 746–751.
- ^ Park, Kiho; Choe, Yo Joong; Veitch, Victor (21 July 2024). "The linear representation hypothesis and the geometry of large language models". Proceedings of the 41st International Conference on Machine Learning. Proceedings of Machine Learning Research. Vol. 235, art. 1605. Vienna, Austria. pp. 39643–39666.
- ^ Vig, Jesse; Gehrmann, Sebastian; Belinkov, Yonatan; Qian, Sharon; Nevo, Daniel; Singer, Yaron; Shieber, Stuart (6 December 2020). "Investigating gender bias in language models using causal mediation analysis". Proceedings of the 34th International Conference on Neural Information Processing Systems. Art. 1039. Red Hook, New York: Curran Associates Inc. pp. 12388–12401. ISBN 978-1-7138-2954-6.
- ^ Sullivan, Mark (22 April 2025). "This startup wants to reprogram the mind of AI—and just got $50 million to do it". Fast Company. Retrieved 12 May 2025.
- ^ "Researchers are figuring out how large language models work". The Economist. 11 July 2024. ISSN 0013-0613. Retrieved 18 March 2026.
- ^ Geiger, Atticus; Ibeling, Duligur; Zur, Amir; Chaudhary, Maheep; Chauhan, Sonakshi; Huang, Jing; Arora, Aryaman; Wu, Zhengxuan; Goodman, Noah; Potts, Christopher; Icard, Thomas (2025). "Causal Abstraction: A Theoretical Foundation for Mechanistic Interpretability". Journal of Machine Learning Research. 26 (83): 1–64. ISSN 1533-7928.
- ^ Somvanshi, Shriyank; Islam, Md Monzurul; Rafe, Amir; Tusti, Anannya Ghosh; Chakraborty, Arka; Baitullah, Anika; Chowdhury, Tausif Islam; Alnawmasi, Nawaf; Dutta, Anandi; Das, Subasish (June 2026). "Bridging the Black Box: A Survey on Mechanistic Interpretability in AI". ACM Computing Surveys. 58 (8) 210. doi:10.1145/3787104. ISSN 0360-0300.
Further reading
- Nanda, Neel (2023). "Emergent Linear Representations in World Models of Self-Supervised Sequence Models". BlackNLP Workshop: 16–30. doi:10.18653/v1/2023.blackboxnlp-1.2.