• Stars
    star
    1,094
  • Rank 42,362 (Top 0.9 %)
  • Language
  • Created 11 months ago
  • Updated 4 months ago

Reviews

There are no reviews yet. Be the first to send feedback to the community and the maintainers!

Repository Details

A curated list of Large Language Model (LLM) Interpretability resources.

Awesome LLM Interpretability

A curated list of amazingly awesome tools, papers, articles, and communities focused on Large Language Model (LLM) Interpretability.

Table of Contents

LLM Interpretability Tools

Tools and libraries for LLM interpretability and analysis.

  1. The Learning Interpretability Tool - an open-source platform for visualization and understanding of ML models, supports classification, refression, and generative models (text & image data); includes saliency methods, attention attribution, counter-facturals, TCAV, embedding visualizations, and facets style data analysis.
  2. Comgra - Comgra helps you analyze and debug neural networks in pytorch.
  3. Pythia - Interpretability analysis to understand how knowledge develops and evolves during training in autoregressive transformers.
  4. Phoenix - AI Observability & Evaluation - Evaluate, troubleshoot, and fine tune your LLM, CV, and NLP models in a notebook.
  5. Automated Interpretability - Code for automatically generating, simulating, and scoring explanations of neuron behavior.
  6. Fmr.ai - AI interpretability and explainability platform.
  7. Attention Analysis - Analyzing attention maps from BERT transformer.
  8. SpellGPT - Explores GPT-3โ€™s ability to spell own token strings.
  9. SuperICL - Super In-Context Learning code which allows black-box LLMs to work with locally fine-tuned smaller models.
  10. Git Re-Basin - Code release for "Git Re-Basin: Merging Models modulo Permutation Symmetries.โ€
  11. Functionary - Chat language model that can interpret and execute functions/plugins.
  12. Sparse Autoencoder - Sparse Autoencoder for Mechanistic Interpretability.
  13. Rome - Locating and editing factual associations in GPT.
  14. Inseq - Interpretability for sequence generation models.
  15. Neuron Viewer - Tool for viewing neuron activations and explanations.
  16. LLM Visualization - Visualizing LLMs in low level.
  17. Vanna - Abstractions to use RAG to generate SQL with any LLM
  18. Copy Suppression - Designed to help explore different prompts for GPT-2 Small, as part of a research project regarding copy-suppression in LLMs.
  19. TransformerViz - Interative tool to visualize transformer model by its latent space.
  20. TransformerLens - A Library for Mechanistic Interpretability of Generative Language Models.

LLM Interpretability Papers

Academic and industry papers on LLM interpretability.

  1. Interpretability Illusions in the Generalization of Simplified Models โ€“ Shows how interpretability methods based on simplied models (e.g. linear probes etc) can be prone to generalisation illusions.
  2. Self-Influence Guided Data Reweighting for Language Model Pre-training] - An application of training data attribution methods to re-weight training data and improve performance.
  3. Data Similarity is Not Enough to Explain Language Model Performance - Discusses the limits of embedding models to explain data effective selection.
  4. Post Hoc Explanations of Language Models Can Improve Language Models] - Evaluates language-model generated explanations ability to also improve model quality.
  5. Does Localization Inform Editing? Surprising Differences in Causality-Based Localization vs. Knowledge Editing in Language Models, Tweet Summary] (NeurIPS 2023 Spotlight) - highlights the limits of Causal Tracing: how a fact is stored in an LLM can be changed by editing weights in a different location than where Causal Tracing suggests.
  6. Finding Neurons in a Haystack: Case Studies with Sparse Probing - Explores the representation of high-level human-interpretable features within neuron activations of large language models (LLMs).
  7. Copy Suppression: Comprehensively Understanding an Attention Head - Investigates a specific attention head in GPT-2 Small, revealing its primary role in copy suppression.
  8. Linear Representations of Sentiment in Large Language Models - Shows how sentiment is represented in Large Language Models (LLMs), finding that sentiment is linearly represented in these models.
  9. Emergent world representations: Exploring a sequence model trained on a synthetic task - Explores emergent internal representations in a GPT variant trained to predict legal moves in the board game Othello.
  10. Towards Automated Circuit Discovery for Mechanistic Interpretability - Introduces the Automatic Circuit Discovery (ACDC) algorithm for identifying important units in neural networks.
  11. A Toy Model of Universality: Reverse Engineering How Networks Learn Group Operations - Examines small neural networks to understand how they learn group compositions, using representation theory.
  12. Causal Mediation Analysis for Interpreting Neural NLP: The Case of Gender Bias - Causal mediation analysis as a method for interpreting neural models in natural language processing.
  13. The Quantization Model of Neural Scaling - Proposes the Quantization Model for explaining neural scaling laws in neural networks.
  14. Discovering Latent Knowledge in Language Models Without Supervision - Presents a method for extracting accurate answers to yes-no questions from language models' internal activations without supervision.
  15. How does GPT-2 compute greater-than?: Interpreting mathematical abilities in a pre-trained language model - Analyzes mathematical capabilities of GPT-2 Small, focusing on its ability to perform the 'greater-than' operation.
  16. Towards Monosemanticity: Decomposing Language Models With Dictionary Learning - Using a sparse autoencoder to decompose the activations of a one-layer transformer into interpretable, monosemantic features.
  17. Language models can explain neurons in language models - Explores how language models like GPT-4 can be used to explain the functioning of neurons within similar models.
  18. Emergent Linear Representations in World Models of Self-Supervised Sequence Models - Linear representations in a world model of Othello-playing sequence models.
  19. "Toward a Mechanistic Understanding of Stepwise Inference in Transformers: A Synthetic Graph Navigation Model" - Explores stepwise inference in autoregressive language models using a synthetic task based on navigating directed acyclic graphs.
  20. "Successor Heads: Recurring, Interpretable Attention Heads In The Wild" - Introduces 'successor heads,' attention heads that increment tokens with a natural ordering, such as numbers and days, in LLMโ€™s.
  21. "Large Language Models Are Not Robust Multiple Choice Selectors" - Analyzes the bias and robustness of LLMs in multiple-choice questions, revealing their vulnerability to option position changes due to inherent "selection biasโ€.
  22. "Going Beyond Neural Network Feature Similarity: The Network Feature Complexity and Its Interpretation Using Category Theory" - Presents a novel approach to understanding neural networks by examining feature complexity through category theory.
  23. "Let's Verify Step by Step" - Focuses on improving the reliability of LLMs in multi-step reasoning tasks using step-level human feedback.
  24. "Interpretability Illusions in the Generalization of Simplified Models" - Examines the limitations of simplified representations (like SVD) used to interpret deep learning systems, especially in out-of-distribution scenarios.
  25. "The Devil is in the Neurons: Interpreting and Mitigating Social Biases in Language Models" - Presents a novel approach for identifying and mitigating social biases in language models, introducing the concept of 'Social Bias Neurons'.
  26. "Interpreting the Inner Mechanisms of Large Language Models in Mathematical Addition" - Investigates how LLMs perform the task of mathematical addition.
  27. "Measuring Feature Sparsity in Language Models" - Develops metrics to evaluate the success of sparse coding techniques in language model activations.
  28. Toy Models of Superposition - Investigates how models represent more features than dimensions, especially when features are sparse.
  29. Spine: Sparse interpretable neural embeddings - Presents SPINE, a method transforming dense word embeddings into sparse, interpretable ones using denoising autoencoders.
  30. Transformer visualization via dictionary learning: contextualized embedding as a linear superposition of transformer factors - Introduces a novel method for visualizing transformer networks using dictionary learning.
  31. Pythia: A Suite for Analyzing Large Language Models Across Training and Scaling - Introduces Pythia, a toolset designed for analyzing the training and scaling behaviors of LLMs.
  32. On Interpretability and Feature Representations: An Analysis of the Sentiment Neuron - Critically examines the effectiveness of the "Sentiment Neuronโ€.
  33. Engineering monosemanticity in toy models - Explores engineering monosemanticity in neural networks, where individual neurons correspond to distinct features.
  34. Polysemanticity and capacity in neural networks - Investigates polysemanticity in neural networks, where individual neurons represent multiple features.
  35. An Overview of Early Vision in InceptionV1 - A comprehensive exploration of the initial five layers of the InceptionV1 neural network, focusing on early vision.
  36. Visualizing and measuring the geometry of BERT - Delves into BERT's internal representation of linguistic information, focusing on both syntactic and semantic aspects.
  37. Neurons in Large Language Models: Dead, N-gram, Positional - An analysis of neurons in large language models, focusing on the OPT family.
  38. Can Large Language Models Explain Themselves? - Evaluates the effectiveness of self-explanations generated by LLMs in sentiment analysis tasks.
  39. Interpretability in the Wild: GPT-2 small (arXiv) - Provides a mechanistic explanation of how GPT-2 small performs indirect object identification (IOI) in natural language processing.
  40. Sparse Autoencoders Find Highly Interpretable Features in Language Models - Explores the use of sparse autoencoders to extract more interpretable and less polysemantic features from LLMs.
  41. Emergent and Predictable Memorization in Large Language Models - Investigates the use of sparse autoencoders for enhancing the interpretability of features in LLMs.
  42. Transformers are uninterpretable with myopic methods: a case study with bounded Dyck grammars - Demonstrates that focusing only on specific parts like attention heads or weight matrices in Transformers can lead to misleading interpretability claims.
  43. The Geometry of Truth: Emergent Linear Structure in Large Language Model Representations of True/False Datasets - This paper investigates the representation of truth in Large Language Models (LLMs) using true/false datasets.
  44. Interpretability at Scale: Identifying Causal Mechanisms in Alpaca - This study presents Boundless Distributed Alignment Search (Boundless DAS), an advanced method for interpreting LLMs like Alpaca.
  45. Representation Engineering: A Top-Down Approach to AI Transparency - Introduces Representation Engineering (RepE), a novel approach for enhancing AI transparency, focusing on high-level representations rather than neurons or circuits.
  46. Explaining black box text modules in natural language with language models - Natural language explanations for LLM attention heads, evaluated using synthetic text
  47. N2G: A Scalable Approach for Quantifying Interpretable Neuron Representations in Large Language Models - Explain each LLM neuron as a graph
  48. Augmenting Interpretable Models with LLMs during Training - Use LLMs to build interpretable classifiers of text data
  49. ChainPoll: A High Efficacy Method for LLM Hallucination Detection - ChainPoll, a novel hallucination detection methodology that substantially outperforms existing alternatives, and RealHall, a carefully curated suite of benchmark datasets for evaluating hallucination detection metrics proposed in recent literature.

LLM Interpretability Articles

Insightful articles and blog posts on LLM interpretability.

  1. Do Machine Learning Models Memorize or Generalize? - an interactive visualization exploring the phenopmena known as Grokking (VISxAI hall of fame)
  2. What Have Language Models Learned? - an interactive visualization to undertsand how large language models work, and understand the nature of their biases (VISxAI hall of fame)
  3. A New Approach to Computation Reimagines Artificial Intelligenceg - Discusses hyperdimensional computing, a novel method involving hyperdimensional vectors (hypervectors) for more efficient, transparent, and robust artificial intelligence.
  4. Interpreting GPT: the logit lens - Explores how the logit lens, reveals a gradual convergence of GPT's probabilistic predictions across its layers, from initial nonsensical or shallow guesses to more refined predictions.
  5. A Mechanistic Interpretability Analysis of Grokking - Explores the phenomenon of 'grokking' in deep learning, where models suddenly shift from memorization to generalization during training.
  6. 200 Concrete Open Problems in Mechanistic Interpretability - Series of posts discussing open research problems in the field of Mechanistic Interpretability (MI), which focuses on reverse-engineering neural networks.
  7. Evaluating LLMs is a minefield - Challenges in assessing the performance and biases of large language models (LLMs) like GPT.
  8. Attribution Patching: Activation Patching At Industrial Scale - Method that uses gradients for a linear approximation of activation patching in neural networks.
  9. Causal Scrubbing: a method for rigorously testing interpretability hypotheses [Redwood Research] - Introduces causal scrubbing, a method for evaluating the quality of mechanistic interpretations in neural networks.
  10. A circuit for Python docstrings in a 4-layer attention-only transformer - Proposes the Quantization Model for explaining neural scaling laws in neural networks.
  11. Discovering Latent Knowledge in Language Models Without Supervision - Examines a specific neural circuit within a 4-layer transformer model responsible for generating Python docstrings.
  12. Toward Transparent AI: A Survey on Interpreting the Inner Structures of Deep Neural Networks - Survey on mechanistic interpretability

LLM Interpretability Groups

Communities and groups dedicated to LLM interpretability.

  1. PAIR - at Google work on opensource tools, interactive explorables visualizations and research interpretability methods.
  2. Alignment Lab AI - Group of researchers focusing on AI alignment.
  3. Nous Research - Research group discussing various topics on interpretability.
  4. EleutherAI - Non-profit AI research lab that focuses on interpretability and alignment of large models.

Contributing and Collaborating

Please see CONTRIBUTING and CODE-OF-CONDUCT for details.

More Repositories

1

awesome-llm-web-ui

A curated list of awesome Large Language Model (LLM) Web User Interfaces.
281
star
2

AWS-Machine-Learning

Guide to pass AWS Machine Learning Specialty Certification
73
star
3

DatabaseDesignChecklist

High Level Priority List of Database Design
32
star
4

tpdatafinder

Python
2
star
5

NVIDIATranscribedChatbot

NVIDIA LLM Chatbot for All NVIDIA related videos
Python
2
star
6

-Over-vs-Under-Sampling-Metrics-vs-SMOTE

Evaluating Over and Under Sampling Metrics versus SMOTE comparison
Jupyter Notebook
1
star
7

BitcoinPerformanceAnalysis

1
star
8

HAC

Using Hierarchical Agglomerative clustering on the iris dataset
Jupyter Notebook
1
star
9

gradientnews

TypeScript
1
star
10

Luxchain-Main

JavaScript
1
star
11

so_train.csv

stack overflow train dataset
1
star
12

Linearly-Separable-Iris-Dataset

Using SVM for linearly separable Iris Dataset
Jupyter Notebook
1
star
13

Will-Allowance-Smart-Contract

Beginner Project creating a smart contract which will send the fortune after a relative passes away
Solidity
1
star
14

redactor

HTML
1
star
15

Datafinder

1
star
16

PytorchIntro

Introductory Guide
Jupyter Notebook
1
star
17

ExpectationMaximizationKM

Utilizing the Expectation Maximization KMeans Algorithm on Simulated Data
Jupyter Notebook
1
star
18

CryptoCurrency

1
star
19

DiabetesPredictiveModeling

Various models which predict diabetes in US
Jupyter Notebook
1
star
20

propensityscorematching

Implementing causality test on BHC for volckers rule
Jupyter Notebook
1
star
21

AWS-Solutions-Portfolio

1
star
22

BlockchainRecommender

Context Aware Blockchain Recommender Using LLMs
1
star
23

Summarizer

HTML
1
star
24

Customer-Segmentation

Utilizing Hierarchical Agglomerative Clustering on Customer Segmentation
Jupyter Notebook
1
star
25

SVM-for-Non-Linear-Separable-Data

SVM for Non Linear Separable Data
Jupyter Notebook
1
star
26

SmartContract

Contract which allows only creator to create new coins in Solidity Remix
Solidity
1
star
27

DjangoTest

Testing Django Deployment
Python
1
star
28

Financial-Optimization-of-Mixed-Portfolio

Using Monte Carlo Simulation, convex optimization and lineal programming to optimize portfolio of stocks and cryptocurrencies
Jupyter Notebook
1
star
29

Utilizing-Over-and-Under-Sampling

Jupyter Notebook
1
star
30

Hierarchical-Clustering-for-Breast-Tumor

Cutting hierarchical clusterings into flat clusterings for the malignant and benign tumor for breast cancer data
Jupyter Notebook
1
star
31

hypedescent

TypeScript
1
star
32

Econometrics-with-Python

Here I used several methods such as IV, 2SLS, DiD, RTC etc. to resolve biases in estimating models through OLS.
Jupyter Notebook
1
star
33

Breast-Cancer-Data-Clustering

Clustering Malignant vs Benign Cancer types
Jupyter Notebook
1
star
34

KNN-On-Iris-Dataset2

Finding best K for our dataset
Jupyter Notebook
1
star
35

Structural-OLS-and-Instrumental-Variables

Here we perform ordinary least squares for residual values to observe the right instrumental variables for a sample train dataset
Jupyter Notebook
1
star
36

FinancialsCrypto

1
star
37

BostonHousingDataset

1
star
38

Luxchain

1
star
39

UCB-MAB

Code reference to UCB Multiarmed Bandits Framework in R
HTML
1
star
40

RImportantLibs

Important R libraries with Examples
R
1
star
41

Xns140

1
star
42

LinearRegressioninR

Testing linear regression function in R
R
1
star
43

VisualizingExpectationMaximization

To get a better understanding of how the expectation maximization works, we will visualize our findings
1
star
44

KNN-On-Iris-Dataset

KNN On Iris Dataset
Jupyter Notebook
1
star
45

Using-SVM-on-Iris-Dataset

Using SVM on Iris Dataset
Jupyter Notebook
1
star
46

Cryptocurrency-Analysis

Exploratory Data Analysis With
1
star
47

Hand-Written-Digit-Problem

Running a Simple Logistic Regression Model On Hand Written Digit Problem
HTML
1
star
48

CognideX-Whitepaper

CognideX - Official Whitepaper
1
star
49

MultiArmedBandits

Cross Comparison of all multi armed bandit models in terms of comparison and performance
Jupyter Notebook
1
star