Anthropic open-sources AI thought-tracing microscope tool for graphing internal layers and meanings

Today, Anthropic open-sourced the method that they used for thought-tracing language models so that anyone can build on research.

Our approach is to generate attribution graphs, which (partially) reveal the steps a model took internally to decide on a particular output. The open-source library we’re releasing supports the generation of attribution graphs on popular open-weights models—and a frontend hosted by Neuronpedia lets you explore the graphs interactively.

What’s really cool is that in the circuit tracer, you can explore the activations within layers of an AI model (Google’s Gemma or Anthropic’s Haiku), and see community notations of the commonalities that seem to power a neural node’s meaning or activity.

Will you find new ways that AI is working to generate the next token, with planning and deeper reasoning than one might expect?

5 Likes

I guess it’s OpenAI’s turn now! :smirking_face:

1 Like