Mapping interpretable features in a text embedding space

21 March 2024
21 Mar 2024
Manhattan, New York
1 mins

Covers work done between Dec 2023 - Feb 2024.

Abstract

Interface — Language models and embeddings are completely opaque systems. Even when we go through great lengths to try to surface what the model is “thinking” through prompting, it’s expensive and unreliable. This work explores an automated way to directly probe the vectors that embedding models output and “map out” what human-interpretable attributes of language specific directions in a model’s embedding space represent, which opens doors to better ways to understand and debug embeddings, and potentially edit text more precisely and directly.

Model — When we are debugging or optimizing embeddings, we may want to know exactly what attributes of the input data the dimensions of the embedding represents. I apply some recent research on dictionary learning with sparse autoencoders to demonstrate a proof-of-concept for automatically discovering 1000s of human-interpretable features in a text embedding space that allows us to interpret and edit text directly in the embedding space, and in the future may open up ways to manually tune embeddings for specific datasets and use cases.


Under construction…


Continuity

Like rocks, like water

I share new posts on my newsletter. If you liked this one, you should consider joining the list.

Have a comment or response? You can email me.