Not All Language Model Features Are Linear (opens in new tab)

(huggingface.co)

9 pointsJessicaWong902y ago7 comments

7 comments

One of the questions I've been thinking about a lot looking at the past year of interpretability research is just how much of what we are finding is "what we're attuned to find" as opposed to "what's actually there."

Are we only measuring the tip of the iceberg, and have coalesced towards getting better at iceberg tip measuring?

jengels_2y ago

I feel like un-supurvised methods like Anthropic's SAEs can be argued to find things we're not looking for (their most recent work is from a couple days ago: https://transformer-circuits.pub/2024/scaling-monosemanticit...). And we can get some sense of how "much" of the model they're recovering by looking at their downstream reconstruction loss.

kromem2y ago

I have skepticism regarding the 'completeness' of SAE in comprehensive discovery of features:

https://www.lesswrong.com/posts/BduCMgmjJnCtc7jKc/research-r...

1 more reply

jengels_2y ago

I'm one of the first authors on this paper, happy to answer any questions :)

tysonfurytoo2y ago

If you're able to find a feature, is it possible to selectively replace it to optimize it?

Kind of like replacing a portion of unoptimized compiler code with hand written assembly?

jengels_2y ago

It's a super interesting direction! That's one of the long term goals of interp research: deconstruct model behavior into circuits of features, and then turn those circuits into code (that we can maybe even formally verify!).

j / k navigate · click thread line to collapse

7 comments

kromem2y ago

Are we only measuring the tip of the iceberg, and have coalesced towards getting better at iceberg tip measuring?

jengels_2y ago

kromem2y ago

I have skepticism regarding the 'completeness' of SAE in comprehensive discovery of features:

https://www.lesswrong.com/posts/BduCMgmjJnCtc7jKc/research-r...

1 more reply

jengels_2y ago

I'm one of the first authors on this paper, happy to answer any questions :)

tysonfurytoo2y ago

If you're able to find a feature, is it possible to selectively replace it to optimize it?

Kind of like replacing a portion of unoptimized compiler code with hand written assembly?

jengels_2y ago

j / k navigate · click thread line to collapse