Show HN: A surprisingly effective way to predict token importance in LLM prompts (opens in new tab)

(heatmap.demos.watchful.io)

14 pointsshayanjm2y ago6 comments

We explored a novel method to gauge the significance of tokens in prompts given to large language models, without needing direct model access. Essentially, we just did an ablation study on the prompt using cosine similarity of the embeddings as the measure. We got surprisingly promising results when comparing this really simple approach to integrated gradients. Curious to hear thoughts from the community!

6 comments

worstestes2y ago

Very interesting research!

Given that you're using cosine similarity of text embeddings to approximate the influence of individual tokens in a prompt, how does this approach fare in capturing higher-order interactions between tokens, something that Integrated Gradients (allegedly) is designed to account for? Are there specific scenarios where the cosine similarity method might fall short in capturing the nuances that Integrated Gradients can reveal?

shayanjmOP2y ago

Great question - there are currently (likely) tons of limitations to this approach as-is. We're planning on testing this on more capable models (e.g: integrated gradients on Llama2) to see how the relationship might change, but here are some initial thoughts:

1. The perturbation method could be improved to more directly capture long-range dependency information across tokens

2. The scoring method could _definitely_ be improved to capture more nuance across perturbations.

I think what we've found is that there does seem to be a relationship between the embedding space and attributions of LLMs, so the next step would be to figure out how to capture more nuance out of that relationship. This sort of side-steps the question you asked, because honestly we'd need to test a lot more to figure out the specific cases where an approach like this falls short.

Anecdotally - we've seen the greatest deviation between the estimation & integrated gradients as prompt "ambiguity" increases. We're thinking about ways to quantify & measure that ambiguity but that's its own can of worms.

spdustin2y ago

What do you consider to be an “average length” prompt? How about a “long” prompt? You mention those in the text, and I’m curious of the token-length thresholds you’re seeing before performance degrades, and if that varies more when higher-importance tokens are distributed across the length versus clustered at the beginning.

shayanjmOP2y ago

Interestingly we initially thought that prompt length would play a big factor in the performance of this approach. In practice, though, we discovered that it's actually not as big a factor as we predicted. For instance, Prompt #3 was 410 tokens long, while Prompt #5 was only 88 tokens. The estimation for Prompt #3 aligned fairly well with the IG approach (0.746 cosine similarity, 0.643 pearson correlation), while the estimation for Prompt #5 seemed to underperform (0.55 cosine similarity, 0.295 pearson correlation). Meanwhile, Prompt #2 was 57 tokens long and performed quite well (0.852 cosine similarity, 0.789 pearson correlation).

Re: our definitions of average/long/short prompts -- we weren't really rigorous with those definitions. In general, we considered anything under 100 tokens "short", 100-300 average, and 300+ large.

Our intuition here is that the relationship between performance of the estimation and the prompt structure is less about length, and more about "ambiguity". Again, we don't really have a rigorous definition of that yet, but it's something we are working on. If you take a look at the prompts in the analysis notebook you might get a sense of what I mean: prompts 1-3 are pretty straight forward and mechanical. Prompts 4 & 5 are a bit more open to interpretation. We see performance of the estimation degrade as prompts become more and more open to interpretation.

spdustin2y ago

Oh, it’s definitely ambiguity. Any given token’s attention is going to have its weight vary based on its context, and less-ambiguous terms are more likely to be used “near” the other terms that matter. For example, if you tell GPT not to ‘omit’ code from a code sample, it has to disambiguate the meaning of omit. Tell it not to ‘elide’ any code, and it performs a lot better. “Prompt engineering” is far more linguistic than people seem to realize. It’s not just “say what you mean” when the model has an easier time when you “say what you mean in the most linguistically precise way possible”. Simplified, but workable: it’s a matter of finding less ambiguous/more context-specific tokens/words with a better tf/idf in the pre-training corpus without getting too esoteric.

Another example: storytelling prompts that include “I dislike open-ended conclusions and other rhetorical hooks” often results in fewer (or no) closing statements like, “as night fell, they wondered about their future.”

Edit: GPT-4 is surprisingly good at answering these things if asked to: https://chat.openai.com/share/b97ad65f-f005-49b4-a64e-eb537d...

naguas2y ago

Super cool! Tried it on my prompt which tried compressing information using emojis, but those were given a low importance score. Switched the emojis out for plain text, which is given a higher importance score, and I'm seeing better results.

j / k navigate · click thread line to collapse

6 comments

worstestes2y ago

Very interesting research!

shayanjmOP2y ago

1. The perturbation method could be improved to more directly capture long-range dependency information across tokens

2. The scoring method could _definitely_ be improved to capture more nuance across perturbations.

spdustin2y ago

shayanjmOP2y ago

Re: our definitions of average/long/short prompts -- we weren't really rigorous with those definitions. In general, we considered anything under 100 tokens "short", 100-300 average, and 300+ large.

spdustin2y ago

Edit: GPT-4 is surprisingly good at answering these things if asked to: https://chat.openai.com/share/b97ad65f-f005-49b4-a64e-eb537d...

naguas2y ago

j / k navigate · click thread line to collapse