Better interpretability for prompt engineering

I frequently have issues where the model ignores instructions in medium-long prompts. For each token generated, I would love to have a “heat map” over the prompt and context, so I can see what parts contribute the most to each tokens generation.

I tried to implement this on llama with integrated gradients. I got the gradient calculation working, but couldn’t get any useful information.

In general, more tooling for understanding how the model is interpreting the prompt would be great.

1 Like

Welcome to the forum.

I’ve not heard of anything personally, but I’d love to see any heatmaps you’ve made. Hope you stick around.

Hi there.

So I have heard of a solution for this.

What you can do is use the embeddings model.

Take your prompt and then embed it and check the token cost.

Then remove each word from the prompt individually, and submit that to the embeddings model.

Example

“Jimmy is the coolest individual I have ever met in my life”
“is the coolest individual I have ever met in my life”
“Jimmy the coolest individual I have ever met in my life”
Jimmy is coolest indivudual I have ever met in my life"

You can track the change in the token cost to then determine the overall weighting of each word in the prompt.

I personally have never tried this. It does seem like an exhausting process and depending on prompt size may not be worth the cost - even though embeddings are cheap as chippies.

If someone has a better solution I would love to hear it.

If you do give this a go please let me know the results.