The Portability of a LLM Prompt

Does anyone know of any decent research on “Prompt Portability” - What I mean by that is a study of how easily reusable a Prompt is across different LLMs?

A couple of years ago, I remember seeing a paper that discussed an LLM’s robustness to slightly different variations of a Prompt (sort of like an ablation study), but “Portability” would be the opposite - a Prompt’s robustness to different LLMs. (I am aware that this is not trivial due to different model architectures & training objectives, but it’s been a curiosity that’s partially come about due to this work - [2307.09009] How is ChatGPT's behavior changing over time?)

If no papers, I would love to hear about others’ anecdotal experiences with the matter!

1 Like

You should know that there are many flaws with the methodology of that paper and it’s not widely considered to be serious work.

1 Like

Thanks! I am aware it’s controversial, to say the least. I don’t put so much stock into the content of any arxiv paper, but the aforementioned curiosity still stands.

Maybe to be clearer to future readers, the mentioning of the paper was tangential, and the comparison for the portability does not only have to be with OpenAI-specific LLM

From experience, different LLM’s are trained with different styles of prompt. You have to research your specific LLM to find out what that is to get the best results.

However the OpenAI prompts appear to be the most forgiving, compared to other models out there … so no big research project needed for these models. Maybe it is because they have a ton of parameters, and were trained over a variety of prompts.

But smaller models (<100B parameters), look out!

In my experience, every prompt is 100% portable (and predictable) when a few examples are included in the prompt (assuming each LLM used is equally capable).

I have so far tested this across AI21 (large models), Llama 2 70B, GPT 3.5 and GPT-4.

1 Like

in our experience smaller models need more careful prompt engineering.
Overall depends on the task being performed, generative use cases can be less forgiving across models than extract for example where models have their own behaviour you need to hack around (i.e. find data and export as json)