The Portability of a LLM Prompt

daniel.yakubov · July 28, 2023, 2:47pm

Does anyone know of any decent research on “Prompt Portability” - What I mean by that is a study of how easily reusable a Prompt is across different LLMs?

A couple of years ago, I remember seeing a paper that discussed an LLM’s robustness to slightly different variations of a Prompt (sort of like an ablation study), but “Portability” would be the opposite - a Prompt’s robustness to different LLMs. (I am aware that this is not trivial due to different model architectures & training objectives, but it’s been a curiosity that’s partially come about due to this work - [2307.09009] How is ChatGPT's behavior changing over time?)

If no papers, I would love to hear about others’ anecdotal experiences with the matter!

anon22939549 · July 28, 2023, 3:23pm

You should know that there are many flaws with the methodology of that paper and it’s not widely considered to be serious work.

daniel.yakubov · July 28, 2023, 3:47pm

Thanks! I am aware it’s controversial, to say the least. I don’t put so much stock into the content of any arxiv paper, but the aforementioned curiosity still stands.

Maybe to be clearer to future readers, the mentioning of the paper was tangential, and the comparison for the portability does not only have to be with OpenAI-specific LLM

curt.kennedy · July 28, 2023, 6:19pm

From experience, different LLM’s are trained with different styles of prompt. You have to research your specific LLM to find out what that is to get the best results.

However the OpenAI prompts appear to be the most forgiving, compared to other models out there … so no big research project needed for these models. Maybe it is because they have a ton of parameters, and were trained over a variety of prompts.

But smaller models (<100B parameters), look out!

jeffinbournemouth · July 30, 2023, 9:45am

In my experience, every prompt is 100% portable (and predictable) when a few examples are included in the prompt (assuming each LLM used is equally capable).

I have so far tested this across AI21 (large models), Llama 2 70B, GPT 3.5 and GPT-4.

lids · July 30, 2023, 11:24am

in our experience smaller models need more careful prompt engineering.
Overall depends on the task being performed, generative use cases can be less forgiving across models than extract for example where models have their own behaviour you need to hack around (i.e. find data and export as json)

Topic		Replies	Views
Switching from GPT-4 to GPT-3.5: prompting best practices Prompting gpt-4 , gpt-35-turbo	4	6549	December 14, 2023
Is an LLM which both generates and critiques its output a contradictory practice? Prompting gpt-4	3	165	November 23, 2024
What's your system for creating and iteratively improving prompts? Prompting	5	3078	December 17, 2023
LLM and Prompt Evaluation Frameworks Prompting prompt-engineering , prompting , evals	11	6463	December 16, 2024
Tuning the prompt? Prompting	6	2882	December 14, 2023

The Portability of a LLM Prompt

Related topics