Prompt result consistency - Need some perspectives to validate understanding

An observation of prompts, behaving differently based on content. - Consistency is hard

  1. Create a one-liner for the below product in less than 50 words - Works Fine provides output in less than 50 words for many products (8/10)
  2. Create a one-liner product description in 50 words - This depends on descriptions of the product and its features. Although the prompt recommends 50, it goes beyond 50 for certain products (3/10 it works)

I was in discussion with someone who was asking me why can’t it return less than 50 words description. Description refers to product key features, when multiple features present it will generate it larger than 50. Business people see it as a software requirement, the time prompts should be consistent. Prompt tuning and guidelines provide detailed descriptions / no complex prompts. Rather than a description if you mention a one-liner it can summarize in less than 50.

This is by design based on the features of the product, since we pick top features and construct it can generate a product description of more than 50 words. This is not something that can be finetuned for consistency. I need some reference / positive/negative feedback to explain to the non-tech audience.

1 Like

Hi and welcome to the forum!

Number 1 requests an action and also includes a pointer to the required information to complete the task.

Number 2 is a generalised statement to create a description of anything at random as it does not point to the thing to be worked on.

I would expect there to be a difference in the result from both.

The model performs best when you make unambiguous requests of it, if there is a loose end, the model will inevitably find it and create unexpected output.

Thanks @Foxalabs. For same prompt I got three results of different lengths - 26 words, 19 words, 21 words. The ask of less than 20 words is not consistent. Business expects less than 20 words for every response.

The output from the model with a temperature above 0 is non deterministic, that is to say, you will get different answers to the same question over time.

If your client requires <20 words, then you should call the API, count the words and make another call if the length is >20 words. or you could try asking for 15 as that should ensure it is mostly less than 20.

Large language models like GPT are autoregressive, this means they are not aware of their own output until they have made it, this means that following word counts are difficult for them.

n-shot prompting:

Simply provide the model with 10 worked examples at the end of your prompt (each example output less than 20 words).
For 2 examples make the initial output > 20 words and include the correction prompt and corrected revised output.

Teach it by example.

I’ve not found any benefit to more than 5 examples myself, perhaps you have add a different experience?

depends on the model version and on the variability of the inputs, and the complexity of the task.

For gpt 3.5 I would start with 10 examples for this specific task.
For Gpt 4 I would begin with 3 examples,

Test. refine. test

As long as you don’t mind setting tokens on fire, you could use a chain approach where you have your product description query, and it is passed in a parameter for word count along with presumably a product description variable. Start the word count variable at, say 45 words. Then in the API call, set n=4 or so to generate 4 outputs. get all of the content produced and measure word count with a len function. Then if none are < 50 words, then run it again with word count max 40 words. then 35… until one of your generated answers is under the word count.