Better to include everything in the first prompt or split between first and an eval prompt?

Assume we ask the LLM to write a small story for an English language class.
Naturally, I would add some requirements through prompting and then receive an answer back.
Most times, it would need further improvement, thus I make a second iteration where I ask the LLM to challenge the text based on some requirements/ grading criteria.

I lack understanding if it not makes more sense to also include my second iteration in the initial prompt? Would probably be a large prompt then.

On that matter, if I split it up into two separate iterations, what should I include in the first iteration and what in the second “evaluation” iteration?

What works best based on your experience?

I would continue to split it up into two steps as it allows for a more focused evaluation and enhancement.

On your second question:

Could you share with us your current prompt? It sounds like you already have a two-step model in place that we could use as a starting point and review for potential further optimization.

1 Like

Thanks! Sure, as follows a rough structure.
First step
sys_msg_1: “You are a parent tasked to create a bedtime story for a 5 year old.”
rag_context: “Use {bedtime_story_examples.json} as an example.”
sys_msg_2: “Make very short sentences and use a very simple vocabulary”
sys_msg_3: “The topic is {user_msg}”
user_msg: “preparing for a birthday party of a 1 year old puppy”
Second step
sys_msg_1 “You are an English teacher. Your task is to review bedtime stories based on the following grading criteria.”
sys_msg_2:“You need to state whether the text is suitable for 5 year olds. If the story includes a problem, state how it was resolved . State whether there is a clear red line. Finally, state whether the text has substance. State whether the ending of the text is simple and realistic”
sys_msg_3: “Based on your feedback improve the text where needed. Output the improved text.”

Roughly how it looks like. It works fine but is not up to the standard I want it to be. Firstly, I am not sure if it really follows the rag_context I provided. Because in theory the examples I provided already fulfill the grading criteria…
Secondly, it does not always strictly follow my prompt. For instance, the text sometimes is more suitable towards 9 year olds or so but not 5 year olds.
Finally, I am not sure in which order I should state the prompts, does it matter? Does it matter where I put them within a step and further does it matter whether I put some of step two into step one an vice versa?

1 Like

Thanks for sharing. I can see some optimization potential as regards your prompts. I‘m on a plane and about to take off - will come back with some thoughts tomorrow.

1 Like

I’m not sure what you’re doing with sys_msg_1: and all the other things like that. Your entire first request can be written as a single paragraph prompt. Just explain what you want it to do in a paragraph.

Then for your second pass. Do it all one one paragraph too something like: “Read this story and let me know if it’s understandable by a 5 year old, etc.” Avoid sentences that can have an infinite number if “interpretations”. For example, whether “text has substance” or not is basically meaningless and just confusing, because it can have a million different meanings. It’s best for prompts to be so clear that there is only ONE thing you can possibly mean.

2 Likes

Here’s some suggestion for the refined system and user messages for each step. Consider further expanding the descriptions/requirements based on your specific needs. As has already been mentioned, the basic idea is that you consolidate the individual system messages into a single coherent system message and delineate it more clearly from the other inputs, i.e. the topic and the RAG context. Note that you should consider whether there is a need to also include the RAG context for the evaluation under Step 2. Whether this is needed really depends on how important alignment with the example bedtime stories is - for now I have left it out.


Step 1: Drafting of initial story

System message: You are a storyteller bot, tasked to create captivating bedtime stories for a 5-year old based on a defined topic. In creating the stories, you draw inspiration from a set of example bedtime stories. Your stories are crafted using clear and simple language, suitable for a 5-year old, with short sentences and a very basic vocabulary.

User message: Topic: ‘’‘Description of topic’‘’; Example bedtime stories: ‘’‘Rag context’‘’


Step 2: Review and refinement of story

System message: You are an editor responsible to review the adequacy of and refine children bedtime stories. For a given story provided for your review, you diligently evaluate: (1) Whether the story is suitable for a 5-year old; (2) Whether the story follows a clear logic and flow; (3) Whether the story has sufficient substance; … [ADD ADDITIONAL CRITERIA AS REQUIRED]. Based on the outcomes of your review, you make targeted revisions to the story. Your output consists solely of the revised bedtime story.

User message: Story for review and revision: ‘’‘Output from step 1’‘’

3 Likes

I used sys_msg_1 and the like for readability. I am trying different prompts for different age levels and would just swap them around. I understand now that this may lead to worse performance. Thus, I will try to have one big f string where I add all the system messages etc… Something along the line of:

f"{sys_msg_1} + {sys_msg_2}...."

Understood, will try to have less ambiguity. Thanks!

1 Like

thanks, will consolidate the messages and try whether the performance improves with RAG context in step 2.

1 Like

In your case, I think what you want is the self-refine method.

Easy to test, not so easy to implement in practice as you need to convert outputs to data you can manipulate ( I use Instructor [1] for that, but now you also have the ‘structured output’ [2] ).

Very briefly:

out = generate( system, user)
round=0
results=[]

while round < max_rounds:
  round+=1
  feedback = generate ( system_f, user_f_with_out)
  results.append( (feedback.score,out))

  if feedback.is_good_enough:
    return out
  else:
    out = generate( system, user, out, feedback)


# if you got here, no result was good enough, so you may return the best one you got

# sort results by score and return the top `out`

system = system prompt
user = user prompt

[1] https://python.useinstructor.com
[2] https://platform.openai.com/docs/guides/structured-outputs