Assume we ask the LLM to write a small story for an English language class.
Naturally, I would add some requirements through prompting and then receive an answer back.
Most times, it would need further improvement, thus I make a second iteration where I ask the LLM to challenge the text based on some requirements/ grading criteria.
I lack understanding if it not makes more sense to also include my second iteration in the initial prompt? Would probably be a large prompt then.
On that matter, if I split it up into two separate iterations, what should I include in the first iteration and what in the second âevaluationâ iteration?
I would continue to split it up into two steps as it allows for a more focused evaluation and enhancement.
On your second question:
Could you share with us your current prompt? It sounds like you already have a two-step model in place that we could use as a starting point and review for potential further optimization.
Thanks! Sure, as follows a rough structure. First step
sys_msg_1: âYou are a parent tasked to create a bedtime story for a 5 year old.â
rag_context: âUse {bedtime_story_examples.json} as an example.â
sys_msg_2: âMake very short sentences and use a very simple vocabularyâ
sys_msg_3: âThe topic is {user_msg}â
user_msg: âpreparing for a birthday party of a 1 year old puppyâ Second step
sys_msg_1 âYou are an English teacher. Your task is to review bedtime stories based on the following grading criteria.â
sys_msg_2:âYou need to state whether the text is suitable for 5 year olds. If the story includes a problem, state how it was resolved . State whether there is a clear red line. Finally, state whether the text has substance. State whether the ending of the text is simple and realisticâ
sys_msg_3: âBased on your feedback improve the text where needed. Output the improved text.â
Roughly how it looks like. It works fine but is not up to the standard I want it to be. Firstly, I am not sure if it really follows the rag_context I provided. Because in theory the examples I provided already fulfill the grading criteriaâŚ
Secondly, it does not always strictly follow my prompt. For instance, the text sometimes is more suitable towards 9 year olds or so but not 5 year olds.
Finally, I am not sure in which order I should state the prompts, does it matter? Does it matter where I put them within a step and further does it matter whether I put some of step two into step one an vice versa?
Thanks for sharing. I can see some optimization potential as regards your prompts. Iâm on a plane and about to take off - will come back with some thoughts tomorrow.
Iâm not sure what youâre doing with sys_msg_1: and all the other things like that. Your entire first request can be written as a single paragraph prompt. Just explain what you want it to do in a paragraph.
Then for your second pass. Do it all one one paragraph too something like: âRead this story and let me know if itâs understandable by a 5 year old, etc.â Avoid sentences that can have an infinite number if âinterpretationsâ. For example, whether âtext has substanceâ or not is basically meaningless and just confusing, because it can have a million different meanings. Itâs best for prompts to be so clear that there is only ONE thing you can possibly mean.
Hereâs some suggestion for the refined system and user messages for each step. Consider further expanding the descriptions/requirements based on your specific needs. As has already been mentioned, the basic idea is that you consolidate the individual system messages into a single coherent system message and delineate it more clearly from the other inputs, i.e. the topic and the RAG context. Note that you should consider whether there is a need to also include the RAG context for the evaluation under Step 2. Whether this is needed really depends on how important alignment with the example bedtime stories is - for now I have left it out.
Step 1: Drafting of initial story
System message: You are a storyteller bot, tasked to create captivating bedtime stories for a 5-year old based on a defined topic. In creating the stories, you draw inspiration from a set of example bedtime stories. Your stories are crafted using clear and simple language, suitable for a 5-year old, with short sentences and a very basic vocabulary.
User message: Topic: âââDescription of topicâââ; Example bedtime stories: âââRag contextâââ
Step 2: Review and refinement of story
System message: You are an editor responsible to review the adequacy of and refine children bedtime stories. For a given story provided for your review, you diligently evaluate: (1) Whether the story is suitable for a 5-year old; (2) Whether the story follows a clear logic and flow; (3) Whether the story has sufficient substance; ⌠[ADD ADDITIONAL CRITERIA AS REQUIRED]. Based on the outcomes of your review, you make targeted revisions to the story. Your output consists solely of the revised bedtime story.
User message: Story for review and revision: âââOutput from step 1âââ
I used sys_msg_1 and the like for readability. I am trying different prompts for different age levels and would just swap them around. I understand now that this may lead to worse performance. Thus, I will try to have one big f string where I add all the system messages etc⌠Something along the line of:
f"{sys_msg_1} + {sys_msg_2}...."
Understood, will try to have less ambiguity. Thanks!
In your case, I think what you want is the self-refine method.
Easy to test, not so easy to implement in practice as you need to convert outputs to data you can manipulate ( I use Instructor [1] for that, but now you also have the âstructured outputâ [2] ).
Very briefly:
out = generate( system, user)
round=0
results=[]
while round < max_rounds:
round+=1
feedback = generate ( system_f, user_f_with_out)
results.append( (feedback.score,out))
if feedback.is_good_enough:
return out
else:
out = generate( system, user, out, feedback)
# if you got here, no result was good enough, so you may return the best one you got
# sort results by score and return the top `out`