I have come across many cases where I’d like an LLM to generate me ground truth data and scores, but this often results in very poor results.
Is it possible that this struggle (such as in GPT-4) to both generate and evaluate responses are because these tasks fundamentally conflict?
That is, generation is creative and open-ended, while evaluation requires detached, critical judgment. The training GPT-4 went through focused on either creating content or evaluating pre-written examples, but not both simultaneously. Could this be why I so often see subjective and mismatched scores in such generation cases? Thanks
I previously had success with GPT-4 when using a role-based approach:
Example:
Role 1: Write a 250-word paragraph about topic X. Role 2: Count the words in the previous paragraph sequentially (e.g., 1. Word, 2. Next word, etc.). Role 3: If the number of words is not exactly 250, switch back to Role 1 to add or remove words until the target number is reached. Otherwise, the task is complete.
This process is repeated until the task is finalized.
This approach worked reasonably well, improving model performance to as high as 95%-99% in what, by today’s standards, are considered outdated tests. Another test I conducted involved generating “ten sentences with word X at the end.” In contrast, simply prompting the model to check the result only increased effectiveness by about 5%-10%, from a 60-70% baseline, using GPT-4 (0314).
My takeaway is that using explicit roles helps the model distance itself from its previously generated tokens, leading to better final answers. It helps to break down the task into smaller ones.
Writing these prompts is a bit of a challenge. At some point you may just observe strange behavior like non-truthful responses or plain cheating. But it’s fun and I suggest you give it a try.
Hi, thanks very much for your insights. I am working with the GPT-4o API on a Python notebook, are the roles approach you mentioned capable on the API? If so how did you structure it? thanks!