Is an LLM which both generates and critiques its output a contradictory practice?

R2D2BOT · November 23, 2024, 1:47pm

I have come across many cases where I’d like an LLM to generate me ground truth data and scores, but this often results in very poor results.

Is it possible that this struggle (such as in GPT-4) to both generate and evaluate responses are because these tasks fundamentally conflict?

That is, generation is creative and open-ended, while evaluation requires detached, critical judgment. The training GPT-4 went through focused on either creating content or evaluating pre-written examples, but not both simultaneously. Could this be why I so often see subjective and mismatched scores in such generation cases? Thanks

vb · November 23, 2024, 2:11pm

I previously had success with GPT-4 when using a role-based approach:

Example:

Role 1: Write a 250-word paragraph about topic X.
Role 2: Count the words in the previous paragraph sequentially (e.g., 1. Word, 2. Next word, etc.).
Role 3: If the number of words is not exactly 250, switch back to Role 1 to add or remove words until the target number is reached. Otherwise, the task is complete.

This process is repeated until the task is finalized.

This approach worked reasonably well, improving model performance to as high as 95%-99% in what, by today’s standards, are considered outdated tests. Another test I conducted involved generating “ten sentences with word X at the end.” In contrast, simply prompting the model to check the result only increased effectiveness by about 5%-10%, from a 60-70% baseline, using GPT-4 (0314).

My takeaway is that using explicit roles helps the model distance itself from its previously generated tokens, leading to better final answers. It helps to break down the task into smaller ones.

Writing these prompts is a bit of a challenge. At some point you may just observe strange behavior like non-truthful responses or plain cheating. But it’s fun and I suggest you give it a try.

R2D2BOT · November 23, 2024, 2:47pm

Hi, thanks very much for your insights. I am working with the GPT-4o API on a Python notebook, are the roles approach you mentioned capable on the API? If so how did you structure it? thanks!

vb · November 23, 2024, 3:17pm

It’s really just a prompting technique.
As such you can put your prompt in the system or the user role.

Topic		Replies	Views
LLM and Prompt Evaluation Frameworks Prompting prompt-engineering , prompting , evals	11	6999	December 16, 2024
Limiting Answer Length with Tokens / Prompting API gpt-4	3	947	January 27, 2024
Can a good prompt prevent 'hallucination'? Prompting chatgpt , api	6	4301	November 4, 2023
Tuning the prompt? Prompting	6	2897	December 14, 2023
Better to include everything in the first prompt or split between first and an eval prompt? Prompting gpt-4	8	2312	August 22, 2024

Is an LLM which both generates and critiques its output a contradictory practice?

Related topics