What I am concluding is that things really break down badly when you try to output a lot. Hallucinations, lack of reasoning, it’s all there.
The best approach is to always generate small snippets of code/text. Ensure that your context is as much as possible but only the absolutely required context to do the task. This latter statement is obviously hard to satisfy, but important, as its very likely reasoning degrades with more context.
One possibility is to do a sort of COT approach where a separate prompt is responsible for extracting the required context before another prompt reasons about it.
I think I should add a bit of additional context here since I ran the translation task
I’m (very obviously) not able to read 9 different languages so I placed the smallest/most uncommon languages I can read at the bottom of the task, here’s the exact prompt I used:
your task is to translate the user's input into nine different languages: Mandarin Chinese, Hindi, Spanish, French, Arabic, German, Finnish, Swedish and Danish.
The user's input will be in English and your task is to provide accurate translations for that input in the mentioned languages.
For each input, generate the translations following this specific format:
"Translation in English"
"Translation in Mandarin Chinese"
"Translation in Hindi"
"Translation in Spanish"
"Translation in French"
"Translation in Arabic"
"Translation in German"
"Translation in Finnish"
"Translation in Swedish"
"Translation in Danish"
Ensure each translation accurately corresponds to the user's input. Do not provide any phonetic transcriptions or pronunciation guides - only the translated text is required.
Note that this task may be an optimized example, as it essentially repeats the same context again and again just in different languages.
I’m very cautious about using this example to infer other things about the 32k context window. I only did this to force as many output tokens as possible