GPT-4 vs GPT-4-32k Performance for < 8k Token Prompts

Has anyone tested the cognition of GPT-4 vs. GPT-4-32k on prompts with less than 8,000 tokens?

This is weird! Why would I use GPT-4-32k on a prompt short enough to fit in a twice as cheap GPT-4? Because we’re all having fun here! It’s for science!

1 Like

I have not, but perhaps as a point of reference, the recent “Lost in the Middle” paper [2307.03172] Lost in the Middle: How Language Models Use Long Contexts compared gpt-3.5-turbo to gpt-3.5-turbo-16k, and Claude 1.3 9k to Claude 1.3 100k. Both sets of models had nearly identical performance in their shared context windows.

It’s a good question to ask. For example, if gpt3.5 16k out-performed the 4k version even within the same context lengths, then for some applications, it’d be well-worth it to pay for the 16k even for small api calls. Same could be true for gpt-4 32k.