My understanding is that a lot of models’ context length is very vram bound - so it could very well be the same model, just running on cheaper hardware.
This could potentially also explain why OpenAI API charges by input tokens, and have an output limit; perhaps they delegate to specific nodes with specific hardware configurations by prompt length
flowchart
GPT-4-->a
a["tiktoken +4k"] -->8k
a-->16k
a-->24k
a-->32k
a-->...
a-->128k