I have a technical question, how are the context window of GPT models increased ?
I want to know wha differentiates gpt-3.5-turbo with 4k context window from gpt-3.5-turbo with 16k context window.
is it different model altogether ? training methodology ? different positional embedding ? or any other change in model.
how was the model able to process more context ? it would be great to have an explanation and paper citations if available.
it would be great if someone can shine some light on this.