How does Codex, a descendant of GPT-3 allow a context length 4096 tokens while GPT-3 allows only 2048?
I have gone through the OpenAI Codex paper, but couldn’t find any information related to it. Could anyone tell how this token limit was increased and what was the technique used?
Codex is much smaller than GPT-3 DAVINCI so it’s more efficient. It’s all fine-tuned for one task. Remember that the token limit is somewhat arbitrary. Transformers can just keep reading and spitting out tokens forever. The reason for a token limit is because anything higher would (1) be too expensive and (2) the technology cannot keep track of contexts that long.
I am also interested in that question, and hope that a future version of GPT will increase the current 2048 token limit.
Increasing the limit is not that easy. Necessary computations at typical attention mechanisms scale quadratically in the token limit, but there are ways around this: [2004.05150] Longformer: The Long-Document Transformer
Yeah, my question was on similar lines. Was Codex trained on Longformer since Longformer provide 4096 tokens. Could anyone please confirm on the same. Thank you
Thanks for your response. We are working on a use case where splitting the data is not an option and entire input exceeds 4096 tokens. We thought of fine tuning GPT-2 with increased context length, so wanted to know the logic behind it
Codex is like a baby, it loses context way before 1000 tokens. Works well for 1 function at a time, more like code completion giving you framework.
I managed to make it do some advanced math but need to test more. Maybe it memorized the equations.