TLDR: To unleash the full power of Codex, we need to be able to run per-request arbitrary code per-token to select each next token from the token probability lists that GPT-3 generates. I believe this is a reasonable request that fits within the technical constraints of an API service like OpenAI.
With GPT-3 in general, but particularly with Codex, there is significant additional knowledge that may be beneficial to generating a completion that cannot be provided in the prompt text, but can easily be applied by programmatically influencing the next-token decisions of the completion process. These knowledge sources include:
- Context data much larger than fits in the prompt size constraint
- Logistic regression models of these larger contexts
- Knowledge of the syntax structure of the target language
- Data derived from the AST of the parsed syntax
- Available external imports
- API surfaces of imports
Based on my experience using GPT-2, my understanding of how GPT-3 runs a completion is conceptually as follows:
- The underlying neural network model maintains a significant (multi-GB) “state” in high-performance (expensive) RAM/SRAM during the processing of a single completion / API call.
- On an API request, a worker (or worker slice) is allocated which has exclusive access to an instance of this stateful model.
- The state is “primed” by executing across the prompt input tokens.
- New completion tokens are generated one-by-one, in series, with a simple external process taking a list of next token probabilities from the model, selecting one (via temperature, top_p, logit bias, etc.), and passing that token back into the model, which then generates the next set of token probabilities, etc.
- At the end of the API call, this expensive stateful worker is released back to the pool.
Because the cost of running the model is likely proportional to wall-clock time of the entire API request, not just the proportion of that time spent “in the model”, and because the process is sequential, requiring a round-trip for each token, it is likely a requirement that this arbitrary code runs on OpenAI-controlled hardware to keep latency to a minimum, and be bounded to (sub-millisecond?) per-token processing times.
There are detailed considerations in design and architecture of such an API extension and its implementation, which I would be happy to discuss further, upon request. Significant among these is a desire to pass in both a code slug and a parameter slug, to give the arbitrary code additional context that varies per-API-call.