Feature Request: Run arbitrary code to select next token

TLDR: To unleash the full power of Codex, we need to be able to run per-request arbitrary code per-token to select each next token from the token probability lists that GPT-3 generates. I believe this is a reasonable request that fits within the technical constraints of an API service like OpenAI.

Motivation

With GPT-3 in general, but particularly with Codex, there is significant additional knowledge that may be beneficial to generating a completion that cannot be provided in the prompt text, but can easily be applied by programmatically influencing the next-token decisions of the completion process. These knowledge sources include:

  • Context data much larger than fits in the prompt size constraint
  • Logistic regression models of these larger contexts
  • Knowledge of the syntax structure of the target language
  • Data derived from the AST of the parsed syntax
  • Available external imports
  • API surfaces of imports
    etc.

Background

Based on my experience using GPT-2, my understanding of how GPT-3 runs a completion is conceptually as follows:

  • The underlying neural network model maintains a significant (multi-GB) “state” in high-performance (expensive) RAM/SRAM during the processing of a single completion / API call.
  • On an API request, a worker (or worker slice) is allocated which has exclusive access to an instance of this stateful model.
  • The state is “primed” by executing across the prompt input tokens.
  • New completion tokens are generated one-by-one, in series, with a simple external process taking a list of next token probabilities from the model, selecting one (via temperature, top_p, logit bias, etc.), and passing that token back into the model, which then generates the next set of token probabilities, etc.
  • At the end of the API call, this expensive stateful worker is released back to the pool.

Likely concerns: Latency and Security

Because the cost of running the model is likely proportional to wall-clock time of the entire API request, not just the proportion of that time spent “in the model”, and because the process is sequential, requiring a round-trip for each token, it is likely a requirement that this arbitrary code runs on OpenAI-controlled hardware to keep latency to a minimum, and be bounded to (sub-millisecond?) per-token processing times.

Luckily, there is significant precedent for running lightweight untrusted code on secured systems, most obviously with modern JavaScript engines like V8, which successfully run untrusted code securely in web browsers. V8 is used to to run arbitrary untrusted code in other secure, high-performance contexts, such as in Deno, in the Cloudflare Workers runtime, and likely others. Other approaches could be taken from “cloud functions” runtimes, including Amazon’s open source Firecracker Micro-VMs. If the architecture is suitable, perhaps there is a cross-promotion opportunity with Azure Functions.

Further details

There are detailed considerations in design and architecture of such an API extension and its implementation, which I would be happy to discuss further, upon request. Significant among these is a desire to pass in both a code slug and a parameter slug, to give the arbitrary code additional context that varies per-API-call.

4 Likes