Feature Request: Run arbitrary code to select next token

tr · August 28, 2021, 12:16am

TLDR: To unleash the full power of Codex, we need to be able to run per-request arbitrary code per-token to select each next token from the token probability lists that GPT-3 generates. I believe this is a reasonable request that fits within the technical constraints of an API service like OpenAI.

Motivation

With GPT-3 in general, but particularly with Codex, there is significant additional knowledge that may be beneficial to generating a completion that cannot be provided in the prompt text, but can easily be applied by programmatically influencing the next-token decisions of the completion process. These knowledge sources include:

Context data much larger than fits in the prompt size constraint
Logistic regression models of these larger contexts
Knowledge of the syntax structure of the target language
Data derived from the AST of the parsed syntax
Available external imports
API surfaces of imports
etc.

Background

Based on my experience using GPT-2, my understanding of how GPT-3 runs a completion is conceptually as follows:

The underlying neural network model maintains a significant (multi-GB) “state” in high-performance (expensive) RAM/SRAM during the processing of a single completion / API call.
On an API request, a worker (or worker slice) is allocated which has exclusive access to an instance of this stateful model.
The state is “primed” by executing across the prompt input tokens.
New completion tokens are generated one-by-one, in series, with a simple external process taking a list of next token probabilities from the model, selecting one (via temperature, top_p, logit bias, etc.), and passing that token back into the model, which then generates the next set of token probabilities, etc.
At the end of the API call, this expensive stateful worker is released back to the pool.

Likely concerns: Latency and Security

Because the cost of running the model is likely proportional to wall-clock time of the entire API request, not just the proportion of that time spent “in the model”, and because the process is sequential, requiring a round-trip for each token, it is likely a requirement that this arbitrary code runs on OpenAI-controlled hardware to keep latency to a minimum, and be bounded to (sub-millisecond?) per-token processing times.

Luckily, there is significant precedent for running lightweight untrusted code on secured systems, most obviously with modern JavaScript engines like V8, which successfully run untrusted code securely in web browsers. V8 is used to to run arbitrary untrusted code in other secure, high-performance contexts, such as in Deno, in the Cloudflare Workers runtime, and likely others. Other approaches could be taken from “cloud functions” runtimes, including Amazon’s open source Firecracker Micro-VMs. If the architecture is suitable, perhaps there is a cross-promotion opportunity with Azure Functions.

Further details

There are detailed considerations in design and architecture of such an API extension and its implementation, which I would be happy to discuss further, upon request. Significant among these is a desire to pass in both a code slug and a parameter slug, to give the arbitrary code additional context that varies per-API-call.

Topic		Replies	Views
Feature request: token injection during streaming for structured output generation API	4	1308	May 17, 2023
Feature request: Query token counts via API Prompting	3	1625	May 24, 2022
Adding regex support to the API API	8	1058	October 21, 2021
Maintaining context in code completions API	1	839	April 12, 2023
Training Codex on my product's API API codex	5	1985	February 18, 2023

Feature Request: Run arbitrary code to select next token

Motivation

Background

Likely concerns: Latency and Security

Further details

Related topics