Feature request: token injection during streaming for structured output generation

Hi all!

Context
Generating structured data responses (e.g. typed list, JSON) currently requires including complete response schemas in the prompt and instructing the model to output the correct format.

For example if I want a JSON with specific keys and value types, say

{
    "name": str,
    "age": int
}

my prompt includes this schema and an instruction to respond in the specified JSON format and I expect/hope that the model outputs the desired format.

Problem 1
The model may fail to produce a response conforming to the schema. A common failure (even of GPT-4) for example is to return boolean values as True and False, rather than true and false, which conform to the JSON spec. These sorts of minor issues are common (especially when generating code or database queries) and a nontrivial amount of engineering is needed to parse and validate responses.

Problem 2
Prompt token count increases to specify this schema. For more complex JSON data, the number of tokens used to express the desired response format can be significant.

Problem 3 (main problem)
A significant number of tokens are generated in the response just to reproduce the schema. This can meaningfully increase both response time and cost. In our use-cases, the increase in response time is prohibitive and constrains our use of GPT-4, requiring us to use less capable models with better response times. These less capable models are sometimes unable to reliable produce solutions to the problem at all, leading us to abandon the use-case entirely.

Continuing with the example given above, say the model responds to a prompt with

{
    "name": "Eric",
    "age": 33
}

Using the web token counter available on the OpenAI website for demonstration (apologies, it seems I can’t include links in my post), this response contains 21 tokens, but only the 4 tokens required for Eric" and 33\n are used to produce the desired output data. The remaining 81% of tokens are “wasted” in reproduced the response schema. Critically, 81% of the response time is due to reproducing these redundant tokens.

Granted this toy example has data (i.e. JSON values) that have a very small token count relative to the JSON schema token count, but for real-world examples we encounter this overhead is significant.

Existing solutions
The jsonformer library (on github) provides a clever solution to this problem (for HuggingFace models) that solves each of the problems above: instead of specifying the response schema and having the model replicate it in its output, provide the partial template and only generate tokens that fill in the requested data.

Again continuing with our example, my prompt would not list the complete JSON response schema at all, and instead end with

{
    "name": "

The model fills in Eric" and I terminate token generation upon seeing the closing quotation. I then fill in the next chunk of the schema

{
   "name": "Eric",
   "age": 

And the model fills in 33\n, at which point all the desired data has been generated and no further tokens are needed.

This solution cannot be implemented using the current (chat) completion APIs
Attempting to implement this approach using the OpenAI APIs fails because I can’t “fill in” intermediate chunks of the response schema without sending the entire content of the prompt and partial response and incurring the associated costs.

Feature request: token injection during streaming
This can be resolved by adding the capacity to specify new tokens to be appended to the context during token streaming. Referring back to the earlier example, on seeing the output Eric and terminating quotation, I want to be able to insert the tokens for

,
    "age": 

that provide the necessary context for the next piece of data to be generated, and to be able to do this during streaming, rather than terminating the token stream and restarting with a new prompt that specifies these additional tokens.

An alternative solution that wouldn’t require token streaming would allow me to specify, along with my initial prompt, a list of intermediate termination tokens and suffixes to inject, in our example something like

[
    ("\"", ",\n    \"age\": "),
    ("\n", <terminate generation>)
]

that, on seeing the first entry of the pair, inserts the second entry into the context and continues generation from there. This approach could be combined with logit biasing for each list item to guarantee that the response data is in the correct format, in our example that age only includes digits, or that a boolean field only yields true or false, etc.

In sum
I want to emphasize that this feature, which is available for open models, would significantly reduce response times (and generation costs) for structured data responses and, at least for many of our use-cases, make it possible to use GPT-4 where otherwise we cannot. In some cases, where response time is critical and the problem sufficiently complex, using less capable models with better response times is not viable and we’re unable to apply OpenAI models at all.

Thanks for reading, have a great week!

Related discussions
Post 186510 titled " Ability to specify the first part of the assistant’s message in chat models" includes a related mention of the jsonformer library (again, apologies that I can’t insert links here).

1 Like

We ran into this issue as well—here’s a partial exposition of our solution (skip the classification part; look at the task-specific prompt part):

In short, try inserting specially-formatted placeholders and telling the model to update only those placeholders and return everything else unchanged.

We took this a little further:

  • I wrote a postprocessing script that handled common common JSON issues emerging from these completions; e.g. capitalization of booleans, quoting of strings, etc.
  • I asked for the most minimal possible completion and then used a python dict.update() to fold the results into the complete dictionary object, before writing the output to a json file.
2 Likes

I read your nicely written article and found the same approach I’m taking with my multi-agent system. Divide and Concer is the approach I go like you do.

Since your tasks might be deterministic, a reliable solution could be, instead of using NLP tool for classification, to use vector embeddings. For example, the tasks mentioned in the article:

tasks:

  • execute a query
  • create a new database
  • update a database
  • import data from a URL
  • list metadata for all databases
  • get metadata about a database
  • create an API key for a service account
  • create an API key for the current account
  • revoke all API keys for a service account
  • export data
  • delete database
  • no task specified

These are all tasks which can be stored as embedding in a VecDB. Now, when the user prompt for a task, this prompt will be converted into an embedding and compared with the VecDB.
What do you think about this for classification?

1 Like

I think it’s a good approach and it’s probably something I’d consider if I had a lot more tasks, or if those tasks had extensive descriptions attached to them. Our use case only had about a dozen tasks, though—I think the overhead of setting up the vector DB and search system would have added some extra complexity to our project without clear benefit.

Given the release of Microsoft’s guidance library, I sincerely hope/eagerly anticipate that API support for partial completions is on the way :slight_smile: