Hi all!
Context
Generating structured data responses (e.g. typed list, JSON) currently requires including complete response schemas in the prompt and instructing the model to output the correct format.
For example if I want a JSON with specific keys and value types, say
{
"name": str,
"age": int
}
my prompt includes this schema and an instruction to respond in the specified JSON format and I expect/hope that the model outputs the desired format.
Problem 1
The model may fail to produce a response conforming to the schema. A common failure (even of GPT-4) for example is to return boolean values as True
and False
, rather than true
and false
, which conform to the JSON spec. These sorts of minor issues are common (especially when generating code or database queries) and a nontrivial amount of engineering is needed to parse and validate responses.
Problem 2
Prompt token count increases to specify this schema. For more complex JSON data, the number of tokens used to express the desired response format can be significant.
Problem 3 (main problem)
A significant number of tokens are generated in the response just to reproduce the schema. This can meaningfully increase both response time and cost. In our use-cases, the increase in response time is prohibitive and constrains our use of GPT-4, requiring us to use less capable models with better response times. These less capable models are sometimes unable to reliable produce solutions to the problem at all, leading us to abandon the use-case entirely.
Continuing with the example given above, say the model responds to a prompt with
{
"name": "Eric",
"age": 33
}
Using the web token counter available on the OpenAI website for demonstration (apologies, it seems I can’t include links in my post), this response contains 21 tokens, but only the 4 tokens required for Eric"
and 33\n
are used to produce the desired output data. The remaining 81% of tokens are “wasted” in reproduced the response schema. Critically, 81% of the response time is due to reproducing these redundant tokens.
Granted this toy example has data (i.e. JSON values) that have a very small token count relative to the JSON schema token count, but for real-world examples we encounter this overhead is significant.
Existing solutions
The jsonformer library (on github) provides a clever solution to this problem (for HuggingFace models) that solves each of the problems above: instead of specifying the response schema and having the model replicate it in its output, provide the partial template and only generate tokens that fill in the requested data.
Again continuing with our example, my prompt would not list the complete JSON response schema at all, and instead end with
{
"name": "
The model fills in Eric"
and I terminate token generation upon seeing the closing quotation. I then fill in the next chunk of the schema
{
"name": "Eric",
"age":
And the model fills in 33\n
, at which point all the desired data has been generated and no further tokens are needed.
This solution cannot be implemented using the current (chat) completion APIs
Attempting to implement this approach using the OpenAI APIs fails because I can’t “fill in” intermediate chunks of the response schema without sending the entire content of the prompt and partial response and incurring the associated costs.
Feature request: token injection during streaming
This can be resolved by adding the capacity to specify new tokens to be appended to the context during token streaming. Referring back to the earlier example, on seeing the output Eric
and terminating quotation, I want to be able to insert the tokens for
,
"age":
that provide the necessary context for the next piece of data to be generated, and to be able to do this during streaming, rather than terminating the token stream and restarting with a new prompt that specifies these additional tokens.
An alternative solution that wouldn’t require token streaming would allow me to specify, along with my initial prompt, a list of intermediate termination tokens and suffixes to inject, in our example something like
[
("\"", ",\n \"age\": "),
("\n", <terminate generation>)
]
that, on seeing the first entry of the pair, inserts the second entry into the context and continues generation from there. This approach could be combined with logit biasing for each list item to guarantee that the response data is in the correct format, in our example that age
only includes digits, or that a boolean field only yields true
or false
, etc.
In sum
I want to emphasize that this feature, which is available for open models, would significantly reduce response times (and generation costs) for structured data responses and, at least for many of our use-cases, make it possible to use GPT-4 where otherwise we cannot. In some cases, where response time is critical and the problem sufficiently complex, using less capable models with better response times is not viable and we’re unable to apply OpenAI models at all.
Thanks for reading, have a great week!
Related discussions
Post 186510 titled " Ability to specify the first part of the assistant’s message in chat models" includes a related mention of the jsonformer library (again, apologies that I can’t insert links here).