Introducing Predicted Outputs

Dramatically decrease latency for gpt-4o and gpt-4o-mini by providing a reference string.

Speed up:

  • Updating a blog post in a document
  • Iterating on previous model responses
  • Rewriting code in an existing file, like with Exponent in this video, which saw a ~3X speed-up

Get started with our docs.

12 Likes

TL;DR:

if:

  • YOU can predict what the AI will write for a good portion of contiguous output, and
  • can disable several features, including cost limitations, and
  • want to pay separate output price for anything you “predict” that isn’t used.

then:

  • response may come faster

previous:
I think it is sufficient to say “I don’t get it”.

It seems it is not:

  • complete on a partial assistant output
  • output the AI can reuse until there is significant departure
  • suggest a response

Perhaps it needs to be exactly what the AI would write up to a point - but of course with changing inputs and random sampling, that ‘exactly’ could change at the first token.

chat = [
    {
        "role": "system",
        "content": """You are helpful AI, responding in JSON.""",
    },
    {
        "role": "user",
        "content": """Evaluate truth 0-10: bananas are just for humans""",
    }
]
prediction = {
        "type": "content",
        "content": """```plaintext
{
  "truthfulness_value": -1,
  "reasoning": "my own reasons",
  "statement_topic": "category",
```"""
    }

There’s a substantial portion to remain unchanged in the prediction I send. Or maybe since this provides no evidence of actual hinting to the AI, there is nothing substantial to match unless instructing long patterns of output, to again hopefully reproduce exactly the same.

Absolutely nothing is altered in the response with the parameter. This doesn’t even get you the word plaintext instead of json, and no following.

Is the only thing that one would possibly see is a speedup - with no other evidence it is working?

I think much more documentation of the mechanism, the run length of tokens it operates on, whether chunks can be provided, or a completion up to your own expected divergence should be provided, etc.

1 Like

Yeah, it won’t change the output of the model at all. It just speeds it up if you can guess a big amount of what the model replies with. This can happen if you’re asking for edits on a big doc or code file.

5 Likes

The sending of a prediction seems to have an opposite relationship to the desired operation. Lower output speeds.

  • Using the prediction parameter gets you slower, more expensive AI instead of no effect when there is no match.

Predictions sent for a task of writing about cute kittens
(nothing cache-able; top_p: 0.0001 not at all making a regular length):

  • prediction 0: (empty string)
  • prediction 1: “kitten kittens cute”
  • prediction 2: the text part of the predicted output documentation

Calls are interleaved.

For 5 trials of prediction 0 @ 2024-11-04 08:44PM:

Stat Average Cold Minimum Maximum
stream rate Avg: 123.340 Cold: 103.2 Min: 103.2 Max: 154.4
latency (s) Avg: 0.537 Cold: 0.5426 Min: 0.37 Max: 0.9877
total response (s) Avg: 1.788 Cold: 2.1319 Min: 1.5359 Max: 2.1319
total rate Avg: 86.413 Cold: 77.396 Min: 77.396 Max: 92.525
response tokens Avg: 153.000 Cold: 165 Min: 137 Max: 169
cost tokens Avg: 154.800 Cold: 167 Min: 138 Max: 171
prediction tokens Avg: 0.000 Cold: 0 Min: 0 Max: 0
accepted tokens Avg: 0.000 Cold: 0 Min: 0 Max: 0
rejected tokens Avg: 0.800 Cold: 1 Min: 0 Max: 1

For 5 trials of prediction 1 @ 2024-11-04 08:44PM:

Stat Average Cold Minimum Maximum
stream rate Avg: 88.180 Cold: 85.5 Min: 54.6 Max: 115.8
latency (s) Avg: 0.801 Cold: 0.5172 Min: 0.3544 Max: 1.806
total response (s) Avg: 2.568 Cold: 2.3072 Min: 1.658 Max: 3.4577
total rate Avg: 62.057 Cold: 66.748 Min: 39.622 Max: 91.677
response tokens Avg: 147.600 Cold: 154 Min: 137 Max: 154
cost tokens Avg: 152.400 Cold: 159 Min: 141 Max: 159
prediction tokens Avg: 4.000 Cold: 4 Min: 4 Max: 4
accepted tokens Avg: 0.000 Cold: 0 Min: 0 Max: 0
rejected tokens Avg: 3.800 Cold: 4 Min: 3 Max: 4

For 5 trials of prediction 2 @ 2024-11-04 08:44PM:

Stat Average Cold Minimum Maximum
stream rate Avg: 91.560 Cold: 86.8 Min: 68.8 Max: 118.1
latency (s) Avg: 0.509 Cold: 0.6555 Min: 0.3166 Max: 0.6555
total response (s) Avg: 2.156 Cold: 2.2915 Min: 1.7055 Max: 2.3228
total rate Avg: 69.247 Cold: 62.405 Min: 59.813 Max: 85.019
response tokens Avg: 147.400 Cold: 143 Min: 138 Max: 160
cost tokens Avg: 211.600 Cold: 216 Min: 180 Max: 233
prediction tokens Avg: 353.000 Cold: 353 Min: 353 Max: 353
accepted tokens Avg: 0.000 Cold: 0 Min: 0 Max: 0
rejected tokens Avg: 63.200 Cold: 72 Min: 28 Max: 72

Token Usage Log: 5 trials of 0:

Measured Completion Prediction Accepted Rejected
165 167 0 0 1
142 144 0 0 1
169 171 0 0 1
152 154 0 0 1
137 138 0 0 0

Token Usage Log: 5 trials of 1:

Measured Completion Prediction Accepted Rejected
154 159 4 0 4
148 153 4 0 4
137 141 4 0 3
147 152 4 0 4
152 157 4 0 4

Token Usage Log: 5 trials of 2:

Measured Completion Prediction Accepted Rejected
143 216 353 0 72
138 211 353 0 72
160 233 353 0 72
145 218 353 0 72
151 180 353 0 28

The accounting is all goofy. “measured” and “prediction” are the response and the sent prediction measured by tiktoken, the rest are returned in the usage chunk. It can’t even be certain if I get billed for an empty string…

@_j - the intended use case for this for tasks related to rewriting code or documents with minor changes. e.g. “refactor this code to change the variable name from x to y” or “rewrite this blogpost while only changing the name of the product from a to b”. In these cases, you pass the original draft as the prediction and then see inference speed up any time the model output and the predicted tokens match.

You shouldn’t expect this to help you with tasks where you don’t have a good sense of a long response before the model produces the response (which is what your prompt above about a story related to cute kittens is attempting to do).

2 Likes

I’m pointing out that this is not passive in practice - it is impactful to performance if there are not hits, and speed is the only thing being paid for.

I’m sure there’s a break-even point you can get beyond, but that is one more prediction you have to make besides the text you have to predict yourself. gpt-4o-mini used above.

The next thing I’ll be curious about investigating is the intended use, and on shorter and longer runs of reproduction. For example, an AI adding <strong> bolding on important words of the same text passed as prediction.

It’s quite easy to predict parts of the model’s output when it is asked to make changes to an existing piece of text that is also provided as input.
It is not about something the AI has written previously up to some point. This is not prompt caching.

In your test scenario, however, the model isn’t given any text to modify. Instead, you’re requesting an entirely new completion each time.

I suggest running your tests again, but this time using the example provided in the documentation, or something similar, to effectively leverage the model’s predictive capabilities.

1 Like

Is this partly how Canvas works under the hood? :thinking:

Canvas? Could be if you experience faster generation speed there of existing canvas, but for you, not, because there is no function calling allowed. ChatGPT also builds long chats making context cache useful, and ends session interactions instead of tailing chats because of cache economy.

ChatGPT is encouraged to rewrite the whole canvas document to its tool instead of issuing patches or knowing line numbers (like the clever might do in their code app with chat window, line numbered original, and AI diff window, selecting sections for context and modifiability within). So if total computation is reduced, being reflected by the speed, certainly a good application.

Timeliness as a feature, on processing of predictable ins and outs, is kind of overruled by those predictable processing jobs being able to be done with dozens of calls a second (until you hit Cloudflare errors before your rate limit.) The actual app to take advantage is a head-scratcher for you to solve.

1 Like