Introducing Predicted Outputs

Dramatically decrease latency for gpt-4o and gpt-4o-mini by providing a reference string.

Speed up:

  • Updating a blog post in a document
  • Iterating on previous model responses
  • Rewriting code in an existing file, like with Exponent in this video, which saw a ~3X speed-up

Get started with our docs.

18 Likes

TL;DR:

if:

  • YOU can predict what the AI will write for a good portion of contiguous output, and
  • can disable several features, including cost limitations, and
  • want to pay separate output price for anything you ā€œpredictā€ that isnā€™t used.

then:

  • response may come faster

previous:

I think it is sufficient to say ā€œI donā€™t get itā€.

It seems it is not:

  • complete on a partial assistant output
  • output the AI can reuse until there is significant departure
  • suggest a response

Perhaps it needs to be exactly what the AI would write up to a point - but of course with changing inputs and random sampling, that ā€˜exactlyā€™ could change at the first token.

Is the only thing that one would possibly see is a speedup - with no other evidence it is working?

I think much more documentation of the mechanism, the run length of tokens it operates on, whether chunks can be provided, or a completion up to your own expected divergence should be provided, etc.

3 Likes

Yeah, it wonā€™t change the output of the model at all. It just speeds it up if you can guess a big amount of what the model replies with. This can happen if youā€™re asking for edits on a big doc or code file.

8 Likes

The sending of a prediction seems to have an opposite relationship to the desired operation. Lower output speeds.

  • Using the prediction parameter gets you slower, more expensive AI instead of no effect when there is no match.

Predictions sent for a task of writing about cute kittens
(nothing cache-able; top_p: 0.0001 not at all making a regular length):

  • prediction 0: (empty string)
  • prediction 1: ā€œkitten kittens cuteā€
  • prediction 2: the text part of the predicted output documentation

Calls are interleaved.

For 5 trials of prediction 0 @ 2024-11-04 08:44PM:

Stat Average Cold Minimum Maximum
stream rate Avg: 123.340 Cold: 103.2 Min: 103.2 Max: 154.4
latency (s) Avg: 0.537 Cold: 0.5426 Min: 0.37 Max: 0.9877
total response (s) Avg: 1.788 Cold: 2.1319 Min: 1.5359 Max: 2.1319
total rate Avg: 86.413 Cold: 77.396 Min: 77.396 Max: 92.525
response tokens Avg: 153.000 Cold: 165 Min: 137 Max: 169
cost tokens Avg: 154.800 Cold: 167 Min: 138 Max: 171
prediction tokens Avg: 0.000 Cold: 0 Min: 0 Max: 0
accepted tokens Avg: 0.000 Cold: 0 Min: 0 Max: 0
rejected tokens Avg: 0.800 Cold: 1 Min: 0 Max: 1

For 5 trials of prediction 1 @ 2024-11-04 08:44PM:

Stat Average Cold Minimum Maximum
stream rate Avg: 88.180 Cold: 85.5 Min: 54.6 Max: 115.8
latency (s) Avg: 0.801 Cold: 0.5172 Min: 0.3544 Max: 1.806
total response (s) Avg: 2.568 Cold: 2.3072 Min: 1.658 Max: 3.4577
total rate Avg: 62.057 Cold: 66.748 Min: 39.622 Max: 91.677
response tokens Avg: 147.600 Cold: 154 Min: 137 Max: 154
cost tokens Avg: 152.400 Cold: 159 Min: 141 Max: 159
prediction tokens Avg: 4.000 Cold: 4 Min: 4 Max: 4
accepted tokens Avg: 0.000 Cold: 0 Min: 0 Max: 0
rejected tokens Avg: 3.800 Cold: 4 Min: 3 Max: 4

For 5 trials of prediction 2 @ 2024-11-04 08:44PM:

Stat Average Cold Minimum Maximum
stream rate Avg: 91.560 Cold: 86.8 Min: 68.8 Max: 118.1
latency (s) Avg: 0.509 Cold: 0.6555 Min: 0.3166 Max: 0.6555
total response (s) Avg: 2.156 Cold: 2.2915 Min: 1.7055 Max: 2.3228
total rate Avg: 69.247 Cold: 62.405 Min: 59.813 Max: 85.019
response tokens Avg: 147.400 Cold: 143 Min: 138 Max: 160
cost tokens Avg: 211.600 Cold: 216 Min: 180 Max: 233
prediction tokens Avg: 353.000 Cold: 353 Min: 353 Max: 353
accepted tokens Avg: 0.000 Cold: 0 Min: 0 Max: 0
rejected tokens Avg: 63.200 Cold: 72 Min: 28 Max: 72

Token Usage Log: 5 trials of 0:

Measured Completion Prediction Accepted Rejected
165 167 0 0 1
142 144 0 0 1
169 171 0 0 1
152 154 0 0 1
137 138 0 0 0

Token Usage Log: 5 trials of 1:

Measured Completion Prediction Accepted Rejected
154 159 4 0 4
148 153 4 0 4
137 141 4 0 3
147 152 4 0 4
152 157 4 0 4

Token Usage Log: 5 trials of 2:

Measured Completion Prediction Accepted Rejected
143 216 353 0 72
138 211 353 0 72
160 233 353 0 72
145 218 353 0 72
151 180 353 0 28

The accounting is all goofy. ā€œmeasuredā€ and ā€œpredictionā€ are the response and the sent prediction measured by tiktoken, the rest are returned in the usage chunk. It canā€™t even be certain if I get billed for an empty stringā€¦

1 Like

@_j - the intended use case for this for tasks related to rewriting code or documents with minor changes. e.g. ā€œrefactor this code to change the variable name from x to yā€ or ā€œrewrite this blogpost while only changing the name of the product from a to bā€. In these cases, you pass the original draft as the prediction and then see inference speed up any time the model output and the predicted tokens match.

You shouldnā€™t expect this to help you with tasks where you donā€™t have a good sense of a long response before the model produces the response (which is what your prompt above about a story related to cute kittens is attempting to do).

5 Likes

Iā€™m pointing out that this is not passive in practice - it is impactful to performance if there are not hits, and speed is the only thing being paid for.

Iā€™m sure thereā€™s a break-even point you can get beyond, but that is one more prediction you have to make besides the text you have to predict yourself. gpt-4o-mini used above.

The next thing Iā€™ll be curious about investigating is the intended use, and on shorter and longer runs of reproduction. For example, an AI adding <strong> bolding on important words of the same text passed as prediction.

Itā€™s quite easy to predict parts of the modelā€™s output when it is asked to make changes to an existing piece of text that is also provided as input.
It is not about something the AI has written previously up to some point. This is not prompt caching.

In your test scenario, however, the model isnā€™t given any text to modify. Instead, youā€™re requesting an entirely new completion each time.

I suggest running your tests again, but this time using the example provided in the documentation, or something similar, to effectively leverage the modelā€™s predictive capabilities.

2 Likes

Is this partly how Canvas works under the hood? :thinking:

1 Like

Canvas? Could be if you experience faster generation speed there of existing canvas, but for you, not, because there is no function calling allowed. ChatGPT also builds long chats making context cache useful, and ends session interactions instead of tailing chats because of cache economy.

ChatGPT is encouraged to rewrite the whole canvas document to its tool instead of issuing patches or knowing line numbers (like the clever might do in their code app with chat window, line numbered original, and AI diff window, selecting sections for context and modifiability within). So if total computation is reduced, being reflected by the speed, certainly a good application.

Timeliness as a feature, on processing of predictable ins and outs, is kind of overruled by those predictable processing jobs being able to be done with dozens of calls a second (until you hit Cloudflare errors before your rate limit.) The actual app to take advantage is a head-scratcher for you to solve.

2 Likes

Iā€™m sorry - Iā€™m a little slow here

I thought weā€™d finally be getting partial assistant output in prompts, but this is not that.

In fact, it looks like weā€™re billed normally for all output tokens, predicted or not, is that right?

I mean itā€™s cool tech, and I wonder how you guys are doing it (parallel generation? Multiple token prediction + skip ahead? Quants?)

But for this particular use case arenā€™t diffs/ fuzzy diffs much faster and cheaper? :thinking:

2 Likes

If I understand it all correctly (on a very basic level) itā€™s first running a smaller model on the tokens, which in theory if itā€™s right it should be the exact same (like in code or large texts where a small body needs updating)

As long as the smaller modelā€™s tokens line up with the predicted you donā€™t get charged.

But once the smaller model deviates the reign is passed to the actual model to now perform inference.

I was thinking (hoping) for the same. Doubt itā€™ll happen anytime soon considering the safety issues :sob:

Iā€™m wondering if maybe the performance of diffs puts a strain on the quality of the output?

1 Like

I donā€™t see how or why :thinking:

Although some implementations struggle with diffs, a fuzzy match (esp when it comes to line numbers) typically works

1 Like

Iā€™m thinking this might be useful for editing large texts. I havenā€™t tested yet, but Iā€™ll report back if I do.

3 Likes

Does this have any bearing on the maximum input length?

This topic was automatically closed after 13 days. New replies are no longer allowed.