How to speed up GPT4 generation

Quick question about any tricks anyone’s found to speed up GPT4 inference in the following scenario.

My user will be writing a report, and I’d like to essentially generate a summary of the report whenever they click “generate”.

As to not have them wait the ~10 seconds at the end when they click generate, I want to generate a summary in the background every X seconds as the write. This still isn’t ideal because they will still likely have to wait a decent amount of time once they press generate.

Are there any tricks that can speed up inference given that the prefix of the report stays the same as the report of the last generation?

By the way, I can’t use streaming unfortunately.

1 Like

One trick is to just use streaming.

Unlike the AI, a human can’t read text in a single brain cycle. So if you stream the words, even gpt-4 tends to stream faster than users can read.

Ah yes I should have mentioned. I can’t use streaming. I won’t go into why, but ya I can’t use streaming unfortunately.

Can’t as in missing technical aptitude / product manager decided that it’s a risky high cost nice to have? I understand, I know that large companies especially seem to struggle with this.


Are there any tricks that can speed up inference given that the prefix of the report stays the same as the report of the last generation?

you’re saying the rolling summary would work?

do you need to use gpt-4? with 3.5 turbo instruct, you would just continue generation where you left off, and it would probably be the simplest to use. Otherwise, you could use gpt-4 to split the task and just ask it to summarize the last paragraph, or continue the summary where it was left off.

or am I misunderstanding something here?

We can’t because we have to print the output in the same text box that the user is typing in, and they may want to type while it’s generating. And this is the text box of a third party app, not one we made.

This is actually the first I’m hearing of 3.5 turbo instruct. I’ll look into that. But ya a simple version would look like gpt-4 summarizing the last paragraph and appending that to the previous output. That’s not completely ideal though because we want it to have some context of what was in the previous paragraphs so it doesn’t repeat itself / say something that doesn’t make sense.

I know what I’m asking is pretty vague, but I’m not sure if anyone’s come up with something that makes use of the fact that a large part of the new report has already been fed into the model. Maybe the solution is something like:

  1. User begins writing report
  2. We generate summary in background for current report state
  3. User adds to / edits report
  4. We take the diff of new report and old report and say “edit the summary with the following additions”

Not sure if that would make it faster maybe I’ll try…

ah you’re trying to save input tokens?

yeah that makes everything more complicated. you’ll likely have to find some way to split your document into logical chunks. if your users are using regular headings and such this might be easier.

it’s obviously possible to just take a section with the diff and explain to the model what it’s looking at. alternatively you could forget the diff and submit the affected section for re-summarization, that would probably make your job much easier.

and if your abstract or what you’re trying to create is a summary of the summaries and you don’t expect it to change much, you could try to instruct the model to submit changes as string replacements. you could use function calling if you want. “here’s an old summary {summary}. however, the document may have changed. use the provided functions to edit the summary as necessary”. string replace, pass, rewrite. something like that.

Any solutions here? Takes around 6s to get a response from GPT4 turbo. Pretty long if you consider calling functions and sending data back to GPT + another 6s for it to respond. ~12s response times to user prompts is a bit much.

That’s interesting I’ll try that. Thanks for the help

1 Like

According to @_j

It may unfortunately depend on your usage tier - do you know what tier you are at?

Hey could I ask you some questions about streaming I’m having trouble setting it up!


Which language are you having issues in?

extra characters for discourse