How to speed up GPT4 generation

schwartzray8 · December 6, 2023, 9:24pm

Quick question about any tricks anyone’s found to speed up GPT4 inference in the following scenario.

My user will be writing a report, and I’d like to essentially generate a summary of the report whenever they click “generate”.

As to not have them wait the ~10 seconds at the end when they click generate, I want to generate a summary in the background every X seconds as the write. This still isn’t ideal because they will still likely have to wait a decent amount of time once they press generate.

Are there any tricks that can speed up inference given that the prefix of the report stays the same as the report of the last generation?

By the way, I can’t use streaming unfortunately.

Diet · December 7, 2023, 1:27am

One trick is to just use streaming.

Unlike the AI, a human can’t read text in a single brain cycle. So if you stream the words, even gpt-4 tends to stream faster than users can read.

schwartzray8 · December 7, 2023, 1:42am

Ah yes I should have mentioned. I can’t use streaming. I won’t go into why, but ya I can’t use streaming unfortunately.

Diet · December 7, 2023, 1:58am

Can’t as in missing technical aptitude / product manager decided that it’s a risky high cost nice to have? I understand, I know that large companies especially seem to struggle with this.

however:

Are there any tricks that can speed up inference given that the prefix of the report stays the same as the report of the last generation?

you’re saying the rolling summary would work?

do you need to use gpt-4? with 3.5 turbo instruct, you would just continue generation where you left off, and it would probably be the simplest to use. Otherwise, you could use gpt-4 to split the task and just ask it to summarize the last paragraph, or continue the summary where it was left off.

or am I misunderstanding something here?

schwartzray8 · December 7, 2023, 3:34am

We can’t because we have to print the output in the same text box that the user is typing in, and they may want to type while it’s generating. And this is the text box of a third party app, not one we made.

This is actually the first I’m hearing of 3.5 turbo instruct. I’ll look into that. But ya a simple version would look like gpt-4 summarizing the last paragraph and appending that to the previous output. That’s not completely ideal though because we want it to have some context of what was in the previous paragraphs so it doesn’t repeat itself / say something that doesn’t make sense.

I know what I’m asking is pretty vague, but I’m not sure if anyone’s come up with something that makes use of the fact that a large part of the new report has already been fed into the model. Maybe the solution is something like:

User begins writing report
We generate summary in background for current report state
User adds to / edits report
We take the diff of new report and old report and say “edit the summary with the following additions”

Not sure if that would make it faster maybe I’ll try…

Diet · December 7, 2023, 4:17am

ah you’re trying to save input tokens?

yeah that makes everything more complicated. you’ll likely have to find some way to split your document into logical chunks. if your users are using regular headings and such this might be easier.

it’s obviously possible to just take a section with the diff and explain to the model what it’s looking at. alternatively you could forget the diff and submit the affected section for re-summarization, that would probably make your job much easier.

and if your abstract or what you’re trying to create is a summary of the summaries and you don’t expect it to change much, you could try to instruct the model to submit changes as string replacements. you could use function calling if you want. “here’s an old summary {summary}. however, the document may have changed. use the provided functions to edit the summary as necessary”. string replace, pass, rewrite. something like that.

s.kovacs · December 12, 2023, 4:42pm

Any solutions here? Takes around 6s to get a response from GPT4 turbo. Pretty long if you consider calling functions and sending data back to GPT + another 6s for it to respond. ~12s response times to user prompts is a bit much.

schwartzray8 · December 12, 2023, 5:06pm

That’s interesting I’ll try that. Thanks for the help

Diet · December 12, 2023, 5:25pm

According to @_j

It may unfortunately depend on your usage tier - do you know what tier you are at?

dyquaye · January 29, 2024, 5:30pm

Hey could I ask you some questions about streaming I’m having trouble setting it up!

Diet · January 29, 2024, 5:33pm

Sure!

Which language are you having issues in?

extra characters for discourse

Topic		Replies	Views
ChatGPT API Very Slow at generating Responses API gpt-4 , api	8	5442	December 25, 2023
Gpt-4-0125-preview INCREDIBLY slower than 3.5 turbo API	12	9586	July 22, 2024
GPT-4 API to slow when you have to work with a 46 second time out API	11	2781	July 30, 2023
Performance issue with gpt-4-turbo-preview API API gpt-4 , api , performance	1	1245	February 17, 2024
Completion vs. chat performance API api-speed	3	3269	December 24, 2023

How to speed up GPT4 generation

Related topics