It looks like GPT-4-32k is rolling out

jwatte · May 26, 2023, 4:34pm

Benchmarking in a vacuum isn’t interesting to any real product. I am measuring end user experience.
User presses button. User has to wait until they see the result. How long does this take?

curt.kennedy · May 26, 2023, 7:12pm

In your situation with an end user waiting, I would stream the answer back to them. The first token latency is a few seconds, just like it is in the Playground.

Above we were mostly taking last token latency and/or token output rates per second, and not first token streaming latency, which is probably the most interesting performance parameter for your application.

jwatte · May 26, 2023, 7:39pm

I already said that I must post process the data before I can show it.

(One common case of this is moderation solutions.)

So, no, I can in fact not stream the data, as I already said, and token generation rate directly impacts the latency to result for the user, as I already said, so generation at significantly faster than reading speed is both useful and important, as I already said.

Another case where generation faster than human reading speed is also important, is when generating steps to execute, such as in plugins.

So, I stand by the assessment that GPT-4 generation speed currently is a significant impediment.

curt.kennedy · May 26, 2023, 8:33pm

I cannot find anything in OpenAI docs to support this. They only mention moderating to " You can use our free moderation endpoint and safety best practices to help you keep your app safe."

So I could interpret this to mean, in your case, stream to your users, then send the full result to moderation (after it’s done), and if it comes back flagged, the put the user on cooldown, or something similar, and see what happened.

I think you are taking something too literally, but feel free to share where you have to moderate everything before it goes out. If this were the case, the streaming feature is in violation of that policy. This is a contradiction.

But if you are referring to some other moderation or post processing thing, then yeah, as a batch, you aren’t getting past last token latency. But no AI model will have great responsive last token latency, at least for the next 6 months, until sub-quadratic models, that work and are good (and running on smoking hot FPGA’s or ASIC’s), come out.

N2U · May 26, 2023, 8:49pm

I understand the argument, end user latency is a combination of multiple things, I’m sure people are aware of this. Benchmarking one thing separately still provides us with relevant information, that can be used by other people.

@jwatte I’m sorry if my information about human reading speed it’s relevant to you, it may be for other people, if you’re having issues with latency I’ll advise that you create a separate topic about this, as it will aid other people in helping you. This thread is for the (very-small) early rollout of GPT-4-32K

I’m expecting most people will be using the 32k while streaming to the end user, running an agent on GPT-4-32k is just too expensive

anon22939549 · May 26, 2023, 8:55pm

What a weird thing to say. It’s absolutely critical to benchmark all components of a process in order to understand where bottlenecks are occurring.

curt.kennedy · May 26, 2023, 8:59pm

My latest use of 32k is to have long conversations with GPT and to analyze legal terms of service agreements. So more of a onesie-twosie thing, not production.

But I have been running 32k in parallel with 8k for model redundancy, randomly for 25% of all traffic. Nothing earth shattering or mission critical, just redundancy to get data out without retries.

N2U · May 26, 2023, 9:26pm

Ha ha, what a coincidence, I’ve been using it for legal texts as well, more specifically to add legal explanations to specific context, very useful stuff

jwatte · May 26, 2023, 11:15pm

This is indeed the case – I was not referring to “the OpenAI moderation API;” I was referring to certain moderation situation that need to see the full context. (As mentioned before, …)

Sure! I’m answering the argument that said that I was wrong when I said that, in the use cases I have, generation speed matters, and generation above human-reading-speed is important.

In runtime, currently, yes absolutely it’s too expensive!

In dollars, GPT-4 is cheap compared to what I would be willing to pay if I could get fine tuning and guaranteed performance.

SomebodySysop · May 30, 2023, 2:43am

Only the paid api (which I have)? Not the chat window? Are you getting full 32K in the chat window?

curt.kennedy · May 30, 2023, 5:07pm

I don’t have ChatGPT, so no idea. The 32k is, for me, through the API. And I can use 32k in the Playground for quick testing purposes.

jwatte · May 30, 2023, 10:41pm

There’s a much shorter limit in the chat window.
I have paid ChatGPT, but even so, the limit to the amount of text I can paste in a chat window is probably around 2000 tokens?

When I have something longer, I have to kick up the Python terminal and paste a request in there instead. Somewhat annoying but not the end of the world.

DavidOS366 · May 31, 2023, 12:23am

I am more concerned about that number. Some one forgot to add float parsing/ rounding?

curt.kennedy · May 31, 2023, 2:20am

Who doesn’t like sub-picosecond timestamps?

Here, “Time taken: 38.4 seconds”.

Happy now?

jochenschultz · May 31, 2023, 10:21pm

It just shows high ambitions for the future

jackparr30 · June 1, 2023, 4:11am

The power of large language models is immense. Crazy to think where this will be in a year.

lachie1 · June 1, 2023, 4:50am

Absolutely can’t wait! Every day now something new and amazing to discover.

ziv1 · June 2, 2023, 7:00am

WOW GPT4 32K would be a game-changer for me , how can I convince them to open the gate for me?

jonesr56ty · June 2, 2023, 4:44pm

I’m really looking forward to it, but my application for the GPT-4 API has still not been approved. Could you help me get it through (org-62SU5I2LI2FRYTUhW2N5oCQN)? Thanks a lot!

ThioJoe · June 3, 2023, 5:45pm

I have paid ChatGPT, but even so, the limit to the amount of text I can paste in a chat window is probably around 2000 tokens?

A workaround could be to just tell it you’ll paste it in multiple parts. Though that would probably involve manually splitting it into short enough chunks which would be extra work.

Not sure if anyone would find this helpful but I’ve been working on a basic python script for interacting with the API that has some features like automatic chat logging, conversation saving/loading, text box pop up for pasting lots of text, etc

Topic		Replies	Views
New 4-turbo model has a unique limit? Or is this a bizarre hallucation? API	18	4507	January 26, 2024
Is the GPT4 api actually this limited or am I doing something wrong? API	13	1516	December 13, 2023
Prompt Fatigue Question For API Calls Prompting gpt-35-turbo	24	585	January 25, 2025
How to confirm that you got the correct value from a text other than repeating the same prompt over and over API	39	950	September 1, 2024
Test new 128k window on gpt-4-1106-preview API	29	18460	February 6, 2024

It looks like GPT-4-32k is rolling out

Related topics