GPT 3.5 and GPT 4 Turbo 1106 Question for Output Context Length

Hi,

I have a question, we’re currently working on a LLM based application using GPT 3.5 Turbo 16k and GPT 4 32k with Enterprise GPT(via Azure Open AI service).

I have read the models like GPT 3.5 Turbo 16k 0613 will be discontinued in June 2024 and recommended replacement model is GPT 3.5 Turbo 1106 which I read has default input context of 16k tokens but output context of 4k tokens. Will we have output context of 16k also available with this model and for GPT 4 with 32k output context?

Having a large output token length of 16k and 32k is very important for our use cases and I have not find any documentation that enterprise customers will see similar models like GPT 3.5 Turbo 1106 16k which has 16k output token max length and 32k for GPT 4 1106??

Please clarify on this.

1 Like

No, GPT-4-1106 also has a 4k output limit.

The only hope we have is that microsoft pushes the discontinuation of the 03 and 06 models out again.

When they cancelled davinci embeddings, they offered reimbursement for re-embedding everything.

Maybe they’ll come up with an “upgrade path”, but we now know that they can and will replace good products with worse ones.

But in some use cases we need larger outputs which has been working well, say for example you’re generating a test script with 30-40 steps, that is not going to work with 4k token limit. I cannot tell end users to divide that to multiple test scripts if it relates to one function, this is just one example.
Our use cases that we have built are proving us much efficiency and having just 4k token limit aint going to cut it, I hope they give us larger output context for folks using EnterPrise GPT and I assume a lot of other people may have to change their implementation if they don’t with Assistant APIs with some sort of chunking and collating.

There’s a workaround, but it’s kind of a pain and expensive:

step 1:

we need a test script for bla bla bla

could you generate an outline of the tests should be done? for now, they should all throw with placeholders that throw notImplemented, we’ll implement them later.

step 2:

we need a test script for bla bla bla

here’s what we have so far:

test 1: 
 not_implemented
test 2:
 not_implemented
test 3:
 not_implemented
...

can you write me some implementations for cases 1 and 2? please give them to me as diffs or something

step 3:

we need a test script for bla bla bla

here’s what we have so far:

test 1: 
 test impl a
test 2:
 test impl b
test 3:
 not_implemented
...

can you write me some implementations for cases 3 and 4? please give them to me as diffs or something

etc.

But again, that needs severe medium to complex changes in stuff that’s already deployed to Production. The model changes and updates should be aligned to provide improvements over current models and not limit them. While I appreciate the input context increasing but to me 32k was enough and 128k is plenty enough, but having at least 16k and 32k token limit for output is required for many of our use cases that we have , and having to re design all of this will cost money and resources.

I am pleading to Open AI to see if they can still provide these output context token limits for enterprise customers ( if not all).

yep

yep

unfortunately that didn’t do anything for davinci embeddings either.

they just killed it :frowning:

As you get deeper into developing your application, you’ll likely find that the current AI models will be very reluctant to even produce 1000 tokens of writing. It will only extend the text in a processing situation where it must replicate an input, and even then it is likely the AI finds a way to inexplicably wrap up the output before hitting any max_token limit you set yourself…

Limitation on token production by trained behavior.

1 Like

That’s also true - it will give you 5 or six, and then insert a comment saying

// put the remaining stuff here :kissing_closed_eyes:

and close the schema

1 Like

That will have a good level of impact on IT companies using Generative AI to bring efficiency into work to produce first level drafts that takes a lot more time. We are already able to produce a lot today with GPT 3.5 Turbo 16k and GPT 4 32k - think about a lot more complex code generation with high quality prompts in one to two attempts itself, think about SDLC use cases where a lot more content generation is required, I am hoping companies will put some pressure on this ask and something may change in regards to this. Anyways I won’t speculate what they might do or not, but I just wanted to ask if I was missing to see if they had larger output models and apparently that’s not the case.
If we have to change designs like these every 6 months to 1 year, then it wont be sustainable much and companies will have to re-think the approach we take while building apps with LLms

1 Like