For those with access to gpt-4-32K, what do you see in openai.Engines.list()?
I see
gpt-4, permissions None, ready True
gpt-4-0314, permissions None, ready True (the 3.14 checkpoint, presumably equivalent to the above)
will I see something like gpt-4-32k also in the list eventually so I can choose between?
Using python print(openai.Model.list())
{
"created": 1678604599,
"id": "gpt-4-32k",
"object": "model",
"owned_by": "openai",
"parent": null,
"permission": [
{
"allow_create_engine": false,
"allow_fine_tuning": false,
"allow_logprobs": false,
"allow_sampling": false,
"allow_search_indices": false,
"allow_view": false,
"created": 1682719317,
"group": null,
"id": "modelperm-XXX",
"is_blocking": false,
"object": "model_permission",
"organization": "*"
}
],
"root": "gpt-4-32k"
},
{
"created": 1678604599,
"id": "gpt-4-32k-0314",
"object": "model",
"owned_by": "openai",
"parent": null,
"permission": [
{
"allow_create_engine": false,
"allow_fine_tuning": false,
"allow_logprobs": false,
"allow_sampling": false,
"allow_search_indices": false,
"allow_view": false,
"created": 1682719443,
"group": null,
"id": "modelperm-XXX",
"is_blocking": false,
"object": "model_permission",
"organization": "*"
}
],
"root": "gpt-4-32k-0314"
},
1 Like
Anyone with access to GPT-4-32 is interested in helping out on a test. I want to know how good GPT is at condensing a full movie script into a synopsis. Iāll pay for usage, of course.
1 Like
I cannot wait to have it on my personal account. I got to play with this a little bit when it first came out because another account had access.
What I was able to test earlier, in applications of fiction, the 32-k model was great at reading more material, but didnāt necessarily write more when prompted. However, as a larger context window for the chain-type prompting authors are doing by hand or using tools like Sudowriteās Story Engine to do, I think the 32-k model has a lot of potential applications.
1 Like
My experience so far is that when I fed it 19k tokens, it spit out a small 300 token response, smaller than the typical GPT-4-8k version on a much smaller, similar, input.
I was expecting the output to be much larger, and I was surprised it wasnāt. But not sure if this is a āprompt engineeringā issue I had or something else. So good to know @eawestwrites that this is what you experienced too.
Weāll see how this really behaves as more folks start using it.
2 Likes
Still on 8k, sad. Considering Ive been submitting feedbacks and having applied for the waitlist since day one. Really wanted it to compare to the Claudes new 100k one.
1 Like
Yeah, I wish I could request 32k on my account.
I want to play with prompts to get it to write more than just the 300-500 words responses. Specific limits I want to test include:
- What are the limits of the āmemoryā component if I ask it to draft say the next chapter?
- Could I just say āand hereās what happens in the next chapter?ā and itās a list of commands? (Iāve had success with this on 8k)
Really, my hypothesis is that the best way to leverage 32k is to figure out a āchatā like instruction sequence almost like youāre playing Managing Editor to your ājunior writing partner.ā
Read this: (they story so far) Respond āgot itā when youāre done
-got it-
What are your specific ideas to continue the story in the next chapter if {your ending} is where we want to end up in {8} chapters?
-get result-
Thatās great! What kind of prompt would I need to give you, including instructions to match the writing style of the existing book, to write that chapter and all of the plot ideas you have and to keep characterization?
And then thatās probably where that prompt could be fed into 4 - 8K or possibly even Turboā¦
I guess Iāll just wait until 32k is out for everyone unless thereās someone I can ask for access 
N2U
68
I like your idea of playing managing editor for a junior writer AI 
I think what youāre asking about in terms of āmemoryā is the context window, ie the amount of context the model can ārememberā, in that case itās 32k tokens (about 40 pages as mentioned earlier)
With the prices mentioned earlier:
That will be ~2-4$(excluding vat and tax) for every time you use the full 32k context window. Meaning you will burn through the standard approved usage limit of 120$ in 30-60 messages.
Please keep experimenting 
Think of GPT as a person whoās really good at language and text, but doesnāt understand what you want or in what context. You can achieve what youāre asking for with a bit of creative use of gpt-3.5-t. if you want to know more I can highly suggest the course:
N2U
69
Iāve gotten access as well, but I havenāt done much 32k testing because I want to make sure Iāve tested my prompts first.
Iām thinking that it would be really impractical if the model would constantly try to consume the maximum amount of tokens without specifically being told to do so, so we may have to construct a task that does that.
Iām thinking something like a translation task into multiple languages might do the trick
2 Likes
qrdl
71
Hey, congrats.
A really interesting test I canāt wait to try out:
Is do code reviews at 8/16/24/32K windows on the same code and see if the 32K window can catch the issues in the 8K window.
Iām very interested to see if there is capability loss at 32K, or if youāre better off chunking.
Have to make sure your code has reasonably subtle bugs/issues for this to be a relevant test, however.
Have you used Claude? How do you find it compared to GPT4?
Iāve been testing 32k with my roguelike code base⦠expensive but very helpful when the 4k or 8k canāt handle the question and needs more context (code)⦠It still tends to direct you in weird directions sometimes to āsolveā the problem⦠reminds me of āGet rid of all spam,ā request and the AI says, āOkay, I will remove all humans and there will be no more spam.ā lol You have to be really specific. Same with fiction, though, reallyā¦
Like others, Iāve found that taking what it outputs and feeding it back to it as if the user came up with it can work well⦠or saying that you want to get to the root cause of the issue rather than just make the error/bug go away ⦠Overall, though, super-useful for a non-coder like myself.
3 Likes
qrdl
74
Yeah, 32k will for sure have use cases, but Iām trying to have a deeper understanding of degradation that occurs around capability.
Itās a bit frustrating nobody is (at least publicly) looking into this, everyone just seems to be getting excited about the topline #.
2 Likes
Are you thinking a bigger context window could be worse in some cases? I guess it depends on what youāre putting into the prompt(s)⦠garbage in / garbage out is more relevant than ever. Heh.
What kind of tests do you want to run?
1 Like
qrdl
76
See my post above.
The question is, whatās the best way to find bugs in code - prompt GPT4 8k at a time, or 32k at a time? Or both? Ignore the cost for a moment, just assume that isnāt a factor.
Some bugs will obviously span across chunking boundaries, so 32K at a default may be required. But will it pick up subtle ones, so you still have to do at 8k at a time?
1 Like
From my limited time with 32k and 3+ years with older GPT prompting, Iād say that itās best to only give it what it needs to solve the problem or find the bug. However, in some cases, Iāve found that I didnāt include the offending function / method or whatever, and it had trouble āspotting the bugā without all the info / context⦠So, itās a balance, Iād say⦠finding the perfect context length for your particular task. Bigger isnāt always better in my experience.
3 Likes
Are you asking āwill it be as good at finding subtle bugs within an 8k chunk IF Iām lucky enough to get the chunking right?ā That seems a very different question that āwill it be as good at finding subtle bugs in a random 8k chunk drawn independently from my chunking?ā or āwill it be as good at finding subtle bugs in code only 8k long?ā
Three different questions, no? All interesting.
Does 32k have the same number of attention heads?
2 Likes
qrdl
79
Yeah, exactly, I made the same point above. By default, youāll probably want to do 32k to get chunk spanning bugs. But should you also do 8k? Presumably you wouldnāt chunk randomly, but with reasonable cohesion.
I think the consensus is yes, if you want to be sure, but hopefully someone will do some official evals.
Sorry, I meant to reply to bruce. Canāt seem to edit the reply to unfortunately.
I cannot imagine ever sending 32k tokens in pursuit of a bugāeven 8k seems like a lot.
Thatās just a lot of code.
But, I guess it really depends on the bug youāre looking for. If the code throws an error, you definitely shouldnāt need to send that much code.
If the code is outputting the incorrect result, thatās different, but still a ton of code and probably not something an LLM can do unless the code is very well documented, youāve got correct pseudocode and well-formatted algorithms to reference, or both.
2 Likes
qrdl
81
It isnāt really a bug pursuit, but rather just a second look. Getting to 100% code coverage in unit tests can be expensive and it isnāt just bugs, but also security reviews which GPT4 is quite good at. It can be used for style commentary as well.
For actual known bugs, I actually find GPT4 isnāt that great and frequently have to intervene.