It looks like GPT-4-32k is rolling out

For those with access to gpt-4-32K, what do you see in openai.Engines.list()?
I see
gpt-4, permissions None, ready True
gpt-4-0314, permissions None, ready True (the 3.14 checkpoint, presumably equivalent to the above)

will I see something like gpt-4-32k also in the list eventually so I can choose between?

Using python print(openai.Model.list())

{
  "created": 1678604599,
  "id": "gpt-4-32k",
  "object": "model",
  "owned_by": "openai",
  "parent": null,
  "permission": [
    {
      "allow_create_engine": false,
      "allow_fine_tuning": false,
      "allow_logprobs": false,
      "allow_sampling": false,
      "allow_search_indices": false,
      "allow_view": false,
      "created": 1682719317,
      "group": null,
      "id": "modelperm-XXX",
      "is_blocking": false,
      "object": "model_permission",
      "organization": "*"
    }
  ],
  "root": "gpt-4-32k"
},
{
  "created": 1678604599,
  "id": "gpt-4-32k-0314",
  "object": "model",
  "owned_by": "openai",
  "parent": null,
  "permission": [
    {
      "allow_create_engine": false,
      "allow_fine_tuning": false,
      "allow_logprobs": false,
      "allow_sampling": false,
      "allow_search_indices": false,
      "allow_view": false,
      "created": 1682719443,
      "group": null,
      "id": "modelperm-XXX",
      "is_blocking": false,
      "object": "model_permission",
      "organization": "*"
    }
  ],
  "root": "gpt-4-32k-0314"
},
1 Like

Anyone with access to GPT-4-32 is interested in helping out on a test. I want to know how good GPT is at condensing a full movie script into a synopsis. I’ll pay for usage, of course.

1 Like

I cannot wait to have it on my personal account. I got to play with this a little bit when it first came out because another account had access.

What I was able to test earlier, in applications of fiction, the 32-k model was great at reading more material, but didn’t necessarily write more when prompted. However, as a larger context window for the chain-type prompting authors are doing by hand or using tools like Sudowrite’s Story Engine to do, I think the 32-k model has a lot of potential applications.

1 Like

My experience so far is that when I fed it 19k tokens, it spit out a small 300 token response, smaller than the typical GPT-4-8k version on a much smaller, similar, input.

I was expecting the output to be much larger, and I was surprised it wasn’t. But not sure if this is a “prompt engineering” issue I had or something else. So good to know @eawestwrites that this is what you experienced too.

We’ll see how this really behaves as more folks start using it.

2 Likes

Still on 8k, sad. Considering Ive been submitting feedbacks and having applied for the waitlist since day one. Really wanted it to compare to the Claudes new 100k one.

1 Like

Yeah, I wish I could request 32k on my account.

I want to play with prompts to get it to write more than just the 300-500 words responses. Specific limits I want to test include:

  • What are the limits of the “memory” component if I ask it to draft say the next chapter?
  • Could I just say “and here’s what happens in the next chapter?” and it’s a list of commands? (I’ve had success with this on 8k)

Really, my hypothesis is that the best way to leverage 32k is to figure out a “chat” like instruction sequence almost like you’re playing Managing Editor to your “junior writing partner.”

Read this: (they story so far) Respond “got it” when you’re done
-got it-
What are your specific ideas to continue the story in the next chapter if {your ending} is where we want to end up in {8} chapters?
-get result-
That’s great! What kind of prompt would I need to give you, including instructions to match the writing style of the existing book, to write that chapter and all of the plot ideas you have and to keep characterization?

And then that’s probably where that prompt could be fed into 4 - 8K or possibly even Turbo…

I guess I’ll just wait until 32k is out for everyone unless there’s someone I can ask for access :slight_smile:

I like your idea of playing managing editor for a junior writer AI :heart:

I think what you’re asking about in terms of “memory” is the context window, ie the amount of context the model can “remember”, in that case it’s 32k tokens (about 40 pages as mentioned earlier)

With the prices mentioned earlier:

That will be ~2-4$(excluding vat and tax) for every time you use the full 32k context window. Meaning you will burn through the standard approved usage limit of 120$ in 30-60 messages.

Please keep experimenting :heart:

Think of GPT as a person who’s really good at language and text, but doesn’t understand what you want or in what context. You can achieve what you’re asking for with a bit of creative use of gpt-3.5-t. if you want to know more I can highly suggest the course:

I’ve gotten access as well, but I haven’t done much 32k testing because I want to make sure I’ve tested my prompts first.

I’m thinking that it would be really impractical if the model would constantly try to consume the maximum amount of tokens without specifically being told to do so, so we may have to construct a task that does that.

I’m thinking something like a translation task into multiple languages might do the trick

2 Likes

Hey, congrats.

A really interesting test I can’t wait to try out:

Is do code reviews at 8/16/24/32K windows on the same code and see if the 32K window can catch the issues in the 8K window.

I’m very interested to see if there is capability loss at 32K, or if you’re better off chunking.

Have to make sure your code has reasonably subtle bugs/issues for this to be a relevant test, however.

Have you used Claude? How do you find it compared to GPT4?

I’ve been testing 32k with my roguelike code base… expensive but very helpful when the 4k or 8k can’t handle the question and needs more context (code)… It still tends to direct you in weird directions sometimes to “solve” the problem… reminds me of “Get rid of all spam,” request and the AI says, “Okay, I will remove all humans and there will be no more spam.” lol You have to be really specific. Same with fiction, though, really…

Like others, I’ve found that taking what it outputs and feeding it back to it as if the user came up with it can work well… or saying that you want to get to the root cause of the issue rather than just make the error/bug go away … Overall, though, super-useful for a non-coder like myself.

3 Likes

Yeah, 32k will for sure have use cases, but I’m trying to have a deeper understanding of degradation that occurs around capability.

It’s a bit frustrating nobody is (at least publicly) looking into this, everyone just seems to be getting excited about the topline #.

2 Likes

Are you thinking a bigger context window could be worse in some cases? I guess it depends on what you’re putting into the prompt(s)… garbage in / garbage out is more relevant than ever. Heh.

What kind of tests do you want to run?

1 Like

See my post above.

The question is, what’s the best way to find bugs in code - prompt GPT4 8k at a time, or 32k at a time? Or both? Ignore the cost for a moment, just assume that isn’t a factor.

Some bugs will obviously span across chunking boundaries, so 32K at a default may be required. But will it pick up subtle ones, so you still have to do at 8k at a time?

1 Like

From my limited time with 32k and 3+ years with older GPT prompting, I’d say that it’s best to only give it what it needs to solve the problem or find the bug. However, in some cases, I’ve found that I didn’t include the offending function / method or whatever, and it had trouble “spotting the bug” without all the info / context… So, it’s a balance, I’d say… finding the perfect context length for your particular task. Bigger isn’t always better in my experience.

3 Likes

Are you asking ‘will it be as good at finding subtle bugs within an 8k chunk IF I’m lucky enough to get the chunking right?’ That seems a very different question that ‘will it be as good at finding subtle bugs in a random 8k chunk drawn independently from my chunking?’ or ‘will it be as good at finding subtle bugs in code only 8k long?’

Three different questions, no? All interesting.
Does 32k have the same number of attention heads?

2 Likes

Yeah, exactly, I made the same point above. By default, you’ll probably want to do 32k to get chunk spanning bugs. But should you also do 8k? Presumably you wouldn’t chunk randomly, but with reasonable cohesion.

I think the consensus is yes, if you want to be sure, but hopefully someone will do some official evals.

Sorry, I meant to reply to bruce. Can’t seem to edit the reply to unfortunately.

I cannot imagine ever sending 32k tokens in pursuit of a bug—even 8k seems like a lot.

That’s just a lot of code.

But, I guess it really depends on the bug you’re looking for. If the code throws an error, you definitely shouldn’t need to send that much code.

If the code is outputting the incorrect result, that’s different, but still a ton of code and probably not something an LLM can do unless the code is very well documented, you’ve got correct pseudocode and well-formatted algorithms to reference, or both.

2 Likes

It isn’t really a bug pursuit, but rather just a second look. Getting to 100% code coverage in unit tests can be expensive and it isn’t just bugs, but also security reviews which GPT4 is quite good at. It can be used for style commentary as well.

For actual known bugs, I actually find GPT4 isn’t that great and frequently have to intervene.