It looks like GPT-4-32k is rolling out

Yeah,

  1. I’ve tested prompt injections on my system and others.
  2. I’ve signed up within the first few minutes.
  3. I’ve been a regular in this forum.
  4. I’ve stated in signup I’m most excited about 32k.

Maybe my ML score was through the roof or something :rofl:

3 Likes

Maybe,
I also mentioned the 32k version when I signed up, so there may be something to it.

I can understand why people are excited but i think the rollout will be slow on this one.

1 Like

Rollout might speed up due to competitive pressure. For example, the plugins got rolled out right after IO and the bard waitlist was relaxed. Bard has fresh access to the web, or at least some proxy of it I believe.

Given the cost, server utilization might not be a huge issue.

I guess I’m around those 48h and while I do have the 8k LM I still don’t have access to the 32k.

1 Like

I am wondering what is the real use case for a 32K context window? I am using the LLM for knowledge base Q & A. In one case, state regulatory information. I have broken all the documents down to into Sections. This is the base of the structure of all of these documents: Title → Division → Part → Chapter → Article → Section.

The largest Section I have is 41,411 characters = 8,494 tokens .

Now, I suspect that 90% of the queries against this knowledgebase will be found in single Sections, and probably 75% of the answers to those found in one or two paragraphs. So, if I leave my largest sections intact because I have a 32K context window, I still can only bring back at most 3 or 4 documents.

What is the point of sending huge text files as context when chunking those same documents will give me the same if not better responses, AND allow a greater number of documents to be evaluated?

Outside of summarizing a book, or being able to process a larger number of documents, I don’t see the real use case for 32K or 100K context windows.

Well, imagine a bug in code that involves several different modules. You need all of the code to find it.

Or imagine reviewing a document where you state one rather specific detail in one part and then contradict it in another part.

Chunking won’t work so well, because those subtleties won’t necessarilly get captured when you process the chunk and try to apply it against the other ones.

For this reason, I’m very curious about how well claude100k works. If they’re just rolling chunks, it probably doesn’t work so well.

3 Likes

Now that you mention it, every bit of that 8K is coming in handy for some code issues.

The 32k context helps a little bit with programming. It would be interesting to see the overall picture of an application. Roughly 250 lines of Haskell code (I took some random code from the Servant library) was 4k tokens.

If we now want to give the system a context of 50k lines of code this would boil down to a context of 800k tokens.

Imagine you would like to put in programs that have a million lines of code, maybe for star ships that fly to Mars. Or the Chrome browser had an estimation of 35 million lines of code. To feed this in a context of 560 million tokens would be nice. So really, 32k is not much. It requires humans to cherry-pick what code they want to present to the system.

We can’t easily detect semantically identical code that way. Perhaps in modules A and C we have a duplicate function, but they are written differently. So a simple classic tool could not find the duplicate. Now when we feed modules A&B we can’t see the duplicate. When we now present it B&C we also can’t find the duplicate. But there simply are too many combinations when we try this with the whole alphabet that not only has letters A-Z but thousands of letters more.

So ideally we could benefit from billions of tokens in the context. But: that sounds like fantasy land :slight_smile:

1 Like

I am wondering what is the real use case for a 32K context window?

I have a feeling we’re going to find comments like this quite humorous in the future. It would be like in the early days of computers saying “why could you possibly need a 1 GB hard drive?”

2 Likes

Agree. Right now we don’t know, because we have been feeding the prompt with matches from embeddings, or whatever. But without this limitation, we can start to think bigger.

Like I said above, an obvious one is a Q/A bot without embeddings would be a use of 32k. But of course, the price is really high, and may not be justified over embeddings, currently. But that may change when the price goes down.

Let’s hope the market forces that brought down the cost of storage also work to bring down the cost of context window tokens!

I believe the current understanding (aside from a few recent research papers) is that the amount of compute necessary scales quadratically with context token length.

So, 10x the context tokens would require roughly 100x the FLOPs.

So, absent a revolutionary new idea (which can actually be implemented at scale without introducing other issues), that’s probably 6–10 years away.

:frowning_face:

Have you read about stuff like this?

1 Like

Not really. The less than quadratic time models are starting to be developed. This one uses good ole’ CNN’s. I can’t attest to the performance or maturity, but this nut should be cracked soon. BTW, already basic RNN’s are infinite window, but they suffer in performance compared to transformers.

Lol, yeah, that was one of the “few recent research papers” I was referring to.

Sure, but that was the “absent a revolutionary new idea (which can actually be implemented at scale without introducing other issues)” caveat I mentioned.

It remains to be seen if those types of models can match the performance of the current SOTA models or if there will be other issues that come with them.

I’m a firm believer in the FFT (Fast Fourier Transform) and convolution in general for pretty much any linearized signal processing problem.

I was thinking of diving into this Git Repo when I get some time:

I think this is an interesting concept

I think models with an infinite window, will always be worse than models a finite context window, not only because there’s a finite amount of computational resources available.

Having a finite context window forces you to think about what context you pass to the model.

With an infinite context window I could pass the entire arxiv library, all text from stack exchange and all the law’s from every nation… It’s still not going to make the answer better when I ask for a fairytale about dragons :laughing:

2 Likes

The biggest problem with infinite windows (today) are that they tend to still have a finite window local (spatially) to adjacent data (thinking RNN’s here). And let’s not forget the recursive, non-parallelization nature of them (more and more lag). :-1:

The attention mechanism solves a bunch of things, but the big one is that the model pays attention to certain things without simply looking at adjacent things, it will pay attention to non-adjacent things if it has to.

So, the holy grail would be huge windows, with lots of attention everywhere, and low computational complexity. :+1:

So, the FFT and convolution provide low computational complexity. “Flash attention” is supposed to solve the attention problem (driving the quadratic cost down). Then now with low overall computational complexity (and parallelization), you can do larger windows in a reasonable time and get good performance.

In my experience, with breakthroughs in algorithms, and not simply throwing more machines at the problem, you really get to see breakthroughs. Soon, you will run a model compatible to GPT-4 on your phone. And you will do it with 10% of the computation resources of the current GPT-4. This is due to efficient algorithms (and more memory).

Then, as computational resources grow (mostly through getting more and more parallelization), the low end (phones) and the high end (servers) grow with it.

But I think most folks will be happy with a local instance of GPT-4 on their phones. And the crazy super AI will be on the servers. But even that will go to your phone. Especially with quantum computing (guessing). There has to be a limit. Eventually you will run into the atoms and the speed of light, and Casimir effect. But who knows!

It would make sense for OpenAI to at least facade / front end some kind of cosine similarity style API providing ‘infinite context windows’.

The price could be ultra cheap as well.

The way to do it I think would be some sort of strategy pattern where you can select the similarity/embedding algorithms and maybe some other parameters (such as max tokens in the real context window).

For example, I was doing a web GPT4 against a PDF today that was about 40 pages long (2019 non profit filing by openai, heh) and it didn’t make any sense to me that webGPT4 choked on it. It should have been able to ‘skim’ the pdf and extract snippets similar to what I was looking for and then run that through the GPT4.

There would be caching for faster snippet retrieval. Note they’d be very up front about what they’re doing here.

fwiw:

‘retriever less architectures’ heh, yeah maybe, or maybe we just don’t get access.

1 Like