New 4-turbo model has a unique limit? Or is this a bizarre hallucation?

ForcefulCJS · January 25, 2024, 8:09pm

I just tried two prompts for the new gpt-4-0125-preview model. The second prompt was nearly identical to the first one, except that the instructions were shorter and more succinct. The first prompt performed as expected, but the second prompt returned the following message:

“I’m sorry, but I cannot provide the quotes as requested since you’ve provided a text exceeding my processing capabilities in a single entry.”

I re-ran the second prompt without changing anything, and the message was even more useless:

“I’m sorry, but I cannot provide a direct excerpt or quote from the document.”

To be clear, the prompt tokens were 61,169 and the output from the first call was merely 476 tokens. So nothing close to the actual token limit here. I have never encountered a response like this in prior uses for turbo — is this just a bizarre hallucination or is there some actual limit I’m not aware of here?

Edit: I ran the second prompt for a third time, and ran the first prompt a second time - both gave slight variations of “I’m sorry, but I cannot provide the content you requested.” This appears to be a consistent problem with the new model, which is ironic since it was touted as a solution to the lazy responses from the prior version.

DevGirl · January 25, 2024, 8:28pm

Can you confirm with a copy of the API call?

The reason I ask, it appears you have the token limit set to 61,169, which would exceed the limit of the model and result in an error.

You could also confirm by setting the token to ~2000 (leaving 2k for input, ~2k for output).

ForcefulCJS · January 25, 2024, 8:36pm

The call is pretty straightforward and I have never used token limits in my calls. The first call (which performed as expected) was actually the largest since the instructions were a couple hundred tokens longer.

The call itself is simply:

client.chat.completions.create(
model=“gpt-4-0125-preview”,
temperature=1,
messages=prompt
)

DevGirl · January 25, 2024, 8:51pm

I’m confused about this statement. The preview model is limited to 4k tokens, correct?

In addition, you included "messages=prompt" which doesn’t help. It’s okay to eliminate the prompt language; however, I was curious whether you were sending history or just a single prompt?

cdonvd0s · January 25, 2024, 9:08pm

~4k token is the max for output. Input can be ~124k if max_tokens is set to 4k.

cdonvd0s · January 25, 2024, 9:09pm

Which tokenizer did you use to calculate this?

ForcefulCJS · January 25, 2024, 9:38pm

No Tokenizer, that number comes straight from GPT’s Chat Completion Object.

cdonvd0s · January 25, 2024, 9:43pm

I verified by pasting many tokens (>40000) in the playground, and it worked fine.

Can you give more insights about the prompt?

ForcefulCJS · January 25, 2024, 9:53pm

The instructions are 170 tokens, the user prompt is 30 tokens of introductory text followed by the ~61,000 tokens of document text.

The instructions+prompt essentially tell the model to review a document and identify quotes on a particular subject matter. The document itself is 200 pages of text from a transcript. The prompt and the document contents concern a relatively mundane discussion about a business’s practices.

I’ve run these kinds of prompts in the past using prior 3.5 and 4 turbo models without issue (results weren’t impressive, but they at least returned something relevant). The fact that the first call went through despite having longer instructions indicates to me that this is a hallucinated limit of some kind that is unique to the new model.

_j · January 25, 2024, 10:16pm

Refusal of a large input, on your dime, is a significant issue.

You can instill confidence in the AI that it can perform this task within a system message, but this type over-refusal based on imagined limits should be a big training “no-no”.

With a temperature of 1, there’s considerable randomness possible in the AI decision. I would recommend knocking sampling parameters down to temperature:0.7 and top_p:0.7 so you get a more reliable idea of what this model is thinking.

cdonvd0s · January 25, 2024, 10:21pm

Have you tried putting the instructions at the bottom of the transcript?

If you have, try to place some delimiter between the transcript and the instructions.

DevGirl · January 25, 2024, 10:54pm

In that event, I misunderstood, I apologize. I thought the preview model was limited to 4k tokens in total.

Because you prefer not to share your prompt, could you write an analog (or ask GPT to do this) and test that, to confirm? Then if you experience the same results, share that alternative prompt with us, so that we can test?

It appears that none of us are able to reproduce and until we can recreate the problem, it’s difficult to determine the issue and seek resolution.

ForcefulCJS · January 25, 2024, 11:24pm

The order is instruction, prompt, then document. There is already clear delineation between the prompt itself and the document it introduces.

I cannot include the prompt verbatim because it includes details about an ongoing construction dispute, but suffice it to say that it amounts to this:

“Please review the [Document], provided below which relates to [information about the dispute and the context of the Document.] Then, identify all pages where [Person] discusses [Topic]. Finally, for each such discussion, please provide any quote of [Person] discussing [Subtopic]. Now, here is the [Document] you are tasked with reviewing:”

ForcefulCJS · January 25, 2024, 11:35pm

No kidding. I will try your suggestion of knocking the temperature/top_p down tomorrow along. However, considering that 80% of the prompts have come back with a consistent “I’m sorry, but no” — I’m thinking that budging the temperature isn’t likely to reveal much.

I should add that not only did the first attempt with the new model run fine, but I had just finished running the first prompt through the prior 4-Turbo model and didn’t encounter any problems there either (beyond the answer being lazy). I saw the post touting the new model immediately after that, so I just swapped out the model number and ran the script again to see what changed (better results, albeit still not quite what I wanted).

ForcefulCJS · January 25, 2024, 11:55pm

Alright, final update for the night. I took the document and converted it to plain text, where I realized that the thing is absolutely littered with a unicode character (U+00B7) and line breaks that don’t add anything to the meaning of the document. I stripped out the unicode character and line breaks and that not only made for clearer reading (from a plain text perspective) but it cut the prompt tokens down to 42,897.

That prompt performed as expected with no issues. That still doesn’t answer my question as to the cause, and maybe it was random chance that this latest prompt got through, but running trial-and-error like this on the “new and improved” model is not encouraging. I will drop the temperature down in future runs to see if that provides any clarity.

DevGirl · January 26, 2024, 2:31am

It’s funny you mention the issue with the extended character set (anything above the 7th bit, aka ascii 128 and higher).

I ran into the exact same issue when GPTs/Assistants were first announced.

I spent the entire night fighting with issues. After a night’s sleep, it occurred to me that I was including reference text that I had not first processed myself.

I did a quick search/replace using a regex to remove anything 128+ [\x00-\x7F] and it resolved my issues as well.

Regarding your prompt:

“Please review the [Document], provided below which relates to [information about the dispute and the context of the Document.] Then, identify all pages where [Person] discusses [Topic]. Finally, for each such discussion, please provide any quote of [Person] discussing [Subtopic]. Now, here is the [Document] you are tasked with reviewing:”

I’ve used very similar prompts and once they get over 4k or so, it becomes far less reliable (for the obvious basis that LLM’s have inherent issues with attention mechanisms / longer context windows).

In the case of your specific prompt, I find that passing it chunked can substantially improve the accuracy/results. I also use GPT for the process of chunking; however, it’s easier for me to throw a couple escape-characters in at crucial contextual-switches.

If you’ve never tried this or aren’t sure what I’m explaining and you’d like to increase the accuracy, let me know if I can provide more details that will help

RobbieInOz · January 26, 2024, 6:05pm

Any quick examples you can share on how to use gpt to chunk your reference text and how you insert escape characters (and what, specifically you use as escape characters) at contextual switches? Thanks!

ForcefulCJS · January 26, 2024, 6:09pm

All good advice, thank you.

In terms of chunking, how are you using GPT to break it up? Is your chunking decided by tokens, the organization of the document, or something else?

DevGirl · January 26, 2024, 9:07pm

@ForcefulCJS & @RobbieInOz – Sure, I’ll start with some basics.

First, the majority of my work where I find that I need to chunk / submit shorter context, results from a lot of work I do with scientific manuscripts and occasionally legal filings.

In both instances, the majority of the content would always fit within 128k tokens (with simple text, assume 180+ pages). Therefore, in theory, I should be fine without chunking, you might assume.

The reality is that I gain far more accurate and comprehensive results by chunking providing that I test the guiding/system prompt extensively prior to executing any batch processes.

Because I’m processing hundreds, not thousands of documents, the most effective methods is to simply open a document, look for the areas that are relevant and insert my own escape characters to indicate it should break there. I would search the document to confirm it contains no pipes, then typically I use three pipes “|||” as a break. If it were markdown, with a table, it may contain pipes but any sequence can work, eg combine your sequence with backticks, etc.

This may sound time consuming or like an extra step; however, in my experience, it’s the best method to assure proper outcome.

Alternative: When I use an Code to Divide

I enumerate the text by max content in the simplest fashion of X characters per iteration (typically no more than 16k characters or roughly 2k words).
I scan through the current iteration in reverse, looking for anything that appears to be a section header using a variety of logic; this is different for legal complaints versus manuscripts, etc.
- Markdown is the easiest option; if I can run a quick formatted doc to Markdown conversion, you can quickly divided with a simple regex seeking a header, eg ^\s*#+
- Non-Markdown – I often look for patterns including numbers / roman numerals with less than 110 characters that follow (or double+ linefeeds on either side of short string of text indicating a un-numbered header, etc). For example, here I use a lookbehind to confirm a minimum of two line breaks, then search for only three roman numerals relevant to a four-digit, limited number section-heading: (?<=\n\n)[IVX]{1,4}\.\s+.{6,110}$ - I have several search patterns like this.
Remember: this is done in reverse search (or w/regex reverse enumerated) to start at the end – and using a weighted system, to determine whether I’m comfortable with the break.
I also have used tools like SpaCy (Python module; I should mention I prefer strongly typed languages but typically write scripts like this in Python) – for NLP
I then take everything from that break forward and append to the subsequent chunk (eg - an “appendNext” string that is always appended with each new loop iteration).
If I have failed / have low confidence of break - I will either call a local LLM or GPT to find the best break. It’s very difficult to find a very short prompt to do this properly, which may surprise you because it’s a relatively easy task, in theory. For example: you cannot ask for a character position, you’d have to ask for the first or last sentence where the most substantial change of topic exists. Yet even that isn’t as easy as it sounds because the LLM will correct the text (remember - it’s an LLM, it wants to put words in the most common order and fix things even when you tell it not to).
Simply put – testing a great deal at this LLM-topic-division point is the most important task prior to running many documents. The other option is having it return the text you provided and insert its own break – however, as you get over 1200 characters you see it begin to trim, etc.
This is a situation where you may find an iterative task with simple/light LLM like Mistral, for something so rudimentary, is just easier – but that’s an entirely separate multi-page post (lol)

Simply put - It’s an enumeration with scan-to-remove-and-append-to-next … logic. In most cases it helps to have an idea of your content structure. Scientific studies/manuscripts and legal filings are very different; yet tailoring a pattern for each, individually, is fairly straightforward.

It may sound complicated at first, but as you iterates a dozen documents to test, it becomes more of a “shooting fish in a barrel” situation where you just methodically begin to see/catch more patterns based on your content.

However, after reading this, I suspect you can appreciate why 30-60 seconds to throw escape-sequences where you find them most relevant, is easier in low-volume cases. This also gives you a solution that can essentially be parsed in a single Python operation without any potential pitfalls.

One last tip - Understanding the manner that LLM’s use to emulate logic and considering this when writing your prompt explaining that each chunk is merely one of many, can be a source for issues. You often have to go overboard to reinforce that this is merely one in a group of many pieces of content. Reinforce it at the beginning and with an “IMMUTABLE RULE:” (or similar technique) at the end, plus one somewhere adjacent to the most fundamental of rules, if that’s not at the beginning.

I realize all of this sounds convoluted; however, it’s really not – particularly if you are processing a fair number of documents over the longterm and accuracy is more important to you than a day customizing a script or iteratively optimizing prompts.

In the end, you have far superior accuracy – and even better – you have a framework you can quickly adapt to another task.

Speaking of Frameworks

You may think to yourself “this all sound convoluted, I’ll just throw LangChain into the mix, tokenize, then iterate, and save myself some time.” I haven’t found that to be the case. I’ve found that it’s easier to simply write a script like this, which is a relatively basic loop/search, than a “new and shiny” process leveraging tools like LangChain.

While that’s been my experience, I’m often wrong – I’m very interested in anyone’s opinion that’s found an alternative that is more effective.

I hope this can be of some help

Topic		Replies	Views
Poor quality response on trained LLM with pdf files Community gpt-4	29	6376	May 1, 2024
Prompt Fatigue Question For API Calls Prompting gpt-35-turbo	24	550	January 25, 2025
Is the GPT4 api actually this limited or am I doing something wrong? API	13	1507	December 13, 2023
Problem extracting data from PDF files and comparing them Prompting gpt-4 , chatgpt	20	5336	June 7, 2025
The length of the embedding contents API	48	34474	December 13, 2023

New 4-turbo model has a unique limit? Or is this a bizarre hallucation?

Alternative: When I use an Code to Divide

Speaking of Frameworks

Related topics