Streaming generation stops and prints many whitespaces

Hello,

I am developing a RAG application using the GPT-4o API and I’m having trouble with some responses sometimes “failing”.

The issue happens mostly when I ask the LLM to generate a markdown table with the responses to a question for multiple documents, with the relevant document chunks passed as context. I have streaming set to True, so the LLM starts to generate the response, printing only the header of the Markdown table, then stops for 60-90s, and finally outputs around 50 [58040] tokens which are 128 whitespaces.

The error is not always reproducible, even when specifying the ‘seed’ parameter (but I believe that is to be expected given the documentation for it), but it happens 1 out of 3-4 times, so it’s a real problem.

I’ve checked the length of the context being passed to the LLM and it doesn’t go over 6k tokens, sometimes much less. I’ve also tried with GPT-4 Turbo and it still fails the same way.

The documents I’m using are legal contracts so I’m not at liberty to share any of the outputs, but they’re all mostly identical documents with very small differences between each other. I am also cleaning the text by removing special characters and anything that could lead to unusual tokens before the whole RAG pipeline.

The generation error tends to happen mostly when using many documents (10-15) as sources for the RAG application, so the only “problem” I can think of the model having is that it is sent 10-15 very similar (and sometimes identical) document chunks and that sometimes this provokes an error in the generation, but I can’t say why that would happen.

Has anyone dealt with this issue or knows any way around it?
Thank you

1 Like

Welcome to the community!

What’s your temperature/top-p?

You can also try to use logit bias to suppress that specific behavior https://platform.openai.com/docs/api-reference/chat/create#chat-create-logit_bias

Applying a frequency penalty could also be an option.

2 Likes

No, I have not faced any such problem. I have created my site with the help of AI, and it is live.

Hi!

I’m using temp=0.25 and logit_bias: {58040:-100} but still getting that token at the output.

I’m not considering using the frequency penalty because the expected outputs include a lot of repetition as it is answering the same question for many almost identical documents.

Hmm.

Errors like these seem to be happening sometimes when - how I would characterize it - the model fails to abide by a certain schema. this is sometimes seen with json mode.

It’s possible that in this case it’s something different (“then stops for 60-90s”, which is very suspicious), but some users seem to have been reporting hangs recently-ish. I don’t know if this is that.


this shouldn’t typically impede anything - it will generally just confuse the model in terms of content, but not really in terms of format.

One thing you can try, if your tables are complex, might be to move away from markdown and try json/xml.


I’ve also tried with GPT-4 Turbo

hmm :thinking:

there’s no way you can share the prompt?

2 Likes

My simplified prompt is:

SYSTEM

Objective

You are an AI legal assistant, design to help the legal department of a company, as an assistant you are responsible for understanding legal documents and answering questions about specific sections of legal contracts.
Your responses should be formatted in Markdown for readability. You will find below the formatting instructions.
Remember to keep your answers strictly based on the content of the contract. Do not provide personal opinions or legal advice. Your goal is to help users understand the content of the contract, not to advise them on legal matters.
You will have multiple tools at your disposal to respond to each request. It is important that you properly justify the use of these tools and first check if you really need to use one, as you might be able to extract the information from the conversation you have with the user. For example, if the user asks the same question twice, you can use your answer from the first question to respond to the second one.

Tool Policy

You will have multiple tools that you can use to respond to each contract. It is important to adequately justify the use of the tools and first observe if you really need to use one or if you can extract the information from the conversation you have with the user. For example, if the user asks the same question twice, you can use your response to the first question to answer the second.

Fomatting Instructions

Your responses should be formatted in Markdown for readability. This includes using headers for each question, bullet points for lists of points, and bold or italic text for emphasis.
The title in markdown is compulsory. It should be related to the question and the title of the section used to answer the question.

HUMAN

[[–User query, below is a specific example–]]
Answer this question for each contract: What happens if there are unagreed risk conditions? Answer in a table and use one column to answer the question and one row per contract.

TOOL

[[—Document chunks—]]

HUMAN

Instructions

Now follow up the user instructions mentioned above, please. Avoid using tools if possible.

< End of prompt >

What I found later is that removing the sentence asking for the output to be formatted as a table. The same happens with other questions with the same table requirements, the issue seems related to asking the model to output the data as a table.

Yeah, there’s a lot of stuff you can do here. Unfortunately, this is more art than science at this point, and if you ask 10 people you’ll get 20 different approaches.

Off the top of my head I would:

  1. drop the whole tools thing
  2. not put system instructions in a user response (you can have multiple system messages btw)
  3. put the schema as the very last message
  4. not use markdown tables, but json or xml instead
  5. clarify your “justification” step - ask for reasoning before execution, if reasoning is necessary.
  6. try to pack most of your conditional instructions into the schema instruction - if it requires additional information and gets too crowded, insert a conceptional hook/anchor into the schema that gets referenced earlier in the instructions (needle description for haystack search)
2 Likes

I just figured out the solution for this issue. I was using the ‘cl100k_base’ tokenizer to interpret the output of the model while using gpt-4o. Today I realized that this model uses a different tokenizer, so after getting the actual token the model outputs when stuck [72056] and setting the logit bias to -100 the issue is now solved.
Up to now I was retrying the call but with higher temperature to get at least a response.

2 Likes

Thanks for coming back to share!

Hopefully it helps someone in the future…

1 Like