Response being cut off in Azure OpenAI

akashnair1307 · January 29, 2024, 3:07pm

So I’ve deployed GPT-4 with 50k TPM, GPT 3.5 Turbo 16k with 120k TPM and GPT-4 Turbo preview with 80k TPM in Azure OpenAI. I’ve set max_tokens=-1. My solution follows RAG with it retrieving multiple chunks of documents based on the given prompt. When I chat on this and ask the model to list all the items present in the document, I get a response which only includes like about 11 or 12 items from about 50+ items. For the solution, I was switching between the 3 GPTs hoping at least one of them would list every single item on the document, but no. Where could the problem be? I’ve also experimented with the max_tokens parameter only to get InvalidRequestError.

P.S: I can’t share the code for obvious reasons

Diet · January 29, 2024, 3:43pm

Hi! Welcome to the community

Max tokens just tells the model after how many tokens generation should be cut off.

When I chat on this and ask the model to list all the items present in the document

My solution follows RAG with it retrieving multiple chunks of documents based on the given prompt.

how is the model supposed to list “all the items” if it can only see some chunks?

akashnair1307 · January 29, 2024, 4:35pm

If I’m asking it to retrieve “all the items”, then the model should retrieve all the chunks that are relevant, right? Or are you suggesting playing with a parameter that can control the number of chunks being fetched?

Diet · January 29, 2024, 4:36pm

how should the model retrieve all chunks? the LLM doesn’t control your RAG system. what RAG tool are you using?

akashnair1307 · January 29, 2024, 4:36pm

I’m using Cognitive Search with the help of LangChain. I’ve also tried FAISS

Diet · January 29, 2024, 4:44pm

yeah so those things control what chunks your LLM sees.

You may need to tweak your langchain agent chain thing to support a “retrieve all” scenario, or include a tool or whatever.

RAG is basically just a database.
LangChain is like a program
and the LLM is essentially your processor/CPU/runtime.

All the parameters you mentioned in your OP tweak the processor operation, when you actually have a programming/feature issue

akashnair1307 · January 30, 2024, 5:56am

Okay, so I tried tweaking the search_kwargs parameter of as_retriever() and changed it from 2 to 29, and gave the llm a max_tokens of 2500. What I found was, it did return extra results(not the entire list), but at the same time it also hallucinated items that didn’t exist in the document.

Apparently, my input prompt alone accounts for about 13k tokens, and that’s why I can’t increase max_tokens beyond 2500

Topic		Replies	Views
Trouble extracting all information from long context document API gpt-4	6	1194	October 29, 2024
I don't get the full result no matter what I do API api	6	2932	September 1, 2023
Openai web search token limit issue Bugs	4	100	March 25, 2025
GPT4 Limiting Examples Cited in RAG Q&A API gpt-4 , api	2	1085	October 27, 2023
OpenAI truncating the response API gpt-4 , chatgpt	0	100	April 4, 2025

Response being cut off in Azure OpenAI

Related topics