Response being cut off in Azure OpenAI

So I’ve deployed GPT-4 with 50k TPM, GPT 3.5 Turbo 16k with 120k TPM and GPT-4 Turbo preview with 80k TPM in Azure OpenAI. I’ve set max_tokens=-1. My solution follows RAG with it retrieving multiple chunks of documents based on the given prompt. When I chat on this and ask the model to list all the items present in the document, I get a response which only includes like about 11 or 12 items from about 50+ items. For the solution, I was switching between the 3 GPTs hoping at least one of them would list every single item on the document, but no. Where could the problem be? I’ve also experimented with the max_tokens parameter only to get InvalidRequestError.

P.S: I can’t share the code for obvious reasons

Hi! Welcome to the community :smiley:

Max tokens just tells the model after how many tokens generation should be cut off.

When I chat on this and ask the model to list all the items present in the document

My solution follows RAG with it retrieving multiple chunks of documents based on the given prompt.

how is the model supposed to list “all the items” if it can only see some chunks?

:thinking:

If I’m asking it to retrieve “all the items”, then the model should retrieve all the chunks that are relevant, right? Or are you suggesting playing with a parameter that can control the number of chunks being fetched?

how should the model retrieve all chunks? the LLM doesn’t control your RAG system. what RAG tool are you using?

I’m using Cognitive Search with the help of LangChain. I’ve also tried FAISS

yeah so those things control what chunks your LLM sees.

You may need to tweak your langchain agent chain thing to support a “retrieve all” scenario, or include a tool or whatever.

RAG is basically just a database.
LangChain is like a program
and the LLM is essentially your processor/CPU/runtime.

All the parameters you mentioned in your OP tweak the processor operation, when you actually have a programming/feature issue :confused:

Okay, so I tried tweaking the search_kwargs parameter of as_retriever() and changed it from 2 to 29, and gave the llm a max_tokens of 2500. What I found was, it did return extra results(not the entire list), but at the same time it also hallucinated items that didn’t exist in the document.

Apparently, my input prompt alone accounts for about 13k tokens, and that’s why I can’t increase max_tokens beyond 2500