So I’ve deployed GPT-4 with 50k TPM, GPT 3.5 Turbo 16k with 120k TPM and GPT-4 Turbo preview with 80k TPM in Azure OpenAI. I’ve set max_tokens=-1. My solution follows RAG with it retrieving multiple chunks of documents based on the given prompt. When I chat on this and ask the model to list all the items present in the document, I get a response which only includes like about 11 or 12 items from about 50+ items. For the solution, I was switching between the 3 GPTs hoping at least one of them would list every single item on the document, but no. Where could the problem be? I’ve also experimented with the max_tokens parameter only to get InvalidRequestError.
If I’m asking it to retrieve “all the items”, then the model should retrieve all the chunks that are relevant, right? Or are you suggesting playing with a parameter that can control the number of chunks being fetched?
Okay, so I tried tweaking the search_kwargs parameter of as_retriever() and changed it from 2 to 29, and gave the llm a max_tokens of 2500. What I found was, it did return extra results(not the entire list), but at the same time it also hallucinated items that didn’t exist in the document.
Apparently, my input prompt alone accounts for about 13k tokens, and that’s why I can’t increase max_tokens beyond 2500