GPT-4o Context Window is 128K but Getting error model's maximum context length is 8192 tokens, however you requested 21026 tokens

We have taken subscription of paid API key for accessing Open AI models through API in our Python Code and currently we are in Tier 1. I am using ‘GPT-4o’ model and performing RAG over our custom data. But, when I take 10 pages Document and ask question over it , it is giving me following error:

Error code: 400 - {‘error’: {‘message’: “This model’s maximum context length is 8192 tokens, however you requested 21026 tokens (21026 in your prompt; 0 for the completion). Please reduce your prompt; or completion length.”, ‘type’: ‘invalid_request_error’, ‘param’: None, ‘code’: None}}

My question is, GPT-4o is having Context window of 128K then, I ideally I should not get the above errors.

Welcome to the community!

There’s multiple things that could be going wrong here. Could you post your entire request? (don’t forget to take out your API key)


from langchain_openai import ChatOpenAI

llm = ChatOpenAI(model=‘gpt-4o’,temperature=0, max_tokens=256)

  1. we are passing ‘llm’ to ServiceContext of LlamaIndex
  2. we are loading custom data and passing it to GPTVectorStoreIndex
  3. Then using LlamaIndex we are performing inferencing over custom data.

I think the community would find it helpful if you could share the ultimate REST calls made to the API - can you get those from logs?


My guess is this is a langchain issue. Make sure you’re updated to the latest version or use the official OpenAI library.

We are using updated Langchain library in our code. We also updated openai library to latest version and using the following approach we still got error as:
“‘ChatCompletion’ object has no attribute ‘system_prompt’” —Error coming from Step1 below

 client = OpenAI(api_key="OPEN_AI_KEY_XXXXXXXX")
                                                   {"role": "system",
                                                    "content": """You are a helpful assistant for question-answering tasks."""},
                                               {"role": "user", "content": user_input},],temperature=0)

Step 1: passing ‘response’ object to ServiceContext object of LllamaIndex
Step 2: Loading data from ElasticSearch using LlamaIndex
Step 3: passing ServiceContext Object to class GPTVectorStoreIndex of LlamaIndex
Step 4: Creating index using LlamaIndex
Step 5: Creating query_engine
Step 6: firing query using query engine over custom documents (of ElasticSearch which are loaded)

It’s a langchain issue.

You’ll be more likely to get help there.