Understanding Token Limits and Information Extraction Accuracy with GPT-4o: A Case Study on PDFs and Images

Using the ChatGPT Plus plan with the GPT-4o model (32k token context window), I experimented with a 127-page PDF document to assess the model’s ability to extract information from images and tables. Out of 56 questions, 6 responses were inaccurate. However, when the same images or tables were uploaded directly into the chat, the responses were more precise and accurate. This raises the following questions:

  1. Why is the quality of responses when extracting information only from uploaded images more accurate than when extracting it from a PDF document?
  2. While uploading the PDF document into the conversation, I quickly reached the context window limit and was asked to start a new conversation. Does the PDF document and its content impact the token limit in the context window?
  3. Is the PDF document analyzed directly within the chat or in an external space, and does the context window limitation depend solely on my questions and GPT-4o’s responses?

I seek answers to these questions to better understand the limitations and potential of GPT-4o when working with large PDF documents and uploaded images.