How can I make the assistant 'read' scanned documents that are in PDF format?

walterriv · June 2, 2025, 12:55pm

Hello everyone,
I’m developing a chatbot. Currently, I provide it with scanned documents in PDF format, but it’s not able to read them—it responds with irrelevant content.
The documents I mention were originally created in Word, then printed and later scanned; they do not contain images (they are just scanned pages with printed text).
I’m using the GPT-4o model (with fine-tuning), o1, and o3-mini.
I’m sharing parts of the code so you can tell me if something is wrong.

def upload_files_to_openai(self, files):
    uploaded_ids = []
    for upload in files:
        try:
            upload.file.seek(0)
            file_tupla = (
                upload.filename,
                upload.file,
                upload.content_type,               
            )
            message_file = self.openai_handler.client.files.create(
                file=file_tupla,
                purpose='user_data'
            )
            uploaded_ids.append(message_file.id)

        except Exception as e:
            print(f"Error uploading the file: {e}")

    return uploaded_ids

def send_message(self, thread_id: str, prompt: str, token: Optional[str] = None,
                 files_ids: Optional[List[str]] = None,
                 role: str = "user"
                 ) -> Optional[str]:
    '''
    Sends a message to a thread in OpenAI.
    '''
    try:
        attachments = []
        if files_ids:
            attachments = [{"file_id": fid, "tools": [{"type": "file_search"}]} for fid in files_ids]

        logging.info(f"Sending message to thread {thread_id} in OpenAI...")
        message = self.client.beta.threads.messages.create(
            thread_id=thread_id,
            role=role,
            content=prompt,
            attachments=attachments,
        )

        return message.id
    except Exception as e:
        logging.error(f"Error sending message to OpenAI: {e}")
        return self.handle_error(e)

async def create_run(self, thread_id: str, assistant_id: str, model: str, 
                   instructions: str = None):
    '''
    Creates a run in a thread in OpenAI.
    '''
    if model == "ft:gpt-4o-2024-08-06:fine:tuning":
        reasoning = None
    elif model == "o1":
        reasoning = "low"
    else:
        reasoning = "high"

    try:
        logging.info(f"Creating run in thread {thread_id} in OpenAI...")
        buffer = []
        in_citation = False
        citation_buffer = []

        with self.client.beta.threads.runs.stream(
            thread_id=thread_id,
            assistant_id=assistant_id,
            additional_instructions=instructions,
            tool_choice={"type": "file_search"},
            model=model,
            reasoning_effort=reasoning,
            tools=[{
                "type": "file_search",
                "file_search": {"max_num_results": 50}
            }],
            event_handler=AssistantEventHandler(),
        ) as stream:
            for chunk in stream:
                text = ""
                if hasattr(chunk, "data") and hasattr(chunk.data, "delta"):
                    delta = chunk.data.delta
                    if hasattr(delta, "content") and delta.content:
                        for block in delta.content:
                            if hasattr(block, "text") and hasattr(block.text, "value"):
                                text += block.text.value
                for char in text:
                    if not in_citation:
                        if char == "":
                            in_citation = True
                            citation_buffer.append(char)
                        else:
                            buffer.append(char)
                    else:
                        citation_buffer.append(char)
                        if char == "":
                            citation_buffer = []
                            in_citation = False

                if not in_citation and buffer:
                    cleaned = ''.join(buffer)
                    yield cleaned
                    buffer = []

            if buffer:
                yield ''.join(buffer)

        logging.info("Run created successfully.")

    except Exception as e:
        print(f"Error creating the run in OpenAI: {e}")
        logging.error(f"Error creating the run in OpenAI: {e}")
        yield self.handle_error(e)

_j · June 2, 2025, 2:40pm

migrate away from the Assistants endpoints
don’t use vector stores and file_search if you want guaranteed “reading”.
chat completions has the ability to receive an entire PDF in a user message and let the AI see it all without any searching or chunking.

I adapted some example code for you. Chat away on the Python console, asking about your pdf files that you list near the top. Find the AI model to use at the 2/3 point.

(be sure to type “exit” to clean up the uploaded PDF files from storage)

"""Chat Completions Python PDF demo chatbot; SDK, no stream, no async"""
import os
import time
import logging
import base64
from openai import OpenAI, RateLimitError, BadRequestError
from openai import APIConnectionError, APIStatusError

logger = logging.getLogger(__name__)
client=OpenAI()
myfilepaths = ["test1.pdf", "test2.pdf"]

file_ids = []
# upload from the list of files individually
for path in myfilepaths:
    with open(path, "rb") as file:
        uploaded_file = client.files.create(
            file=file, purpose="user_data"
        )
        file_ids.append(uploaded_file.id)
print(f"Uploaded File IDs: {file_ids}")

pdf_content = [
    {
        "type": "text",
        "text": (
            "Use these images and/or text from PDF as knowledge:\n"
        )
    },
]
# add the multiple PDFs to a user message content
for idx, file_id in enumerate(file_ids, start=1):
    pdf_content.append({
        "type": "file",
        "file": {
            #"file_data": f"data:application/pdf;base64,{b64data}",
            #"filename": f"file_{idx}.pdf",  # only with base64
            "file_id": file_id  # use a file from storage
        }
    })
pdf_message = {"role": "user", "content": pdf_content}

def send_chat_request(
    conversation_messages: list[dict[str, str]],
    model: str = "gpt-4o-mini",
    max_tokens: int = 4000,
    *,
    timeout: float = 200.0,
) -> dict:
    """
    Call the OpenAI chat-completions endpoint with the supplied message list.
    OPENAI_API_KEY environment variable is used.
    """
    jsonbody={
        "model": model,
        "messages": conversation_messages,
        "max_completion_tokens" : max_tokens,
        "timeout" : timeout,
    }
    try:
        response = client.chat.completions.create(**jsonbody)
        return response.model_dump()
    except httpx.HTTPStatusError as err:
        logger.error(f"HTTP Err {err.response.status_code}: {err.response.text}")
        raise
    except httpx.RequestError as err:
        logger.error(f"Request Error: {err}")
        raise


# Chat application pattern as script, where exit/break gives you ai_response
MODEL_NAME = "gpt-4o-mini"  # start with "gpt-4o-mini"
MAX_TOKENS = None       # Reasoning models need high value or None
MAX_HISTORY_LENGTH = 20  # 20 == 10 user inputs
SYSTEM_PROMPT = """
You are a helpful AI assistant, employing your expertise and vast world knowledge.
With careful planning, you fulfill every input truthfully, accurately, and robustly.
""".strip()

system_message = {
    "role": "developer" if MODEL_NAME.startswith("o") else "system",
    "content": SYSTEM_PROMPT
}
conversation_history: list[dict[str, str]] = []
ai_response: dict | None = None

print(f"Type your prompt to {MODEL_NAME}.  Enter “exit” to quit.", end="\n\n")

# A chatbot session sends repeatedly, growing a message context list
while True:
    user_input = input("prompt> ").strip()
    if user_input.lower() == "exit":
        print("\nExiting.  Inspect `ai_response` in a REPL for call details.")
        break
    user_message = {"role": "user", "content": user_input}
    recent_history = conversation_history[-MAX_HISTORY_LENGTH:]
    messages = [system_message, pdf_message, *recent_history, user_message]

    # Here, send_chat_request is purposefully allowed to raise traceback
    ai_response = send_chat_request(
        messages,
        model=MODEL_NAME,
        max_tokens=MAX_TOKENS,
    )
    # Parse out stuff we want and expect: just text content from assistant
    try:
        assistant_reply = ai_response["choices"][0]["message"]["content"]
    except (KeyError, IndexError, ValueError) as parse_err:
        print(f"Failed to parse response – {parse_err}", file=sys.stderr)
        continue

    # Add to a conversation history only after success (or could retry it)
    conversation_history.append(user_message)
    conversation_history.append({"role": "assistant", "content": assistant_reply})
    print("assistant>", assistant_reply)
    print(ai_response["usage"])
    print({key: ai_response["usage"][key] for key in list(ai_response["usage"].keys())[:2]})
for file_id in file_ids:
    client.files.delete(file_id)

This uses a persistent user message for PDFs after the system message, so file messages never expire, but the following chat history can be limited in turns.

walterriv · June 2, 2025, 4:53pm

I understand, but the reality is that I use the Assistants API because I have several documents stored in the vector store, so it can respond to user queries using that already stored information.
To give you more context, the chatbot I’m developing is a legal assistant that uses the vector store with documents about laws and regulations, so it can respond as accurately and correctly as possible to user queries.

In this case, how could I solve my problem?

_j · June 2, 2025, 7:02pm

The chunks of the PDF can only be included into the same search results as the existing vector store when using any built-in attach method.

What you would need to do in order to continue with Assistants is to perform your own document extraction, where you also can have paginated text passages and rendered images for vision passed in as a user message, perhaps one immediately preceding the user question, so you can track the thread message ID and delete the PDF message when it is obsolete. This doesn’t give you complete observability of when the PDF message is bumped out of the chat history of a thread being loaded into AI model context after more messages, though.

The typical challenge of this is that just one PDF can have more information than can be passed into a context window length, and you already would have any file search tool results consuming up to 16k when the AI decides to invoke a search anyway.

The Responses endpoint can both run an internal file_search tool, while at the same time offering you the option of a PDF file input as part of a user message content.

There, however, every file search tool call to your existing vector store will cost you.

This has challenges, though. If using the internal chat state (such as to continue with the previous internal file search results and not encourage more unnecessary calling), then, again, you have user messages that cannot be arbitrarily placed and persisted and managed, despite your desire for a one-time upload and continuous availability.

Topic		Replies	Views
Best practices for PDF parsing with Assistants API and file_search tool API assistants-api	6	3461	March 4, 2025
Integrating images in assistant's responses Codex	4	181	August 20, 2025
The OpenAI console Assistant does not use or find some of the files uploaded in its file search zone API	5	575	October 10, 2024
File_search assistants api - not returning full output, but just a preview of the output API lost-user , assistants-api , gpt-4o , file-search	4	407	February 4, 2025
Is attaching a file to a prompt possible through API as it is in the UI? API	12	16915	March 18, 2025

How can I make the assistant 'read' scanned documents that are in PDF format?

Related topics