File_search assistants api - not returning full output, but just a preview of the output

I am using the following code to set up a pdf assistant, provide it a pdf file and then asking it to extract some data from the file and return a json in a specified format.

pdf_assistant = client.beta.assistants.create(
    model="gpt-4o-2024-11-20",
    tools=[{
        "type": "file_search", 
        "file_search": {
            "max_num_results": 20
        },
    }],
    name="PDF assistant"
)
file = client.files.create(file=open(pdf_path, "rb"), purpose="assistants")

vs = client.beta.vector_stores.create(name="test")

vector_store_file = client.beta.vector_stores.files.create(
    vector_store_id=vs.id,
    file_id=file.id,
    chunking_strategy={"type": "static", "static": {"max_chunk_size_tokens": 4096, "chunk_overlap_tokens": 600}}
)
thread = client.beta.threads.create(
    messages=[
        {
            "role": "user",
            "content": prompt
        }
    ],
    tool_resources={
        "file_search": {
            "vector_store_ids": [vs.id]
      }
    }
)
run = client.beta.threads.runs.create_and_poll(
    thread_id=thread.id, assistant_id=pdf_assistant.id
)

The said file in the example has 25 objects, so the resulting json should have 25 objects. The response usually contains just a subset of the objects with a message like this: “This is an initial subset for precision validation. I will add more issues based on the detailed findings mentioned in the report. Full JSON will encapsulate all 25 as per summary cross-verification.”

So it knows there should be 25, but it is only outputting a subset. I have played with the prompt quite a bit to get it to output all 25 in one response, but it is still not doing that.

What am I doing wrong / missing and how do I get around this?

A more effective approach would be to utilize structured outputs with function calling instead of merely generating text.

Alternatively, if the objective is solely to extract JSON values from PDF files, I suggest employing chat completions with structured outputs and providing the images of the corresponding pages from the PDF file from which the JSON data is to be extracted.

Here’s a comprehensive guide from OpenAI’s cookbook:

1 Like

I am already using a pdf assistant tool with structured outputs. The task is not merely extracting JSON values from a PDF. It is fairly reasoning-heavy and the task is to extract some data from the pdf and output it in json form.

From what I saw, the assistants API with file_search looks like the best fit for this, but I am not sure why the model isn’t returning the full output in one response. Any ideas @sps?

Ah, I see where you’re coming from with the assistants API approach. The reason you’re getting partial responses is actually a known limitation with the file_search tool - it can sometimes struggle with complete data extraction across large documents especially pdfs.

While we could try tweaking parameters (like adjusting chunking strategy or max_num_results), I’ve found these workarounds aren’t always reliable for getting complete results. That’s why I shared that PDF extraction tutorial - extracting page images and their transcriptions, then sending those directly to the model tends to give more reliable results when you need to guarantee complete extraction of all items (in your case, all 25 objects).

LMK in case you have any questions about how to implement that approach.

Or if you need to stick with the assistants API for other reasons, I’d recommend starting with inspecting the run steps of for the run in question and sharing the instructions that you’re passing to the assistant.

1 Like

Ah ok. That is what I was thinking too. Something seems off with the assistant API for this workflow, so I’ll stick to the regular chat completions endpoint. Thanks!