File_search assistants api - not returning full output, but just a preview of the output

eesha · January 20, 2025, 9:21pm

I am using the following code to set up a pdf assistant, provide it a pdf file and then asking it to extract some data from the file and return a json in a specified format.

pdf_assistant = client.beta.assistants.create(
    model="gpt-4o-2024-11-20",
    tools=[{
        "type": "file_search", 
        "file_search": {
            "max_num_results": 20
        },
    }],
    name="PDF assistant"
)
file = client.files.create(file=open(pdf_path, "rb"), purpose="assistants")

vs = client.beta.vector_stores.create(name="test")

vector_store_file = client.beta.vector_stores.files.create(
    vector_store_id=vs.id,
    file_id=file.id,
    chunking_strategy={"type": "static", "static": {"max_chunk_size_tokens": 4096, "chunk_overlap_tokens": 600}}
)
thread = client.beta.threads.create(
    messages=[
        {
            "role": "user",
            "content": prompt
        }
    ],
    tool_resources={
        "file_search": {
            "vector_store_ids": [vs.id]
      }
    }
)
run = client.beta.threads.runs.create_and_poll(
    thread_id=thread.id, assistant_id=pdf_assistant.id
)

The said file in the example has 25 objects, so the resulting json should have 25 objects. The response usually contains just a subset of the objects with a message like this: “This is an initial subset for precision validation. I will add more issues based on the detailed findings mentioned in the report. Full JSON will encapsulate all 25 as per summary cross-verification.”

So it knows there should be 25, but it is only outputting a subset. I have played with the prompt quite a bit to get it to output all 25 in one response, but it is still not doing that.

What am I doing wrong / missing and how do I get around this?

sps · January 21, 2025, 6:54am

A more effective approach would be to utilize structured outputs with function calling instead of merely generating text.

Alternatively, if the objective is solely to extract JSON values from PDF files, I suggest employing chat completions with structured outputs and providing the images of the corresponding pages from the PDF file from which the JSON data is to be extracted.

Here’s a comprehensive guide from OpenAI’s cookbook:

eesha · January 26, 2025, 9:13pm

I am already using a pdf assistant tool with structured outputs. The task is not merely extracting JSON values from a PDF. It is fairly reasoning-heavy and the task is to extract some data from the pdf and output it in json form.

From what I saw, the assistants API with file_search looks like the best fit for this, but I am not sure why the model isn’t returning the full output in one response. Any ideas @sps?

sps · January 27, 2025, 6:40am

Ah, I see where you’re coming from with the assistants API approach. The reason you’re getting partial responses is actually a known limitation with the file_search tool - it can sometimes struggle with complete data extraction across large documents especially pdfs.

While we could try tweaking parameters (like adjusting chunking strategy or max_num_results), I’ve found these workarounds aren’t always reliable for getting complete results. That’s why I shared that PDF extraction tutorial - extracting page images and their transcriptions, then sending those directly to the model tends to give more reliable results when you need to guarantee complete extraction of all items (in your case, all 25 objects).

LMK in case you have any questions about how to implement that approach.

Or if you need to stick with the assistants API for other reasons, I’d recommend starting with inspecting the run steps of for the run in question and sharing the instructions that you’re passing to the assistant.

eesha · February 4, 2025, 10:23pm

Ah ok. That is what I was thinking too. Something seems off with the assistant API for this workflow, so I’ll stick to the regular chat completions endpoint. Thanks!

Topic		Replies	Views
How to Prevent Hallucinations When Extracting Verbatim Text from Files Using OpenAI Assistant API API assistants-api	6	595	January 16, 2025
How can I make the assistant 'read' scanned documents that are in PDF format? API assistants-api , file-uploads	3	306	June 2, 2025
Best practices for PDF parsing with Assistants API and file_search tool API assistants-api	6	2236	March 4, 2025
Assistant with a file_search tool not working API file-search	1	337	January 21, 2025
The OpenAI console Assistant does not use or find some of the files uploaded in its file search zone API	5	442	October 10, 2024

File_search assistants api - not returning full output, but just a preview of the output

Related topics