Error when using File Search (Uploaded PDF) with a Fine-Tuned Model

Hi,
would be grateful for any advice or help :folded_hands:t2:.

I successfully created a new F/T model using GPT4.1-mini but getting an error when using it with uploaded files.

#-------Test NEW F/T MODEL-------#

response = client.responses.create(
model=“ft:gpt-4.1-mini-2025-04-14:xxxxxxxxxxxxxxxxxxxxx”,
input=[
{
“role”: “user”,
“content”: [
{“type”: “input_file”, “file_id”: file_id}, # PDF uploaded in previous step
{“type”: “input_text”, “text”: “You are a professional editor responsible for extracting citations from academic papers.”} ] } ] )

Error message:

Error code: 400 - {‘error’: {‘message’: ‘There was an issue with your request. Please check your inputs and try again’, ‘type’: ‘invalid_request_error’, ‘param’: None, ‘code’: None}}

As per https://platform.openai.com/docs/guides/pdf-files: “To help models understand PDF content, we put into the model’s context both the extracted text and an image of each page. The model can then use both the text and the images to generate a response.”

Model details (https://platform.openai.com/docs/models/gpt-4.1-mini) show that GPT4.1-mini can be used for Fine-tuning, can use text and image modalities as input, and supports ‘file search’ as part of its tool set.

Is anyone using a F/T model with uploaded files?

Maybe my training data is no longer in the appropriate format give the text+image content extracted from the PDF by OpenAI upon upload? I’m re-using the training file from our ‘current/live’ model which was trained on GPT3.5-turbo - the best model available for F/T at the time. It has a simple list of 100s of examples e.g. a paragraph of text as input, with the list of citations found in that text as output (or ‘Not found’).

1 Like

Does it work for you in the playground with image inputs?

1 Like

Hey @aprendendo.next , thanks very much for your suggestion.

I’ve tried it now in the playground with an uploaded PDF file. It works, but only extracts maybe 20% of the citations that our current fine-tuned model (based on GPT3.5-turbo) finds. I also tested (via API) with GPT-4.1 (not mini) - with no fine-tuning - and it found 95% of the citations, but still not as good as our current model. I expected that upgrading our current F/T model from GPT-3.5 → 4.1 would give better results.

I should stress that our current F/T is given the text on a paragraph by paragraph basis which can be slow, but gives good accuracy. In my tests, the new F/T model gets the whole uploaded PDF and I tell it to go paragraph by paragraph through it. Will it be able to do that, do you think i.e. checking the text+image content for each paragraph so it doesn’t miss citations? The example I’m giving it is deliberately challenging - it has approx. 130 citations in an 80-page PDF.

Just confirming one thing, in the playground you were able to successfully call the fine tuned model? If so, when you use the code provided by the playground under the “
” menu, does it still gives you a 400 error? (In necessary upgrade the OpenAI package first with: pip install --upgrade openai)

2 Likes

Hi @mimoflynn

gpt-3.5-turbo is a text only model and cannot extract text from pdf documents. This is because under the hood:

As of writing this post vision fine-tuning is available only for gpt-4o-2024-08-06 per the OpenAI API docs.

2 Likes

Hi @aprendendo.next,

I’ve tried 2 options in the Playground, neither of which gave a 400 error:

Option #1.
I created an Assistant called Bob using my F/T model. Bob does return valid citations but only ~ 20% of what he should find.

Playground Options under “
” menu
I cannot see how to get the code for Bob - under the 3 dots, it only shows me a ‘Steam output’ option (see screenshot 2 below). In any case, I probably don’t want to create an Assistant in my application unless it’s the only way to get the output?

Option #2.
I used the Chat (with Responses API) option. This tries to search file, but does not pick up the Assistant’s query, even though I entered it on the bottom left. So the query shows as ‘Not found’ and it finds no citations.

Code
Here we get to see the code using the “
” menu. I’m not sure why the File Search call has no queries:
" ```
“queries”: [
“Not found”
],


Here's the full code - for sure, I'm not setting it up correctly but I'm not sure where the problem is!



from openai import OpenAI
client = OpenAI()

response = client.responses.create(
  model="ft:gpt-4.1-mini-2025-04-14:xxxxxxxxxxxxxxxxxxxxxxxxxxxx",
  input=[
    {
      "role": "system",
      "content": [
        {
          "type": "input_text",
          "text": "You are a professional editor responsible for extracting citations from academic documents. "
        }
      ]
    },
    {
      "role": "user",
      "content": [
        {
          "type": "input_text",
          "text": "Go sentence by sentence through each paragraph in the uploaded PDF, to identify and list all citations in the file."
        }
      ]
    },
    {
      "role": "assistant",
      "content": [
        {
          "type": "output_text",
          "text": "Go sentence by sentence through each paragraph in the uploaded PDF, to identify and list all citations in the file."
        }
      ]
    },
    {
      "id": "fs_6887a50ba46cxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxx",
      "type": "file_search_call",
      "status": "completed",
      "queries": [
        "Not found"
      ],
      "results": null
    },
    {
      "id": "msg_6887a50e8a90819b857215622f627c60066c413e2916a052",
      "role": "assistant",
      "content": [
        {
          "type": "output_text",
          "text": "【11:2025_ABCD_123.pdf】"
        }
      ]
    }
  ],
  text={
    "format": {
      "type": "text"
    }
  },
  reasoning={},
  tools=[
    {
      "type": "file_search",
      "vector_store_ids": [
        "vs_6887a1a7c02c819xxxxxxxxxxxxxxxxx"
      ]
    }
  ],
  tool_choice={
    "type": "file_search"
  },
  temperature=0.1,
  max_output_tokens=32768,
  top_p=0.5,
  store=True
)

Hi @sps ,

Thanks for your reply. I’m not using PDFs with gpt-3.5-turbo, the situation is:

  1. Current production job (runs nightly): uses a gpt-3.5-turbo F/T model that I trained 18 months ago. It’s running fine, and picking up most citations but is not perfect! It gets the PDF content in txt format, on a paragraph by paragraph basis, which is why it’s pretty good at finding most/all citations.

  2. New improved application being developed/tested: I’m uploading a PDF file to OpenAI, doing various other things with that content, and then need to extract citations. I want to use my newly fine-tuned model which is based on ‘gpt-4.1-mini’. As mentioned, this model can be used for F/T, can use file search tool and has a huge context window, so it should be able to process the uploaded file on OpenAI’s server, to get the citations.


Model details (https://platform.openai.com/docs/models/gpt-4.1-mini) show that ‘GPT4.1-mini’ can be used for Fine-tuning, can use text and image modalities as input, and supports ‘file search’ as part of its tool set.


Let me see if I have the details correct:

  1. You have a fine-tuned 3.5 model that gets 1 API call per paragraph and based on that it produces some expected result 80-100% of the time.

  2. You would like to increase the accuracy to as close to 100% as possible.

This is how I would start to tackle this:

  1. Attention limits: Attention is a finite resource. It’s the reason asking the model to do too much at once degrades performance. So, to tackle that I would try sending one a page at a time to the model. I would use a python PDF kit library to extract both the text and render an image of each page and pass that to the full 4o or 4.1 models and test the accuracy of the results. You may also try o3 just as a baseline (much slower and more expensive, but the “gold standard” for untuned models.)

  2. If you obtain acceptable results from step 1, you could try the mini version of 4.1 and 4o and see what degradation (if any) occurs and see if that is also acceptable performance.

  3. Try the nano version if step 2 is acceptable.

If none of the above is acceptable you could try fine tuning one of the newer models and retesting.

You have a current baseline, evaluate that against the best models available in terms of price and performance and fine an acceptable midpoint.

Always remember that if an action is slow and independent, i.e. you do not rely on the result of one API call to process another, consider making those calls in parallel threads, i.e. if you have a sufficiently tiered account, you could run 10, 50 even 100 API calls at the same time and then asynchronously await the results and then build your response once all replies have been received. This would give you approximately an N times speedup, where N is the number of parallel tasks performed.

4 Likes

Just as a side note, you could also look at passing the pdf file as a Base64-encoded file, as shown here in the API reference.

I’ve also actually had pretty good results just uploading screenshots of pdfs as well (as suggested similarly by @Foxalabs) by using this Base64 encoding method with both the GPT-4.1 and 4.1 mini models.

3 Likes

Thanks for sharing the details about your process, @mimoflynn.

IMO the pdf file input would perform on best on a standard vanilla model than a fine-tuned model.

Curious how your fine-tuning jsonl file looked like for the gpt-4.1-mini model.

I agree with @Foxalabs that using a smaller model to do only a single task is the best way to move forward for accuracy as opposed to handing it a bunch of tasks.

1 Like

hi @Foxalabs,

Many thanks for your input. You are correct with your understanding of the current state of play. Not only should I be able to get close to 100% citation extraction even on large docs, but there is always a risk tha 3.5-turbo will be decommissioned so I want to move to a newer model.

In terms of starting with the flagship models, that is where I started.
Because I already upload the PDFs to OpenAI (to generate other, separate outputs), I wanted to re-use those uploaded files for the citations extraction function.

I did extensive comparison across 4 models with 3 prompts each (basic, few shot, many shot) and compared them to my current production / baseline model scoring them for both false positives and missed citations. I dropped the worst performing models (gpt-4.1-mini and o4-mini), iterated more on the best prompts, and tested further. GPT 4.1 was excellent as was o3 but it’s too slow with large docs. So GPT-4.1 was the ‘winner’.

I am now testing whether a F/T model will give better results, becauase, for example, on a test document that included 130 citations, GPT-4.1 missed 10 or so citations compared to the baseline.

I know I can continue with ‘current (live)’ approach by feeding GPT4.1 the document as ‘paragraph by paragraph’ text (it may not even need the images), but as the PDF is already on OpenAI servers, where it has text+image input, it would be great if if the F/T model can use that efficiently. Or alternatively, is there any way to get the models (i.e. vanilla GPT-4.1) to go through it on a page by page / paragraph by paragraph basis? I already included that in my prompts in the tests I outlined above, but maybe there is a way to give it fileID+pagenumber as input??

Many thanks
Mim

1 Like

Thanks @sps. Hopefully my reply to @Foxalabs gives a bit more background.

Re the training data for my F/T models, here is a sample only (not my data, just got chatgpt to make up something similar):

{"messages": [{"role": "system", "content": "You are a professional editor responsible for extracting citation references from academic documents."}, {"role": "user", "content": "Recent studies have shown a significant relationship between sleep quality and cognitive performance (see Smith & Jones, 2020). Further research is required to clarify the causal links."}, {"role": "assistant", "content": "Smith & Jones, 2020"}]}
{"messages": [{"role": "system", "content": "You are a professional editor responsible for extracting citation references from academic documents."}, {"role": "user", "content": "The theory of planned behavior has been widely applied in health psychology (Ajzen, 1991; Conner & Sparks, 2005), offering insights into decision-making processes."}, {"role": "assistant", "content": "Ajzen, 1991", "Conner & Sparks, 2005"}]}


1 Like

Thanks for sharing the training data sample.

To fully utilize fine-tuning, you’d want to supply training data in exactly the same structure as you expect to supply to the model in production. Discrepancies between the training input messages structure and the production input messages structure will inevitably cause degraded performance—which is likely what is happening because:

Thus, for an apples-to-apples comparison, the fine-tuning data would need to be updated to utilize examples of PDF input and the expected output, should you choose to do that.

A better option would be to experiment with:

  • Transcribing the PDF verbatim and then having a model extract citations from the transcriptions, OR
  • Directly using images of single pages and extracting the citations.

Compare these with the performance of PDF inputs. See if you can reproduce it on smaller models.

Regardless of whether it’s reproducible on smaller models, I’d recommend using the Batch API for parallel, independent tasks like transcription or direct citation extraction from images, as it will cut the costs by 50%.

2 Likes

Alright, let’s try it step by step.

First, let’s correct the error 400 issue.

  1. Make sure your openai package is updated by running:
    pip install --upgrade openai

  2. Try this code using gpt-4.1-mini, just to check if the API call is alright.

Sample code for sending a PDF file
import base64
import pathlib
from openai import OpenAI
client = OpenAI()

file_name = r"fullpath-to\your-pdf-file.pdf"
with open(file_name, "rb") as f:
    data = f.read()

base64_string = base64.b64encode(data).decode("utf-8")

response = client.responses.create(
  model= "gpt-4.1-mini", 
  input=[
    {
      "role": "system",
      "content": [
        {
          "type": "input_text",
          "text": "You are a professional editor responsible for extracting citations from academic documents. "
        }
      ]
    },
    {
      "role": "user",
      "content": [
        {
          "type": "input_file",
          "filename": pathlib.Path(file_name).name,
          "file_data": f"data:application/pdf;base64,{base64_string}",
        },
        {
          "type": "input_text",
          "text": "Go sentence by sentence through each paragraph in the uploaded PDF, to identify and list all citations in the file."
        }
      ]
    }
    
  ],
  temperature=0.1,
  max_output_tokens=32768,
  top_p=0.5,
)
  1. If this works with gpt-4.1-mini, we have no syntax problem. Change the model name to your fine-tuned model and try the same code again. Let us know if this works.

Issue with not finding all citations

When using file_search:

  tools=[
    {
      "type": "file_search",
      "vector_store_ids": [
        "vs_6887a1a7c02c819xxxxxxxxxxxxxxxxx"
      ]
    }
  ],

It performs a semantic search, not necessarily returning the whole PDF file as a context. It is used by RAG prompts to look for specific context. In the playground, you use a file attach near the input prompt, and leave the ‘tools’ empty.

For passing the whole file you need to pass it by argument (as you tried at the very first, but got a syntax error).

It would be a good idea to test first your finetuned model with a pure text input first, to rule out any problem with the way the API is turning a PDF into context.

1 Like

Thanks @sps.

Can you suggest how I might use examples of PDF input? Should I copy/paste the text from an entire PDF document (using a short one e.g. 10 page PDF) as input, for example? Should I also give the images of that PDF also, and can they be included in the same training data?

If I go back to our current/live approach where the text is extracted from the PDF, split into paragraphs and then sent to the F/T model, can anyone recommend the best utility to use? I used PdfReader but it inserts random extra spaces. A lot of other utilities I tried were also not able to accurately extract the text e.g. here’s some text extracted by PdfReader where you can see the extra spaces; they are found in many citations also!

“
on Wednesday the 16th day o f July 2025 \n1. The correct approach to p rojects that cause emissions is a topical issue , but the author ’s opinion is tha t we
” etc.

Also thanks for the suggestion re batch but the application is time-critial; reports are produced nightly.

Thanks also to @Foxalabs for the suggestion re parallel processing. I will look into how that works (I’m on Tier 4).

1 Like

Glad to help, het general process for parralell processing is as follows:

  1. split your input data in chunks, paragraphs, pages, sentacnes, whatever is appropriate.
  2. spawn multiple threads for each of these chunks each containing an API call to the approriate API endpoint and model.
  3. setup an asyncronus away function that will watch for arriving responses and impliment a simple “Are we there yet?” check to ensure all sent API calls have now been recived back.
  4. process the results into he required output by looping over them.

example: (note: thtis is not asyncronus and could hit python thread limis with thousands of requests at the same time, async would ‘mostly’ solve this.

import os
from openai import OpenAI
import PyPDF2
from concurrent.futures import ThreadPoolExecutor, as_completed
from dotenv import load_dotenv

# Load environment variables from .env if present
load_dotenv()

client = OpenAI()

# prompt
PROMPT_TEMPLATE = "Summarise the following PDF page in clear English:\n\n{page_text}"

def extract_pdf_pages(pdf_path):
    with open(pdf_path, 'rb') as f:
        reader = PyPDF2.PdfReader(f)
        return [page.extract_text() or "" for page in reader.pages]

def call_openai(page_text, page_num):
    prompt = PROMPT_TEMPLATE.format(page_text=page_text)
    response = client.responses.create(
        model="gpt-4.1",
        input=prompt
    )
    return page_num, response.output_text

def parallel_api_calls(pdf_path, num_threads=50):
    pages = extract_pdf_pages(pdf_path)
    results = [None] * len(pages)
    with ThreadPoolExecutor(max_workers=num_threads) as executor:
        futures = {
            executor.submit(call_openai, pages[i], i): i
            for i in range(len(pages))
        }
        for future in as_completed(futures):
            page_num, result = future.result()
            results[page_num] = result
    return "".join(results)

if __name__ == "__main__":
    pdf_path = "input.pdf"  # Path to your PDF file
    num_threads = 50        # Change this as needed (e.g., 10, 100)
    output = parallel_api_calls(pdf_path, num_threads)
    print(output)