Programatically reproduce gpt-4o file upload

piccin.elisa · December 18, 2024, 4:24pm

I’m trying to reproduce programmatically in Python what I tested through chatgpt.com with ChatGPT-4o, since it is working very well and I want to integrate it in my data pipeline.

I tried with Assistents APIs (https://platform.openai.com/docs/assistants/tools/code-interpreter) but it’s not returning the same results at all, so I suppose the chatgpt portal is using something else under the hood.

Do I need to convert PDF pages to images and submit them to vision APIs, similarly to what they did here: https://community.powerplatform.com/galleries/gallery-posts/?postid=73cdb790-11c9-45b7-80d0-b991d1f43f34

If someone could help with this, it would be great.

Thanks,
Elisa

_j · December 18, 2024, 4:44pm

While you can specify a model like gpt-4o-2024-11-20 for a date closer to what is on ChatGPT, ChatGPT utilizes a proprietary model and OpenAI may also conduct A/B testing with newer model versions on its users. Therefore, exact replication of ChatGPT’s behavior might not be possible.
Assistants offer vector store-based file search capabilities, which divide documents into chunks based on a user-defined token size. The AI must then generate a search query to retrieve information, and this query will only return the top-ranked chunks matching the query, not the entire document.
The output you observed was generated using ChatGPT’s code interpreter. This is evident from the utilization of the internal ace_tools module, which are specific to that platform. That is what produced tabular data by sending a dataframe to a specialized UI function.
Given that the code interpreter was activated immediately after a file was uploaded, and considering that ChatGPT automatically provides context about these uploaded files (context which you would need to manually replicate in your messages), it’s plausible that Python was also employed for data extraction from the PDF.

Try code interpreter in assistants: a file upload to storage, and then attachment of the file ID to code_interpreter, and finallly discuss in messages or instructions the patterned workflow in code interpreter instead of having the AI decide how to write code.

piccin.elisa · December 19, 2024, 8:06am

hello @_j and thanks for your support.
This is my code at the moment

from openai import OpenAI
from dotenv import load_dotenv
import os

dotenv_path = os.path.join(os.path.dirname(__file__), '.env')
load_dotenv(dotenv_path)

# Retrieve the OpenAI API key from the environment variables
api_key = os.getenv("OPENAI_KEY")

client = OpenAI(api_key=api_key)

# Upload a file with an "assistants" purpose
file = client.files.create(
file=open("pdf/CARMICHAEL_TCA_BANK 1_02-25-2015_0001.pdf", "rb"),
purpose='assistants'
)

# Create an assistant using the file ID
assistant = client.beta.assistants.create(
instructions="You are a maintenance engineer. You answer question about dissolved gas analysis (DGA) in maintenance reports.",
model="gpt-4o", #gpt-4o-2024-11-20", #"gpt-4o-mini",
tools=[{"type": "code_interpreter"}],
tool_resources={
  "code_interpreter": {
    "file_ids": [file.id]
  }
}
)

thread = client.beta.threads.create(
  messages=[
    {
      "role": "user",
      "content": (
        "I need to extract content from this maintenance report in PDF. "
        "The file contains tabular data, with some columns indicating the different sample dates. "
        "The goal is to extract the different measurements reported in the file for each sample date. "
        "The measurements include C2H2, CO2, CO, C2H6, C2H4, H2, CH4, N2, and O2. "
        "In the pdf file we have rows representing measurements and columns representing dates. "
        "The table might also contain some context information like Laboratory Number, Container Number or Temperature, "
        "which are not relevant for the desired output."
      ),
      "attachments": [
        {
          "file_id": file.id,
          "tools": [{"type": "code_interpreter"}]
        }
      ]
    }
  ]
)

run = client.beta.threads.runs.create_and_poll(
  thread_id=thread.id,
  assistant_id=assistant.id,
  instructions=(
    "Extract chemicals analysis from this file in a csv. "
    "The csv should have sample date as rows and "
    "C2H2, CO2, CO, C2H6, C2H4, H2, CH4, N2, O2 as columns."
  )
)


if run.status == 'completed': 
    messages = client.beta.threads.messages.list(
    thread_id=thread.id
    )
    print(messages)
else:
    print(run.status)
    
for message in messages.data:
    print(f"Message ID: {message.id}")
    print(f"Assistant ID: {message.assistant_id}")
    print(f"Role: {message.role}")
    print(f"Content: {message.content[0].text.value}")
    print(f"Created At: {message.created_at}")
    print(f"Attachments: {message.attachments}") # manually retrieve output_file_id here
    print("-" * 40)

def write_file_to_temp_dir(some_file_id, output_path):
    file_data = client.files.content(some_file_id)
    file_data_bytes = file_data.read()
    os.makedirs(os.path.dirname(output_path), exist_ok=True)
    with open(output_path, "wb") as file:
        file.write(file_data_bytes)

output_file_id = "file-Fsxke9cCMVrasMnVLqkNxE"
output_path = 'test/test2.csv'
write_file_to_temp_dir(output_file_id, output_path)

It is returning variable results, none of which is reliable nor complete in terms of accuracy, while in chatgpt.com I get consistent and correct results for the different types of file tested.

When creating the assistant, I tried to replace gpt-4o with gpt-4o-2024-11-20 but it throws an error

BadRequestError: Error code: 400 - {‘error’: {‘message’: “The requested model ‘gpt-4o-2024-11-20’ does not exist.”, […]

Even if in https://platform.openai.com/settings/organization/limits I can see it among the available models

Thanks in advance for any additional hint,
Elisa

_j · December 19, 2024, 9:22am

Hi again.

The model not being there is odd. I would check project API key settings or generate a new API key to populate that model.

You can use the playground site to also prototype your idea:

You can see that after lots of automatic iterations and attempts in code, 30k input tokens plus a $0.03 session, the AI came up empty-handed. Code methods rely on searchable text, whereas the datasheet PDFs I placed and didn’t inspect otherwise have pages and pages of graphs.

Your own uniform documents may include text instead of just images or vector art that makes this work like a charm. Then you can migrate to doing text extraction in your own Python for free, and use chat completions, placing as much as you like up to the model’s input limits.

piccin.elisa · December 19, 2024, 10:10am

My PDF files primarily contain tabular data, but they come in very different formats from various laboratories. I need to standardize this data into a consistent output tables for the upstream data pipeline.

I’m considering breaking the process into steps—for example, using Azure Document Intelligence and its Layout APIs to automatically extract tables (which I’ve already tested on my documents). As a second step, I could submit the extracted tables to GPT to generate the desired output.

I’d greatly appreciate any other suggestions!

Foxalabs · December 19, 2024, 10:28am

Microsoft just released their Markdown library that can convert most popular document file formats to markdown suitable for AI use, would be worth checking out.

Topic		Replies	Views
What is the best way to parse a PDF file with ChatGPT? API	9	50134	November 16, 2024
How can I make the assistant 'read' scanned documents that are in PDF format? API assistants-api , file-uploads	3	262	June 2, 2025
Best practices for PDF parsing with Assistants API and file_search tool API assistants-api	6	2095	March 4, 2025
GPT Assistant talks about their task or just posts an example instead of actually performing the task Prompting gpt-4	3	954	November 28, 2023
Problems with recognising and reading file formats Prompting gpt-4 , api , assistants-api	7	802	April 3, 2024

Programatically reproduce gpt-4o file upload

Related topics