How can the OpenAI API clean CSV data?

tuang_haumun · July 7, 2025, 6:08am

I’m currently exploring ways to use the OpenAI API to clean or normalize inconsistent data in CSV files—for example, variations in company names, addresses, etc.
If anyone has tried something similar, I’d really appreciate any tips, references, or examples you could share.
Best practices or recommended approaches would be incredibly helpful!

aprendendo.next · July 7, 2025, 11:49am

Perhaps what you need to worry first, is how to make the right questions for the models to help you.

Fix the names of companies for example, how should AI know the variations are the same company? If we are talking mircosoft being microsoft in a list of big techs, perhaps it can help. But if it is not a big tech list, what if there is a real “mircosoft” company elsewhere?

In this case you would have to provide a list of companies you expected, for it to try a match.

Best practice here would be to isolate an occurrence, show some examples, explain what you want, and ask how it can help you, case by case.

You can also have a look these:

GPT-4.1 Prompting Guide - how to build better prompts
The New Skill in AI is Not Prompting, It's Context Engineering - why prompts can’t solve everything

tuang_haumun · July 7, 2025, 12:37pm

Thank you for your reply — that’s actually something I anticipated as well.
My thinking is that I’d probably need to provide some kind of master reference list for company names, and have the AI try to match and correct the values accordingly.

I’ve mostly used ChatGPT via the UI in the past, but this is my first time trying to do something similar using the OpenAI API directly, so I’d really appreciate any guidance.
My goal is to send either a CSV file or raw CSV content to the API and ask it to clean or normalize certain fields — like company names — but I’m not sure where to begin from an API perspective.

Any suggestions on how to approach this would be very helpful!

aprendendo.next · July 7, 2025, 1:29pm

In this case, I don’t think you can just throw 2 lists and ask for a result.

It works best, like I mentioned, by providing a list on company names you want, show some examples of incorrect and correct versions, and ask it to give it a guess. One prompt per row of data to classify , only the necessary columns or at most just a few.
It probably won’t work if you have thousands of possible companies.

It is not a trivial task to guide in a simple post though. You are asking for a complete system solution.

I suggest first familiarizing with how to use the API, study structured outputs, and use chatgpt to help you prepare an outline of what you need to learn to implement it.

aprendendo.next · July 7, 2025, 8:10pm

I made a small example to help you get started.
It will not solve a large scale list, but perhaps it can motivate you to get started on exploring the API.

reference_list = ["Google", "Microsoft", "Amazon", "Meta", "Apple"]
input_list = ["google inc", "MSFT", "A M Z N", "Meta Platforms", "apple.com", "Ggle"]

system_prompt = """# You are a data-normalization assistant.

## Your task is to map each company name in InputList to the single, best-matching company name in ReferenceList.

#Rules
Choose only from ReferenceList – never invent new names.
Consider abbreviations, typos, brand aliases, and related entities when matching.
Return your answer in json with a key 'matches' as an array of objects, each with:
input (the original item from InputList)
match (the chosen name from ReferenceList)"""

prompt = f"""## ReferenceList
{reference_list}

## InputList
{input_list}"""

response = client.responses.create(
  model="gpt-4.1-mini",
  input=prompt,
  instructions=system_prompt,
)

print(response.output_text)

For more, you can have a look at the quickstart guide.

Output:

{
  "matches": [
    {"input": "google inc", "match": "Google"},
    {"input": "MSFT", "match": "Microsoft"},
    {"input": "A M Z N", "match": "Amazon"},
    {"input": "Meta Platforms", "match": "Meta"},
    {"input": "apple.com", "match": "Apple"},
    {"input": "Ggle", "match": "Google"}
  ]
}

tuang_haumun · July 7, 2025, 11:53pm

Thank you! I’ll definitely try out what you suggested.

As I’ve been doing more research, I’ve noticed there are different ways to interact with the API — not just through /v1/responses, but also using /v1/assistants, threads, and so on.

I’m a bit unclear on the difference between these approaches.

In my case, I’m not looking to have a back-and-forth conversation like ChatGPT. I’d rather process the data in batch — for example, trigger cleansing when a specific job runs.

Would you say the v1/responses API is more suitable for that? Or are the Assistants APIs still a good fit for batch-style processing?

Also, I’m wondering which approach better supports working with CSV files.

aprendendo.next · July 8, 2025, 12:02am

I don’t think directly processing csv will work, but it’s up to you to explore it.
Assistants API is being deprecated, I would go with Responses API.

tuang_haumun · July 8, 2025, 12:35am

Thank you for all the information so far, I’ll give it a try。。。。

tuang_haumun · July 8, 2025, 8:00am

So I tried it.
I’m using the /v1/responses endpoint with code_interpreter enabled to cleanse a CSV file via instructions like:

“Please cleanse the uploaded CSV by fixing inconsistent company names, phone number formats, and typos.”

The API correctly processes the CSV and returns a cleansed version. However, in the resulting container.file object (from GET /v1/containers/{id}/files), I sometimes see this:

"path": "/mnt/data/cleansed_member_info.csv  /mnt/data/file-MqcrsUSPDYxAEWfsjNLUDh-.csv"

Notice that the path field contains two paths separated by a space. Because of this, bytes becomes null, and I cannot access the output file.
I did not set the path manually — it’s generated by the assistant/code interpreter.
This is the full response of v1/containers/xxxx/files

{
  "object": "list",
  "data": [
    {
      "id": "cfile_686ccc37b5208191b61f3d6c0795cf12",
      "object": "container.file",
      "created_at": 1751960631,
      "bytes": null,
      "container_id": "cntr_686ccbe74fc08191b4690c65d090ec2a0718809f71897bb8",
      "path": "/mnt/data/cleansed_members.csv  /mnt/data/file-VubKaGSVRHNtpYCTaRE6Ln-.csv",
      "source": "assistant"
    },
    {
      "id": "cfile_686ccbe8c9408191ae43d943d777412f",
      "object": "container.file",
      "created_at": 1751960552,
      "bytes": 1603,
      "container_id": "cntr_686ccbe74fc08191b4690c65d090ec2a0718809f71897bb8",
      "path": "/mnt/data/file-VubKaGSVRHNtpYCTaRE6Ln-.csv",
      "source": "user"
    }
  ]
}

My questions:

Why does this happen? Is it a bug in the code interpreter or a known behavior?
How can I prevent multiple paths from being included in the path field?
Is there a best practice to ensure the output file is always accessible and properly linked?

aprendendo.next · July 8, 2025, 8:33am

You can try looking into the container files.

Topic		Replies	Views
Best practices to reduce Code Interpreter errors and improve Performance API gpt-4 , api	2	1641	December 14, 2023
Questions about Reponses code interpreter file system API chatgpt	3	177	June 20, 2025
Error creating "assistants_output" files API	7	455	July 29, 2024
Anyone successfully using Code Interpreter via API for real analysis? API gpt-4 , code-interpreter , assistants-api	2	225	March 8, 2025
Retrieve OpenAI LLM Generated Documents Using Responses API API	13	513	June 1, 2025

How can the OpenAI API clean CSV data?

Related topics