I’m currently exploring ways to use the OpenAI API to clean or normalize inconsistent data in CSV files—for example, variations in company names, addresses, etc.
If anyone has tried something similar, I’d really appreciate any tips, references, or examples you could share.
Best practices or recommended approaches would be incredibly helpful!
Perhaps what you need to worry first, is how to make the right questions for the models to help you.
Fix the names of companies for example, how should AI know the variations are the same company? If we are talking mircosoft being microsoft in a list of big techs, perhaps it can help. But if it is not a big tech list, what if there is a real “mircosoft” company elsewhere?
In this case you would have to provide a list of companies you expected, for it to try a match.
Best practice here would be to isolate an occurrence, show some examples, explain what you want, and ask how it can help you, case by case.
You can also have a look these:
- GPT-4.1 Prompting Guide - how to build better prompts
- The New Skill in AI is Not Prompting, It's Context Engineering - why prompts can’t solve everything
Thank you for your reply — that’s actually something I anticipated as well.
My thinking is that I’d probably need to provide some kind of master reference list for company names, and have the AI try to match and correct the values accordingly.
I’ve mostly used ChatGPT via the UI in the past, but this is my first time trying to do something similar using the OpenAI API directly, so I’d really appreciate any guidance.
My goal is to send either a CSV file or raw CSV content to the API and ask it to clean or normalize certain fields — like company names — but I’m not sure where to begin from an API perspective.
Any suggestions on how to approach this would be very helpful!
In this case, I don’t think you can just throw 2 lists and ask for a result.
It works best, like I mentioned, by providing a list on company names you want, show some examples of incorrect and correct versions, and ask it to give it a guess. One prompt per row of data to classify , only the necessary columns or at most just a few.
It probably won’t work if you have thousands of possible companies.
It is not a trivial task to guide in a simple post though. You are asking for a complete system solution.
I suggest first familiarizing with how to use the API, study structured outputs, and use chatgpt to help you prepare an outline of what you need to learn to implement it.
I made a small example to help you get started.
It will not solve a large scale list, but perhaps it can motivate you to get started on exploring the API.
reference_list = ["Google", "Microsoft", "Amazon", "Meta", "Apple"]
input_list = ["google inc", "MSFT", "A M Z N", "Meta Platforms", "apple.com", "Ggle"]
system_prompt = """# You are a data-normalization assistant.
## Your task is to map each company name in InputList to the single, best-matching company name in ReferenceList.
#Rules
Choose only from ReferenceList – never invent new names.
Consider abbreviations, typos, brand aliases, and related entities when matching.
Return your answer in json with a key 'matches' as an array of objects, each with:
input (the original item from InputList)
match (the chosen name from ReferenceList)"""
prompt = f"""## ReferenceList
{reference_list}
## InputList
{input_list}"""
response = client.responses.create(
model="gpt-4.1-mini",
input=prompt,
instructions=system_prompt,
)
print(response.output_text)
For more, you can have a look at the quickstart guide.
Output:
{
"matches": [
{"input": "google inc", "match": "Google"},
{"input": "MSFT", "match": "Microsoft"},
{"input": "A M Z N", "match": "Amazon"},
{"input": "Meta Platforms", "match": "Meta"},
{"input": "apple.com", "match": "Apple"},
{"input": "Ggle", "match": "Google"}
]
}
Thank you! I’ll definitely try out what you suggested.
As I’ve been doing more research, I’ve noticed there are different ways to interact with the API — not just through /v1/responses, but also using /v1/assistants, threads, and so on.
I’m a bit unclear on the difference between these approaches.
In my case, I’m not looking to have a back-and-forth conversation like ChatGPT. I’d rather process the data in batch — for example, trigger cleansing when a specific job runs.
Would you say the v1/responses API is more suitable for that? Or are the Assistants APIs still a good fit for batch-style processing?
Also, I’m wondering which approach better supports working with CSV files.
I don’t think directly processing csv will work, but it’s up to you to explore it.
Assistants API is being deprecated, I would go with Responses API.
Thank you for all the information so far, I’ll give it a try。。。。
So I tried it.
I’m using the /v1/responses
endpoint with code_interpreter
enabled to cleanse a CSV file via instructions like:
“Please cleanse the uploaded CSV by fixing inconsistent company names, phone number formats, and typos.”
The API correctly processes the CSV and returns a cleansed version. However, in the resulting container.file
object (from GET /v1/containers/{id}/files
), I sometimes see this:
"path": "/mnt/data/cleansed_member_info.csv /mnt/data/file-MqcrsUSPDYxAEWfsjNLUDh-.csv"
Notice that the path field contains two paths separated by a space. Because of this, bytes becomes null, and I cannot access the output file.
I did not set the path manually — it’s generated by the assistant/code interpreter.
This is the full response of v1/containers/xxxx/files
{
"object": "list",
"data": [
{
"id": "cfile_686ccc37b5208191b61f3d6c0795cf12",
"object": "container.file",
"created_at": 1751960631,
"bytes": null,
"container_id": "cntr_686ccbe74fc08191b4690c65d090ec2a0718809f71897bb8",
"path": "/mnt/data/cleansed_members.csv /mnt/data/file-VubKaGSVRHNtpYCTaRE6Ln-.csv",
"source": "assistant"
},
{
"id": "cfile_686ccbe8c9408191ae43d943d777412f",
"object": "container.file",
"created_at": 1751960552,
"bytes": 1603,
"container_id": "cntr_686ccbe74fc08191b4690c65d090ec2a0718809f71897bb8",
"path": "/mnt/data/file-VubKaGSVRHNtpYCTaRE6Ln-.csv",
"source": "user"
}
]
}
My questions:
- Why does this happen? Is it a bug in the code interpreter or a known behavior?
- How can I prevent multiple paths from being included in the
path
field? - Is there a best practice to ensure the output file is always accessible and properly linked?
You can try looking into the container files.