Having issues with building a customer support AI with OpenAI . Need help to deploy production?

Hi everyone,

I am trying to build a customer support AI for our organization using OpenAI, and honestly, the experience has been more frustrating than expected.

Earlier, I used the Assistants API for document indexing and retrieval, which was working decently. The issues with that were the lack of agentic behavior and reliability. Later, I moved toward the Responses API file search and also started experimenting with the Agents SDK.

But it still is not working well for us.

I want a customer service AI that can answer from our documents, support customers reliably without hallucinations, follow instructions consistently, perform actions, and give grounded responses that are actually useful.

I am stitching together multiple pieces, and even after doing that, the system does not behave the way I expect in real-world use. The models I have tested are GPT-5.2, GPT-5 mini, o3-mini, and GPT-4 Turbo previously.

The main issue is:

  • retrieval does not feel dependable, responses are inconsistent, behavior is unpredictable, and the overall setup still does not feel solid enough for production customer support

At this point, I am tired of piecing everything together and still not getting the quality we need.

So I wanted to ask people here who have actually made this work in production: has anyone built a genuinely reliable customer support AI that does not hallucinate with the current OpenAI stack? Are you using the Agents SDK directly, or mostly building your own orchestration? For document-based support, what are you using? And how are you handling the model variations, like when the model sometimes answers the same query correctly and sometimes gets it wrong?

I want something dependable enough for real organizational customer service. I am fed up with building this way; I have already spent a lot of time and resources on this project.

I would really appreciate honest advice or any alternative approach from anyone who has gone through this. I no longer want to keep stretching things my team members are asking questions. Thanks.

1 Like

You are not alone. One of the biggest issues in production is model inconsistency, where the same query can return different-quality answers across runs. The newer frontier models like GPT-5.4 tend to be more stable and better at agentic workflows than older or smaller models, so they are worth testing for this kind of use case.

I have also seen the community facing some issues around file searches and the Responses API, especially for retrieval-heavy and multi-step support workflows. I tested it a few times. It did not feel that bad to me, which means the stack is usable, but it does have some issues related to file search. I think some of that might be resolved, and you can list all issues here.

In my view, a reliable production implementation core needs good orchestration, guardrails, evals (output & retrieval quality), improvement over time, and also a human fallback option.

At that point, the choice is either to build it carefully in-house with those controls using OAI, langchain, llamaindex or to use AI customer service platform like Forethought, YourGPT, or Intercom to reduce some of the operational overhead.

3 Likes

I built the first chatbot and customer support bot for the Discourse platform (upon which this forum functions).

For what it is worth, here is the repo:

It is still used commercially for customer support (currently uses gpt-5 FYI)

I built retrieval by hand - in fact I was one of the first adopters for pgvector. When this was first written Assistants API was not available even. But you need to implement this yourself imho because otherwise you will not be able to tune it satisfactorily imho.

It’s definitely challenging.

My suggestions:

  • don’t bother with the smaller models - whilst they may be cheaper and sometimes faster (reasoning depending) they aren’t accurate enough. 5.4 mini might be ok but only if you allow it to reason more - but that defeats the purpose of using a cheap model as you will spend a lot more on tokens and increase latency.
  • build and design all your RAG locally so you have full control and can tune it - my solution manages all that “locally” on server and does not delegate this to OpenAI.
  • all prompts really matter (including tools descriptions), spend time tuning them.
  • experiment with different call and response “workflows” until you hit the right one.
  • lean on semantic routing if needed (esp. for guardrailing)

Designing a completely full-proof system will be impossible as LLMs are fallible, but you can definitely deliver a useful system.

5 Likes

Welcome to the developer community, @Abhishek_Kapoor.

For this kind of customer support AI workflow, I’d suggest trying Agent Builder. It’s a visual tool for building multi-step agent workflows, with templates, drag-and-drop nodes, typed inputs/outputs, and live previewing, so it’s a fast way to prototype and debug a support flow before moving to production.

Here’s a template I found for a customer support agent:

When you’re ready to deploy, you can publish the workflow as a versioned object and either plug it into your app with ChatKit or export the code and use the Agents SDK for a more custom setup.

2 Likes