Seeking Advice: Overcoming Role-Switching Errors in Our LLM-Powered Customer Service Agent

Hey OpenAI Community! :rocket:

We’ve built an autonomous customer service agent named “Emily” for our marketplace where gamers can pay money to pro-players to play in the same team (companionship, coaching, etc.). With nearly 4 million monthly messages exchanged between customers, pro players, and our customer service, Emily plays a pivotal role in mediating these interactions.


  • Emily operates on 17 separate LLM applications on GPT-4, extracting data from different databases on different stages for demand and supply side.
  • She has multiple roles: chatting in DMs for onboarding, support and sales, mediating order chats for progress, scheduling and more.
  • We’ve done hundreds iterations on those 17 LLM applications to build Emily-1.0 and ensure Emily’s identity as a marketplace representative is clear. She shouldn’t impersonate customers or pro players.

Challenge: Despite multiple iterations on our prompts, in 3% to 18% of cases depending on the LLM app, inside the order of chats with 3 participants (Customer -Pro player - Emily) Emily ends up taking the role of a customer or a pro player when she shouldn’t. For instance, she may jump into a chat, responding as if she’s the pro player, which is not her intended behavior.

18% (the highest percentage of errors) occurs when using LLM app (prompt), which is a trigger that is activated at a specific time to check whether the scheduled gaming session between parties has started on time.

So basically, she needs to check the order chat history, our database and its prompt, and depending on the situation, write certain things and activate some commands that will ensure that the order has been started or the problem has been resolved. (!) Instead (!), she acts as a professional player or as a client, trying to pretend that she is one of them and acting on their behalf through hallucinations.

What We’ve Tried:

  • We’ve iterated on this issue in our prompts over 20 times. Used plugins for prompting, etc.
  • Explicitly mentioned in our prompts that Emily is solely a manager and shouldn’t take on any other role in our marketplaces.
  • Utilized various plugins to refine and improve prompt construction.

We’re reaching out to this talented community for insights or experiences that might help us refine Emily’s behavior. Have you faced similar challenges? What strategies or techniques worked for you? Any feedback or advice would be greatly appreciated!

Thanks for your time and looking forward to the collective wisdom! :pray:

1 Like

Hi and welcome to the Developer Forum!

The first thing that jumps out here is the 3rtd person aspect, the models are tuned to respond to a single person at a time, perhaps not specifically, but due to the nature of the training and the feedback mechanism, that is what gets instilled.

I think the LLM is having difficulty when it is not clear that it is a 3rd member of a group conversation, perhaps trying to make that point more would help, maybe a short example conversation as a “shot” would help.

Sounds a super interesting use case!

1 Like

I’d just add that you want to make it clear to end-user that it is an AI, so you don’t break OpenAI Terms of Service.

What exactly do you mean by this?

Can you show your prompt?

You might need to give an example in your user/assistant prompts… possibly two-shot or three-shot example of how Emily should behave in a conversation…

ETA: I like @_j 's idea to use the lesser known “name” parameter for the assistant messages…


I don’t have the same experience as some of the people here, but I’ll ask what’s probably an obvious question: What are you using as the temperature setting with the AI? The actual instructions are the most important, but a lower temp should make it follow your instructions more closely.

Seems like you have a waluigi in your midst!

I think you’ll find one of my favorite articles shared here a very fun, insightful read.


It could be Waluigi. You may be too over controlling in your approach, and teaching bad behavior in terms of token frequency, and the model just blindly follows. Do you have a lot of “don’t do this …” in your prompt, or “only behave like this …” ???

1 Like

Thanks for advice about making it clear for users that it’s an AI.
Didn’t know about this thing in ToS, just made changes so now she will say that in intro + this info is going be visible in our product for users during interactions with her.

Here’s the prompt for one of the LLM apps. This one is the simpliest and shortest one, but that’s the one where we have the biggest % of mistakes generated:

You are “Emily”, the dedicated personal assistant for clients at Legionfarm. This platform allows everyday gamers to hire professional players for collaborative gaming sessions. Think of us as the “Uber” for gaming: we connect clients with Pro players and retain a commission for facilitating the connection. Your primary responsibility is to ensure that the order proceeds smoothly and that the Pro player adheres to the correct protocols, such as initiating the gaming session and updating the customer about its commencement. Always remember: you are to maintain your identity as “Emily” and should never impersonate either the client or the Pro player.


  • Client: Recognizable by nicknames containing ‘client’. Their typical opening line is “Hi! I just placed the order…”.
  • Pro Player: Their introduction is “Hi! I am your pro-player for order…”.
  • You: Emily, the mediator.
  • You possess the names and details of both the customer and the pro player, which can be found in the provided {{context}}.

Steps to Follow:

  1. Review the Chat: Take a moment to understand the current state of communication.
  2. Check Pro Player’s Engagement: Ensure the pro player has started the gaming session for this order. You’ll find this information in the latest messages of this order chat
  3. Action Based on Scenarios:
  • Scenario A: Pro Player started the session
    If the pro player initiates the session and the customer doesn’t raise any concerns about the starting process or the scheduled time, then say something like: Hey everyone, it appears the session started as scheduled. Please reach out if you need any assistance.
  • Scenario B: Pro Player didn’t start the session
    If the Pro player hasn’t initiated the session, inform the client that you’ve gave a call to the Pro player to ping them and use these 2 commands:
    1. {make_call}
    2. {send_telegram:‘{“message”:“The time has come to start the order, but you have not designated the gaming session as started. Please start the session or agree on a delay with the client via order chat <Order #, Order name>.”}’}

Use the chat below that you need to work with:

Thanks! We will update our prompts with examples.
Actually this is the only thing that really helps every time

For this one we used 0.1,
but I guess it needs an example what is good and what is bad :slight_smile:

Your prompt is way too convoluted. You need to break it down and start separating your concerns. I’m assuming that you are running this on every message? Do you only run it on a single message, or the conversation? If these were my job instructions I would seriously not know what to do

I’m starting to notice a warning flag when people use “If, then, when, but” sequences in their prompts. Not necessarily wrong, but I think they can be broken down and managed better with a sprinkle of programming logic.

To me it seems like you are trying to use an LLM to completely manage your orders/ready status through some sort of communication platform (Discord?). This is truly bizarre to me. These systems have existed for a long time, before LLMs.

In previous cases, a user would use a simple /[command]. But, I get it. LLMs. In this case a classifier would do the same.

You could start with a classifier to determine what the message is regarding, then craft your prompt based on that. There is so much noise in this prompt it made me go cross-eyed.

The wider you cast your net, the more crap you will accidentally catch, and the less control you have over it all.

I was hoping this would be a cool example of the Waluigi Effect, but I now think this is the LLM simply being utterly lost.

1 Like

Yeah the prompt is way too dense. Maybe also look into Chain of Thought (CoT) styles, and really get the LLM to break it down.

The current gen models really aren’t able to fully understand what’s all going on in your densely worded prompt.

You may also need a proactive computing layer to organize the status before sending to the LLM.


This confuses me. LOL

If I ask ChatGPT: Consider the business Uber. Now apply what you know about that to gaming. Explain the result.

I get the following which is very muddy:

Uber has disrupted the traditional taxi industry by leveraging technology to create a decentralized platform that connects riders with drivers. It has democratized the access to personal transportation and has created economic opportunities for drivers. Applying a similar model to the gaming industry could be quite revolutionary.

  1. Decentralized Gaming Platform: Just like Uber connects drivers and riders, a decentralized gaming platform could connect game developers with gamers directly. This eliminates the need for a central publisher or distributor, thereby reducing costs and increasing profits for the developers.

  2. Dynamic Pricing: Similar to Uber’s surge pricing, games could have dynamic pricing based on real-time demand or other factors like location, time of the day, or special events.

  3. Real-Time Matching: Uber’s real-time matching of drivers and riders could be applied to multiplayer gaming. Players could be matched in real-time based on their skill levels, preferences, or even social connections.

  4. Crowdsourced Development: Just like Uber drivers use their own cars, game developers could contribute assets or modules to a larger game, effectively crowdsourcing the development. This would speed up the development process and allow for a more diverse set of game features.

  5. Ratings & Reviews: The Uber model relies heavily on ratings for both drivers and riders. Similarly, a decentralized gaming platform could use ratings to provide better matching and quality control for games.

  6. Economic Opportunities: Just like Uber has empowered individuals to earn money using their cars, a decentralized gaming model could allow individuals to earn money by contributing to game development, moderating content, or even by trading in-game assets.

  7. Local Experiences: Uber offers local rides tailored to each city. Similarly, games could offer local experiences, maps, or storylines that resonate with players from specific regions.

  8. Data-Driven Decisions: Uber uses data analytics for route optimization, dynamic pricing, and other operational decisions. In gaming, similar analytics could be used for personalizing gaming experiences, optimizing game performance, and more.

  9. Quick Iteration: Uber frequently updates its service based on data and feedback. A similar approach in gaming could result in more responsive and adaptive game development cycles.

  10. On-Demand Services: Like UberEATS for food delivery, a gaming platform could offer on-demand in-game services, tutorials, or coaching sessions, connecting experienced players with newbies for quick learning experiences.

In summary, applying the Uber business model to the gaming industry could result in a more decentralized, efficient, and personalized gaming experience for both developers and gamers.

1 Like

So what should the correct non muddy answer look like?

Fair enough… I’m NOT sure because it’s NOT clear from the prompt. That is my whole point. :slight_smile:

My point is that if I can’t figure it out, GPT is probably having a difficult time also… better to speak clearly rather than stacking a bunch of fuzzy concepts together.

Rather than saying:

Think of us as the “Uber” for gaming

I’d look at the output of my last post for relevant data, and then state those in direct and simple terms. As an example, let’s take #2 as an example of something relevant:


“Games have dynamic pricing based on real-time demand, location, time of the day, and any special event pricing that may be in effect.” …and then you may want to give specifics, a pricing table.

That’s my 2 cents! :slight_smile:


You are 100% correct that if a person has hard time coming up with an answer GPT will too, it will always give you AN answer, so it can help with “hump” problems that just need a kick start, but it is using a similar neural network to that in our heads… it will have similar issues with vagueness, illogic and imprecise requests.

The solution space available to the model is almost infinite, at least 10^600 “locations” in latent space where answers can be found, humans are only interested in a very, almost infinitely small subset of that space. If you don’t narrow down the options the model has to work with, by using solid, well thought out prompting, it WILL find a solution somewhere, just not one most humans will understand.

Thanks for advice!
After we gave it examples it started to do things properly, no mistakes at all now.
So this prompt actually works for us

1 Like

Actually we gave it examples and it works perfectly now.

We iterated on prompts hundreds of times, and to be honest, that’s the shortest prompt, we have something longer and more complicated. Other types of prompt didn’t really work. Just wondering if you ever automated CS with LLMs or know good cases? Would be nice to learn from ppl who implemented LLMs on a high scale in Ops heavy business to improve retention.

As the CEO and Founder of the company that generates revenue and employees 70 people, it’s important to generate profits and increase our revenue. I’m not a big expert in ML or LLMs, but we built something that performs autonomously in our CS (Sales. Support, etc), and purchase 1 → 2 conversion already grew up by x1.5 on a statistically significant amount of new customers over the last 2 months.

Although I’ve received numerous suggestions to try simpler solutions or to utilize LLaMA, but our current system is working effectively for us. We’ve found that building our entire customer service on LLMs is not only faster, but it also allows our operations team to iterate on it with ease.


did you ever give the model itself (GPT-4) a typical input and expected output and ask it to generate a prompt to achieve it? That’s my method currently, removed 90% of the iterative effort, then you can feed the model the prompt in some block markers and explain to it the erroneous output you are getting and most of the time it will nail it on the first attempt.

1 Like

Right on. If it works it works!

If you stick around I think you’ll find a lot of very valuable insights. I know I have.

No I haven’t done any customer support with an LLM (besides just simple chatbots to aid with a web app & general knowledge, if that counts). I think it’s a massive market though and would be MUCH preferable to what is common now :roll_eyes:.

I am… or was an avid gamer once upon a time though and recall the days of finding matches through IRC channels using typical command-based ready systems. I’m definitely interested in seeing how an LLM functions in this case