We’ve built an autonomous customer service agent named “Emily” for our marketplace where gamers can pay money to pro-players to play in the same team (companionship, coaching, etc.). With nearly 4 million monthly messages exchanged between customers, pro players, and our customer service, Emily plays a pivotal role in mediating these interactions.
Background:
Emily operates on 17 separate LLM applications on GPT-4, extracting data from different databases on different stages for demand and supply side.
She has multiple roles: chatting in DMs for onboarding, support and sales, mediating order chats for progress, scheduling and more.
We’ve done hundreds iterations on those 17 LLM applications to build Emily-1.0 and ensure Emily’s identity as a marketplace representative is clear. She shouldn’t impersonate customers or pro players.
Challenge: Despite multiple iterations on our prompts, in 3% to 18% of cases depending on the LLM app, inside the order of chats with 3 participants (Customer -Pro player - Emily) Emily ends up taking the role of a customer or a pro player when she shouldn’t. For instance, she may jump into a chat, responding as if she’s the pro player, which is not her intended behavior.
Example:
18% (the highest percentage of errors) occurs when using LLM app (prompt), which is a trigger that is activated at a specific time to check whether the scheduled gaming session between parties has started on time.
So basically, she needs to check the order chat history, our database and its prompt, and depending on the situation, write certain things and activate some commands that will ensure that the order has been started or the problem has been resolved. (!) Instead (!), she acts as a professional player or as a client, trying to pretend that she is one of them and acting on their behalf through hallucinations.
What We’ve Tried:
We’ve iterated on this issue in our prompts over 20 times. Used plugins for prompting, etc.
Explicitly mentioned in our prompts that Emily is solely a manager and shouldn’t take on any other role in our marketplaces.
Utilized various plugins to refine and improve prompt construction.
We’re reaching out to this talented community for insights or experiences that might help us refine Emily’s behavior. Have you faced similar challenges? What strategies or techniques worked for you? Any feedback or advice would be greatly appreciated!
Thanks for your time and looking forward to the collective wisdom!
The first thing that jumps out here is the 3rtd person aspect, the models are tuned to respond to a single person at a time, perhaps not specifically, but due to the nature of the training and the feedback mechanism, that is what gets instilled.
I think the LLM is having difficulty when it is not clear that it is a 3rd member of a group conversation, perhaps trying to make that point more would help, maybe a short example conversation as a “shot” would help.
I’d just add that you want to make it clear to end-user that it is an AI, so you don’t break OpenAI Terms of Service.
What exactly do you mean by this?
Can you show your prompt?
You might need to give an example in your user/assistant prompts… possibly two-shot or three-shot example of how Emily should behave in a conversation…
ETA: I like @_j 's idea to use the lesser known “name” parameter for the assistant messages…
It could be Waluigi. You may be too over controlling in your approach, and teaching bad behavior in terms of token frequency, and the model just blindly follows. Do you have a lot of “don’t do this …” in your prompt, or “only behave like this …” ???
Thanks for advice about making it clear for users that it’s an AI.
Didn’t know about this thing in ToS, just made changes so now she will say that in intro + this info is going be visible in our product for users during interactions with her.
Here’s the prompt for one of the LLM apps. This one is the simpliest and shortest one, but that’s the one where we have the biggest % of mistakes generated:
You are “Emily”, the dedicated personal assistant for clients at Legionfarm. This platform allows everyday gamers to hire professional players for collaborative gaming sessions. Think of us as the “Uber” for gaming: we connect clients with Pro players and retain a commission for facilitating the connection. Your primary responsibility is to ensure that the order proceeds smoothly and that the Pro player adheres to the correct protocols, such as initiating the gaming session and updating the customer about its commencement. Always remember: you are to maintain your identity as “Emily” and should never impersonate either the client or the Pro player.
Participants:
Client: Recognizable by nicknames containing ‘client’. Their typical opening line is “Hi! I just placed the order…”.
Pro Player: Their introduction is “Hi! I am your pro-player for order…”.
You: Emily, the mediator.
You possess the names and details of both the customer and the pro player, which can be found in the provided {{context}}.
Steps to Follow:
Review the Chat: Take a moment to understand the current state of communication.
Check Pro Player’s Engagement: Ensure the pro player has started the gaming session for this order. You’ll find this information in the latest messages of this order chat
Action Based on Scenarios:
Scenario A: Pro Player started the session
If the pro player initiates the session and the customer doesn’t raise any concerns about the starting process or the scheduled time, then say something like: Hey everyone, it appears the session started as scheduled. Please reach out if you need any assistance.
Scenario B: Pro Player didn’t start the session
If the Pro player hasn’t initiated the session, inform the client that you’ve gave a call to the Pro player to ping them and use these 2 commands:
{make_call}
{send_telegram:‘{“message”:“The time has come to start the order, but you have not designated the gaming session as started. Please start the session or agree on a delay with the client via order chat <Order #, Order name>.”}’}
Your prompt is way too convoluted. You need to break it down and start separating your concerns. I’m assuming that you are running this on every message? Do you only run it on a single message, or the conversation? If these were my job instructions I would seriously not know what to do
I’m starting to notice a warning flag when people use “If, then, when, but” sequences in their prompts. Not necessarily wrong, but I think they can be broken down and managed better with a sprinkle of programming logic.
To me it seems like you are trying to use an LLM to completely manage your orders/ready status through some sort of communication platform (Discord?). This is truly bizarre to me. These systems have existed for a long time, before LLMs.
In previous cases, a user would use a simple /[command]. But, I get it. LLMs. In this case a classifier would do the same.
You could start with a classifier to determine what the message is regarding, then craft your prompt based on that. There is so much noise in this prompt it made me go cross-eyed.
The wider you cast your net, the more crap you will accidentally catch, and the less control you have over it all.
I was hoping this would be a cool example of the Waluigi Effect, but I now think this is the LLM simply being utterly lost.
You are 100% correct that if a person has hard time coming up with an answer GPT will too, it will always give you AN answer, so it can help with “hump” problems that just need a kick start, but it is using a similar neural network to that in our heads… it will have similar issues with vagueness, illogic and imprecise requests.
The solution space available to the model is almost infinite, at least 10^600 “locations” in latent space where answers can be found, humans are only interested in a very, almost infinitely small subset of that space. If you don’t narrow down the options the model has to work with, by using solid, well thought out prompting, it WILL find a solution somewhere, just not one most humans will understand.
Actually we gave it examples and it works perfectly now.
We iterated on prompts hundreds of times, and to be honest, that’s the shortest prompt, we have something longer and more complicated. Other types of prompt didn’t really work. Just wondering if you ever automated CS with LLMs or know good cases? Would be nice to learn from ppl who implemented LLMs on a high scale in Ops heavy business to improve retention.
As the CEO and Founder of the company that generates revenue and employees 70 people, it’s important to generate profits and increase our revenue. I’m not a big expert in ML or LLMs, but we built something that performs autonomously in our CS (Sales. Support, etc), and purchase 1 → 2 conversion already grew up by x1.5 on a statistically significant amount of new customers over the last 2 months.
Although I’ve received numerous suggestions to try simpler solutions or to utilize LLaMA, but our current system is working effectively for us. We’ve found that building our entire customer service on LLMs is not only faster, but it also allows our operations team to iterate on it with ease.
did you ever give the model itself (GPT-4) a typical input and expected output and ask it to generate a prompt to achieve it? That’s my method currently, removed 90% of the iterative effort, then you can feed the model the prompt in some block markers and explain to it the erroneous output you are getting and most of the time it will nail it on the first attempt.
If you stick around I think you’ll find a lot of very valuable insights. I know I have.
No I haven’t done any customer support with an LLM (besides just simple chatbots to aid with a web app & general knowledge, if that counts). I think it’s a massive market though and would be MUCH preferable to what is common now .
I am… or was an avid gamer once upon a time though and recall the days of finding matches through IRC channels using typical command-based ready systems. I’m definitely interested in seeing how an LLM functions in this case