Building chatbot that needs to respond to user messages that are censored

This is probably a more general question for any LLMs, not just ChatGPT…

I am working on a project that builds an AI chatbot that works like a counsellor on sensitive, personal issues that users may want to talk about. For example, harassment they may have experienced at work or university.

The bottom-line is that we want the users to be free to talk about their experience. But as you can imagine, some of their stories may be quite disturbing and triggers OpenAi’s content filtering policy and I’m sure you know that its automated content moderation makes false positive predictions but there’s nothing we can do. This already happened a few times in our experiments. On one occasion, our program that simulates a ‘human’ user talks about self-harm and that violates openai’s usage policy.

I’m sure there are loads of chatbots you are building that will have to deal with sensitive content like this. How do you deal with it? Hosting your own open source models? Using commercial models that are less censored (if so, which ones)?

And I’m not talking about using guardrails to catch and stop such messages because we don’t want to stop users from communicating their stories, we want them to speak freely and help them. Or is that the only way?

Any thoughts are much appreciated.

1 Like

Hello @ziqizhang - It’s a great question and something that I had to deal with recently for a non profit use case for creating voice agents for handling domestic violence abuse hotlines as part of TEDAI Hackathon 2024. You can check out our prompt here . A few non profits have approached us after this hackathon to figure out if we can help with the call volumes and the reality is how do you classify and understand caller intent and dispatch help or in this particular case how do you identify the victim vs perpetrator? I don’t think it is a question of hosting open source vs closed source , it is more of continually training the models with a dataset that can eventually “understand true intent” ? Not helpful at all but we took a hard look at our prompt and identified “trigger or danger words” that should alert a human, if the human gets alerted too many times incorrectly the system should correct itself - using a combination of traditional human feedback & system feedback , not reinforcement learning but reinforcement learning with human feedback.

1 Like

Thank you so much, your prompt already helps me identify some issues with mine!

You mentioned training models that understand the intent and catch trigger words, I wonder what do you do with the cases they catch? Because I imagine that if you send those content to the LLM backend it will still be problematic.

I’ve been thinking of having local models that do ‘style transfer’, i.e., if we catch content that will violate LLM usage policies then we use these local models to rephrase the content to preserve the meaning but ‘neutralise’ the tone/sentiment.

However, I suppose that still cannot address the situations where the backend LLM content moderation makes wrong predictions or too ‘aggressively’ filters our content, which I know it happens more than we want (e.g. Fine-tuning blocked by moderation system)

1 Like

@ziqizhang Yes I think that is an excellent idea worth pursuing. In the old days of NLP there was a document search index that we implemented as a look up table of potential words or sentence structures, for ‘neutralise’ the tone/sentiment - you could use a dictionary that records the flagged content and mask the intent with your own ruleset. It will not address the ‘moderation’ issue at large and now you see why moderation broadly across all applications without contextual nuances is hard. However for researching this 2 model approach with an internal intent mapping table may work.