Building chatbot that needs to respond to user messages that are censored

ziqizhang · February 12, 2025, 11:09pm

This is probably a more general question for any LLMs, not just ChatGPT…

I am working on a project that builds an AI chatbot that works like a counsellor on sensitive, personal issues that users may want to talk about. For example, harassment they may have experienced at work or university.

The bottom-line is that we want the users to be free to talk about their experience. But as you can imagine, some of their stories may be quite disturbing and triggers OpenAi’s content filtering policy and I’m sure you know that its automated content moderation makes false positive predictions but there’s nothing we can do. This already happened a few times in our experiments. On one occasion, our program that simulates a ‘human’ user talks about self-harm and that violates openai’s usage policy.

I’m sure there are loads of chatbots you are building that will have to deal with sensitive content like this. How do you deal with it? Hosting your own open source models? Using commercial models that are less censored (if so, which ones)?

And I’m not talking about using guardrails to catch and stop such messages because we don’t want to stop users from communicating their stories, we want them to speak freely and help them. Or is that the only way?

Any thoughts are much appreciated.

kavitatipnis · February 13, 2025, 1:21am

Hello @ziqizhang - It’s a great question and something that I had to deal with recently for a non profit use case for creating voice agents for handling domestic violence abuse hotlines as part of TEDAI Hackathon 2024. You can check out our prompt here . A few non profits have approached us after this hackathon to figure out if we can help with the call volumes and the reality is how do you classify and understand caller intent and dispatch help or in this particular case how do you identify the victim vs perpetrator? I don’t think it is a question of hosting open source vs closed source , it is more of continually training the models with a dataset that can eventually “understand true intent” ? Not helpful at all but we took a hard look at our prompt and identified “trigger or danger words” that should alert a human, if the human gets alerted too many times incorrectly the system should correct itself - using a combination of traditional human feedback & system feedback , not reinforcement learning but reinforcement learning with human feedback.

ziqizhang · February 13, 2025, 10:21am

Thank you so much, your prompt already helps me identify some issues with mine!

You mentioned training models that understand the intent and catch trigger words, I wonder what do you do with the cases they catch? Because I imagine that if you send those content to the LLM backend it will still be problematic.

I’ve been thinking of having local models that do ‘style transfer’, i.e., if we catch content that will violate LLM usage policies then we use these local models to rephrase the content to preserve the meaning but ‘neutralise’ the tone/sentiment.

However, I suppose that still cannot address the situations where the backend LLM content moderation makes wrong predictions or too ‘aggressively’ filters our content, which I know it happens more than we want (e.g. Fine-tuning blocked by moderation system)

kavitatipnis · February 14, 2025, 12:25am

@ziqizhang Yes I think that is an excellent idea worth pursuing. In the old days of NLP there was a document search index that we implemented as a look up table of potential words or sentence structures, for ‘neutralise’ the tone/sentiment - you could use a dictionary that records the flagged content and mask the intent with your own ruleset. It will not address the ‘moderation’ issue at large and now you see why moderation broadly across all applications without contextual nuances is hard. However for researching this 2 model approach with an internal intent mapping table may work.

kavitatipnis · April 9, 2025, 11:23pm

@ziqizhang Dropping this paper here on analyzing affective use of AI. The references give a table for classification of user queries.

Kublon33 · June 10, 2025, 8:36pm

Maybe use a rewording prompt to change the weights on sensitivity within the context to reduce filtering policy trigger for the output.

Assuming that word x in the context y gives weight w. Once u change the input meaning of the word within a second indicator of the meaning the weight is less than before.

chisanaminamoto · June 10, 2025, 8:51pm

I’ll put it simply: The problem is RLHF and poor training. This is detected after very detailed work with the model.

Kublon33 · June 10, 2025, 8:54pm

what you think about this topic? Weight adjusted wording from probaility expectancy

Topic		Replies	Views
Clarifying Content Policy on Discussing Personal Experiences Community violations	30	4116	June 29, 2024
Building complex guardrails Prompting gpt-4 , chatgpt , api	1	565	February 2, 2025
How to prevent API prompt from being incorrectly flagged as violating OpenAI's policy? Prompting chatgpt	2	195	June 4, 2025
The Limits to Building Safe GPT-4 Community	13	2424	March 18, 2023
GPT cleaning text assuming it's too sensitive (which is not) GPT builders gpt-4	3	2188	December 12, 2023

Building chatbot that needs to respond to user messages that are censored

Related topics