Building chatbot that needs to respond to user messages that are censored

Thank you so much, your prompt already helps me identify some issues with mine!

You mentioned training models that understand the intent and catch trigger words, I wonder what do you do with the cases they catch? Because I imagine that if you send those content to the LLM backend it will still be problematic.

I’ve been thinking of having local models that do ‘style transfer’, i.e., if we catch content that will violate LLM usage policies then we use these local models to rephrase the content to preserve the meaning but ‘neutralise’ the tone/sentiment.

However, I suppose that still cannot address the situations where the backend LLM content moderation makes wrong predictions or too ‘aggressively’ filters our content, which I know it happens more than we want (e.g. Fine-tuning blocked by moderation system)

1 Like