Avoiding jailbreak flag at a Generation flag in RAG

Hello everybody
I am using GPT 4o mini for my RAG pipeline. During generation, I am sending out a prompt along with some procedures I have found. I am repeatedly facing the jailbreak issue even though the content I am sending most definitely does not contain anything wrong. I would be very grateful for any suggestions on how to avoid this.
Example:
The question below translates to English as: What are the main types of capital relationships when creating credit concentration risk groups?
2025-05-07 12:14:10.610504 - INFO - Error processing question ‘Jakie są główne rodzaje powiązań kapitałowych przy tworzeniu grup wspólnego ryzyka koncentracji kredytowej?’ with whole RAG pipeline on attempt 5: Error code: 400 - {‘error’: {‘message’: “The response was filtered due to the prompt triggering Azure OpenAI’s content management policy. Please modify your prompt and retry. To learn more about our content filtering policies please read our documentation”, ‘type’: None, ‘param’: ‘prompt’, ‘code’: ‘content_filter’, ‘status’: 400, ‘innererror’: {‘code’: ‘ResponsibleAIPolicyViolation’, ‘content_filter_result’: {‘hate’: {‘filtered’: False, ‘severity’: ‘safe’}, ‘jailbreak’: {‘filtered’: True, ‘detected’: True}, ‘self_harm’: {‘filtered’: False, ‘severity’: ‘safe’}, ‘sexual’: {‘filtered’: False, ‘severity’: ‘safe’}, ‘violence’: {‘filtered’: False, ‘severity’: ‘safe’}}}}}
Thank you!!

1 Like

OpenAI cannot assist with Azure services.

User Prompt Attacks User prompt attacks are User Prompts designed to provoke the Generative AI model into exhibiting behaviors it was trained to avoid or to break the rules set in the System Message. Such attacks can vary from intricate roleplay to subtle subversion of the safety objective.

The AI moderation may just be too stupid to understand the purpose of the language being sent. The actions you can take, such as aligning system message with the input, ultimately depend on where the inspection is being done, on what type of model.

You can read here and see that “severity level” for jailbreak is a n/a, meaning it may not have a configuration level.

Form to lower the level here.

1 Like