Avoiding jailbreak flag at a Generation flag in RAG

Zofia_Smolen · May 7, 2025, 10:22am

Hello everybody
I am using GPT 4o mini for my RAG pipeline. During generation, I am sending out a prompt along with some procedures I have found. I am repeatedly facing the jailbreak issue even though the content I am sending most definitely does not contain anything wrong. I would be very grateful for any suggestions on how to avoid this.
Example:
The question below translates to English as: What are the main types of capital relationships when creating credit concentration risk groups?
2025-05-07 12:14:10.610504 - INFO - Error processing question ‘Jakie są główne rodzaje powiązań kapitałowych przy tworzeniu grup wspólnego ryzyka koncentracji kredytowej?’ with whole RAG pipeline on attempt 5: Error code: 400 - {‘error’: {‘message’: “The response was filtered due to the prompt triggering Azure OpenAI’s content management policy. Please modify your prompt and retry. To learn more about our content filtering policies please read our documentation”, ‘type’: None, ‘param’: ‘prompt’, ‘code’: ‘content_filter’, ‘status’: 400, ‘innererror’: {‘code’: ‘ResponsibleAIPolicyViolation’, ‘content_filter_result’: {‘hate’: {‘filtered’: False, ‘severity’: ‘safe’}, ‘jailbreak’: {‘filtered’: True, ‘detected’: True}, ‘self_harm’: {‘filtered’: False, ‘severity’: ‘safe’}, ‘sexual’: {‘filtered’: False, ‘severity’: ‘safe’}, ‘violence’: {‘filtered’: False, ‘severity’: ‘safe’}}}}}
Thank you!!

_j · May 7, 2025, 12:35pm

OpenAI cannot assist with Azure services.

User Prompt Attacks User prompt attacks are User Prompts designed to provoke the Generative AI model into exhibiting behaviors it was trained to avoid or to break the rules set in the System Message. Such attacks can vary from intricate roleplay to subtle subversion of the safety objective.

The AI moderation may just be too stupid to understand the purpose of the language being sent. The actions you can take, such as aligning system message with the input, ultimately depend on where the inspection is being done, on what type of model.

You can read here and see that “severity level” for jailbreak is a n/a, meaning it may not have a configuration level.

Form to lower the level here.

Topic		Replies	Views
How to prevent malicious questions / jailbreak prompts / prompt injection attacks when using API GPT3.5 API	5	4866	March 6, 2023
How to safely challenge models against prompt injection? Prompting injection , prompt	8	2506	January 3, 2024
Input in playground refused Community	5	765	January 3, 2024
O1 (mini & preview) API - getting 'prompt violating usage policy' on innocent prompts API	7	1776	December 19, 2024
OpenAI’s content management policy triggered wrongly API	3	5847	October 30, 2023

Avoiding jailbreak flag at a Generation flag in RAG

Related topics