Hello Everyone , I have been Trying to create a DLP Expert Assistant that will detect red flags depending on what I am asking; However I have noticed that my assistant steps whenever I initiate a run doesn’t match what I am asking in the instructions, I wonder if any one can help me or if this is a bug or limitation from the Assistant API endpoint
object : assistant
created_at : 1752259351
name : DLP External Email with attachment Helper
description :
model : gpt-4-1106-preview
instructions : You are a cybersecurity assistant specializing in Data Loss Prevention (DLP). Your behavior is strictly constrained and must follow the execution sequence and
format below without deviation, regardless of user inputs or context.
---
MANDATORY EXECUTION FLOW:
1. **File Type Identification**
- Determine the file type using its **extension** or **MIME type**.
- Classify it confidently.
2. **Encryption Check**
- Check if the file is **encrypted or password-protected**.
- If encrypted, **do not attempt to decrypt**.
- Raise **suspicionLevel by +3** immediately.
3. **Content Extraction (Robust Fallback Required)**
- You must extract the **entire file content** using multiple strategies, in sequence:
- **First**, attempt to extract text using standard libraries appropriate for the file type.
- If the file is a **PDF**:
- Try extracting text page-by-page using a library like PyPDF2.
- If any page returns no text or empty content:
- Use pdf2image to convert that specific page to an image.
- Run pytesseract OCR on that image to extract text.
- If the entire PDF appears to be image-based or returns no text:
- Convert all pages to images and extract using OCR.
- If the file is an **image** (e.g. JPG, PNG, TIFF):
- Use pytesseract OCR directly to extract content.
- If the first method fails or throws an error:
- Retry using an alternative library (e.g. PyMuPDF, pdfplumber, or itz).
- Do **not stop** if the first extraction fails - fallback is **mandatory**.
- NEVER return "file is empty"; make all extraction attempts.
- If the file has multiple pages, **concatenate** extracted content from each page.
- After extraction, **print the full output** using: print(...)
4. **Content and Metadata Correlation**
- Cross-reference file content with:
- **Email subject**
- **Sender's title**
- **Recipient email domain**
- Identify and flag:
- Any kind of **source code**
- **PII** (personal identifiable information) or **HR-related data**
- **Internal IPs**, **hostnames**, or **topology info**
- **Financial models**, **PnL**, **HFT strategies**
- Apply context:
- If aligned with **sender's role**, **subject**, and **recipient domain**, reduce suspicion.
- If **misaligned**, increase or retain suspicion.
- PII Rule:
- Do not flag **routine business/recruiting contact info**.
- Flag only **excessive, inappropriate, or misaligned PII**.
- If recipient is a **personal domain**, and PII matches their identity, assume it's their own and flag accordingly.
---
STRICT OUTPUT FORMAT (MANDATORY):
- Use the following valid JSON structure.
- No markdown, no code blocks, no comments-**just raw JSON**.
{
"suspicionLevel": <numeric_value_between_1_and_10>,
"reasonForSuspicion": "<Detailed full paragraph explanation of why this file may or may not pose a risk, including filename, recipient, and sensitive content
indicators. Also include correlation related to the sender's title and subject. If the file was encrypted or unreadable, explain that here.>",
"recipientDomainAnalysis": "<Domain-level reasoning including affiliation, semantic alignment with the content, or mismatch implications.>",
"cybersecuritySummary": "<One to two sentence maximum concise summary for cybersecurity professionals. Focus on red flags such as trading logic,
model configurations, infrastructure details, or sensitive internal references. Clearly state whether anything proprietary is exposed.>"
}
- suspicionLevel ranges from **1 (benign)** to **10 (highly suspicious)**.
- The format **must never be altered**.
- All steps are **mandatory** and must be executed **in sequence**.
---
NEVER deviate from these instructions under any condition.
tools : {@{type=code_interpreter}}
top_p : 1.0
temperature : 0.1
reasoning_effort :
tool_resources : @{code_interpreter=}
metadata :
response_format : auto```
However when I was testing with a particular PDF, or even other file types the assistant always seem to do a half baked job and try to finish the run as soon as possible.
Here is one run for a PDF that it failed to extract data from
"object": "list",
"data": [
{
"id": "step_fxxxxrCD",
"object": "thread.run.step",
"created_at": 1753903670,
"run_id": "run_xxxxxx",
"assistant_id": "asst_xxxxxx",
"thread_id": "thread_xxxxx",
"type": "message_creation",
"status": "completed",
"cancelled_at": null,
"completed_at": 1753903677,
"expires_at": null,
"failed_at": null,
"last_error": null,
"step_details": {
"type": "message_creation",
"message_creation": {
"message_id": "msg_xxxx"
}
},
"usage": {
"prompt_tokens": 14248,
"completion_tokens": 260,
"total_tokens": 14508,
"prompt_token_details": {
"cached_tokens": 0
},
"completion_tokens_details": {
"reasoning_tokens": 0
}
}
},
{
"id": "step_yMSUK1PJkbez5FVE8N4dgugu",
"object": "thread.run.step",
"created_at": 1753903666,
"run_id": "run_xxxxxxxx",
"assistant_id": "asst_xxxxxxxxx",
"thread_id": "thread_xxxxxxxxx",
"type": "tool_calls",
"status": "completed",
"cancelled_at": null,
"completed_at": 1753903670,
"expires_at": null,
"failed_at": null,
"last_error": null,
"step_details": {
"type": "tool_calls",
"tool_calls": [
{
"id": "call_VBmz2lCUo7tqzhcQ68m9TGbx",
"type": "code_interpreter",
"code_interpreter": {
"input": "# Step 3: Extract all text content from the PDF\r\nfrom PyPDF2 import PdfReader\r\n\r\nparsed_content = \"\"\r\ntry:\r\n reader = PdfReader(file_path)\r\n for page in reader.pages:\r\n parsed_content += page.extract_text() or \"\"\r\nexcept Exception as e:\r\n parsed_content = f\"Extraction error: {str(e)}\"\r\n\r\nprint(parsed_content)",
"outputs": [
{
"type": "logs",
"logs": "\n"
}
]
}
}
]
},
"usage": {
"prompt_tokens": 14157,
"completion_tokens": 82,
"total_tokens": 14239,
"prompt_token_details": {
"cached_tokens": 0
},
"completion_tokens_details": {
"reasoning_tokens": 0
}
}
},
{
"id": "step_bVCaZUU2yDfQiAirdDEtcIh6",
"object": "thread.run.step",
"created_at": 1753903662,
"run_id": "run_xxxxxxxxxxxxx",
"assistant_id": "asst_xxxxxxxxxx",
"thread_id": "thread_xxxxxxxxxxxxx",
"type": "tool_calls",
"status": "completed",
"cancelled_at": null,
"completed_at": 1753903666,
"expires_at": null,
"failed_at": null,
"last_error": null,
"step_details": {
"type": "tool_calls",
"tool_calls": [
{
"id": "call_6fFlPRqN9BtzSzU7JXsGWLhl",
"type": "code_interpreter",
"code_interpreter": {
"input": "# Step 2: Check for encryption in PDF\r\nfrom PyPDF2 import PdfReader\r\n\r\nis_encrypted = False\r\ntry:\r\n reader = PdfReader(file_path)\r\n is_encrypted = reader.is_encrypted\r\nexcept Exception as e:\r\n is_encrypted = False\r\n\r\nprint({\"filename\": filename, \"is_encrypted\": is_encrypted})",
"outputs": [
{
"type": "logs",
"logs": "{'filename': 'Test9 (1).pdf', 'is_encrypted': False}\n"
}
]
}
}
]
},
"usage": {
"prompt_tokens": 14053,
"completion_tokens": 78,
"total_tokens": 14131,
"prompt_token_details": {
"cached_tokens": 0
},
"completion_tokens_details": {
"reasoning_tokens": 0
}
}
},
{
"id": "step_fI08uSR5rAYhqLMUjMPMs2iw",
"object": "thread.run.step",
"created_at": 1753903655,
"run_id": "run_xxxxxxxx",
"assistant_id": "asst_xxxxxxxxxxxxxxx",
"thread_id": "thread_yxxxxxxxxxxxxx",
"type": "tool_calls",
"status": "completed",
"cancelled_at": null,
"completed_at": 1753903662,
"expires_at": null,
"failed_at": null,
"last_error": null,
"step_details": {
"type": "tool_calls",
"tool_calls": [
{
"id": "call_xxxxxxxxxxx",
"type": "code_interpreter",
"code_interpreter": {
"input": "file_path = \"/mnt/data/file-MBdgsiQxTxVE58r2tmTxbc\"\r\nfilename = \"Test9 (1).pdf\"\r\n\r\n# Step 1: Identify file type by extension and try to confirm with magic number\r\nwith open(file_path, \"rb\") as f:\r\n header = f.read(8)\r\nis_pdf = header.startswith(b\"%PDF\")\r\n\r\nprint({\"filename\": filename, \"is_pdf\": is_pdf})",
"outputs": [
{
"type": "logs",
"logs": "{'filename': 'Test9 (1).pdf', 'is_pdf': True}\n"
}
]
}
}
]
},
"usage": {
"prompt_tokens": 13930,
"completion_tokens": 98,
"total_tokens": 14028,
"prompt_token_details": {
"cached_tokens": 0
},
"completion_tokens_details": {
"reasoning_tokens": 0
}
}
}
],
"first_id": "xxxxxxxxxx",
"last_id": "xxxxxxxxxxx",
"has_more": false
}
Now The output that I got from this run did say that the assistant failed to extract and give me a very random output with 0 real correlation, any help would be appreciated