Hi everyone,
I’m building a multimodal OCR/translation tool using the OpenAI Python SDK, and I want to avoid embedding large Base64 image strings in prompts because of the huge token cost. My ideal flow is:
- Upload a preprocessed image (JPEG with resizing/compression) to OpenAI.
- Get back a file/image reference ID.
- Send a chat completion request referencing that uploaded image so the model can process it (e.g., OCR + translation) without me putting the whole Base64 in the prompt.
Environment / Context
- openai Python SDK version:
1.97.1 - Models:
gpt-4.1(and variants likegpt-4.1-mini,gpt-4.1-nano) - Python: 3.13 on macOS
- Current fallback (works): Inline Base64 of a compressed JPEG (resize to max width 1024, quality=70) embedded in the prompt, but it costs ~30k+ tokens per image because of the encoded size.
- Desired: Use “file upload + reference ID” instead, and only send a small textual prompt to the model.
What I’m seeing
• The inline Base64 path works and returns results, but consumes tens of thousands of tokens per image.
• The attempt to do client.files.upload(…) either fails or the upload path seems unavailable (in some runs the library reports no upload method, in others it throws errors).
• I have logging around the upload branch and fallback; when upload is skipped, it prints that it’s falling back to base64.
Specific questions
1. With openai==1.97.1 and the GPT-4.1 family, is the “file upload + reference ID” pattern supported for sending images (so the model can see/process the image) instead of inline Base64?
2. If so, what is the correct, current way to perform that in a chat.completions.create(…) call—i.e., once I have the file_id, how should I reference the image so the model uses it (without embedding the Base64)?
3. Why might client.files.upload be missing or not work in some contexts? Could it be due to:
• SDK misuse (wrong parameters / naming)?
• Using a non-vision-enabled endpoint or model configuration?
• Account/feature flags or required opt-ins?
4. What’s the semantic difference between using purpose=“vision” vs purpose=“assistants” when uploading an image to be used as input? Are there scenarios where one works and the other doesn’t?
5. If the file upload/reference path cannot be made to work in my environment, what is the best fallback strategy that minimizes token usage while still giving the model access to image content?
Thanks in advance!
Minimal reproducible snippet I’m using to probe the upload path:
from openai import OpenAI
import io
client = OpenAI(api_key="YOUR_API_KEY") # openai==1.97.1
# Load and prepare image
with open("test.jpg", "rb") as f:
img_bytes = f.read()
buf = io.BytesIO(img_bytes)
buf.name = "image.jpg" # Ensure multipart upload can infer filename
# Debug introspection
print("[Debug] client attrs:", dir(client))
print("[Debug] has files:", hasattr(client, "files"))
if hasattr(client, "files"):
print("[Debug] client.files attrs:", dir(client.files))
# Attempt file upload
try:
file_obj = client.files.upload(file=buf, purpose="vision", file_name="image.jpg")
print("Upload succeeded:", file_obj)
file_id = getattr(file_obj, "id", None) or file_obj.get("id")
print("Received file_id:", file_id)
except Exception as e:
print("Upload failed:", e)
# Fallback to inline Base64 path here