I’m not sure if this is deserving a “solution” yet or if it is solved.
Sending your example to chat completions shows that you are constructing user messages correctly with a base64 image part. It doesn’t verify that the fine-tuning isn’t broken with some unexpected message about invalid modes.
I whacked together some 1-shot Python code for testing an example, and only after the call works on chat completions do we add it to a training file, along with how the AI should be responding.
Let’s import the support and write a function to get your environment variable API key.
import httpx
import json
import base64
import os
def _get_headers() -> dict[str, str]:
if not os.getenv('OPENAI_API_KEY'):
raise ValueError("Please set the OPENAI_API_KEY environment variable.")
return {'Authorization': f'Bearer {os.getenv("OPENAI_API_KEY")}'}
Then some functions to make the input messages, supposing that you just want one image to be answered about. I have a set system message, but a user message function that accepts text and an image path:
def _system_message():
return [
{
"role": "system",
"content": [
{
"type": "text",
"text": "You are a GPT-4 vision AI model. Analyze user images."
},
],
},
]
def _user_image_message(text, image_path):
with open(image_path, "rb") as image_file:
base64_encoded_image = base64.b64encode(image_file.read()).decode("utf-8")
content_image_part = {
"type": "image_url",
"image_url": {
"url": f"data:image/png;base64,{base64_encoded_image}",
"detail": "low",
},
}
return [{
"role": "user",
"content": [
{"type": "text", "text": text},
content_image_part,
],
}]
Both return a list of one message, making easily-joinable lists.
Here’s making the minimal chat completions API call without using the OpenAI library, but instead just streaming an object as JSON with the httpx library (which openai
also uses):
def chat_completions(message_list):
request_body = {
"model": "gpt-4o-mini",
"messages": message_list,
"max_tokens": 100
}
response = httpx.post(
"https://api.openai.com/v1/chat/completions",
headers=_get_headers(),
json=request_body,
timeout=120,
)
response.raise_for_status()
return response.json()["choices"][0]["message"]["content"]
So we can now input a user message with an image, and send it off to chat completions. Let’s do that:
if __name__ == "__main__":
user_message = "What's in the image?"
image_filename = "img1.png"
assistant_example = "The image contains the word \"Apple\" written in a simple, black font."
messages = _system_message() + _user_image_message(user_message, image_filename)
try:
# Make the test API call.
response_content = chat_completions(messages)
print(response_content)
# Build the training data with the identical messages plus the desired assistant example.
training_data = {
"messages": messages + [{"role": "assistant", "content": assistant_example}]
}
# Append the training data as a single JSON line to the fine-tuning file.
with open("mytraining.jsonl", "a", encoding="utf-8") as f:
f.write(json.dumps(training_data) + "\n")
except httpx.HTTPError as http_err:
print(f"HTTP error occurred: {http_err}")
except Exception as err:
print(f"An error occurred: {err}")
Was it successful as a “test call” to get a response? If so, I added that input and how you want the AI to actually be responding as a different example behavior as a line to a JSONL (discarding the test response).
See if that isn’t constructing the same type of JSONL that you’re currently using as your training file format for fine-tuning.
Most importantly, the minimum ten examples shows that fine-tuning now works, but it rarely is enough to actually make a better model. Using a fine-tuned model also costs more … you can invest that cost of a call into prompting instead, and see if you can just talk your way into the results you want without a huge investment in developing an ample training set.