I want my home to be paperless. I already have a document scanner which names the files depending on the contents but it is pretty hopeless.
So I am writing a .Net app using gpt-4-vision-preview that can look through all the files that the scanner dumps into a folder, and name them based on the contents & also file them in the correct directory on my PC based on the contents.
Problems so far:
API kept rejecting the image with “I’m sorry, but I cannot assist with requests that involve processing images that may contain sensitive personal data such as credit card information”. I think I have worked around that by getting it to do a JSON output (even though I am don’t have the
.response_format = {“type”: “json_object” }
parameter in my code - i just tell it I want JSON out & it provides something resembling JSON and no longer refuses the task - I can clean up the JSON in code later.
Now I am categorising the document into various classes - credit card receipt, cash receipt, delivery order, etc. for the code to later move the file to the correct directory. It also extracts the transaction date, amount including currency as I travel a lot, establishment, item, card number. Unfortunately it seems I have to get the whole document in JSON format so it won’t refuse for security options, rather than just ask it to retrieve a certain subset of information in a small csv or json object to keep costs down
The item is also to be decided by AI - if it recognises a receipt for a meal it will determine breakfast, brunch, lunch, etc. by the timestamp on the receipt. Alternatively it might decide the receipt is for groceries or snacks, etc. Or the document is something completely different - quick user guide for some electronics item or a delivery order or a bill from my phone company.
The aim is to just scan my documents for the month into a folder & have the AI do the rest - probably 100 documents per month - and hoping to do it for 1 cent or less per document - at the moment I am looking at 2 or 3 cents to process each one.
Very interesting use case. Good luck with the problems. I’ve not worked with vision API much myself yet, but hopefully smarter people will chime in here shortly.
Hope you stick around. We’ve got a great community growing.
Do you have working code you could share that included this model and this response format? Because it doesn’t seem like a valid option for the vision-preview (at least not currently for me).
This is a full working example for Python for asking about an input image with the new vision-preview. It’s still hard to find simple examples so I thought I’d share:
import base64
import openai
import os
def main():
# Updated file path to a JPEG image
image_path = "/Users/Documents/mouse_picture.jpg"
# Read and encode the image in base64
with open(image_path, "rb") as image_file:
encoded_image = base64.b64encode(image_file.read()).decode("utf-8")
# Craft the prompt for GPT
prompt_messages = [
{
"role": "user",
"content": [
{
"type": "text",
"text": "Here is an image, is there a mouse in the image?"
},
{
"type": "image_url",
"image_url": {
"url": f"data:image/jpeg;base64,{encoded_image}"
}
}
]
}
]
# Send a request to GPT
params = {
"model": "gpt-4-vision-preview",
"messages": prompt_messages,
"api_key": os.environ["GPT_API_KEY"],
# "response_format": {"type": "json_object"}, # Added response format
"headers": {"Openai-Version": "2020-11-07"},
"max_tokens": 4096,
}
result = openai.ChatCompletion.create(**params)
print(result.choices[0].message.content)
if __name__ == "__main__":
main()
I uploaded some YouTube data and was asking about a new headline idea based on existing data… then I remembered I could also “show it” the thumbnail… It read “game of thrones” in my “style” list on the left and thought it prominent - maybe because the show is popular?
Still… could be a useful tool… especially if it knew what to look for and point out with thumbnail images for YouTube… The code interpreter was useful for the CSV data too…