I want my home to be paperless. I already have a document scanner which names the files depending on the contents but it is pretty hopeless.
So I am writing a .Net app using gpt-4-vision-preview that can look through all the files that the scanner dumps into a folder, and name them based on the contents & also file them in the correct directory on my PC based on the contents.
Problems so far:
API kept rejecting the image with “I’m sorry, but I cannot assist with requests that involve processing images that may contain sensitive personal data such as credit card information”. I think I have worked around that by getting it to do a JSON output (even though I am don’t have the
.response_format = {“type”: “json_object” }
parameter in my code - i just tell it I want JSON out & it provides something resembling JSON and no longer refuses the task - I can clean up the JSON in code later.
Now I am categorising the document into various classes - credit card receipt, cash receipt, delivery order, etc. for the code to later move the file to the correct directory. It also extracts the transaction date, amount including currency as I travel a lot, establishment, item, card number. Unfortunately it seems I have to get the whole document in JSON format so it won’t refuse for security options, rather than just ask it to retrieve a certain subset of information in a small csv or json object to keep costs down
The item is also to be decided by AI - if it recognises a receipt for a meal it will determine breakfast, brunch, lunch, etc. by the timestamp on the receipt. Alternatively it might decide the receipt is for groceries or snacks, etc. Or the document is something completely different - quick user guide for some electronics item or a delivery order or a bill from my phone company.
The aim is to just scan my documents for the month into a folder & have the AI do the rest - probably 100 documents per month - and hoping to do it for 1 cent or less per document - at the moment I am looking at 2 or 3 cents to process each one.
Very interesting use case. Good luck with the problems. I’ve not worked with vision API much myself yet, but hopefully smarter people will chime in here shortly.
Hope you stick around. We’ve got a great community growing.
Do you have working code you could share that included this model and this response format? Because it doesn’t seem like a valid option for the vision-preview (at least not currently for me).
This is a full working example for Python for asking about an input image with the new vision-preview. It’s still hard to find simple examples so I thought I’d share:
import base64
import openai
import os
def main():
# Updated file path to a JPEG image
image_path = "/Users/Documents/mouse_picture.jpg"
# Read and encode the image in base64
with open(image_path, "rb") as image_file:
encoded_image = base64.b64encode(image_file.read()).decode("utf-8")
# Craft the prompt for GPT
prompt_messages = [
{
"role": "user",
"content": [
{
"type": "text",
"text": "Here is an image, is there a mouse in the image?"
},
{
"type": "image_url",
"image_url": {
"url": f"data:image/jpeg;base64,{encoded_image}"
}
}
]
}
]
# Send a request to GPT
params = {
"model": "gpt-4-vision-preview",
"messages": prompt_messages,
"api_key": os.environ["GPT_API_KEY"],
# "response_format": {"type": "json_object"}, # Added response format
"headers": {"Openai-Version": "2020-11-07"},
"max_tokens": 4096,
}
result = openai.ChatCompletion.create(**params)
print(result.choices[0].message.content)
if __name__ == "__main__":
main()
I uploaded some YouTube data and was asking about a new headline idea based on existing data… then I remembered I could also “show it” the thumbnail… It read “game of thrones” in my “style” list on the left and thought it prominent - maybe because the show is popular?
Still… could be a useful tool… especially if it knew what to look for and point out with thumbnail images for YouTube… The code interpreter was useful for the CSV data too…
Being a rehabilitation doc, I’m very excited about using GPT-V as an assistive device for those with visual impairments or who are unsighted. A wearable might provide critical information such as, “you are standing in front of a row of stores. just to your left is a grocer, and to your right, what appears to be a post office. Several cars are directly between you and the entrances”. Further along, leveraging the LLM’s reasoning skills, perhaps “the path you’re on appears a bit dark, uneven, and unsafe. I would suggest crossing the street where the way appears smoother and other pedestrians are present”. Multimodal models will entirely transform the lives of millions of disabled. This is where it starts.
I could see some form of real time, rapid spoken description of what was directly in the centre line of sight, car, door, lamp post, wall… etc and then a longer more detailed description if you stayed looking at the same thing for a moment… soooo many ways this could be made completely awesome.
Also makes me wonder if some kind of haptics could also be used, perhaps a few electromagnetic pins ala brail in a grid that could be raised and lowered to show potential paths forwards or blockages… I think there is an untapped field ready to go exploring in.
There are already a couple of working prototypes of this very concept that leverage GPT, but they are relatively rudimentary. It was Suman Kunaganti, CEO of Personal.ai, who years ago approached blindness as not so much a lack of sight, but of information. In response, he created glasses with cameras that would reach out to community of volunteers. If the app was launched, someone would describe to them what they were looking at. Now, GPT has taken that role over. In the years to come, I really think we’ll see wearable AI with GPS that can provide assistance to those with dementia “Frank, you’re walking the wrong way. Let’s make a left ahead and I’ll bring you back home. I’ll also let your family know we’re headed back, ok?” My own dad lost his vision but aged at home thanks in large part to his Alexa. I’ve longed for an affordable, accessible useful device like that combined with the conversational and reasoning ability of an LLM. “Frank, I didn’t see you take your medicines yet. It’s past five. Stand up and let’s move to the oldies for a few minutes. Let’s play a memory game. Tell me again about your childhood…etc” These devices would greatly extend the dignity and safety of the elderly, detecting missed meds, falls, wandering, and via predictive analytics from wearable health devices, the need for future care. Today, 44 million elderly. By 2050, 88 million. The need is huge. The market is huge. Vision and multimodals will be front and center.
Love it! Yes, it’s a data bandwidth problem, vision is the highest bandwidth sense we have, with audio coming in second place. The brain can do a huge amount of post processing even on low bandwidth inputs if those inputs can be translated appropriately.
I’m glad to hear your father at least had Alexa, but yes, an LLM would be infinitely more capable of building models of the world and handling daily tasks. Combining LLMs with traditional code to get the reliability of proven systems like time tracking and database management for, as you mentioned, pills that need to be taken, and connecting that up to an LLM to provide a more human and contextually aware set of responses… will be interesting to see how this area gets regulated and what sorts of red tape will be thrown up… hopefully, progressive minds will win out.
Thanks for sharing in my enthusiasm. After 30 years of medical practice I’m back as a grad student in an MPH program, largely because I want to see AI leveraged to transform healthcare. It’s a passion of mine and it’s nice to see all the use-cases pop up like daisies. I stand in awe of the coders and devs who speak the jargon like I throw medical terms. This intersection will create untold breakthroughs and I consider myself lucky to watch it unfold. Cheers mate, and thanks for leading this thread.
Good luck with your masters and I hope you find the Forum a useful asset. There is a wide set of skills available here and health care is always a popular topic.