I’ve been planning a possible use of the API for GPT4V: a prospecting client would like to describe some technical drawings into text. They would then store the description and retrieve them using Natural language and a vector DB.
The problem is estimating the cost of the operation: the drawings are pretty large but when i try to estimate the cost, the widget says that the image resolution has been resized (see image)
- is it normal ?
- could it be detrimental to description phase? A technical drawing might have several annotations which are important for the general understanding.
Thanks in advance!
The current image processing library is not suitable for large amounts of technical detail, partially due to the potential resizing issues and partly down to the model still being a trail, if you are expecting to send a vert detailed high resolution image containing lots of text and graphical items to be detected accurately, you will potentially have have issues.
Exactly. Your image will get resized and tiled.
Maybe you can somehow overcome this problem by splitting the image up into interlacing 768x768 tiles, sending them in the same request (GPT4V accepts that) and then describing them, but YMMV.
Thanks for the the feedback, we will have to try some images
I’ll make some tests and post a detailed feedback
Hello, thanks for the feedback!
I have no further info about the prospecting clients, i only know that they produce hydraulic pumps and not much more.
i suppose their idea is to improve the retrievability of the drawings: if they are able to obtain a description of the drawing, influenced by the data read by vision, they turn the description into embeddings, load them into a vector DB and search them using natural language.
That would be cool, but i’m not sure it’s their main concern
I’ll know more when i have some of those drawings.
Brief update on the original question.
- it turned out that the client is more interested into parsing very long technical documents.
- the need to “parse” the drawings is not needed. So somebody else will have to do a proper test.
- I’ll have to evaluate a RAG method with very long documents (dozen of pages).