Use own data with images for queries

I want to run queries on custom data that are science papers, around 1000 of them. They have images as well. What is the best approach here?

I would start by extracting the text from the papers and exposing it for search.

It could be searched via vector correlation with embeddings, or based on keyword search, or a hybrid of embeddings and keywords.

Images are tough right now, but maybe with the Vision API around the corner, even the images can be processed, turned into text and exposed in the search as well.

1 Like

Thanks, thats what i thought.
Plenty of documentation on text. How does it work in training. Or when you ask it to give you an image of something? Is it an external system? Meta data on images?

1 Like

Maybe best to get llm to describe the image then store that text for image search?

Yes precisely. That’s basically what the Vision component of the API would do. You give it an image, and ask for the text description.

FYI, text is the “lowest common denominator” for search. If you have an audio file, you transcribe it to text as well to expose it for search.

So you will have to start with text no matter what.

If you only worked in one domain, for example, only images, you could use one model good at embedding images for an image only search. But once you go “multi-modal”, it all goes text, mainly for ease of use and transportability.

It’s hard to compare vectors across different models. So you pick a channel (usually text) and translate everything over to that channel and make the comparison coming from a single model or a set of models all comparing the same data from a single channel.

You could conceivably use one model for text, one for vision, one for audio, etc. And then fuse all the results in some manner. This adds complexity, and since your discrimination function (comparator) is operating across different domains, it raises doubts since you are comparing across different domains with different detectors.

So whats the status of vision api? Any time lines?

It was just announced for ChatGPT. So my guess is later this year in the API at the earliest, or Q1/Q2 next year.

If this is a pressing project, you could see if there are other API’s or open source models suitable for the task of describing images and graphs of data.

But no matter what, you pipeline will mostly likely be:

{Thing} → {Text} <=== Search ↔ User

1 Like

Maybe google one might help. Not allowed to post link. Bit risky to put a lot of work and if api is coming

1 Like

Wonder if any of these actually hndles images - Chat with any PDF using the new ChatGPT API, not just ocr?

Use its an AI chatbot that allows user to chat with files and databases