I have developed an application, that combines ChatGPT with a document editor over the past months. It’s called DocGPT(.io) and it’s overall goal is to dramatically increase the learning speed of it’s users while working with documents.
Building this thing I of course needed to embed the text information in the documents somehow to enable DocGPT to get to the right answer for a document related question. I firstly used OpenAI’s text embedding API and stored the vectors in my Firebase, but quickly learned, that this is becoming really expensive, if you have a lot of documents, because the vectors consist of 1536 dimensions.
The next step for me was to embed multiple pages of a document into one vector. Unfortunately, this significantly decreased the accuracy of getting the right information within embedded search. It seems that multiple categories of information stored in one vector average each other out, so that you’re not able anymore to find the rigth information from a question.
What I do now is the following: Every time a user uploads a document I send all pages and all chapters to the ChatGPT API and ask it to summarize them in 3 short bullet points each. I then store the summaries and every time a user asks a question, I load the bullet points together with the question into a request to ChatGPT and ask it to return the page or chapter, that most probably answers the question. The result is much better accuracy and much less cost, because the bullet points cost much less to store.
I’m curious what you guys think about this method! How do you solve the problems, that go along with vector databases?