Idea for cheaper & more powerful text embeddings (DocGPT.io)

Hi everyone,

I have developed an application, that combines ChatGPT with a document editor over the past months. It’s called DocGPT(.io) and it’s overall goal is to dramatically increase the learning speed of it’s users while working with documents.

Building this thing I of course needed to embed the text information in the documents somehow to enable DocGPT to get to the right answer for a document related question. I firstly used OpenAI’s text embedding API and stored the vectors in my Firebase, but quickly learned, that this is becoming really expensive, if you have a lot of documents, because the vectors consist of 1536 dimensions.

The next step for me was to embed multiple pages of a document into one vector. Unfortunately, this significantly decreased the accuracy of getting the right information within embedded search. It seems that multiple categories of information stored in one vector average each other out, so that you’re not able anymore to find the rigth information from a question.

What I do now is the following: Every time a user uploads a document I send all pages and all chapters to the ChatGPT API and ask it to summarize them in 3 short bullet points each. I then store the summaries and every time a user asks a question, I load the bullet points together with the question into a request to ChatGPT and ask it to return the page or chapter, that most probably answers the question. The result is much better accuracy and much less cost, because the bullet points cost much less to store.

I’m curious what you guys think about this method! How do you solve the problems, that go along with vector databases?

Best regards,
Jan

1 Like

How many API calls does it take to summarize the full document? And when you look at the table of contents of summaries, what do you do if all the summaries don’t fit into the context window of a single API call? Do you compare the best answers?

1 Like

I can summarize up to 6 pages of a document at once, which means, that the number of API calls per document is dependent of the size of the document.
The summarization process can be carried out on multiple stages, meaning, that I can build summaries of sections of documents, based on the summaries of the different pages. If the amount of text from the summaries for pages becomes too large to fit into the context window, I can look into the summaries of sections / chapters first and then I can go on to pages to find the answer. This process can be repeated as many times as needed.

I tell ChatGPT to return the page with the highest likelihood of containing the wanted piece of information in the end.

1 Like