I am a noob in python / programming / OpenAI, but after thinking and reading a lot I came up with a solution. Decided to provide the problem and steps here, with the intent of helping other noobs like me
User asks question in natural language, OpenAI searches for the answer in my own documents and return an answer to the user in natural language. Simple yet extremely powerful right?
“So all I need is to send all of the documents + user question to OpenAI?”
Problems: This won’t work because the documents can be very long (500 pages long), which would become very expensive and would exceed the size limit of information that can be passed to OpenAI models.
“So why don’t you train a model and fine-tune with everything? Just insert all of your documents during fine-tuning.”
- If I simply input the text, while the model would learn how the documents are written in terms of language. And unless I lose a looooot of time doing some sort of Q&A in fine-tuning, the model would still learn “mostly” the language.
- Imagine that my documents are 100*1000 pages long, but 99% of that is non-important. Training would be very expensive, unpractical and nothing could guarantee that when I ask a question OpenAI would “base” the answer on the relevant part.
So after understanding the ‘limitations’ of the previous approaches, I watched one trillion videos on Youtube and re-read the OpenAI documentation. Once I understood what is an embedding, I knew the answer was there.
For noobs like me, an embedding is a ‘decomposition’ of the ‘text meaning’ into a lot of numbers. Imagine that it is an idiom like portuguese or english, understood by computers, containing the overall meaning of the text in computer-language.
Think of a Library with a lot of books (Your documents). In the first two approaches I was saying:
‘hey, read the entire library so I may ask you anything about any book in the future, but you have to ‘remember’ everything, and I need you to refer only a specific book (document) but I won’t tell you which one lol’.
Now think that an embedding is like a “book cover with a small summary” of each book. So instead of making OpenAI read everything and make OpenAI staff cof, cof, rich, cof, cof… I can point to the books that are most related to the topic I have by using embedding.
My Database in practical terms:
- Those very long Documents were split into small parts (1 page/part)
- Text extracted and sent for summarization / converted bullet points in Chatgpt turbo.
- Very Summarized/ concise text is grouped in ‘2 pages document’
- “2 pages document” is used to generate an embedding (book cover)
- Store both the book cover and the ‘2 pages document’
Note: ChatGPT Turbo is super cheap, 20 pages was like 30 cents to summarize, embedding is way, way cheaper.
- When a user asks a question in natural language, I generate an embedding of the question (‘a book cover/summary of the question’)
- Compare the embeddings of the question to the document-embeddings that were stored. (put the question-embedding-book-cover side by side to all of the document-embedding-book-cover, and get the one that is most related). This can be done by ‘cosine similarity’ of the embeddings.
- Now I get the original question (not the embedding) with the ‘2 pages document related to the embedding’ and pass it to OpenAI API as the ‘groud truth’.
Which is like saying “hey, in this library there are one billion books and I have a question about a red book that speaks about Football, more specifically what makes Brazil the most powerful team ever. Base your answer entirely on the book that is most like that”
- Question + answer displayed to the user
I would love to hear your opinion about this! And I would love to meet OpenAI people that are ever in Brazil! Many thanks in advance, hope it was helpful
Any suggestions on how I can improve anything?
Also, any certification available? I think the tool is so powerful that I want to specialize and eventually build a career around it!