Why the post?
I am a noob in python / programming / OpenAI, but after thinking and reading a lot I came up with a solution. Decided to provide the problem and steps here, with the intent of helping other noobs like me
What I wanted to build:
User asks question in natural language, OpenAI searches for the answer in my own documents and return an answer to the user in natural language. Simple yet extremely powerful right?
First approach and problems: pass question + documents to OpenAI.
âSo all I need is to send all of the documents + user question to OpenAI?â
Problems: This wonât work because the documents can be very long (500 pages long), which would become very expensive and would exceed the size limit of information that can be passed to OpenAI models.
Second approach and problems: Train/fine-tune a model with everything.
âSo why donât you train a model and fine-tune with everything? Just insert all of your documents during fine-tuning.â
Problems:
- If I simply input the text, while the model would learn how the documents are written in terms of language. And unless I lose a looooot of time doing some sort of Q&A in fine-tuning, the model would still learn âmostlyâ the language.
- Imagine that my documents are 100*1000 pages long, but 99% of that is non-important. Training would be very expensive, unpractical and nothing could guarantee that when I ask a question OpenAI would âbaseâ the answer on the relevant part.
Third approach (good one): Use âembeddingâ and summarize the documents.
So after understanding the âlimitationsâ of the previous approaches, I watched one trillion videos on Youtube and re-read the OpenAI documentation. Once I understood what is an embedding, I knew the answer was there.
What is an embedding and why is it useful?
For noobs like me, an embedding is a âdecompositionâ of the âtext meaningâ into a lot of numbers. Imagine that it is an idiom like portuguese or english, understood by computers, containing the overall meaning of the text in computer-language.
Yeah, right, and why is it useful to the case?
Think of a Library with a lot of books (Your documents). In the first two approaches I was saying:
âhey, read the entire library so I may ask you anything about any book in the future, but you have to ârememberâ everything, and I need you to refer only a specific book (document) but I wonât tell you which one lolâ.
Now think that an embedding is like a âbook cover with a small summaryâ of each book. So instead of making OpenAI read everything and make OpenAI staff cof, cof, rich, cof, cof⌠I can point to the books that are most related to the topic I have by using embedding.
So the dataflow is like this
My Database in practical terms:
- Those very long Documents were split into small parts (1 page/part)
- Text extracted and sent for summarization / converted bullet points in Chatgpt turbo.
- Very Summarized/ concise text is grouped in â2 pages documentâ
- â2 pages documentâ is used to generate an embedding (book cover)
- Store both the book cover and the â2 pages documentâ
Note: ChatGPT Turbo is super cheap, 20 pages was like 30 cents to summarize, embedding is way, way cheaper.
- When a user asks a question in natural language, I generate an embedding of the question (âa book cover/summary of the questionâ)
- Compare the embeddings of the question to the document-embeddings that were stored. (put the question-embedding-book-cover side by side to all of the document-embedding-book-cover, and get the one that is most related). This can be done by âcosine similarityâ of the embeddings.
- Now I get the original question (not the embedding) with the â2 pages document related to the embeddingâ and pass it to OpenAI API as the âgroud truthâ.
Which is like saying âhey, in this library there are one billion books and I have a question about a red book that speaks about Football, more specifically what makes Brazil the most powerful team ever. Base your answer entirely on the book that is most like thatâ - Question + answer displayed to the user
I would love to hear your opinion about this! And I would love to meet OpenAI people that are ever in Brazil! Many thanks in advance, hope it was helpful
Any suggestions on how I can improve anything?
Also, any certification available? I think the tool is so powerful that I want to specialize and eventually build a career around it!