@kintela You definitely need to use embeddings. The way it works that that the embedding will map your text data to a vector (a list of numbers). Then a new piece of data comes in, say a question, and it gets embedded too. Then this gets compared to all the information you have embedded, one at a time (just multiply the numbers together and sum … this is called a dot-product). Then you have your top vectors that match. Then you take this, and get the corresponding text behind these vectors. Now you have similar or related text.
For example, after the search for most correlated vectors, you have data that is related to the incoming question. You then feed this data back into GPT, in the prompt, and ask GPT to answer the question based on all this related data. Then it should spit out an answer based on your data, even though GPT was never trained on your data, it is smart enough to absorb your data in the prompt and draw from it.
For an example notebook, check out THIS!
The link will show you the general concepts. Once you figure that out, it’s then a decision on how you will construct your embedding database. For me, I can create an in-memory data structure for searching, and also have a database to retrieve the actual text. I can do this in a serverless enviromnent with around 400k different embeddings with 1 second of latency. But there are SaaS offerings for vector databases too, such as Pinecone. They aren’t cheap, but they can be faster than my approach and handle much more data (think billions of embeddings).
Your input will have to be text. So you need to extract the text from your PDF files, Doc and XLS files to get it to work. There are tools that you can use to do this, or you can just copy and paste.
Another “hyper parameter” is how big of chunks of data should you embed at once (sentence, paragraph, page, thought, etc). I don’t think there is a hard and fast rule, and it depends on which GPT you are using (they all have different window sizes), so that is something to consider and experiment with. But give the embedding a large enough chunk that is coherent so that when a series of these chunks are stuffed into the prompt, it doesn’t look like a non-coherent jumbling mess to the AI.
You can even “micro embed”, I do this in my name similarity engine I built for my business (embed first and last names of people separately, and compare similarity). It all depends on your use-case and what makes sense.