This is a hard problem and I have not tackled a task quite this big. I was working on something far larger by using Wikipedia a source of ground truth but have not worked on it for a while. So keep in mind that my next ideas are only hypothetical, but now that I know a bit more about the problem you’re working on I think my recommendations may be more accurate:
Option 1: Search Index
Before I realized that GPT-3 has a lot of general knowledge, I was working on using Wikipedia as a source of truth to be incorporated into my cognitive architecture. For this problem, I had to find a way to make 5 million articles rapidly searchable so I settled on SOLR as an offline search index. Since you are dealing with several orders of magnitude less, this solution should work rather well for you. Here’s the video about that:
Essentially, you break the problem down into several steps, as outlined in my book (which I recommend you read if you haven’t):
When a query comes in, you first use GPT-3 to generate appropriate search times to fetch the correct information. This can be done with a prompt like “Extract search terms to Google the correct information for the following query” or something like that. GPT-3 is really great at writing Google queries.
Use said search terms to search your SOLR instance for the correct documents - this will take some experimentation and tweaking. You could instead use the embedding/dot product search method, which could hypothetically be more accurate.
Once you’ve fetched the correct documents, which may still be too long to search with a single GPT-3 prompt, you will need to recursively summarize or distill them, which you can see in my “compress anything” videos for the recursive summarizer. I cannot even take credit for this idea, as a commenter on my video pointed out the utility value of using recursive summarization for recall/fetch purposes.
You’ll still need a check somewhere in here to know whether or not the correct information is even present. But as I’ve demonstrated in other threads, GPT-3 is really good at just giving you a BOOLEAN answer about whether or not something is present or not.
Option 2: Finetune a KB memory bot
This is only hypothetically possible and I have not tried it. GPT-3 is capable of storing quite a lot of information, so it’s entirely possible that you can finetune a model on your 5000 documents and just use that to spit out the correct facts. Such a model would be (1) prohibitively expensive to train and (2) would require quite a bit of experimentation to determine if it’s accurate and viable.
Once I cycle back to research mode, this may be one of my projects. Many, many, many people have a need to search an arbitrarily large knowledge base for facts and figures, and to be able to rely on it. Indeed, even my cognitive architecture would benefit from such a feature. Since OpenAI has now enabled the ability to continuously finetune a model, perhaps this method would not be so unwieldy. Essentially, as your KB grows, you just have incremental training sessions to integrate new information into your model.
As I mentioned, I will be experimenting with this… eventually. It would solve many problems to be able to simply accumulate all of an ACOG’s memories in one model that has the magical ability of instant recall. However, I am skeptical of relying on blackboxes such as this for critical functionality. For example, imagine you have an autonomous agent at some point in the future - you want all of its memories to be explicit and declarative (X happened at Y time) and not just embedded in a model. Kind of like how Tesla’s must keep sensor logs in case of a car crash.
I think then OpenAI should update documentation for Embeddings too @daveshapautomator . Because the current documentation makes one believe that the answer is coming from the KB. We know at this point that either there is a bug or the documentation needs revision. adding @moderators for any follow up.
Great stuff @daveshapautomator.
Based on my own experience, even a fine-tuned Davinci will not be able to answer questions presented in the fine-tuned file, unless it was written multiple times and answered exactly the same way each time. In order to fine-tune a model to answer correctly, there is a massive amount of work to multiply every question again and again, and yet even if temperature = 0, it might invent stuff based on the atmosphere of the answer rather than THE answer.
I really like your idea about checking whether or not the information is there in the middle step. Nevertheless, it requires an additional model just for that and quite a complex architecture. That’s where all the fun is, isn’t it?
There are many ways to skin this cat. I suspect that accumulating memories/KB/documents/logs in a search index is probably the way to go. SOLR can search millions of documents in a fraction of a second - plenty fast enough for 99% of use cases.
Actually, that reminds me, I found Milvus but haven’t used it yet. This may be the correct way to go: https://milvus.io/
If someone just figures out vector search + question answering, that alone would be a billion-dollar business. This is the way of the future.
Here. Specifically “Text search using embeddings” section. When a question is outside the scope, it should not return any document. But it does. Maybe there is some confidence score that gets returned too and if thats the case, we should just use a “low confidence score” as boolean Yes/No. Thoughts?
I try to avoid blackbox things like that since I have no idea how it works. That’s why I never really used the now-deprecated Answers endpoint.
Personally, I would just story all the KBs and their associated embeddings in a local DB (with 5000, you can easily do this in SQLITE or even just a JSON document). Then when you have a search query, get the embedding for that and do a dot product against all 5000 documents. It will take less than a second and you can just sort by highest dot product.
In my ACOG when I am searching for relevant memories I just grab the 5 or 10 most relevant memories.
Thank you. Thank you. Thank you. Such help means a lot when you have told your employer that you would be leaving and working on a new venture while having no clue how the product works. Plus you have to pay for your kids college fee.
Human memories are squishy. They tend to get compressed over time. This is called consolidation which happens in the background and while we sleep. Alcohol, for instance, disrupts memory consolidation, which reduces learning.
I’m talking about an artificial cognition, not a user facing application. I’m merely explaining why I recall the top memories in my ACOG. For your chatbot, it may or may not make sense to do the same.
I made some more changes in the prompt, as seen below. And now I am getting “Unknown”, as expected. I asked questions such as what is 2+2, what is the capital of USA, What is Node.js, etc and so far it is working.
I am an answering bot with limited knowledge base about a series of web services. I have been trained on a context and if you ask me a question, I will use the provided context to give you the answer. If you ask me a question that is not mentioned in the context , I will respond with “Unknown”. For example, if you ask me how many days in a week and the knowledge base I am trained on does not have this information, I will respond “Unknown”.
I am watching your Roe v Wade video. From the answers you are getting, it looks like that the AI is also taking data from outside of the verdict. is that correct?
Sorry for the intrusion of the convo here, but I’d like to get the same thing done as @chinmay.duke with less OpenAI experience on my end ;).
Any video/article of yours anywhere I can watch/read up on the “dot product” search. I’m trying the embeddings method as well in my journey, after having tested the soon to be deprecated Upload File / Answers method.