Have a forum with >1M posts. I want a q&a chatbot that can answer questions based on an understanding of what’s in the SQL table ( ‘posts’), which can be organized by thread and grouped by forum category.
My thinking prior to this week dev changes was to make multiple passes to categorize then summarize each thread, then add API functions to do keyword searches generated by OpenAI, hoping it would craft really great function calls.
But now I’m not sure if I should transition to using Assistants instead, and upload the data another way?
In my mind, I need a middle interpreter to infer relationships between a keyword match rule in my DB and a vector db created by OpenAI, allowing for natural language search instead of keywords.
This is a big project so I’m really hopeful I’ll earn a few minutes of your time, so I can build this to help my community.
I am assuming that the SQL table “posts” consists of text (like user posts), as opposed to items (like title, description, cost, location, etc…). And, you want to be able to semantically search the text in your SQL table.
A technique I used in a similar case (my “posts” are in an Apache Solr index) was to “synchronize” the Solr index in a vector db. That is, I basically embedded the entire Solr index in a vector store, and created code to keep that vector store updated with any changes to in the local index. Essentially, I maintain an embedded copy of my Solr database. I then run my prompts as cosine similarity searches against the vector store and use the LLM to not only determine the best answer, but to also return the Solr index ID’s so that I can pull up the original objects if I want to. This would include the metadata that I have defined during the embeddings.
How you do that using the new Assistants API or even GPTs is still a mystery to me since, in your case, you would need to upload your “posts” table as a series of no more than 20 files, each with a maximum limit of 512 megabytes (and 2 million tokens). It is also unclear to me how, in this scenario, you would be able to have either return meta data such as post author, title, date posted, etc…
This strategy would work if you were trying to pull up items, like in a store, records with single item fields. I don’t think it works will trying to do keyword searches against text fields. And, it’s the very problem vector stores were created to solve in the first place.
I’d too would be really interested to hear any thoughts on this. How you create a “keyword match rule” to use against hundreds if not thousands of records of free-flowing text would be of great interest to me.