Creating a Chatbot using the data stored in my huge database

curt.kennedy · June 2, 2023, 4:24pm

Embeddings are mostly for words, sentences, paragraphs, etc. Not a bunch of numbers. If your numbers have meaning, then use traditional computing to filter and refine the data, and then send (mostly) text/word based data for the LLM to summarize.

I’ve seen folks attempt to get an LLM to understand a bunch of random JSON from the database, you can probably do this, but it needs to be prompted correctly.

Your best bet is to go into Playground, copy and paste your retrievals in there, and get the LLM to make sense out of it through prompting. But LLM’s aren’t good at math, numerical comparisons/relationships without heavy coaching through prompt engineering, so it’s best to let traditional computing methods sort this out.

If you can find prompts and data structures that work in Playground, then you can start coding this out. But going straight into Langchain, unless it’s a common use-case, I would expect to be a train wreck.

abhi3hack · June 2, 2023, 4:29pm

@curt.kennedy
So to understand and answer queries on their payment transaction history and also to perform actions on the website with just natural language
if embedding is difficult, can we achieve this with fine tuning (But @bill.french has mentioned that we have to give our db info in order to do this which can be a privacy concern for my users, if open ai guarantees that our info is not being used then we can consider this approach too)

Can you please shed some more light on how to do this?

curt.kennedy · June 2, 2023, 4:43pm

So your data is a bunch of numbers like:

5/30/2023: $100.00
5/14/2023: $900.25
1/1/2022: $5.02

Then the user asks, “What is my most recent transaction?”

You would go to the database, pull the latest date, and spit out “The most recent transaction is $100.00”

This didn’t use any embeddings.

Now the question is, how to get the computer to map the question to your code. You could use embeddings for this. For example, if you embed a bunch of “actions” and one of them is “get recent transaction amount”, and then you compare this (vector-wise) to all of your actions, and when it correlates high with “get recent transaction amount”, you just run the code that does this and spit out an answer. So the embedding acts as a layer that maps the user intent to the code you need to execute.

abhi3hack · June 2, 2023, 5:30pm

Really thanks @curt.kennedy for this Idea

But I have a query in this that
Firstly it would be greatly difficult to write so many sql queries as we have so many use cases possible for any particular user
Even if we done so then unless and until we have an exact sql query in hand (stored for a same or decently similar question in db)then we cannot get the answer it correctly.
and
what about if user asks a general question, like something regarding our company
we want to build a chat bot that can answer general questions plus these factual db info which a particular user have access to then we should be having that use case also embedded right which is not an sql query?

so what could be done in such a case ??
again really thankful for this Idea.

alden · June 2, 2023, 5:43pm

Can you elaborate on your use case? Are you trying to just do Q&A (and turn-by-turn conversations) on the PDFs (and other documents) in your Google Drive?

My feeling is that you will need to do the trick where you use ChatGPT to generate a SQL statement. So the steps will be:

First internally vectorize your data (you could use Weviate or Progress to store the vectors)
Then based on the user’s query, you find a couple of “example” rows from the DB. (for example: “find me red t’shirts for men over age 50”)
Using these few examples and the table schema, ask ChatGPT to generate the SQL.
Run the SQL on the DB to get the results from the SQL.

(This approach was also suggested by @jwatte yesterday – you can look at his detailed explanation)

Agree with @bill.french here – don’t rely on ChatGPT to do the counting. I had an example where I fed ChatGPT survey data where 52 out of 100 people voted for Option A. When I asked “How many people voted for Option A?” – it kept randomly throwing out answers on different re-runs.

What it did do incredibly well was running classifications, summarizations and Q&A. In fact, the classification results were mind-boggling (e.g. “Classify the survey responses using thematic pattern method”).

bill.french · June 2, 2023, 9:42pm

This was advice @curt.kennedy shared with me months ago, and it allowed me to create solutions that map record values to field names using embeddings. It’s not a bed of roses, however, as Curt also makes clear.

This makes it possible to use a query as a vector similarity to know that the query is about (x) fields of data (without the user specifying the exact field names). It also allowed me to create vectors about the records - i.e., depending on the data, each row can have a vector representing its “meaning”. Combined, a query about a specific field and records that are similar in meaning could be used to identify the information that is specific to that query. Transforming it into an output the user can relate to is a simple matter of prompt engineering.

My success with this approach has been very good where the data is actual words. Numbers, as Curt points out, is a challenge. However, extracting analytics based on records with words is clearly possible with embeddings.

curt.kennedy · June 3, 2023, 2:34pm

To decide if the user input goes to the LLM for a general answer or the code, you need a filter. The filter’s objective is “General Question” or “Run Code”.

To do this, there are 3 general things you could do, and you can do all of them or some of them.

1-token categorizer, using a fine-tuned Babbage or Ada. Output ’ 0’ if “General Question” or ’ 1’ if “Run Code”. Train this on both sets of inputs, as many as you can.
Embeddings. See if the user input aligns well with your “Run Code” embedded phrases, if it does, output the “1” otherwise output “0”.
Regular expressions. Search for specific keywords known for the “Run Code” case. Output “1” if “Run Code”, or “0” otherwise.

Then if you do all three, blend the three numbers, you could take a simple average and round. You could also take a weighted average, for example if you think the classifier is really good, put 50% of the weight on it, and 25% on the others. So round(0.5*Classifier + 0.25*Regex + 0.25*Embeddings).

Once you dial this in, it should be good at determining the path it goes down, which is use the LLM for a general question, or run some code and spit out an answer.

abhi3hack · June 5, 2023, 4:30am

Thanks you @curt.kennedy I think I have got what was I thinking about

Now only thing left is to exactly learn how embeddings actually work and make some sample sql queries and user prompts.
If possible since you have a clear Idea of these things can you share from where I can learn these embeddings perfectly so that I can make my project without any major difficulties.

abhi3hack · June 5, 2023, 11:36am

@bill.french @curt.kennedy @EricGT @alden @nelson @castel
I have another problem which I have with my db is that for example if user can define their own item tags or vendor company tags
and if the question is something like
what is the total money spent on apple(users can have any tag like APPLES, Cheap Items)
what is the total money spent for the small business(user can have any tag here like small vendors)
In this case how can I define whether apple is a item tag or a company tag
we are facing this issue as users can have any name, so every user has his or her own names
now how can we know what they are referring to ??

we have two solutions but I want know if you have any more optimized solution

we take user input and extract key words and search in the data base where they are present (I don’t have exactly how this is done but still we have an idea to do something like that) and we send this info in the prompt to the completion end point along with the sql query we get from the embedding model as @curt.kennedy described in this answer

user should put it explicitly in the question saying that

what is the total money spent on items with item tag apple
what is the total money spent on organizations with company tag small business

then we can have similar questions in our embeddings and we may give the code with respect to that embedding along with the question to completion API to write a query and we can execute it and return the result plus prev prompt to gpt4 to write a descriptive answer.
But since this is not a good solution I want to hear from you have you faced such difficulty before?
if yes how to solve this problem?

bill.french · June 5, 2023, 1:38pm

This probably the first thing to learn about as it will add more clarity to your most recent question concerning tags.

I have (ironically) a bias against tags in AI solutions, especially the ones created by users, because they often bias the meaning of data. It’s a free-for-all as you point out.

bill.french · June 5, 2023, 3:18pm

Spot on, and who/what should create those tags? Probably nothing; shouldn’t they be vectors end-to-end?

bill.french · June 5, 2023, 3:19pm

This might be relevant -

curt.kennedy · June 5, 2023, 5:47pm

I agree that tags could bias the vector retrieval, and lead to worse performance (because of the added bias). You can “tag” things for your own sorting in the database, that’s OK, but don’t let the tag be a searchable item through embedding correlations.

The only way I can think of a tag influencing, is a situation where you have a classifier map to tags. Then you subset your embedding search only over the smaller set of data with those tags (and you create the tags, and train the classifier accordingly).

In this scenario, you will speed up the search, and prevent “correlation leakage” to things that are certainly off topic, assuming you tagged everything correctly.

This understanding really comes from years of experience and getting your hands dirty. Once you get the hang of it, you will get there, keep trying, and asking questions!

curt.kennedy · June 5, 2023, 9:24pm

OK @RonaldGRuckus , I like the idea of Hybrid Search. The article explained it pretty well.

It just uses the harmonic sum to weight the two searches. Combining dense (embedding) and sparse (keyword using BM25).

My only concern is the computational complexity of BM25. Also, how should you extract keywords?

I’m thinking of cacheing previous keyword/BM25 pairs, and re-ranking periodically as data comes in. Thoughts?

Are you implementing this yourself, or running a library, or using a service?

I see some BM25 code out there. Maybe benchmark it, but at surface level it looks like a lot of computations (double-nested for-loops => latency). So it would require offline processing periodically to refresh the keyword/BM25 pairs.

panna · June 6, 2023, 1:33am

There is a simple solution. Just pass on table name with data name and type. Then put question in natural language and it will generate SQL. Now, run this SQL in background and display the answer. Please do not pass data to the OpenAI, it has two problems:
(1) The input size can be of limited number of tokens.
(2) You don’t want to expose your data to OpenAI.

curt.kennedy · June 6, 2023, 2:18pm

Hey @anon10827405, was going to respond to your last post, but was it was deleted?

Anyway, makes sense to have embeddings (dense) against the description and Keyword BM25 (sparse) against the keywords, with the harmonic sum combining them. This would work well as a recommendation because the keywords often extrapolate beyond the description, so combing this with a description seems like a great idea!

In my situation, I have an ML pipeline already extracting entities using AWS Comprehend. Things like keywords, locations, PII, sentiment, etc. But the data is just sitting around doing nothing. So if I could use it in the sparse side of search, that could be good.

Just didn’t want to use it on the dense side, or embeddings. So BM25 or similar could be leveraged for sparse (+hybrid with dense). Just need to figure out how to package it to reduce latency. My overall latency spec is pretty high, say 60 seconds, so no need to use Weaviate or Pinecone.

abhi3hack · June 6, 2023, 3:44pm

Thanks for mentioning, if possible please share some resource from where I can get some more clear idea!!!

I cannot access these it is not free!

jethro.adeniran · June 6, 2023, 3:45pm

I used this solution for my sports database. I used few shot examples and detailed explanations – most edge case examples – advanced filters /slang recognition.

It’s pretty intelligent in ways I didnt suspect at first

abhi3hack · June 6, 2023, 3:52pm

can you please say what exactly were you trying to infer about by saying “dynamic listing” or dynamic topic generation

So what do you think would be a more better Idea ?

bill.french · June 6, 2023, 3:53pm

Your data, while exposed to OpenAI through API calls, is not utilized or retained by OpenAI. Their privacy policy is very strict and unlikely to violate it.

Topic		Replies	Views
The length of the embedding contents API	48	35331	December 13, 2023
How to feed data for completions, instead of using prompt/answer fine-tuning format? API	25	18237	December 17, 2023
FAQ on custom data to support company internal API	27	5582	December 18, 2023
A sanity check for future plugins to access private SQL databases Plugins / Actions builders	61	6326	November 30, 2023
Building the Ultimate Chatbot: What Do You Think of My Strategy? API	30	6703	December 18, 2023

Creating a Chatbot using the data stored in my huge database

Related topics