A sanity check for future plugins to access private SQL databases

Ok guys I will put something together and send tomorrow. Sleep time for me now. Cheers, Michael.

Bill, and everyone else thanks for your replay.
I will put a prompt here, but first 2 things:

  1. Are you a programmer? I lately see non-programmers thinking that GPT is some sort of magic, and that the answer is already there and you just need to find it using the right prompt.
    This is far from the truth, prompt engineering is just a buzz word, the bottom line is that if the neural network is trained based on errors, you are going to have errors. It is not some sort of god.
    The reality is, that it makes lots of erros, and provide lots of wrong data.

  2. To feed back the result to the chat and ask it to fix it is a very bad design. It is already slow, and you are offering me to double the waiting time to around a half minute, and this makes the UX completely useless. Not to mention the price at scale.

Here is one of the prompts, i changed the columns names due to privacy, and also, this is just the pre-prompt under the scenes, which is being added to every user input. The user input can be anything and you can’t determine it in advance, so i am not including it here.

I replace some sensitive data with ***** and xxxx :

       ### produce Postgres SQL for ***** type of db, with tables their properties:
        # Table name: *****, columns = [date Date, xx NUMERIC, xx NUMERIC, xx NUMERIC, xx NUMERIC, xx bigint, xx bigint, name VARCHAR, full_name text, xx text, PRIMARY KEY (date, name)]
        # dates are like 2022-12-31
        # name column is always *** in big letters
        # when asked about today/now select last available date
        # always ignore null rows.
        # *** column optional values: *** or ***
        # you produce only SQL, no explanations
      ### the user query is: 

I have to add, that since some of this data is already public, the Chat have some prior knowledge about our system. (eg public company etc)

Of course you can write this prompt in 1000 different ways, but this probably has nothing to do with syntax sql errors, especially if you can see that the produced SQL is doing the right thing, but just have syntax error.

Hi Guys, I am planning on releasing a database + NL question corpus that gets exactly at the issues we are discussing.

Would it be useful if I built a public corpus around the following schema:

   <table id="stock" writes="full">
        <column id="Symbol" type="text"/>
        <column id="Name" type="text"/>
        <column id="Market_Cap" type="int" lex="Market Cap"/>
        <column id="Country" type="text"/>
        <column id="IPO_Year" type="int" lex="IPO Year"/>
        <column id="Sector" type="text"/>
        <column id="Industry" type="text"/>
        <column id="exchange" type="text"/>
        <primary_key key="Symbol"/>
    </table>

    <table id="quote" writes="full" lex="quote,quotes">
        <column id="time" type="time"/>
        <column id="symbol" type="text" properties="none"/>
        <column id="price" type="real" units="USD"/>
        <column id="volume" type="int"/>
        <unique key="time,symbol"/>
        <foreign_key from="symbol" to="stock"/>
    </table>

The database state will be some subset of the stocks of the NASDAQ and their price at some frequency over the day.

The natural language questions would be like:
“sectors of stocks”
“names of energy stocks”
“names of energy stocks that are down 5% in the last two hours”

Would this be an interesting corpus to try to do a little prompt engineering on? Golly I will eat my hat if
ChatGPT can do this in a robust way.

Let me know if I should proceed.

Michael

Michael, this was done by many already and probably will be part of every broker anytime soon.
The problem that it can’t answer complex enough questions, and the simple queries are anyway easy to find on any finance website using UI.

My point is that if we want to clarify the feasibility of prompt engineering for the ChatGPT NL->SQL problem we need an open and clear example. I think this stock example fits that requirement. And I will proceed with an even simpler example:

Just the single table:

<table id="stock" writes="full">
    <column id="Symbol" type="text"/>
    <column id="Name" type="text"/>
    <column id="Last_Sale" type="real" />
    <column id="Net_Change" type="real" />
    <column id="percent_Change" type="real" />
    <column id="Market_Cap" type="int" />
    <column id="Country" type="text"/>
    <column id="IPO_Year" type="int" />
    <column id="Volume" type="int"/>
    <column id="Sector" type="text"/>
    <column id="Industry" type="text"/>
    <column id="summary_quote" type="text"/>
    <column id="exchange" type="text"/>
    <primary_key key="Symbol"/>
</table>
And the database state will be the stocks listed on NASDAQ.

I wonder how it will do with questions like "Energy stocks with percentage change down more than 2.3 percent".
This is a single table, simple case. But there are a lot of stocks. What about:
"what is the of price of ACAX"? Will it know that ACAX is a symbol? 

We will see. Hopefully I will have something to report my later today or tomorrow. 

Cheers,
Michael

 

Michael you don’t need to try anything, the normal GPT version online can do this, it has a cutoff till 2021 but you can ask about stocks before this period and yes it knows ticker names and other stuff about markets.

what is the of price of stock ACAX?

ChatGPT says:

I’m sorry, but as an AI language model, I do not have access to real-time stock market data. The price of a stock such as ACAX (if it exists) can fluctuate rapidly and is dependent on a variety of factors such as market demand, company performance, and global economic conditions. I recommend checking a reputable financial website or consulting with a financial advisor to obtain the latest information on the price of a specific stock.

Michael that’s because you are asking the price now, ask what is the price on 7th Dec on 2020, and you get an answer, (if it’s not a holiday) , its well known that it has price action, nothing to prove here

1 Like

Again, the point is to figure out the feasibility of prompt engineering for the NL → SQL problem.
Almost done. Will release a video + example database + questions.

I would suggest trying to add some examples. So take some common failure cases and include them as examples in the prompt (obviously with corrected syntax).

It will be more expensive as you will use a lot of tokens with the examples.

If you can identify some common syntax errors it’s making (like putting window clauses in the wrong place). Then some examples of what valid SQL queries are that would satisfy common use cases may help.

I would also try just doing your prompt as plain text - this is a prompt that I’ve been experimenting on one of our database tables.

I'm using Postgres and want to query the following table:

Table: xyz

Columns:

X: bigint
X: character varying
X: text
X: integer
X: character varying
X: integer
X: character varying
X: integer
... there's about 30 or so columns..

Write me a query for the following:

XXXXXXXXX

Do not output any explanation or additional text.

Could not agree more. Example of me saying exactly this to a database community drunk on ChatGPT.

Untrue. Users who are drunk on AGI have misconstrued the term in ways that do not match engineering principles. English is now officially code and should be crafted the way we craft code. So, prompt engineering is a thing, and OpenAI didn’t agree to lend its team to build this course on prompt engineering because it isn’t.

Agree. It’s not ideal, but to do so with the intent of holding the AI’s feet to the fire without creating additional latency is probably a good thing. LLMs are capable of self-correcting if given the inference time (not to be conflated with the response time for an API call), to consider what it says before actually saying it.

AI models will say things before thinking. Prompt engineering makes it possible to get them to think before saying things.

@rantrg and @michael.minock Thanks for providing examples. I love this thread because there’s so much to learn about here.

Seems like a good use-case. Potentially secure issue if you use the API key in the manifest. But you coud get around this by using an oauth system when the plugin is first installed.

An easy way to demonstrate this is by the simple, powerful

“Let’s think step by step”

No. Category Template Accuracy
1 instructive Let’s think step by step. 78.7
2 First, (*1) 77.3
3 Let’s think about this logically. 74.5
4 Let’s solve this problem by splitting it into steps. (*2) 72.2
5 Let’s be realistic and think step by step. 70.8
6 Let’s think like a detective step by step. 70.3
7 Let’s think 57.5
8 Before we dive into the answer, 55.7
9 The answer is after the proof. 45.7
10 misleading Don’t think. Just feel. 18.8
11 Let’s think step by step but reach an incorrect answer. 18.7
12 Let’s count the number of “a” in the question. 16.7
13 By using the fact that the earth is round, 9.3
14 irrelevant By the way, I found a good restaurant nearby. 17.5
15 Abrakadabra! 15.5
16 It’s a beautiful day. 13.1

  • (Zero-shot) 17.7

One of my favorite examples to demonstrate the power of prompt engineering.

Use the following clues to answer the following multiple-choice question.
 
Clues:
1. Miss Scarlett was the only person in the lounge.
2. The person with the pipe was in the kitchen.
3. Colonel Mustard was the only person in the observatory.
4. Professor Plum was not in the library nor the billiard room.
5. The person with the candlestick was in the observatory.
 
Question: Was Colonel Mustard in the observatory with the candlestick?
(a) Yes; Colonel Mustard was in the observatory with the candlestick
(b) No; Colonel Mustard was not in the observatory with the candlestick
(c) Unknown; there is not enough information to determine whether Colonel Mustard was in the observatory with the candlestick

Solution:
(c) Unknown; there is not enough information to determine whether Colonel Mustard was in the observatory with the candlest


Use the following clues to answer the following multiple-choice question, using the following procedure:
(1) First, go through the clues one by one and consider whether the clue is potentially relevant
(2) Second, combine the relevant clues to reason out the answer to the question
(3) Third, map the answer to one of the multiple choice answers: either (a), (b), or (c)
 
Clues:
1. Miss Scarlett was the only person in the lounge.
2. The person with the pipe was in the kitchen.
3. Colonel Mustard was the only person in the observatory.
4. Professor Plum was not in the library nor the billiard room.
5. The person with the candlestick was in the observatory.
 
Question: Was Colonel Mustard in the observatory with the candlestick?
(a) Yes; Colonel Mustard was in the observatory with the candlestick
(b) No; Colonel Mustard was not in the observatory with the candlestick
(c) Unknown; there is not enough information to determine whether Colonel Mustard was in the observatory with the candlestick
Solution:
(1) First, go through the clues one by one and consider whether the clue is potentially relevant:

Clue 1: Miss Scarlett was the only person in the lounge. This clue is not relevant.
Clue 2: The person with the pipe was in the kitchen. This clue is not relevant.
Clue 3: Colonel Mustard was the only person in the observatory. This clue is relevant.
Clue 4: Professor Plum was not in the library nor the billiard room. This clue is not relevant.
Clue 5: The person with the candlestick was in the observatory. This clue is relevant.
 
(2) Second, combine the relevant clues to reason out the answer to the question:
Clue 3 and Clue 5 both say that the person with the candlestick was in the observatory. Colonel Mustard was the only person in the observatory, so Colonel Mustard was the person with the candlestick.

(3) Third, write the final answer:
(a) Yes; Colonel Mustard was in the observatory with the candlestick
  1. To feed back the result to the chat and ask it to fix it is a very bad design. It is already slow, and you are offering me to double the waiting time to around a half minute, and this makes the UX completely useless. Not to mention the price at scale.

Not at all. What you are asking for is a perfect query every time.
If you don’t at the very least verify your queries, you aren’t going to ever have a solid product. I mean seriously, have you even considered the possibility of people creating malicious queries? Not even that, but if the query is wrong, then what? You say “Oops, try again?” It doesn’t make sense and is honestly counter-productive.

I don’t understand the point about price. Compared to what? If you expect to do everything in one single call, well, it’s just not going to work out. Why not focus on a solid, bullet-proof product/service, and then focus on efficiencies after? If a competitor somehow manages to make it lightning fast, then learn, but with nothing to compare to, how can you say that it’s too expensive, or takes too long?

People will pay, and people will wait.

A thought: Why not build a very robust system, and take the costs/time, but use a caching system along with a dataset of common questions, so that most queries are immediately generated? This is very easy to accomplish with embeddings.

1 Like

OK Guys, ChatGPT surprised me on the upside in this experiment. Quite a difference between this and the multi-table
TICKIT example. I definitely want to get to the bottom of this. Note it is exactly the same prompt strategy and underlying model.

https://youtu.be/8Auc1OpqCfo

Okay - and are the prompts unchanged from your earlier tests proclaiming ChatGPT as a failure (where multiple tables are involved)? Saying the “exact same prompt strategy” is possibly different from saying “exact same prompts,” right? I just want to understand what has changed from your first experiments to the latest video.

Also, I have never seen a single successful SQL inference that was less than 256 tokens. Why did you fix that at 150? And more specifically, have you tested your outcomes at higher and lower values to benchmark the outcomes?

Help me understand - are these queries the actual prompts you are sending to ChatGPT (i.e., stocks in industry Packaged Foods)? If not, could you share the prompt that would have been generated from this query? I also see mixed case in the queries; why the inconsistency?

CleanShot 2023-05-01 at 17.02.08@2x

And how do you actually know this is the case?

When I test outcomes, I perform hundreds of them in an automated process that I can easily repeat. Each outcome is validated by a human and ranked for accuracy and performance. I’ve come to learn that evaluating success one at a time without any formal test protocol or metrics is unreliable and generally biased.

Most importantly, any change to the prompts or other settings requires a full test battery and re-assessment. This approach tells me how prompts compare to other prompt versions. We then rank prompt versions to determine the highest success. A single prompt development process might have dozens of different versions and thousands of tests.

OK I pasted in the exact prompts into the video description for the TICKIT database where ChatGPT bombed.

https://youtu.be/3HtidCV9BxI

If you have idea about a custom prompt or other settings, I can try those out. I suppose I could use C-Phrase matching to focus the prompt on a relevant part of the schema per question, but that then makes it a hybrid approach. Something I may do eventually.

I will likewise paste the prompts into the description of the latest video over the single stock table.
https://youtu.be/8Auc1OpqCfo
Expect that in the next hour or two.

I agree that a more formal method of testing is needed, but for now I just want to get a feel for the limitations and tradeoffs of wrinkly brain ChatGPT and smooth brain C-Phrase.

Regards,
Michael

Thanks for taking the time to expose those. It’s still unclear to me what you are sending to prompt GPT to respond, and I don’t have the time to watch the video to figure out what’s happening.

For members to help you suss out what may be failing, I was expecting to see a process breakdown - e.g., when a user enters (x), my app transforms that into (y) prompt and calls the API with (z) payload and parameters and fails to respond with (z1) as the expected outcome.

For other engineers to help you determine how to help, you need to provide a repeatable test that we can run using code or a test lattice of some type. Hypothesizing about what may or may not be happening in a video is not going to attract investment in helping you make your AI features better.

Just go to the description of both videos. There it should be pretty clear exactly what was sent and received back. You don’t need to watch the whole video. In short, the take away is that it bombed on the TICKIT databases and did pretty well on the very simple stocks database.

I am going to explore some other aspects of how to jointly use C-Phrase and ChatGPT later. I think I can improve the performance over TICKIT via a hybrid approach.

1 Like

Maybe I’m dense, but it’s not.

Is this the prompt?

### Postgres SQL tables, with their properties:
# stock(symbol, name, last_sale, percent_change, market_cap, country, IPO_year, volume, sector, industry, notes)
#
### stocks in industry Packaged Foods

I’m neither an SQL or prompt expert, but my assumption is that to do this well, you need a prompt process that is far more complex than what you seem to be intimating. Example - this is the approach that I expected to see.

  • Generate a candidate query
  • Quality check the SQL with GPT itself, asking it to spot common errors (NULLS LAST, for instance) and produce a correct version of the query
  • Run the SQL against the tables
  • If there was an error or no result, ask GPT to fix the query to produce a correct result and repeat the loop
  • Otherwise, return the result