Using GPT3 on database schamas

checkjthomas · November 18, 2021, 3:27pm

Is it possible to give GPT3 your database schema, so that it will know about your databse, and then ask it to perform questions from it?

daveshapautomator · November 18, 2021, 3:29pm

I’ve seen people using CODEX to generate SQL queries so I assume so. https://www.reddit.com/r/GPT3/comments/pf5isz/openai_codex_for_sql/

boris · November 18, 2021, 3:31pm

That’s generally possible with the combination of /search and /completion (with Codex) endpoints.

use search, to given a user query, find the most relevant tables from the schema
use /completions, where inside the comment you show the most likely relevant tables, and then you add user query, and ask Codex to complete the SQL statement. Step 2 is shown in this example.

omkar · January 26, 2022, 10:28am

Can we use answers api instead? As it inherently does both search and completion for us?

chirag.shah285 · March 1, 2023, 2:30pm

@omkar did you come up with a solution to this?

ruby_coder · March 1, 2023, 3:09pm

This is doable if you specify the DB details within the context of a programming language which supports SQL queries like Python, Ruby, PHP etc

HTH

kvnam · March 2, 2023, 1:35pm

@boris - Now that the /search endpoint is deprecated, how could this be achieved? Embeddings are the suggested alternative, but in order to find the most relevant tables for a user’s query, what sort of data would need to be embedded?

chirag.shah285 · March 2, 2023, 5:58pm

Great question @kvnam looking to understand the same thing, any help would be much appreciated

anon10827405 · March 2, 2023, 6:01pm

Hey, I’m assuming you haven’t tried using the hybrid index?

You wouldn’t use embeddings for a keyword query, you would use a typical relationship database with ideally some sort of NLP entity extraction. However, it’s also ideal to use both and compare the results together for maximum performance. Keep in mind that you can use GPT for entity extraction. However it’d probably be more efficient to train your own.

“Hello, my day is fine. I’m curious, what is the capital of London?” → OpenAI( Embed the sentence for semantic relevance ) → [Optional] Extract the query “Capital of London” → Rank using a Search Engine → Compare both results’ score

In this case, they both should return the correct answer, assuming you have it, which would be stored as:

Dense Vectors: (Wikipedia page of London)
Sparse Vectors: (Important factual keywords extracted from the page)

kvnam · March 2, 2023, 6:19pm

Thanks @anon10827405 I am trying out the hybrid index. I work with JavaScript so translating some of the code and finding packages has been slow. I understand the concept of hybrid indexes, and am currently testing a BM25 package to see if maybe that will suffice to provide me with the right set of results which I can then send as prompts to Open ai for the final answer…
Since they are discussing conversion to SQL queries, was curious to know if there is a way to convert the question to the exact entities and then a SQL query, if I provide my schema to a model… But I think the hybrid index approach might be simpler to try first…

anon10827405 · March 2, 2023, 6:27pm

Javascript as in NodeJS? It’s all so new. Based on the docs on Pinecone’s website, they do intend to provide a helper class so we don’t need to prepare the sparse method. I believe they’re having a seminar regarding it soon. You can also find a little more helpful information here:

kvnam · March 2, 2023, 7:02pm

Yes @anon10827405 I meant Nodejs and you are right it does seem like a more intensive approach to use SQL queries. Most packages in JavaScript for BM25 directly go through and return the results for the query. I asked ChatGPT how to convert these to embeddings and it suggested

Once the BM25 object is initialized, you can generate embeddings by calling the searchmethod with each query as input. Thesearch method will return a list of document IDs sorted by their relevance scores to the query. You can treat these relevance scores as embedding vectors for each document.

I’m still trying to work this one out, but before that I’m checking if just a BM25 run is enough for me to generate the prompt context documents.
And thank you for sharing that link, it definitely explains the use of the two sparse and dense embeddings in more layman terms that are easier to understand. I just need to figure out the Generate Vectors: step for sparse vectors with Node.

anon10827405 · March 2, 2023, 7:05pm

If you are generating BM25 embeddings you should at some point see something like this:

{
    "indices": [2, 4, 6],
    "values":  [0.1, 0.3, 0.5]
 }

This is the sparse vector data that you would feed alongside your semantic embeddings generated by OpenAI

abdulk · April 9, 2023, 2:14pm

@kvnam were you ever able to generate the sparse vectors in node?

dror · April 12, 2023, 6:22pm

I made this work with a postgres db where you ask it questions in English, it generates a query, runs it against the db, and provides the results. Works surprisingly well, but runs into issues periodically.
If there’s interest, I can look into publishing it on github.

chirag.shah285 · April 12, 2023, 8:36pm

hey @dror thats great I would love to see the github link, interested on how you did this. Thanks!

dror · April 12, 2023, 9:45pm

OK, so this is more of a proof of concept than a real product, but I went ahead and published.

It works surprisingly well.

Topic		Replies	Views
How do i create a custom model for text to sql queries? API	13	12415	December 18, 2023
Turning chatgpt API into a assistant for a (complex) website API	20	4557	December 21, 2023
Seeking Guidance on Building a ChatGPT-Style Data Analyst Tool with Database Integration Plugins / Actions builders gpt-4 , chatgpt , api , openai	11	5317	September 23, 2024
Natural Language to SQL with huge table schema API	12	10164	December 19, 2023
How to fine tune text to sql? API	19	9506	April 26, 2024

Using GPT3 on database schamas

Related topics