Ideas for teaching ChatGPT about an entire DB schema?

I wrote this little toy to test out if I can use the ChatGPT API to generate queries on demand:

require "openai"
require "pg"
require "mini_sql"

API_KEY = `KEY GOES HERE`

pg_conn = PG.connect(dbname: "discourse_development")
conn = MiniSql::Connection.get(pg_conn)

schema = []

table_name = nil
columns = nil

priority_tables = %w[posts topics notifications users user_actions]

conn
  .query(
    "select table_name, column_name from information_schema.columns order by case when table_name in (?) then 0 else 1 end asc, table_name ",
    priority_tables
  )
  .each do |row|
    if table_name != row.table_name
      schema << "#{table_name}(#{columns.join(",")})" if columns
      table_name = row.table_name
      columns = []
    end
    columns << row.column_name
  end

schema << "#{table_name}(#{columns.join(",")})"



messages = [
  {
    role: "system",
    content:
      "Given the following DB schema, I am going to ask you questions and I want you to answer them with valid postgres SQL. Do not decorate the answer with text, only answer with SQL. You are SQL generator. You only generate SQL. Do not include Note:, do not explain the query"
  }
]

puts "What query would you like to write?"
question = gets.chomp

chunked = []
chunk = +""
schema.each do |table|
  chunk << table
  chunk << " "
  if chunk.length > 4000
    chunked << chunk
    chunk = +""
  end
end
chunked << chunk if chunk.length > 0

chunked[0..2].each { |data| messages << { role: "system", content: data } }

messages << { role: "user", content: question }

client = OpenAI::Client.new(access_token: API_KEY)
response =
  client.chat(parameters: { model: "gpt-3.5-turbo", messages: messages })

text = response.dig("choices", 0, "message", "content")

# puts text

#puts
#puts text

text = $2 if text =~ /```(sql)?(.*)```/m

puts text.strip

How well does this do?

which month of 2021 did the user sam post the most posts in (excluding deleted posts and topics)?

WITH user_posts AS (
  SELECT date_trunc('month', p.created_at) as post_month, COUNT(*) as post_count
  FROM posts p
  JOIN users u ON u.id = p.user_id
  WHERE u.username = 'sam' AND p.deleted_at IS NULL AND p.post_type = 1 AND p.topic_id NOT IN (SELECT id FROM topics WHERE deleted_at IS NOT NULL)
  GROUP BY date_trunc('month', p.created_at)
)
SELECT post_month, post_count
FROM user_posts
WHERE post_count = (SELECT MAX(post_count) FROM user_posts)

How many notifications did users get in the last 3 months?

SELECT COUNT(*) 
FROM notifications 
WHERE created_at >= NOW() - INTERVAL '3 months';

(likes are stored in user_actions) how many likes did the user sam get last month?

SELECT COUNT(*) 
FROM user_actions 
WHERE action_type = 2 
AND target_user_id = (SELECT id FROM users WHERE username = 'sam')
AND created_at >= date_trunc('month', CURRENT_DATE - interval '1 month') 
AND created_at < date_trunc('month', CURRENT_DATE);

The challenge though is:

  1. I am forced to truncate the schema… my full schema is 12000 tokens
  2. I will even have less tokens once I pass in all the business rules eg: (likes are stored in the user_actions table)

Is there any method for fine tuning ChatGPT?

Any ideas on how to work around the token limit? Perhaps I start with a round of questions:

Given the question “…” which of these tables (table definition) are relevant?

Is the “right” approach here just to use the more expensive devinci and fine tune?

1 Like

No. ChatGPT is not tunable (by customers / developers) at this time.

I made some progress by fine tuning the prompts a bit and leaning on assistant user chains. This allows me to teach ChatGPT only to answer in SQL regardless of the question.

But fundamentally the token limit for both ChatGPT and Codex make it not exactly suitable for the task. I am going to need to feed a ton of examples to get very consistent and great results and I am already at 3000 tokens now.

3 Likes

GPT-4 experiments (no API key yet) is producing far more interesting results than 3.5. I think combined with the higher caps this should be a very practical solution.

2 Likes

It’s been a while, but did you make any progress on this one? Happy to learn!

A fair bit of progress.

These days we use the equivalent of a GPT for these kind of tasks.

Discourse AI ships with a persona that stuff some of the schema into the prompt and is able to retrieve portions of the schema using tool calls.

You still get the odd hallucination but overall it does a very good job.

I think further progress could be made by validating SQL and ensuring that it returns results, by adding more feedback loops in.

Also given 128k tokens, if you are rich enough you can just stuff the entire database schema into the prompt.

1 Like

When you wold build it again from scratch, would you do a “traditional” way or building with agents like autogen or LangChain’s SQL agent? Did you have a look into LangChains SQL agent? I guess it’s doing multi-step approach, first looking into the schema finding the context, then trying the query, etc.?