I wrote this little toy to test out if I can use the ChatGPT API to generate queries on demand:
require "openai"
require "pg"
require "mini_sql"
API_KEY = `KEY GOES HERE`
pg_conn = PG.connect(dbname: "discourse_development")
conn = MiniSql::Connection.get(pg_conn)
schema = []
table_name = nil
columns = nil
priority_tables = %w[posts topics notifications users user_actions]
conn
.query(
"select table_name, column_name from information_schema.columns order by case when table_name in (?) then 0 else 1 end asc, table_name ",
priority_tables
)
.each do |row|
if table_name != row.table_name
schema << "#{table_name}(#{columns.join(",")})" if columns
table_name = row.table_name
columns = []
end
columns << row.column_name
end
schema << "#{table_name}(#{columns.join(",")})"
messages = [
{
role: "system",
content:
"Given the following DB schema, I am going to ask you questions and I want you to answer them with valid postgres SQL. Do not decorate the answer with text, only answer with SQL. You are SQL generator. You only generate SQL. Do not include Note:, do not explain the query"
}
]
puts "What query would you like to write?"
question = gets.chomp
chunked = []
chunk = +""
schema.each do |table|
chunk << table
chunk << " "
if chunk.length > 4000
chunked << chunk
chunk = +""
end
end
chunked << chunk if chunk.length > 0
chunked[0..2].each { |data| messages << { role: "system", content: data } }
messages << { role: "user", content: question }
client = OpenAI::Client.new(access_token: API_KEY)
response =
client.chat(parameters: { model: "gpt-3.5-turbo", messages: messages })
text = response.dig("choices", 0, "message", "content")
# puts text
#puts
#puts text
text = $2 if text =~ /```(sql)?(.*)```/m
puts text.strip
How well does this do?
which month of 2021 did the user sam post the most posts in (excluding deleted posts and topics)?
WITH user_posts AS (
SELECT date_trunc('month', p.created_at) as post_month, COUNT(*) as post_count
FROM posts p
JOIN users u ON u.id = p.user_id
WHERE u.username = 'sam' AND p.deleted_at IS NULL AND p.post_type = 1 AND p.topic_id NOT IN (SELECT id FROM topics WHERE deleted_at IS NOT NULL)
GROUP BY date_trunc('month', p.created_at)
)
SELECT post_month, post_count
FROM user_posts
WHERE post_count = (SELECT MAX(post_count) FROM user_posts)
How many notifications did users get in the last 3 months?
SELECT COUNT(*)
FROM notifications
WHERE created_at >= NOW() - INTERVAL '3 months';
(likes are stored in user_actions) how many likes did the user sam get last month?
SELECT COUNT(*)
FROM user_actions
WHERE action_type = 2
AND target_user_id = (SELECT id FROM users WHERE username = 'sam')
AND created_at >= date_trunc('month', CURRENT_DATE - interval '1 month')
AND created_at < date_trunc('month', CURRENT_DATE);
The challenge though is:
- I am forced to truncate the schema… my full schema is 12000 tokens
- I will even have less tokens once I pass in all the business rules eg: (likes are stored in the user_actions table)
Is there any method for fine tuning ChatGPT?
Any ideas on how to work around the token limit? Perhaps I start with a round of questions:
Given the question “…” which of these tables (table definition) are relevant?
Is the “right” approach here just to use the more expensive devinci and fine tune?