Correct way to submit the db schema with each prompt

I feel like I’m losing my mind. Let me start off by saying in a Ruby, not Python or Java developer. Let me also say I’ve never done anything with NLM or AI dev before, so most of these posts are confusing as heck. Tbh, I feel like there is a massive chasm between the dev world and those that get this stuff and those that get this stuff just assume the rest of us can figure it out.

With that said, after watching YouTube videos reading Reddit posts and medium articles I decided to just ask Claude and ChatGPT the following - “explain to me from a high level how to search a Postgres database using an NLM or AI”.

They both came back with basically the same thing:

  1. Send your db schema to the AI with an explanation of each table and column and association

  2. Send your prompt and ask the AI to generate. A SQL query.

  3. Run that query against your database.

Love it. Makes sense, doesn’t require me to send my massive db in some odd format to the AI and is manageable. But I had questions.

Specifically I asked:

  • how do I send the schema to the AI? What format?
  • if I send the schema every time it’s going to cost a fortune. What can I do?
  • how do I train the AI if the sql being returned is not producing the correct results? Do I need to do that every time?

The AI came back and told me:

  1. You can create a json, yaml or OpenAI representation of your schema and gave me the representation based on my db schema, which I saved to a file and augmented with descriptions and associations.

  2. You post this formatted representation to OpenAI to create a context embedding. You will get an id to reference in future api calls.

  3. When you make an API to the completion endpoint, tell the AI to use those embeddings, which won’t count against your tokens.

4 use fine tuning to create an augmented version of the ai model that is trained by submitting a file with example
Prompts and the correct sql. You’ll get a model id you can then use for the next completion call, or better, you can create a model alias to use that you can keep updating the tuning for.

Love love love it. Makes so much send. Tell the AI about your db, store that knowledge with the AI, reference it when sending prompts, use an augmented version of the AIs model that you keep training over time.

It even gave me code to use.

Here’s the problem - it appears to all be wrong.

There is no context embedding endpoint in OpenAI, just an embedding endpoint that doesn’t accept my json representation. When I send it as text I get back an array of numbers, which I know are embedding but are longer than the original schema representation. No embedding is to use.

When I call the completions endpoint there is no embedding atttribute I can use to send the embeddings (or phantom embedding id).

And finally there is no way to fine tune the model to create an augmented version that understands my data of the open ai model. At least nothing I can find.

And every darn search I do is super confusing or points to a “recipe” for generating sql from your schema in the OpenAI cookbook that doesn’t exist.

So what happened here? Why is the AI sending me on insanely frustrating wild goose chases? How can I actually figure this out?

Thank you for helping a true LLM newbie gain some sanity.

Are you looking for this one?

No. Thanks. It’s my understanding they’re used to be a specific article that others have linked to explaining how to use NLP via OpenAi to generate sql. Regardless, why did the AI give me completely misleading information on how to do this?

Ah, perhaps this one then: Natural language to SQL.

It is under the examples page.

Yea thank you but it doesn’t really answer any of the concerns I have. Submitting the entire schema with every prompt is incredibly inefficient. Also there is nothing about fine tuning or creating my own model.

what do you mean submitting? like as in getting counted towards tokens? you shouldn’t have to re write or recreate it everytime. but yeah ill will count towards tokens. If you go into the on site playground, then assistants, then within a new instance click on the +function button, there should be two examples there of the json format or schema for creating custom functions

@hennyllogan thanks. Yes I mean submitting as in including the entire json schema with every prompt. That’s an expensive operation to be doing thousands of times a day. There has to be a way to create my own “model” that already knows about the schema so when I use curl or whatever to post a prompt I don’t need to include it.

@hennyllogan yeah I see them but they are of weather and stock price. Where can I find information on how to use these functions to train a model on my database schema? What about fine tuning?

This is kind of what I mean about the learning curve. It’s very difficult to know what to do as things are moving so fast.

@jlippiner Have you found any solution yet? I am also looking for the same thing and also you are right that sending schema each time is also not the right way to do it.

@jlippiner, if you find a solution to this issue, please share it. This is a problem many of us are facing.