Correct way to submit the db schema with each prompt

I feel like I’m losing my mind. Let me start off by saying in a Ruby, not Python or Java developer. Let me also say I’ve never done anything with NLM or AI dev before, so most of these posts are confusing as heck. Tbh, I feel like there is a massive chasm between the dev world and those that get this stuff and those that get this stuff just assume the rest of us can figure it out.

With that said, after watching YouTube videos reading Reddit posts and medium articles I decided to just ask Claude and ChatGPT the following - “explain to me from a high level how to search a Postgres database using an NLM or AI”.

They both came back with basically the same thing:

  1. Send your db schema to the AI with an explanation of each table and column and association

  2. Send your prompt and ask the AI to generate. A SQL query.

  3. Run that query against your database.

Love it. Makes sense, doesn’t require me to send my massive db in some odd format to the AI and is manageable. But I had questions.

Specifically I asked:

  • how do I send the schema to the AI? What format?
  • if I send the schema every time it’s going to cost a fortune. What can I do?
  • how do I train the AI if the sql being returned is not producing the correct results? Do I need to do that every time?

The AI came back and told me:

  1. You can create a json, yaml or OpenAI representation of your schema and gave me the representation based on my db schema, which I saved to a file and augmented with descriptions and associations.

  2. You post this formatted representation to OpenAI to create a context embedding. You will get an id to reference in future api calls.

  3. When you make an API to the completion endpoint, tell the AI to use those embeddings, which won’t count against your tokens.

4 use fine tuning to create an augmented version of the ai model that is trained by submitting a file with example
Prompts and the correct sql. You’ll get a model id you can then use for the next completion call, or better, you can create a model alias to use that you can keep updating the tuning for.

Love love love it. Makes so much send. Tell the AI about your db, store that knowledge with the AI, reference it when sending prompts, use an augmented version of the AIs model that you keep training over time.

It even gave me code to use.

Here’s the problem - it appears to all be wrong.

There is no context embedding endpoint in OpenAI, just an embedding endpoint that doesn’t accept my json representation. When I send it as text I get back an array of numbers, which I know are embedding but are longer than the original schema representation. No embedding is to use.

When I call the completions endpoint there is no embedding atttribute I can use to send the embeddings (or phantom embedding id).

And finally there is no way to fine tune the model to create an augmented version that understands my data of the open ai model. At least nothing I can find.

And every darn search I do is super confusing or points to a “recipe” for generating sql from your schema in the OpenAI cookbook that doesn’t exist.

So what happened here? Why is the AI sending me on insanely frustrating wild goose chases? How can I actually figure this out?

Thank you for helping a true LLM newbie gain some sanity.

Are you looking for this one?

No. Thanks. It’s my understanding they’re used to be a specific article that others have linked to explaining how to use NLP via OpenAi to generate sql. Regardless, why did the AI give me completely misleading information on how to do this?

Ah, perhaps this one then: Natural language to SQL.

It is under the examples page.

Yea thank you but it doesn’t really answer any of the concerns I have. Submitting the entire schema with every prompt is incredibly inefficient. Also there is nothing about fine tuning or creating my own model.

what do you mean submitting? like as in getting counted towards tokens? you shouldn’t have to re write or recreate it everytime. but yeah ill will count towards tokens. If you go into the on site playground, then assistants, then within a new instance click on the +function button, there should be two examples there of the json format or schema for creating custom functions

@hennyllogan thanks. Yes I mean submitting as in including the entire json schema with every prompt. That’s an expensive operation to be doing thousands of times a day. There has to be a way to create my own “model” that already knows about the schema so when I use curl or whatever to post a prompt I don’t need to include it.

@hennyllogan yeah I see them but they are of weather and stock price. Where can I find information on how to use these functions to train a model on my database schema? What about fine tuning?

This is kind of what I mean about the learning curve. It’s very difficult to know what to do as things are moving so fast.

@jlippiner Have you found any solution yet? I am also looking for the same thing and also you are right that sending schema each time is also not the right way to do it.

@jlippiner, if you find a solution to this issue, please share it. This is a problem many of us are facing.

@jlippiner Have you found any solution yet? I am also looking for the same thing and also you are right that sending schema each time is also not the right way to do it.

me too

AI SAY

Currently, on platform.openai.com, it’s not possible to directly train a custom model with your own data like in the case of your SQL schema. However, you can integrate the existing model and tailor it to respond to specific queries about your database.

Here are some ideas on how to proceed:

Data Preprocessing: Ensure you extract relevant information from your SQL schema to formulate questions and answers.

Integration: You can create an application that uses the OpenAI API. This application can process user queries and return responses based on your database data.

Context: When using the API, you can send the necessary context (such as parts of your SQL schema) along with the user’s query to help the model respond more accurately.

Synchronization: Ensure your SQL schema is always updated and available for the model to refer to as needed.

Error Handling: Implement logic to handle any errors or questions the model cannot directly answer.

To train a custom model based on your own data, such as your SQL schema, here are several options you might consider:

Hugging Face Transformers: Allows you to train custom language models using your local data. You can fine-tune a pre-trained model to specialize in your specific domain.

Google AutoML: Offers tools to train custom models using your own data without needing advanced machine learning expertise.

Amazon SageMaker: Enables you to build, train, and deploy custom machine learning models in the cloud.

Microsoft Azure Machine Learning: Provides tools to easily train custom models, also in the cloud.

IBM Watson Studio: Offers a platform for developing and training custom AI models using your own data.

These platforms allow you to upload your data, train specific models, and tailor them for particular tasks. The choice of platform depends on your specific needs, budget, and experience with AI tools.