How to train open ai with my own datas from database?

Hi guys, Im trying to create a backend service with node js by using open ai where I want to let users to make questions about datas that are stored on my Postgres database, but Im not finding any way how to do it. Can anyone sugguest me what to do on this case, I tried to train the open ai with a simple traning doc but also I will need to have this possibility to get correct answers based on datas that I have on database. Thanks!

1 Like

Hi @john0239

You can pass your DB schema in a system message and instruct the assistant to generate Postgres queries for the user questions using the chat completions API

2 Likes

Hi @sps, thank you for your sugguestion, but the problem is that my database is very complicated and I cannot believe 100% to let the open ai to generate queries because it can select the wrong datas instead of selecting the correct datas that I need. This is the reason that I asked about it if there is any other way.

1 Like

If you do not trust the query that GPT, generates, I suggest you can use a query checker at your end to make sure that the generation is correct and whether it would work with the data or not.

It might sound like a tricky piece of code, but it would allow you to both check the validity of the query semantically as well as ensure that the DB or values selected are correct or not.

Other than that, passing the DB schema when generating the query and using a few samples is prob the best way to try and get GPT to generate the queries, though there is still going to be some chance it hallucinates

2 Likes

To integrate OpenAI with your Node.js backend and answer user questions based on your Postgres database, follow my below mentioned steps:

  1. Set up a Node.js backend.
  2. Use the OpenAI API to generate responses.
  3. Prepare input format for user questions.
  4. Process user queries and send them to the OpenAI API.
  5. Retrieve relevant data from your database based on the question.
  6. Post-process the OpenAI response.
  7. Combine database data with the language model’s output.

Remember, the language model won’t directly access your database; it generates responses based on its training on a vast dataset. Ensure user input security and privacy measures are in place. Training the model specifically on your database data requires significant resources and expertise in NLP and ML, so using pre-trained models like GPT-3 for language generation is recommended.

The question here is, how? I’ve heard the possible solution of passing database schema to LLM and letting it create the Postgres query, but I have also read that current LLMs are notoriously bad at creating good SQL.

Based on my experience with GPT 4 and even 3.5 turbo, I would say that most open source SQL query generators are okay, but GPT is the best of the lot.

For my use, I pass the schema along with a small description of the column names and what data they store and along with a few samples, pass then off to GPT to generate a query for me.

I also apply a few checks when the query comes back, using an AST-esque scenario to check whether it is correct or not.

1 Like

@udm17 can you please share the part of code that where you’re doing it?

@petergray3219 can you provide me an example for it (on coding side)?

From your description, it sounds like you really want to use embeddings. This is my go-to video for any beginner on the subject: https://www.youtube.com/watch?v=Ix9WIZpArm0&ab_channel=Chatwithdata

Now, in your case, your data is in a Postgres database. Similar to my use case, were my data is stored in an Apache Solr database. You simply create your embeddings directly from the database! At least, that is what I have been doing now for some months to great effect.

So, instead of ingesting pdfs:

You ingest your Postgres records (with any associated meta data):

And now, assuming you have a process in place to retrieve relevant context documents from your vector store (embeddings), you have context about YOUR data to be sent to AI model to respond to questions.

If you are actually trying to train the model with your data, that’s a horse of a different color. You may want to watch this video on Embedding vs. Finetuning to determine which method is best in your case: https://www.youtube.com/watch?v=9qq6HTr7Ocw&t=110s&ab_channel=DavidShapiro~AI

5 Likes

hey @SomebodySysop thanks for your answer, can you tell me what programming language are you using at your code?

PHP. But, only because I have 20+ years experience working with it, so I was better off figuring how to do this in PHP rather than learn Python. Remember that OpenAI and all the vector database services use APIs for access to their services, so theoretically you can use any programming language you feel comfortable in – or whatever language you use for your current infrastructure. My content infrastructure is the Drupal CMS, so for me PHP is a no brainer.

@SomebodySysop can you share the structure of the code so I can transform it to Express Js ?

A couple months ago, I asked the question why anyone would ever need a 32K token context window? Well, to answer your question, I needed to turn to Claude to feed my 2300 lines of code into it’s 100K token context window. I would have preferred using GPT-4 codex, but it taps out at 2K (despite an 8K token context).

  • solrai_processObject($fieldArray, $document)

    • Main logic to process a Solr document, checking its type and handling accordingly. Calls solrai_getClassPropertyFields().
  • solrai_getClassPropertyFields($field, $document)

    • Constructs the Weaviate schema properties for a document. Calls helper functions based on doc type.
  • solrai_handleFileType($field, $document, …)

  • solrai_handleCommentType($field, $document, …)

  • solrai_handleNodeType($field, $document, …)

    • Handle the different data source types - file, comment, node.
  • solrai_vectorizeContent($docId, $site, …)

    • Sends the content to Weaviate to be vectorized. Handles errors.
  • solrai_summarizeContent($content)

    • Checks if content should be summarized. Calls solrai_summarizeGetContent().
  • solrai_summarizeGetContent($content)

  • solrai_getMasterSummary($content, …)

  • solrai_summarizeChunk($chunkContent, …)

    • Functions to summarize large content by chunking.
  • solrai_generateQuestions($text)

    • Generates questions for a given text using GPT-3.

The main flow is:

  1. Get Solr objects into a queue
  2. Process each object
    • Handle based on type
    • Construct Weaviate properties
    • Vectorize content
  3. Summarize if applicable
  4. Generate questions if applicable
2 Likes

Have you tried LlamaIndex, it connects your
custom data sources to large language models.

1 Like