How to train open ai with my own datas from database?

john0239 · July 21, 2023, 11:36am

Hi guys, Im trying to create a backend service with node js by using open ai where I want to let users to make questions about datas that are stored on my Postgres database, but Im not finding any way how to do it. Can anyone sugguest me what to do on this case, I tried to train the open ai with a simple traning doc but also I will need to have this possibility to get correct answers based on datas that I have on database. Thanks!

sps · July 21, 2023, 11:54am

Hi @john0239

You can pass your DB schema in a system message and instruct the assistant to generate Postgres queries for the user questions using the chat completions API

john0239 · July 21, 2023, 12:03pm

Hi @sps, thank you for your sugguestion, but the problem is that my database is very complicated and I cannot believe 100% to let the open ai to generate queries because it can select the wrong datas instead of selecting the correct datas that I need. This is the reason that I asked about it if there is any other way.

udm17 · July 21, 2023, 1:57pm

If you do not trust the query that GPT, generates, I suggest you can use a query checker at your end to make sure that the generation is correct and whether it would work with the data or not.

It might sound like a tricky piece of code, but it would allow you to both check the validity of the query semantically as well as ensure that the DB or values selected are correct or not.

Other than that, passing the DB schema when generating the query and using a few samples is prob the best way to try and get GPT to generate the queries, though there is still going to be some chance it hallucinates

petergray3219 · July 22, 2023, 8:57am

To integrate OpenAI with your Node.js backend and answer user questions based on your Postgres database, follow my below mentioned steps:

Set up a Node.js backend.
Use the OpenAI API to generate responses.
Prepare input format for user questions.
Process user queries and send them to the OpenAI API.
Retrieve relevant data from your database based on the question.
Post-process the OpenAI response.
Combine database data with the language model’s output.

Remember, the language model won’t directly access your database; it generates responses based on its training on a vast dataset. Ensure user input security and privacy measures are in place. Training the model specifically on your database data requires significant resources and expertise in NLP and ML, so using pre-trained models like GPT-3 for language generation is recommended.

SomebodySysop · July 23, 2023, 8:44am

The question here is, how? I’ve heard the possible solution of passing database schema to LLM and letting it create the Postgres query, but I have also read that current LLMs are notoriously bad at creating good SQL.

udm17 · July 23, 2023, 8:47am

Based on my experience with GPT 4 and even 3.5 turbo, I would say that most open source SQL query generators are okay, but GPT is the best of the lot.

For my use, I pass the schema along with a small description of the column names and what data they store and along with a few samples, pass then off to GPT to generate a query for me.

I also apply a few checks when the query comes back, using an AST-esque scenario to check whether it is correct or not.

john0239 · July 26, 2023, 6:13pm

@udm17 can you please share the part of code that where you’re doing it?

john0239 · July 26, 2023, 6:15pm

@petergray3219 can you provide me an example for it (on coding side)?

SomebodySysop · July 30, 2023, 9:03pm

From your description, it sounds like you really want to use embeddings. This is my go-to video for any beginner on the subject: https://www.youtube.com/watch?v=Ix9WIZpArm0&ab_channel=Chatwithdata

Now, in your case, your data is in a Postgres database. Similar to my use case, were my data is stored in an Apache Solr database. You simply create your embeddings directly from the database! At least, that is what I have been doing now for some months to great effect.

So, instead of ingesting pdfs:

You ingest your Postgres records (with any associated meta data):

And now, assuming you have a process in place to retrieve relevant context documents from your vector store (embeddings), you have context about YOUR data to be sent to AI model to respond to questions.

If you are actually trying to train the model with your data, that’s a horse of a different color. You may want to watch this video on Embedding vs. Finetuning to determine which method is best in your case: https://www.youtube.com/watch?v=9qq6HTr7Ocw&t=110s&ab_channel=DavidShapiro~AI

john0239 · July 30, 2023, 9:28pm

hey @SomebodySysop thanks for your answer, can you tell me what programming language are you using at your code?

SomebodySysop · July 30, 2023, 9:33pm

PHP. But, only because I have 20+ years experience working with it, so I was better off figuring how to do this in PHP rather than learn Python. Remember that OpenAI and all the vector database services use APIs for access to their services, so theoretically you can use any programming language you feel comfortable in – or whatever language you use for your current infrastructure. My content infrastructure is the Drupal CMS, so for me PHP is a no brainer.

john0239 · July 30, 2023, 10:06pm

@SomebodySysop can you share the structure of the code so I can transform it to Express Js ?

SomebodySysop · July 30, 2023, 11:51pm

A couple months ago, I asked the question why anyone would ever need a 32K token context window? Well, to answer your question, I needed to turn to Claude to feed my 2300 lines of code into it’s 100K token context window. I would have preferred using GPT-4 codex, but it taps out at 2K (despite an 8K token context).

solrai_processObject($fieldArray, $document)
- Main logic to process a Solr document, checking its type and handling accordingly. Calls solrai_getClassPropertyFields().
solrai_getClassPropertyFields($field, $document)
- Constructs the Weaviate schema properties for a document. Calls helper functions based on doc type.
solrai_handleFileType($field, $document, …)
solrai_handleCommentType($field, $document, …)
solrai_handleNodeType($field, $document, …)
- Handle the different data source types - file, comment, node.
solrai_vectorizeContent($docId, $site, …)
- Sends the content to Weaviate to be vectorized. Handles errors.
solrai_summarizeContent($content)
- Checks if content should be summarized. Calls solrai_summarizeGetContent().
solrai_summarizeGetContent($content)
solrai_getMasterSummary($content, …)
solrai_summarizeChunk($chunkContent, …)
- Functions to summarize large content by chunking.
solrai_generateQuestions($text)
- Generates questions for a given text using GPT-3.

The main flow is:

Get Solr objects into a queue
Process each object
- Handle based on type
- Construct Weaviate properties
- Vectorize content
Summarize if applicable
Generate questions if applicable

minna.hu · November 26, 2023, 9:11pm

Have you tried LlamaIndex, it connects your
custom data sources to large language models.

Topic		Replies	Views
Is there any example for training open ai with database datas? API gpt-4 , gpt-35-turbo , chatgpt , api	1	3225	July 27, 2023
Using GPT3 on database schamas Prompting	16	5053	April 12, 2023
[Tutorial] Specific knowledge base + Open AI answering questions using it (for noobs) Documentation	3	8318	December 17, 2023
Seeking Guidance on Building a ChatGPT-Style Data Analyst Tool with Database Integration Plugins / Actions builders gpt-4 , chatgpt , api , openai	11	3675	September 23, 2024
What can I do, and how can I work with OpenAI chat? Community chatgpt , api	8	5277	December 24, 2023

How to train open ai with my own datas from database?

Related topics