Creating a Chatbot using the data stored in my huge database

Hello Everyone,

I want to build a custom chat bot which can answer questions based on the data in my databse
Below are my tries and the problems I am facing
I am open for all suggestions, so please do help me

  1. Tried without using langchain

The code establishes a connection to a PostgreSQL database and prompts the user for information they want to obtain.
It then generates an SQL query based on the (user input + the table names of the db) using the OpenAI GPT-3.5 language model.
The code extracts table names from the generated query and fetches column information from the connected database. It generates a prompt that includes the table names and column details, and uses the GPT-3.5 model to generate a final SQL query based on this prompt.
The final SQL query is executed on the database, and the results are printed. (currently )
Overall, the code utilizes natural language processing and database interactions to generate and execute SQL queries based on user input.

  1. Tried using Langchain
import os
from langchain import OpenAI, SQLDatabase
from langchain.chains import SQLDatabaseSequentialChain

os.environ['OPENAI_API_KEY'] = "****"

dburi = 'postgresql://postgres:****@****:****/****'
db = SQLDatabase.from_uri(dburi)

# llm = OpenAI(temperature=0, model='text-curie-001')
llm = OpenAI(temperature=0)
db_chain = SQLDatabaseSequentialChain(llm=llm, database=db, verbose=True)

resp = db_chain.run('what is my last po value for testaccount')
print(resp)

the problem I have Faced is that the prompt size is getting to 1,29,300+ Tokens some how
I am unable to figure it out why it is happening
I tried custom prompt templates also but that did not decrease the prompt size
What I felt is that they just add my custom prompt data to their prompt and send to the open ai api instead of just sending my custom prompt

So if any one can help me in any way around, pls do

  1. Other than these two methods I have seen that there is something known as fine tuning and embeddings
    I know how fine tuning is done using the Open AI but I don’t have much Idea of Embeddings
    I want to know which one is better to use
    As far as what I have re searched I came to know that in both cases we have to give all the database information to them
    which is not secure I think for my organization as my user privacy will be at risk

So finally what could be a better way to build a bot that can answer my questions based on the information in my db

3 Likes

@wfhbrian @bill.french
Can you guys help me in this ??

@ruby_coder @cliff.rosen
can you guys help me in this

“Tagging separately as a new user cannot tag more than two in one post”

I’m also facing a similar issue. I want to build a chatbot based on my data which is on Google Drive. I am getting similar issues with the prompt size being too large and even custom prompts templates not working. Would you mind sharing any progress you have made?

1 Like

I haven’t progressed much currently I am re_searching still
just asking you were trying to do using which method?

2 Likes

Have you watched the video noted in this post?

8 Likes

I watched just a bit Looks like It is similar to what I need
Thank you very much

2 Likes

I’m not much of an expert but this is

import json
import webbrowser
from flask import Flask, request
from flask_cors import CORS
from google.oauth2.credentials import Credentials
from google_auth_oauthlib.flow import InstalledAppFlow
from google.auth.transport.requests import Request
from googleapiclient.discovery import build
from urllib.parse import urlparse
from urllib.parse import parse_qs
from collections import deque
import io
from PyPDF2 import PdfReader
import tiktoken
from qdrant_test import QdrantVectorStore
import openai

openai.api_key = #//my openai key*********"

SCOPES = ['https://www.googleapis.com/auth/drive']
client_secrets = #//my client secret*********
app = Flask(__name__)
CORS(app)


def get_folder_id_from_url(url: str):
    url_path = urlparse(url).path
    folder_id = url_path.split("/")[-1]
    return folder_id


def list_files_in_folder(service, folder_id):
    query = f"'{folder_id}' in parents"
    results = service.files().list(q=query, fields="nextPageToken, files(id, name, mimeType, webViewLink)").execute()
    items = results.get("files", [])
    return items


def download_pdf(service, file_id):
    request = service.files().get_media(fileId=file_id)
    file = io.BytesIO(request.execute())
    return file


def extract_pdf_text(pdf_file):
    reader = PdfReader(pdf_file)
    text = ''
    for page_num in range(len(reader.pages)):
        text += reader.pages[page_num].extract_text()
    return text


def chunk_tokens(document: str, token_limit: int = 200):
    tokenizer = tiktoken.get_encoding(
        "cl100k_base"
    )

    chunks = []
    tokens = tokenizer.encode(document, disallowed_special=())

    while tokens:
        chunk = tokens[:token_limit]
        chunk_text = tokenizer.decode(chunk)
        last_punctuation = max(
            chunk_text.rfind("."),
            chunk_text.rfind("?"),
            chunk_text.rfind("!"),
            chunk_text.rfind("\n"),
        )
        if last_punctuation != -1:
            chunk_text = chunk_text[: last_punctuation + 1]
        cleaned_text = chunk_text.replace("\n", " ").strip()
        # cleaned_text = re.sub(r'\W+', '', cleaned_text)

        if cleaned_text and (not cleaned_text.isspace()):
            chunks.append(
                {"text": cleaned_text}
            )

        tokens = tokens[len(tokenizer.encode(chunk_text, disallowed_special=())):]

    return chunks


def chatgpt_answer(question, context):
    prompt = f"""

        Use ONLY the context below to answer the question. If you do not know the answer, simply say I don't know.

        Context:
        {context}

        Question: {question}
        Answer:"""

    res = openai.ChatCompletion.create(
        model="gpt-3.5-turbo",
        messages=[
            {"role": "system", "content": "You are a question answering chatbot"},
            {"role": "user", "content": prompt}
        ]
    )
    return res['choices'][0]['message']['content']


def get_documents_from_folder(service, folder_id):
    folders_to_process = deque([folder_id])
    documents = []

    while folders_to_process:
        current_folder = folders_to_process.popleft()
        items = list_files_in_folder(service, current_folder)

        for item in items:
            mime_type = item.get("mimeType", "")

            if mime_type == "application/vnd.google-apps.folder":
                folders_to_process.append(item["id"])
            elif mime_type in ["application/vnd.google-apps.document", "application/pdf"]:
                # Retrieve the full metadata for the file
                file_metadata = service.files().get(fileId=item["id"]).execute()
                mime_type = file_metadata.get("mimeType", "")

                if mime_type == "application/vnd.google-apps.document":
                    doc = service.files().export(fileId=item["id"], mimeType="text/plain").execute()
                    content = doc.decode("utf-8")
                elif mime_type == "application/pdf":
                    pdf_file = download_pdf(service, item["id"])
                    content = extract_pdf_text(pdf_file)

                if len(content) > 0:
                    documents.append(content)

    return documents


@app.route("/oauth/redirect", methods=['POST', 'GET'])
def redirect_callback():
    authorization_response = request.url
    print("authorization response: ", authorization_response)
    parsed_url = urlparse(authorization_response)
    auth_code = parse_qs(parsed_url.query)['code'][0]
    print("auth code: ", auth_code)

    flow = InstalledAppFlow.from_client_config(
        client_secrets,
        SCOPES,
        redirect_uri="http://127.0.0.1:5000/oauth/redirect"
    )

    flow.fetch_token(code=auth_code)
    credentials = flow.credentials
    credentials_string = credentials.to_json()
    with open("gdrive_credentials.txt", "w") as text_file:
        text_file.write(credentials_string)

    return "Google Drive Authorization Successful!"


@app.route("/authorize", methods=['GET'])
def authorize_google_drive():

    flow = InstalledAppFlow.from_client_config(
        client_secrets,
        SCOPES,
        redirect_uri="http://127.0.0.1:5000/oauth/redirect"
    )

    authorization_url, state = flow.authorization_url(prompt='consent')
    webbrowser.open(authorization_url)
    return authorization_url


@app.route("/load", methods=['POST'])
def load_docs_from_drive():
    data = request.json
    google_drive_folder_path = data.get('folder_path')
    if not google_drive_folder_path:
        return {"msg": "A folder path must be provided in order to load google drive documents"}

    with open('gdrive_credentials.txt') as f:
        line = f.readline()
    credentials_json = json.loads(line)

    creds = Credentials.from_authorized_user_info(
        credentials_json
    )

    if not creds.valid and creds.refresh_token:
        creds.refresh(Request())
        credentials_string = creds.to_json()
        with open("gdrive_credentials.txt", "w") as text_file:
            text_file.write(credentials_string)

    service = build('drive', 'v3', credentials=creds)

    folder_id = get_folder_id_from_url(google_drive_folder_path)

    documents = get_documents_from_folder(service, folder_id)

    chunks = []
    for doc in documents:
        document_chunks = chunk_tokens(doc)
        chunks.extend(document_chunks)

    vector_store = QdrantVectorStore(collection_name="google-drive-docs")
    vector_store.upsert_data(chunks)

    return "docs loaded"


@app.route("/query", methods=['POST'])
def query_knowledge_base():
    data = request.json
    query = data.get('query')
    vector_store = QdrantVectorStore(collection_name="google-drive-docs")
    results = vector_store.search(query)

    context = ""
    for entry in results:
        text = entry.get('text')
        context += text

    llm_answer = chatgpt_answer(query, context)
    print(llm_answer)
    return llm_answer


if __name__ == "__main__":
    app.run()


This is the code that I am running. What do you think? How can I improve it or make it better?

Ps: edited the code a bit but got it from a Github Repo by @htrivedi99

2 Likes

A better way to say thanks is to just click the heart image at the bottom of the reply.

This tells others that you liked it without reading another reply.
The heart also lets others know that

  • this is a useful reply.
  • which users are providing helpful information.
3 Likes

@abhi3hack, welcome to the community!

I’m slammed this week, but I will quickly mention this:

… using the data stored in my huge database.

I’m not an expert in LangChain, but I suspect it is configured to sustain the previous prompts.

When I see the term “huge”, it suggests you’ll need to consider how embeddings can help you meet your objectives. Your example code shows a general query against the data and mentions of postgres, so I’ll toss this into your mix -

Without a clear understanding of the project, its difficult to say how you should approach this or when has gotten your current approach off the rails.

3 Likes

Thanks @bill.french for the github link, I will look it into it.

My project which I want to work upon is what hubspot has already done but it is a bit more advanced, currently I just want to answer the user questions with the analytics data in my database(which is postgresql).
Later I also want to scale it the hubspot level
for example: their payments history, their invoices creation time and all you know just basically all the analytics data

If I use embeddings or fine tuning Do I have to share all of my data to open ai
In order to make a chatbot like hubspot has done ( because what I understood is that we have to make open ai the data available in some way so that it can write answers using that data )

Or Is there any other way where we can do it without sharing the data in my database

please refer this link to see how hubspot did it

Thanks for sharing your project @abhi3hack
You are likely going to run into this problem is a limitation of Large Language Model, so regardless of which tool you decided to use, we have to overcome the limitation of LLM. The issue you are running into is because you have close to 130K of tokens from you database results and that is too much tokens to fit into a prompt. GPT3.5 takes about 4K and the current GPT4 only support up to 8K.

Solution 1 - Manage the Input Length
Modify your SQL query to make sure you are only returning a small sample size using explicit query with condition and limit. For example, instead of using

select * from my_table

Try using query like this

select text from my_table where text like '%some_keyword%' limit 3

Solution 2 - Embedding Data Storage
Instead of using a relational database, you can consider migrating or duplicating a copy of your data to an embedding data storage system like a vector database. It’s a similar the first approach, but instead of using keyword matching, you can use semantic search which will give you a better results.

Solution 3 - Fine Tuning
If you have a lot of custom data and the first two solutions just doesn’t cut it. You can consider fine tuning a model with your own data. You will have to export your data to a more acceptable format like a json or csv file for fine tuning. If you are taking this approach, I would recommend using open source models to fine tune and you can fine most of them on Hugging Face. If you data is real time, keep in mind you are still probably going to need to use the first and second approach as well because you are not going to be fine tuning your model with the latest data all the time.

Conclusion
It’s a lot of work but this space is moving very fast now, so it will get easier down the line.
Currently I am working on a project called Superinsight that allows you to create a chatbot using any type of documents. We currently don’t support database integration yet, but if decided not to build your own, let me know, we can add postgres integration into our project.
You can learn more at https://docs.superinsight.me/

5 Likes

Thank you @nelson for responding
I have few queries in your response

Solution 1 - Manage the Input Length
In case of langchain I was facing the token issue, I am not aware of how can I modify a sql query in that case
The method in which I have done without langchain i.e by getting the sql queries from chat gpt by providing it the db info I don’t have the token issue, but yeah thank your for your advice I would definitely implement it

Solution 2 - Embedding Data Storage
I don’t have much idea of how this works can you please share relavant resources which could help me in doing this

@bill.french had shared me this link GitHub - pgvector/pgvector: Open-source vector similarity search for Postgres
Is it related to this only?

Just mentioning the main requirement is to answer user questions based on the analytics data in our db, further we would like to scale up to level of chatspot as have mentioned in my previous responses.

Solution 3 - Fine Tuning
In this method I have one major doubt i.e is it possible to fine tune a model without giving out my users information in the data base because I feel it becomes a privacy concern for users

my understanding of fine tuning is basically from this video: How to Fine Tune GPT3 | Beginner's Guide to Building Businesses w/ GPT-3 - YouTube
I don’t know if that is the exact way to do things, so please feel free to correct me where ever it is possible.

Looking forward to hear from you!!!

@curt.kennedy @alden and @RonaldGRuckus
can you guys help me in resolving this query??

Let’s be clear about the terminology - analytics data does not exist in your database. Only data exists there; you need to transform that data to create analytics. By definition, analytics are analyses of data.

If you want to use AGI to ask questions about analytics, you have to educate it with some analytics. It cannot magically transform all your data into answers that require potentially complex analysis and aggregations.

Here’s an elementary example I created for customer survey data. I’m suggesting this not as an approach to your specific requirements; but only as further understanding about the part you are missing in your quest to mimic ChatSpot.

2 Likes

Hey @bill.french
Firstly sorry for the terminology error I guess you understood it correctly anyways
Users data is there in data base if user asks question the chatbot should be able to answer it

I don’t have an account on impertinent can you say this much for me please:
To make this possible using fine-tuning or embeddings do we have to give the content in the database to it
or if we send a sample database that I use for testing which has a fake data, can it produce a same result (this sample db has same tables and columns as the original one)
If yes then which approach would be more good.

I think this is what localGPT tries to offer. I’ve found it in GitHub.

1 Like

The business requirement stated in this thread is only tangentially related to the localGPT project which was inspired by the original [privateGPT] (GitHub - imartinez/privateGPT: Interact with your documents using the power of GPT, 100% privately, no data leaks).

The element of data privacy is different from the entire AGI stack being air-gapped from the Internet. I don’t think @abhi3hack wants his solution cut off from the Internet, and any attempt to run whatever models he deems are best for the solution, probably does not (and should not) live on his local device.

This project is ideal for AGI applications that must run at the edge. @abhi3hack’s project is not that.

Firstly thank you @bill.french for giving me some clarity on these things

So This looks like embedding is the one I need, Even most of the online resources have suggested the same thing
But what do you mean by learner shots, sending sample QA in the prompts ??

I can do this but in this also you are saying " Export a small collection of 100 database records "
which is again giving the data to customgpt
In this way can I try using a sample database with fake data but same structure?

One main question I have currently about embedding is that
Our database is having mostly data which is either some kind of numerical value or some have json fields where we may have some text
Now in this case how can embedding actually help me
What I read from the online resources is that Embedding is used to check similarity in data using vectors dot product, like in case of documents where everything is in text it can do that task by checking the similarity
But wat about numerical data can it understand just by looking at the table names and column names ??

“Please correct me if my understanding of Embedding is wrong!!”

Regarding this what I actually need is a chatbot that can be used by my live users of my website to answer what ever queries they have or they can say to perform an action on the website
so just basically what currently chatspot is doing similar to that
If this is possible by using LocalGPT I will be okay with that
what do you think @castel will it help me in achieving this?

In the mean while I will also look at what privateGPT and LocalGPT does