Question - Chatbot using your own data?

I want to create a chatbot using GPT-4 for my own data. I have 100 GB of data, and I want to develop a chatbot that can answer questions based on my own documents, files, Excel sheets, CSV, etc. I am a bit confused about where to start. Do I need to create a new chatbot using the LangChain module, or is it possible through the ChatGPT Enterprise/Plus subscription?

Welcome to the community!

That’s quite a bit of data, so you won’t be able to access it all at once. You’ll want to look at RAG (Retrieval Augmented Generation) and similar.

Or, if you’re not a coder, you can get a ChatGPT Plus subscription and upload files a few at a time…

The OpenAI Developer Quickstart guide is a great place to start…

Be sure to stick around and ask specific questions if you have them. Your question is a bit general - like asking, “I want to build a car. Where do I start?” :wink:

Again, welcome.


Yeah, I tried using the LangChain module and RAG. However, every time I have to upload the files before asking a question, and I’ve done it locally. I wanted to know more.

Also, I’ve tried to build a chatbot using the Azure OpenAI service, but it is not up to the mark.
As per your example, my question is, while I know how to build a car and also how to sell it, I don’t want to create a new car by developing new parts separately myself.

I have built an open-source project for the same use-case i.e to chat with your data using OpenAI api. The code is available here GitHub - Anil-matcha/ChatPDF: Chat with any PDF. Easily upload the PDF documents you'd like to chat with. Instant answers. Ask questions, extract information, and summarize documents with AI. Sources included. and can be useful to you

@matcha72 , nice . but it is just a simple project was implemented on your local system . Thing about if we have 100 GB of data including (excel, csv, text, pdf, pptx, doc ). So every time I can not load my data because there is a chance of model overfitting .

Do you have any idea about any facility provided by openai to store our data in the help of openai enterprise .

I can not answer your question but might be able to get someone who can.

It would be helpful to know if you are currently a ChatGPT Enterprise customer or would become one to meet your need?

1 Like

I do not. But I’m interested to know who does.

Also interested to know why you’d rather try to squeeze that much data into GPT or Assistants API (both are the mechanisms where OpenAI stores your data) rather than build it out yourself? Using this methodology:

This may answer some of your questions:

1 Like

If it’s need to be a enterprise member in order to fullfil the idea , so i would love to take that.

The project doesn’t read the entire data everytime. The entire data is stored as embeddings and whenever you make a query the relevant embeddings are matched to generate the answer.

1 Like

Hi, The type of action you are talking about sounds to me like it’s best handled by embedding. I’ve done this with lots of documents and successfully am able to interact with the data via chat. What you need to do is take your 100 GB of data and break it up into “pages” lets say of 1000 text letters (roughly), breaking on the spaces so you don’t break words. Then get an embedding array (you can call the openai api to get an embedding array or use a python library to generate it locally, but you can’t mix and match). Advance in your text 800 bytes so that there’s roughly a 200 character overlap between successive pages. Get another embedding array. Once you’ve done this for all your data, you then take the prompt a user supplies get an embedding vector on that. Then run a cosine similarity function between the vector you got for the prompt and all the vectors you have on your content. (I can supply how to do this function if you want, but if you look online or ask GPT there’s no shortage on getting the mathematical function, its very simple). Then take the top X number of results you get and include them in your prompt with the original request from the user. You’ll be surprised how knowledgeable your AI will be. It sounds like you have a mix of data, so for some of the data types, like excel files you may want to put the content into a database and provide functions to an Assistant API to get relevant data back. Its a little harder in that area to give generic advice.

This gives a brief but complete overview on how to call the API (openai’s API) to get the embedding for text:

For every text you send into that call you’ll get an array of 1536 float values back (the vector). You can use the vector to mathematically understand the relevance something is relative to something else. So for example if I take the phrase “I have two dogs, Lucky and Clover” and get a vector on that, and I take another phrase “I live in a small house” and get a vector on that, and you send in a prompt that is something like “tell me about your pets” and you get a vector on that prompt. The vector will mathematically be connected to the first phrase more so than the second phrase when you do a cosine similarity. Think of the cosine similarity as a score that tells how related two vectors are. Given two vectors, this function will compute the cosine similarity:

        public double CalculateCosineSimilarity(double[] vectorA, double[] vectorB)
            if (vectorA.Length != vectorB.Length)
                throw new ArgumentException("Vectors must have the same length.");

            double dotProduct = 0.0;
            double magnitudeA = 0.0;
            double magnitudeB = 0.0;

            for (int i = 0; i < vectorA.Length; i++)
                dotProduct += vectorA[i] * vectorB[i];
                magnitudeA += Math.Pow(vectorA[i], 2);
                magnitudeB += Math.Pow(vectorB[i], 2);

            magnitudeA = Math.Sqrt(magnitudeA);
            magnitudeB = Math.Sqrt(magnitudeB);

            if (magnitudeA == 0.0 || magnitudeB == 0.0)
                throw new ArgumentException("One or both input vectors have zero magnitude.");

            return dotProduct / (magnitudeA * magnitudeB);

yes this is possible and i can help you


A framework that is getting quite popular uses chromadb as vector database Langchain for chunks and Chatgpt4 for generation.

I want to demo this in my organisation but I know the questions will be about security.
Is it possible to do develop a secure RAG without leaking info at any stage as mentioned above? what about data leak from API like Chat GPT?

Security is the main concern, but storage is also crucial. We have numerous documents, and uploading files repeatedly when starting the program is not practical.

Is assistant API is for the same use cases? Did you checked Assistant API? But 100 GB is a a lot of data as the price for storage is 0.2/gb/day.

@k.shreekar-patra - did you get answer about how to secure the data? I have done small projects with open source llama model (without internet). However, I want to do same with openAI API key but I am worried about data leak? please let me know if how to protect data while using openAI API key?

@SomebodySysop - as per openAI privacy instruction, they will store the data for 30 days sicne we are using API key while working with our enterprise data. But do you know if how to stop openAI to not store this data to enhance their model? in enterprise, data privacy is the concern. please let us know if you have any details on this

I do not have any details on this. I believe in their privacy policy, they explicitly state that your data is not used to train their models.