I would like users in my company to be able to consult the thousands of pdf, doc and xls files that have been generated for more than 30 years, making offers through a chat bot and through natural language and I don’t know if this is possible .
I don’t know if I have to train or generate my own model or I don’t know if I have to give all this data to open ai so that it can incorporate it into its own
The most likely way that you’ll want to achieve this is with embeddings.
OpenAI has its own embeddings models, which I recommend using for its ease of use.
It seems like you want to provide a conversational interface, like ChatGPT, to your users. The way this works is by querying your files by using their embeddings and then providing those results to the conversational model before answering the user’s question.
It’s possible. I often refer folks to @wfhbrian 's document about embeddings; it’s a really good place to start the journey. This approach appears to work well even at scale, a requirement that you’ll have to work through given the legacy of information you need to accommodate for your users.
It’s a steep climb to build an embedding system, but it is fun if you have the perseverance to wade through many new concepts.
For a little less technical route, you might want to take a few baby stapes and play with some tools that will help you create a proof of concept. I just did one for my company in an attempt to see if we could outsource part of the process to CustomGPT. It’s not cheap, but you can experiment for free to get a feel. Once you build it from scratch, this service (and others like it) will seem cheap.
Using an off-the-shelf modern document system (Coda) and CustomGPT, I was able to build a pretty good chatbot for a small collection of FAQs. It supports about 300 different questions of which responses were measured to be 76% perfect and 23% good.
I built this approach to automate and compress the time it takes for anyone to build and train a model focused on a specific topic. This parallels the high-level approach we decided to build for testing any LLM systems at Stream It. The content shown in the diagrams is all public FAQ stuff.
Content Authoring and Management - this is a unique data grid that has exceptional text editing features as well as a direct integration with OpenAI’s APIs for content-writing, summation, and keyword/entity extraction.
Build Training Data - exporting a PDF from the authoring and management layer is simple work. This will eventually connect with CustomGPT and perform the export transformation and project update automatically.
CustomGPT does its magic - we then capture the API key and model ID and push these values back into the Coda system so that inference testing will use the latest CustomGPT model that is deployed.
Inference Testing - we then test previous and some new queries to determine outcome performance. Each test is ranked and we use these metrics to gauge if the experience is suitable for production use.
Refining and Assessment Rankings - these help us identify where the corpus needs more work. Rinse → Repeat.
Perhaps you can clarify something for me about CustomGPT. My understanding is that current methods do not allow for foundational models to absorb new knowledge without issues like catastrophic forgetting. As such, I don’t see how CustomGPT could create an LLM containing custom knowledge without redoing the pre-training process, which would surely be cost prohibitive. Yet, I believe you are suggesting this is exactly what they do, and their website seems to support that idea as well.
Would much appreciate if you could shed some light on this. Just what does CustomGPT do in the way of creating custom models with new data?