Who has had success with adding many/or large documents to the 'Knowledge' section?

Hey builders,

I am trying to add a lot of documentation for a platform into one GPT to help me work better.

I’ve been gradually adding to a master Markdown text document, and periodically I would test it in the GPT I am building. Recently I have come to point where the GPT would just error and not respond at all… :arrow_down:

I started to do some troubleshooting and it worth noting that I am under the assumption that GPT’s are the same as Assistants in the API, and that the ‘Knowledge’ part of GPT’s is the same as Assistants tool ‘Knowledge Retrieval’ in the API.

Knowledge retrieval (from docs)
Retrieval augments the Assistant with knowledge from outside its model, such as proprietary product information or documents provided by your users. Once a file is uploaded and passed to the Assistant, OpenAI will automatically chunk your documents, index and store the embeddings, and implement vector search to retrieve relevant content to answer user queries.

For anyone that doesn’t know about chunking/embedding/Vector search, this empowers a type of search that people have been doing for a while with the OpenAI Embedding model called Semantic Search or Similarity Search.

So my assumption is that Knowledge in GPT’s would chunk our documents we upload, embed them and store those embedding into a vector DB for the GPT to query in our chats.

The whole point of Semantic search is to offload knowledge externally so that it doesn’t overload the token context window of a GPT which is now 128,000 tokens, this should let GPT be able to have a MASSIVE knowledge base that is can search (ideally iteratively) and get specific (chunks of) info that it uses in its actual context to answer our questions.

This is where I feel my assumption should be right… but its wither wrong or I am misunderstanding how OpenAI what us to interact with Knowledge.

These are the current test I have done. (Failed is the same as the image above, no response)

  • 136k word markdown : failed :x:
  • 122K word markdown : failed :x:
  • 110k word markdown : failed :x:
  • 100K word markdown : failed :x:

Then I tried it in another GPT just for testing and it actually replied but it never used knowledge retrieval…It actually couldn’t which is crazy strange. It would only retrieve when code interpreter was turned on but that was using code not the semantic search to search the document…

Retrieval was working when I first started with this project…have I been shadow banned or something with all my troubleshooting

I am at a stage when feel like the platform is being rather unreasonable with me as a lot is nonsensical :rofl:

So I guess I have many closed loops right now which I don’t expect anyone to answer all my questions. But here are a few:

  • Are people using Knowledge and is it working fine for you?
  • If yes, how many files, words, etc? File types?
  • Has anyone tried to push its limits? Whats the most files/words you have added to a knowledge base?

@openai can someone get ratelimited in the GPT creation environment? or limited in any way?

I’d love to hear from anyone about their GPT knowledge journey.

5 Likes

I’ve been having a very similar issue with mine. I was actually considering deleting it, and starting from scratch, but haven’t gotten that far yet. The first day I was playing with it I was able to get up to 10 documents, loaded and saved however, an attempt to provide it with more information. I compiled the documents into two different CSV files, and could not get either one to load without an error.

One thing that I have started doing is converting the knowledge into JSON and feeding it into the GPT builder. It has processed the data, more efficiently and “updated” my GPT but I still have not been able to get it to efficiently output some of the knowledge I have provided it.

¯_(ツ)_/¯

1 Like

Thanks for the reply, when you say ‘feeding it into the GPT builder’ are you meaning uploading or actually telling the builder chatbot to update the knowledge with json?

I feel its the latter, so then does that json get added as a file? if not what do you think its doing? You could probably ask it if you don’t know.

Yeah, I’m just pasting it into the GPT builder chat and submitting. Unfortunately, it doesn’t get added as a file, but I have yet to try and “Tell it” To add the prompt as a file… in theory if you upload the file to the GPT builder (without error) it does get added to the configure page as a file though.

Will be testing and let you know, but I won’t be holding my breath.

Also, last night as a test I finally got fed up and tried creating a new GPTs, before even naming it I was able to upload 12 files… Got excited, deleted the other and now it’s telling me that I can’t publish public till I verify my domain. Re-register, verified, all good on DNS side and even Open.Ai side says its been verified in settings - Builder profile. but when I attempt to publish it takes me in an endless loop about domain verification, go to builder, blah, blah, blah…

One step forward, two steps back …lol

Changes couldn't be published
Successfully verified

From what some other folks have tested, less that 20 pages seems to do fine. However, I find the GPT gets less reliable with too many evolutions/adjustments vs starting with the final instructions + final documents in knowledge section already being added from the beginning. Creating a test GPT to fine tune and then adding all of that including documents to a fresh GPT might be best.

4 Likes

This has been exactly my experience! The more I seem to edit the more it degrades or dilutes the original knowledge it was given. The more documents I add or prompts that I provide, the further down the list of priorities the original knowledge becomes.

1 Like

We optimized the content of a CSV with 5,000 rows and up to 11 columns of data per row by structuring it as a JSON file. It has around 420,000 words, and it works fast and accurately. But we use it in the Playground of the Assistants, not the public version “My GPTs” as the dataset is private. So I am unsure if our example is relevant to your use case.

Two things that helped us to get always good results are:

  1. Created an index as part of the JSON file, so each entry has an id between 0 and 5,878
  2. In the chatbot instruction, we ask to look at all the indexed elements, from 0 to 5,878, before providing a response
9 Likes

You sir are a legend, its working!

I’m developing GPTs and this methods works. I took my markdown, converted it to json using its headings and it’s not erroring anymore. The json file is actually bigger in size than the Markdown but it must chunk the JSON properly compared to the markdown.

1 Like

As for you optimisation tips.

Doesn’t making it check all the id’s overload its context window?

Also, do you have code interpreter on?

I am glad it worked!

The code interpreter is OFF.

For our specific application, which looks at all the values for every single query, it does make sense to force the agent to look at all IDs. When we did not do that we noticed the agent was not parsing beyond the id 4,100 or so.

In another application where we compare documents, unless we tell the agent to look at all the pages, sometimes we get errors so this was they hint for us

Hi. I’m trying to do something similar, but have very little coding or programming experience. When you say you have converted the CSV into JSON how did you manage to get that into the GPT? Is it possible for you to show a photo of how you have indexed the information as an example?

Thanks for good input. So far I had only partly succes in adding documents to the knowledge section.

By now I have problem how to load relational-like data including adding various context issues about the data. E.g. customer feedback. I want use the assistant api to ask questions about the data and the added text.

My first problem is that I cannot produce simple lists. Some items are missing.

A simplified fictive example would be: I’m would like to load information about products, customers and sales. I have these in three excel files. In addition I have text files with customer feedback.

I want to ask question like: list all customers, list all productts… list products bought by customer x etc. Prepare report with issues and ideas for improvement etc.

Issues:
. All products do not show op when I ask List all products. Is there a maximum of text gpt takes into account? Should i provide context as suggeste by ai4sp? …

How to organise data. One or many json files.
Please help me on how to best organise data. Nested json or three separate json files, mark down, word etc.

This seems like something that would leverage code interpreter really well. Code interpreter could be used to query you CSV’s perfectly which wouldn’t use anywhere near as much context is you only used the knowledge base.

So make sure to switch on code interpreter and see if it creates those list for you much more accurately.

did you try to zip it and upload ?it worked for me though