My GPT - Knowledge base - Best practices

Hello,

I’m working on a custom GPT model focused on IT solutions and architecture. I’ve uploaded 20 PDF files, each with over 300 pages, but I’m facing slow knowledge retrieval and errors, as seen in other discussions. Please help me with following questions:

1. I’m looking for best practices for knowledge upload, like whether to convert PDFs to TXT and if including PDFs with images is advisable.

2 I’ve hit a roadblock where I can’t upload any additional file in individual conversations after the initial 20 in the knowledge of my chat. Is this a known limitation?

3 Can I compile multiple documents into one larger file to surpass the 20-file limit, or would this affect output quality and organization?

Thanks for any insights!

1 Like

We have no idea behind the underlying mechanics with the retrieval process. Yes, there’s a limit and yes for whatever reason it seems like OpenAI prefers you to send massive files instead of splitting them.

Generalized to typical RAG:

  • Images are ignored
  • PDF is not ideal as it’s meant more for printing and can do weird things to formatting
  • Markdown may be best as it’s clean and the formatting is explicit

For anything serious I wouldn’t recommend using retrieval. As an example, I uploaded a single car care guide for a specific model. My Custom GPT gave me incorrect torque values that, if followed, would have led to my brake caliper & tires dismantling on the road.

If you are serious I would recommend using your own RAG system and either jumping into the API, or use Actions to make calls to it. You can use Actions with the RAG process by calling specific areas of dentistry, for example and shape the query

3 Likes

Thanks for the response. I will explore using own RAG, but I am not that advanced yet.
So would you convert all the documents into a specific format and merge many into one?

Regarding your case - You can try to enter the below text into the instructions and try again:
“Please respond factually, always relying on data from the knowledge base. Avoid generating responses without a factual basis, and if you cannot find relevant data, communicate that. Always include the confidence level of your responses, and when necessary, request more details for clarification.”

1 Like

I actually don’t know where to find the current limits for GPTs. For Assistants each file has a limit of:

The maximum file size is 512 MB and no more than 2,000,000 tokens (computed automatically when you attach a file).

The massive file size limit to me would imply that yes, you want to merge them together if you are hitting the quantity limit.

On paper this sounds like a great idea but there’s a couple issues:

  1. The confidence could be a hallucinated figure as the RAG process is separate, and unknown. Logprobs would be a more useful figure.
  • Don’t get me wrong, this can be useful to use. RAG is meant to augment the knowledge so this would be the equivalent to asking my nephew to grab me a 12mm crowfoot wrench from the top-drawer. Him just grabbing as many wrenches as he fit in his hand, and then me saying: “tell me how confident you are in something you clearly don’t know”. This can be a fabricated value that is not reliable enough
  1. In my case torque values blend together and are only separable by keywords, which is why I wanted to try it out. Assuming that OpenAI is using a typical sliding context window ONLY for semantics (ignoring keywords) it means that ALL the torque values would blend together, and they did. So whatever the underlying system returned would seem very plausible to a model that didn’t know any better, and would falsely indicate that it’s highly confident.

I guess I could ask it to perform a “reality check”, where a very low torque value for a very strong bolt wouldn’t make sense, but in that case it makes more sense just to use a general torque guide instead.

1 Like

knowledge is a type of RAG, but that doesn’t mean it must be used as a RAG.

We’re not even talking about RAG’s general hallucination issues, just the problem of forcing GPTs to stick with their built-in knowledge files. It’s a unique problem that’s annoying enough. If you haven’t found it yet, you’re in luck.

Having a large area That doesn’t mean you have to fill it all the way up. Instructions and knowledge space in a file size of 100,000 tokens GPTs are not worth reading. Additionally, depending on the instruction and prompt at the time it will be What is the limit for document scanning? Even smaller documents still require some scanning

There is also a timeout for loading documents. If you want documents to be used faster, you need to make them readable. GPT doesn’t have to convert data to text or check for ambiguous objects like symbols or other text. Use letters that have strange shapes. It’s no different than having to run over an obstacle. But if you have good file content management and convenient, including appropriate prompts, will be faster than before

The rest you will have to test yourself because it depends on the GPTs file and implementation. You will get to know some of the problems with it.

Research on the RAG glitch issue came out on January 11th. It may help you understand and avoid the problem, but that’s the general idea. This is not specific to OpenAI’s dynamic issues, which change daily due to backend operations.

For this, they suggested “mark down” which means you would upload a file with words like Title, Heading 1, Heading 2 etc.

This type of structured data makes AI more likely to be amazing at your tasks.

Just my opinion, but you are trying to upload way too much information. You said you not very experienced so in my experience, GPTs (which someone mentioned are a special case of AI) don’t really use the knowledge files unless you force it. So I force it.

I am happy when it just reads one file and sticks to my instructions to read the file! :grinning:

below are a couple of responses to this issue when I was building a custom GPT to analyse some epub content.

When it comes to analyzing text data, the ease of analysis between EPUB and PDF formats can vary depending on the specific requirements of the analysis. Here’s a comparison of both formats in terms of ease of analysis:

  1. EPUB:

    • Structure: EPUB is an open e-book format that’s essentially a ZIP archive containing HTML files, images, and other resources. The text in EPUB files is stored in HTML, which can be more straightforward to parse programmatically.
    • Flexibility: Being based on HTML, EPUB is more flexible and reflowable, meaning it can adjust to different screen sizes. This can make extracting text easier, as it’s more standardized.
    • Tools and Libraries: There are many libraries available for parsing HTML and XML, making it easier to extract and analyze text from EPUB files.
  2. PDF:

    • Structure: PDF (Portable Document Format) is a fixed-layout flat document format. It encapsulates text, fonts, images, and 2D vector graphics. This format is designed to present content in a fixed layout, similar to a printed page.
    • Text Extraction Complexity: Extracting text from PDFs can be more complex than from EPUBs. PDFs can store text as images (in the case of scanned documents) or in complex layouts, which can require more advanced parsing techniques like Optical Character Recognition (OCR) to extract text.
    • Tools and Libraries: While there are libraries available for reading and extracting information from PDFs (like Adobe Acrobat, PyPDF2, PDFMiner), the process can be more involved, especially with PDFs that have complex layouts or are not text-based (scanned images).

In summary, if the goal is to analyze text content, EPUB files are generally easier to work with due to their HTML-based structure and reflowable nature. PDF files, while universally used and consistent in presentation, can pose more challenges in text extraction, especially if the layout is complex or if the text is stored as images.

Yes, extracting text from EPUB files is feasible and can be done using various methods. EPUB files are essentially ZIP archives that contain HTML files along with other resources like stylesheets and images. To extract text from an EPUB file, you typically follow these steps:

  1. Unzip the EPUB File: Since an EPUB is a ZIP archive, the first step is to unzip it. You can use any standard unzipping tool or software for this.

  2. Locate the Text Content: After unzipping, you’ll find a collection of files including HTML files, which contain the actual text of the book. These files are usually organized in a directory structure.

  3. Parse the HTML Files: The text can be extracted by parsing these HTML files. This can be done using HTML parsing libraries available in many programming languages. For example, in Python, you can use libraries like BeautifulSoup or lxml to parse HTML and extract the text.

  4. Handle Encoding and Format: EPUB files may use different types of encoding or include text formatting (like bold, italics, etc.) and images. You might need to handle these aspects depending on your text analysis needs.

  5. Concatenate and Clean the Text: If the book is split across multiple HTML files, you’ll need to concatenate these to form the complete text. Additionally, you might need to clean the text by removing HTML tags, headers, footers, or any irrelevant data.

  6. Optional Steps for Advanced Analysis: If your analysis requires understanding the structure of the book (like chapters, sections, etc.), you may need to parse the EPUB’s table of contents file (usually an XML file) to map the text to its structure.

The ease of this process largely depends on the complexity of the EPUB file and the tools or programming languages you are comfortable with. Python, for instance, offers a comprehensive set of tools and libraries for this kind of text extraction and parsing.

1 Like

You would have to join multiple PDFs to fit in the 20 PDF limit. Or you can built an external RAG system and integrate with your GPT via actions