GPTs - best file format for Knowledge to feed GPTs?

vmaucorps · November 20, 2023, 1:07pm

Don’t forget to check the “Code Interpreter” checkbox.
If you don’t, your GPT will probably work, but I’m not sure if it’s really checking the knowledge files.

I solved most of my issues (I’m using CSV file format)

andrei3 · November 24, 2023, 4:58am

GPT can see the knowledge with code interpreter disabled.

vmaucorps · November 25, 2023, 8:37am

I checked and you’re right.

My problem was that I was using CSV files, that seems to require Code Interpreter to be used.

ahoffman · December 5, 2023, 3:24pm

Foo-bar, davidthomasheider, Openai, et all,

I am working with regulating code trying to find the best file, format, preparation & cleaning to improve the quality and speed of knowledge files.

Since the various regulatory codes is usually difficult for humans to understand and it’s time-consuming to find related codes. As regulatory codes are text based sentence structures with references as Chapter, section, subsections which are used with references impeded into the sentences throughout the code referencing other specifics.

as you can see this intricacies make reading the code a complex web thats difficult for the GPT or API to understand.

im thinking maybe the code needs to be formatted into a .XSLX file with the sections in one column and the text / sentences in the next column… just seems like too much prep and openai might just have to increase capabilities on their end??

Any input on this matter would be greatly appreciated!!

Zakaria, my regulatory code is similar to academic papers but maybe harder to follow without the code references… or the code references make it harder?? IDK how the back end works but I would like to know so we can get this working better.

zakaria · December 5, 2023, 4:53pm

Indeed, gaining more insight into the backend is essential. When I inquired ChatGPT about its training and which formats are most effective, I learned that .xlsx files are generally preferred due to their structured nature. In contrast, PDF files can be cumbersome, especially when extracting information from tables, images, and graphs. You could try with the .xlsx files, but like you said, it requires a tremendous amount of work, so it’s counterproductive, imo.

I modified some PDFs to isolate only the essential information (but also too much prep ), and this approach seemed to enhance the performance of custom GPT models. Also, the way prompts are configured plays a significant role. It’s often beneficial to set them up so that the model first searches through its knowledge database, sometimes even referencing the title of a specific PDF.

jrs · December 5, 2023, 5:37pm

Could you elaborate on this please? Are the articles hard coded into the spreadsheet or linked? If hard coded, what kind of file size are we looking at?

sstefan · December 13, 2023, 3:18pm

Markdown works for me, but only when saved as a .txt file

suiramdev · December 15, 2023, 3:21pm

Yes, it always seems to fail saying that it has encountered a problem accessing and reading the contents of .md files.

with open('/mnt/data/MyFile.md', 'r') as file:
MyFile_content = file.read()

MyFile_content

nfosman · December 21, 2023, 6:19pm

Yes. I had the same problem. Any explanation for this behavior?

sstefan · December 21, 2023, 6:53pm

No idea. Not too happy with the quality of retrieval anyway, so I’m working around it

IAdvisor · December 22, 2023, 9:17am

I m currently building a gpt for my newsletter. I m feeding the gpt with txt files (conclusion from this topic) containing previous article with the Source that led to the result.

I got a lot of datas , and i m wondering wheter it s better to parsed it for each previous article or make longer document. Any idea if this has any effect on the learning process of a gpt guys?

andrei3 · December 22, 2023, 10:31am

If you have the ``` symbols in your MD file or other queer symbols it throws an error when uploading the file. Try to sanitize your MD file.

TheHumanist · December 22, 2023, 6:01pm

When you say you are putting these in .xlsx form in an excel spreadsheet… are you putting actual articles with headings and such? How are you laying this out/formatting it in a .xlsx file?

habibulloxon · December 26, 2023, 12:40pm

Is JSON a good option?
If so, could you please give some kind of best case structure of JSON file because I have no idea how to train GPTs more effectively.

Thanks in advance.

lli · December 29, 2023, 11:44am

I am exploring ways to handle this as well. My ideal scenario is that I could feed it technical documentation where it would be able to tell me where to find certain content. This is probably more suitable for other approaches involving some kind of indexing and search, but I’m interested in whether it would be possible to get both worlds for cases where you don’t even know what to search for and you’re only able to describe your requirement.

I have noticed that for structured formats like JSON and HTM/HTML, it may choose to either “look at it” or parse it via python. If it goes with the former option it seemingly has to scan over the whole thing, and starts talking about adjusting the increments parameter for its scroll tool to handle timeouts and precision.

I can explicitly tell it to parse the data using python, but then it needs to know exactly how it’s structured in advance. If I can fit this into the instructions it can be fast and precise, and looking at the debug output it appears to print sections of the data to itself before giving its response.

However, with this structured approach it loses the ability to search and traverse the file to discover unknown aspects. I’ve messed a little with teaching it how to list out properties first and using these results to drill further, making a separate ToC file, etc, but at this point the instructions get far too contrived and verbose.

Ultimately I am still seeing the best results by pasting the relevant documentation in the actual conversation, ideally in markdown. Other formats are usually fine, but like others have mentioned markdown is a good middle-ground between plain text and a semantic structure.

As an aside, I’ve noticed that information fetched via actions or knowledge are persisted “in the background” for the conversation. This is different from when it uses the browsing tool where it forgets everything it saw and only has its own summary to go by after that point.

nickm · January 10, 2024, 11:40am

Hi all, after a ton of testing, I have found that taking a PDF and uploading it to chatGPT and telling it to convert the PDF to markdown syntax with a .txt file format is what works the best for me.

dupervalconsulting · January 14, 2024, 1:52am

Hi,

How do you do that in a GPT? Do you upload the PDF, ask it to convert to TXT, then upload the txt file to your GPT?

I tried that but it gives me a “summarized” version of the PDF, which is unusable, of course.

L

rich5 · January 14, 2024, 3:55pm

I’ve been wondering too… so I asked Just some good pointers.

rich5 · January 14, 2024, 4:16pm

I cleaned it up a bit:

To effectively format ‘knowledge’ documents for GPT models like ChatGPT, you can integrate the following comprehensive strategies:

Clear and Concise Language:
- Use simple, direct language.
- Avoid complex sentences and technical jargon.
Structured Format:
- Organize content with headings, subheadings, bullet points, and numbered lists.
- This helps in breaking down and highlighting key information.
Contextual Information:
- Include enough context for the AI to understand the topic and content.
Avoid Ambiguity:
- Be specific and clear to prevent misunderstandings.
Data Accuracy:
- Ensure the information is current and correct.
Relevant Keywords:
- Use keywords effectively but avoid overstuffing.
Summary Sections:
- Summarize main points at the start or end, especially for lengthy documents.
Use of Examples:
- Provide examples to clarify complex concepts.
Regular Updates:
- Keep the document updated with the latest information.
Accessibility and Readability:

Ensure the document is easy to read in terms of font and layout.

For GPT models specifically, consider these additional formatting tips:

File Formats:
- Prefer .XLSX for structured data due to its detailed interpretability, albeit slower processing.
- Use CSV for faster processing but enable Code Interpreter for effectiveness.
- Avoid PDFs for complex data

dupervalconsulting · January 14, 2024, 6:01pm

Thanks. CSV for text seems… ill-advised. I converted some DOCX files to Markdown semi-manually (Pandoc) and queries on the same text give better results than the equivalent PDF file.

I agree that more documentation on that topic from the folks at openAI would be welcome. ChatGPT’s answer doesn’t look like hallucinations, but who knows. Maybe humans know better.

Thanks,

L

Topic		Replies	Views
GPTs knowledge capacity limits Plugins / Actions builders custom-gpt , gpts , chatgpt-gpt , tp-1	76	52993	January 5, 2024
How to best use GPTs with PDF files? Plugins / Actions builders plugin-development	14	18039	September 18, 2024
Failed to update assistant: UserError: Failed to index file: Unsupported file type: application/csv Bugs playground	70	21221	February 27, 2024
GPTs do not consistently search knowledge documents, despite all instruction to do so GPT builders chatgpt	4	851	March 25, 2025
Poor quality response on trained LLM with pdf files Community gpt-4	29	6198	May 1, 2024

GPTs - best file format for Knowledge to feed GPTs?

Related topics