GPTs - best file format for Knowledge to feed GPTs?

pedmorsou · November 12, 2023, 4:00pm

Hi!
I am building a GPT, and curious if the format I pass the data matters on the quality?
Should I send docx, txt, pdf, html?

TeesValleyAI · November 12, 2023, 5:19pm

I don’t have any hard data on this, and in theory it should all be the same, but I feel like if you upload in PDF format, it understands the heirarchical structure of the information - headings, sections etc - better than it does in other formats.
I haven’t tested this all that throughly though.

SomeUser2022 · November 12, 2023, 6:00pm

I’d like to know too, I’m uploading in HTML and wonder if its being distracted with all the metadata for the DOM. I’m thinking of just rewriting the important stuff in JSON

Mr.Johan · November 12, 2023, 6:36pm

I am very interested about this question also! I suppose it depends how the Knowledge module works behind the scene. I have a doubt that it only works as pure extra “textual” instructions. So I would say txt…

Foo-Bar · November 12, 2023, 7:02pm

Nope, each document type has different characteristics.

A plain text file will be parsed very fast and a .XLSX file very slow.

For .XLSX a code interpreter is needed and for .HTM not.

And .XLSX can trigger “structured data” which can be “smart interpreted” by the GPT, where a .PDF won’t do.

So it really depends on the semantics, taxonomy and context of your purpose.

It’s not “just” a filetype, since the filetype powers the knowledge.

Foo-Bar · November 12, 2023, 7:05pm

I have been testing with about 250.000 high-end dedicated articles about one specific subject.

I tested those formats;

JSON
XLSX
PDF
HTM
TXT

Uploaded as 750mb textfile and it didn’t make any sense for complex questions.

Uploaded as 150mb Excelsheet was superior.

It was able to read the data, fetch the columns, parse data and date, linked it’s knowledge to one fact and the other, etc…

But is was slowwwwwwwwwwwwwwwwww, so I decided to host the SQL myself and wrote a simple API with an endpoint to my server.

You get what you feed it; structured data in = structured data out and plain text in = plain text out.

Foxalabs · November 12, 2023, 7:09pm

Must be using a low speed library to open MS excel files, and running that code for every call.

Foo-Bar · November 12, 2023, 8:07pm

Results : impressive
Speed : not impressive

Foxalabs · November 12, 2023, 8:28pm

presumably, a CSV exported version would be fast.

Foo-Bar · November 12, 2023, 9:15pm

Yes,

Yes, that worked and was fast - but limited in term of structured data and the way GPT did associate rows / cells (comma separated).

The Excel was able to act as a wrapper for 250.000 articles, the .CVS ended by something like 8.000 (GPT complained the context / content was too much).

So it works, when you have “little” data only (couple of hundreds of articles, not tens of thousands).

zakaria · November 12, 2023, 10:12pm

If I have some PDFs that contain academic papers and I’d want to train the custom GPT with that knowledge, would you recommend excel? So far with the PDFs, it didn’t make any sense for complex questions, like you said.

Foo-Bar · November 12, 2023, 11:07pm

Depends on the structure of your date.

In my case the source of all knowledge was a SQL database with fields like “date / location / item / subject / source / etc…”.

This way GPT was able to “link” certain topics (named in the “subject” row) with “dates” and “locations”, giving interesting results.

But when it’s “just” an article XLS doesn’t make much sense.

GPT tries to find the date and location for an article itself, e.g. when this is the format;

NEW YORK, 2023/11/12 - And some interesting line of text to fill up this fake item…

This way GPT tries to see the logic {CITY} / {DATE} - {CONTENT} by itself.

Just try out what works best for you; I uploaded many, many files and a GB of data… and deleted them al with the click of one mouse button (because XLS was giving the best results, but in the slowest possible way).

cbrady · November 13, 2023, 3:24am

I am noticing more hallucinating / made up data when providing a data source that is a csv. It seems to stick to the provided knowledge files when I share a txt doc.

In both cases, have the same instructions to not make up answers and stick to the provided data source.

Curious if anyone else has seen something similar.

anon10827405 · November 13, 2023, 3:30am

For (unstructured) retrieval I am thinking markdown would be best. PDFs are a big “meh”. They can be unreliable and hard to read. It would be worth running the PDF thru Code Interpreter and seeing how easy it is to read first. Last I checked they were using a very basic parser

Would love to see some tests.

For structured formats like CSV & JSON I believe it would be better to use an API (Actions). I know we’re strictly discussing GPTs but for API I found GPT really good at creating GraphQL queries.

FortuneTeller · November 13, 2023, 3:30am

I have uploaded multiple PDF, but the hit rate is super low. And it is even worse than my personal embedding experiment (very simple vector database usage and similarity search). So I wondering how GPT is doing file processing.

nickm · November 15, 2023, 11:22am

Markdown syntax seems to be ideal for this type of thing because it gives semantic meaning to the text (headings, callouts, quotes, emphasis, etc) without all the clutter from HTML. It’s also what powers the text rendering behind chatGPT. What are your thoughts on Markdown? Anyone done testing on it?

As for hallucinations, make sure you add in the system message to “not make it up if the answer can’t be found in the source material”. Be sure to also mention to “think step by step”. This has virtually eliminated hallucinations for me.

davidgrayuk · November 15, 2023, 11:56am

Agreed, I am much happier with the results when using Markdown. I also pay close attention to the code interpretor log at the end of each reply, it suggests that the amount of data actually analysed seems to always be between 500 and 2000 characters, so I make sure I keep my knowledge files below that.

davidgrayuk · November 15, 2023, 11:58am

And I forgot to say that I always ask ChatGPT to review, validate and reformat my Markdown to optimise future analysis

Hacker · November 16, 2023, 11:33pm

To add to this convo, and I wish we had an OpenAI employee knowledgable enough to assist and chime in, I am curious if when using a .zip, what the added latency is, and if once uncompressed, does it remain so for future use, with the files stored somewhere? Curious on the long-term GPT storage mechanics, and yes if the .zip only needs a single first-use unzip.

Might add, is there a file or .zip size limit anyone knows of?

davidthomasheider · November 18, 2023, 7:51am

Prespan: Usually, I speak and write German and not English which will probably be reflected in my usage of the English language.

I tested the following formats:

PDF
XLSX
DOCX
RTF
XML
MD

In my experience, Markdown (MD) works best for structured text that also uses formal symbols such as in math equations. The - conjectured, admittedly - reason for this is due to the similarity of the data format used for the output and the availability of (structured) training data in that format.

Whenever I ask for math assistance, Markdown with embedded or MathJAx‘d equations is used to render the output on the client side.

The syntax is also used to display e. g. tables, headings in various hierarchical depths and embed graphics.

On the input side, the methods used to preprocess data is not known to me with certainty. Thus the addendum from above that the reasons are a conjecture.

The problem at the present is that there is little functionality built in the system to transform the knowledge in - say PDF - to formats that produce higher quality output in a faster way.

For the example under consideration, and small knowledge data, a service called MathPIX (free version) was employed for the conversion of PDF with math expressions to MD and then manually dumping the result of the preprocessing step outside the AI system in the relevant input fields worked best but - to be blunt - feels clumsy and inconvenient.

The idea originates from another problem that still awaits its solution, namely the handling of multiple PDFs. This worked without system failure only with a low probability in ~ 30 tries. With markdown as the only input format, the problem was absent in the case of knowledge distributed among multiple files.

To sum up, I distilled the following „guidelines“ for my purposes:

Use Markdown in the multiple file case.
Use Markdown in the case of a single file that contains elements in languages other than the natural one.

Ideally, a conversion would be possible directly in the system or an integration to outside services is possible. Furthermore, I am a bit skeptical that the product fulfills the data protection criteria because - at the present - it is possible to download knowledge using prompts or the code interpreter, if activated.

Best wishes
David

Topic		Replies	Views
GPTs knowledge capacity limits Plugins / Actions builders custom-gpt , gpts , chatgpt-gpt , tp-1	76	51971	January 5, 2024
How to best use GPTs with PDF files? Plugins / Actions builders plugin-development	14	17248	September 18, 2024
Failed to update assistant: UserError: Failed to index file: Unsupported file type: application/csv Bugs playground	70	21016	February 27, 2024
Poor quality response on trained LLM with pdf files Community gpt-4	29	5772	May 1, 2024
Who has had success with adding many/or large documents to the 'Knowledge' section? Plugins / Actions builders gpts , gpt , mygpts	14	9004	January 6, 2024

GPTs - best file format for Knowledge to feed GPTs?

Related topics