Hi!
I am building a GPT, and curious if the format I pass the data matters on the quality?
Should I send docx, txt, pdf, html?
I donât have any hard data on this, and in theory it should all be the same, but I feel like if you upload in PDF format, it understands the heirarchical structure of the information - headings, sections etc - better than it does in other formats.
I havenât tested this all that throughly though.
Iâd like to know too, Iâm uploading in HTML and wonder if its being distracted with all the metadata for the DOM. Iâm thinking of just rewriting the important stuff in JSON
I am very interested about this question also! I suppose it depends how the Knowledge module works behind the scene. I have a doubt that it only works as pure extra âtextualâ instructions. So I would say txtâŚ
Nope, each document type has different characteristics.
A plain text file will be parsed very fast and a .XLSX file very slow.
For .XLSX a code interpreter is needed and for .HTM not.
And .XLSX can trigger âstructured dataâ which can be âsmart interpretedâ by the GPT, where a .PDF wonât do.
So it really depends on the semantics, taxonomy and context of your purpose.
Itâs not âjustâ a filetype, since the filetype powers the knowledge.
I have been testing with about 250.000 high-end dedicated articles about one specific subject.
I tested those formats;
- JSON
- XLSX
- HTM
- TXT
Uploaded as 750mb textfile and it didnât make any sense for complex questions.
Uploaded as 150mb Excelsheet was superior.
It was able to read the data, fetch the columns, parse data and date, linked itâs knowledge to one fact and the other, etcâŚ
But is was slowwwwwwwwwwwwwwwwww, so I decided to host the SQL myself and wrote a simple API with an endpoint to my server.
You get what you feed it; structured data in = structured data out and plain text in = plain text out.
Must be using a low speed library to open MS excel files, and running that code for every call.
- Results : impressive
- Speed : not impressive
presumably, a CSV exported version would be fast.
Yes,
Yes, that worked and was fast - but limited in term of structured data and the way GPT did associate rows / cells (comma separated).
The Excel was able to act as a wrapper for 250.000 articles, the .CVS ended by something like 8.000 (GPT complained the context / content was too much).
So it works, when you have âlittleâ data only (couple of hundreds of articles, not tens of thousands).
If I have some PDFs that contain academic papers and Iâd want to train the custom GPT with that knowledge, would you recommend excel? So far with the PDFs, it didnât make any sense for complex questions, like you said.
Depends on the structure of your date.
In my case the source of all knowledge was a SQL database with fields like âdate / location / item / subject / source / etcâŚâ.
This way GPT was able to âlinkâ certain topics (named in the âsubjectâ row) with âdatesâ and âlocationsâ, giving interesting results.
But when itâs âjustâ an article XLS doesnât make much sense.
GPT tries to find the date and location for an article itself, e.g. when this is the format;
NEW YORK, 2023/11/12 - And some interesting line of text to fill up this fake itemâŚ
This way GPT tries to see the logic {CITY} / {DATE} - {CONTENT} by itself.
Just try out what works best for you; I uploaded many, many files and a GB of data⌠and deleted them al with the click of one mouse button (because XLS was giving the best results, but in the slowest possible way).
I am noticing more hallucinating / made up data when providing a data source that is a csv. It seems to stick to the provided knowledge files when I share a txt doc.
In both cases, have the same instructions to not make up answers and stick to the provided data source.
Curious if anyone else has seen something similar.
For (unstructured) retrieval I am thinking markdown would be best. PDFs are a big âmehâ. They can be unreliable and hard to read. It would be worth running the PDF thru Code Interpreter and seeing how easy it is to read first. Last I checked they were using a very basic parser
Would love to see some tests.
For structured formats like CSV & JSON I believe it would be better to use an API (Actions). I know weâre strictly discussing GPTs but for API I found GPT really good at creating GraphQL queries.
I have uploaded multiple PDF, but the hit rate is super low. And it is even worse than my personal embedding experiment (very simple vector database usage and similarity search). So I wondering how GPT is doing file processing.
Markdown syntax seems to be ideal for this type of thing because it gives semantic meaning to the text (headings, callouts, quotes, emphasis, etc) without all the clutter from HTML. Itâs also what powers the text rendering behind chatGPT. What are your thoughts on Markdown? Anyone done testing on it?
As for hallucinations, make sure you add in the system message to ânot make it up if the answer canât be found in the source materialâ. Be sure to also mention to âthink step by stepâ. This has virtually eliminated hallucinations for me.
Agreed, I am much happier with the results when using Markdown. I also pay close attention to the code interpretor log at the end of each reply, it suggests that the amount of data actually analysed seems to always be between 500 and 2000 characters, so I make sure I keep my knowledge files below that.
And I forgot to say that I always ask ChatGPT to review, validate and reformat my Markdown to optimise future analysis
To add to this convo, and I wish we had an OpenAI employee knowledgable enough to assist and chime in, I am curious if when using a .zip, what the added latency is, and if once uncompressed, does it remain so for future use, with the files stored somewhere? Curious on the long-term GPT storage mechanics, and yes if the .zip only needs a single first-use unzip.
Might add, is there a file or .zip size limit anyone knows of?
Prespan: Usually, I speak and write German and not English which will probably be reflected in my usage of the English language.
I tested the following formats:
- XLSX
- DOCX
- RTF
- XML
- MD
In my experience, Markdown (MD) works best for structured text that also uses formal symbols such as in math equations. The - conjectured, admittedly - reason for this is due to the similarity of the data format used for the output and the availability of (structured) training data in that format.
Whenever I ask for math assistance, Markdown with embedded or MathJAxâd equations is used to render the output on the client side.
The syntax is also used to display e. g. tables, headings in various hierarchical depths and embed graphics.
On the input side, the methods used to preprocess data is not known to me with certainty. Thus the addendum from above that the reasons are a conjecture.
The problem at the present is that there is little functionality built in the system to transform the knowledge in - say PDF - to formats that produce higher quality output in a faster way.
For the example under consideration, and small knowledge data, a service called MathPIX (free version) was employed for the conversion of PDF with math expressions to MD and then manually dumping the result of the preprocessing step outside the AI system in the relevant input fields worked best but - to be blunt - feels clumsy and inconvenient.
The idea originates from another problem that still awaits its solution, namely the handling of multiple PDFs. This worked without system failure only with a low probability in ~ 30 tries. With markdown as the only input format, the problem was absent in the case of knowledge distributed among multiple files.
To sum up, I distilled the following âguidelinesâ for my purposes:
- Use Markdown in the multiple file case.
- Use Markdown in the case of a single file that contains elements in languages other than the natural one.
Ideally, a conversion would be possible directly in the system or an integration to outside services is possible. Furthermore, I am a bit skeptical that the product fulfills the data protection criteria because - at the present - it is possible to download knowledge using prompts or the code interpreter, if activated.
Best wishes
David