Optimal doc format/structure for custom gpt?

eric8 · November 26, 2023, 12:17am

Hi all,

I’m working on a custom gpt for some old legal docs. The documents contain lots of subsections, e.g., (5)(b)(ii) “…legalese yada yada yada…” I’m hoping for the benefit of your expertise:

What file format (e.g., txt, pdf) is best here? I saw one post where there was discussion about converting content to json format, but it was inconclusive.
Is there particular contextual labeling that would make each doc easier for the gpt or process?
I need to combine some of the docs due to 10 file limit – any recommendations for distinguishing each one within a merged doc for GPT to make the proper distinctions?
Is it helpful to upload a separate overview doc that describes the contents of each of the other 9 files? And if so, how should it be structured?
Any other suggestions for preparing the docs for optimal GPT processing/retrieval?

Thanks,

Eric

Foxalabs · November 26, 2023, 12:25am

Hi Eric,

The GPT Models have already seen a great deal of legal text and have probably encoded the layout fairly well, any text format that preservers the tabs would work fine. Markdown encoded text would seem to be a great way to do this.

Document labelling can simply be a document name at the top in a recognised legal format

Combining is fine, the model gets streams of data anyway.

An overview document may be of help, not done enough testing to see if this would be significant or not.

If you are planning on using assistants for the retrieval you need to do nothing, if you are using your own embedding system and vector database then you might look at some intelligent chunking to keep associated text together, like don’t break a sentence in the middle, or if that is unavoidable then use overlap in your chunks so some of the last chunk and some of the next chunk are included in the current chuck to ensure semantic meaning is not lost on chunk boundaries.

Smeg33 · November 26, 2023, 8:08am

Excuse me Foxabilo, I would like to learn how to do the aforementioned. Would you care to point me to the right direction to get started on this? I’ve built some custom GPT’s but really want to advance to building this with vector base, API calls, etc and add to my website. I’m still new to much of this and appreciate any advice.

Foxalabs · November 26, 2023, 10:42am

Firstly I’d look at the OpenAI documentation on platform.openai.com and also the the left on the forum under RESOURCES and Documentation / API reference, you can also use Youtube which has a good AI community now with plenty of helpful videos on the various topics mentioned here, if you need this for a commercial application I’d suggest hiring a developer to assist as some of the topics can be quite technical for a non coder to take in one go.

Topic		Replies	Views
What is the best type of format to use for uploaded documents for GPTs? Plugin store gpts	2	4549	November 15, 2023
My GPT - Knowledge base - Best practices GPT builders	7	25366	January 25, 2024
Training with Large PDF FIles API	9	26046	September 17, 2023
Maximizing CustomGPT Performance: Exploring Alternative File Formats to PDFs GPT builders gpt-4 , pdf , custom-gpt	0	1039	March 4, 2024
How to best use GPTs with PDF files? Plugins / Actions builders plugin-development	14	19536	September 18, 2024

Optimal doc format/structure for custom gpt?

Related topics