I’m working on a custom gpt for some old legal docs. The documents contain lots of subsections, e.g., (5)(b)(ii) “…legalese yada yada yada…” I’m hoping for the benefit of your expertise:
- What file format (e.g., txt, pdf) is best here? I saw one post where there was discussion about converting content to json format, but it was inconclusive.
- Is there particular contextual labeling that would make each doc easier for the gpt or process?
- I need to combine some of the docs due to 10 file limit – any recommendations for distinguishing each one within a merged doc for GPT to make the proper distinctions?
- Is it helpful to upload a separate overview doc that describes the contents of each of the other 9 files? And if so, how should it be structured?
- Any other suggestions for preparing the docs for optimal GPT processing/retrieval?
The GPT Models have already seen a great deal of legal text and have probably encoded the layout fairly well, any text format that preservers the tabs would work fine. Markdown encoded text would seem to be a great way to do this.
Document labelling can simply be a document name at the top in a recognised legal format
Combining is fine, the model gets streams of data anyway.
An overview document may be of help, not done enough testing to see if this would be significant or not.
If you are planning on using assistants for the retrieval you need to do nothing, if you are using your own embedding system and vector database then you might look at some intelligent chunking to keep associated text together, like don’t break a sentence in the middle, or if that is unavoidable then use overlap in your chunks so some of the last chunk and some of the next chunk are included in the current chuck to ensure semantic meaning is not lost on chunk boundaries.
Excuse me Foxabilo, I would like to learn how to do the aforementioned. Would you care to point me to the right direction to get started on this? I’ve built some custom GPT’s but really want to advance to building this with vector base, API calls, etc and add to my website. I’m still new to much of this and appreciate any advice.
Firstly I’d look at the OpenAI documentation on platform.openai.com and also the the left on the forum under RESOURCES and Documentation / API reference, you can also use Youtube which has a good AI community now with plenty of helpful videos on the various topics mentioned here, if you need this for a commercial application I’d suggest hiring a developer to assist as some of the topics can be quite technical for a non coder to take in one go.