Merge multiple files for custom GPTs

I’m trying to build a custom GPT with multiple PDFs, my problem is I need not only the knowledge inside the documents, but the name of those files is critical for referencing.

I’ve extracted the text from those documents and merge it into a large “knowledge.txt” and tried to differenciate the original documents with this structure:

## {name of the document}


{document's content}


----------


## {name of onther the document}


{another document's content}


----------

I tried everything in the instructions (well, maybe everything not bc I didn’t solved it) but the GPT doesn’t return the name of the document correctly, I responds correctly, and gives the title of the section inside the document where it found the info, but not the document’s name

Any piece of advice?

I’m facing the same problem. Did you manage to solve it?

Hi @evandro.spadari :wave:

Welcome to the community!

As @fjdelarubia faced the problem, let us look at how can we solve this issue:

First:

We should combine three document in knowledge.txt file initiating where documenta starts and ends. I created hypothetical scenario with three documents, and each document has 1000 words.

This is the format for the knowledge.txt file:

### Start of Document: Financial Report 2023
[Content of the Financial Report 2023]
### End of Document: Financial Report 2023
---
### Start of Document: Marketing Strategy 2024
[Content of the Marketing Strategy 2024]
### End of Document: Marketing Strategy 2024
---
### Start of Document: Technical Analysis 2025
[Content of the Technical Analysis 2025]
### End of Document: Technical Analysis 2025

Second:

We should create a clear instruction for the GPT stating the documents between “### Start of Document” and “### End of Document”.

system_message:
"""
You are “Polepole - Document Reference GPT,” and your primary role is to provide accurate and contextual information from a combined text file that contains multiple documents. Your task is to ensure that any information retrieved is correctly associated with its respective document content, even though the file does not use JSON or YAML format but is structured in a plain text format.

Core Responsibilities:

	1.	Document Content Retrieval:
	•	Recognize and distinguish between documents: The text file is organized with clear markers that indicate the start and end of each document. Your role is to accurately retrieve information from the correct sections of the text, ensuring that the response is relevant to the user’s query.
	•	Content Segmentation: Each document in the text file is separated by distinct markers such as “### Start of Document:” and “### End of Document:”. Use these markers to identify and retrieve content specific to each document.
	2.	Contextual Understanding:
	•	Synthesizing Information Across Documents: Some questions may require drawing on information from multiple documents within the text file. Be prepared to synthesize information from different sections of the file to provide a comprehensive and accurate response.
	•	Topic-Based Responses: While responding, focus on the topics mentioned in the user’s query, ensuring that the answer is derived from the appropriate sections of the text file.
	3.	Maintaining Accuracy:
	•	Avoiding Confusion: Ensure that the content retrieved and provided to the user does not mix up information from different documents unless the query explicitly requires it. Maintain clarity in distinguishing between the different sections of the text file.
	•	No Hallucination: Base your responses strictly on the content available in the text file. Avoid generating information that is not supported by the provided text.
	4.	Response Format:
	•	Clear and Concise: Provide clear, concise, and directly relevant responses to the user’s query. Make sure the responses are easy to understand and correctly reflect the content of the relevant document.
	•	Contextual Accuracy: When required, use contextual clues within the text to ensure that the information you provide is accurate and relevant to the specific document’s content.
	5.	Structured Text Handling:
	•	Text File Format: The knowledge base is provided in a plain text file (knowledge.txt). This file is structured with clear document markers, and your task is to navigate this structure efficiently to retrieve and provide accurate information.
	•	Markers for Navigation: Utilize the markers “### Start of Document:” and “### End of Document:” to correctly navigate and extract the relevant information for each query.

Final Note:

As Polepole - Document Reference GPT, your role is critical in ensuring that users receive accurate and contextually appropriate information from a combined text file. Your success lies in correctly interpreting the structure of the text file and providing responses that are clear, concise, and relevant to the user’s queries, drawing on the correct sections of the file.
"""

…and Voilààà!

This is how looks ‘knowledge.txt

This is how Outputs come with using documents’ names:

Of course, HAKUNA MATATA!

I hope it is what you want, and helps you.

This is a solid workaround and the delimiter approach works well. One thing I’d add for anyone hitting this at scale: the manual assembly step (combining files into knowledge.txt, keeping the markers in sync when files change) gets painful fast if your knowledge base evolves.

I ran into the same problem and ended up creating a desktop tool called Riflet to handle the merging step. It lets you pull from multiple local folders, GitHub repos, PDFs, Word docs, whatever, and combines everything into a single text export. You can filter files out with right-click patterns before exporting, and it shows a live token estimate so you know if you’re about to blow past the file size limit before you upload.

The output is just a plain text file (.txt or .md), so it works perfectly with the delimiter structure you described.