Multi document comparision and Q/A

I have encountered a few things while working on my experiments with GPT. Specifically, I have been attempting to compare two or more documents using GPT’s capabilities.

To provide some context, I have two documents that contain tables, information, and other relevant content. Each document exceeds 10,000 tokens in length approx. My goal is to perform a comprehensive comparison of these documents. In order to achieve this, I have utilized lang chain’s “refine” method to query against the documents. By chunking both documents, creating embeddings for each chunk, and storing them in an embeddings database like chroma db, I have attempted to facilitate the comparison process.

However, I have identified several challenges that I am currently facing:

  1. The retrieval process from the database does not clearly differentiate between the information sourced from Document 1 and Document 2. As a result, it becomes challenging to discern which part of the information pertains to each document, leading to confusion in the comparison.
  2. Although I have attempted to extract the source document using various functionalities in langchain, the information about the document’s origin remains as metadata in the output. But it is not remembered in the context window. This makes it difficult to maintain a clear context of which information originates from which document.
  3. Retrieval from the chroma db sometimes results in missing crucial information. It seems that certain important details are not consistently retrieved during the comparison process through the “refine” methodology.
  4. If I use a chain type other than refine, it is exceeding the token limit. I couldn’t able to confine all the relevant embedding data within the token limit.

In my efforts to overcome these challenges, I have explored alternative techniques such as analyzing document chains, conversation retrieval chains, and map-reduce, among others. Unfortunately, none of these approaches have yielded successful results.

At this stage, I am seeking the community’s guidance on how to effectively address the multi-document comparison task within the constraints of the ChatGPT 3.5 APIs or the GPT-4 8K tokens API. I would greatly appreciate any insights or suggestions you can provide to help overcome these obstacles and achieve accurate and reliable multi-document comparisons.
Can we achieve this within the available Openai APIs.?


I’m facing the same problem; for example, I’m currently developing an AI about US law. For example, let’s say that I want to search for a definition between two laws or legislations. The problem that I’m facing is that in the context of mixing the articles of both laws.
For example, let’s say I want to search for an article that says something about human rights between Law 1 and Law 2. With similarity search, I got the documents and the parts, but the main problem is that in the content, it mixes all the articles between Law 1 and Law 2, so the GPT3.5 hallucinates a lot because it says that Articles 1 and 2 are from Law 1 and is totally wrong because they are from Law 2.

is it possible to add a header or tail to each of your chunk, something like “<chunk i from document XYZ>” ? so when retrieve each chunk, you and your model can always tell which document it comes from.

1 Like

Very possible to add meta data to chunks, you control them at the end of the day, 3rd party chunking systems should let you add your own meta headers to chunks.


Yes, I think it’s a good solution. For example, in my case, I could do something in the context like this:
title : “Law 1”,
content: “asdfasdf”

title : “Law 2”,
content: “asdfasdf”


So the title will be from the metadata and the content from the similarity search.
As I said the title will be de title property from the metadata, so what I could do is to merge all the info content if it has the same title from the metadata.
But I don’t know how to pass this context; I should investigate a solution.

I had achieved the format of the context, but I’m still facing an error, this is the context extructure that I have created:
#####Start of context structure#####
Title: Title of the document
Content: Document content
#####End of context structure#####
For every document it will follow this structure:
##/# Beginning of document ### \n
Content: ${doc.content} \n
###End of document###
So the qaTemplate it will look like this:

Use the following context to answer the question at the end.
                       The structure that follows for you to identify it is the following:
                       #####Start of context structure#####
                       Title: Title of the document
                       Content: document content
                       #####End of context structure#####
                       The context is the following:
                       #####Start of context#####
                       #####Final context#####

                       #####Beginning of question#####
                       #####End of question#####
                       Please provide your answer below:

And the problem that I’m facing is mixing the articles from differents laws, for example i’m asking for certain articles of the law 1 but I’m getting articles from the law 2. I thought that dividing the context as I did it will be a great solution but not

With the implementation of GPT-4, this approach now functions flawlessly. GPT-4 is capable of identifying and segregating information from every document. However, there are challenges related to the maximum context and the cost of using the model. When attempting the same approach with GPT-3.5-16k, the results were unsatisfactory, particularly when analyzing different files

1 Like

Have you guys tried document comparison offered by Langchain? Document Comparison | 🦜️🔗 Langchain
It will not solve all the problems, but can handle quite a few cases. I’m still struggling to answer questions like “What are the common clauses in these contracts?” with legal documents in the background. Simple factual comparisons work pretty well.

1 Like

Unless you’ve got extremely excellent prompting, I would not trust gpt-3.5 for legal texts. The last place you want hallucinating is in the Law.

Semantic chunking:

Summary chunking:

And, also, you might try adding questions that the documents answer. Add them to the metadata, or the embeddings themselves. Something like this:

			// Construct the context document string with labeled elements
			$documentString = "Document Title: '{$documentTitle}'\n";
			$documentString .= "Document Content: {$contextDocument}\n";
			if ($this->includeSummary === true ) {
				$documentString .= "Source document summary: {$documentSummary}\n";
			$documentString .= "Event Date: {$documentDate}\n";
			$documentString .= "Document Groups: {$documentGroups}\n";
			$documentString .= "Document Taxonomy/Tags: {$documentTaxonomy}\n";
			$documentString .= "URL: {$documentURL}\n";
			if ($this->includeQuestions === true) {
				$documentString .= "Questions that this document answers: {$documentQuestions}\n";

So, if you generate questions that Law1 and Law2 answer, they should answer some of the same questions. Which means they should strengthen the similarities between the two documents in your vector search.

And, speaking of vector search, you need a good vector engine. I’ve been getting very good results with Weaviate’s OpenAI text2vec transformer. I am working with regulatory docs as well.

Were you able to fix this issue? I am working on something similar and could use a few tips and tricks

  1. Your metadata needs to be included in the cosine similarity search. In Weaviate, you are able to do this by specifying whether an class property is searchable or not. By doing this, you can add simply add a classifier to the embeddings for each document so that the model will always know to which document any chunk it receives belongs.

  2. Same as above. Metadata, aka class “properties” in Weaviate, should be searchable and returned with the embedding chunk to the model so it knows from which document the chunk originates.

  3. This is an embedding issue. See our conversation on “Semantic Chunking” and in particular this post: Using gpt-4 API to Semantically Chunk Documents - #72 by sergeliatko

You are probably using the “sliding window” method, thus losing important context in your embeddings. Don’t know what the “refine” methodology is, but if it’s a summarization technique, that could be a major part of your problem as well.

  1. The token windows have increased since this post (a year ago). GPT-4o now sports a 128K token window while Gemini 1.5 Pro boasts a 1M token window. However, if you follow the guidelines for Semantic Chunking, and restrict your chunks to a size that manages to capture the “atomic ideas” present in the text, then token limits shouldn’t be a problem – or at least not as big a problem as it was.

I know this was posted a year ago, but better late than never. It could help someone else in the same boat.