Multi document comparision and Q/A

SomebodySysop · June 5, 2024, 3:26am

mohamed.illiyas:

The retrieval process from the database does not clearly differentiate between the information sourced from Document 1 and Document 2. As a result, it becomes challenging to discern which part of the information pertains to each document, leading to confusion in the comparison.

Although I have attempted to extract the source document using various functionalities in langchain, the information about the document’s origin remains as metadata in the output. But it is not remembered in the context window. This makes it difficult to maintain a clear context of which information originates from which document.

Retrieval from the chroma db sometimes results in missing crucial information. It seems that certain important details are not consistently retrieved during the comparison process through the “refine” methodology.

If I use a chain type other than refine, it is exceeding the token limit. I couldn’t able to confine all the relevant embedding data within the token limit.

Your metadata needs to be included in the cosine similarity search. In Weaviate, you are able to do this by specifying whether an class property is searchable or not. By doing this, you can add simply add a classifier to the embeddings for each document so that the model will always know to which document any chunk it receives belongs.
Same as above. Metadata, aka class “properties” in Weaviate, should be searchable and returned with the embedding chunk to the model so it knows from which document the chunk originates.
This is an embedding issue. See our conversation on “Semantic Chunking” and in particular this post: Using gpt-4 API to Semantically Chunk Documents - #72 by sergeliatko

You are probably using the “sliding window” method, thus losing important context in your embeddings. Don’t know what the “refine” methodology is, but if it’s a summarization technique, that could be a major part of your problem as well.

The token windows have increased since this post (a year ago). GPT-4o now sports a 128K token window while Gemini 1.5 Pro boasts a 1M token window. However, if you follow the guidelines for Semantic Chunking, and restrict your chunks to a size that manages to capture the “atomic ideas” present in the text, then token limits shouldn’t be a problem – or at least not as big a problem as it was.

I know this was posted a year ago, but better late than never. It could help someone else in the same boat.

Topic		Replies	Views
How to format context documents to allow model to recognize specific fields within documents API gpt-35-turbo , context-elements	5	3969	January 8, 2024
Embedding and searching from similar embeddings API	6	5937	October 27, 2023
Aggregated answer across multiple documents (Q&A) API	6	3150	March 14, 2023
How to Optimize Text Chunking for Improved Embedding Vectorization? API vector-db , semantic-search	6	9007	December 15, 2023
How to send long articles for summarization? API	23	51626	December 12, 2023

Multi document comparision and Q/A

Related topics