Multi document comparision and Q/A

  1. Your metadata needs to be included in the cosine similarity search. In Weaviate, you are able to do this by specifying whether an class property is searchable or not. By doing this, you can add simply add a classifier to the embeddings for each document so that the model will always know to which document any chunk it receives belongs.

  2. Same as above. Metadata, aka class “properties” in Weaviate, should be searchable and returned with the embedding chunk to the model so it knows from which document the chunk originates.

  3. This is an embedding issue. See our conversation on “Semantic Chunking” and in particular this post: Using gpt-4 API to Semantically Chunk Documents - #72 by sergeliatko

You are probably using the “sliding window” method, thus losing important context in your embeddings. Don’t know what the “refine” methodology is, but if it’s a summarization technique, that could be a major part of your problem as well.

  1. The token windows have increased since this post (a year ago). GPT-4o now sports a 128K token window while Gemini 1.5 Pro boasts a 1M token window. However, if you follow the guidelines for Semantic Chunking, and restrict your chunks to a size that manages to capture the “atomic ideas” present in the text, then token limits shouldn’t be a problem – or at least not as big a problem as it was.

I know this was posted a year ago, but better late than never. It could help someone else in the same boat.

2 Likes