Perhaps I used the wrong phrase “overlap”, some topics are described in more than one document. However, if two or more documents share the same topic one of the documents will always be in-depth about that topic, whereas the other documents are broader but also mentions a few key points on that specific topic. Would that still course a problem in your opinion?
As for your questions, here are my reply:
the average token size is about 8000. Since the concatenated document is about 1200 normal pages, this is of course reduced to x number of smaller chunks and feed into the embedding model
I use text-embedding-ada-002
Yes, I get great results on the smaller documents, but when I use the concatenated document I get awful results
Just to make sure I understood your question:
You have a long text and in order to embed it you need to split it in multiple segments, resulting multiple embeddings.
The tests are not satisfying because when searching, the result uses a single segment and it loses the context of the whole text.
I wanna emphasise that the method is working on the smaller documents (before concatenation). E.g. If I do this on “document A” (120 pages) I get satisfying results. However when I do it on the large concatenated document which also contains “document A” + 11 other documents the results are completely useless.
I just don’t know why this is happening. Perhaps as you mention it just lost the context of the whole text because of the size?
Have you tried to take this approach?
Let’s say you have a document of 80000 tokens, and you split it to 10 each with 8000 tokens.
Let’s just call them doc_1.1, doc_1.2 and so on. Let’s say the other document is less than 8000 tokens and we call it doc_2.
When you search for documents, it’s likely you will get these results…
Base on the top 3 results, we know that the average score of doc_1 is (0.4+0.3)/2 = 0.35, which is higher than doc_2, so we use doc_1.