Semantic search on large document

Hi all,

My goal is to use GPT-3 semantic search for a large concatenated document.

In my case, the documents primarily contain different topics. Some of the documents may overlap/supplement some topics, but in general, they are unique.

If I embed one document only, and query things I know for sure are in the document I get some satisfying results.

When I use the large concatenated document I was expecting slightly better results due to the fact that some of the documents overlap/supplement each other. However, I get some really bad results.

How can that be? What am I missing? I have been looking at this but I am not sure if that would make a difference. One big vector would work just as well as many smaller vectors in my opinion.


1 Like

Hi @Cleveland
I think this is a common problem with semantic search, if your context has a lot of overlap it is difficult to separate them apart using semantic search.

  1. What is the average token size for each of your embedding?
  2. Which model are you using to create the embedding?
  3. Have you tried other downstream tasks such as question answering?
1 Like

Hi @nelson,

Thank you for taking the time to reply.

Perhaps I used the wrong phrase “overlap”, some topics are described in more than one document. However, if two or more documents share the same topic one of the documents will always be in-depth about that topic, whereas the other documents are broader but also mentions a few key points on that specific topic. Would that still course a problem in your opinion?

As for your questions, here are my reply:

  1. the average token size is about 8000. Since the concatenated document is about 1200 normal pages, this is of course reduced to x number of smaller chunks and feed into the embedding model

  2. I use text-embedding-ada-002

  3. Yes, I get great results on the smaller documents, but when I use the concatenated document I get awful results

Just to make sure I understood your question:
You have a long text and in order to embed it you need to split it in multiple segments, resulting multiple embeddings.
The tests are not satisfying because when searching, the result uses a single segment and it loses the context of the whole text.

Hi @georgei

Yes, I think you understood the question.

I wanna emphasise that the method is working on the smaller documents (before concatenation). E.g. If I do this on “document A” (120 pages) I get satisfying results. However when I do it on the large concatenated document which also contains “document A” + 11 other documents the results are completely useless.

I just don’t know why this is happening. Perhaps as you mention it just lost the context of the whole text because of the size?

Hi @Cleveland
Have you tried to take this approach?
Let’s say you have a document of 80000 tokens, and you split it to 10 each with 8000 tokens.
Let’s just call them doc_1.1, doc_1.2 and so on. Let’s say the other document is less than 8000 tokens and we call it doc_2.
When you search for documents, it’s likely you will get these results…

ID Score
doc_1.2 0.4
doc_2 0.2
doc_1.1. 0.3

Base on the top 3 results, we know that the average score of doc_1 is (0.4+0.3)/2 = 0.35, which is higher than doc_2, so we use doc_1.


Also another approach is not to use semantic search and use question and answering, which is very good at locating answer from a large amount of text.
See What is Question Answering? - Hugging Face


Hi @nelson

Thanks for the reply and your suggestions!

I’ll try splitting the documents into smaller chunks and see if that improves the results. And I’ll have a look at the alternative approach you suggested, which could also be an even better solution.

1 Like

Good luck, let me know if you run into any roadblocks.

1 Like

I have one query, I have a script I want to use it for generating code based on the script that I have given. It’s a document with 900 pages. If I am taking a small section of the document and then ask queries it gives good answer but for the complete document it’s not able to do so. I want it to consider the complete document to answer. Can you help me out with that.