Semantic search on large document

Cleveland · December 30, 2022, 5:48pm

Hi all,

My goal is to use GPT-3 semantic search for a large concatenated document.

In my case, the documents primarily contain different topics. Some of the documents may overlap/supplement some topics, but in general, they are unique.

If I embed one document only, and query things I know for sure are in the document I get some satisfying results.

When I use the large concatenated document I was expecting slightly better results due to the fact that some of the documents overlap/supplement each other. However, I get some really bad results.

How can that be? What am I missing? I have been looking at this but I am not sure if that would make a difference. One big vector would work just as well as many smaller vectors in my opinion.

/cleveland

nelson · December 30, 2022, 6:23pm

Hi @Cleveland
I think this is a common problem with semantic search, if your context has a lot of overlap it is difficult to separate them apart using semantic search.
Questions…

What is the average token size for each of your embedding?
Which model are you using to create the embedding?
Have you tried other downstream tasks such as question answering?

Cleveland · December 31, 2022, 8:38am

Hi @nelson,

Thank you for taking the time to reply.

Perhaps I used the wrong phrase “overlap”, some topics are described in more than one document. However, if two or more documents share the same topic one of the documents will always be in-depth about that topic, whereas the other documents are broader but also mentions a few key points on that specific topic. Would that still course a problem in your opinion?

As for your questions, here are my reply:

the average token size is about 8000. Since the concatenated document is about 1200 normal pages, this is of course reduced to x number of smaller chunks and feed into the embedding model
I use text-embedding-ada-002
Yes, I get great results on the smaller documents, but when I use the concatenated document I get awful results

georgei · December 31, 2022, 1:12pm

Just to make sure I understood your question:
You have a long text and in order to embed it you need to split it in multiple segments, resulting multiple embeddings.
The tests are not satisfying because when searching, the result uses a single segment and it loses the context of the whole text.
Right?

Cleveland · December 31, 2022, 3:55pm

Hi @georgei

Yes, I think you understood the question.

I wanna emphasise that the method is working on the smaller documents (before concatenation). E.g. If I do this on “document A” (120 pages) I get satisfying results. However when I do it on the large concatenated document which also contains “document A” + 11 other documents the results are completely useless.

I just don’t know why this is happening. Perhaps as you mention it just lost the context of the whole text because of the size?

nelson · January 1, 2023, 6:17am

Hi @Cleveland
Have you tried to take this approach?
Let’s say you have a document of 80000 tokens, and you split it to 10 each with 8000 tokens.
Let’s just call them doc_1.1, doc_1.2 and so on. Let’s say the other document is less than 8000 tokens and we call it doc_2.
When you search for documents, it’s likely you will get these results…

ID Score
doc_1.2 0.4
doc_2 0.2
doc_1.1. 0.3

Base on the top 3 results, we know that the average score of doc_1 is (0.4+0.3)/2 = 0.35, which is higher than doc_2, so we use doc_1.

nelson · January 1, 2023, 6:19am

Also another approach is not to use semantic search and use question and answering, which is very good at locating answer from a large amount of text.
See What is Question Answering? - Hugging Face

Cleveland · January 1, 2023, 3:10pm

Hi @nelson

Thanks for the reply and your suggestions!

I’ll try splitting the documents into smaller chunks and see if that improves the results. And I’ll have a look at the alternative approach you suggested, which could also be an even better solution.

nelson · January 2, 2023, 8:44am

Good luck, let me know if you run into any roadblocks.

archikag5 · November 29, 2023, 6:27am

I have one query, I have a script I want to use it for generating code based on the script that I have given. It’s a document with 900 pages. If I am taking a small section of the document and then ask queries it gives good answer but for the complete document it’s not able to do so. I want it to consider the complete document to answer. Can you help me out with that.

Topic		Replies	Views
Semantic search using uploaded files (only performs lexical search for me) API	19	2458	January 30, 2024
Embedding Longer Texts API	8	15172	December 25, 2023
Large document - Inject into API or use knowledge base with semantic search? Prompting gpt-4 , api	6	398	May 16, 2024
Embedding - text length vs accuracy? API	13	15841	December 25, 2023
Aggregated answer across multiple documents (Q&A) API	6	3368	March 14, 2023

Semantic search on large document

Related topics