Hello. I’d like to confirm if I am creating embeddings in the correct way.
I am trying to match summarised statements in a summary document to their original sources which are contained in PDF files.
For example the summary statement might be: Most cars are grey
And the content in the PDF document might be: A study looked at the most common car colours. It was found that 51% world wide are grey
Using information the documentation and tutorials I have seen I have taken the following approach:
Read in the whole PDF
Split the text into individual sentences
Generate embeddings for each sentence using text-embedding-ada-002
Generate embedding for the query
Generate cosine similarities between the query embedding and each sentence embedding
Sort by similarity
I’m having pretty disappointing results so far. When I have been reading similar forum topics a lot of the time there are recommendations to embed a page at a time, or a big similar chunk of text.
I’m wondering if this would help get better results as there is more context to each embedding, but the part I don’t understand is when I get that whole page back as a result, how do I narrow down which part of that page matches to the summarised statement, or is that not possible?
This “example phrase of yours above” is too short of a text phrase to have a meaningful vector. If you are using the types of short phrases, you will see very poor results.
More accurately would be something like ‘Across treatment groups, mean gestational age ranged from 12 to 15 weeks and mean birthweight ranged from 0.5kg to 0.75kg’
My question was really about whether I am creating the embeddings for the source PDFs correctly, sentence by sentence. I don’t understand if there is a way to narrow down the result if they were created in page by page chunks.
Thank you @raymonddavey, yes that’s pretty much what I’m trying to achieve. We have documents that contain summarised facts about a particular subject. These facts have been generated by summarising longer text contained in multiple PDF files.
Starting with the summarised fact, I am trying to find the most likely source of that fact by scanning through all the PDFs.
I’m glad I am on the right track. I guess the trick is trying to narrow down the size of the search result as much as possible (page, paragraph or sentence) whilst also getting accurate comparison of the query. I think it will be difficult as a single fact may have come from one particular sentence, multiple sentences in one paragraph, multiple sentences over multiple paragraphs, entries within a table etc.
I agree with @raymonddavey that you will get good results when you break it down by tokens.
You can use spacy.io to split your sentences and count tokens using the Transformer GPT2Tokenizer