Understanding Embedding Granularity

Hello. I’d like to confirm if I am creating embeddings in the correct way.

I am trying to match summarised statements in a summary document to their original sources which are contained in PDF files.

For example the summary statement might be: Most cars are grey

And the content in the PDF document might be: A study looked at the most common car colours. It was found that 51% world wide are grey

Using information the documentation and tutorials I have seen I have taken the following approach:

  1. Read in the whole PDF
  2. Split the text into individual sentences
  3. Generate embeddings for each sentence using text-embedding-ada-002
  4. Generate embedding for the query
  5. Generate cosine similarities between the query embedding and each sentence embedding
  6. Sort by similarity

I’m having pretty disappointing results so far. When I have been reading similar forum topics a lot of the time there are recommendations to embed a page at a time, or a big similar chunk of text.

I’m wondering if this would help get better results as there is more context to each embedding, but the part I don’t understand is when I get that whole page back as a result, how do I narrow down which part of that page matches to the summarised statement, or is that not possible?

Thanks,
Simon

This “example phrase of yours above” is too short of a text phrase to have a meaningful vector. If you are using the types of short phrases, you will see very poor results.

HTH

Sorry, that was a bad example.

More accurately would be something like ‘Across treatment groups, mean gestational age ranged from 12 to 15 weeks and mean birthweight ranged from 0.5kg to 0.75kg’

My question was really about whether I am creating the embeddings for the source PDFs correctly, sentence by sentence. I don’t understand if there is a way to narrow down the result if they were created in page by page chunks.

Thanks for clarifying.

I think @raymonddavey does this regularly and can offer some insight into how he breaks these down.

Also, @nelson also does this and perhaps he will chime in and offer his insights as well.

2 Likes

You are doing it correctly but you may get better results of you break into paragraphs or blocks around 200 to 400 tokens in size

I assume the results you are looking for don’t fit.imto a limited number of categories - so you are not looking to coursly classify your.queries

It sounds like you are looking to recall previous case notes that are similar to your existing case?

If you are then you are on the right track

2 Likes

Thank you @raymonddavey, yes that’s pretty much what I’m trying to achieve. We have documents that contain summarised facts about a particular subject. These facts have been generated by summarising longer text contained in multiple PDF files.

Starting with the summarised fact, I am trying to find the most likely source of that fact by scanning through all the PDFs.

I’m glad I am on the right track. I guess the trick is trying to narrow down the size of the search result as much as possible (page, paragraph or sentence) whilst also getting accurate comparison of the query. I think it will be difficult as a single fact may have come from one particular sentence, multiple sentences in one paragraph, multiple sentences over multiple paragraphs, entries within a table etc.

I agree with @raymonddavey that you will get good results when you break it down by tokens.
You can use spacy.io to split your sentences and count tokens using the Transformer GPT2Tokenizer

3 Likes