Tabular data converted to embeddings not returning accurate results

Hey,

I have a set of Postgres tables related to an entity, say for example People. I’m collecting all data related to people in a tabular format. An example from my csv

Name, Email, Phone, Company, Last Contacted, Account Created, Address
John Doe, johndoe@email.com, ABC Company, 2023/01/22, 2022/04/10, "123 Some Street, LA, CA"
Jane Doe, janedoe@email.com, 123 Company, 2023/05/20, 2022/07/17, "957 Some Street2, LA, CA"

I ran this csv through the createEmbeddings api call, sending each row in the CSV as an array of string tokens (without the headers). When I query for things like:

“What is John Doe’s email?” - I get the right result with the highest similarity
“Find people who live in LA?” - Completely inaccurate results (some users don’t have an address)

I then converted the above data into a paragraph with more context. For example

John Doe, johndoe@email.com, ABC Company, 2023/01/22, 2022/04/10, "123 Some Street, LA, CA"

became

John Doe is a user with email johndoe@email.com. They work for ABC Company. Their account was created on 2022/04/10 and were last contacted on 2023/01/22. Their address is 123 Some Street, LA, CA.

After running each converted row through the embeddings, the results are now worse and not even the first question “What is John Doe’s email?” works correctly.

I’m using the following code to generate the embeddings


const embeddingResponse = await openai.createEmbedding({
      model: "text-embedding-ada-002",
      input, // This is either the string input or array [John Doe, john@email.com,..]
    });

I would appreciate some input on how I should be preparing this data for embeddings so that I can perform effective semantic search. Still a novice with AI, so please feel to include any references etc that may help me understand this better. Thanks!

2 Likes

An inherent issue with searching by semantic relevance is the importance of keywords (which by themselves can be semantically worthless) in the query.

What you are asking is for an apple to become an orange.

Fortunately, there is a solution: hybrid indexes. Pinecone for example has just released theirs which allow for both sparse vectors (keywords) and dense vectors (embeddings) in the same query.

Now, you can search for both, and have more relevant results.

1 Like

Why not just look it up directly in your DB? No embeddings required. For location proximity, I would create a lat/lon for the address, and then use Haversine or the many approximations to get these hits.

Thanks @anon10827405 so are you suggesting that I move away from OpenAI and use Pinecone to generate embeddings / vectors instead? So if I have a set of keywords, they are better suited for sparse vectors, which embeddings are not?

I know this is a very generic / basic question, but could you share some input formats/use cases that would work great for embeddings? It would be great if you could shed some light into the situation with my data and the issues with embeddings and why they are / or are not a good choice here. Thanks again!

Personally I have migrated from ElasticSearch to Pinecone just for ease (laziness). I figure that sometimes these questions may come up better using semantics, and the combination (hopefully) will return better results.

@kvnam
I would still use OpenAI for their embeddings and then just store them in whatever vector database that you prefer. If you follow the example on the link I provided, you’ll see that they convert the keywords using BM25 (they claim to eventually have their own helper function).

The link has a great use-case of converting images & their respective keywords into vectors and storing them into the database.

Thanks @curt.kennedy yes there are other options here, including traditional search engines etc. But the idea is to offer users the ability to type any question (in English) related to the data they want to search for and provide them with a set of results. Since OpenAI allows you to process language, I want to leverage this to make search more user friendly. I’m still in the process of identifying how to best use this, so would appreciate some inputs on what the strengths of embeddings are

Thanks for explaining @anon10827405 I will go through the link you provided to understand this better. In the meanwhile for the OpenAI embeddings, if I understand you correctly, the array of keywords should work fine as input for embeddings…? Does it matter that the actual headers are not present in the array? What is the difference if I send them same keywords in the paragraph format? Does it help at all?

It sounds like you need more “Hybrid AI”. Embeddings alone could get you close key matches to your DB, assuming you are embedding on key values. I do this for name similarities with some math thrown around it and it works good. But distance and other metrics are hard to impossible to quantify by word or sentence similarity, you will get a lot of garbage, so you need to switch domains. An upfront AI classifier can help send it down the correct rail (optimal) instead of wasting time using weaker algorithms/approaches.

BM25 (the keyword relevancy search) will train on the keywords you fit it with and return sparse vectors which you’ll attach with your embeddings. All of this will become obvious once you go through the case study. I highly recommend running all the commands yourself up until the points of batching it and sending to pinecone.

When you make a query, you will embed the query twice, once for it’s semantics (using OpenAI), and again for its keyword relevancy (using BM25).

Yes, I was intending to first get embeddings to return the top similar matches and then pass the first set as fine-tuning prompts to get an answer… Not sure if that is what you meant by Hybrid AI…

So it seems embeddings with keywords alone cannot provide the accuracy I seem to be expecting… I’ll read up about this further, in the meanwhile if you have any suggestions on specific domains or AI classifier I can use for this I would be very grateful, I’m still new to AI so it will help me save time, thank you!

Perfect thanks @anon10827405 I’ll do that!

When you use BM25, more specifically, at this point:

import pinecone_text

# load bert tokenizer from huggingface
tokenizer = BertTokenizerFast.from_pretrained(
    'bert-base-uncased'
)

def tokenize_func(text):
    token_ids = tokenizer(
        text,
        add_special_tokens=False
    )['input_ids']
    return tokenizer.convert_ids_to_tokens(token_ids)

bm25 = pinecone_text.BM25(tokenize_func)

You’ll see that it uses a tokenizer (which is what I believe he is talking about), so no worries about it!

1 Like

By “Hybrid AI” I mean something like LangChain. Where it can break items into different domains (search, calculator, live time, location, LLM response). Here is a quick intro to this thinking:

2 Likes

Thank you @curt.kennedy I’ll start here!

2 Likes

HUH? please explain how he is going to use the Haversine formula in this case?

The OP is correlating embedded addresses to get proximity. A better approach is to convert the address to Lat/Lon and then use Haversine to estimate the distances (between various Lat/Lon pairs).