Hey,
I have a set of Postgres tables related to an entity, say for example People. I’m collecting all data related to people in a tabular format. An example from my csv
Name, Email, Phone, Company, Last Contacted, Account Created, Address
John Doe, johndoe@email.com, ABC Company, 2023/01/22, 2022/04/10, "123 Some Street, LA, CA"
Jane Doe, janedoe@email.com, 123 Company, 2023/05/20, 2022/07/17, "957 Some Street2, LA, CA"
I ran this csv through the createEmbeddings api call, sending each row in the CSV as an array of string tokens (without the headers). When I query for things like:
“What is John Doe’s email?” - I get the right result with the highest similarity
“Find people who live in LA?” - Completely inaccurate results (some users don’t have an address)
I then converted the above data into a paragraph with more context. For example
John Doe, johndoe@email.com, ABC Company, 2023/01/22, 2022/04/10, "123 Some Street, LA, CA"
became
John Doe is a user with email johndoe@email.com. They work for ABC Company. Their account was created on 2022/04/10 and were last contacted on 2023/01/22. Their address is 123 Some Street, LA, CA.
After running each converted row through the embeddings, the results are now worse and not even the first question “What is John Doe’s email?” works correctly.
I’m using the following code to generate the embeddings
const embeddingResponse = await openai.createEmbedding({
model: "text-embedding-ada-002",
input, // This is either the string input or array [John Doe, john@email.com,..]
});
I would appreciate some input on how I should be preparing this data for embeddings so that I can perform effective semantic search. Still a novice with AI, so please feel to include any references etc that may help me understand this better. Thanks!