I am using ada-002 looking to embed large datasets from accounting systems. I am getting mixed results and wonder what text format I should be submitting the to be embedded text as. I tried
Company Name: Boom FM, Invoice Number: INV-0045, Invoice Status is PAID, Invoice Reference: BOBCAT12, Invoice Creation Date: 2023-01-15T00:00:00, Invoice Due Date: Jan 29, 2023 12:00, Total Amount: £60
I embedded the same record with a different invoice number, and when querying the exact same text as above, the top match from the vector database (pinecone) returns the invoice record with the wrong invoice number. Both return with a score of 1
Wonder if anyone here took structured data like invoice records for a spin already, and maybe ran into similar issues?
You might consider removing the bias in your embeddings (in all your vectors) and use the “de-biased” vectors.
One (simple) way to do this is to take an average of all your vectors together and then subtract that average from each vector (both your invoice vectors in the DB, and the search vector). There are other de-biasing methods, BTW.
Then run your method of choice to process to get the “best match” trying both the Dot Product and the Euclidean Distance to see if either of these methods get you where you want to be.