What is the best format for structured data embeddings like invoice records?

:wave: I am using ada-002 looking to embed large datasets from accounting systems. I am getting mixed results and wonder what text format I should be submitting the to be embedded text as. I tried

Company Name: Boom FM, Invoice Number: INV-0045, Invoice Status is PAID, Invoice Reference: BOBCAT12, Invoice Creation Date: 2023-01-15T00:00:00, Invoice Due Date: Jan 29, 2023 12:00, Total Amount: ÂŁ60

I embedded the same record with a different invoice number, and when querying the exact same text as above, the top match from the vector database (pinecone) returns the invoice record with the wrong invoice number. Both return with a score of 1 :thinking:

Wonder if anyone here took structured data like invoice records for a spin already, and maybe ran into similar issues? :bowing_man:

2 Likes

You might consider removing the bias in your embeddings (in all your vectors) and use the “de-biased” vectors.

One (simple) way to do this is to take an average of all your vectors together and then subtract that average from each vector (both your invoice vectors in the DB, and the search vector). There are other de-biasing methods, BTW.

Then run your method of choice to process to get the “best match” trying both the Dot Product and the Euclidean Distance to see if either of these methods get you where you want to be.

HTH

2 Likes

Thanks :raised_hands: appreciated and looking into it :eyes:

1 Like

Welcome Aboard @michaelriedler !

:slight_smile:

1 Like

Model used: text-similarity-ada-001

I just ran the vectors for your text, the first one as you posted and the second one by slightly changing the invoice number.

Then I ran the Dot Product against those two vectors, as follows:

 a = Embedding.where(id:9).pluck(:vector)[0]
 b = Embedding.where(id:10).pluck(:vector)[0]
 Embeddings.dot_product(a, b)
=> 0.9996880150165247

Sanity Check:

Embeddings.dot_product(a, a)
=> 0.9999999773456315
Embeddings.dot_product(b, b)
=> 0.9999999300630409

Sorry, I did not create more sample data to test.

Testing the Euclidean Distance:

Embeddings.euclidean_distance(a, a)
=> 0.0
Embeddings.euclidean_distance(b, b)
=> 0.0
Embeddings.euclidean_distance(a, b)
=>  0.024977537421116434

Hope the helps.

Sorry, I’m lazy to create many records of your data and run the methods against all the vectors, but hopefully you get the idea! :slight_smile:

4 Likes

What a hero :star:
Thanks so much for sharing, really appreciated. Already testing and seeing differences which is EXCITING. Will report back, really helpful pointers ruby_coder :raised_hands:

1 Like