How do I add weights to specific parts of the embeddings?

Hi,
I have a SQL database with these columns for my purchases table that I want to save into a vector database.

vendor: string
line_item: string

To my knowledge, I should be converting each record to a single vector. To accommodate for this limitation, my plan is to concatenate the fields into one large string while providing extra context about what each value means before converting that to a vector through one of OpenAI’s embedding models. Example:

The vendor of the purchase is: ${vendor}.
The line item of the purchase is: ${line_item}.

My question is: What if I want to search for claims with extreme emphasis on the line items, and only a little emphasis / weight on the vendor?

Would this be done on the query level? Like when I form the query would I just specify that I care more about line item than the vendor?

  1. “Return claims that have line items similar to ${line_item}. Secondly, prefer claims that were from ${vendor}”
  2. Turn that into a vector using one of openAI’s embedding models
  3. Search vector database using vector from #2

Is this correct?

This seems like a case that is deceptively close to just a database search.

I can push you in the right direction by showing how an embeddings index might return the labeled corresponding document.

Entity extraction->

${vendor}
${date}
${line_item1}
${line_item2}
${line_item3}

entity to language index →

{vendor} purchase invoice {date}
{vendor} sold us {line_item1}
We bought {line_item2} from {vendor}
{line_item3} was obtained from {vendor} on {date}
{vendor} was the source of {line_item4}…

One can see that the semantics of the common vendor are enhanced over the varied line items.

Hi sorry I’m not following. To clarify, if I stored the following as a vector:

The vendor of the purchase is: ${vendor}.
The line item of the purchase is: ${line_item}.
The date of the purchase is: ${date}.

And I wanted to structure my query in a way that prioritized the line_item first and vendor second, would a query like:

"Return claims that have line items similar to ${line_item}. Secondly, prefer claims that were from ${vendor}” (turned into a vector) work?

The point I made was to enhance the relevance of embeddings similarity scores by making what you “store” distinct from the document, and the repetition of the vendor prominent within the embedding while mixing up the language, thus enhancing returns for vendor while still having item capabilities.

If you want to know how to “structure my query” to prefer with an unimproved vector database, when you are not doing a “query” but a “best k” semantic similarity search informed by a large language model…

Pepsi Pepsi Pepsi Pepsi Pepsi Pepsi Pepsi Pepsi Pepsi Pepsi Pepsi Pepsi pepsi pepsi pepsi pepsi pepsi pepsi pepsi pepsi pepsi pepsi
company company vendor vendor vendor vendor diet pepsi syrup 2 gallon
Pepsi Pepsi

So in this example, given how many times the vendor appears, wouldn’t it overindex the vendor relative to the line items?:

{vendor} purchase invoice {date}
{vendor} sold us {line_item1}
We bought {line_item2} from {vendor}
{line_item3} was obtained from {vendor} on {date}
{vendor} was the source of {line_item4}…

So are you saying if I want the line_item to hold the most weight I have to do it on the data level and repeat the line_item multiple times vs attempting to do it on the query level?

That’s my expectation, that an invoice that has three different kinds of ball bearings and four different kinds of lug nuts is going to have a better semantic match on those than its one mention of the vendor.

It is only because of the monotonic nature of all members of a set of “invoices” that you don’t instead have stronger semantic similarity matches on dimensions that one might articulate as “short curt phrases” or “confusing writing” or “mechanical things” when making embeddings similarity extraction.

Another consideration I can add here:

  • embeddings math will return similarity scores

While one might simply plop "Vendor: Ace Typewriters, Item: Ink Ribbon" in and get matches (there with, again, multi-vendor ink ribbon invoices coming out stronger than by-vendor invoices in your “query”), another technique:

  • Make a simulated invoice that looks as close to your invoice embeddings as possible.

That may be a bit hard to synthesize while emphasizing the vendor, but you can spruce it up:

"

The vendor of the purchase is: ${vendor}.
The line item of the ${vendor} purchase is: ${line_item}.
${vendor} shipped 2 units
completed invoice from ${vendor} 

This also looks distinct from the leasing agreements or net terms contract documents you might also dump into the database for the same rights group of users.