Best practice for writing category descriptions for embeddings

mil0 · October 12, 2024, 1:39pm

I’m trying to rate how strongly strings of text relate to the Big 5 Personality Factors by generating embeddings using text-embedding-3-large for a text string and for a description of each Big 5 factor, then calculating the cosine similarity between each text string and factor description. (So not trying to classify the text string into 1 category, but measure the extent to which it relates to each factor).

However, I’m finding that if I vary the description of the definition for each factor (e.g. length, detail, keywords) I’m getting quite different results each time - and in many cases the cosine similarity is fairly even across the big 5 factors without much differentiation.

So, my question is are there any best practices around how to write descriptions for categories / factors that will generate the best results?

_j · October 12, 2024, 6:18pm

Embeddings is not a problem solver or a topic-finder. It is a similarity mechanism, returning the state of an AI after an input. Algorithmic comparison of the resulting vectors of two embeddings can score how correlated two inputs are, even down to the formatting or tone, along with other things only an AI knows.

A internet forum posting that is instigating and targeting is not going to have text that matches well with an overall description of stalking or brigading behavior, that would score the person’s toxic influence on that forum, for example.

Therefore, you will need to do your own computations, where what you have in a vector database containing textual examples that are similar to inputs you will send, and which have been manually rated strongly in one or several of the metrics.

Then after having a ranked return from a text against the embeddings and their labeling, you can proceed to an algorithm that determines the strength in each factor.

mil0 · October 12, 2024, 6:39pm

Thanks for the reply! Ideally I would train a custom model with textual examples, but I was hoping to use zero-shot classification (as this seems to work fairly well), but rather than assign the text string to the category with the highest cosine similarity score, I’d use the magnitude of the score to reflect how strongly the text related to each personality factor.

_j · October 12, 2024, 8:23pm

“Ask an AI” is something less algorithmic in nature. You are asking essentially "based on this input, tokens are ranked by certainty, with hopefully ones that are numbers being the most prominent out of the 200k, and that they relate to a score value.

In a structured output, but one where you are still given logprobs because you just told the AI how to output, you can look at logprobs, which is a ranking of those top tokens. If at the “happy” position there is “8”: 39%, “6”: 21%, you can do some interpolation beyond what the AI and its random sampler may output.

If you do not have extensive training data, this may be the path for you, at a magnitude greater expense than embeddings, without any reusable AI data. If you do, you can explore both fine-tune on performing that task with accurate scoring, and also see how embeddings performs. Even a synthesis of the two AI results might be considered.

Good luck, as the implementation is the secret sauce of many other products.

Topic		Replies	Views
Sentence Classification solution API	4	1623	March 4, 2024
Improve fine tuning by adding embedding API	7	2301	April 26, 2023
Use embeddings to measures how well an answer fits the question API embeddings	5	312	June 29, 2024
An established technique for teaching optimized vector paths? Community embeddings	1	563	May 11, 2023
Help with fine-tuning for text categorization API	4	1297	December 16, 2023

Best practice for writing category descriptions for embeddings

Related topics