I have a simple function in javascript that compares a word again list of words and returns a cosine similarity
export async function filterSimilarWords(target: string, candidates: string[]) {
// Create embeddings for target and candidates
const targetEmbeddingRes = await client.embeddings.create({
model: "text-embedding-3-large",
input: target,
});
const candidatesEmbeddingRes = await client.embeddings.create({
model: "text-embedding-3-large",
input: candidates,
});
const targetEmbedding = targetEmbeddingRes.data[0].embedding;
const candidateEmbeddings = candidatesEmbeddingRes.data.map(e => e.embedding);
// Cosine similarity function
function cosineSimilarity(vecA, vecB) {
const dot = vecA.reduce((sum, a, i) => sum + a * vecB[i], 0);
const normA = Math.sqrt(vecA.reduce((sum, a) => sum + a * a, 0));
const normB = Math.sqrt(vecB.reduce((sum, b) => sum + b * b, 0));
return dot / (normA * normB);
}
// Get and sort words
return candidates
.map((word, i) => ({
word,
similarity: cosineSimilarity(targetEmbedding, candidateEmbeddings[i]),
}))
}
in a case of serbian words, it preforms very poorly
for example if I provide target word: ‘парадајз’ (tomato)
and candidates [‘поврће’, ‘мајонез’, ‘пилећи’] (vegetable, mayonnaise, chicken)
it will return similarity over 0.8 for a mayonnaise, chicken and less than 0.5 for a vegetable
Is there something that I can do to improve the response like giving it a context somehow, or it is what it is?