Search vs Similarity

mike3 · August 19, 2022, 9:17am

So I’ve got a bunch of projects, and I’ve got a bunch of users (or I would, were this not all hypothetical). I want to suggest users for projects based on the project description, and the user’s skills & bio.

Is it better to use search embeddings or similarity embeddings?

Is there any difference in comparing the embeddings when doing similarity vs search? I’ve successfully implemented search using the doc/query embeddings. Does similarity work more or less the same way?

jhsmith12345 · August 19, 2022, 3:35pm

The way I understand it is that similarity search is for smaller chunks. That said, Google’s sentence encoder, and pinecone are supposed to be better cheaper and faster

boris · August 19, 2022, 4:59pm

Hi Mike, thanks for the question.

Search embeddings are generally best for matching very short pieces of text to long pieces of text, such as few word queries to project descriptions. Based on what you said, I suspect similarity embeddings would work best.

To improve your embeddings further, you can customize them by learning a translation matrix, as done in this notebook (you need at least a hundred examples of what you find similar (or dissimilar), in your case cases where users have worked on projects previously for example):

github.com

openai/openai-cookbook/blob/main/examples/Customizing_embeddings.ipynb

{
  "cells": [
    {
      "cell_type": "markdown",
      "metadata": {
        "id": "Vq31CdSRpgkI"
      },
      "source": [
        "# Customizing embeddings\n",
        "\n",
        "This notebook demonstrates one way to customize OpenAI embeddings to a particular task.\n",
        "\n",
        "The input is training data in the form of [text_1, text_2, label] where label is +1 if the pairs are similar and -1 if the pairs are dissimilar.\n",
        "\n",
        "The output is a matrix that you can use to multiply your embeddings. The product of this multiplication is a 'custom embedding' that will better emphasize aspects of the text relevant to your use case. In binary classification use cases, we've seen error rates drop by as much as 50%.\n",
        "\n",
        "In the following example, I use 1,000 sentence pairs picked from the SNLI corpus. Each pair of sentences are logically entailed (i.e., one implies the other). These pairs are our positives (label = 1). We generate synthetic negatives by combining sentences from different pairs, which are presumed to not be logically entailed (label = -1).\n",
        "\n",
        "For a clustering use case, you can generate positives by creating pairs from texts in the same clusters and generate negatives by creating pairs from sentences in different clusters.\n",
        "\n",

This file has been truncated. show original

Topic		Replies	Views
Semantic vs search embedding API	3	6845	September 28, 2023
Models: Embedding vs Similarity vs Search Models API api	4	3252	July 9, 2023
Semantic search through embeddings API	3	1281	January 22, 2023
Semantic search using uploaded files (only performs lexical search for me) API	19	2390	January 30, 2024
Embedding and searching from similar embeddings API	6	6556	October 27, 2023

Search vs Similarity

Related topics