Semantic Search doubt

  1. What are the advantages of using GPT-3 for semantic search over using word embedding or pre-trained transformers(similar models) ?

Understanding scores

The similarity score is a positive score that usually ranges from 0 to 300 (but can sometimes go higher), where a score above 200 usually means the document is semantically similar to the query. At the moment, the score is very useful for ranking (we’ve seen it outperform many existing semantic ranking approaches). For example, you can use it for re-ranking the top few hundred examples from an existing information retrieval system.

Each search query produces a different distribution of scores for a fixed group of documents. For instance, if you have a group of documents that are summaries of books, the query “sci-fi novels” might have a mean score of 150 and standard deviation of 50, whereas the query “cat training” might have a mean score of 200 and standard deviation of 10, if you were to search these queries against every document in the group. The variation is a consequence of the search setup, where the query’s probability (what is used to create the score) is conditioned on the document’s probability.

If you need scores that don’t vary by query, you can randomly sample 50-100 documents for a query and calculate the mean and standard deviation, then normalize new scores for that same query using that mean and standard deviation.

  1. Can someone please explain the above section present on the website in an easy to understand manner and explain how I should actually normalize the score in code?

  2. Does GPT-3 Semantic Search work for other languages too?

  3. Briefly, how does GPT-3 do semantic search under the hood? Does it freeze some layer to compare latent vectors or …?

Many thanks.


I can help answer at least a couple of those questions:

Does GPT-3 Semantic Search work for other languages too?

The vast majority of the training set is English, but there was enough content from other languages to give GPT-3 some modest multilingual capabilities. Performance on English text tends to be significantly better than other languages.

Briefly, how does GPT-3 do semantic search under the hood?

The search query and each document are combined into prompts that are sent to the model. The computed log probabilities of the tokens in the document are used to produce a score indicating how likely it is that the document is related to the query.


@dschnurr Does this mean search can be used for categorical completions? Instead of finding the most likely completion given a prompt out of all possible completions, I am trying to find the most likely completion from a set of possible completions.

Thank you for sharing, dschnurr.

I would love to understand this a bit more and notably what advice OpenAI has in order to encourage good design of documents and queries.

The reason for why this matters is that the document that is the most similar to the query is often not the document that provides the information needed to produce the best completion. At the moment, we have to invent tricks to get the right behavior, but doing so requires us to also guess how the ranking is done. Anything that you are able to share or any advice you have to make it more likely to get the best matches would be helpful.

The working assumption is that it is essentially an optimized version of similarity for internal representations of the documents and queries. When you say that the two get combined into one, it sounds like this is not what is happening though. I am bit curious how that is different from similarity (and also be scalable). Are you sure that this combination into a common prompt is done for the semantic search itself, and not the endpoint using semantic search as a step?

From these experiments, it does seem like ranking is done more by similarity than by completion likelihood:

It’s more accurate to say that the score indicates whether the documents are semantically related to the input query, rather than how likely they are as a completion if the input query were passes as prompt. Depending on your use case, it may still work, but you’d probably need to do some testing to see how well it works.

Alternatively you could pass your query + potential completion pairs to the /completions endpoints with max_tokens=0, echo=True, logprobs=1 set – and aggregate the log probabilities of the tokens corresponding to your categories in the prompt text to figure out which one seems to be the most likely overall (you may need to do some normalization between categories across a set of queries).

And sorry for the delayed response, was on vacation for the past week :slight_smile:


The query and document are not simply concatenated together, there is some other content in the combined prompt that optimizes for determining semantic relatedness rather then whether the documents are the most likely completion if the input query were used as a prompt.

@dschnurr Thanks for the reply and for the explanation.

Yeah, I am currently using the second method, though it is quite token inefficient.

I’m super late to the party, but wanted to share this article on search score normalization - hope you find it useful: Understanding Search Scores | OpenAI Help Center