Why doesn't search endpoint use vector similarity methods?

Currently, key/query similarity scores are computed through prompting and log probabilities. Generally in the literature you see vector similarities used instead (ie DPR, RAG, REALM), as it enables optimizations such as pre-embedding and LSH. This would boost query speed and significantly reduce search cost!

Understandably, GPT is a decoder only model, and can’t leverage representations such as [CLS] token embeddings the same way encoder models can (ie BERT), but I’m sure some workaround can be found by combining GPT’s output vectors. Why hasn’t this been explored yet given the obvious benefits?

1 Like

Actually, probably simplest to pair GPT3 with a custom retrieval model? Should search/answers endpoints be ignored altogether? Are there any experiments which compare GPT3 retrieval performance with other methods?

2 Likes

We now offer the embeddings endpoint, and you can perform embeddings search using vector similarity, as laid out in this notebook.

4 Likes

@boris hooray :tada::pray:. Was that feature in the works for a while, or can I take credit for suggesting? :crossed_fingers:

2 Likes

Hi Boris, I am having a bit of trouble with the notebook since I don’t have a technical background. Usually I can follow technical instructions and create my own colab notebooks for testing, as I’ve done for the search endpoint and other openai use cases, but I’m having trouble with the embeddings endpoint in colab. My 3 questions about the notebook, along with the OpenAI documentation, are as follows:

  1. The demo dataset comes from Amazon reviews and I see there is code to combine certain columns and fetch only a subset of the data and do some other cleaning. I have my own (quite small) dataset in a csv file ready to go, so for non-technical people like me, it would be super helpful if the notebook and documentation could be simplified by just indicating [your csv file goes here] where apppropriate, removing the intricate code for creating a csv file from a very large dataset.

  2. The OpenAI documention for the search endpoint is very clear that a json lines file must be uploaded to openai first. It appears that the embeddings endpoint will accept csv files. That’s great. What isn’t clear from your notebook is whether your csv from the amazon data was previously uploaded to openai. Can the csv file be used just by loading it into a colab session, or does it have to be uploaded to openai first?

  3. I am getting two error messages: " cannot import name ‘get_embedding’ from ‘utils’ " and " cannot import name ‘get_embedding’ from ‘openai’ ". I don’t know where I am going wrong since I’m trying to follow your instructons exactly. Stack overflow suggested perhaps I am using the wrong version of python or perhaps utils is located in an incorrect file location. Any insight on this?

Sorry for my long and perhaps amateurish questions but my company Lexata is a start-up with a part-time CTO, and I’m the one with subject matter expertise who is rigorously testing our openai outcomes. I do like to code in colab myself if my CTO is busy.

As an aside, I really appreciate that openai is pretty friendly to non-developers (who are willing to work hard to learn). Often, non-developers have compelling use cases based on their industry experience, but we can only collaborate effectively with technical people by entering the technical world and learning as much as we can. Then, the great ideas can be brought to life by the whole team. That’s what I’m trying to do. Frankly, I really wish there was an openai residency stream for non-developers to collaborate with developers. I think it would make the program richer and create a bigger pipeline of promising commercial (and non-profit) applications. I’m referring to this: OpenAI Residency.

Thanks very much for your assistance.

Lemme see if I can take a swing at some of these.

  1. Can you point me to the code you’re looking at? We should be able to simplify some things.

  2. With the search endpoint you could upload a file ahead of time because we wanted to scale up a very expensive endpoint (I suspect you’re referring to OpenAI API). With embeddings, you don’t need to do that. The biggest reason is that once you have the embeddings you can do all the search on your end without any more calls to us.
    I think what Boris meant was that you can use a CSV as the starting point to get embeddings but that CSV doesn’t get uploaded to us.

  3. Ah, yea, this one’s on us. The code is referring to a file in our openai-python package. Specifically: openai-python/utils.py at main · openai/openai-python · GitHub. You could copy the file to your working directory or just copy the functions from there. We want to update these references soon.

2 Likes

Thanks @hallacy.

  1. I am trying to piece together code from three notebooks linked in your embeddings documentation:

a) semantic search: openai-python/Semantic_text_search_using_embeddings.ipynb at main · openai/openai-python · GitHub

b) obtain a dataset: openai-python/Obtain_dataset.ipynb at main · openai/openai-python · GitHub

c) get embeddings: openai-python/Get_embeddings.ipynb at main · openai/openai-python · GitHub

I am having trouble following the logic because there are so many cross-references among pieces of code, some of which I don’t need because I have a csv dataset.

  1. Let’s say I am implementing a semantic search solution for my website. Isn’t it true that I would still be making API calls to gpt-3, in order to obtain the embedding of my user’s query? Then once I have that embedding, I can finish the search on my end. Do you agree?

  2. Thanks, I’ll try to figure that out. If I am using colab, can I upload the file to a session?

Aside from my questions above, I’d love to touch base with you again over zoom sometime soon to share more details about Lexata’s work and GPT-3.

Best, Leslie

  1. Gotcha. The trick we’re going for, especially in Obtain_dataset, is that a lot of CSVs won’t necessary have the exact data you want to embed. Instead, you’d have to combine various fields together to get your prompt.
    If your data is already ready to go, you should just be able to swap out your field name for combined here:
    df.combined.apply(lambda x: get_embedding(x, engine='babbage-similarity'))

  2. I got DM’d about this a few times and yes you’re totally correct.
    What I was trying to say was that for a given document you only need to embed once so doing a search over your dataset should be considerably cheaper than using the search endpoint. You’ll still need to embed queries first.

  3. I believe so! I think if you click on the folder on the left of colab you should be able to upload your file.

And sure! I’ll email you.

2 Likes

@lmccallum We’ve updated the python library and the import statements in OpenAI API to make it clearer how to import the functions

1 Like

Thanks @hallacy. Sorry I was totally disorganized on our call today. It’s been a long week. Appreciate your patience with a non-developer.