Semantic search using uploaded files (only performs lexical search for me)

Hello.

Looks like search using uploaded file only works as lexical, not semantic.
The search from docs example works well. But when I upload a file with this content:

{ "text": "White House" }
{ "text": "hospital" }
{ "text": "school" }

and query with { file: <fileId>, query: 'the president' } , I’m getting the error ‘No similar documents were found’

querying for ‘school’ returns the matching document.

File info returns:

{
"id": <fileId>,
"object": "file",
"bytes": 70,
"created_at": 1623666119,
"filename": "president.jsonl",
"purpose": "search",
"status": "processed",
"status_details": null
}

How do I perform semantic search within uploaded file?

1 Like

Thanks for reply.

I’m using babbage, as suggested by docs.

I’d suspect that adding “The” White House, could change the result.

probably, but I need it match semantically on the meaning, and not lexically on the word.

Isn’t the keyword search killing all the semanticness of the search?

How do I then perform semantic search on >200 documents?

1 Like

Is there a way to bypass keyword search, and do a direct semantic search? For me, it breaks even if the document contains “birthdays” and I search for “birthday”.

2 Likes

I am having the same problems. I posted this recently:

The search documentation states this:

“File-based search is a two-step procedure that begins by narrowing the documents in the provided file to at most max_rerank number of documents using a conventional keyword search.”

  1. Can anyone elaborate on what is meant by “keyword search”? It must be something more sophisticated than “exact match” but less sophisticated than the semantic re-ranking occurring in the second step.
  2. If nothing is found during the keyword step, no re-ranking is performed, correct?
  3. Can the problem in #2 be avoided by specifying max_rerank larger than the total number of documents?

For context, my use case involves documents with highly complex, technical language, so there could be a relatively high proportion of instances where nothing is found from a user’s query based on a keyword search. That’s exactly why I need the semantic capabilities of GPT-3 – not to be hamstrung by keyword search. See the irony here?

I understand that there are compute costs that make some constraints necessary. Would love to discuss the future of search in more depth with the OpenAI team.

Hello!!

I’m hitting this problem as well now. Has anyone found a solution? Is it a bug in the API?

When I use the search endpoint by using the “documents” it returns expected results, but if I use the file param with the exact same documents (but just uploaded as a .jsonl file instead) then I get no results unless there’re exact keyword matches.

1 Like

I think you have to use the embeddings endpoint. Get the embeddings for your documents (in advance), and for your search query (at search time) and use the similarity ranking to get the top search results. That’s what I did. Embeddings endpoint is working amazingly for me,

You are spot on @lmccallum

I asked support about it as well and:

“Unfortunately, there’s no way to do File-based Search without keyword narrowing, but I’ll share that as a feature request with the team!”

And:

“You can’t provide more than 200 documents to Search without using Files, and when using Files, you can’t remove the keyword search step, so your best bet for that would be signing up for embeddings”

1 Like

Today is the shut-down day for the old Search API, in favor of the file based approach being discussed in this thread. Did they ever come up a way for us to do semantic search without the keyword pre-filter step? Quite frankly, that’s the silliest thing I’ve seen OpenAI do. As clearly stated by all of you in this thread, it creates a chicken-and-the-egg catch-22 that completely ruins the search for me.

So are things different now then they were 10 months ago? If so, can someone point me to a document that tells me how to eliminate the keyword search based filtering step?

@alitana @hallacy

Yes, the text-search models, which use vector embeddings of the text, are excellent and do not narrow down by keyword first. Here is a link to the embeddings guide:

1 Like

Thanks. Is there a comprehensive tutorial that shows a Node.JS example that would take a user’s input, prepare the embeddings, and then submit the transformed query to the Semantic Search API?

Hey @robert.oschler I was in the same situation and here is what we are using now (until we find a better solution). Not sure if I got it right, but this works for me:

  1. Create the embeddings (an array of numbers) for each document you want to search (example using got)
export const getEmbeddings = async (input: string, model: string) => {
  const options = {
    headers: {
      authorization: "Bearer " + process.env.OPENAI_API_KEY,
    },
    json: {
      input,
      model,
    },
  };
  const response = await got.post("https://api.openai.com/v1/embeddings", options).json<EmbeddingResponse>();
  return response?.data[0]?.embedding;
};
  1. Store the embeddings for speed and cost
  2. Create the embedding for your search query
  3. Compare the embeddings using Cosine Similarity (something like the distance between the query and the document)
export function cosinesim(A, B) {
  var dotproduct = 0;
  var mA = 0;
  var mB = 0;
  for (let i = 0; i < A.length; i++) {
    // here you missed the i++
    dotproduct += A[i] * B[i];
    mA += A[i] * A[i];
    mB += B[i] * B[i];
  }
  mA = Math.sqrt(mA);
  mB = Math.sqrt(mB);
  var similarity = dotproduct / (mA * mB); // here you needed extra brackets
  return similarity;
}

This will give you a number for each document relative to your query, representing the distance between them.

I hope this is useful.

2 Likes

Credits for the cosinesim function go to

Thanks honas. Is there similar code that shows how to do the next step and execute the Similarity Search? Or has the similarity search been displaced entirely by the Cosine similarity match code you show?

I was under the impression that you had to do the pre-filter, document similarity step, and then do the actual call to the Semantic Search API. If that impression is still correct, then I am not sure how to proceed from a point after the Cosine document similarity search to the semantic search. If that impression is not correct, and what you have shown me is the entire procedure, then I’ll just do as you suggest.

In my understanding, this is the whole procedure, the cosine similarity represents the ‘space’ between your document and your query.

1 Like

Great. I give it a go. Thanks again.

Hello again @honas,

During your trials, did you find an optimal size for text blocks to be used as “documents” for a semantic search using embeddings?

I want to let people “deep search” into transcripts. My current thought is to treat each block of 256 words as a “document” and then do a semantic search using the user’s input query against the collection of 256 word documents I created from the transcript. That way I can take them directly to the parts of the transcript that are most relevant to their search query.

What do you think of 256 word text block size? I do know about the 200 document limit so for extremely transcripts I may have to increase the word count per text block, but I don’t see that happening very often.

It’s difficult for me to provide a definitive answer to your question without knowing more about your specific use case and the data you’re working with. However, there are a few things to consider when choosing the size of text blocks for a semantic search using embeddings.

First, it’s important to consider the trade-off between the granularity of your search results and the accuracy of the semantic search. Using smaller text blocks will allow you to provide more detailed and fine-grained search results, but it may also decrease the accuracy of the semantic search if the small text blocks don’t contain enough information to accurately represent the meaning of the text. On the other hand, using larger text blocks will provide less detailed search results, but it may increase the accuracy of the semantic search because the larger text blocks will contain more information.

Another thing to consider is the length of the transcripts you’re working with. If the transcripts are very long, using larger text blocks may be necessary to avoid hitting the 200 document limit you mentioned (which I am not sure of still applies if you use the embeddings vs the search endpoint). In that case, it may be worthwhile to experiment with different text block sizes to find the optimal balance between accuracy and granularity.

Ultimately, the best text block size will depend on your specific use case and the data you’re working with.

1 Like

Thanks. So basically, it’s empirically determined (aka try and try again. :smile: )

Yeah I can’t see how that matters anymore since you (you = the developer) are running the result of getting the embeddings from a singular user input against your own database of stored embedding for the target documents, which were also retrieved one at a time from the embedding API.

did you work with very specific domain knowledge? can you elaborate a bit more how you did it ?I am still struggling