Semantic search using uploaded files (only performs lexical search for me)

honas · June 14, 2021, 3:02pm

Hello.

Looks like search using uploaded file only works as lexical, not semantic.
The search from docs example works well. But when I upload a file with this content:

{ "text": "White House" }
{ "text": "hospital" }
{ "text": "school" }

and query with { file: <fileId>, query: 'the president' } , I’m getting the error ‘No similar documents were found’

querying for ‘school’ returns the matching document.

File info returns:

{
"id": <fileId>,
"object": "file",
"bytes": 70,
"created_at": 1623666119,
"filename": "president.jsonl",
"purpose": "search",
"status": "processed",
"status_details": null
}

How do I perform semantic search within uploaded file?

honas · June 14, 2021, 5:15pm

Thanks for reply.

I’m using babbage, as suggested by docs.

I’d suspect that adding “The” White House, could change the result.

probably, but I need it match semantically on the meaning, and not lexically on the word.

Isn’t the keyword search killing all the semanticness of the search?

How do I then perform semantic search on >200 documents?

vaibhav.garg · June 25, 2021, 6:10am

Is there a way to bypass keyword search, and do a direct semantic search? For me, it breaks even if the document contains “birthdays” and I search for “birthday”.

lmccallum · September 15, 2021, 2:51pm

I am having the same problems. I posted this recently:

The search documentation states this:

“File-based search is a two-step procedure that begins by narrowing the documents in the provided file to at most max_rerank number of documents using a conventional keyword search.”

Can anyone elaborate on what is meant by “keyword search”? It must be something more sophisticated than “exact match” but less sophisticated than the semantic re-ranking occurring in the second step.
If nothing is found during the keyword step, no re-ranking is performed, correct?
Can the problem in #2 be avoided by specifying max_rerank larger than the total number of documents?

For context, my use case involves documents with highly complex, technical language, so there could be a relatively high proportion of instances where nothing is found from a user’s query based on a keyword search. That’s exactly why I need the semantic capabilities of GPT-3 – not to be hamstrung by keyword search. See the irony here?

I understand that there are compute costs that make some constraints necessary. Would love to discuss the future of search in more depth with the OpenAI team.

alitana · January 24, 2022, 8:09pm

Hello!!

I’m hitting this problem as well now. Has anyone found a solution? Is it a bug in the API?

When I use the search endpoint by using the “documents” it returns expected results, but if I use the file param with the exact same documents (but just uploaded as a .jsonl file instead) then I get no results unless there’re exact keyword matches.

lmccallum · January 24, 2022, 8:43pm

I think you have to use the embeddings endpoint. Get the embeddings for your documents (in advance), and for your search query (at search time) and use the similarity ranking to get the top search results. That’s what I did. Embeddings endpoint is working amazingly for me,

alitana · January 25, 2022, 11:01pm

You are spot on @lmccallum

I asked support about it as well and:

“Unfortunately, there’s no way to do File-based Search without keyword narrowing, but I’ll share that as a feature request with the team!”

And:

“You can’t provide more than 200 documents to Search without using Files, and when using Files, you can’t remove the keyword search step, so your best bet for that would be signing up for embeddings”

robert.oschler · December 3, 2022, 7:59pm

Today is the shut-down day for the old Search API, in favor of the file based approach being discussed in this thread. Did they ever come up a way for us to do semantic search without the keyword pre-filter step? Quite frankly, that’s the silliest thing I’ve seen OpenAI do. As clearly stated by all of you in this thread, it creates a chicken-and-the-egg catch-22 that completely ruins the search for me.

So are things different now then they were 10 months ago? If so, can someone point me to a document that tells me how to eliminate the keyword search based filtering step?

@alitana @hallacy

lmccallum · December 4, 2022, 5:37pm

Yes, the text-search models, which use vector embeddings of the text, are excellent and do not narrow down by keyword first. Here is a link to the embeddings guide:

robert.oschler · December 6, 2022, 9:26pm

Thanks. Is there a comprehensive tutorial that shows a Node.JS example that would take a user’s input, prepare the embeddings, and then submit the transformed query to the Semantic Search API?

honas · December 6, 2022, 10:16pm

Hey @robert.oschler I was in the same situation and here is what we are using now (until we find a better solution). Not sure if I got it right, but this works for me:

Create the embeddings (an array of numbers) for each document you want to search (example using got)

export const getEmbeddings = async (input: string, model: string) => {
  const options = {
    headers: {
      authorization: "Bearer " + process.env.OPENAI_API_KEY,
    },
    json: {
      input,
      model,
    },
  };
  const response = await got.post("https://api.openai.com/v1/embeddings", options).json<EmbeddingResponse>();
  return response?.data[0]?.embedding;
};

Store the embeddings for speed and cost
Create the embedding for your search query
Compare the embeddings using Cosine Similarity (something like the distance between the query and the document)

export function cosinesim(A, B) {
  var dotproduct = 0;
  var mA = 0;
  var mB = 0;
  for (let i = 0; i < A.length; i++) {
    // here you missed the i++
    dotproduct += A[i] * B[i];
    mA += A[i] * A[i];
    mB += B[i] * B[i];
  }
  mA = Math.sqrt(mA);
  mB = Math.sqrt(mB);
  var similarity = dotproduct / (mA * mB); // here you needed extra brackets
  return similarity;
}

This will give you a number for each document relative to your query, representing the distance between them.

I hope this is useful.

honas · December 6, 2022, 10:18pm

Credits for the cosinesim function go to

robert.oschler · December 6, 2022, 10:45pm

Thanks honas. Is there similar code that shows how to do the next step and execute the Similarity Search? Or has the similarity search been displaced entirely by the Cosine similarity match code you show?

I was under the impression that you had to do the pre-filter, document similarity step, and then do the actual call to the Semantic Search API. If that impression is still correct, then I am not sure how to proceed from a point after the Cosine document similarity search to the semantic search. If that impression is not correct, and what you have shown me is the entire procedure, then I’ll just do as you suggest.

honas · December 6, 2022, 11:12pm

In my understanding, this is the whole procedure, the cosine similarity represents the ‘space’ between your document and your query.

robert.oschler · December 6, 2022, 11:42pm

Great. I give it a go. Thanks again.

robert.oschler · December 8, 2022, 7:45am

Hello again @honas,

During your trials, did you find an optimal size for text blocks to be used as “documents” for a semantic search using embeddings?

I want to let people “deep search” into transcripts. My current thought is to treat each block of 256 words as a “document” and then do a semantic search using the user’s input query against the collection of 256 word documents I created from the transcript. That way I can take them directly to the parts of the transcript that are most relevant to their search query.

What do you think of 256 word text block size? I do know about the 200 document limit so for extremely transcripts I may have to increase the word count per text block, but I don’t see that happening very often.

honas · December 8, 2022, 8:45pm

It’s difficult for me to provide a definitive answer to your question without knowing more about your specific use case and the data you’re working with. However, there are a few things to consider when choosing the size of text blocks for a semantic search using embeddings.

First, it’s important to consider the trade-off between the granularity of your search results and the accuracy of the semantic search. Using smaller text blocks will allow you to provide more detailed and fine-grained search results, but it may also decrease the accuracy of the semantic search if the small text blocks don’t contain enough information to accurately represent the meaning of the text. On the other hand, using larger text blocks will provide less detailed search results, but it may increase the accuracy of the semantic search because the larger text blocks will contain more information.

Another thing to consider is the length of the transcripts you’re working with. If the transcripts are very long, using larger text blocks may be necessary to avoid hitting the 200 document limit you mentioned (which I am not sure of still applies if you use the embeddings vs the search endpoint). In that case, it may be worthwhile to experiment with different text block sizes to find the optimal balance between accuracy and granularity.

Ultimately, the best text block size will depend on your specific use case and the data you’re working with.

robert.oschler · December 8, 2022, 9:44pm

Thanks. So basically, it’s empirically determined (aka try and try again. )

Yeah I can’t see how that matters anymore since you (you = the developer) are running the result of getting the embeddings from a singular user input against your own database of stored embedding for the target documents, which were also retrieved one at a time from the embedding API.

heiko · December 11, 2022, 7:35pm

did you work with very specific domain knowledge? can you elaborate a bit more how you did it ?I am still struggling

Topic		Replies	Views
How to perform Search using models fine-tuned on technical domains? API	13	2018	March 22, 2022
Embeddings giving incorrect results API	27	7934	September 16, 2023
How to feed data for completions, instead of using prompt/answer fine-tuning format? API	25	17968	December 17, 2023
Creating a Chatbot using the data stored in my huge database Community embeddings , chatgpt , fine-tuning , api	93	89252	November 25, 2023
The length of the embedding contents API	48	34610	December 13, 2023

Semantic search using uploaded files (only performs lexical search for me)

Related topics