Semantic search using uploaded files (only performs lexical search for me)

Hello.

Looks like search using uploaded file only works as lexical, not semantic.
The search from docs example works well. But when I upload a file with this content:

{ "text": "White House" }
{ "text": "hospital" }
{ "text": "school" }

and query with { file: <fileId>, query: 'the president' } , I’m getting the error ‘No similar documents were found’

querying for ‘school’ returns the matching document.

File info returns:

{
"id": <fileId>,
"object": "file",
"bytes": 70,
"created_at": 1623666119,
"filename": "president.jsonl",
"purpose": "search",
"status": "processed",
"status_details": null
}

How do I perform semantic search within uploaded file?

1 Like

Thanks for reply.

I’m using babbage, as suggested by docs.

I’d suspect that adding “The” White House, could change the result.

probably, but I need it match semantically on the meaning, and not lexically on the word.

Isn’t the keyword search killing all the semanticness of the search?

How do I then perform semantic search on >200 documents?

1 Like

Is there a way to bypass keyword search, and do a direct semantic search? For me, it breaks even if the document contains “birthdays” and I search for “birthday”.

2 Likes

I am having the same problems. I posted this recently:

The search documentation states this:

“File-based search is a two-step procedure that begins by narrowing the documents in the provided file to at most max_rerank number of documents using a conventional keyword search.”

  1. Can anyone elaborate on what is meant by “keyword search”? It must be something more sophisticated than “exact match” but less sophisticated than the semantic re-ranking occurring in the second step.
  2. If nothing is found during the keyword step, no re-ranking is performed, correct?
  3. Can the problem in #2 be avoided by specifying max_rerank larger than the total number of documents?

For context, my use case involves documents with highly complex, technical language, so there could be a relatively high proportion of instances where nothing is found from a user’s query based on a keyword search. That’s exactly why I need the semantic capabilities of GPT-3 – not to be hamstrung by keyword search. See the irony here?

I understand that there are compute costs that make some constraints necessary. Would love to discuss the future of search in more depth with the OpenAI team.

Hello!!

I’m hitting this problem as well now. Has anyone found a solution? Is it a bug in the API?

When I use the search endpoint by using the “documents” it returns expected results, but if I use the file param with the exact same documents (but just uploaded as a .jsonl file instead) then I get no results unless there’re exact keyword matches.

I think you have to use the embeddings endpoint. Get the embeddings for your documents (in advance), and for your search query (at search time) and use the similarity ranking to get the top search results. That’s what I did. Embeddings endpoint is working amazingly for me,

You are spot on @lmccallum

I asked support about it as well and:

“Unfortunately, there’s no way to do File-based Search without keyword narrowing, but I’ll share that as a feature request with the team!”

And:

“You can’t provide more than 200 documents to Search without using Files, and when using Files, you can’t remove the keyword search step, so your best bet for that would be signing up for embeddings”

1 Like