What constitutes a "document" in Search

lmccallum · May 28, 2021, 10:30pm

Hi Tom,

For semantic search, “Ada” is a less effective search engine, but also less expensive, whereas Davinci is top-of-the-line.
To understand max-rerank, it’s useful to first clarify the two meanings of “document.” First, the json lines file is a document that you’ll upload, containing the information from which GPT-3 will draw your search results. (With semantic search, using the search endpoint, this file tells GPT-3 what the permissable outputs are.You aren’t getting open-ended search results from the internet as with completions etc.) Somewhat counterintuitively, each line in the json file is also called a “document.” And each line represents a potential search result. (So for searching resumes, you’d probably put each resume in a single line.)
Max-rerank is the number of search results you want GPT-3 to return. So if you specify max-rerank = 8, you’ll get 8 resumes returned in your search. If you want 50 resumes, use that number. I think max-rerank has to be set using some judgement, i.e., taking into account the linguistic complexity of your jsonL file. I think the default max-rerank is 150.
I believe that the total number of lines (documents) in the json file does not affect price. What affects price is (a) the engine chosen, (b) the length of the search query in tokens, (c) the length of each search result (line) in tokens, and (d) the number (max-rerank) of search results. So, if you have really long lines in your jsonL file and you specify max-rerank = 500 using Davinci, that will be a lot more expensive compared to having pretty short lines, with max-rerank = 10 using Ada. For your use case, since resumes are fairly short documents and you only have 150 of them, I think your costs will be pretty low overall, even with Davinci.
Items (b) and (c) above can’t exceed 2048 tokens. (I forget how to convert English words to tokens, but it’s in the GPT-3 documenation somehwere.) For my use case, I expect to have some pretty long lines in my jsonL file, so I might have to work around the 2048 limit by splitting long single lines into multiple lines. (I think I might have a neat way of actually improving the search results by doing so - but I won’t go into that detail here.)
The first step in the GPT-3 search is to narrow down your possible search results (lines/resumes) to the max-rerank. This was until recently accomplished by a simple keyword search, but GPT-3 has upgraded it to fuzzy searching, I believe. If you specify max-rerank too small, there is a risk that the first step won’t collect all the most relevant lines/resumes to feed into the second step. That would be bad. To avoid this, for my use case I’d like to re-rank everything in my jsonL file (no max-rerank at all), but the cost could be prohibitive and perhaps performace would be too slow? The second step in GPT-3’s search, where the real magic happens, takes place after the fuzzy search has chosen your max-rerank documents. The second step uses the full power of GPT-3’s semantic search to rank those lines/resumes in order of relevance and present them as search results.
If I’m wrong about any of the above, hopefully an OpenAI person will correct me.

I hope the above is helpful. Leslie

Topic		Replies	Views
Semantic search using uploaded files (only performs lexical search for me) API	19	2431	January 30, 2024
Scaling RAG chatbot system to millions of documents API gpt-4 , prompt-engineering , rag	18	6079	February 28, 2024
Best Practices for Handling Long Enum Lists in Function Calls API fine-tuning , api	13	3412	February 16, 2024
Creating a long document QA API	4	1165	February 23, 2023
How to perform Search using models fine-tuned on technical domains? API	13	1991	March 22, 2022

What constitutes a "document" in Search

Related topics