What constitutes a "document" in Search

I am just starting the openAI journey. Semantic Search is my main area of interest. The term “document” seems to refer to JSONL files everywhere I see it used. I need to be able to search unstructured text files (ie product reviews, research papers etc). Is there a way to search a collection of such documents using natural language search requests. If so, any pointers would be greatly appreciated. Thanks

1 Like

Hi @tom.meehan, welcome to the API beta!
In order to use the /search API you’ll first have to create and upload a .jsonl file containing the text of the papers that you want to search. Depending on the length of your papers this will require breaking up the text and adding several jsonl lines.
Each line is of the format:
{“text”: “your text - maybe a paragraph of your research paper”, “metadata”: “additional data about your text. This property is optional and does not alter search behavior.”}

I am also interested in learning about what other members in this forum have learnt about uploading large documents.

1 Like

I can see I need to do some serious study of the fundamentals. I have no idea what “ada” is or for that matter “max_rerank”. I gather they have something to do with the cost for doing a search. Regardless, at this point I have been tasked with mapping out the steps for doing the following: I have 150 resumés. I need to determine if the api can be used to search for the best match for a particular job description. ie the prompt might be: “Find me the best candidate for the following job description: IOS programmer with 10+ years experience, must be fluent in both Swift and Objective-C”. Is this doable… if so any pointers as to the approach to take would be very helpful. Thanks.

Hi Tom,

  1. For semantic search, “Ada” is a less effective search engine, but also less expensive, whereas Davinci is top-of-the-line.

  2. To understand max-rerank, it’s useful to first clarify the two meanings of “document.” First, the json lines file is a document that you’ll upload, containing the information from which GPT-3 will draw your search results. (With semantic search, using the search endpoint, this file tells GPT-3 what the permissable outputs are.You aren’t getting open-ended search results from the internet as with completions etc.) Somewhat counterintuitively, each line in the json file is also called a “document.” And each line represents a potential search result. (So for searching resumes, you’d probably put each resume in a single line.)

  3. Max-rerank is the number of search results you want GPT-3 to return. So if you specify max-rerank = 8, you’ll get 8 resumes returned in your search. If you want 50 resumes, use that number. I think max-rerank has to be set using some judgement, i.e., taking into account the linguistic complexity of your jsonL file. I think the default max-rerank is 150.

  4. I believe that the total number of lines (documents) in the json file does not affect price. What affects price is (a) the engine chosen, (b) the length of the search query in tokens, (c) the length of each search result (line) in tokens, and (d) the number (max-rerank) of search results. So, if you have really long lines in your jsonL file and you specify max-rerank = 500 using Davinci, that will be a lot more expensive compared to having pretty short lines, with max-rerank = 10 using Ada. For your use case, since resumes are fairly short documents and you only have 150 of them, I think your costs will be pretty low overall, even with Davinci.

  5. Items (b) and (c) above can’t exceed 2048 tokens. (I forget how to convert English words to tokens, but it’s in the GPT-3 documenation somehwere.) For my use case, I expect to have some pretty long lines in my jsonL file, so I might have to work around the 2048 limit by splitting long single lines into multiple lines. (I think I might have a neat way of actually improving the search results by doing so - but I won’t go into that detail here.)

  6. The first step in the GPT-3 search is to narrow down your possible search results (lines/resumes) to the max-rerank. This was until recently accomplished by a simple keyword search, but GPT-3 has upgraded it to fuzzy searching, I believe. If you specify max-rerank too small, there is a risk that the first step won’t collect all the most relevant lines/resumes to feed into the second step. That would be bad. To avoid this, for my use case I’d like to re-rank everything in my jsonL file (no max-rerank at all), but the cost could be prohibitive and perhaps performace would be too slow? The second step in GPT-3’s search, where the real magic happens, takes place after the fuzzy search has chosen your max-rerank documents. The second step uses the full power of GPT-3’s semantic search to rank those lines/resumes in order of relevance and present them as search results.

  7. If I’m wrong about any of the above, hopefully an OpenAI person will correct me.

I hope the above is helpful. Leslie

4 Likes

Thank you for you explanation. I’m still puzzled about a couple things you said.

  1. "GPT-3 has no idea whatsoever about what constitutes a “best candidate”

I would have thought that this would be an easy one for GPT-3. It should be able to identify the entities as computer languages and platforms. I would think that GPT-3 would be able to rank each candidate based on the occurrence of each of these entities with in each resume, giving a higher rank to those with more mentions of these and possibly additionally use the employment durations as a means to further raising or lowering the ultimate rank of a given candidate. What am I missing?

  1. "you’ll very soon be in what OpenAI calls high stake domains”.

I guess the concept of “high stakes” in the context of ML in general seems a bit odd to me. I read the following in the documentation about what constitutes a high stakes domain:

“Applications that have high risk of irreversible harm or are founded on discredited or unscientific premises”

I am not sure of how there could ever be “irreversible harm” - unless it’s referring to cases where someone responds negatively to something they read (ie being offended) or they misinterpret some medical statement which leads them to do something harmful to their health etc.

Further, “founded on discredited or unscientific premises” seems a bit subjective. Discredited by who? Given the current political climate, this has lost any real meaning - same with the phrase “unscientific premises”. Seem like “science” itself has been debunked itself - so it would be tough to use it to make a determination about high stakes.

But regardless of all of the above, I can’t see how evaluating resumes could be construed as a high stakes domain.

1 Like

I’m sure they would have a different opinion. Of course if you think about it, the results of any search that anyone does (even for something trivial) might be significant to them. If for some reason the results were incomplete or inaccurate it could really ruin their day. But “irreversible”? or “unscientific”? That’s a stretch to say the least. Oh well, it is what it is. Thanks for your feedback.

1 Like