How to perform Search using models fine-tuned on technical domains?

Hello,

TL;DR:
Is there a way to use the Completions API to replicate the Search API but using a fine-tuned model (i.e. does anyone know what the Search endpoint’s prompt is)? Or will the Search endpoint support fine-tuned models soon?

Here’s the context in case there is a better way that I could approach the problem:

Use Case:
Our product is serving product teams at business-to-business (B2B) companies (starting with SaaS companies).

We want to connect a company’s product roadmap to customer conversations. The end goal is being able to say “this feature was critical to this customer buying this product” even for features that aren’t a separate product (or the inverse, this customer won’t buy our product because we don’t have feature XYZ). We think this is possible today because of technologies like Gong, which record and transcribe all customer conversations.

Unlike in an “Answers” use case, completeness is important and the value of completeness is linear. If 10 customers discuss the need for a feature, it is twice as good if we are able to identify 8 of those customers vs just 4.

Domain expertise is critical for this use case because customers and team members often refer to the same feature/capability in a number of different ways:

  • They could describe the feature technically
  • They could use the marketing terms for it
  • They could lay out the pain point the feature solves
  • They could explain their use case that the feature addresses

All of this information is captured in a company’s knowledge base (KB) in some form (or their external docs).

Limitations:
The order of magnitude of data generated is too large to run everything through the API (both from a time and cost perspective). A mid-sized (~500 person) B2B SaaS company may generate over 400 hrs of transcripts every day (that’s 32M tokens generated a month). This would perhaps be feasible as a one-time classification task, but we want to be able to offer real-time search and ongoing tracking of features.

Since these features may be talked about in a myriad of ways without the feature being explicitly mentioned, using the File-based Search Endpoint hasn’t yielded adequate performance because it uses keyword search to winnow down the results. Thus, we’ve added a step (Step 1 below) to improve the list of initial documents that are then fed into the Semantic Search.

Current Approach:
Pre-processing:

  • The model is fine-tuned on the KB through the method described as “Case study: Creating an expert model in the legal domain which understands internal company jargon” on this page of OpenAI documentation.
    – Fine-tuning with babbage today

Search:

  1. When a search occurs (let’s say “Single Sign-On”), we first ask the fine-tuned babbage model to create keywords from the search phrase. Setting Temp=0 and TopP=1 has yielded excellent domain-specific keywords with the fine-tuned model (i.e. the equivalent of returning “authentication” for the search “Single Sign-On” but for a company’s specific terminology).
    a. The prompt we use is “Extract technical and customer-facing keywords from the following text.” along with 3 examples.
  2. We run a keyword search (not using OpenAI for it) and get documents that have matched any of the generated keywords or original search phrase.
  3. We call the Search endpoint with a standard babbage model (have also experimented with curie but haven’t seen much difference). We use the original search phrase and pass in the matched documents from Step 2.
  4. Sort them by rank (after filtering out Rank < 200) and return the results.

The Primary Problem
Step 3 has a tendency to miss documents that should match (or rank these documents low), specifically when the document is talking about the search topic in an indirect, domain-specific way (e.g. doesn’t explicitly use the search phrase).

Given the domain-specific performance I’ve seen from the fine-tuned babbage model, I would love to be able to use it in the Search step.
Since fine-tuned models can only be used for Completions, is there any way to replicate Search through the Completions API?

Other Ideas
For Step 2, I’ve considered fine-tuning a BERT-based semantic similarity model on the KB (as mentioned in this post). That way, we wouldn’t need Step 1 and the initial filtering would hopefully be more complete. It would also be possible to just perform Step 2 (and skip Step 3) if the model was accurate enough (although I’ve used a non-fine-tuned RoBERTa in the past for a different problem and it didn’t seem as effective as GPT-3). I would prefer to stick with GPT-3 where possible given its power in understanding text.

I’ve considered treating this as a classification task (if we just scoped the “search terms” to a set of features). However, getting the examples necessary for training the classifier would not be scalable, and the feature set of a roadmap changes enough that classification would be run constantly (which then makes it computationally similar to search).

4 Likes

@neilmadsen Great breakdown of the problem. If you want to experiment with the Completions endpoint, couldn’t you just pack docs into the prompt with headers for each, then add instructions at the end to print which ones match the query? Maybe you can use the log probabilities of the output tokens to estimate confidence and threshold.

Something like:

Query

Header1
Doc1

Header2
Doc2

...

Which from the documents above answer the query? Print their headers below:
1.

Then you could probably finetune the model on both jargon, and specific search behavior. Might require you to minimize the number of tokens that appear in your search docs, as the approach probably won’t scale well without heavy optimization and constraints.

That’s a great idea, thanks @asabet!

I’ll give that approach a try, and experiment with whether analyzing the log probabilities could get me the equivalent of the Search score. I especially like the idea of fine-tuning it on this type of search behavior, so it could pick up the type of matches that I want.

1 Like

@neilmadsen would love to know the results :pray:. Data annotation for finetuning search behavior probably won’t be too demanding either.

@asabet Yeah I’ll follow up with my results!

Definitely hoping that the few-shot capabilities of GPT-3 will eventually allow for fine-tuning pretty nuanced situations, like being able to identify when someone is simply asking questions about a feature vs when they are communicating that they need it or that it’s critical for their decision to buy.

For now, the goal is to just capture all the instances of the topic, and I’m optimistic about the approach you mentioned.

2 Likes

I also have a complex semantic search problem. For relatively small but specialized/technical text domains, GPT-3’s search endpoint isn’t fit for purpose yet. I do think they are working on eliminating the keyword step, allowing users to jump straight to semantic search without narrowing to max 200. Unless and until they do that, the semantic search endpoint is mostly helpful for “massive text” situations where the language is everyday language.

1 Like

To follow up here with the things I’ve tried.

Tried a couple of different completion prompts (including variations on what you mentioned above @asabet).

The prompt that had the best output but still didn’t quite match search was the following (note this is specific to my use case):

You are a Product Manager for <company> and you’re trying to find texts of customer conversations that match customer problems on your roadmap.
For each problem and text, you should respond with whether the customer writing the text has the given problem.
Customers who use the same words but don’t have the problem shouldn’t be a Yes.
Text: <text>
Problem: <problem>
Does the Text entail the Problem? <Yes/No>

It could be that I just haven’t given enough examples and if I fine-tuned it more on this it would work better than search, but for now I’ve seen it be too picky on matches.

One other note on technical domains – I found that the codex models performed better on this prompt than the standard models when the Text and Problem was technical. For example, it could even determine whether the version of the software in the Text (e.g. “Ubuntu XX.YY”) was different than in the problem (e.g. “Ubuntu AA.BB”). Unfortunately I don’t think those can be fine-tuned yet.

Going to continue to try a few more approaches and maybe fine-tuning if I have time. May just stick with the standard search (not using the keyword search, just calling the documents directly) for now.

2 Likes

Hey @neilmadsen, sorry for the late response! I think I might have to to experiment personally to get a better sense of the problem and behaviour. How many samples did you finetune with? Also what’s your probability threshold for the Yes token? Is there a consistent margin between true positive probabilities and false positives? You might be able to exploit that if there’s a significant gap, but more finetune data is probably needed. Further, have you tried ranking instead of classification? Ranking would be more scalable as you would have something like

Text 1: <text>
Text 2: <text>
Text 3: <text>
...
Problem: <problem>
Rank the texts by relevance to answering the problem: 

You might have a valid top2 or topk results this way, without the strictness of classification.

Very interesting observation wrt to codex. Likely codex’s advantage is from having been finetuned on github code where there are more instances of such domain text. Codex will also be more scalable with the 4096 token limit, home finetuning comes out soon.

DM me if you want to talk more in-depth about experiments and results, as I might want to explore completion search for my own use-cases sometime in the future.

1 Like

@asabet – I was using few shot up to about 10 examples, and wasn’t seeing a significant difference. Didn’t go all the way to 50-60 example fine-tuning, since I was seeing issues with the few shot (most GPT-3 benchmarks I’ve seen have a pretty linear increase in performance between zero-shot, few shot, and fine-tuned).

The performance that I was seeing was that the model would either heavily default to “Yes” or “No” (with 80% probability of the chosen result), depending on how I phrased the prompt and the model that I used (the simpler models like babbage tended to default to No, while davinci-codex defaulted to Yes). This was even with experimenting with different distributions of Yes and No in the examples (e.g. even if 70% of examples were Yes, babbage would default to No).

I’ve ended up with a decent solution by doing the following. For now, I’m optimizing for avoiding false positives and I’m OK with instances where we miss something due to equivalent names being used that a non-domain-specific model wouldn’t know.

These are the current steps:

  1. Take the search query and use a fine-tuned curie model to do keyword extraction (up to 10 keywords)
  2. A simple keyword search of the text using all of the returned keywords
  3. Extract the sentence that contains the keyword, plus the sentence on either side (so three sentences total). This is mainly to manage costs, and to not confuse the simpler model with irrelevant text.
  4. Use the search endpoint with ada to get back results
  5. Then, for results that are on the fence as a match (I arbitrarily set this as between 200 and 250, since I’m using 200 as the “match” threshold), I rerun these results through a curie search. This is pretty effective at catching false positives in the ada model.

This works well enough for now – it’s significantly better than keyword search, and is able to avoid places where the terms in the query are mentioned but don’t match the semantics of the query.

One place where search still seems to struggle is with the structure of the query. I may end up needing to write a davinci model that would rewrite a query before sending it through the above steps.

As an example, the grammatically correct query “snapshots appearing slowly” gives far fewer results than the grammatically incorrect query “snapshots appear slow”. I’m curious why the model doesn’t handle adverbs as well as adjectives in this case.

2 Likes

Circling back to this, few-shot classifiers in this kind of scenario can suffer from recency bias, amongst other things. The defaulting behavior you’re observing with few-shotting might be caused by this bias, whereas finetuned models don’t have this issue.

Also replacing search endpoint queries with embeddings can bring additional efficiencies, as you can pre-embed your knowledge-base now.

1 Like

Yep, I’ve switched to embeddings and that has solved the first “filtering” step. Now that fine-tuning is more scalable I’ve also used that (with about 300 examples total using babbage), and have been getting a 70% accuracy rate (which is good enough for now).

2 Likes

Reading over this thread, it seems as if fine-tuned models can be used in two places to improve search results on technical domains.

  1. Use a fine-tuned model to extract domain-specific keywords from the search query.
  2. Use a fine-tuned model to rank/filter the general-knowledge search results for the extracted keywords.

In between those steps you are performing search on a general knowledge base.

In step 1 above, the fine-tuned domain model can cast a wider net for the general knowledge search by adding relevant keywords. In step 2 above the fine-tuned model narrows the results of the general knowledge search by ranking/filtering them and only selecting the top n results.

This is apparently a workable solution for at least some domains, but seems like a workaround.

Are there any plans to allow the fine-tuning of the embeddings engines, or some other plans to allow fine-tuning search? I’m wondering if implementing the described approach might be obviated by some imminent feature release.

2 Likes

I think the embeddings endpoint obviates the need for fine-tuning search. I think the harder task for specialized domains is getting the desired results from the completions endpoint. I am working on a pipeline for question-answering in the legal domain where the first step is getting the top n search results based on semantic similarity between query and result (each possible result being one line in the json lines file whose embeddings have been uploaded to the endpoint) and then working with those n results to produce reliable answers to the query. The search step with embeddings is pretty straightforward compared to the rest of the NLP pipeline.

Yes we’ve found that using embeddings minimizes the need for a lot of this, and that actually embeddings performs best on full sentences and not keywords so we pipe the full sentences into embeddings rather than doing keyword extraction first.

The next step is trying to get rid of the fine-tuned model that filters down the results of the search (and getting the embeddings search better at identifying all results - today it still misses a number of them when set at a reasonable threshold without too many false positives).

Long term (we aren’t worried about this now), there is also a question for us around how to find the full match to a query. We arbitrarily split up customer calls into snippets of 512 tokens today and that seems to work well enough, but doesn’t give us a way to say “this is the full conversation that is relevant to this search”.