I’m working on an application where a user will type in the name of a dog or cat breed. When this breed is input, I want to resolve it against a large list of known breeds (400+) and pick the closest one from the list.
For example, if someone types in “bull dog”, I would want to find the closest match in my list of breeds and respond with something like “It looks like you have a Bulldog— is that right?”, or “mini aussie doodle” might resolve to “AussieDoodle”.
Is this something that would be best achieved with a function callback? Or perhaps providing a link in the content to a JSON file of breeds?
I’m looking for the most efficient way to do this in terms of token usage and cost. Any help would be greatly appreciated!
Have you tried ‘out of the box’ GPT for this? Chances that it does NOT know this list are slim. It KNOWS the correct names of every dog breed I am pretty sure.
Vectorize each item in the list separately and search by vector in the list.
or simply add an extra step in the process of sanitizing the user input (you do ai-sanitize user inputs, do you?) with something like “fix the breed names in the message if the user is not using the official name”…
This breed is going to be fed into a back-end system where the breed name has to be written exactly as it is in the back-end system or else it won’t be recognized. So I need to be able to convert a free-form input of a breed into a highly specific breed name from a list of a few hundred. Hope that makes it more clear. Thanks!
Yes, this was my original way of thinking how to do it, the problem is the list is going to be quite large. My worry is that putting all of these into a System message is going to quickly eat up my tokens. But maybe there is no way of getting around that?
You can also write a simple function that verifies the presence in the list. the function call simply returns true of false (and the only input is a breed name). Then OpenAI can either ask for a different spelling or mostly likely will be able to retry with a differnet name
Yes, I’ve done fuzzy matching like this before ( Levenshtein distance, etc) with other external libraries but would prefer to take advantage of the LLM if possible in order to give a better experience to the user if it has a hard time finding a close match.
Vectorize each item in the list separately and search by vector in the list.
would be the way to go.
In your DB create entities:
Breed:
id: string/uuid (ideally UUID3 from your lowercase name + type, and app namespace)
code: string (eg “FRENCH_BULLDOG”)
name: string (eg. “French Bulldog”)
type: string (eg. “dog”)
BreedAlternativeName:
id: string/uuid (ideally UUID3 from your lowercase name + Breed.type, and app namespace)
breed_id: string/uuid (Many to One)
name: string (eg “French bull dog”)
Optionally:
BreedTranslation:
id: string/uuid (ideally UUID3 from your language code + lowercase string + Breed.type and app namespace)
object_id: string/uuid (Breed.id|BreedAlternativeName.id) or choose more future proof way of having separate translation entities for breeds and their alternative names.
language_code: string (standard lang code)
name: string (the localized replacement for the translated string, eg “bulldog français”)
Set up the automated vectorization of all newly added entities (breeds and their alternative names) on the name field and handle the retrieval part by cosine similarity, so that you can search your entities by queries like “small black and white dog with a smashed nose” to get the id and other fields of the FRENCH_BULLDOG among the results (personally, I use Directus + weaviate (hooked into OpenAI) for this).
Then on a user’s message received by the app:
Set an LLM-based check if user’s message contains a breed name or description and return the list of breeds/descriptions found in the message as they are provided by the user, eg:
{"breeds":[{"text":"small bull dog","type":"dog"},{"text":"small black and white dog with an ugly smashed nose","type":"dog"}]}
Note that depending on the app, the model may require more context to correctly identify the type (dog vs cat, based on the conversation). Also, your model needs a way to escape if no breeds were mentioned with {"breeds":[]}.
If you have the results from your previous step:
2.1. Use case-insensitive search by name of breeds and alt names using the text and type from the results if found → refer to the id of the breed (can be found from the alt name if matched on the alt name, see 2.2.) and do your logic using the standard name/code in your app.
2.2. If 2.1 matched on alt name of the breed → check with the user if the found breed is what they mean, once confirmed you can use the breed safely in your app.
2.3. If 2.1.-2.2. produced no results → query your DB with text extracted in step 1. to get entities by similarity search, select top matches (may be done with another LLM model) and confirm with the user which breed they mean.
2.4. Once you have the exact breed the user means (confirmed in step 2.3.), add the text from the query(ies) as an alternative name to the breed to skip the vector search the next time someone uses that alt name. Only then you can safely use the breed in your app logic.
2.5. If the user refuses to confirm the found breed match → either they are not capable of identifying the breed (show a picture in this case to confirm) or your app does not have the breed in the DB (in this case you need to add it for the future, so add the admin notification procedure).
If no results from the step 1. → standby.
Also, make your app always use only the standard names in outputs to educate the users (and skip you doing the whole vector search). So maybe some “output filter” logic needs to be introduced similar to the above to force standard breed names in app outputs.
Why using UUID3 and not integers or timestamp-based UUIDs ?
As you do the embeddings, you may want to skip embedding the same text multiple times, so using a “static” uuid generating the same uuid for the same lowercase text (also across all the instances of your app) - can give you a way to store the embedding vectors by UUID somewhere and find them later when needed (also locks your embeddings which may be a good or bad thing, you need to think it through). Personally, I try to use it when possible, no issues so far for me, only the benefits.