LLM implementation for context depend synonym matching

Hi !
I think the following problem might occur in many fields so this question might be useful for others.
The problem is : “You have a name and you want to match the synonym name in a list”.
This process requires understanding of context, so llm are, in my opinion, the way to go, at least for my case.

In my case, I run a nutrition app. So a classic example if I get “boiled zucchini” then I want to match the synonym I have in my database let say “Courgette boiled”. There are exactly the same thing.
Now the trick is the length of the list (1k, 10k, 100k, …). Are you bound to experience lost in the middle ? My list is surely moving from 1k to 10k.
What would be the structure and api to choose according to the length of the list to minimize cost while maintaining a extremely high accuracy ?

I thought as well of a pre filtering, return top_k = 50 using embeddings. Which will solve the problem of lost in the middle. As I don’t know how cheap this problem can be solved using llm. I cannot infer how good in term of time and price this solution might be.
To be even more precise, I generate, with llm, description and synonym of all the ingredient items. So the embeddings are quite powerful but not enough.

Even if this problem is not trivial, I’m not saying the problem is especially hard to solve. I’m more concerned by a efficient, in time and price, idea to implement.

Thank you for your help

Efficient would be just one call. That is to a language AI that can do this:

Provide search synonyms: “boiled zucchini”

Output: {

  "synonyms": [
    "steamed zucchini",
    "poached zucchini",
    "cooked zucchini",
    "blanched zucchini",
    "boiled courgette",
    "soft zucchini",
    "zucchini boiled",
    "zucchini cooked in water",
    "simmered zucchini",
    "tender zucchini"
  ],
  "examples": [
    "To achieve a tender texture, the zucchini should be steamed gently for 3 to 4 minutes until just cooked.",
    "The recipe calls for poached zucchini served alongside grilled chicken.",
    "Adding blanched zucchini to salads gives them a refreshing and nutritious boost.",
    "Serve the boiled courgette as a simple side dish seasoned with salt and pepper.",
    "A quick meal might consist of gently simmered zucchini paired with rice and herbs."
  ]
}

It has produced hypothetical document phrases for embeddings, besides more keyword based look-alikes. The embedding on some of the plain texts, averaged, should give higher quality rankings, and many input synonyms ready to match at low expense as keyword, even without further AI.

Your algorithm (which should be one call) is not clear for me. It seems that you generated augmented embedding, then you sort by similarity and finally you asked an llm the correct matching from the reduced list. There is 2 calls there, if I understood your point.

That is: one language model call as an alternative to calling a LLM multiple times, “is it in this list”, “is it in the next list”, as is described in the middle of your post, and which would need to be broken down in to small contexts for quality.