Why not try forming your own keywords database?

Here is the governing equation on a system I developed, basically a log-normalized TF-IDF, from my notes:

The information of a word W in document D is then log(1+r)*log(N/R), where r is the frequency of W in D, and R is the total number of documents word W is in, and N is the total number of documents.

Then I take the input, break it into words, and correlated this with the documents, and get the common information. This is done in memory for speed, and uses set intersections inside the computer (which are also fast). Add up all the information in common, and that is your score. Rank these items from high to low.

So you also run an embedding leg. Do the cosine similarity (dot product) search, also in memory for speed. Rank these correlations from high to low.

Finally, combine the two with RRF (reciprocal rank fusion). And now you have your overall ranking.

So you are combining semantic (embeddings) with keywords. With RRF you can even bias one correlation leg over the other.

I would avoid the word2vec route, because the keyword algorithm above does the information weighting for you. With vectors, you are just identifying a string with a vector, but you’d have to form your own information content of each word. You could do this with the information above. But I guess my worry there is latency.

But you could use vectors as more of a “fuzzy information correlation”. Fuzzy because the vectors would give you a proximity to other similar phrases.

So the vector system has advantages, but could add more latency. So feel free to try it, but it might be more work.

The cool thing about RRF is you could run embeddings, keywords (like I have above) and the keyword (fuzzy vector based version). And fuse all three with RRF.