- has anyone precalculated wikipedia embeddings and would be willing to share?
- Or perhaps OpenAI has already done them?
- If not and I need to do it, should I put them on Github or HuggingFace?
- And finally, if I can not find any precalculated ones, anything special I should include in the output format? For example, I am thinking it might be best to be able to update the contents from latest Wikipedia version instead of relying on this file. Embedding may still be quite correct even if facts are updated. I am also thinking if there is something about the links between files and language versions that might be useful somehow?
I suppose Wikipedia embeddings would be useful for many users so I expect someone has already calculated a lot of them. Hoping for the best!
From my experience, when you do cosine similarity search through embedding data, the language of the stored embeddings does not matter. This is my observation. So you can probably embed just the english version and it can handle queries from any supported language. Although other language wikipedia pages do not exist in English version. To keep the data fresh, you can probably just run an automatic scan everyday to check the modified dates of the stored data and from the wikipedia itself. No need to do it for every query.
Thanks. One more thing just came to mind: would it be useful to have a hierarchy of embeddings?
Simple version: first search document, then section on that page.
This would help getting only the exact sections, while preventing some “random sections” from other pages that only brielfy touch the topics. While those may be useful in some contexts, I suppose not in most cases.
Complex version: Hierarchy based on some type of clusters?
I think it will help. Just imagine having millions of embedded data. I would not want to do cosine similarity search on the whole corpus of data you have. Adding some metadata on the embedded files could help narrow the search. Plus the ones you are proposing.