Can vector base data be stored in chinese?

Wondering if it’s possible to then ask questions in english about the data stored in chinese in a vector base.

A vector database itself doesn’t contain any prohibitions about “asking”. You can embed most any Unicode text and receive an embeddings vector from an AI model.

It is the qualities of the AI embeddings model that is used that will be able to discern the semantic similarity in topics when the language employed is different.

Similarity is the keyword here. The first part of similarity is simply that “data stored” doesn’t exactly look like “questions”, so already, some transformation of information is beneficial.

That aspect of similarity extends to the world language being used. Large AI models will tend to build further understanding of ideas across languages.

You will have lower thresholds if you are making comparisons between English and Chinese for example, where matches in the database that are in English may score with higher relevance than the ideal target in Chinese with knowledge. This also can be beneficial, as you might not want Chinese search results ranked high for an English query.

If you are making liberal use of AI, you can perform transformations and translations, obtain metadata for embeddings on both the input to make them more compatible with a corpus, or you can also produce embeddings based on a language translation or a question-like summary of your data so you can match inputs of broader types back to a corpus.

So you can ask - and if the data being searched is exclusively in one language, you will likely get top-ranked results that are still relevant (even if only an AI language model you are chatting with can understand).

1 Like