Vector databases for sentences with mixed languages

This is what GPT 4 said, "Using vector embeddings for chat history in multiple languages (x, y, z) can be effective, but there are some considerations:

  1. Each Row in One Language: If each row of your chat history is consistently in one language, vector embeddings should work well. Modern NLP models are quite capable of handling different languages separately. You could use language-specific embeddings for each row based on the detected language.

  2. Rows with Multiple Languages: This is more challenging. Sentences that mix multiple languages (code-switching) can be difficult for some NLP models to handle effectively because the context can be language-dependent. However, there are multilingual models available (like some versions of BERT) that are trained on multiple languages and can handle such scenarios better.

  3. Data Preprocessing: Regardless of the language, preprocessing steps like normalization, tokenization, and handling of special characters are important for effective use of embeddings.

  4. Model Choice: Choose an NLP model that supports multilingual embeddings and understands the context of mixed languages better. Some models are specifically trained on multiple languages and might be more effective in this scenario.

It’s important to test and evaluate the chosen approach with actual data to ensure that the embeddings are meaningful and useful for your application, especially in the more complex case of mixed-language data."

When learning x, I sometimes mix y and z to define phrases and vocab depending on which language is easier to translate to and from and is more intuitive for me (that’s probably why they say languages are easier to pick up the more you learn).

How would vector embeddings be made for these mixed language sentences? I know Facebook came out with a multilingual model but use FAISS then?

The OpenAI embeddings models are multi-lingual, with 3-large demonstrating a big step up in the benchmark that the blog included.

I’ve haven’t made an exploration recently, but have noted that different languages live in a different space, where the same language can have closer distance than the same concept.

In applications such as retrieval for an AI that can also understand multi-language answers, it would seem the cross-language retrieval would be of significant benefit. In giving search results direct to a user, it would be an undesirable outcome.


To satisfy curiosity, let’s test 3-large.

  • The search input is “OpenAI’s most recent GPT foundation model, GPT-4, was released on March 14, 2023. It can be accessed direct…”
  • The chunks are Wikipedia about GPT and Qt GUI software, and similar from Japanese Wikipedia

Here are the results. Each line has both full 32 bit embeddings and then 8 bit RAM embeddings, exploring the quality of embeddings quantization.

 == Cosine similarity comparisons ==
- 1:" Generative pretraining (G" -
 float32: 0.5520
 float08: 0.5513
- 2:" Qt is used for developing" -
 float32: 0.1801
 float08: 0.1828
- 3:" アーキテクチャは、デコーダのみのTransform" -
 float32: 0.6260
 float08: 0.6249
- 4:" QtはC++で開発されており、単独のソースコードに" -
 float32: 0.0916
 float08: 0.0937
Content English Japanese
GPT target 0.55 0.62
Qt irrelevant 0.18 0.09

English GPT-related search matches highest to the Japanese text selection about GPT.

Conclusion: Embeddings of 3-large doesn’t reject what might be a better relevance (depending on my particular cut-and-paste boundary and the difference in human writing) based on language.


input_list =

[
“”"
OpenAI’s most recent GPT foundation model, GPT-4, was released on March 14, 2023. It can be accessed directly by users via a premium version of ChatGPT, and is available to developers for incorporation into other products and services via OpenAI’s API.
“”".strip(),

“”"
Generative pretraining (GP) was a long-established concept in machine learning applications.[16][17][18] It was originally used as a form of semi-supervised learning, as the model is trained first on an unlabelled dataset (pretraining step) by learning to generate datapoints in the dataset, and then it is trained to classify a labelled dataset.[19]

While the unnormalized linear transformer dates back to 1992,[20][21][22] the modern transformer architecture was not available until 2017 when it was published by researchers at Google in a paper “Attention Is All You Need”.[23] That development led to the emergence of large language models such as BERT in 2018[24] which was a pre-trained transformer (PT) but not designed to be generative (BERT was an “encoder-only” model).[25] Also around that time, in 2018, OpenAI published its article entitled “Improving Language Understanding by Generative Pre-Training,” in which it introduced the first generative pre-trained transformer (GPT) system (“GPT-1”).[26]

Prior to transformer-based architectures, the best-performing neural NLP (natural language processing) models commonly employed supervised learning from large amounts of manually-labeled data. The reliance on supervised learning limited their use on datasets that were not well-annotated, and also made it prohibitively expensive and time-consuming to train extremely large language models.[26]
“”".strip(),

“”"
Qt is used for developing graphical user interfaces (GUIs) and multi-platform applications that run on all major desktop platforms and mobile or embedded platforms. Most GUI programs created with Qt have a native-looking interface, in which case Qt is classified as a widget toolkit. Non-GUI programs can also be developed, such as command-line tools and consoles for servers. An example of such a non-GUI program using Qt is the Cutelyst web framework.[14]

Qt supports various C++ compilers, including the GCC and Clang C++ compilers and the Visual Studio suite. It supports other languages with bindings or extensions, such as Python via Python bindings[15] and PHP via an extension for PHP5,[16] and has extensive internationalization support. Qt also provides Qt Quick, that includes a declarative scripting language called QML that allows using JavaScript to provide the logic. With Qt Quick, rapid application development for mobile devices became possible, while logic can still be written with native code as well to achieve the best possible performance.
“”".strip(),

“”"
アーキテクチャは、デコーダのみのTransformerネットワークで、2048トークン長のコンテキストと、1750億個のパラメータという前例のないサイズを持ち、保存するのに800 GBを必要とした。このモデルは、生成的な事前学習を用いて訓練され、以前のトークンに基づいて次のトークンが何であるかを予測するように訓練をされる。このモデルは、多くのタスクに対し、強力なゼロショット学習(英語版)と少数ショット学習を実証した[2]。著者らは、自然言語処理(NLP)における言語理解性能が、GPT-nの『ラベル付与されていないテキストの多様なコーパスに対する言語モデルの生成的事前学習と、それに続く各特定タスクにおける識別的な微調整』のプロセスによって向上したことを説明した。これにより、人間による監督や、時間のかかる手作業でのラベル付けが不要になった[2]。

GPT-3は、サンフランシスコの人工知能研究所OpenAIが開発したGPT-2の後継で、GPTシリーズの第3世代の言語予測モデルである[3]。2020年5月に公開され、2020年7月にベータテストが実施されたGPT-3は[4]、事前学習言語表現による自然言語処理(NLP)システムにおけるトレンドの一翼を担った[1]。

GPT-3が生成するテキストの品質は、それが人間によって書かれたものであるかどうかを判断することは困難なほど高く、利点と危険性の両面があるとされる[5]。GPT-3を紹介する原論文は、2020年5月28日、31人のOpenAIの研究者と技術者が発表した。彼らは論文の中で、GPT-3の潜在的な危険性を警告し、その危険性を軽減するための研究を呼びかけた[1]:34。オーストラリアの哲学者デイヴィッド・チャーマーズは、GPT-3を『これまでに作られた最も興味深く、重要なAIシステムの一つ』と評した[6]。2022年4月のニューヨーク・タイムズ紙では、GPT-3の能力について、人間と同等の流暢さで独自の散文を書くことができると論評している[7]。
“”".strip(),

“”"
QtはC++で開発されており、単独のソースコードによりX Window System(Linux、UNIX等)、Windows、macOS、組み込みシステムといった様々なプラットフォーム上で稼働するアプリケーションの開発が可能である。またコミュニティーにより多言語のバインディングが開発されており、JavaからQtを利用できるようにしたQt Jambi、さらにQtをRuby、Python、Perl、C#などから利用できるようにしたオープンソースのAPIが存在する。

このように開発が容易であり高速、スタイリッシュなQtはライセンスが多様なこともあり、KDEを始めとするオープンソースのアプリケーションに限らず、商業アプリケーションでの採用例も多く様々な分野で使用されている。

OpenGLやSVG、XMLといった最新技術にも対応している他、日本語を含む多バイト文字入力フレームワークへも対応している。

Do those results apply even if I am chopping the vectors from the large model down to dim=256?

We expect that individual vectors contain particular aspects of semantic distinguishment that allowed the initial training to meet its rewarded goals.

While likely not exclusive to a dimension and likely remaining indescribable, we could consider the hypothetical loss of dimensions that could have contained scalars of “NASDAQ vs Nikkei”, “cosplay versus masquerade”, or generalities like “cohesive thought” or “extended unicode”.

Although somewhat meaningless for providing a general answer or measuring quality reduction of applications, I’ll fill in the table for 256 and 64 dimensions, to meet a goal of smaller compute and memory consumption:

Content EN-3072 JP-3072 EN-256 JP-256 EN-64 JP-64
GPT target 0.55 0.62 0.58 0.69 0.40 0.63
Qt irrelevant 0.18 0.09 0.22 0.18 0.20 0.15

A peculiar result: the one most impacted by extreme truncation is the on-topic English-English being reduced significantly at 64 dimensions.

Reducing the dimensions of embeddings in a large semantic search corpus thusly can significantly alter the similarity scores between documents. This can lead to substantial changes in the ranking of documents based on their relevance to a search query, affecting the search results or injected retrieval.

A technique discussed elsewhere is two-round embeddings: a preliminary quick search to eliminate a large mass before an extensive high-quality search requiring loading. This could preserve the multi-lingual abilities.

I just made a mistake on Swagger and executed this:

{
  "table_name": "korean",
  "queries": [
    {
      "query": "string",
      "filter": {
        "document_id": "string",
        "source": "email",
        "source_id": "string",
        "author": "string",
        "start_date": "string",
        "end_date": "string"
      },
      "top_k": 10
    }
  ]
}

the Japanese word for string, like string used for arts and crafts is himo (ひも).

These were my results from the chatgpt retrieval plugin server querying supabase pgvector on dim 256 on the new large model:

2024-02-19 20:51:04.831 | INFO     | services.date:to_unix_timestamp:23 - Invalid date format: string
2024-02-19 20:51:04,831:WARNING - Warning: model not found. Using cl100k_base encoding.
doc[0].page_content: 줄 列、線、ひも
,score: -0.444327712059021
doc[0].page_content: 줄 列、線、ひも
,score: -0.444327712059021
doc[0].page_content: 줄 列、線、ひも
,score: -0.444327712059021
INFO:     127.0.0.1:34354 - "POST /query HTTP/1.1" 200 OK

I find that incredible though it might have been a fluke that it could find that in the same space. Will keep trying and posting some results!

“puppy” is correct:

2024-02-19 20:55:51.575 | INFO     | services.date:to_unix_timestamp:23 - Invalid date format: string
2024-02-19 20:55:51,575:WARNING - Warning: model not found. Using cl100k_base encoding.
doc[0].page_content: 강아지 子犬
,score: -0.656075716018677
doc[0].page_content: 강아지 子犬
,score: -0.656075716018677
doc[0].page_content: 강아지 子犬
,score: -0.656075716018677

and kitten (returns cat):

2024-02-19 20:56:42.256 | INFO     | services.date:to_unix_timestamp:23 - Invalid date format: string
2024-02-19 20:56:42,256:WARNING - Warning: model not found. Using cl100k_base encoding.
doc[0].page_content: 고양이 猫
,score: -0.611923933029175
doc[0].page_content: 고양이 猫
,score: -0.611923933029175
doc[0].page_content: 고양이 猫
,score: -0.611923933029175

“I want to eat some of the famous apples in Daegu” returns

2024-02-19 20:58:22.186 | INFO     | services.date:to_unix_timestamp:23 - Invalid date format: string
2024-02-19 20:58:22,186:WARNING - Warning: model not found. Using cl100k_base encoding.
doc[0].page_content: 복숭아 桃
,score: -0.51775711774826
doc[0].page_content: 복숭아 桃
,score: -0.51775711774826
doc[0].page_content: 복숭아 桃
,score: -0.51775711774826
INFO:     127.0.0.1:56584 - "POST /query 

That says, “peach” in Korean and Japanese so it works I guess.

This topic was automatically closed 2 days after the last reply. New replies are no longer allowed.