I understand that the amount of data in Armenian is much less than is needed to train a “fluent AI bot”. In that case, what measures should be taken to train Chat GPT 3 or 4 to work properly in Armenian?
Perhaps you can look into setting up a test group of individuals willing to actually focus on training the model or provide a separate trained model that OpenAI can integrate into GPT-4 or the next version?
Welcome to the community.
The tokenizers for GPT-3,3.5 and 4 are all based in English. But the models were trained on a large dataset, which is why the models may demonstrate some “understanding” of non-English languages.
Having said that, it is possible to fine-tune (not train) base GPT-3 models to generate better Armenian. You can do this by compiling a dataset of Armenian literature, and fine-tuning any of the base models
davinci with that data.
This sounds simple but involves a good number of steps. Also, since the tokenizer is based on English, the costs might be higher.
Feel free to ask questions about any issues you run into.