Hello, how I can feed gpt-4 with dictionary? I already have json table with entries but it is too big to send as normal string. Any idea? I want to make gpt talk in Silesian. Giving gpt-4 web link does not work.
This is an interesting case, because while gpt-3.5 or 4 can easily resurrect dead languages and write in them, it gives me this when tasked with writing a dictionary – which it can can do for languages with only five living speakers remaining:
I apologize for any confusion, but Upper Silesian is not considered a separate language. It is a dialect of the Polish language spoken in the Upper Silesia region of Poland. While it has some distinct regional vocabulary and pronunciation variations, it is not classified as a separate West Slavic language. Therefore, I cannot provide a dictionary of common words and nouns in Upper Silesian as it doesn’t have its own distinct vocabulary separate from Polish.
If you would like a dictionary of common Polish words and their translations, please let me know, and I’d be happy to assist you with that.
So given this knowledge, I rather would pursue a course of prompting it on your exact use, and language examples which it should follow, and you might find that it is more knowledgeable than you expect.
For generating language with sparse representation in the training corpus, you’ll want to go to minimum temperature.
BTW, training fluency on a completely unknown or fabricated language would instead require at least a long fine-tune, if not training your own AI model at very large expense.
When I sent json table as dictionary(using web gpt-4) It worked a little bit but I can’t or I don’t know how to do this from python. Dictionary is too big to send it from python. Silesian is mix of Polish-German and a little bit of Czech. When I tell gpt-4 to use dialect or silesian language it gives wrong words. It seems gpt creators were good with political correction but not with basic things like common Silesian language/dialect.
The issue with the training is identification. All the world’s languages were fed into the AI, and the fact that it can even somehow isolate them and use them distinctly is an emergent ability that one might not expect. However, you are up against a billion other speakers of several languages in trying to pick this one out.
Then there is that the writings must actually be commonly used for transcribing. If speakers instead prefer not to write or publish in their dialect, there might be a smaller number of texts that could actually be integrated.
There’s an open source AI on lmsys.org produced in China that doesn’t get this right even though it should also have English - writing English, it will toss some Chinese characters into the text.
Figuring out when Polish-looking language is not actually Polish and being able to construct a more likely token output based on that information is indeed going to be a challenge.
If it is a serious research application or something that could be a product, fine-tuning could be part of the answer. You give a new identity: “You are SilesianBot, an AI that only speaks Upper Silesian”. Then you train it on that system prompt, plus thousands of examples of questions and instructions someone might actually input, and answer in that language. Slowly the behavior will change, although preparing the samples would be laborious, and the fine-tuning expensive (think $100+) - plus continuing a fine-tune to refine or strengthen results is not yet available.
For an experiment, you might give AI a replacement word vocabulary, where it can “speak” polish, but overwrite words, besides prompted or multi-shot examples. I wouldn’t expect much, though.
This could be a tough one, I don’t think loading information into the prompt will help with this one, I think you will need a finetuned model. I think there is a parallel between programming languages and normal language in this regard. You’d get really bad results if there is a made-up programming language, I know because I tried it as an experiment. But there are good results if you are training it.
I don’t know how much it would cost to train something to the level of accuracy/quality that is appropriate, but there are existing projects that have done this with languages before. This individual has trained a model on a First Nation’s indigenous Canadian tribal language called Mi’kmaq: Lnu-AI - An Indigenous AI System .
That could be a good starting point if you are interested in the endeavor. Should give you a good idea on how much effort is required. Possibly you can reach out to the author and ask about the costs associated with it.
When I used web gpt-4 I uploaded dictionary in json also I told him the rules. He was able to write in Silesian something after that. But from python level I cant do that. What I figured out I can slice in python that json table to smaller tables and tell him: Here is part of dictionary that you need to remember. It worked a little bit but I was kinda expensive. Around 0 .1 usd per script launch. I already gave up on that idea.