Is translating complex human languages slow?


I am currently working for a start-up in Saudi Arabia that requires the LLM’s output to be in Arabic. We wrote the prompt in English and told it to translate its answer into Arabic, and keep in mind that Arabic is a complicated language with complex grammar. However, when we did that, the response took around 3 minutes to load and took approximately 7500 tokens in GPT-4. We need some data in JSON format, so I would assume we would have to provide instructions in English.

What’s faster:

  1. Making the context in Arabic, the instructions in English, and the output in Arabic

  2. Making the whole prompt in English and then outputting in Arabic?

I understand that I can experiment with this by myself, but unfortunately, it is very awkward to work with VSCode in Arabic especially when I need to use both English and Arabic in one sentence. (Arabic is a right to left language). If making the context in Arabic is faster, then I’ll figure out a way or perhaps use a different IDE.

Thanks in advance :slight_smile:


Using the native language is usually best, unless there is a technical reason why English would be more performant, non Germanic based languages will tend to have less tokens in the training set, so you may find comprehension falls with all Arabic rather than just a translation request at the end. You are correct in so much as you will need to experiment to see which method works best for your use-case.

Okay, I will definitely experiment with both strategies and see which is better.


It is not the translating, but the high token use in encoding unicode arabic.

English, 214 tokens:

Common use cases

Some common use cases where fine-tuning can improve results:

Setting the style, tone, format, or other qualitative aspects
Improving reliability at producing a desired output
Correcting failures to follow complex prompts
Handling many edge cases in specific ways
Performing a new skill or task that’s hard to articulate in a prompt

One high-level way to think about these cases is when it’s easier to “show, not tell”. In the sections to come, we will explore how to set up data for fine-tuning and various examples where fine-tuning improves the performance over the baseline model.

Another scenario where fine-tuning is effective is in reducing costs and / or latency, by replacing GPT-4 or by utilizing shorter prompts, without sacrificing quality. If you can achieve good results with GPT-4, you can often reach similar quality with a fine-tuned gpt-3.5-turbo model by fine-tuning on the GPT-4 completions, possibly with a shortened instruction prompt.

Arabic - 610 tokens

استخدامات شائعة

بعض الاستخدامات الشائعة حيث يمكن تحسين النتائج من خلال التكييف الدقيق:

تعيين الأسلوب أو اللهجة أو التنسيق أو جوانب أخرى ذات جودة
زيادة الموثوقية في إنتاج النتيجة المرغوبة
تصحيح الأخطاء في اتباع الإرشادات المعقدة
معالجة العديد من الحالات الحدودية بطرق محددة
أداء مهارة أو مهمة جديدة تصعب توصيلها بوضوح في الإرشاد

طريقة عالية المستوى للتفكير في هذه الحالات هي عندما يكون من الأسهل “إظهار، لا تخبر”. في الأقسام القادمة، سنتناول كيفية إعداد البيانات للتكييف الدقيق وأمثلة متنوعة حيث يحسن التكييف الأداء مقارنة بالنموذج الأساسي.

سيناريو آخر حيث يكون التكييف الدقيق فعّالًا هو تقليل التكاليف و/أو التأخير عن طريق استبدال GPT-4 أو باستخدام إرشادات أقصر، دون التضحية بالجودة. إذا كنت قادرًا على تحقيق نتائج جيدة مع GPT-4، فيمكنك في كثير من الأحيان الوصول إلى جودة مماثلة مع نموذج gpt-3.5-turbo المكيف بشكل دقيق من خلال تكييفه على استكمالات GPT-4، ربما باستخدام إرشاد توجيهي مختصر.

(forum doesn’t support shifting RTL…)

Obviously, Arabic is the end product required, and generating 3x the token count will take three times as long.

A large prompt input in tokens will also slightly reduce the generation speed. Using English prompting may be more efficient, but mixed use may reduce the fluency and you must create more output specifications.

Also, if the text above seems to have incorrect usage, that is because it was generated in ChatGPT, which has no ability to lower temperature. You should lower the API temperature setting for generating less commonly used world languages, as the AI won’t be as “sure” which token to output (perplexity). Start with 0.2.