Fine-Tuning Quirks: Why Is My Deterministic Model Giving Unexpected Translations?

Hello everyone!

I am working on a bot that needs to be able to translate short food-item user queries from any language into English.

Something simple like:

Hühnerbrust → Chicken Breast
Медовая дыня → Honeydew Melon
Jamón de cerdo → Pork Ham
Goose Breast → Goose Breast

This seemed like an easy problem to solve for me but some unexpected problems appeared.

What I’ve tried so far

1. DeepL API

First, I tried using a simple API call to DeepL to do this. This worked decently well but in rare cases, it wasn’t producing the results I wanted. I want the ability to “hardcode”/ train a model to get a few particular queries always right, which I can’t do with DeepL.

1. Prompt Design

Next, I tried a carefully designed prompt to GPT 3.5 turbo. This had about the same robustness as DeepL. Using a prompt I can “hardcode” a few queries that it should always get right, but I am restricted to a small amount that fits in the prompt. Also, I want to keep the tokens per call low.

Here is the prompt I used:

      Translate the following food-item query into English, adhering strictly to the provided JSON format for the output. 
      The input is most likely in German language but can also be in any other language. If the input is already in English, leave it as is. 
      The text to translate is: '*insert food query here*'. 
      Please note that your response should only contain the translation in the exact JSON format provided below, with no additional text, comments, or formatting.
      Input: 'Hirsch Salami'
      Output: {"translation": "Deer Salami"}
      Input: 'Pfirsich'
      Output: {"translation": "Peach"}
      Input: 'chicken'
      Output: {"translation": "chicken"}
      Please ensure your response is formatted exactly like the examples provided, without any additional explanatory text.

The main issue with the results

I have a list of about 300 food items, that if queried by the user, must produce a translation that perfectly matches my needs without fault.

As the translator continues being used I also want to be able to expand this list to make it continuously improve.

The translator also must be able to translate any other food query that is not part of that list, but with those translations rare mistakes are permissible.

If either DeepL or my prompt got the translation for one of the 300 important queries wrong there was no way I could “teach” them to get them right next time.

So next I set out to create a fine-tuned model.

My expectation was that I could hardcode the 300 food items and that all other possible queries would be handled by the knowledge that is stored within the base model.

The fine tune

This is the current file I am testing “file-e54mVKFoXcdMfUX2apo71gPy” (I don’t know if you guys are able to access the content via the retrieve file content request)

The prompt completion pairs all look like this:

{"prompt": "Translate the following food item into English: \"Pflaume\" -->", "completion": "{\"translation\": \"Plum\"}"}
{"prompt": "Translate the following food item into English: \"Banane\" -->", "completion": "{\"translation\": \"Banana\"}"}
{"prompt": "Translate the following food item into English: \"Alitas de pollo sin hueso\" -->", "completion": "{\"translation\": \"Chicken Wings Bone Removed\"}"}
{"prompt": "Translate the following food item into English: \"Queso de cabra crudo\" -->", "completion": "{\"translation\": \"Raw Goat Cheese\"}"}
{"prompt": "Translate the following food item into English: \"Ryż Carnaroli gotowany\" -->", "completion": "{\"translation\": \"Carnaroli Rice Cooked\"}"}
{"prompt": "Translate the following food item into English: \"Chleb na zakwasie\" -->", "completion": "{\"translation\": \"Sourdough Bread\"}"}
{"prompt": "Translate the following food item into English: \"Codillo de cerdo\" -->", "completion": "{\"translation\": \"Pork Knuckle\"}"}
{"prompt": "Translate the following food item into English: \"Pechuga de cordero\" -->", "completion": "{\"translation\": \"Lamb Breast\"}"}
{"prompt": "Translate the following food item into English: \"Ailes de poulet désossées\" -->", "completion": "{\"translation\": \"Chicken Wings Bone Removed\"}"}
{"prompt": "Translate the following food item into English: \"Jus d'orange frais\" -->", "completion": "{\"translation\": \"Fresh Orange Juice\"}"}

I use babbage-002 as the base model because I think my needs aren’t too complex for it to handle and because it is the cheapest and fastest option.

The file contains 761 prompt completion pairs.

I set the temperature of the fine tune to 0 to make the model as deterministic as possible.

The data also does not include any contradictions.

I think fine tuning is an appropriate method to use for this task because 1) I can define the format of the output (in this case JSON) and 2) because

My problem with the fine tune

My problem is that the model quite often returns results that are plain false. I cannot even explain those results by saying that it is the base model that takes precedence over my training. I assume that proper translation is part of the babbage model already so I do not understand how it can return false results when even a prompt can return proper results.

Here are a few examples of false results:

Query: “Putenbrust”
Expectation: “Turkey Breast”
Result from Finetune: “Pork Breast”

Query: “Schlagobers”
Expectation: “Cream”/“Whipped Cream”
Result from Finetune: “Sugar Butter”

Query: “Pute”
Expectation: “Turkey”
Result from Finetune: “Pheasant”

(All of these queries of course contained the preamble “Translate the following food item into English: “food item” -->” just like it is defined in the fine-tune file)

What makes these results even crazier?

These exact queries are already in the training file. Shouldn’t a temperature of 0 make the model deterministic and return the exact completion that was trained if the same prompt is given to the model?

Here are these exact prompts from the fine tune:

  {"prompt": "Translate the following food item into English: \"Putenbrust\" -->", "completion": "{\"translation\": \"Turkey Breast\"}"}
  {"prompt": "Translate the following food item into English: \"Pute\" -->", "completion": "{\"translation\": \"Turkey Breast\"}"}
  {"prompt": "Translate the following food item into English: \"Schlagobers\" -->", "completion": "{\"translation\": \"Cream\"}"}

I’ve worked with fine tunes for quite a bit now but it seems I still have some fundamental misunderstanding of how they actually work.

I’d appreciate any assistance, whether it relates to my current issue or other potential solutions to my problem!

1 Like

ok, I just tried using davinci-002 and it works fine for now. Maybe babbage-002 just isn’t cut for what I need. If anyone still has suggestions about how to make it work with babbage-002 I’d appreciate those. I’ll have to do a lot of requests so saving money is important.

The problem with such a prompt is it is giving instructions to something that doesn’t exist, which is a trained behavior in other models.

Funny thing, you can just understand how a completion engine works and get it to do the task at hand for 1/8th the price:


The new engines are very unlikely to end the text, but here we have " as a stop phrase we can use.

And you can also see that babbage-002 cannot write the answer correctly with its much smaller training data and parameters, even when setting temp and top-p to 0.1 to restrain its perplexity.

Upgrade to davinci-002 untrained, at 5/4ths the cost of fine tuning babbage-002:


I know this other guy who was working on the same thing. His name was also paul, ironically.

Let me see if i can find him… you guys should hook up with your ideas.

Its like, exact… umm… well lets see if i can find him hold on…

1 Like

wow very helpful, thanks! I’ll play around with this a bit

Sure, connect us, I’d be interested! :slight_smile:

I’m also interested in whether it’s possible to train the babbage-002 model for translation.
My attempts to train this model have been unsuccessful.
It seems that requests to the untrained gpt-3.5-turbo model are priced the same as davinci-002, so I don’t see the point in using the latter.

1 Like

babbage just doesn’t have the skill. For gpt-3, this is a model under 1% the size of davinci. OpenAI likely didn’t train or optimize replacements for the other “letters” because they are outclassed by Llama variants you can run on CPU.

by name:
a(da) - 0.4
b - 1.3
c - 6.7
d - 175