Inconsistent Structured Output Results

Hi, i am having trouble getting consistent results using structured output.

Context:
Translating an excel file in X number of languages.
I am using Chat Completions (Streaming Structured Output).
The source text is in English.

This is my user prompt:

const promptMessage = `Translate the following JSON array of texts to ${languages.join(
      ", "
    )}. 
Return the translations following this exact schema format:
{
  "translations": [
    {
      "id": "0", // IMPORTANT: Return this ID in your response
      ${languages
        .map((lang) => `"${lang.toLowerCase()}": "${lang} translation"`)
        .join(",\n      ")}
    }
  ]
}
IMPORTANT: Each translation must include the "id" field that was provided in the input. Do not include the "source" field in your response.

Here are the texts to translate:
${formattedContent}`;

This is how i create my translationSchema (using Zod):

function createLanguageTranslationSchema(languages) {
  const translationFields = {};
  languages.forEach((language) => {
    const langKey = language.toLowerCase();
    translationFields[langKey] = z.string().describe(`${language} translation`);
  });

  return z.object({
    translations: z.array(
      z.object({
        id: z.string().describe("Row identifier"),
        ...translationFields,
      })
    ),
  });
}

This is how it should respond me:

{
  "translations": [
    {
      "id": "0",
      "italian": "Italian translation",
      "arabic": "Arabic translation",
      "japanese": "Japanese translation",
      "russian": "Russian translation",
      "chinese - traditional (hk)": "Chinese - Traditional (HK) translation"
    }
  ]
}

I have implemented a batching solution to stay within the gpt-4o-mini max output token limit (16k), so it batches X numbers of row.

The problem:
I am having inconsistent results, for example:

  • The presence of special characters, escape char, or html tags, would halt a batch, and it would move to the next batch. I have implemented an html cleanup function. If i translate in a single language it works, if i translate in 10 languages the problem appears.
  • Certain (random) strings (that dont have any special char) would halt the batch.
  • After a random number of rows, it would return one language translation for all languages (italian translations in arabic, japanese, russian and chinese).

What would you guys suggest?

Interestingly enough, these issues are not happening in the base gpt-4o-mini model or the previous version of my fine-tuned model.