Improving Consistency and Filtering in Glossary Extraction for Chinese→English Translation (GPT-4o-mini)

Hi everyone,

I’m currently building a fanfiction translation app (Chinese → English) that uses the OpenAI API to both translate chapters and dynamically expand a glossary for term consistency.

I’m encountering issues specifically with the glossary extraction step, and I’d love some help from the community to understand if this is a model limitation, a prompt design flaw, or something else.


:pushpin: What I’m trying to do

After translating a chapter, I prompt the model to extract new glossary entries from the Chinese text that meet strict inclusion rules. Only specific categories like personal names, clans, techniques, ranks, etc., should be added if they are:

  • Not already in glossary_input
  • Found with exact matches in the English translation (text_en)
  • Assigned a gender: “male”, “female”, or “neutral”

I even apply a “compound name rule” for names like “千手柱間” → output 3 entries: “千手”, “柱間”, and the full name.


:warning: What’s going wrong

When using gpt-4o-mini, the results often include:

  • Terms that should be excluded, like “查克拉” (Chakra), which is too generic.
  • Partial or missing application of the name-splitting rule.
  • Some over-inclusions like “火影” or “Forest of Death”, which are arguable, but I’d still prefer stricter filtering.
  • Sometimes a misunderstanding of whether terms were already present in glossary_input.

You can see an example below (excerpted for clarity):

json

CopierModifier

"glossary_input": [
  {"vo": "和風", "translated": "Kazekaze", "genre": "male"},
  ...
],
"new_glossary": [
  {"vo": "查克拉", "translated": "Chakra", "genre": "neutral"},
  {"vo": "火影", "translated": "Hokage", "genre": "neutral"},
  ...
]

:red_question_mark: What I’m trying to figure out

  1. Model choice: Is gpt-4o-mini capable of this level of control and nuanced filtering? Or should I stick to gpt-4o despite higher cost?
  2. Prompting: Am I doing something wrong with my system prompt design? Should I break it down more explicitly or reformat it?
  3. Data flow: Could the issue come from token limits or how I structure the inputs (long text_cn/text_en blocks)?
  4. Other suggestions: Are there known strategies to improve term filtering/selection behavior for glossary-building tasks like this?

:brain: Extra context

  • I’m using the chat.completions.create() API with response_format="json_object" to ensure structured output.
  • Temperature is low (0.05), and I’ve tried adjusting top_p and penalties to no significant effect.
  • Here my prompt :
You are building the **new_glossary** array for a Chinese→{language_name} fan‑fiction chapter.

        ──────────────────────────────────────────────────────────────
        ✅ **Include a term** *only if* all conditions are met:
        1. The term is **NOT** already present in `glossary_input`.
        2. It belongs to **one** of these categories:
           • Personal given names & family names (e.g. 宇智波斑)
           • Clan / family names (e.g. 千手)
           • Techniques / jutsu / special powers (e.g. 豪火球之術)
           • Ranks, titles, organizations, summons, unique items or locations.
        3. You can find the exact English wording **inside `text_en`**.
        4. Genre must only be **male**, **female** or **neutral**.

        🚫 **Do NOT add**:
           • Everyday nouns/verbs/adjectives (水、火、眼睛…)
           • Grammatical particles, measure words, conjunctions.
           • Onomatopoeia or sound effects.
           • Ads, platform promos or meta‑text.

        🪄 **Compound personal names rule**
           For any multi‑character personal name, output **three** entries:
           {{"vo":"宇智波","translated":"Uchiha","genre":"neutral"}},
           {{"vo":"斑","translated":"Madara","genre":"male"}}
           {{"vo":"宇智波斑","translated":"Uchiha Madara","genre":"male"}}

        📄 **Output format (JSON only)**
        {"new_glossary":[ {{"vo":"…","translated":"…","genre":"…"}}, … ] }
        If no term qualifies, return: {"new_glossary": []}

I’d be super grateful for any thoughts, advice, or even examples from anyone who’s done something similar — either in translation QA workflows or glossary generation!

Thanks in advance :folded_hands:

I forgot to say that i’m sending to the AI the user prompt :
{"text_cn":"在二十歲之前取代猿飛...","text_en":"Becoming the Third Hokage before the age of twenty instead of Hiruzen Sarutobi was not just a fantasy for Kazekaze.\nAs a transmigrator, he was well aware of the future plot developments.\nThough Hashirama Senju was the God of Shinobi, he was afflicted with a terminal illness.\nWh...","glossary_input":[{"vo":"和風","genre":"male","translated":"Kazekaze"},{"vo":"旗木","genre":"male","translated":"Hatake"},{"vo":"朔茂","genre":"male","translated":"Sakumo"},{"vo":"朝孔雀","genre":"neutral","translated":"Morning Peacock"},{"vo":"榜排之術","genre":"neutral","translated":"Ranking Technique"}]}

So to resume i’m sending full VO chapter and full transated chapter, with glossary alredy in my db that are prensents in the chapter.

Can someone help me please ?

Perhaps this is too complex for gpt-4o-mini, like in “How many r’s are in strawberry?”.

Have you tried other models, like o4-mini?

Also, examples might help. There are more tips on this guide, it’s for gpt-4.1 but it may give you some new ideas:

If you’re having better luck with other models, then it might just be too complex for the smaller ones. You may consider distilling the larger model into a smaller one to save money at scale.

Why gpt-4o and not the newer 4.1?