I’m exploring the possibility of using the ChatGPT API for translation purposes and have a specific requirement. In my business, we use specialized terminology that varies across different languages.
I’m wondering if there’s a way to integrate a custom list of terms into the ChatGPT translation process. Essentially, I’d like to provide ChatGPT with a list of these terms in various languages to ensure that the translations it produces consistently use the correct business-specific language.
Has anyone done something similar or knows if this is feasible with the ChatGPT API? Any insights or suggestions on how to approach this would be greatly appreciated.
The lists are very long, about 10 000 rows… We could perhaps cut it down more but you get the picture.
Thanks in advance!
Maybe make a custom GPT for this task and include these lists as your knowledge files, or connect it to a retrieval action to fill the context appropriately.
Hi! Welcome to the forums!
The most popular pattern for augmenting LLMs is called RAG - Retrieval Augmented Generation.
You basically have a vectordb (faiss would probably be a super good enough option in your case) that is populated by embedding vectors
What you’d probably do is embed definitions of all these terms, and retrieve a sublist of the most related terms to the document with a relatedness cutoff.
If you have tons and tons of examples, another thing you can potentially look at is fine-tunes. While I’m typically of the opinion that finetunes are a waste of money, it’s possible that in your case they might be worthwhile since you’re looking to emulate a specific style in your output.
I’d personally take a look at rag, you can throw a PoC together with jupyter in about an hour or so, and see where you get from there.
I think this would be a poor use case for vector embeddings because it’s deterministic and we aren’t concerned about semantics.
What I’d do is take the text to be translated and,
- Given the language of the text, scan the text to identify all the special terms present
- Given the target language of the translation collect the matching pair special term
- Augment the text to include the special term translation, so something like,
When dealing with special_term_english do this
When dealing with [special_term_english=special_term_spanish] do this
Then include in the instructions something like,
You must translate this document from English to Spanish. Included in this document are several terms-of-art which have very specific, precise translation requirements. To assist you in this effort these terms-of-art and their necessary and correct translations are delineated in the format
x is the term-of-art in the source language and
y represents the only acceptable translation of the term-of-art in the target language.
Ah, yeah if the source documents are consistent you can indeed just do that.
Thanks for the suggestions guys!
What about finetuning a model per language?
Finetuning “translate this text from English to Spanish: this is lingo” => “esta poco loco”
In theory, using “this is lingo” in a text, would then be replaced with “esta poco loco”
Finetuning with thousands of these would work no?
Fine-tuning could work, but… That might be a lot of fine-tuning which can get expensive, then running fine-tuned models which is more expensive, all without knowing ahead of time how it will perform.
Honestly, I recommend just building some scaffolding around the LLM to make it’s job easier. Not every task is meant for an LLM to do and even those that are not all of them need to be done without help.