Issues with Multilingual Content Extraction from Knowledge Base in GPT-4 API

Hi everyone!
I work for a company that has recently implemented the GPT-4 API to develop a customized chatbot for the connected home appliances sector. One of our goal is to use GPT-4 to provide users with personalized recipes based on the appliances they own, including specific program settings for these devices.

We have built a comprehensive knowledge base containing various recipes. However, we are encountering an unusual issue regarding the extraction of these recipes. When the model receives requests in Italian for specific foods, it fails to find the appropriate recipes, despite their presence in our knowledge base documents. Interestingly, if the same request is made in English, the model extracts the recipes correctly.

Here are some additional details that might be helpful:

  • The issue specifically occurs with Italian requests, while in English, the model functions as expected.
  • The missing recipes are indeed present in our knowledge base and are accessible to the model (both in english and italian)
  • We have verified and ensured that the Italian documents are correctly formatted and easily accessible to the model.
  • We even tried to force the translation in the prompt, asking the model to translate the user input in english, elaborate, than re-translate the answer to the user language

I don’t know what else to try

This might be a dumb idea, but why wouldn’t you just translate all documents into the same language like English?

It can still respond in the same language as the request was made in, but it might make responses a bit more consistent. Again, maybe it’s a bad idea but thought I’d ask.

Yep we tried, but it didn’t seems to improve performance.

Just to confirm, you are using assistants, correct?

Here are some general thoughts:

  • assistants are a gimmick (my opinion)
  • with a multilingual corpus, ada significantly prefers matching by language rather than content. → From that follows that it may not be a good idea to have a mixed language corpus.
  • the information embedded doesn’t have to be the information retrieved. For example, embedding the number of eggs and cups of flour won’t accomplish much with this weak embedding model. I’d take a look at the content and quality of your knowledge base and see if there’s some room for improvement there.
  • but most importantly, I’d take a look at your use case and what exactly you’re trying to achieve. What requests are the users making? Are they asking for recipes by mood, flavor, or by name? Simply random suggestions by appliance? How are they using your product exactly? For instance, if you have recipes for each appliance, then you don’t need to look through recipes from each appliance if the user only has one appliance. If you’re currently not pre-filtering, that could be a gigantic boost on its own.

No, we’re not using assistants and unlucky we have no possibilities to improve the knowledge base that is provided by our customer.

users are asking for recipes for specific basic ingredients such as “lamb” (actually that’s the word that makes us aware of the problem), or for more complex stuff such as “pasta with tomato sauce” (ironically in this case the bot responds with correct infos bot in italian and english)

The bot knows wich appliance the user have (we pass him an array with the list of appliances associates with the user profile), so yeah we are prefiltering

(for info, the knowledge base consist in 313 docx documents, each with a single recipe, all formatted in the same way)

Thanks a lot for your answer!

Sometimes RAG (or AI) isn’t the answer - sometimes a keyword search will yield better results. :person_shrugging: