Converting Telegram Group Chat History for Fine-tuning

Hi everyone,

I’m currently working on a project to fine-tune a language model using chat history from a French Telegram group. The goal is to create a chatbot that can effectively understand and respond frequently asked questions in French.

Here’s the challenge I’m facing: The group chat history includes messages from multiple participants, and I need to convert this data into a format suitable for fine-tuning. Specifically, I need to create JSONL files where conversations are structured as interactions between users and an assistant. Additionally, most responses in the group are from a particular admin, so I’d like to use their messages as the assistant role.
Additionally, there are photos, voice messages, and stickers in the group. Can ChatGPT understand them?

I’m looking for advice or ideas on the following:

  1. Strategies to group messages into coherent conversations, especially when dealing with multiple users and replies.
  2. Ensuring the fine-tuning process effectively handles French language data.
  3. Dealing with non-text elements like photos, voice messages, and stickers. How should these be handled during the fine-tuning process?