Training on transcribed audio output - How do I make the AI know who said what?

I asked this by private chat, but then I saw this forum and realised that the answer to this might be relevant to many other users, so I thought I’d post it here too.

Here’s my situation: I have about 10,000 text files containing transcriptions of things that my students have said. Each file corresponds to what one student has said on one particular day. Anyway, I’d like to train an AI model on this data to answer queries of the following forms:
• Give me a sentence that might say. This sentence must contain .
• Give me a sentence that might say about . This sentence must contain
is some item of language, e.g. a sequence of words such as “didn’t have to”, or a grammatical structure such as a second conditional.

For open ended generation, the manual recommends leaving the prompt empty. So my training data would look something like:
{“prompt”:"", “completion”:" “}
{“prompt”:”", “completion”:" "}
But where would I put in which student said which sentence? In all of the transcriptions, each student refers to themselves as “I” or “me”, so it’s impossible to know which students have said which sentences. For example, Student A’s transcription might say “I don’t drink because I’m worried about my health.” whereas Student B’s says “I go drinking every weekend.”. It would be problematic if the AI generated output where Student A is talking about his love of alcohol, or where Student B is talking about being teetotal. So I think the training data definitely needs to include each student’s name alongside what they said. My question is how do I do this?

Should I do something like the below?
{“prompt”:"", “completion”:" “Student A: I don’t drink because I’m worried about my health.”}
{“prompt”:"", “completion”:" “Student B: I go drinking every weekend.”}

Or should I put the student’s names as the prompts?
{“prompt”:“Student A”, “completion”:" “I don’t drink because I’m worried about my health.”}
{“prompt”:“Student B”, “completion”:" “I go drinking every weekend.”}

Or should I do something different?

TLDR: What’s the best way to indicate the name of the speaker in training data that consists of transcribed audio output.

Thanks in advance,


You would need to add a layer of speaker identity recognition. See the following:

Thanks for the reply. But I already know each speaker’s identity. Each file of transcribed text only contains output from one speaker. My question is how to include the speaker’s identity in the training data.