Voice to Text with identifying different tones

I am looking for a solution to convert video conversation to text.
lets say we have 4 characters in a play , i want to get text generated for each of them with different Header.

Expected o/p is
Actor 1: hi how are you
Actor 2: I am good, let me Introduce my friend Raj
Actor 3(Raj): I am raj, I am based out of Jercy.
Actor 4: Sorry i heard your conversation, glad to hear i am from Jercy.