How to perform real-time English-to-Chinese translation using Whisper and GPT-3.5-Turbo?

Hello everyone, I have successfully translated an English audio file to Chinese using Whisper and GPT-3.5-Turbo. However, I am unsure how to achieve real-time English-to-Chinese or Chinese-to-English translation when using a microphone. Can anyone advise me on how to accomplish this?

Something like this came to mind: 1- Store the sound data received by the microphone with PyAudio somewhere 2- Send real-time received data to the model with the web socket get the answer, and use it.

However, recently, the OpenAI APIs have been experiencing latency and connection errors due to the intensity. This can negatively affect your process

1 Like

In C# I’ve been using System.Speech.Recognition library to capture the boundaries of someone’s speech.

        private void loadSpeechRecognition()
            // Create an in-process speech recognizer for the en-GB locale.  
            SpeechRecognitionEngine recogniser = new SpeechRecognitionEngine(new System.Globalization.CultureInfo("en-GB"));
            recogniser.LoadGrammar(new DictationGrammar());

            // Add a handler for the speech recognized event.  
            recogniser.SpeechRecognized += new EventHandler<SpeechRecognizedEventArgs>(recognizer_SpeechRecognized);

            // Configure input to the speech recognizer.  

            // Start asynchronous, continuous speech recognition.  

When the event is triggered it records the resulting audio to a wav file.

        public void recognizer_SpeechRecognized(object sender, SpeechRecognizedEventArgs e)
             using (MemoryStream memoryStream = new MemoryStream())
                 using (FileStream file = new FileStream("file.wav", FileMode.Create, FileAccess.Write))
                 _ = transcribe();

It then sends the request to the audio api to transcribe it before sending the transcription to the chat api.

    private async Task transcribe()
         HttpClient client = new HttpClient();
         HttpRequestMessage request = new HttpRequestMessage();

         request = new HttpRequestMessage(HttpMethod.Post, "");
         request.Headers.Add("Authorization", "Bearer " + api);

         var content = new MultipartFormDataContent();
         content.Add(new StringContent("whisper-1"), "model");
         content.Add(new ByteArrayContent(File.ReadAllBytes(@"E:\Chris\Script\WinForm\DesktopGPT\DesktopGPT\bin\Debug\net6.0-windows\file.wav")), "file", Path.GetFileName("file.wav"));
         request.Content = content;

         HttpResponseMessage response = await client.SendAsync(request);
          string responseBody = await response.Content.ReadAsStringAsync();
          var deserializedResponse = JsonConvert.DeserializeObject<AudioResponse>(responseBody);

          _ = GetChatAsync(deserializedResponse.strText);

If you swapped the transcription api for the translation api that should do roughly what you need. Until the whisper model can take a stream I think actual real-time is off the table but using the speeech recognition library to define the chunks of speech works a lot better than uploading chunks of an arbitrary frequency I find. It’s slow and clunky but it does listen in real time even if you wait a few for it to respond.

1 Like

SpeechRecognitionEngine Class (System.Speech.Recognition) | Microsoft Learn