Help Putting Whisper Code Into Python Script

Hey, I’m working with the OpenAI API and I’m trying to convert this script into using the Whisper API, and I can’t figure out how to make it function the same. What I mean by functioning the same is always listening using speechrecognition (r.listen()) so I won’t have to press a button or trigger the “recording” to start talking to the bot.


from openai import OpenAI
import os
import pyaudio
import speech_recognition as sr
from pathlib import Path
from playsound import playsound

client = OpenAI(

messages1 = [{‘role’: ‘system’, ‘content’: ‘You are a person texting. Try to keep the responses short.’}]

while True:
r = sr.Recognizer()
mic = sr.Microphone()

with mic as source:
        audio = r.listen(source, timeout=5)  # Adjust timeout as needed
        prompt = r.recognize_google(audio)
        print("you said: " + prompt)
    except sr.WaitTimeoutError:
    except sr.UnknownValueError:

if prompt.lower() == 'quit':

usrmsg = {'role': 'user', 'content': prompt + ' '}
print("[loading. . .]")

completion =
    model='gpt-3.5-turbo', messages=messages1

text = completion.choices[0].message.content

response =

# Assuming 'output.mp3' is the file you want to write to
output_file_path = Path("output.mp3")

with"wb") as file:
# Play the generated audio
if os.path.exists("output.mp3"):

gptmsg = {'role': 'assistant', 'content': text + ' '}

print(“Bye! See you later!”)

Whisper Code (from docs):

from openai import OpenAI
client = OpenAI()

audio_file= open(“/path/to/file/audio.mp3”, “rb”)
transcript =

1 Like

Hey there and welcome to the community!

A really helpful tutorial I keep coming back to this this one:

The diarization bit is unnecessary, but what I believe you are asking for is streaming data for the model to transcribe. You basically need to create a mechanism where it automatically sends a file to the whisper API every X seconds. It does not have its own streaming function yet, and there is no way to send it data without recording the data first.

The other thing to keep in mind is that “always on” is going to get extremely expensive pretty quick, and most of that expense is going to transcribing empty data (or worse, it misinterprets other sounds as speech and misfires), so if you do wish to build such a function, consider these consequences.

For context: This is why you need to say “Alexa” or “Hey Google” every time you want to use one of those things, so it “knows” when speech is happening. That is how big tech got around this problem.

I think what you are looking for is voice activity detection. Comparable to Alexa and other home assistants the app is always listening but the actual recording for transcription will only start following a speech command (Alexa! Order popcorn!).

Previously I implemented such a solution using Silero VAD, and while I couldn’t easily set my own activation word but instead had to use pre-determined choices, it did work quite well.

Note that the source is already quite old in AI time. Maybe there have been newer developments that I am not aware of.

Hope this helps!

1 Like