Whisper syllable classification

I am trying to use the Whisper API to classify syllables in audio files. This works badly with no prompt (I get lots of random AI dream type text), so following the “improving reliability” section to help detect uncommon/acronyms, I use the following prompt:

“Ah, Eh, Ih, Oh, Uh, A, E, I, O, Doo, Lah, Woh, U”

However, I still rarely get something back resembling a syllable. Can I force the response to be one of the results from the prompt dictionary above? I tried playing with the temperature but this didn’t seem to help.

1 Like

Hi Syntheso,

I hope this helps but I could be all wet behind the ears/ I am NOT a coder. I focus strictly on Legal matters with ChatGPT, particularly with my case. It is up to $840M. So obviously that would be my focus.

When I first started with ChatGPT thinking I was going to get great and knowledgeable responses, I was surprised to realize just how basic it was.

So I started a conversation about the basics.

Chat, do you know the alphabet?

Chat: Yes

Chat do you know that pronunciation of each letter of the alphabet?

Chat: It printed the pronunciation of what it would sound like.

So I asked it to use Google Translate to listen to the pronunciation of each letter. And it did but like it stated it does not have the ability to receive audio files to interpret yet.

So I tried a few things but like you it had trouble with proper responses to many things inclusing acronyms and syllables. But you can also present your question to ChatGPT and it will provide coding examples to accomplish your goals.

Attached is an example of what you could do in order to communicate with ChatGPT and get and audio response back and have the Whisper Syllabus Integrated using the Whisper Syllabus API.

If this helped please let me know, if not I am sorry but please don’t shoot the novice for lack of understanding! LOLOL!

Thank you and have a great day!

(Attachment Whisper Syllable Classification Integration.pdf is missing)

Hi @trueserv1, thanks for your response. I can’ t see your pdf, unfortunately. It says “Attachment Whisper Syllable Classification Integration.pdf is missing”.

Hy Syntheso,

Here is my write up with a lot of ChatGPT’s response which Chat can answer almost any question regarding programming and how to do it. It will speed things up for you.

Whisper Syllable Classification Integration

From what I understand you are asking, you need to set up a system that allows you to speak into a microphone, convert the speech to text for processing for an AI like ChatGPT, a.k.a. (Alex), and then convert the AI’s text response back to audio. But then you want to interpret what the syllables are for the speech. Is that correct? If yes, then I think you can use speech-to-text and text-to-speech APIs. Here is what I think might work with the required code. I did something for myself last year so I could speak through my microphone to talk to ChatGPT without having to type all the time so that I could speed up the process. But I never thought about the syllables aspect of it. I am not a “Braniac Coder” like you guys. I am a novice at best and ask a ton of questions to Alex to get most of the answers.

Below is what I got back from Alex a while back and I asked the questions again today to see if there were changes or updates or a simpler method and yep! There was. LOLOL

I hope this helps and apologize if you already have this or knew all of this.

I also found this link helpful as well.

https://platform.openai.com/docs/api-reference/audio/createSpeech

What I don’t like is that there is a restriction of 4096 characters for text to speech.

I liked this link because it provided voices for Alex to use in his responses.

https://platform.openai.com/docs/guides/text-to-speech/quickstart

I strictly work on Legal Stuff because of my legal case and all the corrupt Judges and Court Clerks. If you are interested let me know. And you can see how Military “Directed Energy Weapons” are being used against Civilians by wealthy people.

YouTube Channel: @davidsimpkins6059

First Section Using (Speech-to-Text (STT) API)

I usually use Google Cloud (Speech-to-Text).

Library and Conversion

  • google-cloud-speech library
  • Microphone input handling with pyaudio

Below is an example Code for: Speech-to-Text

  1. Install the required libraries:

    (Bash)

    pip install google-cloud-speech pyaudio

  2. Speech-to-Text Script:

(Python)

import os

import pyaudio

import wave

from google.cloud import speech

from google.cloud.speech import enums, types

Set up Google Cloud credentials

os.environ[“GOOGLE_APPLICATION_CREDENTIALS”] = “path/to/your/service-account-file.json”

Audio recording parameters

RATE = 16000

CHUNK = int(RATE / 10) # 100ms

def record_audio():

audio_interface = pyaudio.PyAudio()

stream = audio_interface.open(format=pyaudio.paInt16,

channels=1,

rate=RATE,

input=True,

frames_per_buffer=CHUNK)

print(“Recording…”)

frames =

try:

while True:

data = stream.read(CHUNK)

frames.append(data)

except KeyboardInterrupt:

print(“Recording stopped.”)

finally:

stream.stop_stream()

stream.close()

audio_interface.terminate()

return b’'.join(frames)

def transcribe_audio(audio_content):

client = speech.SpeechClient()

audio = types.RecognitionAudio(content=audio_content)

config = types.RecognitionConfig(

encoding=enums.RecognitionConfig.AudioEncoding.LINEAR16,

sample_rate_hertz=RATE,

language_code=‘en-US’

)

response = client.recognize(config=config, audio=audio)

for result in response.results:

print(“Transcript: {}”.format(result.alternatives[0].transcript))

return result.alternatives[0].transcript

if name == “main”:

audio_data = record_audio()

transcript = transcribe_audio(audio_data)

print(“You said: {}”.format(transcript))

Second Section Using - Text-to-Speech (TTS) API

We’ll use Google Cloud Text-to-Speech for this example.

Requirements

  • ‘google-cloud-texttospeech’ library

  • ‘Pydub’ for audio playback

    Code: Text-to-Speech

  1. Install required libraries

    (Bash)

    pip install google-cloud-texttospeech pydub

  2. Text-to-Speech Script:

    (Python)

    import os

    from google.cloud import texttospeech

    from pydub import AudioSegment

    from pydub.playback import play

    Set up Google Cloud credentials

    os.environ[“GOOGLE_APPLICATION_CREDENTIALS”] = “path/to/your/service-account-file.json”

    def text_to_speech(text):

    client = texttospeech.TextToSpeechClient()

    input_text = texttospeech.SynthesisInput(text=text)

    voice = texttospeech.VoiceSelectionParams(

    language_code=“en-US”,

    ssml_gender=texttospeech.SsmlVoiceGender.NEUTRAL

    )

    audio_config = texttospeech.AudioConfig(

    audio_encoding=texttospeech.AudioEncoding.MP3

    )

    response = client.synthesize_speech(

    input=input_text, voice=voice, audio_config=audio_config

    )

    with open(“output.mp3”, “wb”) as out:

    out.write(response.audio_content)

    print(“Audio content written to file ‘output.mp3’”)

    def play_audio(file_path):

    sound = AudioSegment.from_file(file_path)

    play(sound)

    if name == “main”:

    text = “Hello, how can I assist you today?”

    text_to_speech(text)

    play_audio(“output.mp3”)

    Integration of (Speech to Text), (STT) then (Text to Speech). (TTS)

    To create a full application where you can speak into a microphone, have your speech transcribed, processed, and then converted back to speech, you would integrate both parts:

  3. Record and transcribe audio to text.

  4. Send the text to the AI (e.g., via an API).

  5. Convert the AI’s text response to audio.

This requires coordination of both the STT and TTS parts, as well as an interface to send the text to the AI for processing.

This is an Example that Works

Here’s how you might put it all together:

(Python)

import os

import pyaudio

from google.cloud import speech, texttospeech

from pydub import AudioSegment

from pydub.playback import play

Set up Google Cloud credentials

os.environ[“GOOGLE_APPLICATION_CREDENTIALS”] = “path/to/your/service-account-file.json”

Audio recording parameters

RATE = 16000

CHUNK = int(RATE / 10) # 100ms

def record_audio():

audio_interface = pyaudio.PyAudio()

stream = audio_interface.open(format=pyaudio.paInt16,

channels=1,

rate=RATE,

input=True,

frames_per_buffer=CHUNK)

print(“Recording…”)

frames =

try:

while True:

data = stream.read(CHUNK)

frames.append(data)

except KeyboardInterrupt:

print(“Recording stopped.”)

finally:

stream.stop_stream()

stream.close()

audio_interface.terminate()

return b’'.join(frames)

def transcribe_audio(audio_content):

client = speech.SpeechClient()

audio = speech.RecognitionAudio(content=audio_content)

config = speech.RecognitionConfig(

encoding=speech.RecognitionConfig.AudioEncoding.LINEAR16,

sample_rate_hertz=RATE,

language_code=‘en-US’

)

response = client.recognize(config=config, audio=audio)

for result in response.results:

print(“Transcript: {}”.format(result.alternatives[0].transcript))

return result.alternatives[0].transcript

def text_to_speech(text):

client = texttospeech.TextToSpeechClient()

input_text = texttospeech.SynthesisInput(text=text)

voice = texttospeech.VoiceSelectionParams(

language_code=“en-US”,

ssml_gender=texttospeech.SsmlVoiceGender.NEUTRAL

)

audio_config = texttospeech.AudioConfig(

audio_encoding=texttospeech.AudioEncoding.MP3

)

response = client.synthesize_speech(

input=input_text, voice=voice, audio_config=audio_config

)

with open(“output.mp3”, “wb”) as out:

out.write(response.audio_content)

print(“Audio content written to file ‘output.mp3’”)

def play_audio(file_path):

sound = AudioSegment.from_file(file_path)

play(sound)

if name == “main”:

audio_data = record_audio()

transcript = transcribe_audio(audio_data)

Here you would send transcript to the AI for processing

For demonstration, we’ll assume a static response

ai_response = “This is the AI response to your query.”

text_to_speech(ai_response)

play_audio(“output.mp3”)

In this example, the system records audio until interrupted (e.g., by pressing Ctrl+C), transcribes it to text, generates an AI response (simulated here as a static response), converts the AI’s text response to audio, and plays it back.

You’ll need to replace “path/to/your/service-account-file.json” with the path to your actual Google Cloud service account file. Also, ensure your microphone and audio playback devices are properly set up and configured.

To integrate Whisper:

Integrating OpenAI’s Whisper for syllable classification into a speech-to-text pipeline involves using the Whisper model to process the audio and extract text along with syllable information. Although OpenAI’s Whisper might not have a direct syllable classification feature out of the box, you can process the text output to estimate syllable counts using language processing techniques.

Here’s a step-by-step guide:

Installation

First, install the Whisper library and other necessary dependencies. If Whisper is provided as a Python package, you might install it like this:

(Bash)

pip install openai-whisper

Example Code

Below is an example of how to use Whisper for speech-to-text and then classify syllables using textstat.

Code: Whisper Integration

This example assumes Whisper has a method to process audio files directly. You will need to adapt the actual Whisper API calls based on its documentation.

(Python)

import pyaudio

import wave

import openai_whisper as whisper # This is a placeholder for the actual Whisper library import

import textstat

Audio recording parameters

RATE = 16000

CHUNK = int(RATE / 10) # 100ms

FORMAT = pyaudio.paInt16

CHANNELS = 1

def record_audio(file_path):

audio_interface = pyaudio.PyAudio()

stream = audio_interface.open(format=FORMAT,

channels=CHANNELS,

rate=RATE,

input=True,

frames_per_buffer=CHUNK)

print(“Recording…”)

frames =

try:

while True:

data = stream.read(CHUNK)

frames.append(data)

except KeyboardInterrupt:

print(“Recording stopped.”)

finally:

stream.stop_stream()

stream.close()

audio_interface.terminate()

with wave.open(file_path, ‘wb’) as wf:

wf.setnchannels(CHANNELS)

wf.setsampwidth(audio_interface.get_sample_size(FORMAT))

wf.setframerate(RATE)

wf.writeframes(b’'.join(frames))

def transcribe_and_classify_syllables(file_path):

Load Whisper model

model = whisper.load_model(“base”) # Adjust model name based on available models

Transcribe audio file

result = model.transcribe(file_path)

transcript = result[‘text’]

Count syllables in the transcript

syllable_count = textstat.syllable_count(transcript)

return transcript, syllable_count

if name == “main”:

audio_file = “output.wav”

record_audio(audio_file)

transcript, syllable_count = transcribe_and_classify_syllables(audio_file)

print(“Transcript:”, transcript)

print(“Syllable Count:”, syllable_count)

Explanation of how it Works

  1. Recording Audio:
  • Uses pyaudio to record audio from the microphone and save it as a WAV file.

  • You can stop recording by interrupting the process (e.g., pressing Ctrl+C).1. Transcribing Audio:

  • Loads the Whisper model and transcribes the audio file.

  • Extracts the text transcript from the Whisper model’s output.1. Classifying Syllables:

  • Uses ‘textstat’ to count syllables in the transcribed text.

    Summary

    To integrate Whisper syllable classification, you record audio, transcribe it using Whisper, and then analyze the text to estimate syllable counts. Adjust the Whisper library import and method calls according to the actual Whisper API documentation. This example assumes the availability of a Whisper Python package with a simple API for loading models and transcribing audio.

    Does this help? Sorry if you already knew all of this.