Text to speach search timeline for spe ific text and get number of secs i to track

I have some text i am converting into speech, mp3 format

I have some images that will go with the mp3 to make an mp4

I need to find at what second of the audio certain text is dicussed so i know how long between images to leave so they line up correctly

Does anyone know of a methos to achieve this?

You can investigate the timestamp_granularities API parameter with the transcriptions endpoint using whisper-1.

It will return word-level timestamps in this format:

{
  "task": "transcribe",
  "language": "english",
  "duration": 44.08000183105469,
  "text": "I remind you of a joke. I know you've heard this joke. At this point where the lady is caught by the cop, the cop comes up to her and says, lady, you were going 60 miles an hour. ..",
  "words": [
    {
      "word": "I",
      "start": 0.5400000214576721,
      "end": 0.800000011920929
    },
    {
      "word": "remind",
      "start": 0.800000011920929,
      "end": 1.1799999475479126
    },
    {
      "word": "you",
      "start": 1.1799999475479126,
      "end": 1.2999999523162842
    },
    {
      "word": "of",
      "start": 1.2999999523162842,
      "end": 1.4600000381469727
    },
    {
      "word": "a",
      "start": 1.4600000381469727,
      "end": 1.8600000143051147
    },

Here is an example of the parameters just sent to Python’s requests library to make the RESTful multipart/form-data API call and timestamp my joke’s transcription.

import os, requests

audio_file_name = "joke.mp3"
api_key = os.getenv("OPENAI_API_KEY")
headers = {"Authorization": f"Bearer {api_key}"}
url = "https://api.openai.com/v1/audio/transcriptions"

with open(audio_file_name, "rb") as audio_file:
    parameters = {
        "file": (audio_file_name, audio_file),
        "model": (None, "whisper-1"),  # None is for no filename/mime
        "language": (None, "en"),
        "prompt": (None, "Here is the comedy show."),
        "response_format": (None, "verbose_json"),
        "temperature": (None, "0.1"),
        "timestamp_granularities[]" : (None, "word"),
    }
    response = requests.post(url, headers=headers, files=parameters)

print(json.dumps(json.loads(response.content), indent=2))

You can see if the format is useful for you, along with “segment” as a granularity option to search within.

1 Like

Thats really helpful, thank you

1 Like