You can investigate the timestamp_granularities API parameter with the transcriptions endpoint using whisper-1.
It will return word-level timestamps in this format:
{
"task": "transcribe",
"language": "english",
"duration": 44.08000183105469,
"text": "I remind you of a joke. I know you've heard this joke. At this point where the lady is caught by the cop, the cop comes up to her and says, lady, you were going 60 miles an hour. ..",
"words": [
{
"word": "I",
"start": 0.5400000214576721,
"end": 0.800000011920929
},
{
"word": "remind",
"start": 0.800000011920929,
"end": 1.1799999475479126
},
{
"word": "you",
"start": 1.1799999475479126,
"end": 1.2999999523162842
},
{
"word": "of",
"start": 1.2999999523162842,
"end": 1.4600000381469727
},
{
"word": "a",
"start": 1.4600000381469727,
"end": 1.8600000143051147
},
Here is an example of the parameters just sent to Python’s requests library to make the RESTful multipart/form-data API call and timestamp my joke’s transcription.
import os, requests
audio_file_name = "joke.mp3"
api_key = os.getenv("OPENAI_API_KEY")
headers = {"Authorization": f"Bearer {api_key}"}
url = "https://api.openai.com/v1/audio/transcriptions"
with open(audio_file_name, "rb") as audio_file:
parameters = {
"file": (audio_file_name, audio_file),
"model": (None, "whisper-1"), # None is for no filename/mime
"language": (None, "en"),
"prompt": (None, "Here is the comedy show."),
"response_format": (None, "verbose_json"),
"temperature": (None, "0.1"),
"timestamp_granularities[]" : (None, "word"),
}
response = requests.post(url, headers=headers, files=parameters)
print(json.dumps(json.loads(response.content), indent=2))
You can see if the format is useful for you, along with “segment” as a granularity option to search within.