Word level timestamps from whisper v3's json is invalid

Playing with the new word level timestamps in large-v3, however for some reason the json returns with single quotes instead of double quotes. Running the json through chatGPT easily converts it but its pretty risky if it decides to hallucinate any of the output.

example

[{‘text’: ’ Abstract.‘, ‘timestamp’: (0.0, 0.54)}, {‘text’: ’ The’, ‘timestamp’: (0.54, 1.48)}, {‘text’: ’ ubiquity’, ‘timestamp’: (1.48, 2.04)}, {‘text’: ’ of’, ‘timestamp’: (2.04, 2.2)}]

It seems you are describing Python’s adaptive reporting on the contents of string objects. It depends on how you got there and the contents whether you see single or double quotes, or escaped quotes of either type.

The bytes returned from a direct “requests” library call to the API is JSON:

"words": [\n {\n "word": "This",\n "start": 1.059999942779541,\n "end": 1.600000023841858\n },\n {\n "word": "is",\n "start": 1.600000023841858,\n "end": 1.7799999713897705\n },\n {\n "word": "a",\n "start": 1.7799999713897705,\n "end": 1.9800000190734863\n },\n {\n "word": "radio",\n "start": 1.9800000190734863,\n "end": 2.380000114440918\n },\n {\n "word": "show",\n "start": 2.380000114440918,\n "end": 2.619999885559082\n },\n {\n "word": "where",\n "start": 2.619999885559082,\n "end": 2.859999895095825\n },\n {\n "word": "people",\n "start": 2.859999895095825,\n "end": 3.140000104904175\n },\n {\n "word": "call",\n "start": 3.140000104904175,\n "end": 3.440000057220459\n },\n {\n "word": "us",\n "start": 3.440000057220459,\n "end": 3.640000104904175\n },\n {\n "word": "and",\n "start": 3.640000104904175,\n "end": 3.819999933242798\n },\n {\n "word": "ask",\n "start": 3.819999933242798,\n

Just using the sample code they have on the model card

result = pipe(sample, return_timestamps=“word”)
print(result[“chunks”])

  • Set a data object with a mix of strings with double quotes and escaped single quotes:
    chunks = [{'text': 'He said "Hello"', 'timestamp': (0.0, 0.54)}, {'text': ' because', 'timestamp': (0.54, 1.48)}, {'text': ' it\'s', 'timestamp': (1.48, 2.04)}, {'text': '"polite"', 'timestamp': (2.04, 2.2)}]

  • Print:
    print(chunks)

  • See the alternation of string enclosure for ideal presentation of the contents of any one string:
    [{'text': 'He said "Hello"', 'timestamp': (0.0, 0.54)}, {'text': ' because', 'timestamp': (0.54, 1.48)}, {'text': " it's", 'timestamp': (1.48, 2.04)}, {'text': ' "polite"', 'timestamp': (2.04, 2.2)}]

Or we make a JSON string, that is no longer the list and dictionary structure references you’d use for parsing.

import json
print(json.dumps(chunks, indent=2))

[
  {
    "text": "He said \"Hello\"",
    "timestamp": [
      0.0,
      0.54
    ]
  },
  {
    "text": " because",
    "timestamp": [
      0.54,
      1.48
    ]
  },
  {
    "text": " it's",
    "timestamp": [
      1.48,
      2.04
    ]
  },
  {
    "text": " \"polite\"",
    "timestamp": [
      2.04,
      2.2
    ]
  }
]

(Along with enclosing it within ``` here, a good way to present information on the forum)

1 Like

Ah gotcha, makes perfect sense, thank you!