Dear Google Speech-to-Text Community,
I am encountering an issue with the Google Speech-to-Text v2 API (Chirp 2 model) where the startOffset values for consecutive words in the transcription appear to be incorrectly ordered. Specifically, later words in the transcription are assigned smaller startOffset values than earlier words, which violates the expected chronological order of the transcription data.
In the transcription output, one of the words has a startOffset value of 3456.340s, while the following word has a startOffset of 2488.100s. This creates a scenario where the "next" word is starting earlier in time than the "previous" one, which is logically impossible.
Example:
Word 1: startOffset: 3456.340s
Word 2: startOffset: 2488.100s (this word starts earlier than the first)
This causes a discrepancy in the transcription data, which leads to confusion when processing or analyzing the transcriptions.
This issue appears to be specific to the Chirp 2 model in Google Speech-to-Text v2, as it was observed while using this model for transcribing Polish language audio (FLAC, good quality)
The offset values for words should logically increase as the transcription progresses, so the startOffset for subsequent words should always be greater than the previous one.
I would appreciate any insights or solutions to resolve this issue. It is important for us to ensure that the transcriptions are chronologically accurate and usable for further processing.
Thank you for your attention to this matter. I look forward to your assistance.
Best regards,
Bartosz Semanycz
"confidence": 0.85469407
}, {
"startOffset": "3456.340s",
"endOffset": "3456.340s",
"word": "do",
"confidence": 0.9208684
}, {
"startOffset": "3456.340s",
"endOffset": "3456.340s",
"word": "przedstawicieli",
"confidence": 0.94054395
}]
}],
"resultEndOffset": "2487.500s",
"languageCode": "pl-PL"
}, {
"alternatives": [{
"transcript": "organu prowadzącego ale tak samo do organu znaczy do rady pedagogicznej i tak samo do rodziców jesteście w stanie wspólnie wypracować razem wnioski i współpracować",
"confidence": 0.8449576,
"words": [{
"startOffset": "2488.100s",
"endOffset": "2488.540s",
"word": "organu",
"confidence": 0.9376314
}, {
"startOffset": "2488.540s",
"endOffset": "2489.340s",
"word": "prowadzącego,",
"confidence": 0.87923515
}, {
Hi @alfatv,
Welcome to Google Cloud Community!
The issue isn't just that consecutive words across segment boundaries have decreasing start times. The fundamental problem demonstrated here is that within a single result segment, the word timestamps are grossly inaccurate and inconsistent with the segment's own resultEndOffset.
You may try isolating the smallest audio chunk containing this transition (e.g., from 2480s to 3500s, or even smaller if possible) and send only that to the API. Does the error still occur? This helps confirm it's not just an artifact of extremely long files but a core processing issue.
Apart from that, you could test the same audio segment with a different model (if available for Polish in v2, or using v1) to see if the timestamps are correct there. This confirms the issue is specific to Chirp 2.
I also suggest filing a defect report. This way you could have visibility on the progress of your concern as it is publicly available. Please note that I can't provide any details or timelines at this moment. For future updates, I suggest keeping an eye out on the issue tracker.
In your defect report, provide:
Was this helpful? If so, please accept this answer as “Solution”. If you need additional assistance, reply here within 2 business days and I’ll be happy to help.
User | Count |
---|---|
2 | |
2 | |
1 | |
1 | |
1 |