Re: Issue with Incorrect Ordering of Offsets in Go...

alfatv · 04-28-2025 06:46 AM

Dear Google Speech-to-Text Community,

I am encountering an issue with the Google Speech-to-Text v2 API (Chirp 2 model) where the startOffset values for consecutive words in the transcription appear to be incorrectly ordered. Specifically, later words in the transcription are assigned smaller startOffset values than earlier words, which violates the expected chronological order of the transcription data.

Issue Details:

In the transcription output, one of the words has a startOffset value of 3456.340s, while the following word has a startOffset of 2488.100s. This creates a scenario where the "next" word is starting earlier in time than the "previous" one, which is logically impossible.
Example:
- Word 1: startOffset: 3456.340s
- Word 2: startOffset: 2488.100s (this word starts earlier than the first)

This causes a discrepancy in the transcription data, which leads to confusion when processing or analyzing the transcriptions.

Additional Information:

This issue appears to be specific to the Chirp 2 model in Google Speech-to-Text v2, as it was observed while using this model for transcribing Polish language audio (FLAC, good quality)
The offset values for words should logically increase as the transcription progresses, so the startOffset for subsequent words should always be greater than the previous one.

I would appreciate any insights or solutions to resolve this issue. It is important for us to ensure that the transcriptions are chronologically accurate and usable for further processing.

Thank you for your attention to this matter. I look forward to your assistance.

Best regards,

Bartosz Semanycz

"confidence": 0.85469407
}, {
"startOffset": "3456.340s",
"endOffset": "3456.340s",
"word": "do",
"confidence": 0.9208684
}, {
"startOffset": "3456.340s",
"endOffset": "3456.340s",
"word": "przedstawicieli",
"confidence": 0.94054395
}]
}],
"resultEndOffset": "2487.500s",
"languageCode": "pl-PL"
}, {
"alternatives": [{
"transcript": "organu prowadzącego ale tak samo do organu znaczy do rady pedagogicznej i tak samo do rodziców jesteście w stanie wspólnie wypracować razem wnioski i współpracować",
"confidence": 0.8449576,
"words": [{
"startOffset": "2488.100s",
"endOffset": "2488.540s",
"word": "organu",
"confidence": 0.9376314
}, {
"startOffset": "2488.540s",
"endOffset": "2489.340s",
"word": "prowadzącego,",
"confidence": 0.87923515
}, {

ruthseki

Hi @alfatv,

Welcome to Google Cloud Community!

The issue isn't just that consecutive words across segment boundaries have decreasing start times. The fundamental problem demonstrated here is that within a single result segment, the word timestamps are grossly inaccurate and inconsistent with the segment's own resultEndOffset.

You may try isolating the smallest audio chunk containing this transition (e.g., from 2480s to 3500s, or even smaller if possible) and send only that to the API. Does the error still occur? This helps confirm it's not just an artifact of extremely long files but a core processing issue.

Apart from that, you could test the same audio segment with a different model (if available for Polish in v2, or using v1) to see if the timestamps are correct there. This confirms the issue is specific to Chirp 2.

I also suggest filing a defect report. This way you could have visibility on the progress of your concern as it is publicly available. Please note that I can't provide any details or timelines at this moment. For future updates, I suggest keeping an eye out on the issue tracker.

In your defect report, provide:

The model (chirp, language pl-PL).
The JSON snippet you shared (or a similar one).
Details about the audio (FLAC, long duration, Polish).
A minimal reproducible audio sample if possible (even a few minutes around the problematic boundary might trigger it). This is the most helpful thing for Google engineers.
Explicitly point out the contradiction between word.startOffset / word.endOffset and the resultEndOffset within the same result segment, and also the zero-duration words.

Was this helpful? If so, please accept this answer as “Solution”. If you need additional assistance, reply here within 2 business days and I’ll be happy to help.

alfatv

Hi @ruthseki,

Thank you for your help.

I was using v1 and "long" model for a few years - even for very long videos (up to 7-8 hours each). I was using MP3, now I am using FLAC. Never had problems like that. Today I am going to segment smaller file and fill defect report.

When giving "details about the audio, long duration" do you mean duration of sample or duration of the original file? Original file is about 3 hours 28 minutes.

I have done about 10 transcriptions of different length and 4 of them are broken. They come from a different microphones but they are all rather good sound quality - conference recordings.

Kind regards

Bartosz

alfatv

I can confirm that the issue does not occur after extracting a shorter 2-minute file from the long one. There are no problems with phrase order in this shorter version. So I have to give you the full FLAC and full JSON.

I have crated issue no 415474103.

Issue with Incorrect Ordering of Offsets in Google Speech-to-Text v2 (Chirp 2 Model)

Issue Details:

Additional Information: