Question regarding Speech-To-Text timing accuracy

I am currently developing a "snipping" tool that bases on google speech to text data.

Basically, I am having a difficult time with regard to cutting the audio/video data based on google's returned data, with regards to the accuracy of start time and end time of the transcription data.

I am currently assuming that start/end times are incorrect/inaccurate in most cases because when running them through a CLI tool (FFmpeg) and feeding it the start and end time based on google's returned data, the audio always seems to cut short. example "Today" only gets cut into "Toda".

Now I am wondering if this is because of FFmpeg or because the transcription timing is inaccurate.

am I correct in my assumption that the timing data is just really inaccurate/incomplete?

Thanks for any help.

1 1 1,106

1 REPLY 1

never-displayed