Question regarding Speech-To-Text timing accuracy

richardabear · 06-23-2022 04:21 AM

I am currently developing a "snipping" tool that bases on google speech to text data.

Basically, I am having a difficult time with regard to cutting the audio/video data based on google's returned data, with regards to the accuracy of start time and end time of the transcription data.

I am currently assuming that start/end times are incorrect/inaccurate in most cases because when running them through a CLI tool (FFmpeg) and feeding it the start and end time based on google's returned data, the audio always seems to cut short. example "Today" only gets cut into "Toda".

Now I am wondering if this is because of FFmpeg or because the transcription timing is inaccurate.

am I correct in my assumption that the timing data is just really inaccurate/incomplete?

Thanks for any help.

josegutierrez

It is normal to deal with inaccurate Speech to text returned data, the majority of times it is a problem of the microphone input device that you use, or sometimes when you think you have finish the message but when and you stop recording it and the audio only says “Toda” instead of listening the full word of “Today”. On the other hand, Speech to text is not 100% accurate. It is recommended to send the audio at 16000 Hz, if you are sending it with less Hz it is more probable to get inaccurate data.