I am investigating the benefits of upgrading from the V1 Phone Call "Enhanced" model to Chirp Telephony in order to generate more accurate transcriptions that originate from phone calls. I used the Google Console UI to test these two models, and was quite surprised to find that the Chirp model appears to be broken.
I used a 52 second call recording with 2 channels to test this.
V1 Model results appear to be fine. They split out by channel and timestamp as expected. Additionally I can add punctuation as well.
Here is the output of the Chirp Telephony model with the same recording. You can see that it is remarkably worse in comparison. Beyond the no punctuation available, the model doesn't appear to be splitting things out properly by channel at all.
This is so bad that I have to wonder, am I doing something wrong? Am I not understanding the purpose of "Chirp" as a drop-in replacement for V1 Speech models?
Hello,
Based on the Google public documentation,
“Chirp processes speech in much larger chunks than other models do. This means it might not be suitable for true, real-time use.”
Currently, many of the Speech-to-Text features are not supported by the Chirp model. See below for specific restrictions:
Chirp does support the following features:
On the other hand, STT V1 transcription models such as phone_call are best and trained to recognize speech recorded over the phone. These models produce more accurate transcription results.
Hope this helps.
User | Count |
---|---|
2 | |
1 | |
1 | |
1 | |
1 |