Chirp Model with Multi Channel Recordings Seems Broken

I am investigating the benefits of upgrading from the V1 Phone Call "Enhanced" model to Chirp Telephony in order to generate more accurate transcriptions that originate from phone calls. I used the Google Console UI to test these two models, and was quite surprised to find that the Chirp model appears to be broken.

I used a 52 second call recording with 2 channels to test this.

V1 Model results appear to be fine. They split out by channel and timestamp as expected. Additionally I can add punctuation as well.

Here is the output of the Chirp Telephony model with the same recording. You can see that it is remarkably worse in comparison. Beyond the no punctuation available, the model doesn't appear to be splitting things out properly by channel at all.

This is so bad that I have to wonder, am I doing something wrong? Am I not understanding the purpose of "Chirp" as a drop-in replacement for V1 Speech models?

1 1 413

1 REPLY 1

never-displayed