Get hands-on experience with 20+ free Google Cloud products and $300 in free credit for new customers.

Calling speech-to-text suddenly giving me bad transcripts ( starting 2022-Dec-1)

For several months I've been using S2T to transcribe mp3 audio files (1 - 40 minutes long). It's given great results and since I'm using the gcloud CLI I can script batches of submissions.

Today I submitted 10 jobs totalling 40 minutes and the results are all junk. The JSON transcript files which are normally 50-300K in size are a few hundred bytes long and just consist of a handful of individual random words. One of the files I had run on Nov-11 and it gave a good result (230K JSON file of basically correct transcriptions.) 

To test this, I ran the same file through the "Create Transcription" GUI and it gave exactly the same correct result.

I modified my gcloud call (which was "gcloud beta ml speech ...." to remove the "beta" option, and the submission failed on encoding=mp3. I then added back in the "alpha" option after gcloud, this accepted the mp3 encoding but again returned the defective JSON transcrption file.

It would really be a massive inconvenience to have to use the GUI to submit jobs one at a time. 

I went to the S2T "What's new" page and it didn't make any reference that seemed to explain this. (Incidentally there is a bug there where if you click on the "Speech-To-Text v1" drop down and choose "Speech-to-Text" under Public Features, you actually end up at page titled "Speed-to-Text V2" with "Speech-To-Text On-Prem" above it, and no information on either one. )

Any suggestions will be greatly appreciated! 

0 2 1,241
2 REPLIES 2

This issue seems to be not reproducible on my end. If you have premium support, you can check with GCP Support to further check your issue since this is specific to your project.

Thanks very much for taking a look at my post. 

To try to simplify things, I created a 20 sec audio file and saved as FLAC, stereo and mono MP3s. I used gcloud to invoke the recognize-long-running function. In this case I used the beta version since the standard l-r-r invocation (without alpha or beta) doesn't accept the encoding=mp3 option. The gcloud CLI I used is shown below.

The FLAC version transcribed perfectly. However the mp3 executions ran for a while and produced a small JSON file with just a word or two in it, and that with a confidence value of < 0.4 (the FLAC confidence values are all > 0.9 as they should be).

Then I went to the cloud services web page that offers a simple demo (at Speech-to-Text: Automatic Speech Recognition  |  Google Cloud "Put Speech-to-Text into action") This demo accepted all three versions of my file and produced a correct transcription.) So I wish I could talk to whoever wrote the code for that and ask what API they used, and why my gcloud invocation which should have done the same thing failed. 

In addition to that, since I'm a registered Cloud user, there's a more advanced GUI job submitter at https://console.cloud.google.com/speech/overview?project=vernal-design-355021

This tool also processes all three files just fine.

So if I could only talk to the programmers who wrote the code for these demo tools, maybe I could learn why my gcloud invocations, that should do the same thing, fail for mp3 files.

Here's the command I used in the gcloud shells (SDK and even cygwin) :

gcloud beta ml speech recognize-long-running gs://debug_cli/files/221209_1758-mono.mp3 --language-code=en_US --async --encoding=mp3 --channel-count=1 --no-separate-channel-recognition --model=latest_long

The demo files I used are in a bucket called "debug_cli" but I haven't yet figured out an easy way to make that public.