Re: Quality of Amharic STT models between Chrome W...

nefasto · 01-22-2025 02:07 AM

Hello there!

I'm using the speech-to-text API for transcribe Amharic language in real time. Currently I'm using the V1 api version.

The quality of transcription is quite bad, with a very high Word Error Rate (above 60%)

For comparison I have used the web speech API on Chrome and the quality is definitely better but also the real-time factor is very nice!

Just to give you a real example of the results here are an example of a translated transcription from a short speech:

### Google Speech API V1P1:

**When a woman speaks, the traveler says, "How are you?" "6 women, 6 movies, 6 feet."**

### Web Speech API on Chrome: https://www.google.com/intl/it/chrome/demos/speech.html

**As we saw yesterday, an object has 6 faces when placed in any position. When we say six faces, what do we mean by a face? What are those faces called? What is the first face? The opposite is called the backface.**

### **Reference translated transcription:**

**As we saw yesterday, an object has 6 faces when placed in any position. When we say 6 faces, what do we mean by a face? What are those faces called? What is the first one? We call it the front face. If there is a front face, we call it the opposite of it, which is called the back face.**

So my question is: which model is used for the Chrome Web Speech API? It is possible to be used with the Google Cloud Speech API?

MarvinLlamas

Hi @nefasto,

Welcome to Google Cloud Community!

It looks like you are dealing with poor transcription quality of Amharic speech when using the Google Cloud Speech-to-Text API (v1).

Here are potential ways that might help with your use case:

Audio preprocessing: To improve audio quality, consider using noise reduction techniques if your recording has background noise. Libraries such as `libsndfile` or tools like `ffmpeg` are excellent for this purpose. Just be mindful to apply these techniques carefully to avoid distorting the speech.
Evaluation Metrics: You may want to measure performance using evaluation metrics, such as Word Error Rate (WER) to provide an objective and quantifiable way to measure, monitor, and improve the performance of your STT system.
Speech Adaptation: You may want to use speech adaptation feature as It enables you to give the API clues that can enhance transcription accuracy.
Experiment with Different Cloud Speech API Model Parameters: You may want to try various model options like default, phone_call, command_and_search, video, or medical_dictation. Experimenting with these models could help you find the best fit for your specific audio needs.
Error Handling: You may want to establish comprehensive error management protocols to address potential API failures or network issues. This should encompass retry mechanisms, logging procedures, and clear error messaging.
Upgrade to Google Cloud Speech-to-Text API to use version 2 (v2) of the API: I recommend switching to the v2 API, as it often incorporates newer models and improvements that might enhance your accuracy.

Answer to your primary question:

Which model is used for the Chrome Web Speech API?: As the model for Chrome Web Speech is not publicly available, I suggest reaching out to this link.
Is it possible to be used with the Google Cloud Speech API?: The model employed by the Chrome Web Speech API isn't directly accessible or adjustable within the Google Cloud Speech API. It's not possible to "import" or specifically select it.

You may refer to the documentation below, which offers information on Google Cloud’s Speech-to-Text and APIs and references:

Was this helpful? If so, please accept this answer as “Solution”. If you need additional assistance, reply here within 2 business days and I’ll be happy to help.

Quality of Amharic STT models between Chrome Web Speech API and Google Cloud Speech (v1/v2)