Google Cloud Text-to-speech API

hbassistant303 · 10-06-2023 06:09 AM

Hello,

I hope you are keeping well. We are using Google Cloud Text-to-speech API for our use case. However, it often fails due to different accents or hard keywords. For e.g., our use case is for ordering food using Google Text-to-speech API. In this use case, there is a popular item named 'The CEO Burger'. When the user says 'order The CEO Burger', it consistently fails. Apart from that lot of time when the user speaks in a different accent like 'order Huevos Rancheros', it fails to understand it. Is there a way to resolve these kinds of issues? We tried keyword boosting as well but it doesn't seem to be working much.

Looking forward to hearing from you. Thanks.

Regards,

Harsh Sanghvi

Poala_Tenorio

Handling variations in accents and understanding specific keywords or phrases can be a challenging aspect of using text-to-speech (TTS) and automatic speech recognition (ASR) systems. While Google Cloud Text-to-Speech API is a powerful tool, it may not always perform perfectly in all situations. Here are some strategies you can consider to improve the accuracy of your voice recognition system:

Provide a phonetic transcription of hard-to-understand words or phrases. For example, you can specify how "Huevos Rancheros" is pronounced. Google Cloud Text-to-Speech API allows you to use SSML (Speech Synthesis Markup Language) to provide phonetic hints.

<speak>
<p>The user wants to order <phoneme alphabet="ipa" ph="/'weɪvoʊz ræn'tʃɛroʊz/">Huevos Rancheros</phoneme>.</p>
</speak>

Train a custom language model for your specific use case. This can help improve recognition accuracy for domain-specific terms and phrases like "The CEO Burger." You might need to use Google's Speech Recognition service for this.

Continuously test and fine-tune your system based on real-world usage data. Collect and analyze user interactions to identify common misinterpretations and improve your recognition system over time.

Remember that perfect speech recognition is challenging, and even the most advanced systems can struggle with accents and uncommon phrases. It's important to provide users with alternative means of interaction and continually refine your system to improve its accuracy.

datboi

Hmm— so, it seems like google transcribe offers a few options for accented speech, but none for Spanish accents speaking English.

Our use case is to identify Mispronunciation in speech. What would you recommend? Here are a few of our ideas:

1. Transcribe speech both with the en-US model and an en-MX (spanish accented) model. If there's a mismatch, we assume that en-MX is what they were attempting to say, and can therefore match it against the en-US transcription to highlight the mistake
2. Transcribe the phonemes — train a model to transcribe from audio to IPA or an equivalent representation of phonemes.

Are there any approaches we had not considered? Which of these should we go with, if not something else?

Thanks in advance.