Neural2 voices sometimes appear to sound like it's...

pgmichael · 02-23-2023 09:37 AM

For some reasons, Neural2 voices sometimes appear to sound like it's drunk or having a stroke. For instance, synthesize the following text with en-US-Neural2-J using the demo prompt on this page:

First came a stout puffy gentleman with a carpet bag; he wanted to go to the Bishopsgate station; then we were called by a party who wished to be taken to the Regent's Park; and next we were wanted in a side street where a timid, anxious old lady was waiting to be taken to the bank; there we had to stop to take her back again, and just as we had set her down a red-faced gentleman, with a handful of papers, came running up out of breath, and before Jerry could get down he had opened the door, popped himself in, and called out, “Bow Street Police Station, quick!” so off we went with him, and when after another turn or two we came back, there was no other cab on the stand.

Poala_Tenorio

My conclusion is the Text-to-Speech was having trouble with too many commas and semicolons. I just replaced one semicolon with a period since it started pronouncing weirdly at the part of "And next we were wanted in a side street where a timid" and it worked! The whole input text was pronounced clearly.

First came a stout puffy gentleman with a carpet bag; he wanted to go to the Bishopsgate station; then we were called by a party who wished to be taken to the Regent's Park. And next we were wanted in a side street where a timid, anxious old lady was waiting to be taken to the bank; there we had to stop to take her back again, and just as we had set her down a red-faced gentleman, with a handful of papers, came running up out of breath, and before Jerry could get down he had opened the door, popped himself in, and called out, “Bow Street Police Station, quick!” so off we went with him, and when after another turn or two we came back, there was no other cab on the stand.

pgmichael

@Poala_Tenorio Unfortunately this workaround isn't really viable for me; I created a text-to-speech extension for chrome that relies on the GCloud API so users don't have control over the selected text most of the time (since the text may be a selection from a webpage and they would need to open the dev tools and edit the HTML every-time this issue arrises).

Seems to happen quite frequently as I'm getting many reports for this. Any chances this bug could be flagged to the team working on the Text-to-Speech API?