I am using google.cloud.speech_v2 client library for python to get the transcription of a short (< 1 min) audio in spanish. It works fine with model = "long", language code = "es-US". Same audio with "chirp" gives only the first part of the transcription. I have tried different audios and models, all work except for chirp or chirp 2.
One strange feature is that in the chirp results I get, first the truncated transcription and then: "Transcript:1000", which I don't know what it means. The following is my code:
def transcribe_chirp(
audio_file: str,
) -> cloud_speech.RecognizeResponse:
# Instantiates a client
client = SpeechClient(
client_options=ClientOptions(
api_endpoint="us-central1-speech.googleapis.com",
)
)
# Reads a file as bytes
with open(audio_file, "rb") as f:
audio_content = f.read()
config = cloud_speech.RecognitionConfig(
auto_decoding_config=cloud_speech.AutoDetectDecodingConfig(),
language_codes=["es-US"],
model="chirp",
features=cloud_speech.RecognitionFeatures(
# Enable automatic punctuation
enable_automatic_punctuation=True,
),
)
request = cloud_speech.RecognizeRequest(
recognizer=f"projects/{PROJECT_ID}/locations/us-central1/recognizers/_",
config=config,
content=audio_content,
)
# Transcribes the audio into text
response = client.recognize(request=request)
for result in response.results:
print(f"Transcript: {result.alternatives[0].transcript}")
return response
Hi @jbraun,
Welcome to Google Cloud Community!
It looks like you're seeing "Transcript:1000" because the "chirp" model provides partial results during streaming. The "1000" is likely a metadata signal, not part of the transcript. The ‘client.recognize’ method is intended for non-streaming audio, so it doesn’t work with "chirp."
To address your question, here are potential ways that might help with your use case:
You may refer to the following documentation, which will provide an overview of how the chirp and chirp2 models are optimized for short-form, low-latency streaming recognition and not suited for long, full-file transcriptions like the long model:
Was this helpful? If so, please accept this answer as “Solution”. If you need additional assistance, reply here within 2 business days and I’ll be happy to help.
Thanks for your prompt response.
I am a bit confused though. You recommend using the 'streaming_recognize’ method, but the documentation explicitly states the method is not supported for chirp and that only the 'recognize' and 'batchrecognize' methods are available. For chirp 2, the 'streaming_recognize' method is added to the 'recognize' and 'batchrecognize'.
Moreover, all the sample codes for chirp use the 'recognize' method, of which my code is essentially a copy.
Thanks.
User | Count |
---|---|
2 | |
1 | |
1 | |
1 | |
1 |