speech to text chirp

jbraun · 12-16-2024 02:55 PM

I am using google.cloud.speech_v2 client library for python to get the transcription of a short (< 1 min) audio in spanish. It works fine with model = "long", language code = "es-US". Same audio with "chirp" gives only the first part of the transcription. I have tried different audios and models, all work except for chirp or chirp 2.

One strange feature is that in the chirp results I get, first the truncated transcription and then: "Transcript:1000", which I don't know what it means. The following is my code:

def transcribe_chirp(
audio_file: str,
) -> cloud_speech.RecognizeResponse:

# Instantiates a client
client = SpeechClient(
client_options=ClientOptions(
api_endpoint="us-central1-speech.googleapis.com",
)
)

# Reads a file as bytes
with open(audio_file, "rb") as f:
audio_content = f.read()

config = cloud_speech.RecognitionConfig(
auto_decoding_config=cloud_speech.AutoDetectDecodingConfig(),
language_codes=["es-US"],
model="chirp",
features=cloud_speech.RecognitionFeatures(
# Enable automatic punctuation
enable_automatic_punctuation=True,
),
)

request = cloud_speech.RecognizeRequest(
recognizer=f"projects/{PROJECT_ID}/locations/us-central1/recognizers/_",
config=config,
content=audio_content,
)

# Transcribes the audio into text
response = client.recognize(request=request)

for result in response.results:
print(f"Transcript: {result.alternatives[0].transcript}")

return response

MarvinLlamas

Hi @jbraun,

Welcome to Google Cloud Community!

It looks like you're seeing "Transcript:1000" because the "chirp" model provides partial results during streaming. The "1000" is likely a metadata signal, not part of the transcript. The ‘client.recognize’ method is intended for non-streaming audio, so it doesn’t work with "chirp."

To address your question, here are potential ways that might help with your use case:

Utilize Streaming Recognition: You may want to use the ‘streaming_recognize’ method instead of ‘recognize’ in your code to support models like ‘chirp’ and ‘chirp-2,’ which generate partial results during audio processing.
Request Format: You may want to form a series of StreamingRecognizeRequest objects, where the first contains the configuration and the second holds the audio data, enabling you to perform streaming transcription.
Audio Chunking: Make sure that you chunk large audio files into smaller parts instead of loading the entire file into memory at once.
Processing Results: You might want to iterate over the generator of StreamingRecognizeResponse objects returned by the method to compile the final transcript from the partial results.
Partial Results Handling: Make sure that you handle the partial results provided by the streaming API based on your specific requirements.
Error Handling: You may want to incorporate error handling to manage any issues that could arise during the API request or while processing the audio.

You may refer to the following documentation, which will provide an overview of how the chirp and chirp2 models are optimized for short-form, low-latency streaming recognition and not suited for long, full-file transcriptions like the long model:

Speech-to-Text basics

Was this helpful? If so, please accept this answer as “Solution”. If you need additional assistance, reply here within 2 business days and I’ll be happy to help.

jbraun

Thanks for your prompt response.

I am a bit confused though. You recommend using the 'streaming_recognize’ method, but the documentation explicitly states the method is not supported for chirp and that only the 'recognize' and 'batchrecognize' methods are available. For chirp 2, the 'streaming_recognize' method is added to the 'recognize' and 'batchrecognize'.

Moreover, all the sample codes for chirp use the 'recognize' method, of which my code is essentially a copy.

Thanks.