Recognition Returning Transcript Out of Order - Page 2

VitorBoldrin · 10-19-2023 11:57 AM

Hello,

Can someone help me with this problem ? I'm really struggling with this.

I am using the Batch Recognize from speech_v2 to recognize phone calls with two channels, using python. The code works very well, it is very precise, fast and simple. The only problem is that the result is out of order, i mean, the fisrt part of the results it shows the first channel and the second part shows the 2 channel, all together.

So i expect something like:

Transcript: Hello
Channel tag: 1

Transcript: Hi who is this ?
Channel tag: 2

Transcript: Its me
Channel tag: 1

Transcript: me who ?
Channel tag: 2

and only got:

Transcript: Hello
Channel tag: 1

Transcript: Its me
Channel tag: 1

Transcript: Hi who is this ?
Channel tag: 2

Transcript: me who ?
Channel tag: 2

Is there any thing that i can do o fix that ? here is my code:

from google.cloud.speech_v2 import SpeechClient
from google.cloud.speech_v2.types import cloud_speech

def transcribe_batch_gcs_input_inline_output_v2(
project_id: str,
gcs_uri: str,
) -> cloud_speech.BatchRecognizeResults:
# Instantiates a client
client = SpeechClient.from_service_account_file('key.json')

# CONFIG
features = cloud_speech.RecognitionFeatures(
multi_channel_mode=cloud_speech.RecognitionFeatures.MultiChannelMode.SEPARATE_RECOGNITION_PER_CHANNEL
)

config = cloud_speech.RecognitionConfig(
auto_decoding_config=cloud_speech.AutoDetectDecodingConfig(),
model="telephony",
language_codes=["pt-BR"],
features=features,
)

file_metadata = cloud_speech.BatchRecognizeFileMetadata(uri=gcs_uri)

request = cloud_speech.BatchRecognizeRequest(
recognizer=f"projects/{project_id}/locations/global/recognizers/_",
config=config,
files=[file_metadata],
recognition_output_config=cloud_speech.RecognitionOutputConfig(
inline_response_config=cloud_speech.InlineOutputConfig(),
),
)

# Transcribes the audio into text
operation = client.batch_recognize(request=request)

print("Waiting for operation to complete...")
response = operation.result(timeout=2000)

print(response)

for result in response.results[gcs_uri].transcript.results:
print(f"Transcript: {result.alternatives[0].transcript}")
print(f"Channel tag: {result.channel_tag}")

return response.results[gcs_uri].transcript

transcribe_batch_gcs_input_inline_output_v2('my file gcs uri")

I'm using audios up to 10 minutes and not using BatchRecognize,
unfortunately, is not an option. I already tried to look all the documentation searching for something that would help me but i couldn't find anything.