Recognition Returning Transcript Out of Order

VitorBoldrin · 10-19-2023 11:57 AM

Hello,

Can someone help me with this problem ? I'm really struggling with this.

I am using the Batch Recognize from speech_v2 to recognize phone calls with two channels, using python. The code works very well, it is very precise, fast and simple. The only problem is that the result is out of order, i mean, the fisrt part of the results it shows the first channel and the second part shows the 2 channel, all together.

So i expect something like:

Transcript: Hello
Channel tag: 1

Transcript: Hi who is this ?
Channel tag: 2

Transcript: Its me
Channel tag: 1

Transcript: me who ?
Channel tag: 2

and only got:

Transcript: Hello
Channel tag: 1

Transcript: Its me
Channel tag: 1

Transcript: Hi who is this ?
Channel tag: 2

Transcript: me who ?
Channel tag: 2

Is there any thing that i can do o fix that ? here is my code:

from google.cloud.speech_v2 import SpeechClient
from google.cloud.speech_v2.types import cloud_speech

def transcribe_batch_gcs_input_inline_output_v2(
project_id: str,
gcs_uri: str,
) -> cloud_speech.BatchRecognizeResults:
# Instantiates a client
client = SpeechClient.from_service_account_file('key.json')

# CONFIG
features = cloud_speech.RecognitionFeatures(
multi_channel_mode=cloud_speech.RecognitionFeatures.MultiChannelMode.SEPARATE_RECOGNITION_PER_CHANNEL
)

config = cloud_speech.RecognitionConfig(
auto_decoding_config=cloud_speech.AutoDetectDecodingConfig(),
model="telephony",
language_codes=["pt-BR"],
features=features,
)

file_metadata = cloud_speech.BatchRecognizeFileMetadata(uri=gcs_uri)

request = cloud_speech.BatchRecognizeRequest(
recognizer=f"projects/{project_id}/locations/global/recognizers/_",
config=config,
files=[file_metadata],
recognition_output_config=cloud_speech.RecognitionOutputConfig(
inline_response_config=cloud_speech.InlineOutputConfig(),
),
)

# Transcribes the audio into text
operation = client.batch_recognize(request=request)

print("Waiting for operation to complete...")
response = operation.result(timeout=2000)

print(response)

for result in response.results[gcs_uri].transcript.results:
print(f"Transcript: {result.alternatives[0].transcript}")
print(f"Channel tag: {result.channel_tag}")

return response.results[gcs_uri].transcript

transcribe_batch_gcs_input_inline_output_v2('my file gcs uri")

I'm using audios up to 10 minutes and not using BatchRecognize,
unfortunately, is not an option. I already tried to look all the documentation searching for something that would help me but i couldn't find anything.

lsolatorio

Hi @VitorBoldrin,

Welcome and appreciate you reaching out to our community for help.

I understand that you are having issues with your transcript having mixed up speaker dialogues. I have encountered a somewhat similar case and have suggested exploring Speaker diarization to detect different speakers. A Python code sample is also available to tag the speakers accordingly and get better transcriptions.

from google.cloud import speech_v1p1beta1 as speech

client = speech.SpeechClient()

speech_file = "resources/commercial_mono.wav"

with open(speech_file, "rb") as audio_file:
    content = audio_file.read()

audio = speech.RecognitionAudio(content=content)

diarization_config = speech.SpeakerDiarizationConfig(
    enable_speaker_diarization=True,
    min_speaker_count=2,
    max_speaker_count=10,
)

config = speech.RecognitionConfig(
    encoding=speech.RecognitionConfig.AudioEncoding.LINEAR16,
    sample_rate_hertz=8000,
    language_code="en-US",
    diarization_config=diarization_config,
)

print("Waiting for operation to complete...")
response = client.recognize(config=config, audio=audio)

# The transcript within each result is separate and sequential per result.
# However, the words list within an alternative includes all the words
# from all the results thus far. Thus, to get all the words with speaker
# tags, you only have to take the words list from the last result:
result = response.results[-1]

words_info = result.alternatives[0].words

# Printing out the output:
for word_info in words_info:
    print(f"word: '{word_info.word}', speaker_tag: {word_info.speaker_tag}")

return result

Hope this helps.