call transcription with Gemini

tfriedel · 10-12-2024 03:03 AM

We'd like to transcribe calls with speaker labels and timestamps.
We have dual channel stereo recordings with one channel per speaker. How can we transcribe those with Gemini?

The naive solution of just asking for a transcription of the combined call will result in it being converted to mono. We observe confusions with speaker diarization then. We tried uploading the channels separately and asking for a combined transcript. This doesn't work.

Our next idea would be to use a separate tool to split on silence, then upload a long sequence of ["speaker A", audio_segment, "speaker B", audio_segment, ...]. Any idea if this could work?

Another idea we are considering is inserting little sound signals, e.g. clicks or beeps to indicate when the turn switches. This feels a bit hacky and may also make it more difficult to understand what is being said.

We also noticed there's a recent "audioTimestamp" flag which should help. We tried it and it's still wrong sometimes. How does this work under the hood?

ruthseki

Hi @tfriedel,

Welcome to Google Cloud Community!

Insights related to your ideas and questions include the following points:

Separate Channels, Combined Transcript: Gemini likely treats individual audio files as independent units, making it difficult to combine them correctly for speaker diarization.
Silence-Based Splitting: This approach might work, but it could lead to issues with:

Speaker Diarization Errors: Silence might not always indicate a speaker change, especially if speakers overlap or there are long pauses.
Incorrect Timestamping: Splitting by silence may disrupt the natural flow of the conversation and result in inaccurate timestamps.

Click/Beep Markers: While this method could theoretically work, it's indeed hacky and might interfere with the transcription itself.

Additionally, the "audioTimestamp" flag aims to provide timestamps for words or phrases in the transcription. However, its accuracy depends on how well Gemini can detect speech boundaries.

To accurately generate timestamps for audio-only files, you must configure the audio_timestamp parameter in generation_config.

Here are some approaches that you may consider:

Explore Dedicated Speech Recognition APIs: Consider using dedicated speech recognition APIs like Google Cloud Speech-to-Text. These services often offer better speaker diarization and timestamping features.
Experiment with Alternative Methods: Try using different methods for separating the audio, such as based on energy levels or spectral features.
Fine-tune Gemini for Specific Audio: If you have a lot of similar audio data, you might be able to fine-tune Gemini to perform better on your specific type of calls.

Here are some useful documentation that you may check as well:

I hope the above information is helpful.