We'd like to transcribe calls with speaker labels and timestamps.
We have dual channel stereo recordings with one channel per speaker. How can we transcribe those with Gemini?
The naive solution of just asking for a transcription of the combined call will result in it being converted to mono. We observe confusions with speaker diarization then. We tried uploading the channels separately and asking for a combined transcript. This doesn't work.
Our next idea would be to use a separate tool to split on silence, then upload a long sequence of ["speaker A", audio_segment, "speaker B", audio_segment, ...]. Any idea if this could work?
Another idea we are considering is inserting little sound signals, e.g. clicks or beeps to indicate when the turn switches. This feels a bit hacky and may also make it more difficult to understand what is being said.
We also noticed there's a recent "audioTimestamp" flag which should help. We tried it and it's still wrong sometimes. How does this work under the hood?
Hi @tfriedel,
Welcome to Google Cloud Community!
Insights related to your ideas and questions include the following points:
Additionally, the "audioTimestamp" flag aims to provide timestamps for words or phrases in the transcription. However, its accuracy depends on how well Gemini can detect speech boundaries.
To accurately generate timestamps for audio-only files, you must configure the audio_timestamp parameter in generation_config.
Here are some approaches that you may consider:
Here are some useful documentation that you may check as well:
I hope the above information is helpful.
User | Count |
---|---|
2 | |
1 | |
1 | |
1 | |
1 |