call transcription with Gemini - Page 2

tfriedel · 10-12-2024 03:03 AM

We'd like to transcribe calls with speaker labels and timestamps.
We have dual channel stereo recordings with one channel per speaker. How can we transcribe those with Gemini?

The naive solution of just asking for a transcription of the combined call will result in it being converted to mono. We observe confusions with speaker diarization then. We tried uploading the channels separately and asking for a combined transcript. This doesn't work.

Our next idea would be to use a separate tool to split on silence, then upload a long sequence of ["speaker A", audio_segment, "speaker B", audio_segment, ...]. Any idea if this could work?

Another idea we are considering is inserting little sound signals, e.g. clicks or beeps to indicate when the turn switches. This feels a bit hacky and may also make it more difficult to understand what is being said.

We also noticed there's a recent "audioTimestamp" flag which should help. We tried it and it's still wrong sometimes. How does this work under the hood?