Re: Summary after audio conversation in Gemini's m...

danon · 12-23-2024 08:43 AM

Is it possible to get a summary of an audio conversation with Gemini's multimodal live API in the same session? Theoretically you can config a session to support both audio and text

generation_config = {

"responseModalities": ["TEXT", "AUDIO"],...

So is it possible to handle a conversation via audio, and when the audio conversation ends, ask for a final summary (via text) of the audio conversation?

MarvinLlamas

Hi @danon,

Welcome to Google Cloud Community!

It is indeed possible to get a summary of an audio conversation using Gemini's multimodal live API within the same session. By configuring the session to support both audio and text, you can handle the conversation via audio and subsequently request a text-based summary of the conversation once it ends.

Here are potential ways that might help with your use case:

Start the session: You may want to initiate a session with the Gemini API, setting the responseModalities to ['TEXT', 'AUDIO'], along with any other necessary parameters.
Engage in an Audio Conversation: You may want to use audio input to interact, allowing the API to process the audio accordingly.
Signal the End of Audio: You may cease sending audio streams to indicate the conclusion of the audio conversation.
Summary Request: You may want to submit a text-based query to the API, requesting a summary of the audio conversation.
Receive Text summary: You may want to wait for an API to evaluate the conversation, produce a text summary, and then send it to you.

In addition, Gemini’s multimodal live API is currently in an experimental phase and remains under active development. For more information, please refer to the documentation.

Was this helpful? If so, please accept this answer as “Solution”. If you need additional assistance, reply here within 2 business days and I’ll be happy to help.

danon

Tried to follow your steps but cannot make it work.

For context, my codebase is acting as a middleware integrating an external app that streams voice conversations via websockets to Gemini (thanks to my middleware), so Gemini multimodal live API can act as a real-time voice assistant.

See some methods on my codebase:

    async def connect(self, instructions: Optional[str] = None,
                    voice: Optional[str] = None,
                    temperature: Optional[float] = None,
                    model: Optional[str] = None,
                    max_output_tokens: Optional[int] = None,
                    agent_name: Optional[str] = None,
                    company_name: Optional[str] = None) -> None:
        """
        Establishes a connection to the Gemini API using API key authentication
        """
        try:
            # Store configuration 
            self.admin_instructions = instructions
            self.voice = voice if voice and voice.strip() else "Puck"
            self.agent_name = agent_name
            self.company_name = company_name

            if temperature is not None:
                self.temperature = temperature
            if model is not None:
                self.model = model
            if max_output_tokens is not None:
                self.max_output_tokens = max_output_tokens

            # Create final instructions (system prompt)
            self.final_instructions = create_final_system_prompt(
                self.admin_instructions,
                self.language,
                self.customer_data,
                self.agent_name,
                self.company_name
            )

            # Initialize Gemini client with API key
            api_key = os.getenv('GOOGLE_API_KEY')
            if not api_key:
                raise ValueError("GOOGLE_API_KEY environment variable not set")

            self.logger.info("Initializing Gemini client with API key...")
            self.client = genai.Client(
                api_key=api_key,
                http_options={'api_version': 'v1alpha'}
            )

            # Prepare session configuration
            generation_config = {
                "responseModalities": ["TEXT", "AUDIO"],
                "candidateCount": 1,
                "maxOutputTokens": self.max_output_tokens,
                "temperature": self.temperature,
                "speechConfig": {
                    "voiceConfig": {
                        "prebuiltVoiceConfig": {
                            "voiceName": self.voice
                        }
                    }
                },
                "safetySettings": [
                    {
                        "category": setting["category"],
                        "threshold": setting["threshold"]
                    }
                    for setting in GEMINI_SAFETY_SETTINGS
                ]
            }

            setup_message = {
                "model": self.model,
                "generation_config": generation_config,
                "system_instruction": {
                    "parts": [{"text": self.final_instructions}]
                }
            }

            self.logger.debug(f"Sending setup message: {json.dumps(setup_message, indent=2)}")

            # Connect to Gemini Live API
            self.logger.info("Creating Gemini session...")
            self._session_manager = self.client.aio.live.connect(
                model=self.model,
                config=setup_message
            )

            # Initialize the session
            self.session = await self._session_manager.__aenter__()
            
            self.running = True
            self.read_task = asyncio.create_task(self._read_responses())

            self.logger.info("Successfully connected to Gemini API")

        except Exception as e:
            self.logger.error(f"Failed to connect to Gemini: {str(e)}")
            self.logger.debug("Full connection error details:", exc_info=True)
            raise RuntimeError(f"Failed to connect to Gemini: {str(e)}")
 
	async def _read_responses(self) -> None:
		"""Reads responses (text/audio/function calls) from Gemini"""
		try:
			while self.running:
				turn = self.session.receive()
				async for response in turn:
					if not self.running:
						break

					# Only handle function calls if enabled
					if ENABLE_FUNCTIONS and hasattr(response, "tool_call") and response.tool_call:
						await self._handle_tool_call(response.tool_call)

					# Handle audio data
					if response.data:
						self.logger.debug(f"Received audio data: {len(response.data)} bytes")
						if self.on_audio_ready_callback:
							await self.on_audio_ready_callback(response.data)

					# Handle text responses  
					if response.text:
						self.logger.info(f"Received text response: {response.text}")
						self.last_response = {
							"text": response.text,
							"usage": {}
						}
						if self.on_speech_started_callback:
							await self.on_speech_started_callback()

					if hasattr(response, 'server_content') and hasattr(response.server_content, 'turn_complete'):
						if response.server_content.turn_complete:
							self.logger.debug("Turn complete signal received")
							if self.on_audio_ready_callback:
								await self.on_audio_ready_callback(None)

		except asyncio.CancelledError:
			self.logger.info("Response reading task cancelled")
			raise
		except Exception as e:
			self.logger.error(f"Error reading responses: {e}", exc_info=True)
			raise
		finally:
			self.running = False

This is the method for generating the final summary:

    async def generate_session_summary(self) -> Optional[Dict[str, Any]]:
        """
        Sends a text-based summary request to Gemini in the SAME session.
        """
        from config import ENDING_PROMPT, DEBUG
        from google.genai import types

        try:
            if DEBUG == 'true':
                self.logger.debug("=== Starting Summary Generation ===")

            # Cancel existing read task first
            if self.read_task:
                if DEBUG == 'true':
                    self.logger.debug("Cancelling existing read task...")
                self.read_task.cancel()
                try:
                    await self.read_task
                    if DEBUG == 'true':
                        self.logger.debug("Read task cancelled successfully")
                except asyncio.CancelledError:
                    if DEBUG == 'true':
                        self.logger.debug("Read task cancellation handled")
                    pass
                self.read_task = None

            await asyncio.sleep(0.1)

            if DEBUG == 'true':
                self.logger.debug("Making text request in existing session...")

            # Send summary prompt directly as text
            summary_prompt = (
                f"{ENDING_PROMPT}"
            )

            if DEBUG == 'true':
                self.logger.debug(f"Sending summary request with text: {summary_prompt}")

            # Send the text directly
            await self.session.send(summary_prompt, end_of_turn=True)

            # Read the response
            full_response = ""
            try:
                async with asyncio.timeout(30):
                    turn = self.session.receive()
                    async for chunk in turn:
                        if DEBUG == 'true':
                            self.logger.debug(f"Received chunk: {chunk}")
                        if chunk.text:
                            if DEBUG == 'true':
                                self.logger.debug(f"Got text: {chunk.text}")
                            full_response += chunk.text

                        if (hasattr(chunk, 'server_content') and 
                            hasattr(chunk.server_content, 'turn_complete') and
                            chunk.server_content.turn_complete):
                            if DEBUG == 'true':
                                self.logger.debug("Turn complete")
                            if full_response:
                                break

            except asyncio.TimeoutError:
                self.logger.error("Timeout waiting for summary")
                return None

            if DEBUG == 'true':
                self.logger.debug(f"Final response: {full_response}")

            if full_response:
                return {"summary": full_response}
            else:
                self.logger.warning("No summary text received")
                return None

        except Exception as e:
            self.logger.error(f"Error generating summary: {e}", exc_info=True)
            if DEBUG == 'true':
                self.logger.debug("Full error details:", exc_info=True)
            return None
        finally:
            if DEBUG == 'true':
                self.logger.debug("=== Summary Generation Complete ===")

During the voice conversation I correctly get the audio responses back from Gemini, but when the voice ends, the external app sends me a "close" message, and in that moment I want to generate the text summary by invoking "generate_session_summary"

Excerpt from my server logs when the voice conversation ends and I try to get the summary:

2024-12-26 10:46:52.575 [INFO] ExternalAppGeminiBridge.ExternalAppServer-81fb-4294-903d-05ad43319bc5: Received 'close' from ExternalApp. Reason: end
2024-12-26 10:46:52.576 [INFO] ExternalAppGeminiBridge.ExternalAppServer-81fb-4294-903d-05ad43319bc5: Audio processing task cancelled
2024-12-26 10:46:52.576 [DEBUG] ExternalAppGeminiBridge.GeminiClient_30e03fcf-6cb7-48a4-aebc-b63fda305922: === Starting Summary Generation ===
2024-12-26 10:46:52.577 [DEBUG] ExternalAppGeminiBridge.GeminiClient_30e03fcf-6cb7-48a4-aebc-b63fda305922: Cancelling existing read task...
2024-12-26 10:46:52.577 [INFO] ExternalAppGeminiBridge.GeminiClient_30e03fcf-6cb7-48a4-aebc-b63fda305922: Response reading task cancelled
2024-12-26 10:46:52.577 [DEBUG] ExternalAppGeminiBridge.GeminiClient_30e03fcf-6cb7-48a4-aebc-b63fda305922: Read task cancellation handled
2024-12-26 10:46:52.682 [DEBUG] ExternalAppGeminiBridge.GeminiClient_30e03fcf-6cb7-48a4-aebc-b63fda305922: Making text request in existing session...
2024-12-26 10:46:52.682 [DEBUG] ExternalAppGeminiBridge.GeminiClient_30e03fcf-6cb7-48a4-aebc-b63fda305922: Sending summary request with text: Here is a summary request for the voice conversation we just had:


Please analyze this conversation and provide a structured summary including:
{
    "main_topics": [],
    "key_decisions": [],
    "action_items": [],
    "sentiment": ""
}


Please analyze our conversation and provide the summary in the requested format.
2024-12-26 10:46:52.802 [DEBUG] ExternalAppGeminiBridge.GeminiClient_30e03fcf-6cb7-48a4-aebc-b63fda305922: Received chunk: setup_complete=None server_content=LiveServerContent(model_turn=None, turn_complete=None, interrupted=None) tool_call=None tool_call_cancellation=None
2024-12-26 10:46:52.803 [DEBUG] ExternalAppGeminiBridge.GeminiClient_30e03fcf-6cb7-48a4-aebc-b63fda305922: Received chunk: setup_complete=None server_content=LiveServerContent(model_turn=None, turn_complete=True, interrupted=None) tool_call=None tool_call_cancellation=None
2024-12-26 10:46:52.803 [DEBUG] ExternalAppGeminiBridge.GeminiClient_30e03fcf-6cb7-48a4-aebc-b63fda305922: Turn complete
2024-12-26 10:46:52.804 [DEBUG] ExternalAppGeminiBridge.GeminiClient_30e03fcf-6cb7-48a4-aebc-b63fda305922: Final response:
2024-12-26 10:46:52.804 [WARNING] ExternalAppGeminiBridge.GeminiClient_30e03fcf-6cb7-48a4-aebc-b63fda305922: No summary text received
2024-12-26 10:46:52.804 [DEBUG] ExternalAppGeminiBridge.GeminiClient_30e03fcf-6cb7-48a4-aebc-b63fda305922: === Summary Generation Complete ===
2024-12-26 10:46:52.804 [INFO] ExternalAppGeminiBridge.GeminiClient_30e03fcf-6cb7-48a4-aebc-b63fda305922: Closing Gemini connection...
2024-12-26 10:46:52.915 [INFO] ExternalAppGeminiBridge.GeminiClient_30e03fcf-6cb7-48a4-aebc-b63fda305922: Gemini session reference cleared
2024-12-26 10:46:52.915 [INFO] ExternalAppGeminiBridge.GeminiClient_30e03fcf-6cb7-48a4-aebc-b63fda305922: Connection closed after 15.34s
2024-12-26 10:46:52.915 [DEBUG] ExternalAppGeminiBridge.ExternalAppServer-81fb-4294-903d-05ad43319bc5: Sending message to ExternalApp:
{
  "version": "2",
  "type": "disconnect",
  "seq": 6,
  "clientseq": 15,
  "id": "30e03fcf-6cb7-48a4-aebc-b63fda305922",
  "parameters": {
    "reason": "completed",
    "outputVariables": {
      "CONVERSATION_SUMMARY": "",
      "CONVERSATION_DURATION": "15.495136260986328"
    }
  }
}
2024-12-26 10:46:52.916 [INFO] ExternalAppGeminiBridge.ExternalAppServer-81fb-4294-903d-05ad43319bc5: Session stats - Duration: 15.50s, Frames sent: 35, Frames received: 75
2024-12-26 10:46:52.916 [INFO] ExternalAppGeminiBridge: [WS-0a011fd2] Connection handler finished

So as you can see, I can't get any conversation summary.

Thanks in advance!

danon

I give up, at least by now. I've tried any possible combination. I don't think that text modality has access to the audio context. Although you can do both text and audio modalities in one session, one hasn't access to the context of the other, so you can't ask for a text summary of the audio session.

vResolv_Dev

I am facing the exact same issue, Gemini 2.0 as of now can only output Audio or Text in response.

I want to store the context in the form of text, but I realize I can't. The only way to do this might be using Text to Speech API or Gemini 1.5 Transcription.
This is not clearly mentioned in the documentation, I don't know why Google's Documentation is so vague, scattered and hard to digest.

sanketny8

You can invoke a function with summary as required parameter of the conversation, that will work.

llm

Any example code snippet?

danon

Can you please elaborate more on that workaround? Thanks

llm

Now Gemini only supports more modalities for selected customers, so we can only use TEXT or AUDIO only. See more details from https://github.com/google-gemini/cookbook/issues/379. I don't know how you can set TEXT and AUDIO in the config.

sanketny8

@danon After each conversation you can invoke a function requesting summary of that conversation
https://github.com/google-gemini/multimodal-live-api-web-console

here is github for live demo,
just add one more function call for summary and add in prompt to call that function with summary

danon

Have you tried it? Mentioning because I have tried this and another ideas and basically nothing works. That is what I say in the first place "I give up", because it simply doesn't work. The caveat, not documented anywhere, is that although Gemini 2.0 flash supports audio and text (multimodal), these sessions are in practice independent, so you can't do use cases like the one I'm targeting (text summary of the audio session)

pk13055

From what I have found, the text and audio modalities have separate contexts so you cannot request the audio transcription via a text prompt. What we ended up doing was have a separate AWS Transcribe worker to whom we send gemini and user audio pcm chunks and when the session is about to reset, send the conversation history back to gemini as the first message to maintain context.

Summary after audio conversation in Gemini's multimodal live API