Hi:
I wrote a Python script to take audio from a microphone and process it using streaming_detect_intent(). I am using Dialogflow ES. Since I am writing a custom integration, I disabled the web hook (not that this should matter). I tested the agent with the "try it now."
The problem is I don't get a query_result. The streaming seems to work. At some point recognition_result.is_final is True. However I am not getting query results. I do not see anything in the log. My code is a modified version of the Python example. Any suggestions would be appreciated!
def sample_streaming_detect_intent(audio_queue, project_id, session_id, sample_rate):
from google.cloud import dialogflow
language_code = "en"
session_client = dialogflow.SessionsClient()
audio_encoding = dialogflow.AudioEncoding.AUDIO_ENCODING_LINEAR_16
sample_rate_hertz = sample_rate
session_path = session_client.session_path(project_id, session_id)
print(f"Session path: {session_path}")
def request_generator(audio_config):
query_input = dialogflow.QueryInput(audio_config=audio_config)
# The first request contains the configuration.
yield dialogflow.StreamingDetectIntentRequest(
session=session_path, query_input=query_input
)
while True:
chunk = audio_queue.get()
if not chunk:
break
# The later requests contains audio data.
yield dialogflow.StreamingDetectIntentRequest(input_audio=chunk)
audio_config = dialogflow.InputAudioConfig(
audio_encoding=audio_encoding,
language_code=language_code,
sample_rate_hertz=sample_rate_hertz,
single_utterance=False,
)
requests = request_generator(audio_config)
responses = session_client.streaming_detect_intent(requests=requests)
for response in responses:
print(
f"{response.recognition_result.is_final} intermediate transcript: {response.recognition_result.transcript}"
)
if response.recognition_result.is_final:
print("=" * 20)
print(f"Query text: {response.query_result.query_text}")
print(f"Detected intent: {response.query_result.intent.action} confidence: {response.query_result.intent_detection_confidence}")
print(f"Fulfillment text: {response.query_result.fulfillment_text}")
I noticed a typo.
response.query_result.intent.action}
I corrected it. Still no go. I looked at the example using an audio file. I noticed that the generator eventually stops. I made the following modifications
def sample_streaming_detect_intent(audio_queue, project_id, session_id, sample_rate):
# Create a client
# Using the same `session_id` between requests allows continuation
# of the conversation."""
from google.cloud import dialogflow
# adding a flag
done = False
language_code = "en"
session_client = dialogflow.SessionsClient()
is_final_result = False
# Note: hard coding audio_encoding and sample_rate_hertz for simplicity.
audio_encoding = dialogflow.AudioEncoding.AUDIO_ENCODING_LINEAR_16
sample_rate_hertz = sample_rate
session_path = session_client.session_path(project_id, session_id)
print(f"Session path: {session_path}")
def request_generator(audio_config):
query_input = dialogflow.QueryInput(audio_config=audio_config)
# The first request contains the configuration.
yield dialogflow.StreamingDetectIntentRequest(
session=session_path, query_input=query_input
)
while True:
chunk = audio_queue.get()
if not chunk:
break
if done:
print("I AM DONE")
break
# The later requests contains audio data.
yield dialogflow.StreamingDetectIntentRequest(input_audio=chunk)
audio_config = dialogflow.InputAudioConfig(
audio_encoding=audio_encoding,
language_code=language_code,
sample_rate_hertz=sample_rate_hertz,
single_utterance=False,
)
requests = request_generator(audio_config)
responses = session_client.streaming_detect_intent(requests=requests)
for response in responses:
print(
f"{response.recognition_result.is_final} intermediate transcript: {response.recognition_result.transcript}"
)
print(f"Detected intent: {response.query_result.action} confidence: {response.query_result.intent_detection_confidence}")
#print(f"title: {response.query_result.parameters.get('title')}")
if response.recognition_result.is_final:
done = True
print("=" * 20)
print(f"Query text: {response.query_result.query_text}")
print(f"Detected intent: {response.query_result.action} confidence: {response.query_result.intent_detection_confidence}")
print(f"title: {response.query_result.parameters.get('title', 'blab blab blab')}")
print(f"Fulfillment text: {response.query_result.fulfillment_text}")
Now when the main loop detects that an intent was found, it terminates the generator. Now I get results. Although it works and doesn't close the session, it doesn't seem to be better than using single_utterance. What am I missing? Does StreamingDetectIntent under the hood so some sort of out-of-band signal processing involving the StopIteration exception?
False intermediate transcript: play
Detected intent: confidence: 0.0
False intermediate transcript: please
Detected intent: confidence: 0.0
False intermediate transcript: please read
Detected intent: confidence: 0.0
False intermediate transcript: please renew
Detected intent: confidence: 0.0
False intermediate transcript: please renew
Detected intent: confidence: 0.0
False intermediate transcript: please renew the
Detected intent: confidence: 0.0
False intermediate transcript: please renew the
Detected intent: confidence: 0.0
False intermediate transcript: please renew the day
Detected intent: confidence: 0.0
False intermediate transcript: please renew the den
Detected intent: confidence: 0.0
False intermediate transcript: please renew the event
Detected intent: confidence: 0.0
False intermediate transcript: please renew the adventure
Detected intent: confidence: 0.0
False intermediate transcript: please renew the adventure
Detected intent: confidence: 0.0
False intermediate transcript: please renew the adventures
Detected intent: confidence: 0.0
False intermediate transcript: please renew the adventures of
Detected intent: confidence: 0.0
False intermediate transcript: please renew the Adventures of Tom
Detected intent: confidence: 0.0
False intermediate transcript: please renew the Adventures of Tom
Detected intent: confidence: 0.0
False intermediate transcript: please renew the Adventures of Tom Sawyer
Detected intent: confidence: 0.0
False intermediate transcript: please renew the Adventures of Tom Sawyer
Detected intent: confidence: 0.0
False intermediate transcript: please renew the Adventures of Tom Sawyer
Detected intent: confidence: 0.0
False intermediate transcript: please renew the Adventures of Tom Sawyer
Detected intent: confidence: 0.0
True intermediate transcript: please renew the Adventures of Tom Sawyer
Detected intent: confidence: 0.0
I AM DONE
False intermediate transcript:
Detected intent: LPA_renew_title confidence: 1.0
====================
Query text: please renew the Adventures of Tom Sawyer
Detected intent: LPA_renew_title confidence: 1.0
title: Adventures of Tom Sawyer
Fulfillment text:
Hello! How do you mean "it doesn't seem to be better than using single_utterance"? Do you see that the intents detected by single utterance are more accurate than by streaming?
My concern is not about accuracy but economy*. If I'm doing things right, I have to re-start the generator when I get results. In this regard, I don't see the difference from setting detect_streaming_intent() to single utterance.
It sounds like there are a couple of issues to address here:
1) Restarting the generate when results arrive. This sounds like a samples issue that we (Google) should address. I've logged an issue to track.
2) For your use case, it sounds like you might be better off using single utterance. Is this a possibility for you?
Thanks for your help! Some comments.
Concerning 1. So this is a bug? If my example is correct, intent information is included once the generator is terminated. I would expect information once is_final is true. So I suspect StopIteration exception plays a role. I started looking at the repo at:
It seems I would really need to understand rpc is used with Dialogflow. If you can give me insights perhaps I can get a better idea were to look to help track down the problem?
That said, is this a problem just with the Python implementation? I suspect there are other problems with the asyncio client (like it doesn't properly include information in CancelledError exception).
Concerning 2. Right now, I would prefer not to. Also, I have to run tests to confirm but on a few invocations, detect_streaming_intent() fails to properly identify many of my utterances.
@EricSchmidtI have reading the code for the StreamingDetectIntentRequest class. I noticed a line:
"After you sent all the input, you must half-close or abort the request stream."
In my modification, I definitely "abort" the request stream albeit in a clumsy way. However, the comments doesn't mention how to *half-close* a request stream. Moreover there is no description of what *half-closing* a request stream entails. If you could describe want happens, that would be great. I'm going to assume that sending an audio chunk with nothing in it would be a natural way to half close the connection?
Cheers,
Andrew
Yeah, "half-close" is a mystifying phrase! Your guess that sending an empty audio chunk would "half close" is sensible. I will investigate a bit and respond here.
@EricSchmidtThanks for exploring! I did more research. I think half-close refers to the client closing its side of a bi-directional stream. This is a concept from gRPC and HTTP 2. However, yes please find out how to half-close in Python. I learnt about the debugging flag and in the meanwhile, I'll come up with an example to test.come up with an example. I'm also looking at the GoLang implementation to see if there is something different and or missing.
Looks like you beat me to it! I found this:
I will take a look a the Go library to see if if it has this `CloseSend()` capability.
It looks like `CloseSend()` is supported by the Go pkg:
https://github.com/grpc/grpc-go/blob/v1.58.1/stream.go#L92
(if necessary, I'll test ideas with Golang but I feel that is a burden)
@EricSchmidt However we are not sure what the equivalent of CloseSend() is in Python. So far, the hack with the shared variable seems to be the best way to communicate to the generator that something has been matched. Of course, that seems too weird to me.
This begets another question. Does a new request start always start with sending query_input? Or can I just send more audio? If I am engaging in multi-turn conversation, sending query_input each time an utterance is detected doesn't make sense.
I feel we are getting close. In the days to come, I'll try: recording audio file that contains multiple utterances; turning debugging on.
Again, I really appreciate your effort in assisting me.
I'm very glad to help! I wish we could put together a more elegant solution for you. Thank you for bringing this to our attention.
It looks like the Python client doesn't support CloseSend()
, just close()
. I agree that using a shared variable seems a bit weird. We should probably update our samples to handle this use case. I've logged this GitHub issue to track.
Every request, as I understand it, needs to start with a new empty (config-only) request. I believe that the client closes the bi-directional streams once they're completed and opens a new stream for every new request.
"I'm very glad to help! I wish we could put together a more elegant solution for you. Thank you for bringing this to our attention."
You are welcome. detect_streaming_intent is inherently powerful. Perhaps it just needs more tweaking.
Now that I know that there are issues in the Python implementation, I can work around this.
I can also do the following (albeit this will take some time):
Look for a language neural description of the Dialogflow RPC protocol. The closest I can find is API interactions: https://cloud.google.com/dialogflow/es/docs/api-overview
If that doesn't exist, I can play with the GoLang implementation and make sure it doesn't have the same problems as the Python version and I'm comfortable with it. I can also look at the Golang client source code. Then using that knowledge to add the equivalent of a closeSend() to the Python implementation. If things work, do a pull request(?) or offer up a patch. Really provided the problems are on the client side, things can be "easily" fixed 🙂
"Every request, as I understand it, needs to start with a new empty (config-only) request. I believe that the client closes the bi-directional streams once they're completed and opens a new stream for every new request."
Maybe there is a technical reason for this but this strange. If I am streaming from a mic, my mic parameters won't change.
Again, thanks for your help!
I agree that starting each stream with a config-only request is strange. It's always been that way, as much as I know.
Yes, please do feel free to create a pull request for the Python Dialogflow library, if you feel so inclined! We heartily accept contributions from customers and community members.
As far as language-neutral RPC documentation, that we have! Here are some resources for you:
+ https://cloud.google.com/dialogflow/es/docs/reference/rpc/google.cloud.dialogflow.v2
+ The protocol buffers that define the client library interfaces: https://github.com/googleapis/googleapis/tree/master/google/cloud/dialogflow/v2
+ In case you're looking for the fundamentals of gRPC, you can find information here: https://grpc.io/docs/what-is-grpc/core-concepts/
+ A more detailed description of streaming from a gRPC perspective: https://grpc.io/docs/what-is-grpc/core-concepts/#bidirectional-streaming-rpc