Hello,
Im working with the dataflow streaming flex templates documented here: https://cloud.google.com/dataflow/docs/guides/templates/provided/pubsub-proto-to-bigquery
Our protobuf schema utilizes the struct type provided here:
https://github.com/protocolbuffers/protobuf/blob/main/src/google/protobuf/struct.proto
When I compile the .proto file into .pb using:
protoc -I=./ actions.proto --include_imports --descriptor_set_out=actions.pb
and attempt to start the job I get the following error from dataflow:
com.google.cloud.teleport.v2.common.UncaughtExceptionLogger - The template launch failed.
java.lang.IllegalArgumentException: Cannot infer schema with a circular reference. Proto Field: google.protobuf.Struct
at org.apache.beam.sdk.extensions.protobuf.ProtoSchemaTranslator.getSchema(ProtoSchemaTranslator.java:174)
In our case, the use of the struct type looks like this:
message Track {
// Name of the action that the user has performed
string event = 1;
// Free-form dictionary of event properties
google.protobuf.Struct properties = 2;
// Free-form dictionary of user properties
google.protobuf.Struct user_properties = 3;
}
From what I can tell, the error is due to Structs being able to contain other structs, is this just not supported by Java? is there a workaround to using the google provided Struct type? any experience or help would be greatly appreciated!
Hi @niallsc,
Welcome to Google Cloud Community!
The error “Cannot infer schema with a circular reference” stems from the nested and potentially circular nature of Google Protocol Buffer’s google.protobuf.struct
. While It is designed to be flexible and enables you to represent arbitrary JSON-like data, meaning a Struct
can contain another Struct
within it, leading to potential circular references, it poses a challenge for certain serialization and schema inference mechanisms, particularly in Java with Beam's ProtoSchemaTranslator
.
ProtoSchemaTranslator tries to deduce the schema of your data (defined by the Track message) to optimize processing. Circular references throw a wrench in this process because the schema becomes infinitely nested.
Try to examine if circular references within your Track
message are really necessary. If possible, redesign your message structure to eliminate circular dependencies. If you can anticipate a fixed level of nesting or represent the data differently, it might be the cleanest approach.
I hope the above information is helpful.
Thank you for the reply, however, in my original request I did outline the understanding that was conveyed in your response. "the error is due to Structs being able to contain other structs"
I was also able to get this data to load through dataflow by creating a custom dataflow template and reading the protobuf via:
def convert_proto_to_dict(data, schema_class):
message = schema_class()
message.ParseFromString(data)
d = MessageToDict(message, preserving_proto_field_name=True)
return d