Get hands-on experience with 20+ free Google Cloud products and $300 in free credit for new customers.

Dataflow Pubsub Proto to Bigquery with Struct

Hello, 

Im working with the dataflow streaming flex templates documented here: https://cloud.google.com/dataflow/docs/guides/templates/provided/pubsub-proto-to-bigquery

Our protobuf schema utilizes the struct type provided here: 

https://github.com/protocolbuffers/protobuf/blob/main/src/google/protobuf/struct.proto

When I compile the .proto file into .pb using: 

 

protoc -I=./ actions.proto --include_imports --descriptor_set_out=actions.pb

 

and attempt to start the job I get the following error from dataflow: 

 

com.google.cloud.teleport.v2.common.UncaughtExceptionLogger - The template launch failed.
java.lang.IllegalArgumentException: Cannot infer schema with a circular reference. Proto Field: google.protobuf.Struct
    at org.apache.beam.sdk.extensions.protobuf.ProtoSchemaTranslator.getSchema(ProtoSchemaTranslator.java:174)

 

In our case, the use of the struct type looks like this: 

 

message Track {
    // Name of the action that the user has performed
    string event = 1;
    // Free-form dictionary of event properties
    google.protobuf.Struct properties = 2;
    // Free-form dictionary of user properties
    google.protobuf.Struct user_properties = 3;
}

 

From what I can tell, the error is due to Structs being able to contain other structs, is this just not supported by Java? is there a workaround to using the google provided Struct type? any experience or help would be greatly appreciated!

0 2 473
2 REPLIES 2

Hi @niallsc,

Welcome to Google Cloud Community!

The error “Cannot infer schema with a circular reference” stems from the nested and potentially circular nature of Google Protocol Buffer’s google.protobuf.struct. While It is designed to be flexible and enables you to represent arbitrary JSON-like data, meaning a Struct can contain another Struct within it, leading to potential circular references, it poses a challenge for certain serialization and schema inference mechanisms, particularly in Java with Beam's ProtoSchemaTranslator.

ProtoSchemaTranslator tries to deduce the schema of your data (defined by the Track message) to optimize processing. Circular references throw a wrench in this process because the schema becomes infinitely nested.

Try to examine if circular references within your Track message are really necessary. If possible, redesign your message structure to eliminate circular dependencies. If you can anticipate a fixed level of nesting or represent the data differently, it might be the cleanest approach.

I hope the above information is helpful.

Thank you for the reply, however, in my original request I did outline the understanding that was conveyed in your response. "the error is due to Structs being able to contain other structs"

I was also able to get this data to load through dataflow by creating a custom dataflow template and reading the protobuf via: 

def convert_proto_to_dict(data, schema_class):
    message = schema_class()
    message.ParseFromString(data)
    d = MessageToDict(message, preserving_proto_field_name=True)

    return d