Get hands-on experience with 20+ free Google Cloud products and $300 in free credit for new customers.

How to identify duplicate OBJECT_FINALIZE events in Cloud Storage Pubsub Notifications?

Is there an example, or a recommended way to deduplicate messages that come from Cloud Storage Pubsub Notifications? I'm specifically interested in identifying duplicates of event type "Object Finalize" (https://cloud.google.com/storage/docs/pubsub-notifications#events), namely new objects being created.

I'm expecting the primary key to be a combination of fields like (id, generation), but not sure as of now. I found some useful info in this blog post titled Handling duplicate data in streaming pipelines using Dataflow and Pub/Sub but since I'm not using Dataflow or BigQuery I need to implement this myself. The couple of images I've posted from that blog point to the fact that I need to implement this in a custom way.

pramodbiligiri_0-1660228359485.png

pramodbiligiri_1-1660229495012.png

 

0 2 838
2 REPLIES 2

As you mentioned, this is a limitation that GCS has, so to be able to achieve this, you would need to make use of Dataflow or BigQuery.

Thanks. I've decided to use the id field as Primary Key for now, and making my downstream system dedupe based on that.