As an interim solution we're making use of GCS buckets to store content in the form of YAML files and using this through an internal web application - this could change later but not for now.
We need to have monitoring in place should any of these GCS buckets get accidentally deleted and an easy, default mechanism to restore a GCS bucket and its content.
For the monitoring part the Cloud Metrics seem to sample out events over a 24h period and we'd need finer time granularity (minutes to a couple hours, max). We've resorted to a Cloud Function running on a schedule to check if the bucket still exists, this looks sub-optimal.
For the recovery part, we're just copying over the content from the source to the back-up bucket on a schedule through a Transfer Service job and manually copying over the content back to the source bucket when a deletion happens - which also looks sub-optimal.
Have you got any recommendation for an easier, cleaner way of achieving this? Thanks!
That's a great question. Couple thoughts:
1. To detect, use audit logs. You can either export these to Cloud Pub/Sub and trigger a Cloud Function that notifies you based on deletion or you can create log based metrics (for example, counting the # of bucket delete operations per object) and set up alerts on these. The former will have lower latency but requires a bit of code. Since you already have a similar scheduled system, this should not be a heavy lift. The advantage, of course, is that you look for events in "real time" rather than changes every so often. There are still advantages to your current approach since there is a -- very remote -- possibility that some events are missed either by your code or the infrastructure. So if this is truly critical you may want to both look at events in real time and maintain a "configuration as a code" picture of what buckets are expected to exist.
2. Reduce the risk of failure. Restricting direct access to deletion and instead requiring buckets to be created or deleted using configuration as code (for example, using Terraform) with code reviews in the middle can be worth the work for protecting highly critical buckets.
3. Data backups. Use event-driven storage transfers to reduce the time between a new object being created and a backup copy creation to a minimum.
Please work with your account team to discuss options for recovering from deletions other than backups.