GCS - Deletion of object but not a folder object only files inside a folder object

Currently we have GCS Buckets and each GCS Bucket have a folder inside and actual data files are inside the folder. 

We wanted to cleanup those data files but not the folders. We want to retain the folders. 

Recently GCP launched MatchesPrefix option in Lifecyle policy which can be applied at Object level however when I apply a lifecycle policy it cleans both Folder as well as Data File Inside the folder.

Is there any better way to retain the folder and delete the data files using LiveCycle policy option such MatchesPrefix/MatchesSuffix?

1 14 4,886
14 REPLIES 14

Seems to be more of a problem of how to manage LifeCycle, I would suggest you to follow the Manage Object Lifecycles documentation.

Be sure to create a rule to delete only the objects and not the complete folder.

Using Object Lifecycle Management can delete any amount of objects, or the Google Cloud console, which can delete up to several million objects.

Set a lifecycle configuration rule on your bucket with the condition Age set to 0 days and the action set to mass delete items in your bucket using object lifecycle management. 

According to the main issue, this stackoverflow question might be helpful too.

Hi, 

Thanks for your reply...!!!

I have set Lifecycle policy as below 

Action - Delete Object 

MatchesPrefix - "FolderName/"

And expectation is delete all the files which are inside the folder but not the actual folder. 

However in my case - It is deleting both folder as well as objects inside the folder. 

Regards,

Dhiraj Shah

One thing to keep in mind is that Google Cloud Storage uses a flat namespace, it doesn't really have the concept of folders.  You can read more about it here:

https://cloud.google.com/storage/docs/folders

Some tools have mechanisms to simulate the illusion of folders by creation of a zero byte object (See the blue note at the bottom of that page), but when your lifecycle policy is running those objects too will match, hence why it appears that the 'folder' has also been deleted.

What is the requirement for keeping the 'folder'?  Is there another approach maybe?

Hi, 

Thanks for your reply...!!!

Requirement is I have 10 Folders inside the bucket and on daily basis files are coming in each folder. 

I enabled the lifecycle policy as below 

{

"rule": [{

"action": {

"type": "Delete"

},

"condition": {

"age": 1,

"matchesPrefix": ["at_feedback/", "customer_contact/", "account_information/"]

}

}]

}

 

It cleans of both folders as well as file if no new files received on next day.

Also other interesting thing is -
Day-1 file is not deleted if new file received today as per above lifecycle policy configured. Ideally File/Object which were aged 24 hours should get deleted however new files/object should not get deleted. 

 

That's correct, as per my first reply, there isn't really any such thing as a folder.  So if you have a delete policy with an age of 1 and nothing has been up loaded within the last day, it will delete everything.

Hi, 

That is valid but if I have a object which already reached 24 hour age limit and another object which is loaded but not aged 24 hours in this case zero by folder object should stick however actual file which aged 24 hours should get deleted. 

Correct me if I am worng.

I'm sorry I'm not sure I understand the question exactly.  Specifically "in this case zero by folder object", not sure what you mean by this.

Hi, 

Let me put it like this - 

1. Folder - Zero Byte Object

2. Files - Zero/Non-Zero Byte Object

Bucket Name - test

File Name - test1.txt loaded on 28-09-2022 10 AM

File Name - test2.txt loaded on 29-09-2022 10 AM

Lifecyle Policy enabled with

Action - Object Delete

Matches Prefix - Folder Name

Condition - Age - 1 Days 

Expectation is -

  • Folder Name should be there as it is. 
  • Files which are aged 24 hours should get deleted
  • Files which are newly added not but aged 24 hours should not get deleted. 

 

With that policy, objects (I suggest not to think about it like files and folders) that were created more than 24 hours before the current time will be deleted.  Some other points to note from the documentation:

https://cloud.google.com/storage/docs/lifecycle

"Cloud Storage performs an action asynchronously, so there can be a lag between when the conditions are satisfied and when the action is taken. Your applications should not rely on lifecycle actions occurring within a certain amount of time after a lifecycle condition is met."

and

"Changes to a bucket's lifecycle configuration can take up to 24 hours to go into effect, and Object Lifecycle Management might still perform actions based on the old configuration during this time."

 

Thanks for your reply. 

I am getting promising results with Matches Suffix option. 

Greetings, @shah29.

Could you confirm to me your approach using suffixes?

I mean, I imagine you are using suffixes like file extensions to avoid deleting folders.

I am also considering that, but that requires every file to have an extension, which I still do not think it is the ideal, but it is something.

Best regards!

Hi, 

In my case all files have the extension however that does not matter. The rule which you are going to implement if that contains extension and if it is matching then and then it will act, for files where rule is not matching there will not be any action. 

I hope this helps you. 

Regards, 

Dhiraj Shah

 

Can you please help with suffix option? I mean like how to give it

I have the same question, but I don't feel it was answered sufficiently in this thread.

Let's say I have a bucket with a top-level prefix named "app1-logs". (As an aside, the OP fairly thinks of these as directories. I know that they aren't, but thats how linux users are trained to think about paths in that format, and I don't think its a valuable response to get pedantic about that.)

gs://example-bucket/app1-logs/log1.log
gs://example-bucket/app1-logs/log2.log
gs://example-bucket/app1-logs/log3.log

I want to create a lifecycle policy that will delete everything under the "app1-logs/" prefix WITHOUT deleting the "app1-logs/" prefix. After every object under "app1-logs/" times out, I expect to see the following:

gs://example-bucket/app1-logs/

This can be solved with suffixes, but in a real world use case, you might have arbitrary object names. For example, a hadoop job might write objects that end with ascending numbers like "...0001", "...0002", etc. It's impossible to write a suffix predicate for arbitrary file names.

My suspicion is that we would instead use a wildcard in the prefix, e.g. "app1-logs/*". Would that preserve my prefix?

Thanks