Hi, really could do with some help.
Have been running on GCP for a year - never had any issues, but I have in the past few days added some securitycontext settings to my pods,
podSecurityContext:
fsGroup: 2000
seccompProfile:
type: RuntimeDefault
securityContext:
capabilities:
drop:
- ALL
readOnlyRootFilesystem: true
runAsNonRoot: true
runAsUser: 1000
allowPrivilegeEscalation: false
Didn't see any issue with making these changes - however in the past day I have found services stopped running on the cluster, when investigating, I have found Persistent Volumes become corrupted, then GKE when attempting to remount the volume tries to fix issues it finds in the disk - fails and so my pv's are . This in turn made me look at the GKE worker nodes both using "get events" via kubectl and then in the GKE portal on GCP console.
I then find errors like this across two different clusters - which had both recently added the securitycontext
default 61m Warning IOError node/<nodeID> Buffer I/O error on dev nvme0n3, logical block 1081344, lost sync page write
default 61m Warning Ext4Error node/<nodeID> EXT4-fs error (device nvme0n3): ext4_put_super:1196: comm node-agent: Couldn't clean up the journal
default 61m Warning FilesystemIsReadOnly node/<nodeID> Node condition ReadonlyFilesystem is now: True, reason: FilesystemIsReadOnly, message: "EXT4-fs (nvme0n3): Remounting filesystem read-only"
In GCP console under the GKE nodes, the ReadonlyFilesystem warning is showing.
ReadonlyFilesystem: True | EXT4-fs (nvme0n3): Remounting filesystem read-only |
My question is - whats going on? I didn't know the setting readOnlyRootFileSystem security context at the pod level would affect the hosts in some way. (I can't believe it but thats pretty much the only change). I have also setup ephemeral (emptydirs{}) on /tmp - but I dont think thats the cause either.
I have had this happen to 4 different nodes in my clusters already today, it appears to happen randomly, Im running on GKE v1.26.13-gke.1052000.
If the volumes are corrupted, am i able to recover them? So far i have only been able to get some services back up by deleting the pv and pvc associated with the pod, which isn't ideal.
Welcome to Google Cloud Community!
Your changes on the deployment securityContext: readOnlyRootFilesystem in Kubernetes can impact on how your pods interact with the host system.
Recovering volumes (PV) that are corrupted can be difficult or nearly impossible to do and deleting the PVC/PV to make the system up could result in data loss.
Moving forward, you can use tools such as Velero to backup your cluster workloads including the PVC/PV.
I hope this information is helpful.
If you need further assistance, you can always file a ticket on our support team.