Re: PVCs not resizing

nabk · 02-27-2025 11:35 AM

UPDATE: This appears to be an issue with GKE 1.32.x - 1.32.1729000 globally.

Hello! Since yesterday my PVCs no longer will resize, across clusters, different GKE versions, and with different versions of the pdcsi driver.

The normal behaviour is to change the PVC resource request for storage to a larger number, and then when the resize is pending a Pod start, kill the existing pod attached to the PVC.

Now, I see the following event, and nothing further:

From v1.32.1-gke.1489001 with pdcsi v1.15.4:

CSI migration enabled for kubernetes.io/gce-pd; waiting for external resizer to expand the pvc

From v1.32.1-gke.1729000 with pdcsi v1.16.1:

waiting for an external controller to expand this PVC

Recent events:

Migrated to new node pools with cgroup v2
Upgraded GKE versions

Storage Class relevant details:

allowVolumeExpansion: true
apiVersion: storage.k8s.io/v1
kind: StorageClass
metadata:
  name: my-standard
parameters:
  type: pd-standard
provisioner: pd.csi.storage.gke.io
reclaimPolicy: Delete
volumeBindingMode: WaitForFirstConsumer

Any ideas on what might be going wrong, or any advice on how to troubleshoot further?

jayeshmahajan

Please check Check CSI Resizer Pod
kubectl get pods -n kube-system | grep csi
kubectl logs -n kube-system -l app=csi-gce-pd-controller --tail=50
# restart above pods if needed.

kubectl describe pvc <your-pvc-name>
kubectl get events --sort-by=.metadata.creationTimestamp
# check and expand storage class runtime
kubectl get storageclass my-standard -o yaml | grep allowVolumeExpansion
kubectl patch storageclass my-standard -p '{"allowVolumeExpansion": true}'
# force manual resize

kubectl patch pvc <your-pvc-name> -p '{"spec":{"resources":{"requests":{"storage":"50Gi"}}}}'

restart kubelet and pod delete/start new one.

DarwinVinoth

1. Check if the External Resizer is Running
Since the error states that it is “waiting for an external controller to expand this PVC,” verify if the external resizer is running correctly in your cluster:

kubectl get pods -n kube-system | grep csi

2. Verify the Resizer Logs
Check the logs of the external resizer:

kubectl logs -n kube-system <csi-resizer-pod-name> -c external-resizer

3. Check for Pending PVC Events
Describe the PVC to see more details on its status:

kubectl describe pvc <your-pvc-name>

4. Ensure Proper CSI Migration Configuration
Since you see the message "CSI migration enabled for kubernetes.io/gce-pd", verify that migration is properly configured and there are no conflicts. Run:

kubectl get csidrivers

5. Manually Restart the CSI Controller
Try restarting the csi-provisioner and csi-resizer:

kubectl delete pod -n kube-system -l app=pd-csi-controller

nabk

The pdcsi pods do not appear to have a container called "external-resizer" in 1.31 (where resizing works) or in 1.32 (where resizing fails)

nabk

I have tried the suggestions, but also done more research, and I believe this to be an issue with GKE 1.32.x

You should be able to replicate the issue like this, switching the channel between rapid (where it's broken) and regular (where it works)

#!/bin/bash

CLUSTER_NAME=pvctest-rapid
CLUSTER_REGION=moon-east1-a
PROJECT_ID=myproject1234
RELEASE_CHANNEL=rapid

pvcStats() {
  echo "Configured space"
  kubectl get pvc --context=gke_${PROJECT_ID}_${CLUSTER_REGION}_${CLUSTER_NAME} -o=json | jq '.items[0].spec.resources.requests.storage'
  echo "Actual space"
  kubectl get pvc --context=gke_${PROJECT_ID}_${CLUSTER_REGION}_${CLUSTER_NAME} -o=json | jq '.items[0].status.capacity.storage'
}

gcloud container clusters create ${CLUSTER_NAME} \
    --enable-autoscaling --max-nodes=2 --min-nodes=1 \
    --release-channel ${RELEASE_CHANNEL} \
    --machine-type e2-standard-4 --region ${CLUSTER_REGION}

gcloud container clusters get-credentials ${CLUSTER_NAME} --region ${CLUSTER_REGION}

kubectl apply -f - --context=gke_${PROJECT_ID}_${CLUSTER_REGION}_${CLUSTER_NAME}<<EOF
allowVolumeExpansion: true
apiVersion: storage.k8s.io/v1
kind: StorageClass
metadata:
  name: cheap
parameters:
  type: pd-standard
provisioner: pd.csi.storage.gke.io
reclaimPolicy: Delete
volumeBindingMode: WaitForFirstConsumer
allowVolumeExpansion: true
---
apiVersion: apps/v1
kind: Deployment
metadata:
  labels:
    app: test-nginx
  name: test-ngnix
spec:
  replicas: 1
  selector:
    matchLabels:
      app: test-nginx
  strategy:
    type: Recreate
  template:
    metadata:
      labels:
        app: test-nginx
      name: test-nginx
    spec:
      volumes:
        - name: pvc-storage
          persistentVolumeClaim:
            claimName: claim
      containers:
        - name: web
          image: nginx
          ports:
            - containerPort: 80
              name: "http-server"
          volumeMounts:
            - mountPath: "/usr/share/nginx/html"
              name: pvc-storage
      restartPolicy: Always
---
apiVersion: v1
kind: PersistentVolumeClaim
metadata:
  name: claim
spec:
  storageClassName: cheap
  accessModes:
    - ReadWriteOnce
  resources:
    requests:
      storage: 3Gi
EOF


pvcStats
kubectl patch pvc claim -p '{"spec":{"resources":{"requests":{"storage":"5Gi"}}}}'
sleep 5s
kubectl delete pods -l=app=test-nginx --context=gke_${PROJECT_ID}_${CLUSTER_REGION}_${CLUSTER_NAME}
sleep 10s
pvcStats

francislouie

Hi nabk ,

Welcome to Google Cloud Community!

I tried to reproduce your concern and I'm getting the same output. Per GKE release Channel, as Rapid channel provides the newest GKE versions, these versions are excluded from the GKE SLA and may contain issues without known workarounds.

To ensure the features and APIs in your configuration work as expected, I suggest using the Regular channel instead of the Rapid channel. The Rapid channel is designed for early access and experimentation, which means some features can be unstable or even temporarily disabled. By switching to the Regular channel, you'll be using a more stable environment that supports the components in your configuration.

If you need further assistance, please don't hesitate to submit a ticket to our support team.

For further reference, please see below documentations:

Was this helpful? If so, please accept this answer as “Solution”. If you need additional assistance, reply here within 2 business days and I’ll be happy to help.

pgstorm148

You've hit upon a known and frustrating issue with GKE 1.32.x, specifically related to PVC resizing. The symptoms you're describing, where the PVC resize hangs with "waiting for external resizer" messages, are indicative of this problem.

Understanding the Issue

GKE 1.32.x Bug:
- The root cause is a bug in the GKE 1.32.x series, particularly versions up to 1.32.1729000. This bug disrupts the communication between the Kubernetes control plane and the CSI (Container Storage Interface) driver responsible for resizing Persistent Volumes.
- The issue stops the external resizer from properly recieving the resize request.
CSI Driver Interaction:
- Kubernetes relies on CSI drivers to manage storage operations, including volume resizing. The pd.csi.storage.gke.io driver handles Persistent Disks (PD) in Google Cloud.
- The bug in GKE 1.32.x interferes with the ability of the CSI driver to receive and process resize requests.
Impact:
- This issue prevents you from dynamically resizing your Persistent Volumes, which can be critical for applications that require flexible storage capacity.

Troubleshooting and Workarounds

GKE Version Downgrade (If Possible):
- If possible, the most reliable workaround is to downgrade your GKE clusters to a stable version before 1.32.x. For example, 1.31.x versions are generally considered stable.
- This is not always possible, but is the most reliable fix.
Wait for GKE Patch:
- Google Cloud is aware of this issue and is working on a patch. Keep an eye on the GKE release notes and the Google Cloud Status Dashboard for updates.
- The fact that you have seen this issue globally, confirms that google is working on a fix.
Manual Volume Resizing (Complex):
- As a temporary workaround, you might be able to manually resize the underlying Persistent Disk using the gcloud command-line tool or the Google Cloud Console.
- However, this is a complex and risky process that requires careful coordination with your application and Kubernetes.
- You would have to:
  - Detach the volume from the node.
  - Resize the PD.
  - Resize the filesystem on the volume.
  - Reattach the volume to the node.
  - Then, resize the PVC object in Kubernetes.
- This is highly discouraged, unless you are very comfortable with storage management.
Create New PVCs and Migrate Data (Inconvenient):
- Another workaround is to create new, larger PVCs and migrate your data to them.
- This is inconvenient and can cause downtime, but it might be necessary if you urgently need to increase storage capacity.
Check for CSI Driver Issues (Less Likely):
- Although you mentioned you have seen this across different pdcsi driver versions, it is still worth double checking for any reported issues with the pd.csi.storage.gke.io driver.
- However, because of the global nature of this issue, and the GKE version correlation, the GKE version is the most likely culprit.

Recommendations

Monitor the GKE release notes for updates and patches.
If you need immediate PVC resizing, consider downgrading to a stable GKE version if possible.
Avoid manual volume resizing unless absolutely necessary and you have a strong understanding of storage management.
If possible, try to schedule non-critical workloads to run until a patch is released.

erikoqcam

Is this an issue affecting all 1.32.x version of GKE?

adinhodovic

It does seem resolved in the latest GKE version: 1.32.2-gke.165200

rezabojnordi

we updated to last vertion this is our vertion 1.32.4-gke.1106006

but it is not resolved yet