UPDATE: This appears to be an issue with GKE 1.32.x - 1.32.1729000 globally.
Hello! Since yesterday my PVCs no longer will resize, across clusters, different GKE versions, and with different versions of the pdcsi driver.
The normal behaviour is to change the PVC resource request for storage to a larger number, and then when the resize is pending a Pod start, kill the existing pod attached to the PVC.
Now, I see the following event, and nothing further:
From v1.32.1-gke.1489001 with pdcsi v1.15.4:
CSI migration enabled for kubernetes.io/gce-pd; waiting for external resizer to expand the pvc
From v1.32.1-gke.1729000 with pdcsi v1.16.1:
waiting for an external controller to expand this PVC
Recent events:
Storage Class relevant details:
allowVolumeExpansion: true
apiVersion: storage.k8s.io/v1
kind: StorageClass
metadata:
name: my-standard
parameters:
type: pd-standard
provisioner: pd.csi.storage.gke.io
reclaimPolicy: Delete
volumeBindingMode: WaitForFirstConsumer
Any ideas on what might be going wrong, or any advice on how to troubleshoot further?
Please check Check CSI Resizer Pod
kubectl get pods -n kube-system | grep csi
kubectl logs -n kube-system -l app=csi-gce-pd-controller --tail=50
# restart above pods if needed.
kubectl describe pvc <your-pvc-name>
kubectl get events --sort-by=.metadata.creationTimestamp
# check and expand storage class runtime
kubectl get storageclass my-standard -o yaml | grep allowVolumeExpansion
kubectl patch storageclass my-standard -p '{"allowVolumeExpansion": true}'
# force manual resize
kubectl patch pvc <your-pvc-name> -p '{"spec":{"resources":{"requests":{"storage":"50Gi"}}}}'
restart kubelet and pod delete/start new one.
1. Check if the External Resizer is Running
Since the error states that it is “waiting for an external controller to expand this PVC,” verify if the external resizer is running correctly in your cluster:
kubectl get pods -n kube-system | grep csi
2. Verify the Resizer Logs
Check the logs of the external resizer:
kubectl logs -n kube-system <csi-resizer-pod-name> -c external-resizer
3. Check for Pending PVC Events
Describe the PVC to see more details on its status:
kubectl describe pvc <your-pvc-name>
4. Ensure Proper CSI Migration Configuration
Since you see the message "CSI migration enabled for kubernetes.io/gce-pd", verify that migration is properly configured and there are no conflicts. Run:
kubectl get csidrivers
5. Manually Restart the CSI Controller
Try restarting the csi-provisioner and csi-resizer:
kubectl delete pod -n kube-system -l app=pd-csi-controller
The pdcsi pods do not appear to have a container called "external-resizer" in 1.31 (where resizing works) or in 1.32 (where resizing fails)
I have tried the suggestions, but also done more research, and I believe this to be an issue with GKE 1.32.x
You should be able to replicate the issue like this, switching the channel between rapid (where it's broken) and regular (where it works)
#!/bin/bash
CLUSTER_NAME=pvctest-rapid
CLUSTER_REGION=moon-east1-a
PROJECT_ID=myproject1234
RELEASE_CHANNEL=rapid
pvcStats() {
echo "Configured space"
kubectl get pvc --context=gke_${PROJECT_ID}_${CLUSTER_REGION}_${CLUSTER_NAME} -o=json | jq '.items[0].spec.resources.requests.storage'
echo "Actual space"
kubectl get pvc --context=gke_${PROJECT_ID}_${CLUSTER_REGION}_${CLUSTER_NAME} -o=json | jq '.items[0].status.capacity.storage'
}
gcloud container clusters create ${CLUSTER_NAME} \
--enable-autoscaling --max-nodes=2 --min-nodes=1 \
--release-channel ${RELEASE_CHANNEL} \
--machine-type e2-standard-4 --region ${CLUSTER_REGION}
gcloud container clusters get-credentials ${CLUSTER_NAME} --region ${CLUSTER_REGION}
kubectl apply -f - --context=gke_${PROJECT_ID}_${CLUSTER_REGION}_${CLUSTER_NAME}<<EOF
allowVolumeExpansion: true
apiVersion: storage.k8s.io/v1
kind: StorageClass
metadata:
name: cheap
parameters:
type: pd-standard
provisioner: pd.csi.storage.gke.io
reclaimPolicy: Delete
volumeBindingMode: WaitForFirstConsumer
allowVolumeExpansion: true
---
apiVersion: apps/v1
kind: Deployment
metadata:
labels:
app: test-nginx
name: test-ngnix
spec:
replicas: 1
selector:
matchLabels:
app: test-nginx
strategy:
type: Recreate
template:
metadata:
labels:
app: test-nginx
name: test-nginx
spec:
volumes:
- name: pvc-storage
persistentVolumeClaim:
claimName: claim
containers:
- name: web
image: nginx
ports:
- containerPort: 80
name: "http-server"
volumeMounts:
- mountPath: "/usr/share/nginx/html"
name: pvc-storage
restartPolicy: Always
---
apiVersion: v1
kind: PersistentVolumeClaim
metadata:
name: claim
spec:
storageClassName: cheap
accessModes:
- ReadWriteOnce
resources:
requests:
storage: 3Gi
EOF
pvcStats
kubectl patch pvc claim -p '{"spec":{"resources":{"requests":{"storage":"5Gi"}}}}'
sleep 5s
kubectl delete pods -l=app=test-nginx --context=gke_${PROJECT_ID}_${CLUSTER_REGION}_${CLUSTER_NAME}
sleep 10s
pvcStats
Welcome to Google Cloud Community!
I tried to reproduce your concern and I'm getting the same output. Per GKE release Channel, as Rapid channel provides the newest GKE versions, these versions are excluded from the GKE SLA and may contain issues without known workarounds.
To ensure the features and APIs in your configuration work as expected, I suggest using the Regular channel instead of the Rapid channel. The Rapid channel is designed for early access and experimentation, which means some features can be unstable or even temporarily disabled. By switching to the Regular channel, you'll be using a more stable environment that supports the components in your configuration.
If you need further assistance, please don't hesitate to submit a ticket to our support team.
For further reference, please see below documentations:
Was this helpful? If so, please accept this answer as “Solution”. If you need additional assistance, reply here within 2 business days and I’ll be happy to help.
You've hit upon a known and frustrating issue with GKE 1.32.x, specifically related to PVC resizing. The symptoms you're describing, where the PVC resize hangs with "waiting for external resizer" messages, are indicative of this problem.
Understanding the Issue
Troubleshooting and Workarounds
GKE Version Downgrade (If Possible):
Wait for GKE Patch:
Manual Volume Resizing (Complex):
Create New PVCs and Migrate Data (Inconvenient):
Check for CSI Driver Issues (Less Likely):
Recommendations
Is this an issue affecting all 1.32.x version of GKE?
It does seem resolved in the latest GKE version: 1.32.2-gke.165200