Solved: Re: Technology specific documentation on GKE

avindia · 02-18-2025 04:51 PM

Good Morning

I have Project starting to host Application on GKE .This is Private Cluster and all staff only (On-premise)facing connectivity Via Interconnect so no internet inbound or outbound. would like to check what all HA/DR and security aspects I should be go through other than IAM. Looking for Technical Infra specific documentation.

1. Documentation on Security for GKE (Pvt Standard Mode Cluster)

2. Infra documentation on Regional(Multi Zonal) GKE HA and DR like POD Replicas, Storage ,Failover, Load Balancers, I did not find single diagram showing GKE cluster with Nodes along with POD replicas in different zone and interface with downstream like Middleware and DB using LB ,Ingress,LB etc.

Please advice

desireejoy

Hi @avindia ,

Welcome to Google Cloud Community!

Here’s some HA/DR and security considerations for your private GKE Standard mode cluster connected via Interconnect.

Security for Private GKE Standard Mode Clusters

Since you're operating a private cluster with no internet access, security is most important. Here's a layered approach, along with relevant documentation pointers:

Network Security:

VPC Network and Subnets: Properly segment your VPC. Dedicated subnets for your GKE nodes, services, and any associated infrastructure (like your jump host). Avoid overly permissive firewall rules.
Firewall Rules: Implement strict ingress and egress firewall rules. Only allow necessary traffic. Think in terms of a "least privilege" network model. Specifically, for your on-premise connectivity via Interconnect:

Allow ingress from your on-premise network's IP range to the GKE node IP range.
Allow egress from the GKE node IP range to your on-premise network's IP range (if communication back to on-premise is needed).
Consider using Service Accounts for Node Pools to further restrict outbound traffic.

Private Service Access (PSA): This is critical for connecting to Google Cloud services (like Cloud SQL, Cloud Storage, etc.) without exposing them to the internet. Use PSA to establish a private network connection between your VPC and the Google service producer network. Allocate a dedicated IP range for PSA.
Network Policy: GKE Network Policies are Kubernetes resources that control traffic between pods. Use them to isolate applications and services within the cluster.

Cluster Security:

Private Cluster Configuration: Ensure your GKE cluster is explicitly created as a private cluster (--enable-private-nodes and --enable-private-endpoint). This prevents nodes from having public IP addresses.
Control Plane Access: Control how you access the Kubernetes API server (the control plane). You'll likely access it from your on-premise network. Authorize specific IP ranges from your on-premise network for API server access during cluster creation.
Node Security:

Shielded VMs: Consider using Shielded VMs for your node pools to protect against boot-level and kernel-level attacks.
Container-Optimized OS (COS): COS is hardened and optimized for running containers, reducing the attack surface.

Workload Identity: Use Workload Identity to allow your applications running in GKE to securely access Google Cloud services. Workload Identity binds Kubernetes service accounts to Google service accounts, eliminating the need to store and manage service account keys.
Managed Certificate Authority Service (CAS): Consider using CAS to issue certificates for your services within the cluster. This simplifies certificate management and improves security.
Image Security:

Container Registry: Use Artifact Registry (or Container Registry) to store your container images. Scan your images for vulnerabilities during the build process.
Binary Authorization: Implement Binary Authorization to ensure that only trusted images are deployed to your cluster.

Application Security:

Service Mesh (Istio): Consider implementing a service mesh like Istio for advanced traffic management, security (mTLS), and observability. This can greatly enhance your security posture.
Best practices for GKE RBAC: Role-based access control (RBAC) is a method of regulating access to computer or network resources based on the roles of individual users within your organization.

Regional (Multi-Zonal) GKE HA and DR

Addressing the HA and DR aspects:

Regional Clusters: Create a regional GKE cluster. This distributes the control plane and nodes across multiple zones within a region. This provides high availability if one zone fails.
Node Pools: Create node pools that span multiple zones. GKE will automatically distribute nodes across those zones.
Pod Replicas: Use Kubernetes Deployments or ReplicaSets to ensure you have multiple replicas of your pods running across different nodes (and therefore, ideally, different zones). Use podAntiAffinity to strongly encourage pod replicas to be scheduled on different nodes and zones. Inter-pod affinity and anti-affinity can be even more useful when they are used with higher level collections such as ReplicaSets, StatefulSets, Deployments, etc.
Storage:

Persistent Volumes (PVs) and Persistent Volume Claims (PVCs): Use Persistent Volumes (PVs) to provision storage. For regional HA, use regional persistent disks. These disks are replicated across multiple zones within the region. PVCs allow your pods to request storage from the PVs.
Consider Cloud SQL or other managed databases: If you can, use Cloud SQL with its built-in HA and replication features. This simplifies your database management.

Load Balancing and Ingress:

Internal Load Balancer: Since you're private, you'll primarily use the Internal Load Balancer. Create a Kubernetes Service of type LoadBalancer. GKE will automatically provision an internal load balancer in your VPC. The internal load balancer distributes traffic to your pods across all nodes in the cluster, regardless of zone.
Ingress: Use Ingress to expose your services to external (on-premise) clients. The Ingress controller will automatically configure the load balancer to route traffic based on the Ingress rules.
Health Checks: The load balancer uses health checks to determine which pods are healthy and can receive traffic. Make sure your applications expose a health check endpoint.

Failover:

Automatic Failover: If a zone fails, GKE will automatically reschedule pods from the failed zone to healthy zones within the region. The load balancer will automatically stop sending traffic to the failed pods.
Regional Persistent Disks: If you're using regional persistent disks, data will remain available even if one zone fails.
DR Strategy (Beyond HA): HA within a region is different from Disaster Recovery. For DR, you'd need to replicate your cluster and data to a different region. This involves more complex setup using tools like Velero for cluster backup and restore, and cross-region replication for databases. Consider your Recovery Time Objective (RTO) and Recovery Point Objective (RPO) when designing your DR strategy.

Conceptual Diagram

Explanation of the Diagram:

Zones: The diagram shows three zones (A, B, and C) within a single Google Cloud region.
GKE Nodes: GKE nodes (VMs) are distributed across the zones. These nodes are part of the regional GKE cluster's node pools.
Pod Replicas: Pod replicas (instances of your application) are running on different nodes and, ideally, in different zones. The podAntiAffinity setting helps achieve this distribution.
Internal Load Balancer: A Kubernetes Service of type LoadBalancer creates an internal load balancer. This load balancer distributes traffic across all healthy pod replicas, regardless of which zone they are in. The Internal LB allows On-Premise staff accessing Application/services.

On-Premise Network: Your on-premise network is connected to your VPC via Interconnect. Traffic from on-premise clients flows through the Interconnect to the internal load balancer.
Cloud SQL: The diagram also shows a connection to Cloud SQL. You would use Private Service Access for this connection.

Was this helpful? If so, please accept this answer as “Solution”. If you need additional assistance, reply here within 2 business days and I’ll be happy to help.

View solution in original post

avindia

Thanks for the detailed response I Love this Google way !