Get hands-on experience with 20+ free Google Cloud products and $300 in free credit for new customers.

Google Cloud SQL MySQL Crashed

This past weekend one of my Kubernetes clusters started reporting health check failures for all my deployments. At first I thought this was a cluster issue but after digging in a bit more the actual issue was caused by my Google Cloud SQL (MySQL) server crashing and taking about 10 minutes to come back online. All of my deployments connect to the SQL instance which is why their health checks started to fail. Everything came back online on it's own after the database came online but that took about 15 or so minutes. I have looked over the SQL logs and metrics which show nothing out of the ordinary. I have the SQL instance configured with a shared core and does not have any sort of high availability configured as this is not a critical system. The metrics show no CPU, memory or connection spikes of any kind and do not approach the limit on the instance. I looked through the logs and there are zero logs for multiple days until the instance is booting back up and reports an unexpected shutdown and crash recovery. Luckily this is not a critical system, and if it was we would have more redundancy configured for the SQL instance, but I still want to understand what occurred. The logs that I can see don't provide much information except that the DB did not shutdown correctly and that disaster recovery is taking place. Any insight into a cause or a way I can find more information on the cause would be appreciated. Thanks!

 

0 2 401
2 REPLIES 2

Hello bobbake4,

I would suggest checking the logs inside your MySQL. Your MySQL may have log files on what caused the crashed and give us a better understanding on what happened. Here is a documentation on how to check your MySQL log files.

Great suggestion @dionv! I followed the instructions and unfortunately I had not turned on the SQL logging before the crash occurred so I am not seeing any additional log events. I have updated the DB config to start logging additional information so if this occurs again I should have a bit more insight into what caused the failure.