Solved: MySQL Database Crashing on Google Cloud SQL Daily ...

AAKASH26229 · 09-16-2024 01:02 AM

Hello Google Cloud community,

We are facing a critical issue with our MySQL database on Google Cloud SQL, and we need assistance in diagnosing and resolving it. Our database has been crashing daily for the last 30 days, and the crashes occur exactly at 12:00 AM UTC (05:30 AM IST) when the memory components spike to very high levels.

Project Details:

Database Version: MySQL 8.0.18
vCPUs: 8
Memory: 28 GB
SSD Storage: 825 GB
Cloud Platform: Google Cloud SQL

We have been monitoring the database closely, but we haven’t been able to pinpoint the exact cause. We believe a query, event, or some scheduled task might be triggering the memory spike at this exact time, leading to the crash.

What We Have Available:

General logs
Event logs
MySQL logs

We are looking for help in identifying:

What is running or getting triggered in MySQL or Google Cloud SQL at 12:00 AM UTC that could cause the memory spike?
Are there specific settings or configurations in MySQL or GCP that we should look at to optimize memory usage and prevent these crashes?
Any tools or techniques to diagnose such time-specific issues on Google Cloud SQL.

What We’ve Tried:

We’ve reviewed the logs but haven’t identified a clear pattern.
Basic performance tuning and optimization have been applied, but the issue persists.

Any guidance, suggestions, or tips would be greatly appreciated! If additional logs or details are needed, we are happy to provide them.

Thank you!

AAKASH26229

I wanted to share an issue we experienced recently where the database crashed due to the innodb_buffer_pool_size. On the 17th, our database crashed when the innodb_buffer_pool_size was set around 13.3 GB. After the crash, we adjusted the innodb_buffer_pool_size down to 12 GB on the same day.

Interestingly, after making this change, the database did not crash on the 18th. However, at exactly 12:00 UTC on the 18th, we observed a significant spike, which can be seen in the image attached.

Has anyone else experienced something similar, where reducing the buffer pool size helped prevent crashes, but led to performance spikes? Any recommendations for further fine-tuning to avoid such spikes in the future would be appreciated.18th Sept 202417th Sept 2024

View solution in original post

ms4446

To troubleshoot this issue, your investigation should focus on identifying scheduled tasks or queries running at this exact time, analyzing logs, and optimizing memory usage and configuration settings.

First, it’s essential to check for any scheduled jobs or events in MySQL or the broader GCP environment that might trigger the crash. Using SHOW EVENTS can help identify MySQL events, and checking the event_scheduler variable ensures the event scheduler is running. Additionally, external factors like Cloud Scheduler or cron jobs from external applications or VMs should be investigated, as they might be interacting with the database around the same time.

A detailed log analysis is also crucial. The MySQL error log (mysql.err) can reveal out-of-memory (OOM) errors, deadlocks, or long-running queries around 12:00 AM UTC. The slow query log, if enabled, will help identify resource-intensive queries that might spike memory usage. Google Cloud Logging can provide additional insights into system-level issues or external interactions with the database during the crash.

In terms of performance, MySQL’s Performance Schema and Cloud SQL Insights are invaluable tools for identifying queries that consistently run around the crash time and consume significant resources. These tools allow for granular analysis of query performance, helping to pinpoint potential bottlenecks.

Optimizing memory usage and configuration is key to preventing future crashes. For example, innodb_buffer_pool_size should be set to around 60-80% of available RAM, ensuring efficient memory usage. In MySQL 8.0, the query cache is deprecated, and it is generally recommended to disable it to avoid unnecessary memory overhead. Temporary tables managed by tmp_table_size and max_heap_table_size should be monitored and adjusted if needed, though increasing these values too much may lead to memory exhaustion.

Connection management is another area of focus. Limiting max_connections and implementing connection pooling at the application level can prevent the server from being overwhelmed by concurrent connections. While Cloud SQL itself does not support built-in connection pooling, managing this at the application level can alleviate the overhead of creating new connections.

Additionally, tools like Cloud Monitoring and Cloud Trace can help track memory and CPU usage, setting alerts for critical thresholds, and analyzing application-level performance issues. MySQL Workbench or other SQL clients with performance profiling features can assist in further optimizing query execution plans.

Replication and recent changes should also be reviewed. If replicas experience the same crashes, it could indicate a data-related or systemic issue. Furthermore, examining recent changes to the database schema, application code, or Cloud SQL configuration may reveal clues as to what started causing the crashes.

AAKASH26229