Building High Availability into a Large Cloud SQL Fleet, Part 3: Automated Scaling, Upgrades, and Lessons Learned from Incidents

Managing a large fleet of Cloud SQL instances and ensuring they remain resilient and well functioning has been a journey filled with valuable lessons. This blog post, third in the series, discusses the effective management, automated scaling, and upgrades of the Cloud SQL database fleet.

Part 1: Monitoring and Consistent Configuration
Part 2: Optimizations and Design Choices
Part 3: Automated Scaling, Upgrades, and Lessons Learned from Incidents

This Building High Availability into a Large Cloud SQL Fleet blog series is also available as a single PDF technical white paper. Download it here.

At Rubrik, we rely on a multi-tenant architecture to store customer metadata in a large fleet of Cloud SQL database instances. With numerous global production deployments—each supporting multiple customer accounts—maintaining high availability, performance, and robustness across this infrastructure is critical.

Maintaining this robust and high-performing large-scale Cloud SQL fleet has been a rewarding journey rich with insightful lessons, including:

We leverage automated scaling to dynamically adjust instance sizes based on real-time metrics, seamlessly handling workload spikes.
Applying strategic upgrades allows us to enhance performance in a way that supports reliable operations at scale.
Any incidents we encounter are thoroughly investigated and we continuously refine our monitoring and resilience practices to minimize downtime and build a robust database environment.

Automated Scaling and Instance Splitting

To maintain optimal performance and manage workload spikes effectively, the team implemented a sophisticated auto-scaling mechanism across our Cloud SQL fleet. Our monitoring systems continuously observe metrics like CPU, memory usage, and query throughput, allowing us to proactively adjust instance sizes based on demand. When workload increases to a sustained, justifiable level, instances are automatically upscaled to handle the additional load, ensuring seamless performance even during traffic surges.

For instances that have grown too large due to increased data or a high volume of tenants, we employ instance splitting. This strategy involves redistributing data and services across multiple instances, reducing the burden on any single server. By splitting oversized instances into smaller, manageable units, we can maintain high performance while simplifying resource allocation and troubleshooting.

Staying Ahead of the Upgrade Curve

In the fast-evolving database technology environment, staying ahead of the upgrade curve is crucial for maximizing performance and leveraging the latest innovations. In early 2023, we undertook a comprehensive upgrade of our entire Cloud SQL fleet from MySQL 5.7 to MySQL 8.0. Our commitment to continuous improvement and operational excellence drove this strategic move.

Upgrading to MySQL 8.0 unlocked a host of powerful features and performance enhancements that have had a transformative impact on our database operations. Notably, the introduction of instant column addition significantly reduces the time required for schema changes.

MySQL 8.0 also offers increased throughput, which translates to faster data processing and improved overall system performance. As our operations grow, these improvements allow us to manage increased transaction volumes smoothly while ensuring high levels of availability and reliability.

Incident Handling: Learning from Challenges

Our operational philosophy centers on learning from every database incident through thorough investigations to pinpoint and resolve root causes whenever possible. This approach goes beyond merely reacting to an incident; rather, we cultivate a culture of continuous improvement and system resilience.

In cases where fully eliminating a root cause isn't feasible, we focus on developing automated mitigation strategies. These safeguards enhance incident management, ensuring our systems stay resilient and robust despite challenges.

For instance, runaway growth of the InnoDB history list length (HLL) can be mitigated by automatically throttling update rates. By configuring parameters such as innodb_max_purge_lag and innodb_max_purge_lag_delay, we allow purge threads to keep pace, effectively preventing excessive buildup of the InnoDB history list length (HLL).

Furthermore, each incident provides valuable insights that drive enhancements to our monitoring systems. We continuously refine our approach by creating new alerts and introducing additional metrics, ensuring that we have the visibility needed to detect and respond to potential issues before they escalate. This iterative process strengthens our operational framework and helps us stay one step ahead of future challenges.

Ultimately, these incidents contribute to our overall system resilience, enabling us to minimize downtime and maintain high levels of service availability. By treating each challenge as an opportunity for growth, we not only bolster our current operations but also pave the way for a more stable and efficient future.

What’s Next?

At Rubrik, high availability is more than just preventing downtime—it reflects our dedication to fostering continuous improvement. Through detailed monitoring, proactive optimization and thorough incident management, we keep our large Cloud SQL fleet resilient, scalable, and capable of serving our global customers’ needs.

By sharing our strategies and lessons learned, we hope to encourage others to implement similar approaches, supporting their efforts to achieve robust high availability.

Moving forward, our strategy emphasizes enhanced workload separation, the implementation of controls to limit resource consumption per query and database, and the development of frameworks to optimize operations like data cleanup, pagination, backfill, and bulk updates, all aimed at maintaining efficiency and performance as our operations scale.

Acknowledgments

This journey would not have been possible without the relentless efforts of our incredible platform database and SRE teams and DB delegates. A heartfelt thanks to platform database team members Rajorshi, Travis, Gabriel, Rahul, Yashwanth, Sudip, Anmol, Gurneet, Hardik, and SRE team members Prabudas, Suraj, and Mihir whose expertise and dedication have been instrumental in scaling Rubrik's database fleet to support our ever-growing ARR. Your commitment to excellence and innovation is what drives us forward. Thank you for being the backbone of our success!

This Building High Availability into a Large Cloud SQL Fleet blog series is also available as a single PDF technical white paper. Download it here.

Products

Solutions

Knowledge Hub

About Us

Automated Scaling and Instance Splitting

Staying Ahead of the Upgrade Curve

Incident Handling: Learning from Challenges

What’s Next?

Acknowledgments