Building High Availability into a Large Cloud SQL Fleet, Part 2: Optimizations and Design Choices

At Rubrik, we rely on a multi-tenant architecture to store customer metadata in a large fleet of Cloud SQL database instances. With numerous production deployments globally, each supporting multiple customer accounts, maintaining high availability, performance, and robustness across this infrastructure is critical.

Managing a large fleet of Cloud SQL instances and ensuring they remain resilient and performant has been a journey filled with valuable lessons. So I’ve written a three-part blog series to share with other technical practitioners and strategists some of the best practices our team developed in the process. Those lessons include:

Monitoring and Consistent Configuration
Design Choices and Optimizations
Automated Scaling, Upgrades, and Lessons Learned from Incidents (available April 2025)

Early And Proactive Optimization For Optimal Query Performance

Proactively optimizing queries during development and refining them in production reduces costly fixes, boosts developer productivity, and ensures seamless database performance at scale. So our team at Rubrik uses an approach that improves query performance at every step of the development process, as you can see in the diagram below.

Shift-Left: Optimizing During Development

At Rubrik, we believe in catching potential database issues early in the development process, an approach known as the shift-left mindset. This means our Database Administrators (DBAs) review table designs and query structures during the development phase, rather than after deployment (when fixes can be costly and disruptive).

To support this, our engineering-wide design document template includes a dedicated section for database design. Here, developers are required to detail aspects such as the data model, expected workload, data growth projections, cleanup criteria, performance requirements, and security considerations. This practice insists that database architecture is thoughtfully integrated from the outset and aligned with long-term operational needs.

Additionally, we’ve implemented the DB Delegate program, where volunteers from various feature teams are trained to perform schema change reviews. These delegates act as additional eyes and ears for the DBAs, ensuring that schema changes adhere to best practices and operational requirements before they are implemented. This distributed approach to schema reviews not only strengthens our shift-left strategy but also fosters cross-functional collaboration, empowering feature teams to take a more active role in database performance and scalability.

AI-Driven Query Optimization in Code Reviews

We’ve integrated automation into our code review process in which newly added queries are automatically fed into an AI-driven optimization engine that provides suggestions to improve performance, indexing, and structure. By receiving real-time feedback during code review, developers can refine queries before they reach production, reducing the risk of suboptimal query patterns from impacting performance at scale. This shift-left approach empowers our teams to maintain high performance as the system grows and evolves.

Managing Slow Queries in Production

Proactive query optimization doesn’t stop with development. To ensure our databases remain efficient under real-world loads, we analyze slow queries daily using automated tools, which systematically classify and prioritize the most impactful slow queries, presenting them to the relevant teams for review and optimization.

This continuous review process enables us to identify and address potential performance bottlenecks before they affect customers. By optimizing query execution paths, adjusting indexes, and eliminating inefficiencies, we keep our databases running smoothly, ultimately contributing to a fast and seamless customer experience.

Together, these early and proactive optimization strategies create a resilient foundation for scalable performance, allowing us to meet the growing demands of our multi-tenant environment without compromising on speed or reliability.

Design Choices Matter

In a large-scale, multi-tenant database environment, smart design choices prevent performance bottlenecks and ensure long-term scalability. At Rubrik, we carefully evaluate architectural trade-offs to optimize efficiency, reliability, and maintainability.

Here are three key areas where our design decisions have made a significant impact:

Optimal Metadata Cleanup

Efficient metadata cleanup is essential for maintaining database performance in a multi-tenant environment. At Rubrik, we leverage MySQL partition tables to simplify this process, converting traditionally heavy DML operations (like bulk deletes) into streamlined DDL operations. We can drop partitions instead of deleting rows, which significantly reduces server load. A single ALTER TABLE command can efficiently drop millions of rows, making cleanup operations faster and less resource-intensive.

To further reduce the impact of these operations, we schedule DDL-based cleanups during off-peak hours, ensuring that they do not disrupt customer-facing services.

Our team also developed a partition management framework that empowers feature teams to easily implement partitioning by simply specifying two parameters: interval (daily, weekly, monthly, or yearly) and retention period (e.g., 7 days, 6 months, or 2 years). This framework simplifies the adoption of partition tables across different features and ensures that large data tables are efficiently managed throughout their lifecycle.

Connection Management and Query Caching

Managing connections and optimizing query flow are critical in high-traffic, multi-tenant environments. We rely on ProxySQL to multiplex connections, which helps reduce the overhead associated with frequent connection opening and closing. ProxySQL effectively reduces connection churn on the database server, allowing it to handle higher loads with greater stability.

In addition, ProxySQL’s query caching capabilities significantly reduce the number of repetitive queries sent to the database. This reduces the workload on CloudSQL instances by caching frequently accessed query results, thereby improving response times and allowing the databases to dedicate resources to more critical or unique queries.

Workload Isolation: Reducing Noise for Optimal Performance

To tackle the "noisy neighbor" problem, where high-intensity workloads impact other operations on the same instance, we strategically separated different types of workloads into dedicated instances. Specifically, we moved our write-heavy job framework workload onto its own set of instances, isolating it from the customer metadata workload. This design choice allowed us to prevent the intensive, often bursty job framework operations from interfering with the performance of metadata-related transactions, which require consistent, low-latency access.

By isolating these workloads, we eliminated resource contention and significantly improved the stability and predictability of both environments. This separation also enables more targeted scaling and optimization for each workload type, ensuring that each receives the resources and configurations best suited to its unique requirements.

What’s Next?

Our next installment in this series will focus on automated scaling, upgrades, and lessons learned from incidents. I’ll demonstrate how we use automated scaling, how we handle upgrades, and lessons we’ve learned from performance and availability related incidents.

If you don’t want to wait for the next installment, you can always download the series as a single PDF.

Products

Solutions

Knowledge Hub

About Us