8.4 C
New York
Monday, March 30, 2026

Zero-Downtime Patching in Lakebase Half 1: Prewarming


Guaranteeing buyer databases are at all times obtainable is among the most vital issues we do in Lakebase. We’ve designed the system with redundancy at each stage, robotically failing over and recovering your database within the occasion of {hardware} or software program failures.

In a large-scale system, such unplanned failures are a statistical expectation, however for a person database, they’re not that frequent. For a person database, deliberate upkeep tends to trigger extra workload disruption. In any case, a typical database is patched extra ceaselessly than it experiences {hardware} failure.

At present, almost each database supplier operates with upkeep home windows: intervals the place your database severs all lively connections and will get up to date and restarted in a course of that may take anyplace from just a few seconds to minutes. Whereas Lakebase allows you to schedule updates at a time that is optimum for you, it is nonetheless a short interruption when it occurs.

We expect we will do higher. This weblog put up is the primary in a sequence on how we’re leveraging the lakebase structure to get rid of the affect of deliberate upkeep solely. Our purpose: make model updates and safety patches fully unnoticeable.

On this put up, we’ll cowl prewarming: a method that stops any efficiency degradation that follows a database restart. In future posts, we’ll talk about enhancements to the failover course of itself and extra optimizations that carry us nearer to true zero-downtime patching.

The Downside with Chilly Restarts

The problem with restarting PostgreSQL is that in-memory caches (particularly the buffer cache and native file cache) are misplaced. Regardless that the database is again on-line in a short time (1 second @ P99), the workload could expertise a slowdown within the first minutes after restart – we noticed a ~70% discount in pgbench TPS. This is because of a low cache hit ratio whereas knowledge is learn again from storage and the cache warms up. Whereas this may look like solely a efficiency drawback, it may be an availability subject if the slowdown is extreme sufficient that the database can not sustain with the workload and timeouts happen.

Methods to handle this exist in Postgres: pg_prewarm can be utilized to heat up buffer caches. Nevertheless, this runs after a restart when the workload is already impacted. Streaming replication can be utilized to arrange a duplicate, which will be prewarmed earlier than failing over to it (selling it to major). Nevertheless, this requires making a full duplicate and punctiliously orchestrating the prewarming earlier than failover.

Prewarming on the Lakebase Structure

Within the lakebase structure, we mix stateless, elastic compute nodes with disaggregated, shared storage. The compute nodes make use of native caches to ship most efficiency with out sacrificing serverless properties. Whereas the cache faces the identical cold-start points outlined above, we now have extra choices with the Lakebase structure.

Since Lakebase’s Postgres compute replicas are stateless, we will spin them up and down on demand. We make the most of this and mix it with computerized prewarming on deliberate restarts to reduce the efficiency affect on the workload. That is the way it works:

  1. A brand new model of Lakebase’s Postgres compute picture turns into obtainable. You obtain a notification and may schedule the restart for a time that works for you.
  2. Shortly earlier than the scheduled time, our management aircraft spins up a brand new Postgres compute within the background. You don’t see it, and also you’re not billed for it. The present major’s workload is unaffected.
  3. A listing of pages within the present major’s cache is shipped to the brand new compute. The brand new compute hundreds these pages into cache from our shared storage tier with out impacting the first.
  4. The brand new compute subscribes to the WAL (write-ahead log) to maintain its cache updated. For effectivity, in contrast to a standard Postgres duplicate, it will possibly ignore all WAL data that don’t have an effect on its cache. It will get the WAL from our Safekeepers, placing no further load on the first compute.
  5. When prewarming is full, we rapidly shut down the outdated major, promote the brand new compute to major, and swap it in. Promotion makes use of the usual pg_promote from OSS Postgres and doesn’t restart the database server.

Earlier than:

After:

With the lakebase structure, you get this at no further price, with out paying for added replicas. As of in the present day, all deliberate restarts of learn/write endpoints are carried out this fashion with out you having to do something. Quickly we’ll be extending it to read-only endpoints as nicely.

Outcomes

To measure the affect of chilly caches, we ran 10 GB pgbench (scale issue 670) on a database whereas restarting it – first with prewarming enabled, then with out prewarming. The primary chart exhibits a read-only workload (pgbench “choose solely”), whereas the second exhibits a read-write workload (pgbench “easy replace”).

Read only workloads perform better after restarting with a prewarmed cacheRead-write workloads perform better after restarting with a prewarmed cache

In each instances, we see that throughput recovers almost immediately with prewarming. With out prewarming, restoration is way slower whereas the chilly cache is warming up. The distinction is starkest for the read-only workload as a result of prewarming improves the cache hit ratio which helps reads proportionally greater than writes.

Related Articles

LEAVE A REPLY

Please enter your comment!
Please enter your name here

Stay Connected

0FansLike
0FollowersFollow
0SubscribersSubscribe
- Advertisement -spot_img

Latest Articles