Zero-Downtime Deployments: Blue-Green, Canary, and Rolling

Downtime during deployments is unacceptable for modern applications. Users expect 24/7 availability, and every minute of downtime costs revenue and trust. Zero-downtime deployment isn't a luxury — it's a requirement. This guide covers the three main strategies, their trade-offs, and the often-overlooked challenge of database schema changes during zero-downtime deployments.

Deployment architecture and infrastructure — Zero-downtime deployment ensures users never experience interruption during releases

Strategy 1: Blue-Green Deployment

Maintain two identical environments: Blue (current production) and Green (new version). Deploy to Green, run health checks and smoke tests, then switch the load balancer from Blue to Green. If anything goes wrong, switch back instantly. The trade-off: you need double the infrastructure (temporarily) and database schema changes require careful coordination.

Strategy 2: Canary Deployment

Route a small percentage of traffic (1-5%) to the new version while the majority stays on the current version. Monitor error rates, latency, and business metrics. Gradually increase traffic to the new version if metrics are healthy. This catches issues that only appear at scale or with real user traffic.

Strategy 3: Rolling Deployment

Replace instances one at a time. Take one instance out of the load balancer, update it, run health checks, add it back, then move to the next instance. This requires no extra infrastructure but means both old and new versions run simultaneously during the rollout — your application must handle this (backward-compatible APIs, feature flags).

The Hard Part: Database Migrations

Zero-downtime database migrations require the expand-contract pattern: first, make additive changes (add new column, new table) that work with both old and new application versions. Then deploy the new application version. Finally, clean up (remove old columns) in a separate migration. Never rename or delete columns in the same release that changes the application code.

migrations/expand-contract-example.sql

-- Phase 1 (EXPAND): Add new column, keep old column working
ALTER TABLE users ADD COLUMN full_name TEXT;

-- Application v2 writes to BOTH columns during transition
-- UPDATE users SET full_name = first_name || ' ' || last_name;

-- Phase 2 (MIGRATE): Backfill new column from old data
UPDATE users SET full_name = first_name || ' ' || last_name WHERE full_name IS NULL;

-- Phase 3 (CONTRACT): After all app instances use new column
-- ALTER TABLE users DROP COLUMN first_name;
-- ALTER TABLE users DROP COLUMN last_name;

The most common cause of deployment-related downtime is not the application deployment — it's the database migration. A poorly planned ALTER TABLE on a large table can lock it for minutes. Use online schema change tools (pg_repack, pt-online-schema-change) for large tables.

Health Checks and Automated Rollback

Every deployment strategy requires health checks that go beyond 'the process is running.' Check that the application can connect to its dependencies (database, cache, external APIs), that critical endpoints return correct responses, and that business metrics (conversion rate, error rate) haven't degraded. Automated rollback should trigger if any health check fails for more than 2 consecutive intervals.

Zero-downtime deployment is a capability, not a destination. Start with rolling deployments (simplest), graduate to canary (when you need validation with real traffic), and use blue-green for high-stakes releases where instant rollback is critical. The goal is to make deployment boring — so routine and reliable that nobody worries about it.