Skip to content

Progressive Rollout and Canary Deploys

intermediate18 min read

Why All-at-Once Deploys Are Gambling

You deploy a new version. It passes CI, it works in staging, it looks great in the preview deployment. You hit "deploy to production" and every user gets the new version simultaneously. Five minutes later, your error monitoring lights up. A subtle bug in the payment flow is failing for users with a specific browser locale. By the time you notice, 10,000 users have hit it.

This is the fundamental problem with all-at-once deployments: your blast radius is always 100%. Every deploy is a bet that nothing went wrong in the space between "works in CI" and "works for every user in every browser with every network condition."

Progressive rollouts shrink that bet. Instead of 100% of users getting the new version immediately, you start with 1%. If that 1% looks healthy, you go to 10%, then 50%, then 100%. If the 1% shows errors, you roll back before 99% of your users ever saw the problem.

Mental Model

Think of progressive rollouts like a restaurant soft launch. A new restaurant does not open to the public on day one with a Super Bowl ad. They invite 20 friends first, then open for one evening a week, then expand to full service. Each stage reveals problems (slow kitchen, confusing menu, broken dishwasher) while the blast radius is small. By grand opening night, the major issues are resolved. Deploying software works the same way.

Deployment Strategies Compared

There are three main strategies for deploying new versions. Each has different tradeoffs around complexity, rollback speed, and resource cost.

StrategyBlue-GreenCanaryRolling
How it worksTwo identical environments. Route all traffic from blue (old) to green (new) at once.Route a small percentage of traffic to the new version. Increase gradually.Replace instances one at a time. Old and new versions coexist during rollout.
Rollback speedInstant — switch the router back to blueFast — reduce canary percentage to 0%Slow — must re-deploy old version to each instance
Blast radius100% (all-or-nothing switch)Configurable (1% then 10% then 50% etc.)Grows as instances are replaced
Resource cost2x (both environments run simultaneously)1x + small canary allocation1x (instances are replaced in-place)
ComplexityLow — just a traffic switchMedium — needs traffic splitting and monitoringLow — built into most orchestrators
Best forSimple apps, instant rollback requirementProduction apps needing gradual validationContainer orchestration (Kubernetes)

Blue-Green Deployments

The simplest model. You have two identical environments — "blue" (current) and "green" (new). You deploy the new version to green, verify it, then switch the load balancer from blue to green. If something breaks, switch back.

The downside is binary: either 0% or 100% of users see the new version. There is no "let 5% try it first" option. And you need to maintain two full environments, which can be expensive for large applications.

Canary Deployments

Named after the canary in the coal mine — a small bird sent ahead to detect danger before the miners followed. In software, you send a small percentage of traffic to the new version (the canary) and watch for problems. If the canary survives, you send more traffic.

A typical canary progression:

Execution Trace
Stage 1: 1% canary
Deploy new version, route 1% of traffic to it. Monitor error rates, latency, and Core Web Vitals for 15 minutes.
Catches catastrophic failures — crashes, 500 errors, broken rendering
Stage 2: 10% canary
If Stage 1 is healthy, increase to 10%. Monitor for 30 minutes. Watch for performance degradation under higher load.
Catches load-dependent issues — memory leaks, slow queries, cache misses
Stage 3: 50% canary
If Stage 2 is healthy, increase to 50%. Monitor for 1 hour. Compare metrics between old and new version side-by-side.
Catches subtle issues — slight conversion drops, increased bounce rate
Stage 4: 100% rollout
If all stages pass, route 100% to the new version. Keep the old version available for 24 hours as a rollback target.
Full release with safety net

Rolling Deployments

In a rolling deployment, instances of the old version are replaced one at a time with the new version. At any point during the rollout, some instances serve old and some serve new. Kubernetes uses this by default.

For frontend applications specifically, rolling deployments are less common because most frontend deployments are atomic (a single build artifact served by a CDN). Canary deployments make more sense when your "deployment" is a set of static files behind a CDN.

Quiz
You deploy a canary at 1% and monitor for 15 minutes. Error rates look normal. You promote to 10%. After 20 minutes at 10%, you notice a memory leak causing the new version to crash every 30 minutes. Why did you not catch this at 1%?

Implementing Canary Deploys with Vercel

Vercel's Rolling Releases feature provides built-in canary deployment support. You configure rollout stages directly in the Vercel dashboard, and the platform handles traffic splitting automatically.

{
  "rollingRelease": {
    "enabled": true,
    "stages": [
      { "targetPercentage": 1, "duration": 15 },
      { "targetPercentage": 10, "duration": 30 },
      { "targetPercentage": 50, "duration": 60 },
      { "targetPercentage": 100 }
    ]
  }
}

Each stage has a targetPercentage (how much traffic gets the new version) and a duration (how many minutes to hold at that stage before advancing). If an error threshold is breached during any stage, the rollout pauses automatically.

You can also control rollouts programmatically through the Vercel API:

curl -X PATCH "https://api.vercel.com/v1/deployments/{id}/rolling-release" \
  -H "Authorization: Bearer $VERCEL_TOKEN" \
  -d '{"action": "promote"}'

Or roll back:

curl -X PATCH "https://api.vercel.com/v1/deployments/{id}/rolling-release" \
  -H "Authorization: Bearer $VERCEL_TOKEN" \
  -d '{"action": "rollback"}'
Sticky sessions during canary

During a canary rollout, it is important that individual users consistently see the same version. If a user hits the new version on page load and the old version on the next navigation, they might encounter inconsistent behavior. Vercel handles this via cookies — once a user is assigned to the canary, they stay on it for the duration of their session.

Implementing Canary Deploys with Cloudflare Workers

Cloudflare Workers lets you implement custom traffic splitting at the edge. You write a Worker that decides which backend to route each request to:

export default {
  async fetch(request: Request, env: Env): Promise<Response> {
    const canaryPercentage = parseInt(env.CANARY_PERCENTAGE || '0');

    const cookie = request.headers.get('Cookie') || '';
    const versionCookie = cookie.match(/deploy-version=(canary|stable)/);

    let useCanary: boolean;

    if (versionCookie) {
      useCanary = versionCookie[1] === 'canary';
    } else {
      useCanary = Math.random() * 100 < canaryPercentage;
    }

    const backend = useCanary ? env.CANARY_ORIGIN : env.STABLE_ORIGIN;
    const response = await fetch(backend + new URL(request.url).pathname, request);

    const newResponse = new Response(response.body, response);
    if (!versionCookie) {
      newResponse.headers.append(
        'Set-Cookie',
        `deploy-version=${useCanary ? 'canary' : 'stable'}; Path=/; Max-Age=3600`
      );
    }

    return newResponse;
  },
};

This Worker checks for an existing version cookie (sticky sessions), and if none exists, randomly assigns the user to canary or stable based on the configured percentage. The assignment is persisted via a cookie so the user gets a consistent experience.

Monitoring During Rollout

A canary deployment without monitoring is just a slow deployment. The entire point is to compare the canary against the stable version in real-time and catch problems early.

What to Monitor

Key Rules
  1. 1Error rate — compare 5xx and client-side errors between canary and stable. Any increase above baseline triggers investigation.
  2. 2Latency — compare p50, p95, and p99 response times. Latency regressions often indicate performance bugs or missing optimizations.
  3. 3Core Web Vitals — compare LCP, INP, and CLS between versions. A CLS regression means the new version has layout shift issues.
  4. 4Business metrics — conversion rate, cart additions, signups. Sometimes the new version works perfectly but users behave differently.
  5. 5Memory and CPU — server-side metrics that catch resource leaks before they cause outages.

Automated Rollback Triggers

Manual monitoring does not scale. Configure automated rollback that triggers when metrics cross a threshold:

rollback_triggers:
  - metric: error_rate
    threshold: 2x_baseline
    window: 5m
    action: rollback

  - metric: p99_latency
    threshold: 3000ms
    window: 10m
    action: pause

  - metric: cls
    threshold: 0.25
    window: 15m
    action: pause

The difference between rollback and pause matters. A 2x error rate spike is unambiguous — roll back immediately. A latency increase might be transient (cold caches warming up), so pausing gives you time to investigate before deciding.

Quiz
During a canary rollout at 10%, your monitoring shows the canary has a 0.5% higher error rate than stable. The absolute error rate is 1.2% (canary) vs 0.7% (stable). Should you roll back?
Common Trap

A common mistake is comparing absolute numbers instead of rates during canary monitoring. If your canary handles 1% of traffic and stable handles 99%, the canary will naturally have far fewer total errors. Always compare error rates (errors per request), not error counts. Five errors in 100 requests (5% rate) is much worse than 500 errors in 100,000 requests (0.5% rate).

Blue-Green with Instant Rollback

For teams that want simplicity over gradual rollout, blue-green is the straightforward choice. On Vercel, every deployment is essentially blue-green: each deployment gets its own immutable URL, and promoting a deployment makes it the production URL. Rolling back is just promoting the previous deployment.

vercel rollback

That single command switches production back to the previous deployment. The new (broken) deployment still exists at its unique URL for debugging, but no production traffic reaches it.

The edge case blue-green misses

Blue-green's weakness is database migrations. If the new version requires a schema change (a new column, a changed index), and you roll back to the old version, the old code might not work with the new schema. This is why blue-green deployments require backward-compatible database migrations: the old version must be able to work with both the old and new schema. In practice, this means you deploy schema changes separately from code changes, often one deploy cycle ahead. For frontend-only deployments (no server, just static files), this is rarely an issue.

What developers doWhat they should do
Skipping the 1% stage and starting canary at 25%
The whole point of canary is catching problems early. Starting at 25% means 25% of your users are affected before you detect anything.
Always start at 1% or lower to minimize blast radius
Using the same monitoring thresholds for all rollout stages
At 1%, a small number of errors causes a huge error rate spike. At 50%, the same absolute number of errors is a tiny blip. Thresholds should account for sample size.
Tighten thresholds at lower percentages, relax at higher percentages
Rolling back manually by re-deploying the old version
Re-deploying takes minutes. Instant rollback takes seconds. During an incident, every second of exposure matters.
Use instant rollback (Vercel rollback, switch load balancer target)
Monitoring only error rates during canary rollout
The new version might have zero errors but 2x latency. Or it might work perfectly but cause a 10% conversion drop. Error rate alone is insufficient.
Monitor error rates, latency, Web Vitals, and business metrics together

The Rollout Mindset

Progressive rollout is not just a deployment technique — it is a mindset shift. Instead of "deploy and hope," you "deploy and observe." Instead of rollback being an emergency procedure, it is a normal, expected operation. Some rollouts get rolled back. That is fine. The system is working as designed.

The best teams treat every deploy as a canary, even when they are confident. Confidence is not data. The canary provides data. And data beats confidence every time — especially at 2am when the on-call engineer's phone buzzes.