Progressive Rollout and Canary Deploys
Why All-at-Once Deploys Are Gambling
You deploy a new version. It passes CI, it works in staging, it looks great in the preview deployment. You hit "deploy to production" and every user gets the new version simultaneously. Five minutes later, your error monitoring lights up. A subtle bug in the payment flow is failing for users with a specific browser locale. By the time you notice, 10,000 users have hit it.
This is the fundamental problem with all-at-once deployments: your blast radius is always 100%. Every deploy is a bet that nothing went wrong in the space between "works in CI" and "works for every user in every browser with every network condition."
Progressive rollouts shrink that bet. Instead of 100% of users getting the new version immediately, you start with 1%. If that 1% looks healthy, you go to 10%, then 50%, then 100%. If the 1% shows errors, you roll back before 99% of your users ever saw the problem.
Think of progressive rollouts like a restaurant soft launch. A new restaurant does not open to the public on day one with a Super Bowl ad. They invite 20 friends first, then open for one evening a week, then expand to full service. Each stage reveals problems (slow kitchen, confusing menu, broken dishwasher) while the blast radius is small. By grand opening night, the major issues are resolved. Deploying software works the same way.
Deployment Strategies Compared
There are three main strategies for deploying new versions. Each has different tradeoffs around complexity, rollback speed, and resource cost.
| Strategy | Blue-Green | Canary | Rolling |
|---|---|---|---|
| How it works | Two identical environments. Route all traffic from blue (old) to green (new) at once. | Route a small percentage of traffic to the new version. Increase gradually. | Replace instances one at a time. Old and new versions coexist during rollout. |
| Rollback speed | Instant — switch the router back to blue | Fast — reduce canary percentage to 0% | Slow — must re-deploy old version to each instance |
| Blast radius | 100% (all-or-nothing switch) | Configurable (1% then 10% then 50% etc.) | Grows as instances are replaced |
| Resource cost | 2x (both environments run simultaneously) | 1x + small canary allocation | 1x (instances are replaced in-place) |
| Complexity | Low — just a traffic switch | Medium — needs traffic splitting and monitoring | Low — built into most orchestrators |
| Best for | Simple apps, instant rollback requirement | Production apps needing gradual validation | Container orchestration (Kubernetes) |
Blue-Green Deployments
The simplest model. You have two identical environments — "blue" (current) and "green" (new). You deploy the new version to green, verify it, then switch the load balancer from blue to green. If something breaks, switch back.
The downside is binary: either 0% or 100% of users see the new version. There is no "let 5% try it first" option. And you need to maintain two full environments, which can be expensive for large applications.
Canary Deployments
Named after the canary in the coal mine — a small bird sent ahead to detect danger before the miners followed. In software, you send a small percentage of traffic to the new version (the canary) and watch for problems. If the canary survives, you send more traffic.
A typical canary progression:
Rolling Deployments
In a rolling deployment, instances of the old version are replaced one at a time with the new version. At any point during the rollout, some instances serve old and some serve new. Kubernetes uses this by default.
For frontend applications specifically, rolling deployments are less common because most frontend deployments are atomic (a single build artifact served by a CDN). Canary deployments make more sense when your "deployment" is a set of static files behind a CDN.
Implementing Canary Deploys with Vercel
Vercel's Rolling Releases feature provides built-in canary deployment support. You configure rollout stages directly in the Vercel dashboard, and the platform handles traffic splitting automatically.
{
"rollingRelease": {
"enabled": true,
"stages": [
{ "targetPercentage": 1, "duration": 15 },
{ "targetPercentage": 10, "duration": 30 },
{ "targetPercentage": 50, "duration": 60 },
{ "targetPercentage": 100 }
]
}
}
Each stage has a targetPercentage (how much traffic gets the new version) and a duration (how many minutes to hold at that stage before advancing). If an error threshold is breached during any stage, the rollout pauses automatically.
You can also control rollouts programmatically through the Vercel API:
curl -X PATCH "https://api.vercel.com/v1/deployments/{id}/rolling-release" \
-H "Authorization: Bearer $VERCEL_TOKEN" \
-d '{"action": "promote"}'
Or roll back:
curl -X PATCH "https://api.vercel.com/v1/deployments/{id}/rolling-release" \
-H "Authorization: Bearer $VERCEL_TOKEN" \
-d '{"action": "rollback"}'
During a canary rollout, it is important that individual users consistently see the same version. If a user hits the new version on page load and the old version on the next navigation, they might encounter inconsistent behavior. Vercel handles this via cookies — once a user is assigned to the canary, they stay on it for the duration of their session.
Implementing Canary Deploys with Cloudflare Workers
Cloudflare Workers lets you implement custom traffic splitting at the edge. You write a Worker that decides which backend to route each request to:
export default {
async fetch(request: Request, env: Env): Promise<Response> {
const canaryPercentage = parseInt(env.CANARY_PERCENTAGE || '0');
const cookie = request.headers.get('Cookie') || '';
const versionCookie = cookie.match(/deploy-version=(canary|stable)/);
let useCanary: boolean;
if (versionCookie) {
useCanary = versionCookie[1] === 'canary';
} else {
useCanary = Math.random() * 100 < canaryPercentage;
}
const backend = useCanary ? env.CANARY_ORIGIN : env.STABLE_ORIGIN;
const response = await fetch(backend + new URL(request.url).pathname, request);
const newResponse = new Response(response.body, response);
if (!versionCookie) {
newResponse.headers.append(
'Set-Cookie',
`deploy-version=${useCanary ? 'canary' : 'stable'}; Path=/; Max-Age=3600`
);
}
return newResponse;
},
};
This Worker checks for an existing version cookie (sticky sessions), and if none exists, randomly assigns the user to canary or stable based on the configured percentage. The assignment is persisted via a cookie so the user gets a consistent experience.
Monitoring During Rollout
A canary deployment without monitoring is just a slow deployment. The entire point is to compare the canary against the stable version in real-time and catch problems early.
What to Monitor
- 1Error rate — compare 5xx and client-side errors between canary and stable. Any increase above baseline triggers investigation.
- 2Latency — compare p50, p95, and p99 response times. Latency regressions often indicate performance bugs or missing optimizations.
- 3Core Web Vitals — compare LCP, INP, and CLS between versions. A CLS regression means the new version has layout shift issues.
- 4Business metrics — conversion rate, cart additions, signups. Sometimes the new version works perfectly but users behave differently.
- 5Memory and CPU — server-side metrics that catch resource leaks before they cause outages.
Automated Rollback Triggers
Manual monitoring does not scale. Configure automated rollback that triggers when metrics cross a threshold:
rollback_triggers:
- metric: error_rate
threshold: 2x_baseline
window: 5m
action: rollback
- metric: p99_latency
threshold: 3000ms
window: 10m
action: pause
- metric: cls
threshold: 0.25
window: 15m
action: pause
The difference between rollback and pause matters. A 2x error rate spike is unambiguous — roll back immediately. A latency increase might be transient (cold caches warming up), so pausing gives you time to investigate before deciding.
A common mistake is comparing absolute numbers instead of rates during canary monitoring. If your canary handles 1% of traffic and stable handles 99%, the canary will naturally have far fewer total errors. Always compare error rates (errors per request), not error counts. Five errors in 100 requests (5% rate) is much worse than 500 errors in 100,000 requests (0.5% rate).
Blue-Green with Instant Rollback
For teams that want simplicity over gradual rollout, blue-green is the straightforward choice. On Vercel, every deployment is essentially blue-green: each deployment gets its own immutable URL, and promoting a deployment makes it the production URL. Rolling back is just promoting the previous deployment.
vercel rollback
That single command switches production back to the previous deployment. The new (broken) deployment still exists at its unique URL for debugging, but no production traffic reaches it.
The edge case blue-green misses
Blue-green's weakness is database migrations. If the new version requires a schema change (a new column, a changed index), and you roll back to the old version, the old code might not work with the new schema. This is why blue-green deployments require backward-compatible database migrations: the old version must be able to work with both the old and new schema. In practice, this means you deploy schema changes separately from code changes, often one deploy cycle ahead. For frontend-only deployments (no server, just static files), this is rarely an issue.
| What developers do | What they should do |
|---|---|
| Skipping the 1% stage and starting canary at 25% The whole point of canary is catching problems early. Starting at 25% means 25% of your users are affected before you detect anything. | Always start at 1% or lower to minimize blast radius |
| Using the same monitoring thresholds for all rollout stages At 1%, a small number of errors causes a huge error rate spike. At 50%, the same absolute number of errors is a tiny blip. Thresholds should account for sample size. | Tighten thresholds at lower percentages, relax at higher percentages |
| Rolling back manually by re-deploying the old version Re-deploying takes minutes. Instant rollback takes seconds. During an incident, every second of exposure matters. | Use instant rollback (Vercel rollback, switch load balancer target) |
| Monitoring only error rates during canary rollout The new version might have zero errors but 2x latency. Or it might work perfectly but cause a 10% conversion drop. Error rate alone is insufficient. | Monitor error rates, latency, Web Vitals, and business metrics together |
The Rollout Mindset
Progressive rollout is not just a deployment technique — it is a mindset shift. Instead of "deploy and hope," you "deploy and observe." Instead of rollback being an emergency procedure, it is a normal, expected operation. Some rollouts get rolled back. That is fine. The system is working as designed.
The best teams treat every deploy as a canary, even when they are confident. Confidence is not data. The canary provides data. And data beats confidence every time — especially at 2am when the on-call engineer's phone buzzes.