Skip to content

04 — QA & Deployment Process

This document defines the gate sequence every Worker build must pass, and how code promotes from developer machine through staging to production. It exists because silent production outages almost always trace back to a skipped step — not to anyone doing something creative.

The gate sequence

Every build, on every branch, runs these steps in this order. An early failure short-circuits the rest.

1. Install          npm ci
2. Lint             npm run lint
3. Typecheck        npm run typecheck
4. Unit tests       npm run test
5. Secret scan      scripts/preflight-secrets.sh
6. Dry-run deploy   wrangler deploy --dry-run --env <target>
7. Build artifact   (implicit in dry-run; capture size and log it)
8. Smoke test       scripts/smoke.sh <deployed-url>   (after real deploy only)

Steps 1–6 run on pull-request CI. Step 8 runs after deploys to staging and production.

Any step's failure is a build failure. "The lint error is unrelated" is not a reason to merge a red build — fix it or revert the change that introduced it.

Environments — the promotion path

       ┌──────────────┐      ┌──────────────┐      ┌──────────────┐
       │   dev        │      │   staging    │      │  production  │
       │  (per-eng)   │ ───▶ │   (shared)   │ ───▶ │   (live)     │
       └──────────────┘      └──────────────┘      └──────────────┘
           ▲                      ▲                      ▲
      wrangler dev         wrangler deploy         wrangler deploy
       or deploy            --env staging          --env production
      (no route)         (auto from main merge)    (auto from tag/release)

Never deploy to production directly from a developer machine in steady state. Exception: documented break-glass scenario (CI is down and a security fix must ship). Requires post-hoc PR + incident note.

Staging soak

Every change lives in staging for a minimum soak window before promotion to production:

Change type Minimum soak
Config-only (vars, routes, bindings on non-critical Workers) 15 minutes
Code change, tier-3 Worker 30 minutes
Code change, tier-2 Worker 2 hours
Code change, tier-1 Worker 24 hours
Schema/migration change touching D1 or persistent state 24 hours regardless of tier, plus explicit Platform lead sign-off

During soak, someone on the owning team must check Workers analytics / Logpush for new error classes. No soak means no promotion — don't talk yourself out of this one.

Rollback

Cloudflare Workers support instant rollback to the previous deployment:

wrangler rollback --env production

Rollback is not a recovery plan; it's a containment action. After rolling back:

  1. Open an incident in the on-call channel.
  2. Confirm metrics returned to baseline.
  3. Leave staging on the broken version so the fix can be developed against the real failure.
  4. Post-mortem within 5 business days for any tier-1 rollback.

For Workers backed by persistent state (D1, R2, Durable Objects), rollback of the code does not undo data changes. Migrations must be designed reversibly (see "Schema-safe deploys" below).

Schema-safe deploys — the expand/contract pattern

Any change to D1 schema, KV key shape, R2 path convention, or persisted Durable Object state follows expand/contract over two releases minimum:

  1. Expand: deploy code and schema change that can read both old and new shapes. Old code still works.
  2. Wait at least one full soak window and confirm no regressions.
  3. Contract: deploy follow-up release that removes old-shape support.

This keeps rollback safe — step 1 can always be rolled back without data loss because old code is still schema-compatible. Step 3 is rollback-safe because the old shape is gone on purpose and there's no data to corrupt.

CI gate — what runs where

The stub ships .github/workflows/ci.yml with three jobs:

Job Runs on What it does
quality Every push, every PR Steps 1–6 above against the PR head.
deploy-staging Push to main Runs quality, then wrangler deploy --env staging, then smoke test against staging URL.
deploy-production Git tag matching v*.*.* Runs quality, deploys to production, runs production smoke test, posts deploy notification.

CI must use an API token scoped to Workers Scripts:Edit plus the specific zones the Worker touches. Do not use an account-global token.

Smoke tests

Every Worker ships a scripts/smoke.sh that, given a base URL, probes the well-known health endpoint (GET /__health) and any other endpoint critical enough that a silent 500 would matter. The stub's src/index.ts includes the /__health handler by default — keep it.

Smoke tests check:

  • 200 response to /__health
  • Response body includes the deployed commit SHA (the stub wires this via vars.COMMIT_SHA set at deploy time)
  • Critical custom endpoints return 2xx or the documented expected status

A failing smoke test after deploy triggers an automatic rollback attempt in CI, followed by a human page.

Pre-merge checklist (for PR description)

- [ ] Lint, typecheck, unit tests pass locally
- [ ] Secret scan clean (scripts/preflight-secrets.sh)
- [ ] wrangler deploy --dry-run --env staging succeeds
- [ ] Worker name, routes, bindings conform to 01 & 02
- [ ] If schema/state change: expand/contract plan documented in PR
- [ ] If tier-1: Platform lead reviewer added
- [ ] CHANGELOG.md updated

Pre-production checklist (before tagging a release)

- [ ] Staging soak window met for this change type and tier
- [ ] No new error classes in staging analytics during soak
- [ ] Rollback target identified (previous production deployment ID)
- [ ] On-call aware of deploy window
- [ ] Comms posted for tier-1 changes

Observability — what we watch

Every Worker emits, at minimum:

  • request.started / request.completed with duration and status
  • error.caught for any thrown exception handled by the top-level error boundary
  • dependency.call for subrequests, with duration and outcome

The stub's src/lib/logger.ts provides structured JSON logs in a shape Logpush and Workers Analytics Engine can both ingest. Don't reinvent logging — extend the logger if you need a new field.

Deploy freezes

Deploy freezes apply during:

  • Major client launches (announced ≥ 72 hours in advance)
  • End-of-quarter close for billing-related Workers
  • Platform lead discretion during active incidents

Freezes are announced in the platform channel and tracked in a freeze log alongside this SOP. Break-glass deploys during a freeze require a Platform lead approval recorded in the PR.


Next: 05 — Resiliency & Disaster Recovery.