04 — QA & Deployment Process¶
This document defines the gate sequence every Worker build must pass, and how code promotes from developer machine through staging to production. It exists because silent production outages almost always trace back to a skipped step — not to anyone doing something creative.
The gate sequence¶
Every build, on every branch, runs these steps in this order. An early failure short-circuits the rest.
1. Install npm ci
2. Lint npm run lint
3. Typecheck npm run typecheck
4. Unit tests npm run test
5. Secret scan scripts/preflight-secrets.sh
6. Dry-run deploy wrangler deploy --dry-run --env <target>
7. Build artifact (implicit in dry-run; capture size and log it)
8. Smoke test scripts/smoke.sh <deployed-url> (after real deploy only)
Steps 1–6 run on pull-request CI. Step 8 runs after deploys to staging and production.
Any step's failure is a build failure. "The lint error is unrelated" is not a reason to merge a red build — fix it or revert the change that introduced it.
Environments — the promotion path¶
┌──────────────┐ ┌──────────────┐ ┌──────────────┐
│ dev │ │ staging │ │ production │
│ (per-eng) │ ───▶ │ (shared) │ ───▶ │ (live) │
└──────────────┘ └──────────────┘ └──────────────┘
▲ ▲ ▲
wrangler dev wrangler deploy wrangler deploy
or deploy --env staging --env production
(no route) (auto from main merge) (auto from tag/release)
Never deploy to production directly from a developer machine in steady state. Exception: documented break-glass scenario (CI is down and a security fix must ship). Requires post-hoc PR + incident note.
Staging soak¶
Every change lives in staging for a minimum soak window before promotion to production:
| Change type | Minimum soak |
|---|---|
| Config-only (vars, routes, bindings on non-critical Workers) | 15 minutes |
| Code change, tier-3 Worker | 30 minutes |
| Code change, tier-2 Worker | 2 hours |
| Code change, tier-1 Worker | 24 hours |
| Schema/migration change touching D1 or persistent state | 24 hours regardless of tier, plus explicit Platform lead sign-off |
During soak, someone on the owning team must check Workers analytics / Logpush for new error classes. No soak means no promotion — don't talk yourself out of this one.
Rollback¶
Cloudflare Workers support instant rollback to the previous deployment:
Rollback is not a recovery plan; it's a containment action. After rolling back:
- Open an incident in the on-call channel.
- Confirm metrics returned to baseline.
- Leave staging on the broken version so the fix can be developed against the real failure.
- Post-mortem within 5 business days for any tier-1 rollback.
For Workers backed by persistent state (D1, R2, Durable Objects), rollback of the code does not undo data changes. Migrations must be designed reversibly (see "Schema-safe deploys" below).
Schema-safe deploys — the expand/contract pattern¶
Any change to D1 schema, KV key shape, R2 path convention, or persisted Durable Object state follows expand/contract over two releases minimum:
- Expand: deploy code and schema change that can read both old and new shapes. Old code still works.
- Wait at least one full soak window and confirm no regressions.
- Contract: deploy follow-up release that removes old-shape support.
This keeps rollback safe — step 1 can always be rolled back without data loss because old code is still schema-compatible. Step 3 is rollback-safe because the old shape is gone on purpose and there's no data to corrupt.
CI gate — what runs where¶
The stub ships .github/workflows/ci.yml with three jobs:
| Job | Runs on | What it does |
|---|---|---|
quality |
Every push, every PR | Steps 1–6 above against the PR head. |
deploy-staging |
Push to main |
Runs quality, then wrangler deploy --env staging, then smoke test against staging URL. |
deploy-production |
Git tag matching v*.*.* |
Runs quality, deploys to production, runs production smoke test, posts deploy notification. |
CI must use an API token scoped to Workers Scripts:Edit plus the specific zones the Worker touches. Do not use an account-global token.
Smoke tests¶
Every Worker ships a scripts/smoke.sh that, given a base URL, probes the well-known health endpoint (GET /__health) and any other endpoint critical enough that a silent 500 would matter. The stub's src/index.ts includes the /__health handler by default — keep it.
Smoke tests check:
- 200 response to
/__health - Response body includes the deployed commit SHA (the stub wires this via
vars.COMMIT_SHAset at deploy time) - Critical custom endpoints return 2xx or the documented expected status
A failing smoke test after deploy triggers an automatic rollback attempt in CI, followed by a human page.
Pre-merge checklist (for PR description)¶
- [ ] Lint, typecheck, unit tests pass locally
- [ ] Secret scan clean (scripts/preflight-secrets.sh)
- [ ] wrangler deploy --dry-run --env staging succeeds
- [ ] Worker name, routes, bindings conform to 01 & 02
- [ ] If schema/state change: expand/contract plan documented in PR
- [ ] If tier-1: Platform lead reviewer added
- [ ] CHANGELOG.md updated
Pre-production checklist (before tagging a release)¶
- [ ] Staging soak window met for this change type and tier
- [ ] No new error classes in staging analytics during soak
- [ ] Rollback target identified (previous production deployment ID)
- [ ] On-call aware of deploy window
- [ ] Comms posted for tier-1 changes
Observability — what we watch¶
Every Worker emits, at minimum:
request.started/request.completedwith duration and statuserror.caughtfor any thrown exception handled by the top-level error boundarydependency.callfor subrequests, with duration and outcome
The stub's src/lib/logger.ts provides structured JSON logs in a shape Logpush and Workers Analytics Engine can both ingest. Don't reinvent logging — extend the logger if you need a new field.
Deploy freezes¶
Deploy freezes apply during:
- Major client launches (announced ≥ 72 hours in advance)
- End-of-quarter close for billing-related Workers
- Platform lead discretion during active incidents
Freezes are announced in the platform channel and tracked in a freeze log alongside this SOP. Break-glass deploys during a freeze require a Platform lead approval recorded in the PR.