Skip to content

05 — Resiliency & Disaster Recovery

Not every Worker deserves the same amount of engineering effort on redundancy. A marketing banner Worker failing for 10 minutes is mildly annoying; a billing webhook failing for 10 minutes could cost real money and trust. This document defines the tiers of criticality, what each tier requires, and how to prove those requirements are met.

Tiers of criticality

Every Worker is classified into exactly one tier. The tier is recorded in the Worker's Cloudflare tag (criticality:tier-1|tier-2|tier-3) and in the Worker's README.

Tier 1 — Mission-critical

Definition: Downtime directly harms Adventive or our clients. Examples: billing webhooks, auth/session issuance, client-facing form relays tied to lead revenue, payment callbacks.

Posture required:

  • Redundancy: Multi-region is inherent to Cloudflare Workers (they run at the edge), but persistent-state dependencies must be redundant too. D1 followers where available; R2 with replication to a second bucket; critical Queues with a DLQ and an alternate consumer.
  • Graceful degradation: Define what "degraded but open" looks like for the Worker, and implement it. Example: billing webhook that can't reach its database persists the event to a DLQ and returns 202, never 5xx.
  • Circuit breakers and retries on every outbound dependency, with explicit timeouts (never rely on default).
  • Observability: Structured logs + Workers Analytics Engine or Logpush to an external sink. Alerting on error rate, p95 latency, and absolute volume drop.
  • Runbook: A RUNBOOK.md in the repo with "if you are paged, do this." Updated after every incident.
  • Deploy rules: 24-hour staging soak (see 04). Tail consumer Worker configured. Platform lead review on every PR.
  • Backup & restore: Verified restore drill every 90 days for any D1/R2/DO state the Worker owns. "Verified" means an engineer other than the owner restored a snapshot to a test environment and exercised the Worker against it.
  • Third-party dependency map: README lists every external service the Worker calls, the expected latency, the timeout set, and the behavior on failure.

Tier 2 — Important

Definition: Downtime is noticeable and annoying but not revenue-impacting within the window it would take us to respond.

Posture required:

  • Circuit breakers on critical dependencies; timeouts set explicitly.
  • Structured logs.
  • Alerting on error rate > 5% for 5 minutes.
  • 2-hour staging soak.
  • RUNBOOK.md with at minimum: owners, what the Worker does, where logs are, how to roll back.
  • Backup: for any persistent state, documented backup mechanism (R2 lifecycle rules, D1 exports); restore drill annually.

Tier 3 — Routine

Definition: Internal tools, experiments, landing pages, and client-specific builds where failure is recoverable by reverting and the blast radius is small.

Posture required:

  • Observability enabled (it's the default in the stub).
  • Staging soak per 04.
  • Owner documented in README.
  • Backups for any non-reproducible state.

Matching tier to our product surface

Some rough defaults so this doesn't turn into a tier-war:

Worker purpose Typical tier
Payment webhooks, billing sync, auth issuance 1
Form relays tied to client lead revenue 1
Public marketing sites, Turnstile proxies, image resize, core microservices under svc 2
Internal dashboards, staff tools, experimentation, most cli landing pages 3

When a cli project is tied to a campaign worth six figures, elevate it to tier-2 for the duration. Write the tier change into the project's README with a date and a reason.

Backup & disaster recovery patterns

D1

  • Use wrangler d1 export on a schedule for tier-1 databases. Store exports in an R2 bucket with a 30-day lifecycle for tier-1, 7-day for tier-2.
  • Test a full restore quarterly for tier-1, annually for tier-2.
  • For reference data that must survive a Cloudflare account compromise, replicate daily to a second cloud (S3 / GCS) via a scheduled Worker.

R2

  • Enable object versioning on tier-1 buckets and keep at least 30 days of versions.
  • For irreplaceable objects (client deliverables, billing records), cross-bucket replicate to a second account or a second region where supported. If cross-account replication isn't available for the use case, a nightly R2-to-R2 sync Worker is acceptable.
  • Document the retention policy in the bucket's description and in the Worker's README.

KV

  • KV is best-effort eventually-consistent. Treat it as a cache unless explicitly stated otherwise.
  • For tier-1 KV usage (e.g. rate-limit counters, auth tokens), document what happens on loss of that data. If the answer is "service is broken," the data belongs in D1 or a DO, not KV.

Durable Objects

  • DOs hold authoritative state for tier-1 use cases (WebSocket coordination, rate limiting, transactional counters). Back them with periodic snapshots to R2 if the state is expensive to reconstruct.
  • Document DO lifecycle: when they're created, what triggers eviction, how to audit which DOs exist.

Queues

  • Every tier-1 queue has a dead-letter queue and a documented procedure for replay.
  • Every tier-1 queue consumer declares max_retries and retry_delay explicitly. No relying on defaults.

Redundancy across Cloudflare itself

Cloudflare is highly available but not infinitely so. For tier-1 Workers where Adventive's SLA to a client depends on uptime:

  • Document the failure mode if the Cloudflare account, API, or edge is unavailable. A one-paragraph answer is enough.
  • Know what's recoverable offline. Code lives in git. Secrets are documented (by name, not value) in the Worker's README so they can be re-set if an account recovery is needed.
  • For data that cannot be lost, replicate off-Cloudflare on a schedule. The cost is small; the peace of mind is large.

Incident response — the minimum

When a Worker misbehaves:

  1. Confirm from metrics / Logpush that there's a real regression (not a scraper spike).
  2. If it's tier-1 or tier-2, open an incident in the platform channel.
  3. Attempt rollback (wrangler rollback --env production).
  4. If rollback fixes it: capture the diff between the bad and last-good deployments. Owner writes a post-mortem within 5 business days.
  5. If rollback doesn't fix it: move to mitigation mode — Worker kill-switch (disable route), cached-response fallback, or origin-direct DNS override, whichever is documented in the runbook.

Proving the posture — the quarterly audit

Once a quarter, the Platform lead runs an audit script (to be scheduled; see scripts/audit.sh in the stub) that:

  1. Lists every Worker in the account.
  2. Checks each against its declared tier for: tags set, observability.enabled = true, tail consumer present (tier-1), runbook present in repo (tier-1 and tier-2), backup drill performed on schedule.
  3. Produces a report. Anything non-compliant gets a ticket.

The audit is not optional. Drift is the default; scheduled enforcement is how we prevent it.

When tier needs to change

  • Upgrading a tier is a PR against the Worker's README + tag, plus whatever work is needed to meet the new tier's posture (runbook, backup drill, alerts). Don't upgrade the tag without doing the work.
  • Downgrading a tier is rare and requires Platform lead sign-off. Usually it means we replaced the Worker with something better and this one is about to be retired.

This completes the SOP series. Stub project is in worker-stub/ (GitHub). Open its README.md to start your first Worker.