Skip to content

04 — Runbook

On-call quick reference

Role Name Contact
Primary on-call Jeffrey Lambert Slack: @jeffrey
Secondary on-call Patrick Slack: @patrick
Escalation (Cloudflare issues) Cloudflare Support dash.cloudflare.com → Support
Escalation (Stripe issues) Stripe Support dashboard.stripe.com → Support

Dashboards: - Workers Analytics Engine: Cloudflare dashboard → Workers & Pages → adventive-admin-api → Analytics - Real-time logs: wrangler tail adventive-admin-api --env production --format pretty - Cloudflare Access audit log: Cloudflare dashboard → Zero Trust → Logs → Access

Status check (quick):

curl -s https://admin-api.adventive.com/health | jq .
# Expected: {"status":"ok","version":"x.y.z"}

Day-to-day procedures

Add a new operator to Cloudflare Access

  1. Open Cloudflare Zero Trust dashboard → Access → Access Groups → Adventive Operators
  2. Click Edit → add the operator's email to the email list → Save
  3. If the operator is super-admin tier: update the RBAC_SUPER_ADMIN_EMAILS Wrangler secret and redeploy
    wrangler secret put RBAC_SUPER_ADMIN_EMAILS --env production
    # Enter new comma-separated list when prompted
    wrangler deploy --env production
    
  4. If the operator is billing tier: same procedure for RBAC_BILLING_EMAILS
  5. Send operator the admin URL and confirm they can authenticate: https://admin.adventive.com
  6. Verify first request appears in Workers Analytics Engine (shows operator email in request log)

Revoke operator access

  1. Zero Trust dashboard → Access → Access Groups → Adventive Operators → Edit → remove the operator's email → Save
  2. Revoke active sessions: Zero Trust → Logs → Access → search operator email → Revoke session
  3. If operator had billing or super-admin RBAC: remove from Wrangler secret and redeploy (same commands as add, minus their email)
  4. Confirm: attempt to access https://admin.adventive.com from the revoked operator's browser should redirect to Access login and then deny

Deploy a hotfix

# 1. Make the fix on a hotfix branch, open PR, get review
git checkout -b hotfix/description
# ... make changes, commit ...
git push origin hotfix/description

# 2. After PR approval, merge to staging first
git checkout staging && git merge hotfix/description && git push

# 3. Verify fix on staging
# ... test manually or run relevant Playwright spec ...

# 4. Merge to main and trigger production deploy
git checkout main && git merge staging && git push

# 5. Monitor: watch tail logs and Analytics Engine dashboard for 10 minutes post-deploy
wrangler tail adventive-admin-api --env production --format pretty

For UI-only hotfixes: same flow, but only admin-ui build/deploy runs (no Worker redeploy needed).

Roll back a Worker (admin-api)

# See available deployments
wrangler deployments list --env production

# Roll back to previous deployment
wrangler rollback --env production

# Or roll back to a specific deployment ID
wrangler rollback [deployment-id] --env production

# Verify
curl -s https://admin-api.adventive.com/health | jq .version

Roll back the UI (admin-ui)

Option 1 (dashboard): Cloudflare Pages → adventive-admin-ui → Deployments tab → click the deployment you want → Rollback to this deployment

Option 2 (CLI):

wrangler pages deployment list --project-name adventive-admin-ui
wrangler pages deployment rollback [deployment-id] --project-name adventive-admin-ui

Rotate a Worker secret

# Example: rotating Stripe key after a security incident
wrangler secret put STRIPE_SECRET_KEY --env production
# Enter the new key value when prompted; takes effect on next request

wrangler secret put STRIPE_SECRET_KEY --env staging

No Worker redeploy is required — secrets are fetched at request time. Confirm the new key works:

# Hit a Stripe-dependent endpoint (e.g., invoice list)
curl -s https://admin-api.adventive.com/invoices \
  -H "Authorization: Bearer [valid-operator-token]" | jq .

Rotate RBAC operator lists

If operators are added or removed without a corresponding Access Group change (e.g., role change within existing operators):

wrangler secret put RBAC_SUPER_ADMIN_EMAILS --env production
# Enter the full updated comma-separated list

wrangler secret put RBAC_BILLING_EMAILS --env production
# Enter the full updated list

No redeploy required.

Roll a cohort back to the legacy admin

If an operator cohort encounters a workflow blocker in the new admin:

  1. Send affected operators the legacy admin URL: https://admin-legacy.adventive.com
  2. Confirm they can log in (JumpCloud LDAP credentials; Duo bypass currently active)
  3. Create a GitHub issue documenting: the workflow that failed, the operator who reported it, reproduction steps
  4. Do not disable the new admin — other cohorts may still be using it
  5. Fix the issue in a hotfix branch, test in staging, re-notify the cohort once resolved

No data migration is required in either direction — both admins share the same database.


Incident playbooks

Symptom: Admin UI shows blank page or 404 for all routes

Likely cause: Pages SPA routing fallback missing, or Pages deploy failed.

Diagnosis:

# Check Pages deployment status
wrangler pages deployment list --project-name adventive-admin-ui | head -5

# Check the _redirects file is present in the dist output
# Should contain: /* /index.html 200

Fix: - If _redirects missing: add public/_redirects with content /* /index.html 200 and redeploy - If Pages deploy failed: check GitHub Actions workflow for build error, fix, and push again

Verification: Navigate to https://admin.adventive.com/customers — should render the customers list, not a 404.


Symptom: All API requests returning 401

Likely cause: CF Access JWT validation failing — wrong AUD tag, expired token, or Access policy misconfigured.

Diagnosis:

# 1. Check the CF_ACCESS_AUD secret is correct
wrangler secret list --env production
# Verify CF_ACCESS_AUD is present

# 2. Verify the AUD matches the Access Application
# Cloudflare dashboard → Access → Applications → Adventive Admin API → Application AUD

# 3. Check the Access audit log for the failing operator
# Zero Trust → Logs → Access → filter by operator email

Fix: - If AUD mismatch: update the CF_ACCESS_AUD secret to match the Access Application's AUD tag - If operator not in Access Group: add their email (see "Add a new operator" procedure) - If token expired: operator needs to re-authenticate at https://admin.adventive.com — Access handles re-authentication automatically on next visit

Verification: Operator navigates to the admin UI, is redirected to Access login, authenticates, and lands on the dashboard.


Symptom: API requests returning 500 on database operations

Likely cause: Hyperdrive connection failure, MySQL server unreachable, or schema mismatch.

Diagnosis:

# 1. Check real-time Worker logs for the error
wrangler tail adventive-admin-api --env production --format pretty

# 2. Look for: "connect ETIMEDOUT", "Access denied", "Unknown column"
# - ETIMEDOUT: DB host unreachable or Hyperdrive misconfigured
# - Access denied: DB credentials wrong or rotated without updating Hyperdrive
# - Unknown column: schema drift — a schema change was deployed that breaks this Worker

# 3. Test DB connectivity from a known-good host (EC2 → MySQL)

Fix: - ETIMEDOUT: Check MySQL server status on EC2; check Hyperdrive binding in wrangler.toml points to correct host - Access denied: Rotate and re-set DB credentials in Hyperdrive configuration (Cloudflare dashboard → Workers & Pages → Hyperdrive → edit binding) - Schema mismatch: Identify the schema change, either revert it or update the Worker query to handle both old and new schema (schema-freeze policy must not be violated during transition)

Verification: curl -s https://admin-api.adventive.com/customers | jq .total should return a number (not an error).


Symptom: Stripe-dependent endpoints (invoices, billing) returning errors

Likely cause: Stripe API key expired or revoked, Stripe API down, or rate limit hit.

Diagnosis:

# 1. Check Worker logs for Stripe error codes
wrangler tail adventive-admin-api --env production --format pretty
# Look for: stripe error codes (e.g., "authentication_failed", "rate_limit_error")

# 2. Verify key is valid
# Stripe dashboard → Developers → API keys → confirm restricted key for admin-api is active

# 3. Check Stripe status
# https://www.stripestatus.com

Fix: - Authentication failure: rotate STRIPE_SECRET_KEY (see "Rotate a Worker secret" procedure) - Rate limit: Stripe rate limits are very generous for invoice reads; if hit, investigate for a loop or runaway client. Implement request-level caching if needed. - Stripe outage: no fix — surface error to operators with a clear message; fall back to read-only view if possible


Symptom: Operator reports "Access Denied" when navigating to a route they should have access to

Likely cause: RBAC tier misconfigured — operator's email not in the correct Wrangler secret list.

Diagnosis:

# 1. Check which email is in the Access JWT by inspecting the request log
wrangler tail adventive-admin-api --env production --format pretty
# Look for the request from the operator and the email extracted from the JWT

# 2. Compare against current RBAC secrets
wrangler secret list --env production
# Verify RBAC_SUPER_ADMIN_EMAILS or RBAC_BILLING_EMAILS contains the operator's email

Fix:

# Update the relevant RBAC secret to include the operator's email
wrangler secret put RBAC_BILLING_EMAILS --env production
# Enter updated comma-separated list

No redeploy needed. Verify: operator refreshes the admin UI and retries the route.


Symptom: Workers Analytics Engine shows no data (blank dashboard)

Likely cause: Analytics binding misconfigured in wrangler.toml, or writeDataPoint calls not executing.

Diagnosis:

# Check that the ANALYTICS binding is defined in wrangler.toml
grep -A3 "analytics_engine_datasets" admin-api/wrangler.toml

# Check if Worker is emitting events by tailing and triggering a request
wrangler tail adventive-admin-api --env production --format pretty

Fix: If binding is missing or misconfigured, add it to wrangler.toml and redeploy. Analytics Engine data appears with a ~1-minute delay; wait before concluding there's an issue.


Symptom: Cloudflare Pages build failing

Likely cause: TypeScript errors, missing environment variables, or dependency issues.

Diagnosis: - Check GitHub Actions workflow run for the build step output - Common failures: VITE_API_BASE_URL not set in Pages environment, generated API types out of sync with openapi.json

Fix: - Set missing env var in Cloudflare Pages dashboard → Settings → Environment variables - Regenerate API types and commit: cd admin-ui && npm run generate:types && git add src/api/generated.ts && git commit -m "chore: regenerate API types"


Investigate a failed Access login

When an operator reports they cannot authenticate to the admin:

  1. Check Access audit log: Zero Trust → Logs → Access → filter by operator email and time range
  2. Common outcomes to look for:
  3. ALLOW with a block at the application level → operator email not in Access Group
  4. BLOCK with reason "Policy: Email not in list" → same; add email to group
  5. BLOCK with reason "MFA required" → operator's identity provider session does not have MFA — operator needs to re-authenticate with MFA on their IDP account
  6. BLOCK with reason "Revoked session" → manually revoked — investigate why; re-add if legitimate
  7. If none of the above: Ask operator to clear browser cookies for adventive.cloudflareaccess.com and retry. If still failing, escalate to Cloudflare Support with the CF-Ray ID from the blocked request.

Decommission the CodeIgniter admin (Phase 10)

Perform only after the 30-day freeze window with zero operator rollbacks and zero legacy admin traffic.

Pre-decommission verification:
1. Pull legacy admin access logs from EC2: zero operator sessions in last 14 days
2. Confirm all Cron Trigger Workers are live and delivering scheduled reports
   (formerly Reporting.php — verify each report type is being generated)
3. Confirm override file S3 → R2 migration is complete
4. Confirm Phase 0 secret rotation is complete (adventive.php has no live credentials)

Shutdown sequence:
1. Disable legacy admin login page (add maintenance notice: "This system has been retired.
   Use https://admin.adventive.com")
2. Set all legacy admin routes to return 410 Gone
3. Let sit for 7 days — monitor for any automated callers hitting legacy URLs
4. Archive BitBucket repo: Settings → Archive repository
5. Terminate EC2 instance (coordinate with Patrick for infra sign-off)
6. Remove DNS record for admin-legacy.adventive.com
7. Remove Cloudflare Access application for legacy admin
8. Remove Bitbucket Pipelines deploy pipeline (or disable CI)
9. Update README.md in this planning folder: set Status = Decommissioned, Date = [today]

Known issues & workarounds

Issue Workaround Permanent fix
Schema-freeze policy blocks column renames Use dual-write pattern: write to both old and new column names until legacy admin is decommissioned Decommission legacy admin (Phase 10)
PHPExcel (abandoned) still used for month-end Excel export Legacy admin continues to generate until Cron Trigger Worker replacement is live Port month-end summary to phpspreadsheet or generate CSV via Cron Trigger Worker
Duo 2FA currently bypassed on legacy admin Legacy admin only accessible during parallel-run; Cloudflare Access enforces MFA for new admin Decommission legacy admin (Phase 10)
CF Access directory provider not yet chosen (JumpCloud vs. Google Workspace) Use email allowlist as interim identity strategy Resolve identity provider ADR before Phase 1 completes
Report_model.php (5,333 lines, 24 bespoke SQL functions) not ported Partner reports continue to run from legacy Cron jobs during Phases 1–8 Phase 4: evaluate dedicated analytics Worker or Cloudflare Analytics product