04 — Runbook
On-call quick reference
Section titled “On-call quick reference”| Role | Name | Contact |
|---|---|---|
| Primary on-call | Jeffrey Lambert | Slack: @jeffrey |
| Secondary on-call | Patrick | Slack: @patrick |
| Escalation (Cloudflare issues) | Cloudflare Support | dash.cloudflare.com → Support |
| Escalation (Stripe issues) | Stripe Support | dashboard.stripe.com → Support |
Dashboards:
- Workers Analytics Engine: Cloudflare dashboard → Workers & Pages → adventive-admin-api → Analytics
- Real-time logs:
wrangler tail adventive-admin-api --env production --format pretty - Cloudflare Access audit log: Cloudflare dashboard → Zero Trust → Logs → Access
Status check (quick):
curl -s https://admin-api.adventive.com/health | jq .# Expected: {"status":"ok","version":"x.y.z"}Day-to-day procedures
Section titled “Day-to-day procedures”Add a new operator to Cloudflare Access
Section titled “Add a new operator to Cloudflare Access”- Open Cloudflare Zero Trust dashboard → Access → Access Groups → Adventive Operators
- Click Edit → add the operator’s email to the email list → Save
- If the operator is super-admin tier: update the
RBAC_SUPER_ADMIN_EMAILSWrangler secret and redeployTerminal window wrangler secret put RBAC_SUPER_ADMIN_EMAILS --env production# Enter new comma-separated list when promptedwrangler deploy --env production - If the operator is billing tier: same procedure for
RBAC_BILLING_EMAILS - Send operator the admin URL and confirm they can authenticate:
https://admin.adventive.com - Verify first request appears in Workers Analytics Engine (shows operator email in request log)
Revoke operator access
Section titled “Revoke operator access”- Zero Trust dashboard → Access → Access Groups → Adventive Operators → Edit → remove the operator’s email → Save
- Revoke active sessions: Zero Trust → Logs → Access → search operator email → Revoke session
- If operator had billing or super-admin RBAC: remove from Wrangler secret and redeploy (same commands as add, minus their email)
- Confirm: attempt to access
https://admin.adventive.comfrom the revoked operator’s browser should redirect to Access login and then deny
Deploy a hotfix
Section titled “Deploy a hotfix”# 1. Make the fix on a hotfix branch, open PR, get reviewgit checkout -b hotfix/description# ... make changes, commit ...git push origin hotfix/description
# 2. After PR approval, merge to staging firstgit checkout staging && git merge hotfix/description && git push
# 3. Verify fix on staging# ... test manually or run relevant Playwright spec ...
# 4. Merge to main and trigger production deploygit checkout main && git merge staging && git push
# 5. Monitor: watch tail logs and Analytics Engine dashboard for 10 minutes post-deploywrangler tail adventive-admin-api --env production --format prettyFor UI-only hotfixes: same flow, but only admin-ui build/deploy runs (no Worker redeploy needed).
Roll back a Worker (admin-api)
Section titled “Roll back a Worker (admin-api)”# See available deploymentswrangler deployments list --env production
# Roll back to previous deploymentwrangler rollback --env production
# Or roll back to a specific deployment IDwrangler rollback [deployment-id] --env production
# Verifycurl -s https://admin-api.adventive.com/health | jq .versionRoll back the UI (admin-ui)
Section titled “Roll back the UI (admin-ui)”Option 1 (dashboard): Cloudflare Pages → adventive-admin-ui → Deployments tab → click the deployment you want → Rollback to this deployment
Option 2 (CLI):
wrangler pages deployment list --project-name adventive-admin-uiwrangler pages deployment rollback [deployment-id] --project-name adventive-admin-uiRotate a Worker secret
Section titled “Rotate a Worker secret”# Example: rotating Stripe key after a security incidentwrangler secret put STRIPE_SECRET_KEY --env production# Enter the new key value when prompted; takes effect on next request
wrangler secret put STRIPE_SECRET_KEY --env stagingNo Worker redeploy is required — secrets are fetched at request time. Confirm the new key works:
# Hit a Stripe-dependent endpoint (e.g., invoice list)curl -s https://admin-api.adventive.com/invoices \ -H "Authorization: Bearer [valid-operator-token]" | jq .Rotate RBAC operator lists
Section titled “Rotate RBAC operator lists”If operators are added or removed without a corresponding Access Group change (e.g., role change within existing operators):
wrangler secret put RBAC_SUPER_ADMIN_EMAILS --env production# Enter the full updated comma-separated list
wrangler secret put RBAC_BILLING_EMAILS --env production# Enter the full updated listNo redeploy required.
Roll a cohort back to the legacy admin
Section titled “Roll a cohort back to the legacy admin”If an operator cohort encounters a workflow blocker in the new admin:
- Send affected operators the legacy admin URL:
https://admin-legacy.adventive.com - Confirm they can log in (JumpCloud LDAP credentials; Duo bypass currently active)
- Create a GitHub issue documenting: the workflow that failed, the operator who reported it, reproduction steps
- Do not disable the new admin — other cohorts may still be using it
- Fix the issue in a hotfix branch, test in staging, re-notify the cohort once resolved
No data migration is required in either direction — both admins share the same database.
Incident playbooks
Section titled “Incident playbooks”Symptom: Admin UI shows blank page or 404 for all routes
Section titled “Symptom: Admin UI shows blank page or 404 for all routes”Likely cause: Pages SPA routing fallback missing, or Pages deploy failed.
Diagnosis:
# Check Pages deployment statuswrangler pages deployment list --project-name adventive-admin-ui | head -5
# Check the _redirects file is present in the dist output# Should contain: /* /index.html 200Fix:
- If
_redirectsmissing: addpublic/_redirectswith content/* /index.html 200and redeploy - If Pages deploy failed: check GitHub Actions workflow for build error, fix, and push again
Verification: Navigate to https://admin.adventive.com/customers — should render the customers list, not a 404.
Symptom: All API requests returning 401
Section titled “Symptom: All API requests returning 401”Likely cause: CF Access JWT validation failing — wrong AUD tag, expired token, or Access policy misconfigured.
Diagnosis:
# 1. Check the CF_ACCESS_AUD secret is correctwrangler secret list --env production# Verify CF_ACCESS_AUD is present
# 2. Verify the AUD matches the Access Application# Cloudflare dashboard → Access → Applications → Adventive Admin API → Application AUD
# 3. Check the Access audit log for the failing operator# Zero Trust → Logs → Access → filter by operator emailFix:
- If AUD mismatch: update the
CF_ACCESS_AUDsecret to match the Access Application’s AUD tag - If operator not in Access Group: add their email (see “Add a new operator” procedure)
- If token expired: operator needs to re-authenticate at
https://admin.adventive.com— Access handles re-authentication automatically on next visit
Verification: Operator navigates to the admin UI, is redirected to Access login, authenticates, and lands on the dashboard.
Symptom: API requests returning 500 on database operations
Section titled “Symptom: API requests returning 500 on database operations”Likely cause: Hyperdrive connection failure, MySQL server unreachable, or schema mismatch.
Diagnosis:
# 1. Check real-time Worker logs for the errorwrangler tail adventive-admin-api --env production --format pretty
# 2. Look for: "connect ETIMEDOUT", "Access denied", "Unknown column"# - ETIMEDOUT: DB host unreachable or Hyperdrive misconfigured# - Access denied: DB credentials wrong or rotated without updating Hyperdrive# - Unknown column: schema drift — a schema change was deployed that breaks this Worker
# 3. Test DB connectivity from a known-good host (EC2 → MySQL)Fix:
- ETIMEDOUT: Check MySQL server status on EC2; check Hyperdrive binding in wrangler.toml points to correct host
- Access denied: Rotate and re-set DB credentials in Hyperdrive configuration (Cloudflare dashboard → Workers & Pages → Hyperdrive → edit binding)
- Schema mismatch: Identify the schema change, either revert it or update the Worker query to handle both old and new schema (schema-freeze policy must not be violated during transition)
Verification: curl -s https://admin-api.adventive.com/customers | jq .total should return a number (not an error).
Symptom: Stripe-dependent endpoints (invoices, billing) returning errors
Section titled “Symptom: Stripe-dependent endpoints (invoices, billing) returning errors”Likely cause: Stripe API key expired or revoked, Stripe API down, or rate limit hit.
Diagnosis:
# 1. Check Worker logs for Stripe error codeswrangler tail adventive-admin-api --env production --format pretty# Look for: stripe error codes (e.g., "authentication_failed", "rate_limit_error")
# 2. Verify key is valid# Stripe dashboard → Developers → API keys → confirm restricted key for admin-api is active
# 3. Check Stripe status# https://www.stripestatus.comFix:
- Authentication failure: rotate
STRIPE_SECRET_KEY(see “Rotate a Worker secret” procedure) - Rate limit: Stripe rate limits are very generous for invoice reads; if hit, investigate for a loop or runaway client. Implement request-level caching if needed.
- Stripe outage: no fix — surface error to operators with a clear message; fall back to read-only view if possible
Symptom: Operator reports “Access Denied” when navigating to a route they should have access to
Section titled “Symptom: Operator reports “Access Denied” when navigating to a route they should have access to”Likely cause: RBAC tier misconfigured — operator’s email not in the correct Wrangler secret list.
Diagnosis:
# 1. Check which email is in the Access JWT by inspecting the request logwrangler tail adventive-admin-api --env production --format pretty# Look for the request from the operator and the email extracted from the JWT
# 2. Compare against current RBAC secretswrangler secret list --env production# Verify RBAC_SUPER_ADMIN_EMAILS or RBAC_BILLING_EMAILS contains the operator's emailFix:
# Update the relevant RBAC secret to include the operator's emailwrangler secret put RBAC_BILLING_EMAILS --env production# Enter updated comma-separated listNo redeploy needed. Verify: operator refreshes the admin UI and retries the route.
Symptom: Workers Analytics Engine shows no data (blank dashboard)
Section titled “Symptom: Workers Analytics Engine shows no data (blank dashboard)”Likely cause: Analytics binding misconfigured in wrangler.toml, or writeDataPoint calls not executing.
Diagnosis:
# Check that the ANALYTICS binding is defined in wrangler.tomlgrep -A3 "analytics_engine_datasets" admin-api/wrangler.toml
# Check if Worker is emitting events by tailing and triggering a requestwrangler tail adventive-admin-api --env production --format prettyFix: If binding is missing or misconfigured, add it to wrangler.toml and redeploy. Analytics Engine data appears with a ~1-minute delay; wait before concluding there’s an issue.
Symptom: Cloudflare Pages build failing
Section titled “Symptom: Cloudflare Pages build failing”Likely cause: TypeScript errors, missing environment variables, or dependency issues.
Diagnosis:
- Check GitHub Actions workflow run for the build step output
- Common failures:
VITE_API_BASE_URLnot set in Pages environment, generated API types out of sync with openapi.json
Fix:
- Set missing env var in Cloudflare Pages dashboard → Settings → Environment variables
- Regenerate API types and commit:
cd admin-ui && npm run generate:types && git add src/api/generated.ts && git commit -m "chore: regenerate API types"
Investigate a failed Access login
Section titled “Investigate a failed Access login”When an operator reports they cannot authenticate to the admin:
- Check Access audit log: Zero Trust → Logs → Access → filter by operator email and time range
- Common outcomes to look for:
ALLOWwith a block at the application level → operator email not in Access GroupBLOCKwith reason “Policy: Email not in list” → same; add email to groupBLOCKwith reason “MFA required” → operator’s identity provider session does not have MFA — operator needs to re-authenticate with MFA on their IDP accountBLOCKwith reason “Revoked session” → manually revoked — investigate why; re-add if legitimate
- If none of the above: Ask operator to clear browser cookies for
adventive.cloudflareaccess.comand retry. If still failing, escalate to Cloudflare Support with the CF-Ray ID from the blocked request.
Decommission the CodeIgniter admin (Phase 10)
Section titled “Decommission the CodeIgniter admin (Phase 10)”Perform only after the 30-day freeze window with zero operator rollbacks and zero legacy admin traffic.
Pre-decommission verification:1. Pull legacy admin access logs from EC2: zero operator sessions in last 14 days2. Confirm all Cron Trigger Workers are live and delivering scheduled reports (formerly Reporting.php — verify each report type is being generated)3. Confirm override file S3 → R2 migration is complete4. Confirm Phase 0 secret rotation is complete (adventive.php has no live credentials)
Shutdown sequence:1. Disable legacy admin login page (add maintenance notice: "This system has been retired. Use https://admin.adventive.com")2. Set all legacy admin routes to return 410 Gone3. Let sit for 7 days — monitor for any automated callers hitting legacy URLs4. Archive BitBucket repo: Settings → Archive repository5. Terminate EC2 instance (coordinate with Patrick for infra sign-off)6. Remove DNS record for admin-legacy.adventive.com7. Remove Cloudflare Access application for legacy admin8. Remove Bitbucket Pipelines deploy pipeline (or disable CI)9. Update README.md in this planning folder: set Status = Decommissioned, Date = [today]Known issues & workarounds
Section titled “Known issues & workarounds”| Issue | Workaround | Permanent fix |
|---|---|---|
| Schema-freeze policy blocks column renames | Use dual-write pattern: write to both old and new column names until legacy admin is decommissioned | Decommission legacy admin (Phase 10) |
| PHPExcel (abandoned) still used for month-end Excel export | Legacy admin continues to generate until Cron Trigger Worker replacement is live | Port month-end summary to phpspreadsheet or generate CSV via Cron Trigger Worker |
| Duo 2FA currently bypassed on legacy admin | Legacy admin only accessible during parallel-run; Cloudflare Access enforces MFA for new admin | Decommission legacy admin (Phase 10) |
| CF Access directory provider not yet chosen (JumpCloud vs. Google Workspace) | Use email allowlist as interim identity strategy | Resolve identity provider ADR before Phase 1 completes |
| Report_model.php (5,333 lines, 24 bespoke SQL functions) not ported | Partner reports continue to run from legacy Cron jobs during Phases 1–8 | Phase 4: evaluate dedicated analytics Worker or Cloudflare Analytics product |