04 — Runbook¶
On-call quick reference¶
| Role | Name | Contact |
|---|---|---|
| Primary on-call | Jeffrey Lambert | Slack: @jeffrey |
| Secondary on-call | Patrick | Slack: @patrick |
| Escalation (Cloudflare issues) | Cloudflare Support | dash.cloudflare.com → Support |
| Escalation (Stripe issues) | Stripe Support | dashboard.stripe.com → Support |
Dashboards:
- Workers Analytics Engine: Cloudflare dashboard → Workers & Pages → adventive-admin-api → Analytics
- Real-time logs: wrangler tail adventive-admin-api --env production --format pretty
- Cloudflare Access audit log: Cloudflare dashboard → Zero Trust → Logs → Access
Status check (quick):
Day-to-day procedures¶
Add a new operator to Cloudflare Access¶
- Open Cloudflare Zero Trust dashboard → Access → Access Groups → Adventive Operators
- Click Edit → add the operator's email to the email list → Save
- If the operator is super-admin tier: update the
RBAC_SUPER_ADMIN_EMAILSWrangler secret and redeploy - If the operator is billing tier: same procedure for
RBAC_BILLING_EMAILS - Send operator the admin URL and confirm they can authenticate:
https://admin.adventive.com - Verify first request appears in Workers Analytics Engine (shows operator email in request log)
Revoke operator access¶
- Zero Trust dashboard → Access → Access Groups → Adventive Operators → Edit → remove the operator's email → Save
- Revoke active sessions: Zero Trust → Logs → Access → search operator email → Revoke session
- If operator had billing or super-admin RBAC: remove from Wrangler secret and redeploy (same commands as add, minus their email)
- Confirm: attempt to access
https://admin.adventive.comfrom the revoked operator's browser should redirect to Access login and then deny
Deploy a hotfix¶
# 1. Make the fix on a hotfix branch, open PR, get review
git checkout -b hotfix/description
# ... make changes, commit ...
git push origin hotfix/description
# 2. After PR approval, merge to staging first
git checkout staging && git merge hotfix/description && git push
# 3. Verify fix on staging
# ... test manually or run relevant Playwright spec ...
# 4. Merge to main and trigger production deploy
git checkout main && git merge staging && git push
# 5. Monitor: watch tail logs and Analytics Engine dashboard for 10 minutes post-deploy
wrangler tail adventive-admin-api --env production --format pretty
For UI-only hotfixes: same flow, but only admin-ui build/deploy runs (no Worker redeploy needed).
Roll back a Worker (admin-api)¶
# See available deployments
wrangler deployments list --env production
# Roll back to previous deployment
wrangler rollback --env production
# Or roll back to a specific deployment ID
wrangler rollback [deployment-id] --env production
# Verify
curl -s https://admin-api.adventive.com/health | jq .version
Roll back the UI (admin-ui)¶
Option 1 (dashboard): Cloudflare Pages → adventive-admin-ui → Deployments tab → click the deployment you want → Rollback to this deployment
Option 2 (CLI):
wrangler pages deployment list --project-name adventive-admin-ui
wrangler pages deployment rollback [deployment-id] --project-name adventive-admin-ui
Rotate a Worker secret¶
# Example: rotating Stripe key after a security incident
wrangler secret put STRIPE_SECRET_KEY --env production
# Enter the new key value when prompted; takes effect on next request
wrangler secret put STRIPE_SECRET_KEY --env staging
No Worker redeploy is required — secrets are fetched at request time. Confirm the new key works:
# Hit a Stripe-dependent endpoint (e.g., invoice list)
curl -s https://admin-api.adventive.com/invoices \
-H "Authorization: Bearer [valid-operator-token]" | jq .
Rotate RBAC operator lists¶
If operators are added or removed without a corresponding Access Group change (e.g., role change within existing operators):
wrangler secret put RBAC_SUPER_ADMIN_EMAILS --env production
# Enter the full updated comma-separated list
wrangler secret put RBAC_BILLING_EMAILS --env production
# Enter the full updated list
No redeploy required.
Roll a cohort back to the legacy admin¶
If an operator cohort encounters a workflow blocker in the new admin:
- Send affected operators the legacy admin URL:
https://admin-legacy.adventive.com - Confirm they can log in (JumpCloud LDAP credentials; Duo bypass currently active)
- Create a GitHub issue documenting: the workflow that failed, the operator who reported it, reproduction steps
- Do not disable the new admin — other cohorts may still be using it
- Fix the issue in a hotfix branch, test in staging, re-notify the cohort once resolved
No data migration is required in either direction — both admins share the same database.
Incident playbooks¶
Symptom: Admin UI shows blank page or 404 for all routes¶
Likely cause: Pages SPA routing fallback missing, or Pages deploy failed.
Diagnosis:
# Check Pages deployment status
wrangler pages deployment list --project-name adventive-admin-ui | head -5
# Check the _redirects file is present in the dist output
# Should contain: /* /index.html 200
Fix:
- If _redirects missing: add public/_redirects with content /* /index.html 200 and redeploy
- If Pages deploy failed: check GitHub Actions workflow for build error, fix, and push again
Verification: Navigate to https://admin.adventive.com/customers — should render the customers list, not a 404.
Symptom: All API requests returning 401¶
Likely cause: CF Access JWT validation failing — wrong AUD tag, expired token, or Access policy misconfigured.
Diagnosis:
# 1. Check the CF_ACCESS_AUD secret is correct
wrangler secret list --env production
# Verify CF_ACCESS_AUD is present
# 2. Verify the AUD matches the Access Application
# Cloudflare dashboard → Access → Applications → Adventive Admin API → Application AUD
# 3. Check the Access audit log for the failing operator
# Zero Trust → Logs → Access → filter by operator email
Fix:
- If AUD mismatch: update the CF_ACCESS_AUD secret to match the Access Application's AUD tag
- If operator not in Access Group: add their email (see "Add a new operator" procedure)
- If token expired: operator needs to re-authenticate at https://admin.adventive.com — Access handles re-authentication automatically on next visit
Verification: Operator navigates to the admin UI, is redirected to Access login, authenticates, and lands on the dashboard.
Symptom: API requests returning 500 on database operations¶
Likely cause: Hyperdrive connection failure, MySQL server unreachable, or schema mismatch.
Diagnosis:
# 1. Check real-time Worker logs for the error
wrangler tail adventive-admin-api --env production --format pretty
# 2. Look for: "connect ETIMEDOUT", "Access denied", "Unknown column"
# - ETIMEDOUT: DB host unreachable or Hyperdrive misconfigured
# - Access denied: DB credentials wrong or rotated without updating Hyperdrive
# - Unknown column: schema drift — a schema change was deployed that breaks this Worker
# 3. Test DB connectivity from a known-good host (EC2 → MySQL)
Fix: - ETIMEDOUT: Check MySQL server status on EC2; check Hyperdrive binding in wrangler.toml points to correct host - Access denied: Rotate and re-set DB credentials in Hyperdrive configuration (Cloudflare dashboard → Workers & Pages → Hyperdrive → edit binding) - Schema mismatch: Identify the schema change, either revert it or update the Worker query to handle both old and new schema (schema-freeze policy must not be violated during transition)
Verification: curl -s https://admin-api.adventive.com/customers | jq .total should return a number (not an error).
Symptom: Stripe-dependent endpoints (invoices, billing) returning errors¶
Likely cause: Stripe API key expired or revoked, Stripe API down, or rate limit hit.
Diagnosis:
# 1. Check Worker logs for Stripe error codes
wrangler tail adventive-admin-api --env production --format pretty
# Look for: stripe error codes (e.g., "authentication_failed", "rate_limit_error")
# 2. Verify key is valid
# Stripe dashboard → Developers → API keys → confirm restricted key for admin-api is active
# 3. Check Stripe status
# https://www.stripestatus.com
Fix:
- Authentication failure: rotate STRIPE_SECRET_KEY (see "Rotate a Worker secret" procedure)
- Rate limit: Stripe rate limits are very generous for invoice reads; if hit, investigate for a loop or runaway client. Implement request-level caching if needed.
- Stripe outage: no fix — surface error to operators with a clear message; fall back to read-only view if possible
Symptom: Operator reports "Access Denied" when navigating to a route they should have access to¶
Likely cause: RBAC tier misconfigured — operator's email not in the correct Wrangler secret list.
Diagnosis:
# 1. Check which email is in the Access JWT by inspecting the request log
wrangler tail adventive-admin-api --env production --format pretty
# Look for the request from the operator and the email extracted from the JWT
# 2. Compare against current RBAC secrets
wrangler secret list --env production
# Verify RBAC_SUPER_ADMIN_EMAILS or RBAC_BILLING_EMAILS contains the operator's email
Fix:
# Update the relevant RBAC secret to include the operator's email
wrangler secret put RBAC_BILLING_EMAILS --env production
# Enter updated comma-separated list
No redeploy needed. Verify: operator refreshes the admin UI and retries the route.
Symptom: Workers Analytics Engine shows no data (blank dashboard)¶
Likely cause: Analytics binding misconfigured in wrangler.toml, or writeDataPoint calls not executing.
Diagnosis:
# Check that the ANALYTICS binding is defined in wrangler.toml
grep -A3 "analytics_engine_datasets" admin-api/wrangler.toml
# Check if Worker is emitting events by tailing and triggering a request
wrangler tail adventive-admin-api --env production --format pretty
Fix: If binding is missing or misconfigured, add it to wrangler.toml and redeploy. Analytics Engine data appears with a ~1-minute delay; wait before concluding there's an issue.
Symptom: Cloudflare Pages build failing¶
Likely cause: TypeScript errors, missing environment variables, or dependency issues.
Diagnosis:
- Check GitHub Actions workflow run for the build step output
- Common failures: VITE_API_BASE_URL not set in Pages environment, generated API types out of sync with openapi.json
Fix:
- Set missing env var in Cloudflare Pages dashboard → Settings → Environment variables
- Regenerate API types and commit: cd admin-ui && npm run generate:types && git add src/api/generated.ts && git commit -m "chore: regenerate API types"
Investigate a failed Access login¶
When an operator reports they cannot authenticate to the admin:
- Check Access audit log: Zero Trust → Logs → Access → filter by operator email and time range
- Common outcomes to look for:
ALLOWwith a block at the application level → operator email not in Access GroupBLOCKwith reason "Policy: Email not in list" → same; add email to groupBLOCKwith reason "MFA required" → operator's identity provider session does not have MFA — operator needs to re-authenticate with MFA on their IDP accountBLOCKwith reason "Revoked session" → manually revoked — investigate why; re-add if legitimate- If none of the above: Ask operator to clear browser cookies for
adventive.cloudflareaccess.comand retry. If still failing, escalate to Cloudflare Support with the CF-Ray ID from the blocked request.
Decommission the CodeIgniter admin (Phase 10)¶
Perform only after the 30-day freeze window with zero operator rollbacks and zero legacy admin traffic.
Pre-decommission verification:
1. Pull legacy admin access logs from EC2: zero operator sessions in last 14 days
2. Confirm all Cron Trigger Workers are live and delivering scheduled reports
(formerly Reporting.php — verify each report type is being generated)
3. Confirm override file S3 → R2 migration is complete
4. Confirm Phase 0 secret rotation is complete (adventive.php has no live credentials)
Shutdown sequence:
1. Disable legacy admin login page (add maintenance notice: "This system has been retired.
Use https://admin.adventive.com")
2. Set all legacy admin routes to return 410 Gone
3. Let sit for 7 days — monitor for any automated callers hitting legacy URLs
4. Archive BitBucket repo: Settings → Archive repository
5. Terminate EC2 instance (coordinate with Patrick for infra sign-off)
6. Remove DNS record for admin-legacy.adventive.com
7. Remove Cloudflare Access application for legacy admin
8. Remove Bitbucket Pipelines deploy pipeline (or disable CI)
9. Update README.md in this planning folder: set Status = Decommissioned, Date = [today]
Known issues & workarounds¶
| Issue | Workaround | Permanent fix |
|---|---|---|
| Schema-freeze policy blocks column renames | Use dual-write pattern: write to both old and new column names until legacy admin is decommissioned | Decommission legacy admin (Phase 10) |
| PHPExcel (abandoned) still used for month-end Excel export | Legacy admin continues to generate until Cron Trigger Worker replacement is live | Port month-end summary to phpspreadsheet or generate CSV via Cron Trigger Worker |
| Duo 2FA currently bypassed on legacy admin | Legacy admin only accessible during parallel-run; Cloudflare Access enforces MFA for new admin | Decommission legacy admin (Phase 10) |
| CF Access directory provider not yet chosen (JumpCloud vs. Google Workspace) | Use email allowlist as interim identity strategy | Resolve identity provider ADR before Phase 1 completes |
| Report_model.php (5,333 lines, 24 bespoke SQL functions) not ported | Partner reports continue to run from legacy Cron jobs during Phases 1–8 | Phase 4: evaluate dedicated analytics Worker or Cloudflare Analytics product |