04 — Runbook¶
On-call quick reference¶
| Role | Name | Contact |
|---|---|---|
| Primary on-call | Jeffrey Lambert | Slack: @jeffrey |
| Secondary on-call | Patrick | Slack: @patrick |
| Escalation (Stripe issues) | Stripe Support | dashboard.stripe.com → Support |
| Escalation (Mailgun issues) | Mailgun Support | app.mailgun.com → Support |
| Escalation (Cloudflare issues) | Cloudflare Support | dash.cloudflare.com → Support |
Quick checks:
# Worker health
curl -s https://billing-worker.adventive.com/health | jq .
# Real-time logs
wrangler tail adventive-billing-worker --env production --format pretty
# Stripe webhook delivery status
# Stripe Dashboard → Developers → Webhooks → endpoint → Event deliveries
Day-to-day procedures¶
Monthly reconciliation (after each billing cycle)¶
The daily Cron Trigger Worker runs reconciliation automatically at 06:00 UTC. After a billing cycle closes (typically month-end), run a manual verification:
# Trigger reconciliation manually (admin endpoint — CF Access protected)
curl -s https://billing-worker.adventive.com/admin/reconcile \
-H "CF-Access-Client-Id: ${SERVICE_TOKEN_ID}" \
-H "CF-Access-Client-Secret: ${SERVICE_TOKEN_SECRET}" | jq .
# Expected: {"status":"clean","accounts_checked":N,"drift_count":0}
# If drift_count > 0: see "Symptom: Reconciliation drift" incident playbook
Check Mailgun delivery for the billing cycle:
- Mailgun Dashboard → Logs → filter by billing-noreply@notify.adventive.com and date range
- All invoiced accounts should show a delivered event for invoice_new or invoice_notify
Deploy a hotfix¶
# 1. Create hotfix branch from main
git checkout main && git pull
git checkout -b hotfix/description
# 2. Make the fix, commit, push
git commit -m "fix(billing): ..."
git push origin hotfix/description
# 3. Open PR → staging first
git checkout staging && git merge hotfix/description && git push
# GitHub Actions deploys to staging automatically
# 4. Verify fix on staging (use Stripe test mode + Stripe CLI to trigger events)
stripe trigger invoice.finalized --override invoice:customer=cus_test123
# 5. Merge to main → production deploy
git checkout main && git merge staging && git push
# GitHub Actions deploys to production
# 6. Monitor: tail logs for 10 minutes post-deploy
wrangler tail adventive-billing-worker --env production --format pretty
Roll back the billing Worker¶
# See available deployments
wrangler deployments list --env production
# Roll back to previous deployment
wrangler rollback --env production
# Or roll back to a specific deployment
wrangler rollback [deployment-id] --env production
# Verify
curl -s https://billing-worker.adventive.com/health | jq .version
After rollback, Stripe will continue delivering webhook events. Unprocessed events from the failed deployment will be retried by Stripe automatically for up to 3 days.
Rotate a Worker secret¶
# Example: rotating Stripe key
wrangler secret put STRIPE_SECRET_KEY --env production
# Enter new value when prompted
wrangler secret put STRIPE_SECRET_KEY --env staging
# Verify by triggering a Stripe-dependent endpoint
curl -s https://billing-worker.adventive.com/health | jq .stripe_connected
No Worker redeploy required — secrets are fetched per-request.
Rotate the Stripe webhook signing secret¶
If the Stripe webhook endpoint is deleted and re-registered, or if the signing secret is rotated via Stripe Dashboard:
- Stripe Dashboard → Developers → Webhooks → click endpoint → Reveal signing secret
- Copy the new secret
- Update the Wrangler secret:
- Verify: trigger a test event from Stripe Dashboard → endpoint should return 200
Resend an invoice email¶
If an operator reports an invoice email was not received:
# Admin resend endpoint (CF Access protected)
curl -s -X POST https://billing-worker.adventive.com/admin/resend-invoice \
-H "CF-Access-Client-Id: ${SERVICE_TOKEN_ID}" \
-H "CF-Access-Client-Secret: ${SERVICE_TOKEN_SECRET}" \
-H "Content-Type: application/json" \
-d '{"invoice_id": "in_1THM9xCSGXf35AB6gDkwmE5y"}' | jq .
This re-fetches the invoice from Stripe, re-renders the PDF, and re-sends via Mailgun. It does NOT create a new Stripe invoice event.
Issue a refund¶
# Via Stripe Dashboard: Dashboard → Payments → find charge → Refund
# Stripe issues the refund; billing Worker does not need to act
# Acodei picks up the refund event and creates a QuickBooks credit note
# For a partial refund or credit note via Stripe CLI:
stripe refunds create --charge ch_xxx --amount 5000 # amount in cents
# Confirm: check Acodei sync within 15 minutes of Stripe refund
Handle a dispute¶
- Stripe Dashboard → Disputes → find the dispute
- Review the evidence Stripe requests (invoice PDF, delivery confirmation, account records)
- Upload evidence via Stripe Dashboard — deadline shown on dispute detail page
- Invoice PDF available in R2:
r2://adventive-invoices/{acct_id}/{inv_uuid}.pdf - Mailgun delivery log: Dashboard → Logs → search by recipient email for delivery confirmation
- After dispute closes: Stripe resolves automatically; Acodei picks up the outcome event
Update a customer's payment method¶
- Direct the customer to the Stripe Customer Portal (if B.9 is enabled):
https://billing.stripe.com/p/login/{portal_token} - If Customer Portal is not yet enabled: use Stripe Dashboard → Customers → find customer → Payment methods → Add payment method → share Stripe Checkout setup link with customer
Investigate a QuickBooks / Acodei drift¶
If an operator reports that QuickBooks totals don't match Stripe:
# Check the reconciliation endpoint output
curl -s https://billing-worker.adventive.com/admin/reconcile \
-H "CF-Access-Client-Id: ${SERVICE_TOKEN_ID}" \
-H "CF-Access-Client-Secret: ${SERVICE_TOKEN_SECRET}" | jq .
# Check Acodei dashboard for sync errors:
# Acodei → Connected accounts → Adventive → Event log → filter by error
Common causes: Stripe event not yet picked up by Acodei (5–15 min delay), Acodei auth token expired (re-connect in Acodei settings), QuickBooks API rate limit (resolves automatically).
Incident playbooks¶
Symptom: Stripe webhook not delivered (endpoint failing)¶
Likely cause: Worker returning non-2xx, or Worker throwing an unhandled exception.
Diagnosis:
# 1. Check Stripe webhook delivery log
# Stripe Dashboard → Developers → Webhooks → endpoint → Event deliveries
# Look for: 5xx responses, timeouts, or connection refused
# 2. Check Worker logs for the error
wrangler tail adventive-billing-worker --env production --format pretty
# 3. Check Worker error rate in Cloudflare Dashboard
# Workers & Pages → adventive-billing-worker → Metrics → Errors
Fix:
- If Worker throwing: fix the exception in code; hotfix deploy
- If Worker returning 400 (signature mismatch): verify STRIPE_WEBHOOK_SECRET matches the endpoint's current signing secret (see "Rotate the Stripe webhook signing secret")
- If Worker unavailable: Cloudflare infrastructure issue — check cloudflare.com/status; Stripe will retry for up to 3 days
Verification: Stripe Dashboard → Event deliveries → redeliver the failed event → should show 200 response.
Symptom: Invoice email not delivered¶
Likely cause: Mailgun API failure, Mailgun API key expired, or PDF render failure (email suppressed when PDF is missing).
Diagnosis:
# 1. Check Worker logs around the time of invoice.finalized event
wrangler tail adventive-billing-worker --env production --format pretty \
--search "invoice.finalized"
# 2. Check Mailgun delivery log
# Mailgun Dashboard → Logs → search by recipient email or message-id
# 3. Check if PDF exists in R2
# Cloudflare Dashboard → R2 → adventive-invoices → browse for {acct_id}/{inv_uuid}.pdf
Fix:
- If Mailgun API key expired: rotate MAILGUN_API_KEY (see "Rotate a Worker secret"); resend via /admin/resend-invoice
- If PDF render failed: see "Symptom: PDF render failed"
- If Mailgun shows bounced or failed: check recipient email address is correct in Stripe Customer record; update if needed
- If Mailgun shows suppressed: customer previously unsubscribed — contact them via alternative channel; do not send via Mailgun until they re-subscribe
Verification: Resend via /admin/resend-invoice; confirm Mailgun shows delivered for the message.
Symptom: PDF render failed¶
Likely cause: Cloudflare Browser Rendering timeout, HTML template error, or Browser Rendering concurrency limit hit.
Diagnosis:
# 1. Check Worker logs for pdf_render_failed event
wrangler tail adventive-billing-worker --env production --format pretty \
--search "pdf_render_failed"
# 2. Check Analytics Engine for pdf_render_failed events
# Workers Analytics Engine dashboard → filter by event=pdf_render_failed
# 3. Attempt manual render for the specific invoice
curl -s -X POST https://billing-worker.adventive.com/admin/render-pdf \
-H "CF-Access-Client-Id: ${SERVICE_TOKEN_ID}" \
-H "CF-Access-Client-Secret: ${SERVICE_TOKEN_SECRET}" \
-H "Content-Type: application/json" \
-d '{"invoice_id": "in_xxx"}' | jq .
Fix: - If template error (malformed HTML, missing variable): fix Handlebars template and redeploy - If timeout: invoice data may be unusually large; check if specific invoice has an extreme number of line items - If concurrency limit: Browser Rendering allows 2 simultaneous renders; if >2 invoices are finalizing simultaneously, renders queue automatically — wait and retry via resend endpoint - If Browser Rendering service unavailable: check developers.cloudflare.com/browser-rendering for service status; Stripe will hold the webhook event for retry
Verification: Confirm PDF appears in R2 at {acct_id}/{inv_uuid}.pdf; resend invoice email.
Symptom: Reconciliation drift detected¶
Likely cause: Metering event double-counted, usage record missed, Stripe invoice total doesn't match billing_invoice DB, or rounding difference.
Diagnosis:
# 1. Get drift details
curl -s https://billing-worker.adventive.com/admin/reconcile?detail=true \
-H "CF-Access-Client-Id: ${SERVICE_TOKEN_ID}" \
-H "CF-Access-Client-Secret: ${SERVICE_TOKEN_SECRET}" | jq .
# 2. Compare: Stripe invoice for the drifting account
stripe invoices retrieve in_xxx | jq .amount_due
# 3. Compare: billing_invoice DB for the same account + period
# (Direct DB query via secure bastion)
# 4. Check metering events for the account in question
stripe usage_records list --subscription_item si_xxx | jq .
Fix:
- If usage double-counted: find the duplicate Usage Record; issue a Stripe Credit Note for the overcharge; fix the idempotency key logic in billing Worker
- If usage missed: report missing usage via /usage endpoint for the affected account and period; Stripe will include on the next invoice
- If rounding: document as acceptable if < $0.01; if systematic, fix rounding in billing Worker
- During dual-run (cohorts not yet fully migrated): drift may indicate legacy billing_service also ran for the same account — investigate double-billing risk first
Verification: Re-run reconciliation; confirm drift_count = 0 for affected account.
Symptom: Failed payment — customer's invoice not collected¶
Likely cause: Stripe Smart Retries exhausted, card declined, or payment method expired.
Diagnosis:
# 1. Check Stripe invoice status
stripe invoices retrieve in_xxx | jq '{status, amount_due, amount_paid, next_payment_attempt}'
# 2. Check Smart Retries schedule
# Stripe Dashboard → Billing → Revenue recovery → Smart Retries configuration
# Stripe Dashboard → Customers → {customer} → Invoices → see retry schedule
Fix:
- If Smart Retries still running: no action needed; Stripe will retry per configured schedule; dunning emails sent automatically
- If Smart Retries exhausted (invoice status = uncollectible): contact customer via Slack / direct outreach; ask them to update payment method via Customer Portal (B.9) or Stripe Checkout setup link
- If customer updates card: Stripe Dashboard → manually retry collection on the invoice
Verification: Invoice status changes from open to paid in Stripe Dashboard after successful collection.
Symptom: Customer reports incorrect invoice amount¶
Likely cause: Usage record incorrect, wrong Stripe Price tier applied, or managed service job amount wrong.
Diagnosis:
# 1. Pull full invoice line items
stripe invoices retrieve in_xxx --expand='lines' | jq .lines.data[]
# 2. Compare metering data to source (Redshift impression counts for the period)
# Ask Patrick to run: SELECT SUM(impressions) FROM stats WHERE acct_id=... AND period=...
# 3. Check which Price tier was applied
# Look at lines[].price.tiers for the metered impression lines
Fix:
- If usage record wrong: issue a Stripe Credit Note for the overcharge; correct the metering pipeline for future periods
- If Price tier wrong: verify the Stripe Price object's tier thresholds match account_plan_usage for this account's plan
- If managed service job wrong: issue a Stripe Credit Note; correct the InvoiceItem amount for the specific job
Verification: Customer acknowledges corrected invoice; Credit Note sent to customer via Mailgun if applicable.
Symptom: Acodei not syncing to QuickBooks¶
Likely cause: Acodei OAuth token expired, QuickBooks API rate limit, or Acodei configuration drift.
Diagnosis: 1. Acodei dashboard → Connected accounts → Adventive → check last sync timestamp 2. Acodei → Event log → look for sync errors or auth failures 3. QuickBooks → check for duplicate entries (indicates sync ran twice)
Fix: - If OAuth expired: reconnect Acodei to QuickBooks via Acodei settings (OAuth re-authorization flow) - If rate limited: Acodei retries automatically; wait 30–60 minutes - If configuration drift: compare Acodei field mappings against the expected Stripe event shape; update mappings in Acodei settings
Verification: Acodei → Event log shows successful sync for recent Stripe events; QuickBooks shows new entries.
Known issues & workarounds¶
| Issue | Workaround | Permanent fix |
|---|---|---|
| Browser Rendering concurrency limit (2 simultaneous renders) | Renders queue naturally; ~200 invoices/month at non-simultaneous pacing is fine | If batch volume spikes, implement a queue (Workers Queue) to serialize render requests |
| PHPExcel still used for month-end Excel export | Legacy admin continues to generate until Cron Trigger Worker replacement is live | Port month-end summary to CSV via Cron Trigger Worker (Phase 6+) |
| Stripe Smart Retries not yet configured | Operators manually track failed invoices until Smart Retries is configured in B.3 | Configure Smart Retries as part of B.3; enable at least 3 retry attempts on configurable schedule |
| Historical invoice PDF unavailable in R2 pre-B.4 | PDFs still served from billing.adventivecdn.com for pre-migration invoices |
Batch-copy historical PDFs from S3 to R2 as part of B.4 setup |
14-year billing_invoice history not in Stripe |
Legacy DB remains accessible read-only for historical lookups | Export to R2 / archive format before legacy DB decommission (post-B.8) |
| Custom dunning still in admin during parallel run | Legacy admin handles dunning for non-migrated cohorts; billing Worker handles migrated cohorts | Retire at B.8 when 100% of customers are on Stripe Billing |