Skip to content

04 — Runbook

On-call quick reference

Role Name Contact
Primary on-call Jeffrey Lambert Slack: @jeffrey
Secondary on-call Patrick Slack: @patrick
Escalation (Stripe issues) Stripe Support dashboard.stripe.com → Support
Escalation (Mailgun issues) Mailgun Support app.mailgun.com → Support
Escalation (Cloudflare issues) Cloudflare Support dash.cloudflare.com → Support

Quick checks:

# Worker health
curl -s https://billing-worker.adventive.com/health | jq .

# Real-time logs
wrangler tail adventive-billing-worker --env production --format pretty

# Stripe webhook delivery status
# Stripe Dashboard → Developers → Webhooks → endpoint → Event deliveries

Day-to-day procedures

Monthly reconciliation (after each billing cycle)

The daily Cron Trigger Worker runs reconciliation automatically at 06:00 UTC. After a billing cycle closes (typically month-end), run a manual verification:

# Trigger reconciliation manually (admin endpoint — CF Access protected)
curl -s https://billing-worker.adventive.com/admin/reconcile \
  -H "CF-Access-Client-Id: ${SERVICE_TOKEN_ID}" \
  -H "CF-Access-Client-Secret: ${SERVICE_TOKEN_SECRET}" | jq .

# Expected: {"status":"clean","accounts_checked":N,"drift_count":0}
# If drift_count > 0: see "Symptom: Reconciliation drift" incident playbook

Check Mailgun delivery for the billing cycle: - Mailgun Dashboard → Logs → filter by billing-noreply@notify.adventive.com and date range - All invoiced accounts should show a delivered event for invoice_new or invoice_notify

Deploy a hotfix

# 1. Create hotfix branch from main
git checkout main && git pull
git checkout -b hotfix/description

# 2. Make the fix, commit, push
git commit -m "fix(billing): ..."
git push origin hotfix/description

# 3. Open PR → staging first
git checkout staging && git merge hotfix/description && git push
# GitHub Actions deploys to staging automatically

# 4. Verify fix on staging (use Stripe test mode + Stripe CLI to trigger events)
stripe trigger invoice.finalized --override invoice:customer=cus_test123

# 5. Merge to main → production deploy
git checkout main && git merge staging && git push
# GitHub Actions deploys to production

# 6. Monitor: tail logs for 10 minutes post-deploy
wrangler tail adventive-billing-worker --env production --format pretty

Roll back the billing Worker

# See available deployments
wrangler deployments list --env production

# Roll back to previous deployment
wrangler rollback --env production

# Or roll back to a specific deployment
wrangler rollback [deployment-id] --env production

# Verify
curl -s https://billing-worker.adventive.com/health | jq .version

After rollback, Stripe will continue delivering webhook events. Unprocessed events from the failed deployment will be retried by Stripe automatically for up to 3 days.

Rotate a Worker secret

# Example: rotating Stripe key
wrangler secret put STRIPE_SECRET_KEY --env production
# Enter new value when prompted

wrangler secret put STRIPE_SECRET_KEY --env staging

# Verify by triggering a Stripe-dependent endpoint
curl -s https://billing-worker.adventive.com/health | jq .stripe_connected

No Worker redeploy required — secrets are fetched per-request.

Rotate the Stripe webhook signing secret

If the Stripe webhook endpoint is deleted and re-registered, or if the signing secret is rotated via Stripe Dashboard:

  1. Stripe Dashboard → Developers → Webhooks → click endpoint → Reveal signing secret
  2. Copy the new secret
  3. Update the Wrangler secret:
    wrangler secret put STRIPE_WEBHOOK_SECRET --env production
    wrangler secret put STRIPE_WEBHOOK_SECRET --env staging
    
  4. Verify: trigger a test event from Stripe Dashboard → endpoint should return 200

Resend an invoice email

If an operator reports an invoice email was not received:

# Admin resend endpoint (CF Access protected)
curl -s -X POST https://billing-worker.adventive.com/admin/resend-invoice \
  -H "CF-Access-Client-Id: ${SERVICE_TOKEN_ID}" \
  -H "CF-Access-Client-Secret: ${SERVICE_TOKEN_SECRET}" \
  -H "Content-Type: application/json" \
  -d '{"invoice_id": "in_1THM9xCSGXf35AB6gDkwmE5y"}' | jq .

This re-fetches the invoice from Stripe, re-renders the PDF, and re-sends via Mailgun. It does NOT create a new Stripe invoice event.

Issue a refund

# Via Stripe Dashboard: Dashboard → Payments → find charge → Refund
# Stripe issues the refund; billing Worker does not need to act
# Acodei picks up the refund event and creates a QuickBooks credit note

# For a partial refund or credit note via Stripe CLI:
stripe refunds create --charge ch_xxx --amount 5000  # amount in cents

# Confirm: check Acodei sync within 15 minutes of Stripe refund

Handle a dispute

  1. Stripe Dashboard → Disputes → find the dispute
  2. Review the evidence Stripe requests (invoice PDF, delivery confirmation, account records)
  3. Upload evidence via Stripe Dashboard — deadline shown on dispute detail page
  4. Invoice PDF available in R2: r2://adventive-invoices/{acct_id}/{inv_uuid}.pdf
  5. Mailgun delivery log: Dashboard → Logs → search by recipient email for delivery confirmation
  6. After dispute closes: Stripe resolves automatically; Acodei picks up the outcome event

Update a customer's payment method

  1. Direct the customer to the Stripe Customer Portal (if B.9 is enabled): https://billing.stripe.com/p/login/{portal_token}
  2. If Customer Portal is not yet enabled: use Stripe Dashboard → Customers → find customer → Payment methods → Add payment method → share Stripe Checkout setup link with customer

Investigate a QuickBooks / Acodei drift

If an operator reports that QuickBooks totals don't match Stripe:

# Check the reconciliation endpoint output
curl -s https://billing-worker.adventive.com/admin/reconcile \
  -H "CF-Access-Client-Id: ${SERVICE_TOKEN_ID}" \
  -H "CF-Access-Client-Secret: ${SERVICE_TOKEN_SECRET}" | jq .

# Check Acodei dashboard for sync errors:
# Acodei → Connected accounts → Adventive → Event log → filter by error

Common causes: Stripe event not yet picked up by Acodei (5–15 min delay), Acodei auth token expired (re-connect in Acodei settings), QuickBooks API rate limit (resolves automatically).


Incident playbooks

Symptom: Stripe webhook not delivered (endpoint failing)

Likely cause: Worker returning non-2xx, or Worker throwing an unhandled exception.

Diagnosis:

# 1. Check Stripe webhook delivery log
# Stripe Dashboard → Developers → Webhooks → endpoint → Event deliveries
# Look for: 5xx responses, timeouts, or connection refused

# 2. Check Worker logs for the error
wrangler tail adventive-billing-worker --env production --format pretty

# 3. Check Worker error rate in Cloudflare Dashboard
# Workers & Pages → adventive-billing-worker → Metrics → Errors

Fix: - If Worker throwing: fix the exception in code; hotfix deploy - If Worker returning 400 (signature mismatch): verify STRIPE_WEBHOOK_SECRET matches the endpoint's current signing secret (see "Rotate the Stripe webhook signing secret") - If Worker unavailable: Cloudflare infrastructure issue — check cloudflare.com/status; Stripe will retry for up to 3 days

Verification: Stripe Dashboard → Event deliveries → redeliver the failed event → should show 200 response.


Symptom: Invoice email not delivered

Likely cause: Mailgun API failure, Mailgun API key expired, or PDF render failure (email suppressed when PDF is missing).

Diagnosis:

# 1. Check Worker logs around the time of invoice.finalized event
wrangler tail adventive-billing-worker --env production --format pretty \
  --search "invoice.finalized"

# 2. Check Mailgun delivery log
# Mailgun Dashboard → Logs → search by recipient email or message-id

# 3. Check if PDF exists in R2
# Cloudflare Dashboard → R2 → adventive-invoices → browse for {acct_id}/{inv_uuid}.pdf

Fix: - If Mailgun API key expired: rotate MAILGUN_API_KEY (see "Rotate a Worker secret"); resend via /admin/resend-invoice - If PDF render failed: see "Symptom: PDF render failed" - If Mailgun shows bounced or failed: check recipient email address is correct in Stripe Customer record; update if needed - If Mailgun shows suppressed: customer previously unsubscribed — contact them via alternative channel; do not send via Mailgun until they re-subscribe

Verification: Resend via /admin/resend-invoice; confirm Mailgun shows delivered for the message.


Symptom: PDF render failed

Likely cause: Cloudflare Browser Rendering timeout, HTML template error, or Browser Rendering concurrency limit hit.

Diagnosis:

# 1. Check Worker logs for pdf_render_failed event
wrangler tail adventive-billing-worker --env production --format pretty \
  --search "pdf_render_failed"

# 2. Check Analytics Engine for pdf_render_failed events
# Workers Analytics Engine dashboard → filter by event=pdf_render_failed

# 3. Attempt manual render for the specific invoice
curl -s -X POST https://billing-worker.adventive.com/admin/render-pdf \
  -H "CF-Access-Client-Id: ${SERVICE_TOKEN_ID}" \
  -H "CF-Access-Client-Secret: ${SERVICE_TOKEN_SECRET}" \
  -H "Content-Type: application/json" \
  -d '{"invoice_id": "in_xxx"}' | jq .

Fix: - If template error (malformed HTML, missing variable): fix Handlebars template and redeploy - If timeout: invoice data may be unusually large; check if specific invoice has an extreme number of line items - If concurrency limit: Browser Rendering allows 2 simultaneous renders; if >2 invoices are finalizing simultaneously, renders queue automatically — wait and retry via resend endpoint - If Browser Rendering service unavailable: check developers.cloudflare.com/browser-rendering for service status; Stripe will hold the webhook event for retry

Verification: Confirm PDF appears in R2 at {acct_id}/{inv_uuid}.pdf; resend invoice email.


Symptom: Reconciliation drift detected

Likely cause: Metering event double-counted, usage record missed, Stripe invoice total doesn't match billing_invoice DB, or rounding difference.

Diagnosis:

# 1. Get drift details
curl -s https://billing-worker.adventive.com/admin/reconcile?detail=true \
  -H "CF-Access-Client-Id: ${SERVICE_TOKEN_ID}" \
  -H "CF-Access-Client-Secret: ${SERVICE_TOKEN_SECRET}" | jq .

# 2. Compare: Stripe invoice for the drifting account
stripe invoices retrieve in_xxx | jq .amount_due

# 3. Compare: billing_invoice DB for the same account + period
# (Direct DB query via secure bastion)

# 4. Check metering events for the account in question
stripe usage_records list --subscription_item si_xxx | jq .

Fix: - If usage double-counted: find the duplicate Usage Record; issue a Stripe Credit Note for the overcharge; fix the idempotency key logic in billing Worker - If usage missed: report missing usage via /usage endpoint for the affected account and period; Stripe will include on the next invoice - If rounding: document as acceptable if < $0.01; if systematic, fix rounding in billing Worker - During dual-run (cohorts not yet fully migrated): drift may indicate legacy billing_service also ran for the same account — investigate double-billing risk first

Verification: Re-run reconciliation; confirm drift_count = 0 for affected account.


Symptom: Failed payment — customer's invoice not collected

Likely cause: Stripe Smart Retries exhausted, card declined, or payment method expired.

Diagnosis:

# 1. Check Stripe invoice status
stripe invoices retrieve in_xxx | jq '{status, amount_due, amount_paid, next_payment_attempt}'

# 2. Check Smart Retries schedule
# Stripe Dashboard → Billing → Revenue recovery → Smart Retries configuration
# Stripe Dashboard → Customers → {customer} → Invoices → see retry schedule

Fix: - If Smart Retries still running: no action needed; Stripe will retry per configured schedule; dunning emails sent automatically - If Smart Retries exhausted (invoice status = uncollectible): contact customer via Slack / direct outreach; ask them to update payment method via Customer Portal (B.9) or Stripe Checkout setup link - If customer updates card: Stripe Dashboard → manually retry collection on the invoice

Verification: Invoice status changes from open to paid in Stripe Dashboard after successful collection.


Symptom: Customer reports incorrect invoice amount

Likely cause: Usage record incorrect, wrong Stripe Price tier applied, or managed service job amount wrong.

Diagnosis:

# 1. Pull full invoice line items
stripe invoices retrieve in_xxx --expand='lines' | jq .lines.data[]

# 2. Compare metering data to source (Redshift impression counts for the period)
# Ask Patrick to run: SELECT SUM(impressions) FROM stats WHERE acct_id=... AND period=...

# 3. Check which Price tier was applied
# Look at lines[].price.tiers for the metered impression lines

Fix: - If usage record wrong: issue a Stripe Credit Note for the overcharge; correct the metering pipeline for future periods - If Price tier wrong: verify the Stripe Price object's tier thresholds match account_plan_usage for this account's plan - If managed service job wrong: issue a Stripe Credit Note; correct the InvoiceItem amount for the specific job

Verification: Customer acknowledges corrected invoice; Credit Note sent to customer via Mailgun if applicable.


Symptom: Acodei not syncing to QuickBooks

Likely cause: Acodei OAuth token expired, QuickBooks API rate limit, or Acodei configuration drift.

Diagnosis: 1. Acodei dashboard → Connected accounts → Adventive → check last sync timestamp 2. Acodei → Event log → look for sync errors or auth failures 3. QuickBooks → check for duplicate entries (indicates sync ran twice)

Fix: - If OAuth expired: reconnect Acodei to QuickBooks via Acodei settings (OAuth re-authorization flow) - If rate limited: Acodei retries automatically; wait 30–60 minutes - If configuration drift: compare Acodei field mappings against the expected Stripe event shape; update mappings in Acodei settings

Verification: Acodei → Event log shows successful sync for recent Stripe events; QuickBooks shows new entries.


Known issues & workarounds

Issue Workaround Permanent fix
Browser Rendering concurrency limit (2 simultaneous renders) Renders queue naturally; ~200 invoices/month at non-simultaneous pacing is fine If batch volume spikes, implement a queue (Workers Queue) to serialize render requests
PHPExcel still used for month-end Excel export Legacy admin continues to generate until Cron Trigger Worker replacement is live Port month-end summary to CSV via Cron Trigger Worker (Phase 6+)
Stripe Smart Retries not yet configured Operators manually track failed invoices until Smart Retries is configured in B.3 Configure Smart Retries as part of B.3; enable at least 3 retry attempts on configurable schedule
Historical invoice PDF unavailable in R2 pre-B.4 PDFs still served from billing.adventivecdn.com for pre-migration invoices Batch-copy historical PDFs from S3 to R2 as part of B.4 setup
14-year billing_invoice history not in Stripe Legacy DB remains accessible read-only for historical lookups Export to R2 / archive format before legacy DB decommission (post-B.8)
Custom dunning still in admin during parallel run Legacy admin handles dunning for non-migrated cohorts; billing Worker handles migrated cohorts Retire at B.8 when 100% of customers are on Stripe Billing