Files
fotospiel-app/docs/ops/deployment/public-api-incident-playbook.md
Codex Agent 9a8305d986
Some checks failed
linter / quality (push) Has been cancelled
tests / ci (push) Has been cancelled
tests / ui (push) Has been cancelled
Add Uptime Kuma monitoring template
2026-01-30 11:12:15 +01:00

6.2 KiB

Public API Incident Response Playbook (SEC-API-02)

Scope: Guest-facing API endpoints that rely on join tokens and power the guest PWA plus the public gallery. This includes:

  • /api/v1/events/{token}/* (stats, tasks, uploads, photos)
  • /api/v1/gallery/{token}/*
  • Signed download/asset routes generated via EventPublicController

The playbook focuses on abuse, availability loss, and leaked content.


1. Detection & Alerting

Signal Where to Watch Notes
4xx/5xx spikes Application logs (storage/logs/laravel.log), centralized logging Look for repeated Join token access denied / token_rate_limited or unexpected 5xx.
Rate-limit triggers Laravel log lines emitted from EventPublicController::handleTokenFailure Contains IP + truncated token preview.
CDN/WAF alerts Reverse proxy (if enabled) Ensure 429/403 anomalies are forwarded to incident channel.
Synthetic monitors Uptime Kuma (SEC-API-03) Public uptime + guest API + support API health checks.

Manual check commands:

php artisan log:tail --lines=200 | grep "Join token"
php artisan log:tail --lines=200 | grep "gallery"

1.1 Synthetic Monitors (Uptime Kuma)

Primary uptime:

  • GET / (base domain) — HTTP 200-399.

Guest API (stable synthetic event tokens required):

  • GET /api/v1/events/{token} — expect JSON containing "slug" or "engagement_mode".
  • GET /api/v1/gallery/{token} — expect JSON containing "event" or "branding".
  • GET /api/v1/gallery/{token}/photos — expect JSON containing "data".
  • GET /api/v1/events/{token}/photos — expect JSON containing "data".

Support API health metrics (read-only token stored in Kuma):

  • GET /api/v1/support/tenants?per_page=1
  • GET /api/v1/support/events?per_page=1
  • GET /api/v1/support/photos?per_page=1

Defaults:

  • Interval: 60s
  • Timeout: 10s
  • Retries before alert: 2

Notes:

  • Do not store bearer tokens in the repo; configure them directly in Kuma.
  • If synthetic tokens rotate, guest monitors will flap. Keep a dedicated synthetic event/token.
  • Import template: docs/ops/deployment/uptime-kuma-import.template.json (replace placeholders before import).

2. Severity Classification

Level Criteria Examples
SEV-1 Wide outage (>50% error rate), confirmed data leak or malicious mass-download Gallery downloads serving wrong event, join-token table compromised.
SEV-2 Localised outage (single tenant/event) or targeted brute force attempting to enumerate tokens Single event returning 500, repeated invalid_token from single IP range.
SEV-3 Minor functional regression or cosmetic issue Rate limit misconfiguration causing occasional 429 for legitimate users.

Escalate SEV-1/2 immediately to on-call via Slack #incident-response and open PagerDuty incident (if configured).

3. Immediate Response Checklist

  1. Confirm availability
    • curl -I https://app.test/api/v1/gallery/{known_good_token}
    • Use tenant-provided test token to validate /events/{token} flow.
  2. Snapshot logs
    • Export last 15 minutes from log aggregator or storage/logs. Attach to incident ticket.
  3. Assess scope
    • Identify affected tenant/event IDs via log context.
    • Note IP addresses triggering rate limits.
  4. Decide mitigation
    • Brute force? → throttle/bock offending IPs.
    • Compromised token? → revoke token via Filament or php artisan tenant:join-tokens:revoke {id} (once command exists).
    • Endpoint regression? → begin rolling fix or feature flag toggle.

4. Mitigation Tactics

4.1 Abuse / Brute force

  • Increase rate-limiter strictness temporarily by editing config/limiting.php (if available) or applying runtime block in the load balancer.
  • Use fail2ban/WAF rules to block offending IPs. For quick local action:
    sudo ufw deny from <ip_address>
    
  • Consider temporarily disabling gallery download by setting PUBLIC_GALLERY_ENABLED=false (feature flag planned) and clearing cache.

4.2 Token Compromise

  • Revoke specific token via Filament “Join Tokens” modal (Event → Join Tokens → revoke).
  • Notify tenant with replacement token instructions.
  • Audit join-token logs for additional suspicious use and consider rotating all tokens for the event.

4.3 Internal Failure (500s)

  • Tail logs for stack traces.
  • If due to downstream storage, fail closed: return 503 with maintenance banner while running php artisan storage:diagnostics.
  • Roll back recent deployment or disable new feature flag if traced to release.

5. Communication

Audience Channel Cadence
Internal on-call Slack #incident-response, PagerDuty Initial alert, hourly updates.
Customer Support Slack #support with summary Once per significant change (mitigation applied, issue resolved).
Tenants Email template “Public gallery disruption” (see resources/lang/*/emails.php) Only for SEV-1 or impactful SEV-2 after mitigation.

Document timeline, impact, and mitigation in the incident ticket.

6. Verification & Recovery

After applying mitigation:

  1. Re-run test requests for affected endpoints.
  2. Validate join-token creation/revocation via Filament.
  3. Confirm error rates return to baseline in monitoring/dashboard.
  4. Remove temporary firewall blocks once threat subsides.

7. Post-Incident Actions

  • File RCA within 48 hours including: root cause, detection gaps, follow-up tasks (e.g., enabling synthetic monitors, adding audit fields).
  • Update documentation if new procedures are required (docs/prp/11-public-gallery.md, docs/prp/03-api.md).
  • Schedule backlog items for long-term fixes (e.g., better anomaly alerting, token analytics dashboards).

8. References & Tools

  • Log aggregation: storage/logs/laravel.log (local), Stackdriver/Splunk (staging/prod).
  • Rate limit config: App\Providers\AppServiceProviderRateLimiter::for('tenant-api') and EventPublicController::handleTokenFailure.
  • Token management UI: Filament → Events → Join Tokens.
  • Signed URL generation: app/Http/Controllers/Api/EventPublicController (for tracing download issues).

Keep this document alongside the other deployment runbooks and review quarterly.