5.2 KiB
5.2 KiB
Public API Incident Response Playbook (SEC-API-02)
Scope: Guest-facing API endpoints that rely on join tokens and power the guest PWA plus the public gallery. This includes:
/api/v1/events/{token}/*(stats, tasks, uploads, photos)/api/v1/gallery/{token}/*- Signed download/asset routes generated via
EventPublicController
The playbook focuses on abuse, availability loss, and leaked content.
1. Detection & Alerting
| Signal | Where to Watch | Notes |
|---|---|---|
| 4xx/5xx spikes | Application logs (storage/logs/laravel.log), centralized logging |
Look for repeated Join token access denied / token_rate_limited or unexpected 5xx. |
| Rate-limit triggers | Laravel log lines emitted from EventPublicController::handleTokenFailure |
Contains IP + truncated token preview. |
| CDN/WAF alerts | Reverse proxy (if enabled) | Ensure 429/403 anomalies are forwarded to incident channel. |
| Synthetic monitors | Planned via SEC-API-03 |
Placeholder until monitors exist. |
Manual check commands:
php artisan log:tail --lines=200 | grep "Join token"
php artisan log:tail --lines=200 | grep "gallery"
2. Severity Classification
| Level | Criteria | Examples |
|---|---|---|
| SEV-1 | Wide outage (>50% error rate), confirmed data leak or malicious mass-download | Gallery downloads serving wrong event, join-token table compromised. |
| SEV-2 | Localised outage (single tenant/event) or targeted brute force attempting to enumerate tokens | Single event returning 500, repeated invalid_token from single IP range. |
| SEV-3 | Minor functional regression or cosmetic issue | Rate limit misconfiguration causing occasional 429 for legitimate users. |
Escalate SEV-1/2 immediately to on-call via Slack #incident-response and open PagerDuty incident (if configured).
3. Immediate Response Checklist
- Confirm availability
curl -I https://app.test/api/v1/gallery/{known_good_token}- Use tenant-provided test token to validate
/events/{token}flow.
- Snapshot logs
- Export last 15 minutes from log aggregator or
storage/logs. Attach to incident ticket.
- Export last 15 minutes from log aggregator or
- Assess scope
- Identify affected tenant/event IDs via log context.
- Note IP addresses triggering rate limits.
- Decide mitigation
- Brute force? → throttle/bock offending IPs.
- Compromised token? → revoke token via Filament or
php artisan tenant:join-tokens:revoke {id}(once command exists). - Endpoint regression? → begin rolling fix or feature flag toggle.
4. Mitigation Tactics
4.1 Abuse / Brute force
- Increase rate-limiter strictness temporarily by editing
config/limiting.php(if available) or applying runtime block in the load balancer. - Use fail2ban/WAF rules to block offending IPs. For quick local action:
sudo ufw deny from <ip_address> - Consider temporarily disabling gallery download by setting
PUBLIC_GALLERY_ENABLED=false(feature flag planned) and clearing cache.
4.2 Token Compromise
- Revoke specific token via Filament “Join Tokens” modal (Event → Join Tokens → revoke).
- Notify tenant with replacement token instructions.
- Audit join-token logs for additional suspicious use and consider rotating all tokens for the event.
4.3 Internal Failure (500s)
- Tail logs for stack traces.
- If due to downstream storage, fail closed: return 503 with maintenance banner while running
php artisan storage:diagnostics. - Roll back recent deployment or disable new feature flag if traced to release.
5. Communication
| Audience | Channel | Cadence |
|---|---|---|
| Internal on-call | Slack #incident-response, PagerDuty |
Initial alert, hourly updates. |
| Customer Support | Slack #support with summary |
Once per significant change (mitigation applied, issue resolved). |
| Tenants | Email template “Public gallery disruption” (see resources/lang/*/emails.php) |
Only for SEV-1 or impactful SEV-2 after mitigation. |
Document timeline, impact, and mitigation in the incident ticket.
6. Verification & Recovery
After applying mitigation:
- Re-run test requests for affected endpoints.
- Validate join-token creation/revocation via Filament.
- Confirm error rates return to baseline in monitoring/dashboard.
- Remove temporary firewall blocks once threat subsides.
7. Post-Incident Actions
- File RCA within 48 hours including: root cause, detection gaps, follow-up tasks (e.g., enabling synthetic monitors, adding audit fields).
- Update documentation if new procedures are required (
docs/prp/11-public-gallery.md,docs/prp/03-api.md). - Schedule backlog items for long-term fixes (e.g., better anomaly alerting, token analytics dashboards).
8. References & Tools
- Log aggregation:
storage/logs/laravel.log(local), Stackdriver/Splunk (staging/prod). - Rate limit config:
App\Providers\AppServiceProvider→RateLimiter::for('tenant-api')andEventPublicController::handleTokenFailure. - Token management UI: Filament → Events → Join Tokens.
- Signed URL generation:
app/Http/Controllers/Api/EventPublicController(for tracing download issues).
Keep this document alongside the other deployment runbooks and review quarterly.