Observability
Alana Shopping B2B uses Datadog as the unified observability platform — APM, LLM Observability, Database Monitoring, CI Visibility, Synthetic Monitoring, SLOs, and Incident Management.Architecture
| Layer | Datadog Product | What It Monitors |
|---|---|---|
| Frontend | Browser RUM | User interactions, errors, Core Web Vitals, session replay |
| API | APM | Request traces, latency, error rates, distributed tracing |
| LLM | LLM Observability | LangGraph workflows, token usage, cost per model |
| Database | DBM | PostgreSQL query performance, slow queries, connection pool |
| Infrastructure | Infrastructure | Redis memory/clients/latency, DO droplet |
| CI/CD | CI Visibility | Pipeline performance, test results, flaky tests |
| Uptime | Synthetics | Health checks, API probes from US-East and SA-East |
| Alerting | Monitors | 8 app monitors + 6 infra monitors + 6 SLO burn rate alerts |
Dashboards
Four custom dashboards are available atus5.datadoghq.com. Run bash datadog/dashboards.sh to create them.
| Dashboard | Purpose |
|---|---|
| Billing Health | Stripe webhook success rate, payment latency p50/p95/p99, payment failures |
| Auth Health | Login success rate, callback latency, SCIM provisioning events, auth errors |
| API Performance | Request rate by route, p99 latency toplist, error rate, DBM slow queries |
| AI/LLM Performance | Request rate by model, latency, token usage, cost (24h), Canvas action distribution |
SLOs
| SLO | Target | Warning | Window |
|---|---|---|---|
| API p99 Response Time < 200ms | 99.0% | 99.5% | 30 days |
| API Error Rate < 0.1% | 99.9% | 99.95% | 30 days |
| Service Uptime > 99.9% | 99.9% | 99.95% | 30 days |
Monitors
Application Monitors (8)
| Monitor | Threshold | Priority |
|---|---|---|
| Payment Failures | >2 failures in 5min | P1 |
| Auth Errors | >5 errors in 5min | P1 |
| SCIM Provisioning Failures | >1 failure in 10min | P2 |
| AI Degradation | >500ms p95 LangGraph latency | P2 |
| API Latency | >200ms p99 in 5min | P2 |
| Error Rate | >1% in 5min | P2 |
| Billing Webhook | >1 webhook failure in 10min | P1 |
| Auth Log Stream | >3 suspicious auth events in 5min | P2 |
Infrastructure Monitors (6)
| Monitor | Threshold | Priority |
|---|---|---|
| Redis Memory Usage | >80% of maxmemory | P2 |
| Redis Connected Clients | >90 clients | P2 |
| Redis Command Latency p99 | >10ms | P3 |
| DBM Slow Queries | >500ms appearing >5× in 5min | P2 |
| DBM Connection Pool | >80% of max connections | P2 |
| DBM Lock Contention | >10 active locks | P3 |
Synthetic Tests
Four synthetic tests run fromaws:us-east-1 and aws:sa-east-1:
| Test | URL | Frequency | Priority |
|---|---|---|---|
| Health Check | GET /health | 1 min | P1 |
| Readiness Check | GET /ready | 5 min | P2 |
| Search API | POST /api/v1/search | 5 min | P2 |
| MCP Endpoint | GET /api/mcp/sse | 10 min | P3 |
On-Call
Schedule:alana-oncall-primary — 24/7 solo (America/Sao_Paulo)
| Priority | Response | Channels |
|---|---|---|
| P1 | Immediate | Email + Slack + Push |
| P2 | 5 minutes | Email + Slack |
| P3 | Daily digest |
us5.datadoghq.com/on-call. See datadog/oncall/oncall.md for setup instructions.
Bits AI SRE
Bits AI provides natural language queries across APM, logs, infrastructure metrics, and monitors. Enable atus5.datadoghq.com → Settings → Bits AI.
Useful queries:
- “What errors occurred in the last hour?”
- “Which endpoints are slowest right now?”
- “What changed before the latency spike at 14:30?”
- “Show me active monitors and their status”
- “Summarize SLO burn rate for the past 24h”
Environment Variables
| Variable | Purpose | Where |
|---|---|---|
DD_API_KEY | Datadog API key (server + CI) | Vercel + GitHub Secrets |
DD_APP_KEY | Datadog application key (scripts) | Local .env.local |
DD_SITE | Datadog site (us5.datadoghq.com) | Hardcoded in config |
DD_SERVICE | Service name (alana-shopping-b2b) | Hardcoded in config |
DD_ENV | Environment (production/preview/ci) | Vercel + GitHub Secrets |
NEXT_PUBLIC_DD_RUM_APPLICATION_ID | RUM application ID | Vercel env vars |
NEXT_PUBLIC_DD_RUM_CLIENT_TOKEN | RUM client token | Vercel env vars |
Setup Scripts
All Datadog configuration is scripted in thedatadog/ directory:
Runbooks
Seedocs/runbooks/ for incident response procedures per alert type.