Pular para o conteúdo principal

Observability

Alana Shopping B2B uses Datadog as the unified observability platform — APM, LLM Observability, Database Monitoring, CI Visibility, Synthetic Monitoring, SLOs, and Incident Management.

Architecture

LayerDatadog ProductWhat It Monitors
FrontendBrowser RUMUser interactions, errors, Core Web Vitals, session replay
APIAPMRequest traces, latency, error rates, distributed tracing
LLMLLM ObservabilityLangGraph workflows, token usage, cost per model
DatabaseDBMPostgreSQL query performance, slow queries, connection pool
InfrastructureInfrastructureRedis memory/clients/latency, DO droplet
CI/CDCI VisibilityPipeline performance, test results, flaky tests
UptimeSyntheticsHealth checks, API probes from US-East and SA-East
AlertingMonitors8 app monitors + 6 infra monitors + 6 SLO burn rate alerts

Dashboards

Four custom dashboards are available at us5.datadoghq.com. Run bash datadog/dashboards.sh to create them.
DashboardPurpose
Billing HealthStripe webhook success rate, payment latency p50/p95/p99, payment failures
Auth HealthLogin success rate, callback latency, SCIM provisioning events, auth errors
API PerformanceRequest rate by route, p99 latency toplist, error rate, DBM slow queries
AI/LLM PerformanceRequest rate by model, latency, token usage, cost (24h), Canvas action distribution

SLOs

SLOTargetWarningWindow
API p99 Response Time < 200ms99.0%99.5%30 days
API Error Rate < 0.1%99.9%99.95%30 days
Service Uptime > 99.9%99.9%99.95%30 days
Burn rate alerts fire at 14.4× budget consumed in 1h (P1 — fast) and 6× budget in 6h (P2 — slow).

Monitors

Application Monitors (8)

MonitorThresholdPriority
Payment Failures>2 failures in 5minP1
Auth Errors>5 errors in 5minP1
SCIM Provisioning Failures>1 failure in 10minP2
AI Degradation>500ms p95 LangGraph latencyP2
API Latency>200ms p99 in 5minP2
Error Rate>1% in 5minP2
Billing Webhook>1 webhook failure in 10minP1
Auth Log Stream>3 suspicious auth events in 5minP2

Infrastructure Monitors (6)

MonitorThresholdPriority
Redis Memory Usage>80% of maxmemoryP2
Redis Connected Clients>90 clientsP2
Redis Command Latency p99>10msP3
DBM Slow Queries>500ms appearing >5× in 5minP2
DBM Connection Pool>80% of max connectionsP2
DBM Lock Contention>10 active locksP3

Synthetic Tests

Four synthetic tests run from aws:us-east-1 and aws:sa-east-1:
TestURLFrequencyPriority
Health CheckGET /health1 minP1
Readiness CheckGET /ready5 minP2
Search APIPOST /api/v1/search5 minP2
MCP EndpointGET /api/mcp/sse10 minP3

On-Call

Schedule: alana-oncall-primary — 24/7 solo (America/Sao_Paulo)
PriorityResponseChannels
P1ImmediateEmail + Slack + Push
P25 minutesEmail + Slack
P3Daily digestEmail
Configure at us5.datadoghq.com/on-call. See datadog/oncall/oncall.md for setup instructions.

Bits AI SRE

Bits AI provides natural language queries across APM, logs, infrastructure metrics, and monitors. Enable at us5.datadoghq.com → Settings → Bits AI. Useful queries:
  • “What errors occurred in the last hour?”
  • “Which endpoints are slowest right now?”
  • “What changed before the latency spike at 14:30?”
  • “Show me active monitors and their status”
  • “Summarize SLO burn rate for the past 24h”

Environment Variables

VariablePurposeWhere
DD_API_KEYDatadog API key (server + CI)Vercel + GitHub Secrets
DD_APP_KEYDatadog application key (scripts)Local .env.local
DD_SITEDatadog site (us5.datadoghq.com)Hardcoded in config
DD_SERVICEService name (alana-shopping-b2b)Hardcoded in config
DD_ENVEnvironment (production/preview/ci)Vercel + GitHub Secrets
NEXT_PUBLIC_DD_RUM_APPLICATION_IDRUM application IDVercel env vars
NEXT_PUBLIC_DD_RUM_CLIENT_TOKENRUM client tokenVercel env vars

Setup Scripts

All Datadog configuration is scripted in the datadog/ directory:
# Create monitors
DD_API_KEY=<key> DD_APP_KEY=<key> bash datadog/monitors.sh

# Create synthetics + SLOs
DD_API_KEY=<key> DD_APP_KEY=<key> HEALTH_MONITOR_ID=<id> bash datadog/slos.sh

# Create custom dashboards
DD_API_KEY=<key> DD_APP_KEY=<key> bash datadog/dashboards.sh

Runbooks

See docs/runbooks/ for incident response procedures per alert type.
Last modified on March 19, 2026