Golden Signals Dashboard for SRE: Datadog Success Rate, Availability, SLO, Error Budget and Troubleshooting Widgets
A Golden Signals dashboard is not just a chart collection. For an SRE, it should answer one question fast: is the service reliable right now, and if not, where should we investigate first? Every widget placement, every metric formula, and every threshold color should serve that single purpose. This guide covers how to build that dashboard in Datadog from scratch — the exact widget set, the filter structure, the query patterns, the troubleshooting flow, and how to explain it all in an interview or a production incident review.
Why Managers Care About Success Rate and Availability Progression
When a production service degrades, managers do not want theory. They want to know: which endpoint, which region, which version, or which dependency caused the drop? They want to see a trend, not a single number. A dashboard that shows only the current success rate tells you there is a problem. A dashboard that shows the progression of Success Rate and Availability over time — at 1h, 24h, 7d, and 30d — tells you when the problem started, whether it is getting worse, and whether recent deployments or infrastructure changes correlate with the degradation.
Senior engineering managers and VPs of Engineering use these trends to make release decisions, escalation calls, and SLA commitments to customers. That is why Success Rate Over Time and Availability Over Time must be the two primary widgets at the top of any SRE dashboard — not buried three rows down between CPU graphs.
The Four Golden Signals: Practical SRE Language
Google's SRE book defined four signals that, together, give you a complete picture of service health. Here is how each one translates to production decisions:
Latency
Latency is not just the average response time — averages hide the 5% of requests that take 10x longer. Track P50, P95, and P99. P50 tells you what a typical user experiences. P95 tells you what 1 in 20 users experiences. P99 tells you where your worst-case SLA commitments are at risk. When P95 spikes but P50 stays flat, you have a tail latency problem — often caused by a slow database query, a downstream dependency timing out, or a specific endpoint with heavy computation. When both P50 and P95 spike, the problem is systemic — resource saturation or a traffic surge affecting the entire service.
Traffic
Traffic is the request rate — how much load the service is currently under. You need this to contextualize every other signal. A success rate drop from 99.9% to 98% on 10 requests per second is noise. The same drop on 50,000 requests per second is a SEV1 incident affecting 1,000 users per minute. Always show current request volume alongside success rate so no one misreads low-traffic error spikes as a production crisis.
Errors
Errors are the direct explanation for why success rate drops. But a single error rate number is not enough. You need to break it down by HTTP status code (to separate 4xx client errors from 5xx service errors), by endpoint (to find which API is failing), by error type or exception class (to find the root cause), and by region or version (to understand blast radius). A dashboard that shows only "2% error rate" is useless during an incident. A dashboard that shows "2% error rate, 98% of it on the /api/v2/checkout endpoint, all 503s, starting at 14:32, correlated with the v3.7.1 deployment" is actionable in 30 seconds.
Saturation
Saturation tells you how close the service is to its resource limits. A dashboard that only shows CPU and memory is not an SRE dashboard — it is an infrastructure dashboard. Real SRE saturation monitoring covers: CPU utilization, memory usage, pod restart count, OOM kill events, replica availability (desired vs. running vs. available), connection pool usage (active/idle/wait/timeout), queue depth and consumer lag, and database connection limits. Saturation is often the root cause hiding behind a success rate or latency symptom — the error rate is 4xx because the connection pool is exhausted, not because the code is wrong.
Success Rate vs Availability: Two Different Reliability Stories
These two metrics are often confused. Understanding the difference is critical for both production work and SRE interviews.
| Dimension | Success Rate | Availability |
|---|---|---|
| What it measures | Request-level reliability — of all requests the service received, how many succeeded | Reachability and uptime — of all health checks performed, how many returned healthy |
| Formula | Successful Requests / Total Requests × 100 | Successful Availability Checks / Total Availability Checks × 100 |
| Degrades when | Service is receiving traffic and returning errors (5xx, unhandled exceptions) | Service is unreachable, failing health checks, or completely down even with no traffic |
| Data sources in Datadog | APM traces, application metrics, custom StatsD counters | Synthetic monitors, health check endpoints, uptime monitors, SLO availability data |
| Common scenario | Service is up and reachable but returning database errors on checkout — Availability: 100%, Success Rate: 40% | Pod crashes, health checks fail, load balancer marks instance unhealthy — Availability: 60%, Success Rate: N/A (no traffic reaching service) |
| SLO type | Request-based SLO (numerator: good requests, denominator: total requests) | Monitor-based SLO or time-based SLO (uptime as percentage of time in healthy state) |
A good Golden Signals dashboard shows both metrics side by side, because they answer different parts of the reliability question. You need both to understand the complete service health picture.
Complete Dashboard Layout
The dashboard is organized into nine rows. Each row has a specific purpose — the layout is designed so an on-call engineer can scan top-to-bottom and know exactly where to look based on the symptom they are investigating.
Row 1: Current Service Health
Six query value widgets showing the current state at a glance. These are the first thing an on-call engineer sees.
- Current Success Rate — color-coded: green ≥ 99.9%, yellow 99.5–99.9%, red < 99.5%
- Current Availability — same thresholds as success rate
- Current Error Rate — green < 0.1%, yellow 0.1–0.5%, red > 0.5%
- P95 Latency — green < 500ms, yellow 500ms–1000ms, red > 1000ms
- Current Request Volume — no threshold, provides traffic context
- SLO / Error Budget — Datadog SLO widget showing target, current value, budget remaining, and budget consumed
Row 2: Progression Over Time
Four timeseries widgets showing historical trends — this is the row managers look at during review meetings and the row SREs look at during incident timelines.
- Success Rate Over Time — line chart with target line at 99.9%, supports 1h/24h/7d/30d
- Availability Over Time — line chart using synthetic monitor, health check, or SLO availability data
- SLO Burn Rate — line chart showing budget consumption rate relative to the sustainable pace
- Error Budget Remaining — trend of remaining budget as the window progresses
Row 3: Failure Analysis
This row answers "why is success rate dropping?" It is the first troubleshooting row.
- Error Rate Over Time (grouped by status code, endpoint, region, version)
- HTTP Status Code Breakdown (2xx, 3xx, 4xx, 5xx)
- Top Failing Endpoints (top list by error count)
- Error Type Breakdown (by exception.type or error.message)
Row 4: Latency
Latency details for user experience and slow-request investigation.
- P50 / P95 / P99 Latency (all on one timeseries or separate query values)
- Slowest Endpoints (top list by P95 latency)
- Dependency Latency (downstream service call durations)
- Latency by Region (to isolate geographic degradation)
Row 5: Traffic
Traffic patterns to understand load and correlate with error and latency behavior.
- Request Rate Over Time (RPS or RPM by service, endpoint, region)
- Traffic by Endpoint (top list)
- Traffic by Region (timeseries grouped by region)
- Traffic vs Error Rate (overlay chart to see if errors track load)
Row 6: Saturation
Resource utilization and capacity — this row explains whether the system is under pressure.
- CPU and Memory (by host, pod, container, cluster)
- Pod Restarts / OOM kills (count by pod and namespace)
- Replica Availability (desired vs. available vs. unavailable)
- Connection Pool Usage (active, idle, wait time, timeout count)
Row 7: Dependencies
Downstream health — because most production incidents are caused by something the service depends on, not the service itself.
- Database Health (latency, error rate, slow queries, lock wait time)
- Cache Health (hit ratio, latency, eviction rate, error rate)
- Queue Health (queue depth, consumer lag, dead letter queue count)
- External API Health (latency, error rate, timeout count, retry rate)
Row 8: Logs and Traces
Evidence-level debugging — the row that takes you from metric to root cause.
- Recent Error Logs (filtered by service, endpoint, error.message, trace_id)
- Top Error Messages (aggregated log patterns)
- Trace Samples (failed spans, slow spans, dependency call chains)
- Logs by Endpoint (to isolate which API is generating errors)
Row 9: Deployment Correlation
Release validation and change correlation — to answer "did this start after the last deployment?"
- Deployment Events overlay (vertical markers on Success Rate and Error Rate timeseries)
- Success Rate by Version (timeseries or top list grouped by container image or version tag)
- Error Rate by Version
- Latency by Version
Recommended Datadog Template Variables (Filters)
Template variables make the dashboard reusable across services, environments, and infrastructure scopes. Configure these at the dashboard level so every widget respects the selected filter automatically.
| Variable | Tag | Purpose |
|---|---|---|
$env |
env |
Switch between prod, staging, and lower environments without cloning dashboards |
$service |
service |
Reuse the same dashboard for different microservices |
$region |
region |
Isolate region-specific reliability issues |
$availability_zone |
availability-zone |
Identify AZ-level availability failures in multi-AZ deployments |
$cluster |
kube_cluster_name |
Filter to a specific Kubernetes cluster |
$namespace |
kube_namespace |
Scope to a specific Kubernetes namespace |
$pod |
pod_name |
Drill down to a specific pod during a pod-level incident |
$host |
host |
Isolate a specific host or node |
$version |
version |
Compare reliability between deployment versions |
Set the default values for $env to prod and $service to * so the dashboard opens in a useful state without requiring the user to configure it first.
Widget-by-Widget Implementation Guide
Section 1: Success Rate and Availability Progression
Success Rate Over Time
- Widget type: Timeseries (line chart)
- Formula pattern:
(sum of successful requests / sum of total requests) * 100 - Scope: filter by
$env,$service,$region - Y-axis: 0–100 with target reference line at 99.9
- Time range presets: support 1h, 24h, 7d, and 30d so managers can review short-term incidents and long-term trends on the same widget
- Color threshold: green ≥ 99.9%, yellow 99.5–99.9%, red < 99.5%
Availability Over Time
- Widget type: Timeseries (line chart)
- Data sources: Datadog Synthetic monitors, health check endpoint metrics, uptime monitor data, or SLO availability rollup
- Formula pattern:
(sum of successful health checks / sum of total health checks) * 100 - Group by: region and availability zone for multi-region services
- Y-axis: 99–100 range to make small drops visible; use custom y-axis bounds
Availability by Region / AZ
- Widget type: Timeseries grouped by region, or Top List showing current availability per region
- Purpose: isolate whether an availability drop is global or specific to one region or AZ — critical for multi-region architectures on AWS, GCP, and Azure
SLO / Error Budget Widget
- Widget type: Datadog SLO widget (native widget, not a timeseries)
- Configuration: link the widget to your configured SLO in Datadog
- Display: shows SLO target, current SLO value, error budget remaining (in percentage and in minutes/hours), and budget consumed
- Window: typically 7-day and 30-day rolling windows side by side
SLO Burn Rate
- Widget type: Timeseries line chart
- Formula pattern: error rate for the current window divided by the allowed error rate for the SLO
- Reference lines: add horizontal reference lines at burn rate 1 (sustainable), 6 (slow burn alert), and 14.4 (fast burn / critical alert)
- Alert integration: connect this widget to your Datadog composite alert so the on-call sees the burn rate spike in context with the page they received
Section 2: Current Health Widgets
Current Success Rate
- Widget type: Query Value
- Formula: same formula as the timeseries but aggregated to a single value for the selected time window
- Conditional formatting: green ≥ 99.9%, yellow 99.5–99.9%, red < 99.5%
Current Availability
- Widget type: Query Value or SLO widget in summary mode
- Same thresholds as Success Rate
Current Error Rate
- Widget type: Query Value
- Conditional formatting: green < 0.1%, yellow 0.1–0.5%, red > 0.5%
P95 Latency
- Widget type: Query Value
- Conditional formatting: green < 500ms, yellow 500ms–1000ms, red > 1000ms
- Note: thresholds vary by service — a payment API at 500ms may be acceptable while a search API at 500ms is unacceptable. Set per-service thresholds rather than using the same value globally.
Request Volume
- Widget type: Query Value
- Purpose: never review success rate without traffic context — a 2% error rate on 5 requests/second is noise; the same on 100,000 requests/second is a critical incident
- No threshold color coding — informational widget
Section 3: Failure Analysis Widgets
Error Rate Over Time
- Widget type: Timeseries, grouped by
http.status_codeorstatus - Additional groupings: switch the group-by to
endpoint,region, orversionto narrow down blast radius - Use when: Success Rate drops — this is your first investigation widget
HTTP Status Code Breakdown
- Widget type: Timeseries stacked bar or table
- Purpose: separate 4xx client errors from 5xx service errors
- Why it matters: a spike in 4xx can mean a broken client integration or an API contract change; a spike in 5xx means your service is failing — very different root causes and very different escalation paths
Top Failing Endpoints
- Widget type: Top List sorted by error count descending
- Group by:
resource_nameorhttp.route - Purpose: when Success Rate drops across the board, one or two endpoints often account for 80% of failures — find them immediately
Error Type Breakdown
- Widget type: Top List or table
- Group by:
exception.type,error.message, or custom error tag - Purpose: connects the metric symptom to the code-level root cause — is it a NullPointerException, a database connection timeout, a downstream API 504, or a configuration error?
Section 4: Latency Widgets
P50 / P95 / P99 Latency Timeseries
- Widget type: Timeseries with three lines (P50, P95, P99) on the same chart
- Why three lines: a widening gap between P50 and P99 indicates that a minority of requests are experiencing extreme slowness — often a slow database query, a downstream timeout, or a specific code path with heavy computation
Slowest Endpoints
- Widget type: Top List sorted by P95 latency descending
- Group by:
resource_nameorhttp.route - Use when: P95 is high but it is unclear which API is causing it
Section 5: Saturation Widgets
CPU and Memory
- Widget type: Timeseries grouped by
host,pod, orcontainer - CPU threshold: alert at 80%, critical at 90% sustained for more than 5 minutes
- Memory threshold: alert when approaching the container memory limit (typically 80% of the limit)
Pod Restarts / OOM Kills
- Widget type: Timeseries or change widget showing restart count over time
- Group by:
pod_name,kube_namespace - Why it matters: pod restarts directly explain Availability drops — when pods are restarting, health checks fail and the load balancer marks them unhealthy
Replica Availability
- Widget type: Timeseries showing desired, available, and unavailable replica counts
- Alert threshold: unavailable replicas > 0 for more than 2 minutes
- Purpose: if desired is 5 and available is 2, your service is running at 40% capacity — success rate and latency degradation will follow
Connection Pool Usage
- Widget type: Timeseries showing active, idle, wait time, and timeout count
- Critical pattern: a rising wait time followed by connection timeouts followed by a success rate drop is a classic connection pool exhaustion pattern — very common under traffic spikes or when a slow downstream API holds connections open
Section 6: Dependency Widgets
Database Health
- Metrics to track: query latency (P50/P95), error rate, slow query count, lock wait time, active connections
- Widget types: timeseries for latency and error rate, query value for current connection count
- Purpose: most success rate drops in data-heavy services originate at the database layer — slow queries, connection limits, or lock contention
Cache Health
- Metrics to track: hit ratio, latency, eviction rate, error rate, memory usage
- Critical pattern: a sudden drop in cache hit ratio causes a surge of requests to reach the database — this causes both latency spikes and potential database connection exhaustion
Queue Health
- Metrics to track: queue depth (messages waiting), consumer lag, dead letter queue (DLQ) message count, consumer throughput
- Critical pattern: a growing DLQ count combined with a growing queue depth means consumers are failing and retrying — the service is producing work it cannot process
External API Health
- Metrics to track: latency, error rate, timeout count, retry rate per downstream service
- Group by: downstream service name or hostname
- Purpose: when your success rate drops but all internal metrics look healthy, the problem is almost certainly a downstream API — this widget surfaces it immediately
Section 7: Logs, Traces, and Deployment Widgets
Recent Error Logs
- Widget type: Log Stream filtered to error-level logs
- Columns: timestamp, service, endpoint, error message, trace_id
- Purpose: takes the investigation from metric to evidence in one click — the trace_id in an error log connects directly to the distributed trace in Datadog APM
Trace Samples
- Widget type: Trace List filtered to failed or slow spans
- Purpose: shows the full request path — which service called which, where time was spent, which span threw the error
- Filter: show only traces where
error:trueorduration > P95 threshold
Deployment Events
- Widget type: Event Timeline or Event Stream filtered to deployment events
- Integration: overlay deployment markers on the Success Rate Over Time and Error Rate Over Time charts using the
@evt.name:deploymenttag - Purpose: the most common question in any post-incident review is "did this start after a deployment?" — this widget answers it without leaving the dashboard
Success/Error Rate by Version
- Widget type: Timeseries grouped by
versionorimage_tag - Purpose: compare old and new versions side by side — if v3.7.0 has 99.9% success rate and v3.7.1 has 94% success rate, the deployment is the cause and a rollback decision can be made in minutes
Troubleshooting Flow: From Symptom to Root Cause
A good Golden Signals dashboard should help an engineer move from symptom to root cause in minutes. Here is the exact investigation sequence for each common failure pattern:
| Symptom | Investigation Sequence |
|---|---|
| Success Rate drops | Error Rate Over Time → HTTP Status Code Breakdown → Top Failing Endpoints → Error Type Breakdown → Recent Deployments → Dependency Health (DB, Cache, External API) → Logs/Traces → Saturation (Connection Pool, CPU) |
| Availability drops | Availability Monitor → Region/AZ Availability → Health Check Failures → Pod Restarts/OOM → Replica Availability → Load Balancer Health → Dependency Outage → Recent Deployments |
| Latency increases | P95/P99 Latency → Slowest Endpoints → Dependency Latency → Database Latency → Queue Consumer Lag → CPU/Memory Saturation → Traffic Spike → Recent Deployment / Version Latency |
| Traffic spike | Request Rate Over Time → Traffic by Endpoint → Traffic by Region → Error Rate (is the spike causing errors?) → Saturation (is the system keeping up?) → Autoscaling / Replica Count → Connection Pool |
| Errors after deployment | Deployment Events overlay → Error Rate by Version → Success Rate by Version → Latency by Version → Logs/Traces filtered by new version → Rollback decision |
Datadog Query and Formula Examples
These patterns use generic metric names. Your actual metric names will depend on your instrumentation — APM auto-instrumentation generates different metric names than custom StatsD or DogStatsD counters. Adjust the metric names to match your service's instrumentation.
Success Rate (request-based)
Formula: (A / B) * 100
A = sum:<your_success_metric>{env:$env, service:$service}.as_count()
B = sum:<your_total_request_metric>{env:$env, service:$service}.as_count()
Error Rate
Formula: (A / B) * 100
A = sum:<your_error_metric>{env:$env, service:$service}.as_count()
B = sum:<your_total_request_metric>{env:$env, service:$service}.as_count()
Availability (health-check based)
Formula: (A / B) * 100
A = sum:<your_health_check_success_metric>{env:$env, service:$service}.as_count()
B = sum:<your_health_check_total_metric>{env:$env, service:$service}.as_count()
SLO Burn Rate (fast burn window — 1 hour)
Burn Rate = (error_rate_last_1h / allowed_error_rate_for_SLO)
Alert threshold: burn_rate > 14.4 for 1 hour (will exhaust 30-day budget in ~2 days)
Alert threshold: burn_rate > 6 for 6 hours (slow burn)
P95 Latency by Endpoint
p95:<your_latency_metric>{env:$env, service:$service} by {resource_name}
Pod Restart Count
sum:kubernetes.containers.restarts{env:$env, kube_namespace:$namespace} by {pod_name}.as_count()
Connection Pool Wait Time
avg:<your_connection_pool_wait_metric>{env:$env, service:$service}
Important: always use .as_count() rather than .as_rate() when summing events for ratio calculations — .as_rate() pre-normalizes by time interval and can produce incorrect ratios when combined with another .as_rate() metric.
SRE Interview Angle: How to Explain This Dashboard
For SRE interviews, this dashboard is one of the strongest ways to prove production thinking. Most candidates answer observability questions in abstract terms. When you describe a specific dashboard layout — with real widget types, real troubleshooting flows, and real threshold rationales — interviewers immediately understand that you have built and used these systems in production.
How to Describe It in an Interview
Frame your answer around decision-making, not feature listing:
"I build Golden Signals dashboards in Datadog organized into two purposes: the top rows answer 'is the service reliable right now' — Success Rate, Availability, SLO, and Error Budget as query value widgets with color coding. The progression rows below answer 'how has reliability changed over time' — timeseries at 1h, 24h, 7d, and 30d so I can show managers a trend, not just a current number. The troubleshooting rows are for when the answer is 'no, it is not reliable' — Error Rate broken down by status code, endpoint, and version, dependency health for DB/cache/queue/external APIs, and Deployment Events overlaid on error rate so I can answer 'did this start after the last deployment' in under 10 seconds."
Common Interview Questions and How to Use This Dashboard
- "How do you define reliability?" — Answer with Success Rate and Availability as two separate metrics, explain the formula for each, and explain why both are needed.
- "Walk me through an incident investigation" — Use the troubleshooting flow table. Start with the symptom, describe each widget you check in sequence, and explain what you are looking for at each step.
- "What is an SLO and how do you manage error budgets?" — Describe the SLO widget, the burn rate timeseries with the 14.4x threshold, and how burn rate alerts work as a tiered alerting system (fast burn = page immediately, slow burn = watch and prepare).
- "How would you know if a deployment caused an incident?" — Describe the Deployment Events overlay and the Success/Error Rate by Version widget — you can compare old and new versions on the same chart within seconds of the deployment completing.
- "What is the difference between a health check and request success rate?" — This is the Availability vs Success Rate question. Use the table above to explain it concisely.
For Real Production Incidents
This dashboard type is the foundation of an effective on-call rotation. When a page fires at 2 AM, the on-call engineer should be able to open this dashboard, see the failing metric in Row 1, understand the historical context in Row 2, find the error source in Row 3, and have a root cause hypothesis within 5 minutes — without switching between multiple dashboards or querying logs manually.
If you are currently in a role where this capability does not exist, building this dashboard is one of the highest-leverage reliability improvements you can make. It pays back in reduced MTTR, cleaner post-incident reviews, and significantly less on-call fatigue.
Need Real-Time SRE Job Support for Datadog, Dashboards, or Production Incidents?
Building and maintaining this type of dashboard under real project pressure — tight deadlines, incomplete metrics coverage, misconfigured SLOs, or a manager who needs results yesterday — is a different challenge from reading about it. If you are currently stuck on any part of this:
- SLO configuration that is not reflecting the correct error budget
- Success Rate formula returning incorrect values due to metric naming or tag structure
- Burn rate alerts that are either too noisy or not firing when they should
- Datadog APM instrumentation that is not producing the right request metrics
- Dashboard layout that your manager or team lead wants restructured before a quarterly review
- An incident where you need help tracing the root cause through Datadog logs, traces, and metrics
Real-time SRE job support from Proxy Tech Support connects you with a senior SRE specialist — live screen share, same-day response, direct help on your actual Datadog environment. We cover DevOps and cloud infrastructure job support as well, including Kubernetes, AWS, GCP, Azure, Prometheus, Grafana, and the full observability stack.
For engineers working outside the USA: we also provide IT job support across all US time zones and serve teams globally.
Preparing for an SRE, DevOps, or Datadog Interview?
Technical interviews for SRE, DevOps, and observability roles increasingly test production scenario thinking — not just tool knowledge. Interviewers at top-tier companies want to understand how you approach incidents, how you build dashboards, how you manage SLO trade-offs, and how you communicate reliability to non-technical stakeholders.
Our proxy interview support for SRE roles prepares you for:
- SRE system design rounds (SLO design, alerting strategy, dashboard architecture)
- Incident investigation walkthroughs using real-world scenario narratives
- Kubernetes reliability and observability questions (health checks, HPA, PodDisruptionBudgets)
- Datadog-specific interview questions (APM setup, log pipelines, monitor types, SLO configuration)
- Communication round preparation — explaining reliability metrics to engineering managers
We also provide DevOps proxy interview support for roles that blend SRE responsibilities with CI/CD, infrastructure as code, and cloud platform work. Explore the full technologies and tools we cover and our interview questions resource for SRE, DevOps, and cloud engineering roles.
For engineers preparing for interviews outside the USA: we provide proxy interview support across all US locations, proxy interview support in Canada, proxy interview support in the UK, and proxy interview support in Australia.
Complete Widget Checklist
Use this as a build checklist when implementing the dashboard in Datadog. A fully complete Golden Signals dashboard for production SRE use should have all 26 widgets configured and validated.
| Section | Widget | Status |
|---|---|---|
| Current Health | Current Success Rate | — |
| Current Health | Current Availability | — |
| Current Health | Current Error Rate | — |
| Current Health | P95 Latency | — |
| Current Health | Request Volume | — |
| Current Health | SLO / Error Budget | — |
| Progression | Success Rate Over Time | — |
| Progression | Availability Over Time | — |
| Progression | SLO Burn Rate | — |
| Progression | Error Budget Remaining | — |
| Failure Analysis | Error Rate Over Time | — |
| Failure Analysis | HTTP Status Code Breakdown | — |
| Failure Analysis | Top Failing Endpoints | — |
| Failure Analysis | Error Type Breakdown | — |
| Latency | P50 / P95 / P99 Latency | — |
| Latency | Slowest Endpoints | — |
| Traffic | Request Rate Over Time | — |
| Saturation | CPU and Memory | — |
| Saturation | Pod Restarts / OOM | — |
| Saturation | Replica Availability | — |
| Saturation | Connection Pool Usage | — |
| Dependencies | Database Health | — |
| Dependencies | Cache Health | — |
| Dependencies | Queue Health | — |
| Dependencies | External API Health | — |
| Logs & Traces | Recent Error Logs | — |
| Logs & Traces | Trace Samples | — |
| Deployment | Deployment Events | — |
| Deployment | Success/Error Rate by Version | — |
Frequently Asked Questions
What is a Golden Signals Dashboard in SRE?
A Golden Signals Dashboard is a single-pane observability view built around the four metrics Google SRE teams identified as most reliable for service health assessment: Latency, Traffic, Errors, and Saturation. Production SRE teams extend this to include Success Rate, Availability, SLO status, error budget, and burn rate — giving on-call engineers everything needed to move from symptom to root cause in minutes without switching between dashboards.
What are the four golden signals of monitoring?
Latency (how long requests take, tracked at P50/P95/P99), Traffic (request volume per second or minute), Errors (rate of failed requests broken down by status code, endpoint, and type), and Saturation (resource utilization including CPU, memory, connection pools, pod restarts, and queue lag).
How do you calculate Success Rate in Datadog?
Success Rate = (successful requests / total requests) × 100. In Datadog, build this as a formula query: divide the sum of your success metric by the sum of your total request metric, then multiply by 100. Scope both metrics with $env and $service template variables so the formula works across all your services without modification. Use .as_count() aggregation, not .as_rate(), to avoid double-normalization in ratio formulas.
What is the difference between Success Rate and Availability?
Success Rate is request-level reliability — of all requests received, how many succeeded. It degrades when the service is up and receiving traffic but returning errors. Availability is reachability and uptime — of all health checks performed, how many succeeded. It degrades when the service is unreachable, crashing, or failing health checks. A service can have 100% Availability but 40% Success Rate (up but broken), or 60% Availability but 100% Success Rate for the requests that do reach it (partially up). Both metrics are required.
Which Datadog widgets are needed for SRE troubleshooting?
For complete SRE troubleshooting coverage: Success Rate Over Time, Availability Over Time, SLO/Error Budget widget, SLO Burn Rate, Current Success Rate (query value), Current Availability (query value), Error Rate Over Time, HTTP Status Code Breakdown, Top Failing Endpoints, Error Type Breakdown, P50/P95/P99 Latency, Slowest Endpoints, Request Rate, CPU/Memory, Pod Restarts/OOM, Replica Availability, Connection Pool Usage, Database/Cache/Queue/External API Health, Recent Error Logs, Trace Samples, Deployment Events, and Success/Error Rate by Version.
How do SLO, error budget, and burn rate help SRE teams?
The SLO sets the reliability target (e.g., 99.9% success rate over 30 days). The error budget is the tolerated failure margin — at 99.9%, that is approximately 43.2 minutes of failure budget per month. Burn rate measures how fast the budget is being consumed. A burn rate of 1 is sustainable. A burn rate of 14.4 over one hour means the budget will be exhausted in approximately two days — that triggers a critical alert and an immediate incident response. Burn rate is the mechanism that converts an SLO from a passive report into an active alerting system.
How should an SRE explain Golden Signals in an interview?
Explain in terms of production decisions: "I track Latency at P50/P95/P99 to distinguish typical from tail performance. Traffic gives me load context — the same error rate has very different severity on 100 RPS vs 100,000 RPS. Errors I break down by status code, endpoint, and version to find blast radius. Saturation — I track CPU, memory, pod restarts, connection pool usage, and replica count, not just CPU and memory, because saturation failures usually manifest as application-layer symptoms. I add Success Rate and Availability on top of the four signals because they directly map to SLO commitments. And I use SLO burn rate for alerting — burn rate over 14 means page immediately."
Can Proxy Tech Support help with SRE Datadog job support?
Yes. We provide real-time job support for SRE engineers working with Datadog — dashboard design, SLO configuration, metric instrumentation, alert tuning, and incident troubleshooting. Via live screen share, same-day response. We cover Datadog APM, Infrastructure Monitoring, Log Management, Synthetic Monitoring, and the full SLO/error budget framework. Contact us via WhatsApp (+91-96606-14469) or visit our SRE job support page.
Can Proxy Tech Support help with SRE or DevOps interview preparation?
Yes. We provide proxy interview support for SRE roles and DevOps proxy interview support, covering production scenario walkthroughs, dashboard design explanations, SLO strategy, incident response narratives, Kubernetes reliability questions, and live interview assistance. We serve engineers in the USA, UK, Canada, and Australia.
Conclusion
A Golden Signals dashboard built around Success Rate and Availability progression is not a nice-to-have — it is the foundation of an SRE team's ability to respond to incidents quickly, review reliability trends honestly, and make release decisions with data. The difference between a dashboard that gets used during incidents and one that sits forgotten on a bookmark is whether it answers the right question in the right sequence: current state first, trend second, troubleshooting path third.
The 29-widget layout described here — with Datadog template variables for environment, service, region, cluster, and version — gives you a single dashboard that works for every service, every environment, and every on-call scenario. Build the first two rows first (Current Health and Progression Over Time). Validate that Success Rate and Availability are trending correctly. Then add the troubleshooting rows incrementally, prioritizing the sections that match your service's most common failure patterns.
If you need hands-on help building this, debugging Datadog metric queries, configuring SLOs, or preparing to present it in an interview: Proxy Tech Support provides real-time SRE job support, proxy interview support for SRE roles, and DevOps and observability job support for engineers across the USA, UK, Canada, and Australia. Reach us via WhatsApp (+91-96606-14469) for same-day support.
Browse more production-focused guides on our blog, or explore the full technologies we support.