Note
Access to this page requires authorization. You can try signing in or changing directories.
Access to this page requires authorization. You can try changing directories.
Applies to this Azure Well-Architected Framework Reliability checklist recommendation:
| RE:10 | Continuously measure and track system health using uptime and reliability indicators across components and critical flows. Ensure this data is retained and accessible to support timely detection, response, and post-incident analysis. |
|---|
Reliability monitoring is the practice of measuring how well a system meets its business requirements over time, with respect to resiliency and recoverability. A well-architected monitoring system provides real-time view and trends of system behavior by establishing visibility across platform, infrastructure, and workload layers.
By correlating these signals across components and over time, monitoring enables fast, confident analysis of incidents and outages. Create a structured approach so that insights are meaningful, alerts drive the right actions, and learnings feed back into architecture and operations.
The key strategies in this article build on the foundational operational practice of observability, described in OE:07 Architecture strategies for designing a monitoring system. Guidance on implementing the monitoring practice is available in the Monitoring Design Guide. We recommend reviewing those resources first.
Definitions
| Term | Definition |
|---|---|
| SLA (service level agreement) | External commitments you receive from vendors, or that you make to your customers. Failing to meet SLAs can lead to financial penalties, reputational damage, or degraded user experience. |
| SLO (service level objectives) | Internal performance and reliability targets used to define thresholds that trigger alerts and measure system health against business objectives. |
| Health model | A hierarchical representation of system condition using clear health states (healthy, degraded, unhealthy) with real-time signals and drill-down capability from overall system to individual components. |
| Stamp | A deployment unit with defined capacity limits, such as maximum concurrent users, throughput, or resource utilization thresholds. Multiple stamps enable scale-out and regional distribution. |
| FMA (failure mode analysis) | A systematic analysis to identify potential points of failure in a system, used to guide monitoring strategy and reliability improvements. |
| RTO (recovery time objective) | The maximum acceptable time to detect, respond to, and recover from a failure or incident. |
| RPO (recovery point objective) | The maximum acceptable amount of data loss measured in time, representing how much data can be lost during a failure scenario. |
| Synthetic transactions | Automated tests that simulate real user actions and end-to-end interactions to validate system health and detect issues from a customer perspective, providing external validation of system availability. |
| Correlation IDs | Unique identifiers used to trace transactions and requests across multiple services and components, enabling problem identification and analysis in distributed systems. |
| Transient faults | Temporary failures in system dependencies that typically resolve themselves within a short period of time, such as network timeouts or temporary service unavailability. |
| Tail latency | The response time experienced by the slowest requests, typically measured at high percentiles (p95, p99) where performance issues often surface first. |
Monitor workload functionality
Monitor what your system actually delivers. Start by tracking whether critical workflows complete successfully and produce valid results. A system can appear healthy while still producing incorrect or incomplete outputs, so execution alone is not enough.
For example, if a workload generates reports every six hours, monitoring should confirm two things: that the job ran as scheduled, and that it produced a valid result, such as a non-empty report with expected content and size. This kind of validation helps ensure the system is not only executing, but also delivering functionality that it was designed for.
Monitor user experience
Monitor reliability from a business and user perspective. As part of your failure mode analysis (FMA), you should already have identified key user flows. For each flow, track how failures in any component or dependency affect the user experience and what the expected outcome becomes. For example, in an e-commerce checkout flow, a service outage or overload in payment or inventory systems may prevent customers from completing purchases.
Reliability also reflects quality of service. In a checkout flow, users should be able to complete purchases end-to-end without interruption. Use percentile-based latency metrics such as p50, p95, and p99 to understand real user experience, with special attention to tail latency where performance issues often surface first.
Important
Performance monitoring provides a view of how your system behaves under real load by breaking down end-to-end latency across system layers. It connects performance changes so you can understand what influenced a shift in behavior. That could be due to deployments, configuration updates, and scaling events. Together, reliability and performance monitoring provide a complete picture of system behavior and highlight where focused attention will have the most impact. For information about performance monitoring, see Monitor Performance.
Track availability targets
Track how well your system meets its defined targets for availability, throughput, and response times. These targets are often formalized as service level agreements (SLAs) and service level objectives (SLOs) and reflect the expectations you've set with your users. Monitoring against them keeps reliability aligned with real business outcomes. For more information, see Reliability targets and Service Level Agreements.
Focus on the key indicators that contribute to those targets and track them over time. When something drifts, you should be able to drill down into the specific components or subsystems involved. Capture all relevant signals, including issues masked by redundancy or failover, so you can understand what actually happened and prevent repeat occurrences.
Combine real-time awareness with historical context. Real-time signals help you respond quickly when targets are at risk, while trends over time reveal patterns and recurring issues. Classifying the causes of target misses and aggregating these metrics also supports clear SLA reporting and helps guide ongoing improvements.
Monitor the key SLAs provided by your suppliers and platform services (from Microsoft and others). You should:
- Track indicators of potential SLA violations in real time
- Capture and retain the evidence required to support an SLA claim if a breach occurs
Track recoverability targets
Track recoverability by treating every test and real incident as a measurable event. Use monitoring data to validate that your system and team can meet defined recovery objectives under real conditions.
Measure key signals such as time to detect, respond, and recover (RTO), along with data loss exposure (RPO). Include indicators like failover readiness and capacity, failover success rates and execution time, backup and restore success, replication lag, and how much manual intervention is required.
These metrics also highlight operational gaps, such as unclear procedures, decision delays, or hard-to-access documentation, which can affect recovery performance. Use these insights to strengthen both system design and incident response practices.
Note
Be careful that cleanup or retention policies aren't so aggressive that they delete logs or telemetry just when you need them most. For each scenario, ask: What data would we need to understand what happened before and during the incident? A useful approach is to think ahead to different types of post-incident investigations, such as:
- Platform or infrastructure outages
- Application availability issues (for example, after a deployment or configuration change)
- Application bugs causing data loss or corruption
- Security incidents
Make alerts actionable with a health model
Design alerts so they clearly point to something worth acting on, and ground them in a health model that represents the system using simple states like healthy, degraded, and unhealthy.
Structure the health model hierarchically, from individual components up to the full system, so you can quickly trace issues to their source. Define thresholds using service level objectives (SLOs), and combine signals such as metrics, logs, traces, and synthetic checks to create a reliable picture of system health. This gives operators a clear view of what's working, what isn't, and where to act without digging through raw data. For more information, see the design guide on Health modeling.
Tune alerts for real conditions by focusing on end-to-end experience and critical transactions, so they reflect actual user impact. Reduce noise by accounting for transient fluctuations and triggering on meaningful health state changes rather than isolated blips or spikes. Combine real-time alerts with trend insights to catch both immediate issues and gradual degradation, helping teams respond quickly and stay focused.
Risk: Health modeling requires collecting meaningful signals across the system. Relying only on simple metrics like CPU or memory can miss what actually matters. Include user experience data and synthetic checks to get a complete view. Defining “healthy” takes alignment, and poorly tuned thresholds can create noise and reduce effectiveness.
Monitor all layers of the system
Monitor each layer of the system, application, data/storage, and network, to maintain a complete view of reliability signals.
At the application layer, track success, failure, and latency using logs, metrics, and health probes. Use correlation IDs to follow requests across services and make troubleshooting easier. Collect logs asynchronously so they don't impact request performance, and keep diagnostic and audit logs separate for clarity. Add synthetic transactions and endpoint probes to confirm what customers actually experience.
Tradeoff. Choosing between asynchronous and synchronous logging involves a balance between performance and reliability of telemetry.
Asynchronous logging keeps logging off the critical path, reducing latency and improving system performance. However, it introduces a risk of telemetry loss, especially if a failure occurs before logs are flushed or persisted.
Synchronous logging ensures that logs are written before processing continues, which improves data durability and auditability. The trade-off is increased latency and a tighter coupling between application performance and the logging system.
In most scenarios, asynchronous logging is the preferred approach due to its minimal impact on performance. However, in heavily regulated or audit-sensitive environments, synchronous logging may be required to guarantee that critical events are captured reliably..
At the data and storage layer, focus on availability, write success rates, query latency, timeouts, locks, and resource pressure. Look at trends over time to identify growing bottlenecks and distinguish short-lived issues from sustained degradation.
At the network layer, monitor connectivity, latency, packet loss, bandwidth, and traffic patterns. Combine flow logs, endpoint checks, and synthetic tests to surface routing issues, anomalies, or security-related behavior. Connect these signals back to application and platform data to understand where problems originate.
Operational logs help diagnose issues, track performance, and understand system behavior. They are not designed to serve as a source of truth for business events, auditing, or regulatory reporting, which typically require stronger traceability.
What to monitor in detail for each layer is covered in the Monitoring Design Guide.
Define and monitor stamp capacity
Define clear capacity limits for each deployment unit, or stamp, and monitor them closely. Every stamp operates within a finite ceiling, whether that's maximum concurrent users, throughput, or resource utilization thresholds. So making those limits explicit will give you a reliable baseline for decision-making.
This visibility helps you identify when a stamp is nearing saturation, well before it affects reliability. It also supports timely scale-out decisions, such as adding new stamps or redistributing load, and confirms that traffic is flowing according to your design.
Defining these limits isn't always straightforward. Capacity can be difficult to measure, especially when it depends on multiple underlying services with different scaling characteristics. You should use platform guidance, such as quotas and limits from Microsoft Azure, as a starting point. In practice, capacity is often determined through load testing, observation, and iterative tuning rather than precise upfront modeling.
Monitor load distribution across redundant instances
When you run the workload across multiple redundant instances, including when you distribute instances across different regions or zones, traffic and resource usage should remain balanced across those instances.
You want to spot imbalances that often point to routing issues, configuration problems, or dependency constraints. It also ensures that failover targets have sufficient capacity to absorb traffic when needed and confirms that redundancy mechanisms behave as expected during both steady-state operation and failure scenarios.
Detect failure modes
As part of your failure mode analysis (FMA) exercise, you should have identified the potential points of failure.
In reliability monitoring practice, keep continuous watch on those points. Start by focusing on simpler signals such as transient faults. Monitor retry behavior and transient fault rates to understand how your dependencies and underlying services behave under real operating conditions. These signals provide an early view into emerging instability. They help you recognize when retry patterns drift from the expected norm, maintain a sense of whether the system is staying healthy under load, and identify when a dependency or external service begins to degrade before it impacts user experience.
Also include bigger impact failures like availability zone outages that impact a subset of infrastructure, service outages, or regional outages that take an entire Azure region offline. Even watch for security scenarios such as DDoS or other malicious activity, component misconfiguration, and performance problems, because each of them can affect your solution's overall reliability.
For information about FMA, see Architecture strategies for failure mode analysis.
Be informed about platform reliability status
You need clear insight into platform health to manage reliability effectively. That awareness helps you quickly determine whether an issue originates in your workload or in the underlying cloud platform.
Azure Service Health provides visibility into the state of Azure. Configure alerts on Service Health so that you get notifications when platform conditions change. You receive updates on active outages affecting your resources, planned maintenance events that may introduce disruption, and regional or service-specific degradations.
Azure facilitation
Incorporate cloud platform monitoring and alerting services, including:
Platform-level health, like Azure Service Health.
Resource-level health, like Azure Resource Health.
Azure Monitor is a comprehensive monitoring solution that's used to collect, analyze, and respond to monitoring data from your cloud and on-premises environments.
Log Analytics is a tool in the Azure portal that's used to edit and run log queries against data in the Log Analytics workspace.
Application Insights is an extension of Azure Monitor. It provides application performance monitoring (APM) features.
Azure Monitor insights are advanced analytics tools that help monitor Azure services, like virtual machines, application services, and containers. Insights are built on top of Azure Monitor and Log Analytics.
Azure Monitor for SAP solutions is an Azure-native monitoring product for SAP landscapes that run on Azure.
Connection monitor is a tool for continuously tracking network connectivity and performance across Azure resources. It runs synthetic tests and provides real-time alerts and diagnostics to detect failures early. You can build custom workbooks to visualize connectivity health and aggregated performance metrics.
Virtual network flow logs can be enabled across workloads to monitor network traffic. Traffic analytics can be used to analyze and enrich these flow logs to surface insights such as blocked traffic, malicious flows, and active ports exposed to the internet. Creating workbooks allows teams to monitor live traffic behavior and receive alerts. Use timeline views and topology visualizations to easily monitor traffic patterns that might indicate performance degradation or security threats.
For multiple workspace best practices, see Design a Log Analytics workspace architecture.
Example
For examples of real-world monitoring solutions, see Web application monitoring on Azure and Baseline architecture for an Azure Kubernetes Service cluster.
Community links
- Azure Monitor Baseline Alerts (AMBA) is a central repository of alert definitions that customers and partners can use to improve their observability experience through the adoption of Azure Monitor.
Reliability checklist
Refer to the complete set of recommendations.