Overview
The four Golden Signals for the site reliability engineer (SRE) discipline - Latency, Errors, Traffic, and Saturation (LETS) - have become key indicators to effectively monitor distributed systems. These metrics are closely related to the older methods; for example, USE metrics - Utilization, Saturation, and Errors - and RED metrics - Rate, Errors, and Duration. Monitoring these Golden signals gives SREs visibility into the performance of their services to help maintain high availability.IBM Cloud App Management (ICAM) simplifies troubleshooting by using Golden Signals to provide visibility into the microservices and by using USE metrics to provide visibility into the traditional resources. We can use these signals as early warning signs to receive advanced knowledge of service impacts, thereby keeping your service downtime to a minimum. If the error rate or the latency exceeds the expected threshold, we receive automatic notifications to address the issue before it impacts customers negatively.
IBM Cloud App Management provides a one-hop topology, where we can look at the immediate upstream and downstream dependencies that are just one hop away.
We can look at the health of your dependencies in the topology to determine if your service is affected by a dependent service that is causing the bottleneck.
The timeline on the service page also provides visibility into the deployments and other events, and this helps us to determine if a recent code push is causing an issue.
IBM Cloud App Management
IBM Cloud App Management (ICAM) is a container-based platform for monitoring the performance and availability of both traditional and modern microservices-based business applications deployed on both public cloud and on-premises. The following are some of the key advantages offered:
- Utilize Synthetics testing
Monitor application availability and response time proactively using Synthetics testing. Run them from different locations on different endpoints, on a schedule, and set alerts to get notified when response time is over the threshold or on error response code. Define complex conditions for a warning or critical event.
- Reduce the noise to the SRE
Receiving hundreds of alerts for a single underlying problem causes excessive noise and it reduces the precious time that is needed to focus on the problem to hand. ICAM helps aggregate all the events that are tied to an application or cluster into one incident. This aggregation helps the SRE to focus on quickly restoring the service.
- Resolve issues by historical knowledge
ICAM runbooks allow us to take action in order to resolve incidents, directly react to events, or perform scheduled or unscheduled changes in your data center.
- Integrate with myriad external offerings
Set up both incoming and outgoing integrations from external sources into ICAM. For example, set up integration with Jenkins projects to receive notifications about job status or deployments. We can also integrate with Prometheus, Azure, new Relic, and many more offerings to receive event notifications. We can receive incident details via outgoing integrations to Slack, ServiceNow, Github, and more.
More info