An incident triage with observability intelligence

What’s your observability solution of choice?

The difference between an observability solution and a generation-old monitoring solution is that the observability solution helps to identify the root cause of incidences occurring in distributed cloud-native environments better. Accordingly, various monitoring, log, and trace solutions for distributed microservices environments are being released, and users are now at a crossroads to evaluate and choose which solution will be effective among those many options.

But which evaluation criteria are appropriate to evaluate these various solutions in the market? Is it appropriate to use a blend of solutions that helps you analyze traces profoundly together with a basic metrics monitoring solution? Or would using a solution from a single provider that includes metrics, logs, and traces help with finding the root of a problem in a more integrated manner? What about the option to build your own observability solution with multiple OSS projects such as Prometheus?

In real life….

Let’s start with an example: Suppose that when a microservice developed in Java calls data from the database, a response delay occurs, the memory usage of the Pod for the microservice increases, and the observability solution fires an alert. Now, with the metric information that the memory usage of a particular pod has increased, the system operator would not have enough information to determine the cause of the alert. The system operator would need to work with the development team to identify the problem with the microservice. The development team would only be able to identify the specific microservice that caused the problem by analyzing the logs and traces left by the microservices. If the problem occurred due to a recent deployment, the team needs to take action to rollback to the previous software version to reduce system downtime or avoid the system being down entirely.

The highlighted question is: How quickly can you determine the cause of the problem and take appropriate action? In general, if the development team is involved, it can take some time to detect the cause of the issue, to build a solution, and then to execute and validate it.

This ultimately underlines the importance of having a solution in place that can expertly analyze traces between microservices, helping to identify problems quickly. Of course, this is under the premise that a trace library (i.e. Jaeger or OpenElementary) has been added to the code or a specialized service mesh is being used so that all microservices can be traced. Such specialized trace tools can be of practical help in the development stage or Postmortem stage, but in order to solve immediate problems, it might be a more realistic alternative to rollback the deployed services or configurations and Kubernetes rollout would help facilitate this process.

Observability with NexClipper

Now let’s see how NexClipper can help users to cope with incidents. WithNexClipper’s ExporterHub, users can easily and automatically install validated exporters with alert configuration and Grafana dashboard. It allows users to have metric dashboards, explore options, alert setup for Kubernetes, nodes, and services immediately. Once an alert is generated from the Prometheus alert manager due to system abnormality, NexClipper’s Alert Hub will deliver the alert to the responsible parties through dedicated channels, and the parties can start reviewing the alert. NexClipper will help support the review of the alert to determine if it should be escalated as an incident for further follow-ups to find a resolution to solve the issue. NexClipper’s incident management contains the intelligence of occurred incidents with a history of alerts, metrics, an investigation/triage discussion history, as well as solutions applied to solve the case. This will help users to solve the issue as well as postmortem activities to prevent future incidents.

Figure 1. NexClipper process – from alert to problem-solving

Alert Hub

NexClipper’s Alert Hub works in conjunction with Prometheus’ Alert Manager to help sending alerts generated by a specific cluster, node, or service to a designated group through a selected channel such as Slack or email.

Figure 2. Alert Hub architecture

Incident management

A NexClipper user can create an incident ticket, for example, by using a specific alert. Users not only see the incident ticket, but also the related metrics, history of alerts and incidents from the same source, as well as the history of recently applied deployments. The function also supports users with managing the history of actions taken related to the incident so that they can refer to a holistic incident report for any future case.

Figure 3. Incident management – sample UI

Log & Trace

NexClipper’s log and trace functions help customers searching for logs and traces that occurred during a specific time period and allows them to review service latency and more.

Future plans including to add a function that allows system operators to easily rollback deployment by referring the deployment history on Kubernetes to the incident management.

Figure 4. NexClipper log & trace architecture

Rapid operation by highly qualified information

As described above, NexClipper will provide support with the following:

  1. determining to whom the detected anomalies will be communicated
  2. accurate judgment by integrated management of alerts, incident histories, and metrics related to the symptom
  3. correlation analysis of log and trace of related applications
  4. curating a database for problem-solving processes and history

In conclusion, NexClipper aims to allow users to solve problems quickly by providing references for future development and operation. If dashboard, APM, log management, and incident management are well integrated in the operator’s view, it allows for rapid response time for time-sensitive operations.

In future versions, a rollback automation function will be added to the incident management / operation automation function so that the system operator can easily execute the rollback operation of the related service to support more a rapid failure response.

And that concludes the outline of how exactly NexClipper’s aims to help users when coping with incidents. If you have any further questions or would like to discuss the topic more, please feel free to contact Jerry Lee (jerry.lee@nexclipper.io).