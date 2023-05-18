Every admin knows the ominous “ping” in the mail inbox when a new message has fluttered in via the service desk, or the ringing of the phone, at the other end the rough voice of a colleague: “Nothing works here anymore”. An incident is an unplanned or impending disruption of an IT service. Such failures can not only be really expensive for companies, but even endanger their existence. Here some examples:

Well, these are all big companies. Although they do not have to fear for their existence in such incidents due to their good financial cushion, it is still painful.

What factors influence the amount of loss in an incident?

But now imagine what major IT incidents can cause for medium and small businesses. Even if the losses in an incident are classified here as smaller, they can still have a greater impact on the bottom line.

In addition to the size of the company, factors such as the industry or the business model influence the level of losses from incidents. Industries such as banking/financial services, government, healthcare, manufacturing, media and communications services, retail, transportation and utilities are particularly at risk should a major incident occur.

When it comes to the business model, the more your business model is geared towards availability, the more you have to lose in the event of failures. To give a clear example, an e-commerce site with no physical sales locations will be hit much harder by a web outage than a business with physical sales locations/branch offices.

And what is the actual total cost of downtime?

An incident has far more far-reaching consequences than you might think: in fact, the cost of downtime is by no means equal to lost revenue. By interrupting the usual business processes, image damage also occurs, users cannot use the product at all or only to a limited extent, and in the worst case the failure leads to customers leaving.

In addition, internal productivity losses can also be counted as losses. This affects the IT team that is supposed to fix the incident, but also other teams that e.g. B. communicate the incident or now have to enter into an exchange with customers. If it is an internal incident, it can happen that all employees are prevented from their work.

Software providers in particular have to dig deep into their pockets in the worst case, since fines due to violations of SLAs, state fines (for violations of official requirements), legal disputes and compensation payments represent high financial burdens. On the other hand, for companies that deal in physical products, low inventories are a significant risk.

As you can see, IT failures are a real problem for companies. Hence, many have ITSM-Teams established a fixed process through which they nip incidents in the bud or at least want to rectify them as quickly as possible. And it is precisely this process that Incident Management deals with!

Incident Management: What is it exactly?

Atlassian defined Incident Management if “The process used by development and IT operations teams to respond to an unplanned event or service interruption and restore the service to its operational state.” As part of IT service management, incident management has the goal of restoring the operation of services as quickly as possible by resolving the disruption. The focus is on minimizing the effects on a company caused by an incident as far as possible.

incidents are according to Atlassian “events of any kind that disrupt or reduce the quality of service (or threaten to do so). […] Incidents can vary widely in severity, ranging from an entire global web service crashing to a small number of users having intermittent errors.” An incident is considered resolved once the affected service is back to normal operation.

But how do ITSM teams typically respond to an incident?

The Incident Management Cycle – a process divided into 5 phases

A fixed process within a team is required so that incidents can be quickly identified, resolved and followed up. This can be viewed as an ongoing cycle and ideally includes the following five phases:

Preparation: Here you play through “what if” scenarios, define processes for them and pack an “emergency kit”. These include e.g. B. Incident response plans, contact lists, preparedness plansescalation guidelines, access codes, compliance regulations or technical documentation. Detection and alerting: How is your team notified of a disruption? Modern incident management tools help you get instant, reliable alerts with automated alerting workflows based on alert types, team plans, and escalation policies. Delimitation: After the incident has been identified, it is necessary to isolate and separate it. First of all, the scope of the incident must be kept as small as possible – only at a later point in time is a comprehensive solution achieved. Even in this phase, there should be open and transparent communication with the customers concerned. Restoration: This phase is about effective and long-term problem solving. What causes triggered the incident? What measures can be taken to prevent a recurrence and make the system more secure? Analyse: A post-mortem analysis offers your team the opportunity to learn from the incident for the future. Such an analysis not only relates to the systemic causes of the incident, but also to its resolution process. This results e.g. B. improved workflows or reference material for future incident scenarios.

Want to delve deeper into the incident management cycle? This earlier article explains each stage in detail!

What tools do you need for effective incident management?

In order to be able to offer effective incident management, your ITSM teams should be equipped with the right tools. Depending on which phase of your incident management cycle you are in, different tools are recommended – e.g. B. Jira Service Management (JSM), Genius or Statuspage. They support you with different functions to quickly identify, communicate and document incidents, which allows you to minimize downtime and costs.

You can find out which tool is suitable for which process within your process and what it is useful for in the table below. If you want to know more about each incident management tool and its features, you can here delve deeper into the topic.

Already knew? There is a Confluence integration in Opsgenie for post-mortems. The reports are created directly in Opsgenie and can then be transferred to Confluence. If you want to document an incident or its analysis or record known errors and workarounds in writing, the Confluence integration for Jira Service Management.

5 lessons learned: Minimize downtime costs with good incident management

Effective incident management is essential for every company because: Although incidents and the associated downtimes cannot be planned, you can react to incidents quickly and effectively with a clearly defined process and suitable tools – and at the same time prevent costs from rising shoot. Therefore, here are five concrete lessons to be learned, with which you can limit the risk of downtime and minimize downtime costs:

Create detailed disaster recovery plans: What do you and your team have to do in the event of an outage? These plans contain all the instructions and steps you should take to respond to an incident. Communicate clearly and often: This not only applies to the exchange within the team, but above all to the communication with the customers concerned. Since transparency is particularly important in (chaotic) emergency situations and creates trust, it makes sense to be able to use a communication plan as a guide. Eliminiere Single Points of Failure: Eliminate components from your existing infrastructure and your current processes that, in the event of a failure, would lead to the failure of the entire system. same z. B. Balance loads between servers, build failsafe solutions into your deployments, and adhere to good backup practices and peer reviews. Prioritize prevention: Incidents cannot always be avoided, but you can minimize their occurrence through prevention. This is how you should replace outdated systems and security features and fix problems before they develop into full-blown incidents. Do not ignore post-mortem analysis: A particular incident should not occur a second time. To counteract this, you should always conduct a post-mortem analysis. At the same time, you can learn for future incidents through this intensive reappraisal.

