Differences Explained: Incident Management versus Problem Management
Differences Explained: Incident Management versus Problem Management
. 9 min read
Problem management is an exercise in constant maintenance; identifying and controlling problems before they become incidents.
Storm Dennis - A Major Incident
Storm Dennis, one of the most intense extratropical cyclones on record, hit the United Kingdom with force in February 2020. Heavy rainfall caused severe flooding and broke new records for river levels. Five people died in the storm and the high winds and flooding caused massive disruption to transportation services across the country.
Network Rail, a UK railway company, announced disruptions to travel due to flooding, “high and fast-flowing water,” landslips, vegetation falling on railway tracks and damage to overhead power lines. The Independent reported that in addition to the rail and ferry travel disruption, tens of thousands of people were stranded due to hundreds of UK flights being cancelled.
Storm Dennis was a major incident that affected thousands of UK businesses. In February 2020, when Storm Dennis was raging in the UK, no one yet suspected that an even larger and more disruptive incident was just around the corner - the COVID-19 pandemic. Though Storm Dennis was a less catastrophic incident than the COVID-19 pandemic, both are examples of unpredictable forces whose symptoms need to be urgently treated - in other words, ‘incidents’.
In February 2020, many thousands of people in the UK responsible for their organisation’s business continuity or security management had their hands full due to the disruption that Storm Dennis caused. The large-scale travel disruption and power outages had tangible impacts on business revenue and operations. The way that different organisations managed the storm had huge implications on their reputation, employees, customers and stakeholders.
Climate change is making storms such as Storm Dennis more frequent and more severe, but this is just one of many security risks that are on the rise in today’s globalised world. Civil unrest, terrorism, and public violence are also increasing in frequency. Unforeseen and unplanned for pandemics like COVID-19 have left no corner of the globe untouched. For this reason, it is more vital than ever for all organisations to have a solid plan in place for both incident and problem management, and to understand the difference between the two.
What is Incident Management?
To understand how to manage an incident, it is first important to distinguish between ‘incident’ and ‘problem’.
The ISO/IEC 20000:2018 service management standard defines an incident as one of the following:
- An unplanned interruption to a service
- A reduction in the quality of a service
- An event that has not yet impacted the service to the customer or user
Speed is of paramount importance in managing incidents. Just as the Storm Dennis example outlined above, speedy communication with stakeholders and resolution of the issues caused by the storm minimised the scale of its disruption. Incidents have a direct impact on customers and users, which is why the speed at which normal service delivery returns is the top priority. In many cases, every minute that goes by where service is disrupted results in very tangible monetary losses for an organisation.
In the Storm Dennis example, the prompt response of engineers working on clearing vegetation off other rails and ensuring that train signalling systems were working correctly is a good example of a sensible and quick response to an incident. Network Rail’s communications with their audience over social media and on their website about the state of the transportation system is a good example of clear communication with stakeholders.
Incident Management: The Lifecycle Approach
The lifecycle approach to incident management is a framework for managing incidents in service management. The lifecycle approach involves the following steps: logging, categorisation, prioritisation, escalation, resolution, and closure. Following these steps can help organisations standardise and effectively manage and mitigate the consequences of a variety of incidents.
Logging, categorisation and prioritisation
- These first three categories focus on identifying and documenting the incident and determining the response.
- In determining the response, this is the stage where the level of communication is decided as well as the speed of response. For example, if the incident has enormous repercussions for the business (like Storm Dennis did for Network Rail), the incident will be prioritised above all other activities and projects of the business. Everyone’s time and attention will likely be directed towards resolving the incident.
- However, in other scenarios, this would not be appropriate. Take the example of a bank with thousands of branches around the world. If there were a terrorist attack near one of the bank’s branches in Indonesia, the team focused on determining the response to this incident would likely be the security team that focuses on South East Asia. This incident would be unlikely to have the same "all hands on deck" approach because it is more localised.
- Escalation refers to bringing the incident to the attention of people in an organisation who have functions that are more specialised or have greater decision-making power.
- These people will make tough decisions about communicating with stakeholders such as shareholders or regulators. The more senior employees will also be in a position to approve emergency resource allocation or last-minute changes.
- Resolving the incident can be done in a myriad of ways depending on what happened. Some common forms of resolution include invoking playbooks or measures designed to resolve a particular incident.
- Other methods involve setting up support teams or even designing a self-service that the end-users can use themselves to resolve their queries or issues.
- Closing the incident happens after two things. Firstly, normal service has resumed. Secondly, all stakeholders, customers or users have been communicated with and have confirmed they are satisfied.
There are two main components to the successful handling of an incident. These are significant, timely and accurate communication and effective cross-functional collaboration.
One common collaboration technique used in incident management is bringing all the key stakeholders together into one space. They collaboratively diagnose and divide responsibilities most appropriately for a quick resolution.
Best practices for communication centre on communicating clearly and consistently throughout the incident. A constant line of communication from the business to its stakeholders and customers, even before resolution or closure, assures customers or stakeholders that the incident has been identified and is being dealt with as a priority.
What is problem management?
ISO/IEC 20000:2018 defines a “problem” as “a cause of one or more actual or potential incidents.”
The reason why organisations should “problem manage” in addition to “incident manage” is to “reduce the likelihood and impact of incidents by identifying actual and potential causes of incidents and managing workarounds and known errors.” Essentially, problem management can prevent incidents before they escalate.
Therefore, problem management can be understood as proactive, rather than reactive. While incident management focuses on speed of resolution before the damage becomes even worse, problem management is an exercise in constant maintenance; identifying and controlling problems before they become incidents.
An example of problem management following Storm Dennis would look something like this. Because train services providers realised that strong storms are becoming more frequent and more severe, they put measures into place to prepare for the next storm. These measures included long-term improvements to the service which make it more resistant to severe storms such as building flood barriers or clearing away vegetation that is likely to fall on the tracks. These barriers would not just be put up once and forgotten about, but in line with effective problem management, they would be constantly inspected for faults and maintained in good working order.
Problem Management: A Lifecycle Approach
Problem management follows a lifecycle approach just as incident management does, however it is slightly different. The main components that make up lifecycle problem management include identification, controlling the problem, and controlling the error.
Identifying the problem
In problem management, identifying the problem involves logging, categorisation and prioritisation of the problem, just as in incident management.
Problem control focuses on analysing and figuring out how to deal with known errors or situations which are likely to come up and cause issues or business disruption. Documenting the playbooks around these issues happens during this step. Having playbooks prepared for a variety of scenarios minimises the negative effects that the incidents will have.
Error control refers to the constant assessment of workaround effectiveness. After every incident occurs, root causes need to be examined and problem management updated to reflect any inaccuracies or gaps in current problem management systems.
There are many techniques organisations use to identify the foundational causes of incidents. For example, some of the following techniques are common ways to assess the root causes or contributing factors leading to problems:
The 5 Whys
This is an iterative interrogative technique that focuses on cause-and-effect relationships.
Events and Causal Factors Charting
This method identifies and documents the events in a sequence as they occurred to try to identify which conditions contributed to the incident.
The Kepner-Tregoe Method
This is a systematic technique focused on making the best decision in contexts where the repercussions of choices can be unclear.
Multiple Events Sequencing
Multiple event sequencing is useful in analysing causal chains.
These are causal diagrams that slightly resemble the bones of a fish whose aim is to identify factors that lead to a particular outcome.
Fault Tree Analysis
This is a top-down tree-shaped analysis that relies on deductive reasoning to understand why something happened.
Root Cause Analysis
This philosophy focuses strongly on problem management. It values addressing the root causes that led to a problem instead of treating the symptoms of a particular incident.
Incident management and problem management are both integral parts of ensuring a good reputation and solid business continuity in the face of risks.
Because it is proactive and limits the amount and severity of incidents before they have an opportunity to damage business, problem management is arguably the most important exercise in minimising business disruption caused by incidents. However, it is harder to measure its impacts, which leads to a lot of organisations putting more emphasis on incident management.
It is difficult to know how many incidents good problem management prevented, because having effectively problem-managed, one cannot be sure of what would have happened had that problem management not occurred. In scientific words, there is no "control group." Preventative problem management can also be harder to appreciate because psychologically we are wired to respond to loss more strongly than to gain.
Creating effective reward structures that boost innovation and creative thinking in problem-solving is a crucial component in minimising business disruption. Organisations need to ensure that they are supporting problem management efforts focused on identifying the root causes of incidents and taking measures to mitigate or get rid of those root causes before an incident devastates the business.
Through effective problem and incident management, businesses are made more resilient and stakeholders and customers are more satisfied.
To find out how WatchKeeper can help your organisation become more resilient, please visit https://watchkeeper.com or email us at email@example.com.
51° 30' 35.5140'' N
0° 7' 5.1312'' W