Incidents

Grading an incident

Each incident must be assigned a priority level 1 to 4 based on

  • the number of users affected
  • the impact on those users
  • whether one or more critical journeys have been disrupted (critical journeys change throughout the year depending on the frameworks’ lifecycles).

Based on the priority, there are different response times.

Roughly speaking, priorities are as follows:

  • P1 is critical: a complete outage, critical journeys are disrupted or there is a major security breach
  • P2 is major: a substantial degradation of service
  • P3 is significant: users experiencing intermittent or degraded service due to platform issue
  • P4 is minor: component failure that does not immediately affect the service

The incident lead may change the priority level during the incident if its impact changes or becomes clearer.

Examples of priorities and response times

P1 and P2 process

Take a breath. Everything will be fine.

If you’ve just found out about an incident, you may want to be the incident and comms lead. You can pass the responsibility on to a Delivery Manager, Product Manager, Technical Architect or other member of the team at any point.

If you are the incident and comms lead you will:

  1. Ensure people are aware that there is an incident:

    • post on the #dm-incidents Slack channel with an @here with a brief description of what is happening
    • share the post on #dm-framework Slack channel with an @here
  2. Establish a technical lead. This is usually the most experienced person on 2nd line or in the wider team

If you are the technical lead you will:

  • lead and coordinate the technical investigation
  • support the incident lead and keep them updated on the technical progress
  • add to the incident report once it’s created
  1. Assess the impact of the incident and grade it, based on Digital Marketplace Incident Responses, and communicate it on #dm-incidents

    • communicate it on the #dm-incidents channel
  2. Start an incident report by making a copy of the Incident Report Template

    • share the link on the #dm-incidents channel so that the incident team can contribute
    • work with the team to fill the overview and impact sections
    • record important events and changes in the timeline
  3. Establish a person to delegate comms to who will become the comms lead

If you are the comms lead you will:

  1. Create a Trello card tagged with the appropriate P level label on the 2ndLine Trello Board
  2. Once the incident is resolved
  • ensure stakeholders have been notified
  • update Trello card with relevant details
  • together with the rest of the incident team, finalise the incident report including all relevant information (such as Slack channels conversations).
  • add an entry to the Incidents Summary spreadsheet
  • arrange an incident review meeting with the incident team at a minimum and invite everyone in the wider team should know that the meeting is taking place and may request to be invited)
  • ensure the incident is raised at the GDS Infrastructure Weekly meeting
  • communicate ongoing actions to the CCS support team where required

Additional information

This guidance is a summary of the GDS - Technical Incident Management Framework and Process. Refer to this manual for any clarification or additional detail.

If the incident is a security breach also refer to the GDS Information Assurance - Process for Handling Security Incidents & Personal Data Breaches for detail on what the Information Assurance team will do.

P3 and P4 process

  1. Create a Trello card tagged with the appropriate P level label on the 2ndLine Trello Board
    • add a description of the issue
    • include as much information as possible, links to status pages, links to email groups and messages, copy-pasted slack conversations.
    • highlight whether or not there are any imminent risks that might cause this issue to be upgraded (for example, a failure in the backup storage and an impending backup run or a failure in the email service and an impending email run)
  2. Notify relevant developers with an @here in the #dm-2ndline Slack channel
  3. If the service is significantly degraded notify the #dm-framework Slack channel and the CCS support team
  4. Keep the Trello card up to date
  5. Once the incident is resolved
    • let the CCS support team know
    • create an incident report and hold an informal incident review if there is something to learn from the incident