Incidents

Grading an Issue

Each incident must be assigned a Priority level based on its impact to users and number of systems affected with required response times and actions appropriate to the level of impact.

Please see the Digital Marketplace Incident Responses for how incidents should be graded, and the corresponding response time.

To summarise:

  • P1 - critical - complete outage
  • P2 - major - substantial degradation of service
  • P3 - significant - users experiencing intermittent or degraded service due to platform issue
  • P4 - minor - component failure that is not immediately service impacting

Particularly if the grade is P1 or P2, please refer to the below P1/ P2 - Process section

P1/ P2

Process

Take a breath. Everything will be fine. Don’t panic.

As you’ve just found out about the issue, you may want to be the Incident & Comms Lead - you can pass the responsibility on to a Delivery Manager, Product Manager, Technical Architect or any other member of the team at any point.

If you are the incident and comms lead:

  1. Ensure your team and other teams at GDS are aware that an incident is ongoing:

    • post on the #dm-incidents channel with an @here providing a brief description of what is happening and
    • share the post on #dm-framework channel with an @here
    • share the post on #dm-2ndline with an @here
  2. Establish a tech lead - this is usually the most experienced person on 2nd line or in the wider team

If you are the technical Lead, you will:

  • lead and coordinate the technical investigation (use the #dm-2ndline Slack channel for internal coordination)
  • support the incident lead and keep them updated on the technical progress
  • add to the incident report once it’s created
  1. Assess the impact of the incident and give it a priority based on Digital Marketplace Incident Responses, and communicate it on #dm-incidents
  2. Start an incident report by making a copy of the Incident Report Template, share the link on the #dm-incidents channel and ensure important events are recorded
  3. Create a trello card tagged with the appropriate P level label on the 2ndLine Trello Board
  4. Notify stakeholders about incident priority and impact with a brief description
  1. Discuss with CCS whether users should be notified (and how).
  2. Keep communicating with the stakeholders according the update frequency of the incident priority (see Digital Marketplace Incident Responses), or in the case of important news
  3. Communicate a short summary including priority and impact to the GDS #incident channel.
  4. Once the incident is resolved
  • update trello card with relevant details
  • notify the stakeholders you contacted at the start of the incident
  • together with the rest of the incident team, finalise the incident report including as much relevant information as possible (e.g. Slack channels conversation).
  • add a row to the Incidents Summary spreadsheet
  • ensure the incident is brought up at the GDS Infrastructure Weekly meeting.
  • communicate ongoing actions to the CCS support team where required

Additional Information

This guidance is a summary of the full guidance published in the GDS - Technical Incident Management Framework and Process. If clarification is needed on any of the above steps, please refer to that document.

If the incident is a security breach please also refer to the GDS Information Assurance - Process for Handling Security Incidents & Personal Data Breaches for detail on the additional steps that will be taken by the Information Assurance Team.

(<=) P3

Process

  1. Create a trello card in the inbox column tagged with the appropriate P level label on the 2ndLine Trello Board
    • Include as much information as possible; links to status pages, links to email groups/ messages, copy-pasted slack conversations etc.
    • Add a brief description of the issue
    • Be sure to highlight whether or not there are any imminent risks that might cause this issue to be upgraded (for example; a failure in our backup storage and an impending backup run or a failure in our email service and an impending email run)
  2. Notify relevant devs with an @here in the dm-2ndline Slack Channel
  3. If the service is significantly degraded notify the digital-marketplace Slack Channel and the CCS support team
  4. Keep the ticket on the 2ndLine Trello Board up to date
  5. Resolve the incident. Again, update trello card with relevant details.
    • If we think there’s something we can learn from the incident or if we think it’s worthy of documentation, then please do create an incident report.
  6. (Optional/ discretionary) Review the incident, create an incident report using the Incident Report Template
  7. Communicate ongoing actions to the CCS support team where required

Additional Information

P3 and 4 incidents are classed as not heavily limiting user experience. Often they will be caused by a downstream component/ platform issue. This means that the team might not be able to resolve the issue themselves.

As such often the most important thing to do is heavily document the issue and keep on top of communication.