Incidents

Grading an Issue

Each incident must be assigned a Priority level based on its impact to users and number of systems affected with required response times and actions appropriate to the level of impact.

Please see the GDS Way Incident Priority Table for how incidents should be graded.

To summarise:

  • P1 - critical - complete outage (very bad)
  • P2 - major - substantial degradation of service (pretty bad)
  • P3 - significant - users experiencing intermittent or degraded service due to platform issue
  • P4 - minor - component failure that is not immediately service impacting

Particularly if the grade is P1 or P2, please refer to the below P1/ P2 - Process section

P1/ P2

Process

  1. Take a breath. Everything will be fine. Don’t panic.
  2. Create a trello card tagged with the appropriate P level label on the 2ndLine Trello Board
  3. Prioritise/triage - grade the card (see: Grading Incidents Trello Card)
  4. Notify incident lead and team (see: Programme Contact List)
  5. Please also update the dm-2ndline Slack Channel with an @here to ensure everyone knows something is/ has happened
  6. Technical enquiry - start investigating. Please ensure trello card is kept up to date with all relevant info
  7. Communicate - In any P1/ P2 incident the people in that column should be notified. If a data breach is suspected you must include those in the ‘Data Breach’ column:

Note

The following links are correct at time of writing. Check the Programme Contact List for up to date contacts.

P1/ P2 Incident Data Breach

Service Head of Product

GDS Incident Manager

GDS Head of Support Operations

Deputy Director of TechOps

Lead Accreditor

Security Operations Team

Information Assurance Team

 

  1. Communicate a short summary and the suspected level of disruption to the GDS #incident channel.
  2. Resolve the incident. Again, update trello card with relevant details.
  3. Closure - review the incident and create an incident report using the Incident Report Template. While reviewing, please include relevant parties who can input and learn from. Follow a ‘blameless post-mortem’ style.
  4. Save the incident report to the Digital Marketplace Team/Incident Reports Drive
  5. Add a row to the Incidents Summary spreadsheet

Additional Information

This guidance is a summary of the full guidance published in the GDS - Technical Incident Management Framework and Process. If clarification is needed on any of the above steps, please refer to that document.

If the incident is a security breach please also refer to the GDS Information Assurance - Process for Handling Security Incidents & Personal Data Breaches for detail on the additional steps that will be taken by the Information Assurance Team.

(<=) P3

Process

  1. Create a trello card in the inbox column tagged with the appropriate P level label on the 2ndLine Trello Board
    • Include as much information as possible; links to status pages, links to email groups/ messages, copy-pasted slack conversations etc.
    • Add a brief description of the issue
    • Be sure to highlight whether or not there are any imminent risks that might cause this issue to be upgraded (for example; a failure in our backup storage and an impending backup run or a failure in our email service and an impending email run)
  2. Notify relevant devs with an @here in the dm-2ndline Slack Channel
  3. If the service is significantly degraded notify the digital-marketplace Slack Channel and support@digitalmarketplace.service.gov.uk
  4. Keep the ticket on the 2ndLine Trello Board up to date
  5. Resolve the incident. Again, update trello card with relevant details.
    • If we think there’s something we can learn from the incident or if we think it’s worthy of documentation, then please do create an incident report.
  6. (Optional/ discretionary) Review the incident, create an incident report using the Incident Report Template and save the incident report to the Digital Marketplace Team/Incident Reports Drive

Additional Information

P3 and 4 incidents are classed as not heavily limiting user experience. Often they will be caused by a downstream component/ platform issue. This means that the team might not be able to resolve the issue themselves.

As such often the most important thing to do is heavily document the issue and keep on top of communication.