Each incident must be assigned a Priority level based on its impact to users and number of systems affected with required response times and actions appropriate to the level of impact.
Please see the GDS Way Incident Priority Table for how incidents should be graded.
- P1 - critical - complete outage
- P2 - major - substantial degradation of service
- P3 - significant - users experiencing intermittent or degraded service due to platform issue
- P4 - minor - component failure that is not immediately service impacting
Particularly if the grade is P1 or P2, please refer to the below P1/ P2 - Process section
- Take a breath. Everything will be fine. Don’t panic.
- Create a trello card tagged with the appropriate P level label on the 2ndLine Trello Board
- Prioritise/triage - grade the card (see: Grading Incidents Trello Card)
- Notify stakeholders about incident priority and impact with a brief description
- GDS - send an email to firstname.lastname@example.org
- programme - send an email to email@example.com
- team - send an email to the team members on the Digital Marketplace contacts list
- CCS - send an email to all the contacts on the CCS contacts list
- if the incident involves a data or security breach, notify the Cyber Security team. Contact them using the #cyber-security-help Slack channel (don’t include any sensitive information!) or via the contact details in the GDS Rotas app https://rotas.cloudapps.digital/teams/cyber-security
- if a data breach is suspected you must include: Lead Accreditor , Security Operations Team , Information Assurance Team
- Update the dm-2ndline Slack Channel with an @here to ensure everyone knows something is/has happened
- Technical enquiry - start investigating. Please ensure trello card is kept up to date with all relevant info
- Keep communicating with the stakeholders according the update frequency of the incident priority (every hour for P1 and every two for P2), or in the case of important news
- Communicate a short summary and the suspected level of disruption to the GDS #incident channel.
- Resolve the incident:
- Update trello card with relevant details
- Notify the stakeholders you contacted at the start of the incident
- Closure - review the incident and create an incident report using the Incident Report Template. While reviewing, please include relevant parties who can input and learn from. Follow a ‘blameless post-mortem’ style.
- Save the incident report to the Digital Marketplace Team/Incident Reports Drive
- Add a row to the Incidents Summary spreadsheet
- Bring up the incident at the GDS Infrastructure Weekly meeting.
This guidance is a summary of the full guidance published in the GDS - Technical Incident Management Framework and Process. If clarification is needed on any of the above steps, please refer to that document.
If the incident is a security breach please also refer to the GDS Information Assurance - Process for Handling Security Incidents & Personal Data Breaches for detail on the additional steps that will be taken by the Information Assurance Team.
- Create a trello card in the inbox column tagged with the appropriate P level label on the 2ndLine Trello Board
- Include as much information as possible; links to status pages, links to email groups/ messages, copy-pasted slack conversations etc.
- Add a brief description of the issue
- Be sure to highlight whether or not there are any imminent risks that might cause this issue to be upgraded (for example; a failure in our backup storage and an impending backup run or a failure in our email service and an impending email run)
- Notify relevant devs with an @here in the dm-2ndline Slack Channel
- If the service is significantly degraded notify the digital-marketplace Slack Channel and firstname.lastname@example.org
- Keep the ticket on the 2ndLine Trello Board up to date
- Resolve the incident. Again, update trello card with relevant details.
- If we think there’s something we can learn from the incident or if we think it’s worthy of documentation, then please do create an incident report.
- (Optional/ discretionary) Review the incident, create an incident report using the Incident Report Template and save the incident report to the Digital Marketplace Team/Incident Reports Drive
P3 and 4 incidents are classed as not heavily limiting user experience. Often they will be caused by a downstream component/ platform issue. This means that the team might not be able to resolve the issue themselves.
As such often the most important thing to do is heavily document the issue and keep on top of communication.