Responding to Cloudwatch alerts¶
We have three types of CloudWatch alerts that get sent to the #dm-2ndline Slack channel. These should be investigated by the 2nd line developers as soon as possible after they appear.
See Alerting for more information on how the alerts are set up.
- On Kibana, use the ‘5xx requests’ shortcut to find the request that caused the error. This should provide information about what the user was doing at the time, and how many requests failed.
- Try to determine the user impact from the logs. Look at whether the request was made by a human user, or a script/smoke test. Also, if possible, check if the request was subsequently retried successfully.
- If you cannot determine the cause from Kibana, check app metrics on the Grafana dashboards. Look for recent crashes or a high number of concurrent requests.
- Check Kibana as above (using the ‘Slow requests’ shortcut) and see if the endpoints are unusually slow.
- Check the app metrics on Grafana as above.
For all the error types above, once you’ve identified the cause, reply to the Slack alert (in a thread) to inform the team. Ask for help if you’re stuck!
If the error has a low impact or is intermittent, consider adding to the 2nd line Trello Board ‘watchlist’ and monitoring for a week or two. If the problem gets worse then there will be a record of what’s happened so far. If the problem doesn’t recur then it can be moved to ‘Done’.
For ongoing issues with a high user impact, follow the incident process.