Responding to Cloudwatch alerts

We have three types of CloudWatch alerts that get sent to the #dm-2ndline Slack channel. These should be investigated by the 2nd line developers as soon as possible after they appear.

See Alerting for more information on how the alerts are set up.

Investigating alerts

Production-500s

  • On Kibana, use the ‘5xx requests’ shortcut to find the request that caused the error. This should provide information about what the user was doing at the time, and how many requests failed.

  • Try to determine the user impact from the logs. Look at whether the request was made by a human user, or a script/smoke test. Also, if possible, check if the request was subsequently retried successfully.

  • If you cannot determine the cause from Kibana, check app metrics on the Grafana dashboards. Look for recent crashes or a high number of concurrent requests.

Production-router-slow-requests-gt10s

  • Check Kibana as above (using the ‘Slow requests’ shortcut) and see if the endpoints are unusually slow.

  • Check the app metrics on Grafana as above.

Production-429s

  • Check Kibana as above, but include the User Agent field in the query to check if the user is a human or a bot.

  • If a human user is seeing 429 errors through normal behaviour, you may want to adjust the router app rate limiting settings.

Next steps

For all the error types above, once you’ve identified the cause, reply to the Slack alert (in a thread) to inform the team. Ask for help if you’re stuck!

If the error was due to a bug, create a card either on the 2nd line Trello Board (for urgent problems) or on the Tech Debt Trello board (for non urgent problems).

If the error has a low impact or is intermittent, consider adding to the 2nd line Trello Board ‘watchlist’ and monitoring for a week or two. If the problem gets worse then there will be a record of what’s happened so far. If the problem doesn’t recur then it can be moved to ‘Done’.

For ongoing issues with a high user impact, follow the incident process.