Run Book: API Production HTTP 5XX responses count above threshold¶
Severity Guide¶
After confirming there is not a total outage, check if the increase of 5XX HTTP errors is related to a regular time where resources are expected to be constrained like a recent deployment, a data refresh, DB maintenance, etc. If the spike is related to one of these events and the alarms stabilizes in the short time then the severity is low.
If the issue is not related to known recurrent events and persists, the severity is critical. Check if API service dependencies –DB, Redis, Elasticsearch– are available to the API or if the problem is intrinsic to itself. To gather more information check the log group, use the “Logs Insights” view to query for requests that failed using a CloudWatch query similar to the following which can give more hints about where is the problem.
fields request, @timestamp, @message
| filter status >= 500
| sort @timestamp desc
| limit 20
Historical false positives¶
Nothing registered to date.