Stable Run Book Sample¶

Tip

A real run book should include links to anything relevant like dashboards the maintainer should check. This example does not do so for the sake of simplicity. Italicised terms identify places that should be linked.

Caution

The information and instructions in this sample are completely fictional and should be taken as true-to-life documentation of our services or how to respond to incidents. It may also refer to observable metrics that we do not actually have configured for our services.

Run Book: Increased search response time¶

Metadata

Status: Stable

Alarm links:

Configured downtime:

18:00 UTC - 03:00 UTC due to organic traffic decreases creating low-confidence standard deviation values

Severity Guide¶

After confirming there is not a total outage, you need to identify the source of the slowdown. Historically the source of response time increases have been issues in dead link filtering. Check for dead link timing first and foremost and check for increased 5xx responses on the route that may indicate issues with completing the dead link process.

After that, check Elasticsearch pagination and total query time per search endpoint request to confirm that those are stable.

Finally, check that Postgres response times and CPU usage are stable.

If none of these are abnormal or are only slightly higher, this is probably caused by an organic increase in traffic. Consider scaling the API by one task if the increase is substantial and sustained, otherwise investigate traffic origin according to the traffic analysis run book and identify potential sources of organic traffic increases (conferences, etc).

Stable Run Book Sample¶

Run Book: Increased search response time¶

Severity Guide¶

Related incident reports¶