Slow requests and connectivity issues

Incident Report for ProcedureFlow

Postmortem

On April 23, 2020 between 11:43 AM and 12:07 PM Eastern, we experienced slow requests, connectivity problems, and a sporadic outage lasting ~13 minutes affecting all customers. We’re very sorry this occurred and would like to share with you what happened, what we learned, and what we're doing to prevent incidents like this from happening in the future.

This incident was caused by cascading problems: reduced capacity during a deployment, normal daily peak load, poor failure modes for certain requests, and several slow requests which caused other requests to be blocked.

We've learned from this incident and have identified several changes that we're making to prevent an incident like this from happening in the future:

We will use a larger capacity during maintenance/automated deployments to handle the maximum possible load.
We will change a few low-priority requests so that they have a better failure mode to avoid overloading ourselves and making the failure worse.
We're going to ensure certain error messages are more user-friendly with links to our status page, support contact information, etc. rather than showing a bleak "504 error".
We've prioritized fixing known slow requests that exacerbated this issue.
We're improving our monitoring/alerting of specific metrics that can give us more insight into issues like this.

We know how much you rely on ProcedureFlow to help your business succeed. We let you down with this incident, but we will continue to analyze this event for opportunities to serve you better and earn the trust you place in us.

Posted May 04, 2020 - 17:52 UTC

Resolved

This incident has been resolved.

Posted Apr 23, 2020 - 17:46 UTC

Monitoring

A fix has been implemented and we are monitoring the results.

Posted Apr 23, 2020 - 16:29 UTC

Identified

We have identified a severe load on our service and are attempting to restore access.

Posted Apr 23, 2020 - 16:08 UTC

Investigating

We are currently investigating this issue.

Posted Apr 23, 2020 - 15:54 UTC