On April 28, 2020 between 4:33 PM and 4:39 PM Eastern, we experienced slow requests, connectivity problems, and a sporadic outage lasting ~5 minutes affecting all customers. We’re very sorry this occurred and would like to share with you what happened, what we learned, and what we're doing to prevent incidents like this from happening in the future.
This incident was caused by cascading problems: reduced capacity during maintenance, normal daily mid-peak load, poor failure modes for certain low-priority requests, and a number of slow/unoptimized requests which blocked other requests.
We were making maintenance-related changes in response to an incident from the previous week. At the time of the previous incident, we thought that the root cause was different based on the deployment we were making at the time. This assumption was wrong and caused the same issue to happen again when we made these changes.
We've written a postmortem for the previous incident which describes what we've learned and what we're doing to prevent incidents like this from happening in the future.
We know how much you rely on ProcedureFlow to help your business succeed. Having 2 incidents in one week is not something we're proud of, but we will continue to analyze this event for opportunities to serve you better and earn the trust you place in us.