Database maintenance
Scheduled Maintenance Report for ProcedureFlow
Postmortem

On September 24, 2020 at 11:00 PM Eastern, we were unsuccessful in an attempt to perform 1 hour of database maintenance which caused an additional ~1 hour of downtime affecting all customers. We’re very sorry this occurred and would like to share with you what happened, what we learned, and what we're doing to reschedule the maintenance and prevent this from happening in the future.

What happened

We stopped our database to perform the maintenance around 11:05 PM Eastern. At 11:30 PM Eastern, we noted that the remaining time to stop the database was unusually high: 2 to 8 hours. We logged a case with our hosting provider (AWS), made the decision to abort the maintenance, and started recovery procedures.

Aborting the planned maintenance required a point-in-time database recovery, which is slower than normal database restoration procedures. The additional hour of downtime was caused by waiting for the recovery to finish. Access to ProcedureFlow was restored around 1 AM Eastern.

What we learned

  • Testing is important but doesn't always highlight potential gaps in a plan. Our prior testing did not identify any problems with stopping the database. We tested using recent backups of our Production database and also performed a full dry-run in a Staging environment.
  • When stopping a database, a snapshot is created from the primary database and not the standby database. Normally, snapshots are very fast because they're incrementally created from the standby database (like nightly backups). By design, snapshots are not normally created from the primary database. Since there were no previous incremental snapshots of the primary database, a very slow snapshot was created for the full database.
  • In hindsight, the steps to perform the maintenance did not require us to completely stop the database. Our hosting provider informed us that there were alternative steps we could have taken. In early August, we had confirmed the steps for our plan with a different representative. However, at the time, they did not identify this problem as a gap in our plan, nor did they suggest alternatives to the plan we presented.
  • The point-in-time recovery procedure can be triggered even while the database is in the middle of stopping. We confirmed this during the phone call with our hosting provider. This is helpful for other emergency situations.

Prevention in the future

  • We will confirm any major maintenance plans with multiple representatives at our hosting provider to ensure a consistent plan.
  • We will avoid stopping the database until it's no longer needed. Using the new approach that we learned, we can pause traffic to the database, snapshot the standby database while it's running, and launch a new database from the snapshot with any changes applied.
  • We will scrutinize maintenance plans from additional perspectives that might seem trivial, like stopping a database or server.

We know how much you rely on ProcedureFlow to help your business succeed. We will continue to analyze this event for opportunities to serve you better and continue to earn the trust you place in us.

Posted Oct 01, 2020 - 18:26 UTC

Completed
The scheduled maintenance has completed. We will follow up with a postmortem and reschedule the maintenance for a future date.
Posted Sep 25, 2020 - 05:37 UTC
Verifying
ProcedureFlow has been restored. We are continuing to warm up our database, so you may experience slow requests.
Posted Sep 25, 2020 - 05:04 UTC
Update
We are continuing to roll back our maintenance to restore access to ProcedureFlow.
Posted Sep 25, 2020 - 04:31 UTC
Update
Unfortunately, the maintenance tonight did not go as planned. Certain infrastructure components behaved differently than during our tests. We are in the process of rolling back our maintenance to restore access to ProcedureFlow. After ProcedureFlow is restored, we will schedule this maintenance again for a future date with a better plan in place.
Posted Sep 25, 2020 - 04:06 UTC
In progress
Scheduled maintenance is currently in progress. We will provide updates as necessary.
Posted Sep 25, 2020 - 03:00 UTC
Scheduled
On September 24, 2020 at 11 PM Eastern we will be performing database upgrades. This maintenance includes changes to improve the security of ProcedureFlow in preparation for a SOC 2 audit. During this maintenance, ProcedureFlow will be unavailable to all customers.

We anticipate that the maintenance should complete in about an hour. However, there is always the possibility that it takes longer. We will provide updates during the maintenance. If you have any questions, please contact us at help@procedureflow.com.
Posted Sep 04, 2020 - 16:26 UTC
This scheduled maintenance affected: Application (https://app.procedureflow.com).