Last July 21st, DigitalOcean started having a few issues with their New York based datacenter. They kept customers updated using their status page. They sent out an email to all their customer today explaining the issue in detail.
This is one of the best examples I've seen for handling post-mortem communication and the breakdown of their execution should be what the industry uses as a standard.
The first paragraph starts with a sincere apology. It provides a short summary of the incident followed by a reassurance of their commitment to the customer.
Hi, I would like to take a moment to apologize for the problems you may have experienced accessing your droplets in the NYC2 region July 21st, starting around 6PM Eastern time. Providing a stable infrastructure for all customers is our number one priority, and whenever we fall short we work to understand the problem and take steps to reduce the chance of it happening again.
The next few paragraphs focus on the issue directly. There's no beating around the bush, or an attempt to downplay the incident. It's three straight-forward paragraphs detailing the issue.
In this case, we’ve determined what were a few related events which contributed to the outage:
Next is a clear call to action. It shows that they are not brushing this off, but rather, intent on doing something to prevent this from happening in the future.
Our network vendor has been engaged, and we’ve been working together to attempt to fully understand the scope of the problem and steps that we can take to address it. Concretely, we’ve begun evaluating some software updates that we believe may improve the situation. If we determine, as we hope, that these changes will improve stability in this type of situation we will build a plan to upgrade our core network to this version as soon as possible. In addition, we continue to look for additional configuration changes that we can make in the mean time to help prevent this type of problem.
Reassurance to the Customer
Nearing the end of the email is another reassurance of what their priorities are. It's a strong testament to make that they understand the gravity of the situation, and that they'll do everything to validate their findings.
DigitalOcean's top priority is to ensure your droplets are running 24 hours a day, 7 days a week, 365 days a year. We’ve taken the first steps to fully understand this outage and have begun making changes to greatly reduce the likelihood of a similar event in the future. This work is ongoing and we will continue to make changes and validate our infrastructure to ensure that it behaves as expected in adverse conditions.
Backing the Assurance
Finally, they do something that proves they mean business. They issue SLA credit for the downtime. They also make it clear that they fell short, and that this is a gesture to stand by their commitment.
We will issue an SLA credit for the downtime you have experienced. We realize this doesn't make up for the interruption but we want to uphold our promise to our users when we fall short.
On Digital Ocean's blog about Mark Imbriaco (from May 19, 2014), he explains his for writing an ideal post-mortem: