A tale of transparency and customer understanding

In a previous life, I was one of the first engineers on the Tech Support team at New Relic as we grew from a small to a large company, and I learned a lot in the process that I'm considering in light of the somewhat similar journey my team is on now at my new, smaller company. Today I want to tell the story of how New Relic built a good messaging platform for communicating service issues to our customers - a tale of customers' expectations of SaaS services, and downtimes. All recommendations below are just my opinion, but are based on direct experience around rolling out a status page and associated processes at New Relic, with results visible here:

Example 1: http://status.newrelic.com

The problem: Customers expect the service to be up, and if it isn't, they expect us to be working on it - hopefully in some way they can discover with minimal effort. There are a lot of ways to attempt to solve this problem which customers find to be of varying acceptability, but one industry standard is a web page displaying a status report for the service.

A minimal set of features for a successful status page

I want to dive in to how we did it there, not with a goal of doing it the same way, but so the lessons learned can be considered as we design a solution ideal for our own needs here.

Externally hosted

This is a requirement because the customer notification must work even when things outside of our control are wrong - what if AWS is down? In-app messaging won't help if an entire data center is offline or experiencing problems. Do customers have someplace to check in, in case our ticketing system is suffering extended downtime? Both of these situations are rare but have both occurred, and we need a solution that is separate from any other single points of failure.

Updated consistently

Once we convince customers that this will be the source of truth for Jama's hosted status, we need to be sure we live up to that promise. The hope is that it becomes so trustworthy that a hosted customer experiencing what they think is an outage first checks the status page, and if nothing is listed, then checks their network connection before assuming it is a problem on our end.

Clear messaging

As the "face of an outage", it's key to put our best foot forward, so keeping the following things in mind may be of use:

Lessons learned

All of the above are pretty much best practices and industry standard (or should be... see example 2: How not to do it: Honest Status Page on twitter). Additionally, there are some other things we figured out along the way at New Relic that it would be awesome to address during spin-up rather than in a reactionary way.
  1. Be prepared for reality. We may not discover an outage in time to report it while it is happening. Customer feedback may not be glowing.
    1. DevOps will need a plan about when it is and isn't appropriate/useful to post a status
    2. Support will need a good script to work from to explain why we do things how we do for customers who ask questions about our use of the status page. For example, at New Relic, we quickly realized that it wasn't most useful to post a status page update when one customer out of thousands were affected - it scared the thousands, and the one rarely noticed. However, when that one wrote in asking - we needed to have a good reason why we hadn't updated the page because as far as they were concerned, there was an outage. We also made the choice not to post "posthumous" status updates about issues that had resolved before we posted. I don't know if this was the right choice, but we needed something useful to say when we were called on the carpet - something that wasn't "gee, we didn't actually notice while we were down, so we didn't say anything?"
  2. Keeping customers sufficiently- but not over- informed can reduce concern (and # of tickets) about issues which we are working on a fix for. A consistently transparent/technical voice in the language of the updates keeps customer tickets about status updates to a minimum.
  3. Auxiliary messaging in app can be useful to clarify impact / and lead to more-actionable customer support tickets. For example, imagine the maintenance banner: "We had a slowdown over the last 12 hours, but everything should be caught up now. Please submit a ticket if things don't look better for you." in helping us gauge whether our fix was effective.

Potential results

We won't get it completely right the first time, but we can follow some of the most useful tenets of the agile design philosophy: try something that seems to address the problem simply, and iterate based on how that goes. A good implementation can still improve over time as we see how customers react and how we grow. If we're having consistent downtimes, a status page won't fix that, but we can at least get better at heading off another 20 customers writing in to let us know things are down, if we announce what we know in an easily consumable format. Some explicit goals we reached for at New Relic that helped us get to "good enough":