On October 1, 2014 at 3:31 AM Pacific time I was awoken for a series of Pingdom alerts that were escalated to me via PagerDuty from our Shanghai operations team, including:
Hello Matt Purkeypile,
You are assigned 1 triggered incident in PagerDuty: Please visit the following URL to manage this incident. https://xignite.pagerduty.com/dashboard
1) Incident #5429 Opened on: Oct 1 at 3:20am PDT Service: pingdom Description: Pingdom Alert: incident #18417 is open for XigniteSuperQuotes (AWS US-East-1) (superquotes.xignite.com) Link: https://xignite.pagerduty.com/i/5429
Escalation Policy: 2013 Ops Rotation
Hi Pager Duty, This is a notification sent by Pingdom. Incident 18417, check 1291070 XigniteSuperQuotes (AWS US-East-1) (superquotes.xignite.com), is currently open. It has been open for 0 minutes. Log in to your account at https://my.pingdom.com to see further details and take the necessary actions. Best regards, The Pingdom Team
By the time I was up and online though, systems were back to normal. Nonetheless a flood of notifications of an outage are nothing to brush off, even if it is the middle of the night and they’re brief. Since this was across our AWS US-East1 based services, I checked Amazon’s status page- but they reported everything operating normally.
Looking further at these Pingom notifications on our dashboard, I noticed the outage was for everything we had running in the AWS US-East1 region. Additionally, I was able to see not only our core services being impacted there, but also completely different stacks in the same region and some of our data providers that we monitor in the same region. At the same time, services hosted outside US-East1 were reported as fine. For example, here’s what I could see for the above alert:
Taking a look at our primary stack in that region I could see everything was up and running fine. However there was a brief blip where the traffic dropped to almost nothing- even though we were up:
- We detected a AWS US-East1 outage across multiple stacks of our own and other data providers.
- Our infrastructure was up, but traffic dropped to near nothing.
The conclusion at the time was that it was an AWS problem, even if they didn’t say so. In fact, I issued an internal alert to our entire engineering and support teams at 03:57 stating such. A few minutes later, Amazon did acknowledge the problem:
Of course, this begs the question: what if the AWS problems were longer, say a couple hours instead of a couple minutes? In fact, we had a problem on our BATS real time service last week that was restricted to AWS. To fix it for customers we redirected batsrealtime.xignite.com to an alternate site that wasn’t impacted by this problem. This quickly resolved the problem on the service for our customers, and allowed us to take the time to be sure the issue was truly resolved in AWS before sending traffic back to it.
This is another demonstration of how operations is a differentiator for Xignite, not just something that has to be done. We were able to quickly detect, troubleshoot, recognize the problem, and issue an internal alert- all before Amazon acknowledged it.
To test drive our market data APIs and experience for yourself how easy they are to integrate into your applications, request a free trial.