DNS DDos Attack
Incident Report for Vero
Resolved
Today we suffered serious, though sporadic, outages across our entire domain (www, app and api.getvero.com) due to a volumetric DDoS attack on our DNS host, DNSimple. This means that we were not able to track your event or user API calls for parts of today.

For the non-technical: a DDoS attack is when a large volume of requests are made to a server, essentially causing it to shutdown (or at least stop servicing as it should). DDoS attacks may have no purpose, other than to prove the 'worth' of the individual(s) running the attack. What's worse is that when a DNS host goes down, simply 'failing over' to a new DNS hosts can mean the changes take up to 24 hours to fully propagate around the world, so sometimes waiting for your host to come back up can be the better path to take. Ultimately, whilst changes propagate this means getvero.com can be visible to some computers and servers, and not to others. Fortunately these occurrences are very rare.

We have been on the receiving end of such an attack once in our past – however the resulting downtime was an hour at most. In that instance it was been a matter of the host bringing their service back.

In this instance DNSimple sustained their outrage due to unrelenting attacks that lasted at least eight hours. They have provided details here: http://dnsimplestatus.com.

Having been tremendously reliable over the last two years, we waited for over 1.5 hours before making the call to switch to our backup DNS host. It took a little time to migrate the settings and from that moment on it was a matter of waiting for name servers to propagate globally. The bad news is this can take up to 24 hours and that our API and website can appear up for some people and down for others.

We've updated via Twitter throughout the day (as our status page was also down) and responded to tickets as they came in. Now that the storm has past I have a chance to do a post-mortem and to let you know what we can do to help if you require it.

It appears that many customers were affected dramatically for 1-3 hours and then intermittently for 3-4 more hours from around 19:20 UTC (1 Dec). The outage did not affect any newsletter sends (thankfully, as it was the night of Cyber Monday in the US and there were a few sends still to go out) but has affected transactional and behavioural emails. The key issue was that our API was unreachable, not allowing us to track data.

Things have returned to normal now and I wanted to reach out and let you know what we can do to help you with this frustrating failure:

1. If you would like, and where there is a substantial drop in API requests for your account, we can give you more details on the change in volume so you can get the specifics of the impact and we can work together to see what else we can do.

2. If you have a logging service that logs API failures then we can work to replay these backend API calls into Vero.

A few points worth noting:

3. Our m.js Javascript is not on our domain so was still served reliably. It also caches customer page views so, for returning customers, these events will be tracked (great for apps where customers log in).

4. If you have your DNS records setup for your Vero emails, opens and clicks were not affected.

--

This sort of outage is rare and extremely frustrating as there is little we could have done whilst waiting for DNS servers to proagate (sometimes the good things on the internet are also the bad - caching!) If you feel we could have done more or want to talk about any of the details, we'd love to share more and are keen to work with you to patch any missing data where possible.

Thanks, as always, for your support.

Chris
Posted Dec 02, 2014 - 19:13 AEDT