Application outage

Incident Report for Vero

Postmortem

Hi there,

On Thursday we experienced a significant outage at Vero. I apologise for the inconvenience I know this has caused.

We understand the importance of ensuring Vero is online 24/7 and, in an extremely hard situation, believe we have made the right calls over the last couple of days to restore Vero as promptly as possible and with improved resilience.

Below I have provided the full details of the issue, but want to start with a brief synopsis: A core component of our hardware was hosted with a managed service provider. Given this is a load bearing system component, we had identified that we could improve this aspect of our architecture by moving away from this provider. We did not feel this was imminently necessary, but prudent to plan for. We have recently been building the systems needed to move away from this service provider and were intending to do so early next week. Unexpectedly, on Thursday this hardware component suffered a major outage with this service provider. Fortunately, we were able to use the work we'd done to cutover to the new system we had designed. As this was unplanned, it resulted in several hours' outage. Again, I want to apologise for the extended and severe service interruption. This is not how we aim to operate here at Vero.

The good news is that, as a result of the work performed in the face of this issue, the new system is resilient in a way that the old one was not.

For those that would like more details, I have provided them below. If you have questions and concerns regarding the issue, the changes made and the improved fault-tolerance, I am more than happy to answer them. We take this extremely seriously.

Thank you for being a Vero customer,

Chris

There have been two issues over the last ten days. Although the first was less significant in impact, I wanted to address the details of both here.

Late last week, one of our hardware providers performed a standard upgrade to one of our system components. Unfortunately, this process resulted in a significant error. Whilst this error did not originate due to our processes or team, we were left with limited access to a critical piece of our stack. Thanks to our hard work over the last year, we were able to failover to backup hardware within 30 to 40 minutes. You can read more of the details related to this issue on our status page but this ensured Vero was operational in all critical areas, albeit performing slower than we'd like in our web UI.

By Sunday we had returned our systems to our normal configuration with this service provider, ensuring things were operating smoothly. Throughout this week we have been monitoring to ensure no further degradation in service.

Although there was no reason to suspect that we'd see further issues from this service provider, given the nature of the outage last week, we decided to bring forward a plan we have been working on to replace the service provider and move to a more resilient system. This was a large task that has taken many months of work and, throughout the first half of this week, we spent all of our time finalising this project and preparing to migrate off this provider. We intended to cutover entirely this weekend, believing this was the first safe opportunity to do so, without impacting your Vero experience. This change should have proceeded unnoticed to you, our customers.

Unfortunately, on Thursday the same provider experienced another (entirely unexpected) outage on the same system component. Whilst this confirms our work toward migrating from this provider was a smart decision, it is extremely frustrating and unprecedented to have had two outages within seven days.

Over several hours on Thursday afternoon, we decided to cutover entirely to our new configuration, rather than failing over to backup architecture temporarily. We deemed this the fastest and most reliable method of recovery given the situation.

This process took longer than we had anticipated and, whilst ultimately going smoothly from a functional perspective, resulted in a large disruption over the following 24 hours, with access to Vero and Vero's processing significantly affected for 12-16 hours.

Although unplanned, the new architecture we have put in place gives us more functionality and control. We believe we have made the right decision to ensure we do not have these issues in the future. We are building Vero for the long term and work to make decisions with this in mind.

The affected component was one of the few last remaining parts of Vero that is not inherently resilient. The majority of Vero now runs on EC2 spot instances and our data stores leverage Cassandra and Redshift, reducing single points of failure in our architecture and enabling us to build fault-tolerant systems at an unprecedented level. Thursday's change moves us one step closer to fault tolerance across all elements of Vero.

Throughout the issue we reported on Vero's status via our status page. We will include a copy of this post-mortem email there. Vero is functioning normally again and we do not expect further outages here in the future.

I want to finish by apologising again for this outage. We understand the importance of ensuring Vero is online 24/7 and, in an extremely hard situation, believe we've made the right call today to ensure we continue to deliver on that expectation.

Thank you for being a Vero customer.

If you have questions, please hit respond and either myself or our team will answer. If you have concerns or specific enquiries about the infrastructure issues and our solutions, we'd love to talk about these also. We're excited by the changes we've been making and look forward to sharing more of our knowledge.

Thanks again,

Chris

Posted May 09, 2017 - 10:37 AEST

Resolved

Hi all,

We are resolving this issue at this time.

We have caught up on the vast majority of our backlog now.

Approximately 15 hours ago we re-enabled all email processing and this appears to be proceeding smoothly.

New events, emails and logs have been processing for the last 20 hours.

As mentioned earlier, we will follow up with a detailed post-mortem email and cross-post there here.

If you have further questions please email us at support@getvero.com.

Apologies again for the inconvenience caused with this outage.

The Vero Team

Posted May 06, 2017 - 14:21 AEST

Update

--- Outage update

The majority of our system is fully operational at this time. Since our update around five hours ago, the majority of transactional and behavioral emails have been sending in real time. Shortly after this, newsletters and CSV imports also caught up.

Over the last four hours we've been monitoring our systems and, at this time, everything has been operating smoothly.

There are however two areas still affected by this outage. We will leave this issue open until these are resolved.

1. Delays in conversions: Conversion tracking is delayed. We de-prioritised this queue in order to ensure critical system functionality returned to normal as quickly as possible. This results in `Conversion` metrics being underreported at this time. We anticipate this will have caught up in the next 24 hours.

2. Delays in behavioral and transactional emails: We have had to explicitly disable the processing of behavioral and transactional campaigns that use the conditions `has triggered` and `has not triggered` temporarily. We should be able to begin processing these campaigns in the next hour or two, pending the completion of an update we are waiting on. No data or emails have been lost.

Outside of these delays, things are operating normally. If you are seeing delays beyond the above, please email us at support@getvero.com.

--- Erroneous sends

Approximately seven hours ago we identified that old newsletters and some old behavioral campaigns had been re-processing for users who had initially failed to receive the emails.

In response to this, in most cases, we were able to find and stop such jobs processing. We have been doing a post-mortem to ascertain why this occurred and will be emailing affected customers with the details.

We believe this is isolated to the recovery of our systems due to the above incident and will not be ongoing. Given our systems are in order, our focus is 100% on this issue at this time.

------

We hope to have this issue closed out in the next few hours. All components (below) are up to date to accurately reflect processing.

I want to reiterate that the majority of processes are functional and that there has been no data loss:

- Newsletters are sending.
- Logs are updating.
- The UI is functional.
- Segments are calculating.
- Behavioral and transactional emails are processing outside of those with the specific condition detailed above.

We will send an email to all customers and post it here (as a post-mortem) once complete. Again we apologise for the inconvenience this has caused today. Thank you for your patience and for being a Vero customer.

Posted May 05, 2017 - 11:21 AEST

Update

Due to the nature of the outage we experienced, we have had to disable behavioural/transactional campaigns that use the following conditions:

- has triggered event
- has not triggered event

We expect this to be disabled for the next several hours but as soon as we can, we will send all delayed emails.

Posted May 05, 2017 - 07:08 AEST

Update

We are now processing all emails in real-time. Any unsent newsletters currently in the queue will be sent out momentarily. Please note that during the outage recovery period we have disabled all logs and campaign statistics tracking. We will re-enable this shortly then you will be able to see campaigns updated.

We have also had reports of older, previously unsent emails being triggered. The team is investigating to uncover why this happened and ensure that it doesn't continue.

Posted May 05, 2017 - 05:16 AEST

Update

Our API is now processing in real-time. We're currently processing the backlog of transactional emails.

Posted May 05, 2017 - 02:41 AEST

Monitoring

We are currently investigating reports of erroneous newsletters being sent. We are currently investigating this.

The work performed earlier should not have affected any newsletter sends. We will report back as soon as we know more.

Posted May 05, 2017 - 01:49 AEST

Update

Unfortunately we are still working through some processing issues. We are making progress but not as fast as we would like.

We hope to have more positive news in the near future to report significant advances. Apologies for the continued interruption.

Posted May 04, 2017 - 22:49 AEST

Update

We have been processing data for approximately 30 minutes now. We have a large backlog to get through.

We have been working non-stop on this issue and will provide a more verbose post-mortem as soon as we've caught up. We're hoping that we'll get through this backlog promptly.

We will keep the updates coming!

Posted May 04, 2017 - 19:59 AEST

Update

Our web UI is back online.

Backend processing will begin shortly too.

Posted May 04, 2017 - 18:06 AEST

Update

We expect to have the web UI back online within the next 30 minutes.

At that time we should also being processing the backlog of queued emails from the last couple of hours. When we see things operating fluidly at that time, we'll be able to post a better ETA in regards to things being back to real time.

Apologies for the inconvenience. We will provide a further update as soon as the dust has settled a little. Thank you for your patience.

Posted May 04, 2017 - 16:20 AEST

Identified

We are currently investigating an application outage. We have taken our web UI down whilst we collect more information.

We'll provide more information shortly. Our API is still operational.

Posted May 04, 2017 - 13:35 AEST