Trouble recording API requests and triggering related campaigns in realtime
Incident Report for Vero
Postmortem

Yesterday evening, Vero experienced a major outage that prevented newsletters, transactional and behavioural emails from being sent on time for approximately two and a half (2.5) hours.

The affected period lasted from approximately 17:05 GMT 19 January and lasted until around 19:15 GMT. During this period:

  • The API collected data successfully. There was no API downtime.
  • The Vero web application remained accessible. There was no UI downtime.
  • Newsletter emails were delayed for this entire period.
  • Transactional and behavioral emails were not sent during this period.
  • Logs were delayed throughout this period, with a maximum delay of two and a half hours.

Full speed was restored by 20:15 GMT. Our aim is to provide greater than 99.9% uptime on email deliverability, so we consider this a serious impact. Below, we've given an overview of the issue and laid out how we're going to prevent future of this nature.

At 17:05 GMT our engineering team were alerted to an issue with the platform after the error rates on behavioural emails had significantly increased. We did an initial investigation which revealed that behavioral campaigns were having trouble loading user properties from our databases for insertion into templates, preventing emails from being sent.

Over the last two weeks we have begun migrating our user properties to a new data infrastructure. This has so far been a seamless process and should not affect our users as we are operating the datastores in parrallel for testing. This upgrade will allow us to continue to handle rapid growth throughout 2016. An unexpected problem arose at scale yesterday evening that prevented the email workers from successfully finding their related user data and, ultimately, API workers from saving new user data.

In order to fix this issue promptly we made the decision to take our email and log workers offline for all campaign types, ultimately affecting newsletter, behavioral and transactional emails. This allowed us to deploy a fix that and the code and data structure that was causing the issue.

We have spent the last eight hours monitoring our changes and all has been operating smoothly and in real time since 20:15 GMT.

If you would like further details on emails delayed in your account or similar, please get in touch via support@getvero.com.

Thanks for working with us at Vero! Looking forward to a great 2016 here on the Engineering Team – we can't wait to deploy these changes live!

Posted Jan 20, 2016 - 15:07 AEDT

Resolved
This incident has been resolved.
Posted Jan 20, 2016 - 15:05 AEDT
Monitoring
All API requests and emails are processing in real-time. Due to the nature of the issue, some service such as CSV imports may still be unavailable. If you have any questions, please contact support@getvero.com.
Posted Jan 20, 2016 - 08:24 AEDT
Update
All transactional and behavioural emails are now processing in real-time. We are continuing to work through the backlog of newsletter emails.
Posted Jan 20, 2016 - 07:44 AEDT
Update
All API requests are now processing in real-time. We are continuing to process the backlog of transactional, behavioural and newsletter emails.
Posted Jan 20, 2016 - 07:20 AEDT
Update
The team has been able to implement a workaround, and is now re-tasking Vero's servers to process the backlog of API requests and emails. This is expected to take some time but we will keep you up-to-date as we proceed. Some features will still be delayed until our background processing queues are running in real-time once again.
Posted Jan 20, 2016 - 06:33 AEDT
Identified
We have isolated the cause of the issue we are seeing.

In order to fix these issues in the fastest way possible, we have temporarily paused all behavioral emails, logging queues and indexing in order to deploy a fix. This is not affecting our API or data collection, nor is it affecting Vero's web interface outside of Logs being temporarily delayed.

As this is the first time we've had to take this action in around the last year, we hope to have these processes back online promptly. We're of course working on this as a priority and will keep you updated throughout and circle back with a post-mortem after we have successfully resolved this incident.
Posted Jan 20, 2016 - 04:50 AEDT
Investigating
We are currently seeing issues saving API requests to our database with 100% completeness. We are still receiving these requests and queueing them.

We are investigating as we speak and will provide an update as soon as we identify the root cause and plan of action. This is a critical issue.
Posted Jan 20, 2016 - 04:17 AEDT