Slower than normal newsletter sends

Incident Report for Vero

Postmortem

During the periods 8:00am UTC 11/22/2016 - 5:00am UTC 11/23/2016 and 2:00pm UTC 11/23/2016 - 6:00pm UTC 11/23/2016 we experienced intermittent periods where emails sent through Vero were delivered at approximately half the normal rate. Due to the backlog that builds quickly in these periods, some emails were delayed by over an hour.

We know that, for many customers, even a few minutes' delay is significant.

With this in mind, we want to provide details on these delays as part of our ongoing commitment to consistent performance with every send.

Earlier in 2016 we saw several similar occurrences and delays. We responded to these by investing heavily in our infrastructure and, in August 2016, released some major updates to the way we process emails. Since this time we have been extremely proactive in ensuring the new infrastructure is as fault-tolerant as it was designed to be. This has paid off and, over the last four months, we have only seen two minor incidents as part of our transition.

As such, our changes have been working effectively and, with our focus on continual improvement, we're confident this will only continue.

What caused the lower send rates

Vero's email queues are processed in parallel. These processes are automatically scaled up and down to meet demand, enabling us to service customers at large scale, as needed.

We have determined that the slow rates were related to a failure in the system responsible for automatically scaling these processes. This same failure also meant that manual overrides were acknowledged by the service but not actually actioned, making it difficult to track down, isolate and improve the situation in real time.

What we have done to address this

After a lot of work in the last 24 hours, we believe we have found and resolved the cause of this issue. Under observation, things have been holding steady for several hours now.

Today we have been leveraging a second service provider we have configured to assist as a fallback in scenarios like this – an important part of our commitment to consistency. Given it is Thanksgiving, we will be operating this infrastructure throughout the weekend alongside our standard setup, providing an extra layer of certainty as we continue to observe that the fixes on our primary architecture have resolved the root cause.

Our goal is always to proactively find edge cases like the one we have seen today before they occur. As with any significant incident, we are both adding to our already robust automated test suite to cover this scenario and reviewing how we can improve our methodology to ensure we cover scenarios like this up front.

Given our new infrastructure and the improvements we have made to our testing and release cycles since August 2016, we believe we are in a good position to learn from this scenario and continue to deliver consistent send speeds going forward, as we have the last few months.

Happy Thanksgiving

If you have any questions or concerns, please let us know via support@getvero.com.

I wanted to get in touch to reiterate our commitment to transparency and quality and give you a clear outline of our proactive response to yesterday's delays. Thank you for your support as customers. We are working incredibly hard to ensure we do not have issues like this arise at any time.

We know that this weekend is particularly busy with Thanksgiving, Black Friday and Cyber Monday. Even ahead of the issues overnight we have planned to have extra operational and customer support on call throughout the next several days and will be responding promptly to any questions or scenarios to ensure things go smoothly.

Happy Thanksgiving to our US customers 🦃!,

Chris

Posted Nov 23, 2016 - 22:24 UTC

Resolved

This incident has been resolved.

Posted Nov 23, 2016 - 14:30 UTC