Hi there,
On Thursday we experienced a significant outage at Vero. I apologise for the inconvenience I know this has caused.
We understand the importance of ensuring Vero is online 24/7 and, in an extremely hard situation, believe we have made the right calls over the last couple of days to restore Vero as promptly as possible and with improved resilience.
Below I have provided the full details of the issue, but want to start with a brief synopsis: A core component of our hardware was hosted with a managed service provider. Given this is a load bearing system component, we had identified that we could improve this aspect of our architecture by moving away from this provider. We did not feel this was imminently necessary, but prudent to plan for. We have recently been building the systems needed to move away from this service provider and were intending to do so early next week. Unexpectedly, on Thursday this hardware component suffered a major outage with this service provider. Fortunately, we were able to use the work we'd done to cutover to the new system we had designed. As this was unplanned, it resulted in several hours' outage. Again, I want to apologise for the extended and severe service interruption. This is not how we aim to operate here at Vero.
The good news is that, as a result of the work performed in the face of this issue, the new system is resilient in a way that the old one was not.
For those that would like more details, I have provided them below. If you have questions and concerns regarding the issue, the changes made and the improved fault-tolerance, I am more than happy to answer them. We take this extremely seriously.
Thank you for being a Vero customer,
Chris
There have been two issues over the last ten days. Although the first was less significant in impact, I wanted to address the details of both here.
Late last week, one of our hardware providers performed a standard upgrade to one of our system components. Unfortunately, this process resulted in a significant error. Whilst this error did not originate due to our processes or team, we were left with limited access to a critical piece of our stack. Thanks to our hard work over the last year, we were able to failover to backup hardware within 30 to 40 minutes. You can read more of the details related to this issue on our status page but this ensured Vero was operational in all critical areas, albeit performing slower than we'd like in our web UI.
By Sunday we had returned our systems to our normal configuration with this service provider, ensuring things were operating smoothly. Throughout this week we have been monitoring to ensure no further degradation in service.
Although there was no reason to suspect that we'd see further issues from this service provider, given the nature of the outage last week, we decided to bring forward a plan we have been working on to replace the service provider and move to a more resilient system. This was a large task that has taken many months of work and, throughout the first half of this week, we spent all of our time finalising this project and preparing to migrate off this provider. We intended to cutover entirely this weekend, believing this was the first safe opportunity to do so, without impacting your Vero experience. This change should have proceeded unnoticed to you, our customers.
Unfortunately, on Thursday the same provider experienced another (entirely unexpected) outage on the same system component. Whilst this confirms our work toward migrating from this provider was a smart decision, it is extremely frustrating and unprecedented to have had two outages within seven days.
Over several hours on Thursday afternoon, we decided to cutover entirely to our new configuration, rather than failing over to backup architecture temporarily. We deemed this the fastest and most reliable method of recovery given the situation.
This process took longer than we had anticipated and, whilst ultimately going smoothly from a functional perspective, resulted in a large disruption over the following 24 hours, with access to Vero and Vero's processing significantly affected for 12-16 hours.
Although unplanned, the new architecture we have put in place gives us more functionality and control. We believe we have made the right decision to ensure we do not have these issues in the future. We are building Vero for the long term and work to make decisions with this in mind.
The affected component was one of the few last remaining parts of Vero that is not inherently resilient. The majority of Vero now runs on EC2 spot instances and our data stores leverage Cassandra and Redshift, reducing single points of failure in our architecture and enabling us to build fault-tolerant systems at an unprecedented level. Thursday's change moves us one step closer to fault tolerance across all elements of Vero.
Throughout the issue we reported on Vero's status via our status page. We will include a copy of this post-mortem email there. Vero is functioning normally again and we do not expect further outages here in the future.
I want to finish by apologising again for this outage. We understand the importance of ensuring Vero is online 24/7 and, in an extremely hard situation, believe we've made the right call today to ensure we continue to deliver on that expectation.
Thank you for being a Vero customer.
If you have questions, please hit respond and either myself or our team will answer. If you have concerns or specific enquiries about the infrastructure issues and our solutions, we'd love to talk about these also. We're excited by the changes we've been making and look forward to sharing more of our knowledge.
Thanks again,
Chris