Over the last 24 hours Vero experienced a partial outage delaying the processing of a small number of emails and events. Here is an outline of what happened:
Oct 19 2017, 19:25 UTC: The team deployed an update to our email logs features. This involved storing data in a structure within a new database. The feature had been rigorously tested prior to deployment to our production environment.
Oct 19 2017, 20:28 UTC: The team monitored for any errors associated with the previous deployment, and identified that events triggering emails greater than "30 days from now" were failing to save correctly. A patch was deployed which solved this issue. After deploying the patch, error rates reduced to normal levels.
Oct 20 2017, 00:00 UTC: The team started seeing an elevated number of failing API requests and failure of our test builds. Although the number of errors in production were higher than usual, the majority of API requests continued to process correctly (>99.5%) and the team began investigating why there wasn't consistency with the failures.
Oct 20 2017, 12:05 UTC: The team discovered and patched a bug in a third-party tool which seemed to explain the errors we were seeing. After deploying this patch we saw a slight drop in the error rate. The team continued to investigate.
Oct 20 2017, 17:05 UTC: The team found another bug in the third-party tool which was related to timezone handling. This explained why previous testing had not raised the issues we were seeing in production, and why the errors began to occur several hours after deploying the feature into production. The team patched the bug and error rates reduced to normal levels.
In total there were 39,576 delayed events which could have resulted in emails, representing about 0.5% of total event volume within the outage period. We can also confirm that there was no data-loss caused by this incident.
If you are still seeing issues, I encourage you to contact firstname.lastname@example.org
and we can investigate your issue.
Have a lovely weekend!