At 01:17 UTC we observed delays across all of our key queues.
We began investigating this and ultimately uncovered that one of our data caches was not responding as normal: throughput was significantly down on normal processing.
We were able to identify the root cause in approximately 30 minutes, and deemed that we had to roll over to a recovery server. This process took around 10 minutes.
Due to heavy load this morning, the backlog of jobs that accumulated grew quickly. It took around an hour to process the backlog and get fully up to speed.
At this time all processing has returned to normal. We are currently reviewing our diagnostic documentation to determine if we could have identified the core issues faster. Whilst recovery was swift, improving this would lead to a faster recovery.
We apologise for the inconvenience. As ever we are working hard to improve Vero's resilience. If you have any questions or would like to discuss this issue, please email us at firstname.lastname@example.org – we are here for your feedback.
Thanks for working with us 🙇.