Between ~7am UTC Monday and ~10pm UTC Monday the Vero system experienced periodic outages/delays affecting:
• Reports in the UI • Automated/workflow campaign evaluation speed. • API ingestion. • The UI itself.
We know these outages have a major impact on your end-user experience and we apologise for the inconvenience caused.
For those interested, at ~10pm UTC Sunday, we conducted some unplanned maintenance to one of our core data indexes. Whilst unplanned, we made the required changes after planning the alternatives and thinking through the risks. The new configuration is designed to be more performant, running on more modern architecture. This should have been a very run-of-the-mill, invisible upgrade.
It took some time to become apparent but the new configuration was not performing to specification. By 7am UTC Monday this was leading to delays in several services. Throughout Monday (UTC) our team actively managed the situation to prioritise API and automated email processing was prioritised (as these are the most critical systems).
As of ~10pm UTC Monday, all services expect reports were returned to normal processing speeds. As of ~1am UTC Tuesday, reports have also returned to normal processing.
We will be conducting a post-mortem internally to learn and better plan for future changes to this specific service.
At this time we are continuing to monitor the situation. We will mark this issue as resolved once we are comfortable things have been operating as normal for ~24 hours.
We work hard to ensure 99.99%+ uptime on all core campaign processing. All changes we have been making to our infrastructure recently are in the service of:
• Faster automated/workflow campaign processing. • Delivery of new channels such as SMS (and beyond).
We've encountered performance issues resulting in degraded performance across the product. API processing, segment calculation, email sending, and the UI have all been impacted. No data has been lost.
At 7:15AM Monday UTC, we were alerted to an unusually high number of unprocessed API jobs. These jobs eventually came back down after intervention by our platform team.
Throughout the morning, the API queue has seen multiple additional spikes, and we've seen degraded performance across the entirety of the application.
We are continuing to investigate the cause and will update when we have determined a course of action.
Posted Aug 22, 2023 - 06:26 AEST
This incident affected: Vero Cloud: Ingestion API, Vero Cloud: Newsletter processing, Vero Cloud: Segment calculation, Vero Connect: Newsletter processing, Vero Connect: Reports data availability, Vero Connect: UI, Vero Cloud: Reports data availability (Reports data availability (Vero default and Mailgun integrations), Reports data availability (Sendgrid and other non-Mailgun integrations)), Vero Cloud: UI (General UI access and speed, Logs page activity, CSV Imports and Exports), and Vero Cloud: Automated email processing (Transactional emails, Behavioral emails, Workflows).