Delays in API processing

Incident Report for Vero

Postmortem

At 01:17 UTC we observed delays across all of our key queues.

We began investigating this and ultimately uncovered that one of our data caches was not responding as normal: throughput was significantly down on normal processing.

We were able to identify the root cause in approximately 30 minutes, and deemed that we had to roll over to a recovery server. This process took around 10 minutes.

Due to heavy load this morning, the backlog of jobs that accumulated grew quickly. It took around an hour to process the backlog and get fully up to speed.

At this time all processing has returned to normal. We are currently reviewing our diagnostic documentation to determine if we could have identified the core issues faster. Whilst recovery was swift, improving this would lead to a faster recovery.

We apologise for the inconvenience. As ever we are working hard to improve Vero's resilience. If you have any questions or would like to discuss this issue, please email us at support@getvero.com – we are here for your feedback.

Thanks for working with us 🙇.

Posted Sep 01, 2017 - 13:59 AEST

Resolved

Just confirming that the backlog was cleared and all processing has been operating at normal capacity for more than 90 minutes now

Posted Sep 01, 2017 - 13:58 AEST

Monitoring

Our API processing is nearly realtime again – we are finalising the processing of the backlog that accumulated earlier.

We will provide details of the root cause shortly. Our priority is processing this backlog.

Thanks.

Posted Sep 01, 2017 - 11:44 AEST

Update

We have identified the cause and have made changes in response to this.

Unfortunately we are not yet confident that the system is operating as normal – we are actively working on this issue.

Posted Sep 01, 2017 - 11:16 AEST

Identified

We are currently seeing delays in API processing. We are investigating the situation and will provide updates.

Posted Sep 01, 2017 - 10:17 AEST