At 11:17 UTC (1 April), our team was alerted to delays in automated campaign processing. At this time we marked behavioural, transactional and workflow processing as degraded. This affected all customers.
By 11:54 UTC, our team had isolated the root cause: delayed filter evaluations on one of our data sources.
At 12:30 UTC, processing performance had returned to normal but there was a large backlog of automated campaign jobs to work through. We continued to monitor as the backlog reduced and began work on a fix to the root cause.
At 18:49 UTC, we noticed a spike in processing again. This was quickly resolved (as of 19:30). We continued to monitor the backlog.
As of 21:30 UTC, the backlog has cleared and workflow processing is 100% realtime for all but handful of customers (with larger backlogs). These remaining backlogs should be fully cleared by 00:00 UTC.
At 21:56 we deployed a patch for the root cause. We believe we have fixed the underlying cause and this will not occur again. We will monitor extra closely over the coming 48 hours to ensure this is the case.
We also want to note that this incident was unrelated to the database upgrades/maintenance conducted over the weekend. These changes have gone smoothly.
--
Processing has returned to normal and we can report the following metrics for the key time windows. These are the P50 to P90 reports for workflow node processing. This represents how long it takes to process a node vs. when the node was queued for evaluation. We expect a p90 of 5 minutes across all nodes. Note that this include **all node types** across all workflow types (transactional, non-transactional, etc.). For example, A node with 50 conditions looking back across a large date range will take longer to evaluate than a simple email node.
10:00 - 21:30 UTC 1 April (catching up on backlog) • p50 = 1 hour • p75 = 2.45 hours • p90 = 4.6 hours
21:30 - 00:00 UTC 1 April (tail of backlog, with majority of customers in realtime) • p50 = 48 minutes • p75 = 1.8 hours • p90 = 3.1 hours
00:00 - 01:00 UTC 2 April (backlog fully processed and processing in realtime) • p50 = 31 seconds • p75 = 3 minutes • p90 = 5.4 minutes