Between 13:00 UTC and 15:18 UTC 9 October we observed a drop in our overall newsletter throughput.
This manifest in several outcomes:
• Some email campaigns in this window started sending late. These delays were typically in the order of 15-20 minutes.
• Email campaigns sending during this window sent at around 30% of the typical rate.
• Due to these delays, a portion of some email campaigns were blocked from sending. These campaigns entered a retry loop, meaning that the jobs retried over an increasing period – this meant that a small portion of some email campaigns (2-8%) were sent up to four hours late.
We have found the root cause of this issue. Ultimately a rogue process was killing one of our core email sending engines. We have put in place two changes:
1. We believe we have fixed the root cause by adjusting the way one of our email processors runs. This should ensure this particular root cause does not occur again.
2. Our alerting system did alert us to this problem, but took around an hour to do so. We have increased the aggressiveness of our monitoring over this component, meaning we should be alerted faster.
Apologies for the inconvenience caused.
If you have any questions, please email us at firstname.lastname@example.org