I selected the blog post image intentionally to capture your eyes. I'm sure you understand the circumstances only by seeing this picture. We rely on external service, it does not matter if externally or internally to the organisation, or if they promise 99.9999% availability, when the shit hits the fan, you are going to take a hit as well.
Giving you some technical context - we are using Celery and SQS as broker, using DLQs for dev intervention.
We discovered as part of our production duty schedule ~2500 emails failed to be sent (i.e 2500 DLQ messages), as our mailing service was down. We are using SQS visibility timeout mechanism with 7 retries. Meaning we give 7X30 seconds to the service to become available again before we stop retrying. I think it is safe to say that a production incident taking less than 24 hrs to be contained is rare.
So the first thing was assessing the damage, taking it with the PM seeing if its relevant sending the emails in such a delay. After taking it with him, we considered creating a migration to execute only this specific DLQ messages that we had. Eventually we decided taking all DLQ messages and reprocess them (at the point where the service came to life again).
Bombarding the service
One of the problems this kind of situations create is making the service facing multiple services reprocessing messages / executing huge number of calls, as soon as its back up again, trying to minimise the damage caused to the clients. One of the lessons we learned, is that having an exponential backoff strategy is good, but adding a random jitter to the delay time makes it even safer to re-process the messages. At the end of the day we are talking about shared responsibility.
Clery based re-drive policy
We decided to implement a smarter re-drive policy than every 30 seconds retry, and die after 7 attempts.
Celery has its own mechanism for retrying tasks (autoretry_for), but it collides with SQS mechanism causing duplicates. Additional info can be found here - Celery retry for known exceptions and here - Celery and SQS caveats
So we ended up writing our own solution, with exponential backoff mechanism supported by SQS visibility timeout mechanism