• Gal Cohen

Celery + SQS the good the bad and the ugly

Updated: Nov 23, 2019

As the previous post suggested, we decided going with SQS as Celery broker, because of few reasons:

  1. Compliance - we have some offline jobs that are directly related to our business and regulations. This means that we can not afford loosing jobs (see redis evictions)

  2. Our project is fully based on Celery, and migrating from it from (i.e polling SQS by ourselves) would take a long time. With that said, it took us quite some time to make it work with SQS

  3. Redrive policy - SQS offers out of the box a feature that lets the system reliability a serious boost. If a message disappear - it will requeue it, regardless to the state of the system. For instance if there was a power shutdown and our containers died - the jobs will be restored.

The good

  1. Out of the box integration, meaning its just a configuration replacement

  2. Pay as you go - depends on the throughput, no ElasticCache / RabbitMQ instances sitting there waiting to be used

  3. Visibility and monitoring - using AWS console we can monitor queue length, messages in flight, and dead letter messages. Using CloudWatch its possible configuring alerts on queue length etc..

  4. Big community supporting the Celery project (comparing with the alternative of polling the queues with our own implementation)

The bad

We had a problem with our celery config that gave us quite a headache:

Why this setup was so bad? it caused the workers not executing any work after a single exception. Why? because it turned out that when you don't ack the message (on exception in this case), its not removed from the prefetch local queue. In return to the community we fixed this issue

The ugly

  1. Plenty of permissions are needed from IAM / SQS. For instance ListQueues / CreateQueue. You are asking - why does it need to create queues ? well the answer is - to let the workers ping each other. So we removed the Celery health cron, assuming that queue length metric would be fine. Regarding list queues, we disabled it by monkey patching Kombu code.

  2. As it turned out, we have different jobs with different timing behaviors, meaning we have a setup that needs to be modified in order to work well with visibility timeout mechanism (one of the reasons we wanted SQS in the first place)

In retrospective was it a good move? Yes, it was a good move. Was it easy as I expected it to be? No. Is management happy? Bottom line we delivered, not in a great timing.