We run lots of background jobs and background workers. Some of these are pretty consistent load and some vary greatly. In particular we have a background process that can consume 30GB or more of memory and run for over a day. For smaller workloads it could complete in 15 minutes (consuming a lot less memory). At other times this queue can be empty for days.
Traditionally Resque is run with one or more workers monitoring a queue and picking up jobs as they show up (and as the worker is available). To support jobs that could scale to 30GB that meant allocating 30GB per worker in our cluster. We didn’t want to allocate lots of VMs to run workers that might be idle much of the time. In fact we didn’t really want to allocate any VMs to run any workers when there were no jobs in the queue. So we came up with a solution that uses Kubernetes Jobs to run Resque jobs and scale from zero.
We’ve open sourced our resque-kubernetes gem to do this.
How Does it Work?
Kuberentes has a concept of a “Job”. Unlike a Deployment, that ensures a certain number of replicas of a Pod (i.e. container) are running, a Job launches the Pod and when the container completes it deletes the Pod. This is great for one-off type of work, but something has to create the Jobs.
Resque, on the other hand, has the concept of a worker. A process that monitors the queue for new (Resque) jobs and performs them when it selects them. But this requires that the workers is always running.
To run a Resque worker as a Kubernetes Job, it needs to be packaged as a Docker container, and then when there are no items left in the queue, it needs to terminate. We accomplish this by adding support for an environment variable (
TERM_ON_EMPTY) that tells the worker to shut down when there are no more jobs in the queue.
Then, in the Resque job, we add a hook that spawns the Kubernetes Job when you enqueue a new Resque job. This means that there doesn’t need to be any process “monitoring” the queue for autoscaling and we don’t have to run any workers. We can autoscale all the way down to zero workers. (There is a configurable limit to the maximum number of workers it will launch.)
The Kubernetes Job runs the worker, which picks up the Resque job and performs it. If there is another job in the queue it will perform that as well, until there are no more jobs and it terminates.
Now With VM Autoscaling
The above works great when you have enough capacity in your cluster to support the newly added Jobs. But in our original case we needed 30GB allocated for each Job and we didn’t want to keep that around when we weren’t using this.
We created a secondary node pool in our cluster with 32GB instances. The default node pool was smaller. In the manifest for the high-memory Job, we set a resource request for 30GB of memory. This means that Kubernetes will only schedule the Jobs into nodes with enough available memory — nodes from the high-memory pool. (We could also have used
Finally, we configured the high-memory node pool with autoscaling. The minimum lower limit for a node pool autoscaling is 1 node, so we accepted the fact that we’d have one of these running, but for the most part it would be hosting other Pods from the cluster.
When the enqueue action creates a Job, Kubernetes’ scheduler looks for a place to put it, doesn’t find one so marks it as “Pending”, and kicks off the autoscaler to add a new node to the pool. When the node is “Ready” it schedules the Job onto the node.
This does mean that it can take a minute or a few to get a job started. For our use case this is fine. It’s a background job that is expected to take quite a while.
With this autoscaling we can now horizontally scale to handle lots of simultaneous, high-capacity jobs, allocating resources only on demand. This is the promise of cloud computing.
Incidentally, Google’s per-minute VM pricing (after the first 10 minutes) made this autoscaling a lot more cost effective.
We’ve been running this in production for four months and found it reliable for our needs (on a couple different job types).
There are certainly places where we can make improvements. Proper support for namespaces and alternate authentication handling are two listed in the gem README.
I’d love to have you try it out and get your feedback for other use cases.