Scheduling based on compute requirements

Hey all,

How do folks deal with running tasks that require obscene amounts of RAM? I’ve been working with a customer who has a legitimate need to run tests that spin up headless browsers and mobile emulators, and so require many GiBs of RAM.

As I understand it, there’s no way to tell Concourse the compute requirements of a task. Hence sometimes all these massive tasks get lumped onto the same worker, whilst (for example) 250 lightweight check containers are on another worker with oodles of spare memory.

Is my understanding incorrect? It normally is.

  • How do folks with similar requirements manage this? Workers with different characteristics, and tagging jobs that need all the RAMs?
  • Are there any plans to implement compute-requirement-based scheduling? That sounds like quite a hassle.
  • Could this be added if Kubernetes-as-a-Container-Runtime is implemented? If containers were scheduled as pods on Kubes, then presumably the clever scheduling in Kubernetes could be leveraged instead.
1 Like

We handle this by separating concerns. The build and unit test of our microservices are all done within the Dockerfile. The integration tests are all done on an external k8s cluster in one of several namespaces we dedicate to that task.

The concourse worker infrastructure is ill-suited to heavy weight jobs as you describe, and it should not be the thing that has to worry about those sorts of concerns.

So how do we actually solve the problem?

Well, by letting concourse natively build things, but when concourse kicks off heavy integration test jobs, it uses the standard concourse resource model.

What we do:

  • Have external kubernetes clusters / namespaces we can target. These are long-lived but easy to update on the fly.
  • Use the Concourse pool resource to allocate a dedicated namespace for the purpose of the test
  • Use the Concourse kubernetes and helm resources to perform the integration test

In this way, we can have as many “integration test slots” as we care to have. We can scale them up or down. We can have them on different clusters or cloud providers. And we don’t have to have them deeply integrated into our shared Concourse infrastructure.

Then we let Concourse do what it does best – manage the criteria by which our artifacts are built, verified, and promoted and we let Kubernetes do what it does best – run the tests as we describe them natively within our charts.

Overall this is a strategy that we’ve been quite happy with.

We have the advantage of being a hosting provider, so all of our infrastructure for concourse is on Droplets and our 35 workers have 20 vcpus and 64gb ram. We rarely run low on ram, mostly get high cpu load causing builds to slow down sometimes.