Concourse-CI Performance Recommendations


#1

Hello,

I have a pipeline that builds 10 docker images. I ran few test builds and noticed that build times are different each time. Each app build varies between 5-10mins. Is there anything in concourse configuration I could do to make it more consistent and potentially faster? Currently I have 3 workers, is there a limit per worker on how many containers it can spin up?

Increasing CPU and Memory resource allocation to workers doesn’t help as well as scaling up.


#2

What might help is changing the container placement strategy, the closest thing to scheduling in Concourse. By default Concourse tends to follow “container gravity”: if a worker already has a volume for a given task, it will tend to get tasks that need that container. Sometimes a worker tends to become hot for this reason.

If in your tests you triggered multiple builds at the same time, this is what you might have observed: not all workers are utilized fully. This is all speculation on my part, to really know you should have monitoring of the load of the workers all displayed together so that you can validate if worker load is uneven.

If you can confirm that the worker utilization is uneven, then you can try changing scheduling to “random” via an option to concourse web. See https://github.com/concourse/concourse/issues/1741 for details. In our tests, switching to “random” makes workers load more even and is the setting we have enabled in production. But, we monitor worker load closely :slight_smile:

The trade-off with “random” is that, the more workers you have, the more the task will land a worker that doesn’t have the container cached. If for example you have aggressive autoscaling (workers are disposed very fast after they become idle), you might always land a worker that needs to download the task image. This might or might not be a problem, only close monitoring will tell :slight_smile:


#3

Thank you!

This is what we did (based upon a feedback from Concourse developers):

  1. Changed to random
  2. Increased worker disk size marginally
  3. Scaled the atc (web) nodes to two (2)
  4. Implemented the ‘ulimit’ bosh deployment on all workers and atc (web) nodes

I think the ulimit change had the largest impact as we were receiving errors in the logs regarding ‘too many open files’. The additional atc nodes handled baggage claim operations that may not have been addressed in a timely fashion. The disk space increase gave us just enough headroom to allow for (1), (3), and (4) to take effect.

(1), (3) and (4) will be SOPs for our concourse deployments going forward.


#4

For the record, the upcoming 5.0 will have a way better container placement strategy: least-build-containers.

See