Concourse 5.0 experience report


#1

Hello, we deployed in prod a pre-release Concourse 5, with least-build-containers scheduling.

Our prod is quite huge, 30 beefy workers on AWS, compiling C++ code.

The build and test tasks were able to kill any kind of worker, up to the point that we decided to deploy a pre-release to get least-build-containers.

First we got bitten by ‘insufficient subnets remaining in the pool’ https://github.com/concourse/concourse/issues/847

We then allowed for more garden subnets as explained in that ticket.
The number of check containers in the workers now fluctuate completely randomly, reaching 350 containers, but the beefy workers handle that.

Net result: we have 5.0 pre-realase in prod since 24h and we observe GREAT IMPROVEMENTS in the overall load thanks to least-build-containers

Thanks to the Concourse team! :heart:

[we will update this post if something newsworthy comes out]

Update 1

We are using RC54.

We opened a ticket describing current problem: runaway check containers: https://github.com/concourse/concourse/issues/3251

Update 2

We moved to RC74 and enabled global resources.

We think that the runaway containers happen only if you have lots of pipelines, each pipeline with lots of resources. Proof: we paused a subset of pipelines, the ones with lots of resources (60!) and it seems we managed to stabilize Concourse. The bug is still there, but a “half workaround” is to pause pipelines.

Update 3

The put inputs feature does wonders in reducing useless streaming, flakiness and wasted time! See


#2

Ha, upgrading prod to a pre-release of a major bump is very brave, but you’re @marco-m so I’ll just assume you know what you’re doing. :wink:

But oh god please don’t everyone do this lol. The 5.0 upgrade has a lot of changes and, for example, the version history of all of your resources will be re-set. Not a big deal for most but there will be a lot of release notes to read for this upgrade.

Glad to hear it’s working well though! We’re doing some targeted testing to identify stress points right now so it’s nice to see examples of things working in other real environments - we’re going to be upgrading our own large-scale environment soon, too.

Out of curiosity, which release candidate did you use for the upgrade? Do you have any interesting metrics to show? Have you considered enabling global resources? That may further reduce load if you’ve got a lot of pipelines with the same resources.

Thanks for the feedback!


#3

…also I’m about to rename it from least-build-containers to fewest-build-containers so be sure to remember that for when the real 5.0 comes out. :stuck_out_tongue:


#4

thanks for sharing your experience Marco!!


#5

We are using RC54 and will update to latest RC soon. We will also try enabling global resources.

(current problems are reported in the “update” section of the first post).


#6

Update 2

We moved to RC74 and enabled global resources.

We think that the runaway containers happen only if you have lots of pipelines, each pipeline with lots of resources. Proof: we paused a subset of pipelines, the ones with lots of resources (60!) and it seems we managed to stabilize Concourse. The bug is still there, but a “half workaround” is to pause pipelines.


#7

Update 3

The put inputs feature does wonders in reducing useless streaming, flakiness and wasted time! See


#8

Update 4

We deployed RC78, which contains the bugfix for

Configured with “fewest-build-containers” as before.

Boy the check containers distribution changed drastically for the better. The number of builds has been lower than usual, so we will wait to see how it goes on Monday, but the below graph speaks for itself. The deployment of RC78 happened at 14:45 in the graph.

(we also see an increase in DB lookups, which we think it is because the scheduler it is now asking the DB for the type of the container)