Workload balancing


#1

Our deployment has 12 workers. Typical running at around 1200 tasks or averaging 100 tasks per worker. A single workers hard limit is near 250. Thus utilisation is around 40%.
Using this approach it should be safe during upgrades to restart 6 of the 12 workers at a time.

We find we are restricted to rotating 1 worker at a time and even that causes unacceptable outages.

Today upgrading from 4.2.2 to 4.2.3 the following poor performance was reported:
New tasks would ‘hang’ in pending state for 15 minutes or more.
Resources turned orange and were not restored with fly check-resource.
Get /volumes/ connect: no route to host.
max containers reached (despite 1500 container spaces available)

Thus the expected 8 minute outage to upgrade haproxy, db and web turned into a 3 hour outage where it was near impossible to align enough resources for long enough to complete a single job, let alone a whole pipeline.

I need some help to understand what options are available to configure the container management and to determine how to stop a single worker without causing a cascading system wide failure.


#2

I feel your pain.

When we were still on 4.x, what worked for us was to pause all pipelines before the upgrade / redeployment of the workers, and then slowly unpausing the pipelines. We did that with a script that also kept state about already paused pipelines (say by the pipeline owners), to be sure we would not unpause pipelines that were meant to stay paused.

We then deployed Concourse 5.0 RC, switched container placement to fewest-build-containers and now we can redeploy everything like cowboys :slight_smile:

So maybe as workaround you could try the pause/unpause trick.

Further information about our experience with 5.0 RC is at https://discuss.concourse-ci.org/t/concourse-5-0-experience-report/


#3

It seems very strange and very scary about solution robustness. If loss of a single worker cause major outage, it means the system is fragile and unreliable !

This is a major issue ! As worker architecture is supposed to remove SPOF.