Our deployment has 12 workers. Typical running at around 1200 tasks or averaging 100 tasks per worker. A single workers hard limit is near 250. Thus utilisation is around 40%.
Using this approach it should be safe during upgrades to restart 6 of the 12 workers at a time.
We find we are restricted to rotating 1 worker at a time and even that causes unacceptable outages.
Today upgrading from 4.2.2 to 4.2.3 the following poor performance was reported:
New tasks would ‘hang’ in pending state for 15 minutes or more.
Resources turned orange and were not restored with fly check-resource.
Get /volumes/ connect: no route to host.
max containers reached (despite 1500 container spaces available)
Thus the expected 8 minute outage to upgrade haproxy, db and web turned into a 3 hour outage where it was near impossible to align enough resources for long enough to complete a single job, let alone a whole pipeline.
I need some help to understand what options are available to configure the container management and to determine how to stop a single worker without causing a cascading system wide failure.