I did a bunch of work to carefully orchestrate the landing and retirement process on concourse when running locally.
I first land the worker, wait for it to show as ‘landed’, then stop it. For docker containers I stop the container (leaving the volume intact), and for a local worker I stop the process.
I then stop the web/atc.
When starting back up, I start the web/atc first, then the workers back up.
However, without fail, I always end up with intermittent issues with the checks or jobs when the cluster comes back up. Sometimes it comes back with no issues, but it’s never been consistent.
Am I doing something wrong? The documentation around worker management has always been a little unclear. At the end of the day, it feels like I need to be able to just throw away workers to make them work again.
I can deal with it locally, but it scares me when I imagine running this centralized with a bunch of important jobs, and having to re-deploy workers to fix these state issues.
To be clear, I’ve got docker containers with consistent names that are getting stopped and started in order, including ‘landing’ the workers first, yet I still end up with broken pipelines and missing volumes or files. I don’t get it.
This gist shows the
concourse fly workers --json output after landing and then after starting: https://gist.github.com/eedwards-sk/87979bf3d4c85f2fae83463ae1b124b8
The resources are the same for each. Yet alas, checks fail, files missing on the volumes. The volumes got mounted but are empty… which seems to be this issue I opened before: https://github.com/concourse/concourse/issues/2525