Is 'landing' broken? does it even work? or am I stupid?


#1

I did a bunch of work to carefully orchestrate the landing and retirement process on concourse when running locally.

I first land the worker, wait for it to show as ‘landed’, then stop it. For docker containers I stop the container (leaving the volume intact), and for a local worker I stop the process.

I then stop the web/atc.

When starting back up, I start the web/atc first, then the workers back up.

However, without fail, I always end up with intermittent issues with the checks or jobs when the cluster comes back up. Sometimes it comes back with no issues, but it’s never been consistent.

Am I doing something wrong? The documentation around worker management has always been a little unclear. At the end of the day, it feels like I need to be able to just throw away workers to make them work again.

I can deal with it locally, but it scares me when I imagine running this centralized with a bunch of important jobs, and having to re-deploy workers to fix these state issues.

To be clear, I’ve got docker containers with consistent names that are getting stopped and started in order, including ‘landing’ the workers first, yet I still end up with broken pipelines and missing volumes or files. I don’t get it.

This gist shows the concourse fly workers --json output after landing and then after starting: https://gist.github.com/eedwards-sk/87979bf3d4c85f2fae83463ae1b124b8

The resources are the same for each. Yet alas, checks fail, files missing on the volumes. The volumes got mounted but are empty… which seems to be this issue I opened before: https://github.com/concourse/concourse/issues/2525

check_failed


#2

We never bothered with landing. We retire directly :slight_smile:
Advantage is that it removes any state about the worker from the ATC and it puts you in a state of mind ready to throw away the VM and redeploy another one, which makes a big difference if the worker is Windows or Mac. Since there is no container isolation, it ensures no side effects / global state is kept. Granted, you pay with a cold cache.


#3

I spent the morning basically going this route. I rewrote my local orchestration scripts to spin up an ‘ephemeral’ style linux and mac worker, and now I’m explicitly clearing the mac’s worker volumes whenever I spin down so it will be ‘fresh’ for the next worker (I don’t use VMs for mac, just local worker). I’m also skipping landing and just retiring.

I guess the side benefit about not bothering with landing is that it will work well in an autoscaling group environment, since without landing I can just make a retirement script lambda and trigger it as a lifecycle hook on termination.