How to see how many un-interruptible builds are running on a worker?


#1

Hi,

I’m trying to implement a graceful shutdown of a concourse-worker process and am having trouble determining how many jobs/builds are still being run by the worker.

The current logic puts the worker in “RETIRING” state and then waits 10 minutes before calling “stop” to stop the service. This is not ideal as I’m not confident all builds are complete by this time.

I would like to be able to poll/query the concourse-worker process to determine what (if any) builds are still being run by the worker so that I can ensure that I’m not killing any builds prematurely.

Thank you!


#2

Hmm, there’s no API for this at the moment. The ATC however will automatically retire workers once the uninterruptible builds running on it have finished, at which point the concourse worker command should automatically exit. So it’s just up to you whether you want to wait for that arbitrary amount of time or place a timeout on it.

Is that behavior not enough? It could be an interesting API to add.


#3

Thank you for your response! I think my issue was that I was seeing the concourse worker exit with a non-zero exit code after retiring and due to our SystemD setup specifying that it should be restarted if it exits with an error (non-zero error code) it just kept getting restarted.

I mistakenly thought that perhaps the issue was that the service stop command was being executed while the worker still had some un-interruptible builds on it, but from your description (after retiring, the worker stops gracefully) it seems that this is the correct behaviour.

I think my solution will be just to remove the “restart on failed” configuration as it’s not working correctly.


#4

Ah, ok. Wonder why it exited 1 though. Do you see any errors in the logs?


#5

@wagdav Could you please post the contents of our systemd unit file for concourse worker, once you finish the investigation about reliable draining ? This could help @serge and we might get some feedback from @vito.


#6

I would be interested in an API to ask the worker how many builds are still running and how old they are. I’m doing something wrong somewhere, and I think I have a small amount of permanently stuck jobs hanging around.


#7

Wonder why it exited 1 though. Do you see any errors in the logs?

I wondered that myself, I didn’t see anything in the logs when it shut down, but I wasn’t looking too hard. Will try and capture any warnings/errors in the logs and post them here. Maybe run a worker in “debug” mode or something…


#8

Ok, so the logs are as below. From what I can see, it’s trying to “sweep” containers and volumes after the service listening on ports 2222 and 7788 have stopped listening, not sure what to make of it.

Sep 05 02:44:37 ip-AC122BB2.foobar.example.com concourse-worker[3480]: {"timestamp":"1536115477.106324434","source":"worker","message":"worker.sweep","log_level":1,"data":{"cmd":"sweep-containers"}}
Sep 05 02:44:37 ip-AC122BB2.foobar.example.com concourse-worker[3480]: {"timestamp":"1536115477.125513792","source":"worker","message":"worker.sweep","log_level":1,"data":{"cmd":"sweep-volumes"}}
Sep 05 02:44:38 ip-AC122BB2.foobar.example.com concourse-worker[3480]: {"timestamp":"1536115478.521609545","source":"guardian","message":"guardian.api.garden-server.run.exited","log_level":1,"data":{"handle":"96c463c2-665d-474e-5a6c-589b3b704c9d","id":"c0a0d6a9-825c-4f84-4a6f-bc1b38269103","session":"3.1.2007","status":0}}
Sep 05 02:44:38 ip-AC122BB2.foobar.example.com concourse-worker[3480]: {"timestamp":"1536115478.740438938","source":"worker","message":"worker.beacon.beacon.beacon-client.keepalive.failed","log_level":2,"data":{"error":"read tcp 172.18.43.178:57542-\u003e172.18.32.32:2222: use of closed network connection","session":"4.1.1.1"}}
Sep 05 02:44:43 ip-AC122BB2.foobar.example.com concourse-worker[3480]: {"timestamp":"1536115483.810646534","source":"worker","message":"worker.failed-to-destroy-handles","log_level":2,"data":{"error":"Delete http://0.0.0.0:7788/volumes/destroy: dial tcp 0.0.0.0:7788: connect: connection refused"}}
Sep 05 02:44:43 ip-AC122BB2.foobar.example.com concourse-worker[3480]: {"timestamp":"1536115483.810684681","source":"worker","message":"worker.failed-to-sweep-volumes","log_level":2,"data":{"error":"Delete http://0.0.0.0:7788/volumes/destroy: dial tcp 0.0.0.0:7788: connect: connection refused"}}
Sep 05 02:44:43 ip-AC122BB2.foobar.example.com concourse-worker[3480]: {"timestamp":"1536115483.810698271","source":"worker","message":"worker.sweeper.failed-to-sweep-volumes","log_level":2,"data":{"error":"Delete http://0.0.0.0:7788/volumes/destroy: dial tcp 0.0.0.0:7788: connect: connection refused","session":"6"}}
Sep 05 02:44:43 ip-AC122BB2.foobar.example.com concourse-worker[3480]: {"timestamp":"1536115483.810711384","source":"worker","message":"worker.sweeper.exiting-from-mark-and-sweep","log_level":1,"data":{"session":"6"}}
Sep 05 02:46:13 ip-AC122BB2.foobar.example.com systemd[1]: concourse-worker.service: State 'stop-sigterm' timed out. Killing.
Sep 05 02:46:13 ip-AC122BB2.foobar.example.com systemd[1]: concourse-worker.service: Unit entered failed state.
Sep 05 02:46:13 ip-AC122BB2.foobar.example.com systemd[1]: concourse-worker.service: Failed with result 'timeout'.
Sep 05 02:46:14 ip-AC122BB2.foobar.example.com systemd[1]: concourse-worker.service: Service hold-off time over, scheduling restart.
Sep 05 02:46:14 ip-AC122BB2.foobar.example.com systemd[1]: Stopped concourse-worker.
Sep 05 02:46:14 ip-AC122BB2.foobar.example.com systemd[1]: Started concourse-worker.
Sep 05 02:46:14 ip-AC122BB2.foobar.example.com concourse-worker[21771]: {"timestamp":"1536115574.896947861","source":"worker","message":"worker.setup.already-done","log_level":1,"data":{"session":"1"}}

#9

Hmm I guess that’s not too surprising. I think the worker doesn’t really exit gracefully and sort of stumbles over as its different components individually stop. We should fix that.