Worker failure to start

Hi,
Our concourse is running in GKE K8S
installed via helm charts
Concourse version v5.5.1

Currently only a single worker out of 10 constantly fails to run because

  1. fails livenessProbe checks
  2. can’t connect to port 7777

when describing the problem worker pod failing health check is the main issue.

Warning Unhealthy 10m (x21 over 20m) kubelet, gke-ci-gke-cluster-worker-node-pool-11f790f7-bnc0 Liveness probe failed: Get http://10.20.35.5:8888/: net/http: request canceled while waiting for connection (Client.Timeout exceeded while awaiting headers)

however, after I removed livenessprobe the pod still fails to connect to web.

{"timestamp":"2019-10-16T18:30:23.258889937Z","level":"info","source":"baggageclaim","message":"baggageclaim.using-driver","data":{"driver":"naive"}}
{"timestamp":"2019-10-16T18:30:23.259848578Z","level":"info","source":"worker","message":"worker.garden.dns-proxy.started","data":{"session":"1.2"}}
{"timestamp":"2019-10-16T18:30:23.259887260Z","level":"info","source":"baggageclaim","message":"baggageclaim.listening","data":{"addr":"127.0.0.1:7788"}}
{"timestamp":"2019-10-16T18:30:23.295004174Z","level":"error","source":"worker","message":"worker.beacon-runner.beacon.forward-conn.failed-to-dial","data":{"addr":"127.0.0.1:7777","error":"dial tcp 127.0.0.1:7777: connect: connection refused","network":"tcp","session":"4.1.4"}}

when exec into the pod, env variables seems to be all there. Installing netstat I only see 1 single established tcp connection:

root@ci-worker-6:/# netstat -tna
Active Internet connections (servers and established)
Proto Recv-Q Send-Q Local Address           Foreign Address         State
tcp        0      0 10.20.35.3:60776        10.23.246.25:2222       ESTABLISHED

it is very different from other working worker pods.

Active Internet connections (servers and established)
Proto Recv-Q Send-Q Local Address           Foreign Address         State
tcp        0      0 127.0.0.1:7776          0.0.0.0:*               LISTEN
tcp        0      0 127.0.0.1:7777          0.0.0.0:*               LISTEN
tcp        0      0 127.0.0.1:7787          0.0.0.0:*               LISTEN
tcp        0      0 127.0.0.1:7788          0.0.0.0:*               LISTEN
tcp        0      0 127.0.0.1:7777          127.0.0.1:48462         TIME_WAIT
tcp        0      0 127.0.0.1:7777          127.0.0.1:48418         TIME_WAIT
tcp        0      0 10.20.30.5:37140        10.23.246.25:2222       TIME_WAIT
tcp        0      0 127.0.0.1:7777          127.0.0.1:48426         TIME_WAIT
tcp        0      0 10.20.30.5:36618        10.23.246.25:2222       ESTABLISHED
tcp        0      0 10.20.30.5:37102        10.23.246.25:2222       TIME_WAIT
tcp        0      0 127.0.0.1:7788          127.0.0.1:59998         TIME_WAIT
tcp        0      0 10.20.30.5:37136        10.23.246.25:2222       TIME_WAIT
tcp        0      0 127.0.0.1:7777          127.0.0.1:48444         TIME_WAIT
tcp        0      0 10.20.30.5:37104        10.23.246.25:2222       TIME_WAIT
tcp        0      0 10.20.30.5:37100        10.23.246.25:2222       TIME_WAIT
tcp        0      0 10.20.30.5:37106        10.23.246.25:2222       TIME_WAIT
tcp        0      0 10.20.30.5:37138        10.23.246.25:2222       TIME_WAIT
tcp        0      0 10.20.30.5:37134        10.23.246.25:2222       TIME_WAIT
tcp        0      0 127.0.0.1:7788          127.0.0.1:60024         TIME_WAIT
tcp6       0      0 :::8888                 :::*                    LISTEN

Seems that the problematic worker was able to establish a web but nothing else…
in web pod logs i find
“worker not present in db”
“connect: connection timed out” to the problematic pod.

Anyone has any idea how to debug this please let me know.

Does fly workers reveal anything interesting, e.g. is it possible a worker isn’t being retired/landed properly before reconnecting or being relaunched?

Yes I checked fly workers, the problem worker is not in the list of workers. It’s not able to register to at all. Also i already rebooted all web node as well.

I checked the db and didn’t see the worker in db as well.

Is it possible that a worker’s state has not been cleaned up properly? (e.g. the persistent volume was still around when you re-created the worker?)

Can be a stale process keeping the port open.
I find if I kill a worker very often it does not kill all of its children.

Have you Tried a reboot?

OR
stop worker
Prune the worker with fly
Regenerate the worker key, add it to authorised keys on the web node
delete everything in /opt/concourse/worker and below
Restart the worker

Yes i’ve tried rebooting the pod and also deleting the pod so it’s running on a fresh vm

You can’t prune a worker when it’s not registered as a resource.

Well i didn’t delete the helm chart i just reboot the pod. I thought just the worker container and volumes are stored in the pvc ?

Yeah you definitely need a way to be able to fully re-create a worker, including destroying its state.

If you can figure that out and still have problems with that worker, let us know.

ok if that’s the case, i’ll delete the persistent volume while the pod is down, once it recreates the pod it should create a new persistent volume

Thanks @eedwards-sk that did it, deleting the pvc was the solution. I’m curious what is in the persistent volume that might have cause this ? Is there some sort of state file ?

for our setup we have autoscaling in k8s so what i had to do is:

  1. adjust the max autoscaling to equal the number of vms running.
  2. cordon the node
  3. delete the problem pod, now the problem pod will be stuck in pending state
  4. delete the pvc, which will delete the pv as well.

once that’s done, uncordon the node and delete the pod again so the pod schedules onto the node and the worker pod will recreate the pvc.

Glad you got it sorted, as we discussed in discord, I’m not privy to the particulars of the volume mapping, only that leaving its “data dir” in place allows it to pick up existing state.

I don’t use k8s so my knowledge of the helm chart and their lifecycle control is limited.

In general, being able to dispose/re-create a worker (without state) is probably the most useful tool you can have to keep a concourse cluster humming.

I use Concourse v5.5.1 binaries directly, but perhaps had the same problem with one of my three workers:

-- Logs begin at Fri 2019-10-25 13:43:33 PDT, end at Fri 2019-10-25 14:36:18 PDT. --
Oct 25 13:43:39 ubuntu-206 systemd[1]: Started Concourse CI Worker Service.
Oct 25 13:43:40 ubuntu-206 concourse_worker[992]: + export CONCOURSE_GARDEN_LOG_LEVEL=error
Oct 25 13:43:40 ubuntu-206 concourse_worker[992]: + CONCOURSE_GARDEN_LOG_LEVEL=error
Oct 25 13:43:40 ubuntu-206 concourse_worker[992]: + export CONCOURSE_GARDEN_TIME_FORMAT=rfc3339
Oct 25 13:43:40 ubuntu-206 concourse_worker[992]: + CONCOURSE_GARDEN_TIME_FORMAT=rfc3339
Oct 25 13:43:40 ubuntu-206 concourse_worker[992]: + export CONCOURSE_GARDEN_DNS_SERVER=8.8.8.8
Oct 25 13:43:40 ubuntu-206 concourse_worker[992]: + CONCOURSE_GARDEN_DNS_SERVER=8.8.8.8
Oct 25 13:43:40 ubuntu-206 concourse_worker[992]: + /usr/local/concourse/bin/concourse worker --ephemeral --baggageclaim-log-level error --work-dir /opt/concourse/worker --tsa-host surpiviz.ucsf.edu:2222 --tsa-public-key keys/tsa_host_key.pub --tsa-worker-private-key keys/worker_key
Oct 25 13:43:51 ubuntu-206 concourse_worker[992]: {"timestamp":"2019-10-25T20:43:51.868368986Z","level":"info","source":"worker","message":"worker.garden.forwarding-garden-env-var","data":{"env":"CONCOURSE_GARDEN_TIME_FORMAT","flag":"time-format","session":"1"}}
Oct 25 13:43:51 ubuntu-206 concourse_worker[992]: {"timestamp":"2019-10-25T20:43:51.868581771Z","level":"info","source":"worker","message":"worker.garden.forwarding-garden-env-var","data":{"env":"CONCOURSE_GARDEN_LOG_LEVEL","flag":"log-level","session":"1"}}
Oct 25 13:43:51 ubuntu-206 concourse_worker[992]: {"timestamp":"2019-10-25T20:43:51.868621302Z","level":"info","source":"worker","message":"worker.garden.forwarding-garden-env-var","data":{"env":"CONCOURSE_GARDEN_DNS_SERVER","flag":"dns-server","session":"1"}}
Oct 25 13:43:57 ubuntu-206 concourse_worker[992]: {"timestamp":"2019-10-25T20:43:57.545500897Z","level":"error","source":"worker","message":"worker.beacon-runner.beacon.forward-conn.failed-to-dial","data":{"addr":"127.0.0.1:7777","error":"dial tcp 127.0.0.1:7777: connect: connection refused","network":"tcp","session":"4.1.4"}}
Oct 25 13:43:58 ubuntu-206 concourse_worker[992]: {"timestamp":"2019-10-25T20:43:58.262552887Z","level":"error","source":"guardian","message":"guardian.failed-to-load-properties","data":{"error":"EOF","propertiesPath":"/opt/concourse/worker/garden-properties.json"}}
Oct 25 13:43:58 ubuntu-206 concourse_worker[992]: EOF
Oct 25 13:43:58 ubuntu-206 concourse_worker[992]: EOF
Oct 25 13:43:58 ubuntu-206 concourse_worker[992]: {"timestamp":"2019-10-25T20:43:58.264099932Z","level":"error","source":"worker","message":"worker.garden.gdn-runner.logging-runner-exited","data":{"error":"exit status 1","session":"1.2"}}
Oct 25 13:43:58 ubuntu-206 concourse_worker[992]: {"timestamp":"2019-10-25T20:43:58.264259414Z","level":"error","source":"worker","message":"worker.garden-runner.logging-runner-exited","data":{"error":"Exit trace for group:\ngdn exited with error: exit status 1\n","session":"8"}}
Oct 25 13:43:58 ubuntu-206 concourse_worker[992]: {"timestamp":"2019-10-25T20:43:58.264353612Z","level":"info","source":"worker","message":"worker.container-sweeper.sweep-cancelled-by-signal","data":{"session":"6","signal":2}}
Oct 25 13:43:58 ubuntu-206 concourse_worker[992]: {"timestamp":"2019-10-25T20:43:58.264453011Z","level":"info","source":"worker","message":"worker.container-sweeper.logging-runner-exited","data":{"session":"13"}}
Oct 25 13:43:58 ubuntu-206 concourse_worker[992]: {"timestamp":"2019-10-25T20:43:58.264487726Z","level":"info","source":"worker","message":"worker.volume-sweeper.sweep-cancelled-by-signal","data":{"session":"7","signal":2}}
Oct 25 13:43:58 ubuntu-206 concourse_worker[992]: {"timestamp":"2019-10-25T20:43:58.264493781Z","level":"info","source":"worker","message":"worker.debug-runner.logging-runner-exited","data":{"session":"10"}}
Oct 25 13:43:58 ubuntu-206 concourse_worker[992]: {"timestamp":"2019-10-25T20:43:58.264498117Z","level":"info","source":"worker","message":"worker.healthcheck-runner.logging-runner-exited","data":{"session":"11"}}
Oct 25 13:43:58 ubuntu-206 concourse_worker[992]: {"timestamp":"2019-10-25T20:43:58.264526199Z","level":"info","source":"worker","message":"worker.volume-sweeper.logging-runner-exited","data":{"session":"14"}}
Oct 25 13:43:58 ubuntu-206 concourse_worker[992]: {"timestamp":"2019-10-25T20:43:58.264607035Z","level":"info","source":"worker","message":"worker.baggageclaim-runner.logging-runner-exited","data":{"session":"9"}}
Oct 25 13:43:58 ubuntu-206 concourse_worker[992]: {"timestamp":"2019-10-25T20:43:58.545732770Z","level":"info","source":"worker","message":"worker.beacon-runner.beacon.forward-conn.retrying","data":{"addr":"127.0.0.1:7777","network":"tcp","session":"4.1.4"}}
Oct 25 13:43:58 ubuntu-206 concourse_worker[992]: {"timestamp":"2019-10-25T20:43:58.545938323Z","level":"error","source":"worker","message":"worker.beacon-runner.beacon.forward-conn.failed-to-dial","data":{"addr":"127.0.0.1:7777","error":"dial tcp 127.0.0.1:7777: connect: connection refused","network":"tcp","session":"4.1.4"}}
Oct 25 13:43:59 ubuntu-206 concourse_worker[992]: {"timestamp":"2019-10-25T20:43:59.546064691Z","level":"info","source":"worker","message":"worker.beacon-runner.beacon.forward-conn.retrying","data":{"addr":"127.0.0.1:7777","network":"tcp","session":"4.1.4"}}
Oct 25 13:43:59 ubuntu-206 concourse_worker[992]: {"timestamp":"2019-10-25T20:43:59.546235434Z","level":"error","source":"worker","message":"worker.beacon-runner.beacon.forward-conn.failed-to-dial","data":{"addr":"127.0.0.1:7777","error":"dial tcp 127.0.0.1:7777: connect: connection refused","network":"tcp","session":"4.1.4"}}

This file was empty: /opt/concourse/worker/garden-properties.json
Deleting the file and restarting the worker fixed the problem.