Overlay driver "failed to set up driver" on 5.4.1 -> 5.5.0

Recently bumped from concourse helm chart 8.2.3 --> 8.2.5, which bumps from Concourse 5.4.1 to Concourse 5.5.0

Our workers were configured to use concourse.worker.baggageclaimdriver: overlay, as follows:

    ephemeral: true
      driver: overlay

during the upgrade of the chart, the worker pods began to fail and crashloop:

kubectl -n concourse-workers logs -f concourse-workers-worker-7 --tail=100
{"timestamp":"2019-09-25T16:06:38.178879585Z","level":"error","source":"baggageclaim","message":"baggageclaim.failed-to-set-up-driver","data":{"error":"no such file or directory"}}
no such file or directory

I’ve never personally seen that error of {"timestamp":"2019-09-25T16:06:38.178879585Z","level":"error","source":"baggageclaim","message":"baggageclaim.failed-to-set-up-driver","data":{"error":"no such file or directory"}} before

The solution was to set the driver to anything different (in our case, we just went with detect, but explicitly setting btfs worked fine too):

    ephemeral: true
      driver: detect

I rolled the worker deployment back to 5.4.1 with the driver set back to overlay, and things started working again just fine.

(To keep the workers consistent with the web deployment (on 5.5.0), I currently have the workers on 5.50 also, but with the driver set to detect.)

Any thoughts or advice?

possibly related: Random worker fails

Does this ring a bell, @kcmannem? I’m wondering if it could be related to https://github.com/concourse/baggageclaim/pull/30.

I noticed that there’s quite a bit of code without error wrapping, so my guess is that it’d be somewhere down those lines :thinking: wdyt?

@aegershman, this mind be a bit of a stretch :sweat_smile: but, would you mind taking a look at what is the file that’s not being possible to read? you can do that either via strace (limiting to openat syscalls, I guess), or something more fancy via bpf instrumentation (e.g., https://github.com/iovisor/kubectl-trace + https://github.com/iovisor/bpftrace/blob/master/tools/opensnoop.bt)


I will try, but I’m kind of a dingus and I’m not able to kubectl exec -it <worker> -- /bin/bash into a concourse k8s worker pod due to a permissions error

 kubectl -n concourse-workers exec -it concourse-workers-worker-7 -- /bin/sh
Error from server (Forbidden): pods "concourse-workers-worker-7" is forbidden: cannot exec into or attach to a privileged container

… I also don’t know how to use bpf instrumentation, so it’ll take a bit for me to figure this out.

I’ll take a look at this a bit later though, apologies for the delay

I believe this might be where the “file not found” might be happening.

The liveVolumes dir creation logic hasn’t changed since ever, I’m surprised that it suddenly wouldn’t find it. Are you able to ssh into your pod and see if that directory exists? I’m not sure where it would exist in helm chart deployed workers though. Maybe “/worker-state/work/volumes/live”? Some path like this?

I’m really sorry, I’m unable to ssh into the worker pod because I can’t seem to figure out what I’m doing wrong with my cluster’s security permissions to warrant the error message of Error from server (Forbidden): pods "concourse-workers-worker-7" is forbidden: cannot exec into or attach to a privileged container

Any advice is welcome but otherwise I’m going to need to spend some time next week figuring out wtf I’m doing wrong regarding my inability to observe or attach to the workers

Interestingly, this appears to have been something related to the formatting of the persistent volume attached to the worker.

I destroyed the worker deployment entirely && deleted the PVCs associated to it, thus destroying the entire PVs (keep in mind simply scaling down the statefulset of worker pods to a lower number doesn’t destroy the PVC/PV, so when you scale the number back up your worker pods will re-attach to the existing PV)

I redeployed the workers with driver: overlay and now things are working swimmingly.

(I also decided to remove the use of persistence on the workers in favor of using emptyDir, so when the worker pods are recreated, it’s entirely reset)

Not sure what else I can offer in terms of advice or debugging help, but it appeared to be because the PV outlived the worker pods. Even though the work-dir was cleaned and reset on every pod restart, it appears something else was persisting… Including, in my case, orphaned containers on the workers (https://github.com/concourse/concourse/issues/4513)

Let me know if there’s anything else I can do to help offer insight or debugging for others, but our current mitigation is to simply recreate the worker deployment & set them to use emptyDir