Random worker fails

#1

Hi!

We’ve just updated to concourse 5.1.0 (via our module on GitHub). This module runs web and worker nodes in an ASG. I have three workers currently. Two of these workers are great and run beautifully. The third worker always goes sideways. I’m happy to provide access to anyone to help us figure this out, but I’m completely stumped here. Logs during a “Failed to create Volume” call below:

Apr 28 01:16:21 ip-172-0-5-227 concourse[19984]: {"timestamp":"2019-04-28T01:16:21.265123531Z","level":"info","source":"baggageclaim","message":"baggageclaim.api.volume-server.get-volume.get-volume.volume-not-found","data":{"session":"3.1.4080.1","volume":"abfb34ce-8044-4980-7e8a-37c60fd5bd19"}}
Apr 28 01:16:21 ip-172-0-5-227 concourse[19984]: {"timestamp":"2019-04-28T01:16:21.265162464Z","level":"info","source":"baggageclaim","message":"baggageclaim.api.volume-server.get-volume.volume-not-found","data":{"session":"3.1.4080","volume":"abfb34ce-8044-4980-7e8a-37c60fd5bd19"}}
Apr 28 01:16:21 ip-172-0-5-227 concourse[19984]: gzip: stdin: unexpected end of file
Apr 28 01:16:21 ip-172-0-5-227 concourse[19984]: tar: Unexpected EOF in archive
Apr 28 01:16:21 ip-172-0-5-227 concourse[19984]: tar: Unexpected EOF in archive
Apr 28 01:16:21 ip-172-0-5-227 concourse[19984]: tar: Error is not recoverable: exiting now
Apr 28 01:16:21 ip-172-0-5-227 concourse[19984]: {"timestamp":"2019-04-28T01:16:21.449739215Z","level":"info","source":"baggageclaim","message":"baggageclaim.api.volume-server.create-volume-async.create-volume.malformed-archive","data":{"error":"exit status 2","handle":"abfb34ce-8044-4980-7e8a-37c60fd5bd19","session":"3.1.4081.1"}}
Apr 28 01:16:21 ip-172-0-5-227 concourse[19984]: {"timestamp":"2019-04-28T01:16:21.449972355Z","level":"error","source":"baggageclaim","message":"baggageclaim.api.volume-server.create-volume-async.create-volume.failed-to-materialize-strategy","data":{"error":"exit status 2","handle":"abfb34ce-8044-4980-7e8a-37c60fd5bd19","session":"3.1.4081.1"}}
Apr 28 01:16:21 ip-172-0-5-227 concourse[19984]: {"timestamp":"2019-04-28T01:16:21.450133800Z","level":"error","source":"baggageclaim","message":"baggageclaim.api.volume-server.create-volume-async.failed-to-create","data":{"error":"exit status 2","handle":"abfb34ce-8044-4980-7e8a-37c60fd5bd19","privileged":true,"session":"3.1.4081","strategy":{"type":"import","path":"/etc/concourse/resource-types/docker-image/rootfs.tgz","follow_symlinks":false}}}

Disk space is fine, everything is mounted to dev/nvme0n1p1 as seen below:

ubuntu@ip-172-0-5-227:/concourse-tmp$ df -h
Filesystem      Size  Used Avail Use% Mounted on
udev            3.8G     0  3.8G   0% /dev
tmpfs           769M  732K  768M   1% /run
/dev/nvme0n1p1   97G  5.7G   92G   6% /
tmpfs           3.8G     0  3.8G   0% /dev/shm
tmpfs           5.0M     0  5.0M   0% /run/lock
tmpfs           3.8G     0  3.8G   0% /sys/fs/cgroup
/dev/loop0       90M   90M     0 100% /snap/core/6673
/dev/loop1       18M   18M     0 100% /snap/amazon-ssm-agent/1068
tmpfs           769M     0  769M   0% /run/user/1000
ubuntu@ip-172-0-5-227:/concourse-tmp$

A call to localhost:7788/volumes does indeed return an empty set:

ubuntu@ip-172-0-5-227:/concourse-tmp$ curl localhost:7788/volumes
[]

The volume concourse is looking for exists (the last one) in the overlays dir:

ubuntu@ip-172-0-5-227:/concourse-tmp/overlays$ ls | grep abf
451ad655-4938-45e4-5701-0abfb3bf6538
abf57ab2-90d9-4c40-62e1-55eb8bfe7ae8
abfb34ce-8044-4980-7e8a-37c60fd5bd19

Before we file a bug I wanted to see if I’m just doing something dumb. This is the only worker that goes sideways. We start it via systemd:

[Unit]
Description=Concourse Worker Service
After=network.target
StartLimitIntervalSec=0

[Service]
Type=simple
Restart=always
RestartSec=1
ExecStart=/etc/concourse/bin/concourse worker                
                         --bind-ip 0.0.0.0                
                         --baggageclaim-bind-ip 0.0.0.0                 
                         --baggageclaim-driver overlay                 
                         --garden-config /etc/concourse/gdn-config.ini                 
                         --tsa-host redacted:2222                 
                         --tsa-public-key /etc/concourse/keys/worker/tsa_host_key.pub                 
                         --tsa-worker-private-key /etc/concourse/keys/worker/worker_key                 
                         --work-dir /concourse-tmp

[Install]
WantedBy=multi-user.target

It’s also helpful to note that fly -t my-target volumes yields no volumes owned by that worker. I’ll spare folks the spew (there’s a lot of volumes).

#2

For anyone following along swapping to the “naive” baggage claim driver solved all my issues.

#3

Confirmed for sure. I converted the containers back to “overlay” and everything breaks again. It looks like it’s always the last worker that’s created. Could be some kind of weird off-by-one error?

01%20PM