Concourse Disk continues to Balloon

Hey Everyone,

We have deployed to Digital Ocean via Helm and have recently been running into issues where the disk ballons to 100+Gi in just a few days and workers are not cleaning up meaning builds fails with the error.

One thing to note before we get into details, this only happens when I build images, it happened with the docker_image resource and is now happening with the oci_build_task. That maybe a red herring but it’s worth noting

Other things worth noting, I have been using an overlay. I just tried to switch it to btrfs to see if the problem continues.

#23 ERROR: error writing layer blob: failed to copy: failed to send write: write /tmp/build/2ebbeb81/cache/ingest/e37e6369f3536c9d6f90383b16e5abc5bd0b885133e8c5182fe234a0e221e4fb/data: no space left on device: unknown

I have done a little spelunking and the disk is definitely full. One worker shows the disk as

kubectl exec -it -n concourse concourse-worker-0 -- /bin/bash
root@concourse-worker-0:/# df -hT
Filesystem                                                               Type     Size  Used Avail Use% Mounted on
overlay                                                                  overlay   79G   13G   63G  17% /
tmpfs                                                                    tmpfs     64M     0   64M   0% /dev
tmpfs                                                                    tmpfs    2.0G     0  2.0G   0% /sys/fs/cgroup
tmpfs                                                                    tmpfs    2.0G  8.0K  2.0G   1% /concourse-keys
/dev/disk/by-id/scsi-0DO_Volume_pvc-361d28ac-9822-4794-b6f7-754f6070a51a ext4      98G   98G     0 100% /concourse-work-dir
/dev/vda1                                                                ext4      79G   13G   63G  17% /etc/hosts
shm                                                                      tmpfs     64M     0   64M   0% /dev/shm
tmpfs                                                                    tmpfs    2.0G   12K  2.0G   1% /run/secrets/kubernetes.io/serviceaccount
overlay                                                                  overlay   98G   98G     0 100% /concourse-work-dir/volumes/live/66c41ac9-bb5f-4a6e-5fc1-264697dc951c/volume
overlay                                                                  overlay   98G   98G     0 100% /concourse-work-dir/volumes/live/1094f166-cc2a-43b7-6a53-43bea8f6b2ff/volume
overlay                                                                  overlay   98G   98G     0 100% /concourse-work-dir/volumes/live/a59a9aa3-e195-41f3-778f-433cd593f25f/volume
overlay                                                                  overlay   98G   98G     0 100% /concourse-work-dir/volumes/live/72eef3d1-3cc0-488d-5759-57b2c078a2c7/volume

Notice the work dir is full

root@concourse-worker-0:/concourse-work-dir# du -h -d 1
104K	./depot
111G	./volumes
98G	./overlays
208G	.

Digging into the ATC there are no notable errors while there is an error in the database

2019-11-26 17:55:29.481 GMT [240] DETAIL:  Key (id, state)=(2099, created) is still referenced from table "volumes".
2019-11-26 17:55:29.481 GMT [240] STATEMENT:  UPDATE volumes SET state = $1 WHERE (id = $2 AND (state = $3 OR state = $4))

I did find this thread Resetting a worker to a "clean" state but have not had any luck with it :confused:

Has anyone had a similar issue or have any suggestions on cleaning up workers?

Thanks

Did you see if there were any errors in the logs on the worker(s) with the full disks? If garbage collection is failing you should see a bunch of errors from baggageclaim. If I think of something else I’ll post again.

@taylorsilva I am also running into exact same issue. we are running with concourse 5.6.0 via helm.

we are seeing that Concourse Disk continues to Balloon and at the end worker is getting recreated
I dint find any errors from baggagecliam rather find connection refused errors from garden.
baggagecliam showing lot of volume not found errors. Can you help us to resolve the problem because our production system is affected

Filesystem     1K-blocks     Used Available Use% Mounted on
overlay         58534648 52055924   3981536  93% /
tmpfs              65536        0     65536   0% /dev
tmpfs            8212792        0   8212792   0% /sys/fs/cgroup
/dev/vda9       58534648 52055924   3981536  93% /etc/hosts
tmpfs            8212792        8   8212784   1% /concourse-keys
shm                65536        0     65536   0% /dev/shm
tmpfs            8212792       12   8212780   1% /run/secrets/kubernetes.io/serviceaccount
# command terminated with exit code 137
➜  product-k8s-concourse git:(13159fd) k logs concourse-worker-0 | grep "error"
{"timestamp":"2019-12-04T13:39:07.511837408Z","level":"error","source":"worker","message":"worker.beacon-runner.beacon.forward-conn.failed-to-dial","data":{"addr":"127.0.0.1:7777","error":"dial tcp 127.0.0.1:7777: connect: connection refused","network":"tcp","session":"4.1.4"}}
{"timestamp":"2019-12-04T13:39:08.512268538Z","level":"error","source":"worker","message":"worker.beacon-runner.beacon.forward-conn.failed-to-dial","data":{"addr":"127.0.0.1:7777","error":"dial tcp 127.0.0.1:7777: connect: connection refused","network":"tcp","session":"4.1.4"}}

@gowrisankar22 what storage driver are you using.

I do get this error

→ kubectl logs -n concourse concourse-worker-0 concourse-worker-init-rm
ERROR: not a btrfs filesystem: /concourse-work-dir
ERROR: can't access '/concourse-work-dir'

→ kubectl logs -n concourse concourse-worker-1 concourse-worker-init-rm
ERROR: not a btrfs filesystem: /concourse-work-dir
ERROR: can't access '/concourse-work-dir' 

I am using overlay. @austinbv