We have deployed to Digital Ocean via Helm and have recently been running into issues where the disk ballons to 100+Gi in just a few days and workers are not cleaning up meaning builds fails with the error.
One thing to note before we get into details, this only happens when I build images, it happened with the
docker_image resource and is now happening with the
oci_build_task. That maybe a red herring but it’s worth noting
Other things worth noting, I have been using an overlay. I just tried to switch it to
btrfs to see if the problem continues.
#23 ERROR: error writing layer blob: failed to copy: failed to send write: write /tmp/build/2ebbeb81/cache/ingest/e37e6369f3536c9d6f90383b16e5abc5bd0b885133e8c5182fe234a0e221e4fb/data: no space left on device: unknown
I have done a little spelunking and the disk is definitely full. One worker shows the disk as
kubectl exec -it -n concourse concourse-worker-0 -- /bin/bash root@concourse-worker-0:/# df -hT Filesystem Type Size Used Avail Use% Mounted on overlay overlay 79G 13G 63G 17% / tmpfs tmpfs 64M 0 64M 0% /dev tmpfs tmpfs 2.0G 0 2.0G 0% /sys/fs/cgroup tmpfs tmpfs 2.0G 8.0K 2.0G 1% /concourse-keys /dev/disk/by-id/scsi-0DO_Volume_pvc-361d28ac-9822-4794-b6f7-754f6070a51a ext4 98G 98G 0 100% /concourse-work-dir /dev/vda1 ext4 79G 13G 63G 17% /etc/hosts shm tmpfs 64M 0 64M 0% /dev/shm tmpfs tmpfs 2.0G 12K 2.0G 1% /run/secrets/kubernetes.io/serviceaccount overlay overlay 98G 98G 0 100% /concourse-work-dir/volumes/live/66c41ac9-bb5f-4a6e-5fc1-264697dc951c/volume overlay overlay 98G 98G 0 100% /concourse-work-dir/volumes/live/1094f166-cc2a-43b7-6a53-43bea8f6b2ff/volume overlay overlay 98G 98G 0 100% /concourse-work-dir/volumes/live/a59a9aa3-e195-41f3-778f-433cd593f25f/volume overlay overlay 98G 98G 0 100% /concourse-work-dir/volumes/live/72eef3d1-3cc0-488d-5759-57b2c078a2c7/volume
Notice the work dir is full
root@concourse-worker-0:/concourse-work-dir# du -h -d 1 104K ./depot 111G ./volumes 98G ./overlays 208G .
Digging into the ATC there are no notable errors while there is an error in the database
2019-11-26 17:55:29.481 GMT  DETAIL: Key (id, state)=(2099, created) is still referenced from table "volumes". 2019-11-26 17:55:29.481 GMT  STATEMENT: UPDATE volumes SET state = $1 WHERE (id = $2 AND (state = $3 OR state = $4))
I did find this thread Resetting a worker to a "clean" state but have not had any luck with it
Has anyone had a similar issue or have any suggestions on cleaning up workers?