Hey Everyone,
We have deployed to Digital Ocean via Helm and have recently been running into issues where the disk ballons to 100+Gi in just a few days and workers are not cleaning up meaning builds fails with the error.
One thing to note before we get into details, this only happens when I build images, it happened with the docker_image
resource and is now happening with the oci_build_task
. That maybe a red herring but it’s worth noting
Other things worth noting, I have been using an overlay. I just tried to switch it to btrfs
to see if the problem continues.
#23 ERROR: error writing layer blob: failed to copy: failed to send write: write /tmp/build/2ebbeb81/cache/ingest/e37e6369f3536c9d6f90383b16e5abc5bd0b885133e8c5182fe234a0e221e4fb/data: no space left on device: unknown
I have done a little spelunking and the disk is definitely full. One worker shows the disk as
kubectl exec -it -n concourse concourse-worker-0 -- /bin/bash
root@concourse-worker-0:/# df -hT
Filesystem Type Size Used Avail Use% Mounted on
overlay overlay 79G 13G 63G 17% /
tmpfs tmpfs 64M 0 64M 0% /dev
tmpfs tmpfs 2.0G 0 2.0G 0% /sys/fs/cgroup
tmpfs tmpfs 2.0G 8.0K 2.0G 1% /concourse-keys
/dev/disk/by-id/scsi-0DO_Volume_pvc-361d28ac-9822-4794-b6f7-754f6070a51a ext4 98G 98G 0 100% /concourse-work-dir
/dev/vda1 ext4 79G 13G 63G 17% /etc/hosts
shm tmpfs 64M 0 64M 0% /dev/shm
tmpfs tmpfs 2.0G 12K 2.0G 1% /run/secrets/kubernetes.io/serviceaccount
overlay overlay 98G 98G 0 100% /concourse-work-dir/volumes/live/66c41ac9-bb5f-4a6e-5fc1-264697dc951c/volume
overlay overlay 98G 98G 0 100% /concourse-work-dir/volumes/live/1094f166-cc2a-43b7-6a53-43bea8f6b2ff/volume
overlay overlay 98G 98G 0 100% /concourse-work-dir/volumes/live/a59a9aa3-e195-41f3-778f-433cd593f25f/volume
overlay overlay 98G 98G 0 100% /concourse-work-dir/volumes/live/72eef3d1-3cc0-488d-5759-57b2c078a2c7/volume
Notice the work dir is full
root@concourse-worker-0:/concourse-work-dir# du -h -d 1
104K ./depot
111G ./volumes
98G ./overlays
208G .
Digging into the ATC there are no notable errors while there is an error in the database
2019-11-26 17:55:29.481 GMT [240] DETAIL: Key (id, state)=(2099, created) is still referenced from table "volumes".
2019-11-26 17:55:29.481 GMT [240] STATEMENT: UPDATE volumes SET state = $1 WHERE (id = $2 AND (state = $3 OR state = $4))
I did find this thread Resetting a worker to a "clean" state but have not had any luck with it
Has anyone had a similar issue or have any suggestions on cleaning up workers?
Thanks