We recently upgraded our fleet to Ubuntu 18.04 (bionic) from 14.04 (trusty) and Concourse v4.2 to v5.1 in one sweep. Our workers are running on DigitalOcean Droplet VMs with over 600GB of storage. The workers are configured with the BTRFS file system for volumes. We generally have between 25 - 35 workers depending on workload demand. We currently do not have global resources enabled which we planned on doing soon after we have had time to test it on our staging cluster.
Since the upgrade, we have had 3 workers reach 100% disk usage. I do not recall one instance of this on the old trusty cluster with v4 in probably the last year. Short term, I have just retired the worker, then used our Terraform package to destroy the unhealthy worker & create a replacement.
When I looked at the latest worker to run out of disk, I saw the folllowing:
root@concourse-worker-prod-ac4cbe47fa:/opt/concourse/worker/volumes# df -h Filesystem Size Used Avail Use% Mounted on udev 32G 0 32G 0% /dev tmpfs 6.3G 1.1M 6.3G 1% /run /dev/vda1 621G 614G 6.5G 99% / tmpfs 32G 0 32G 0% /dev/shm tmpfs 5.0M 0 5.0M 0% /run/lock tmpfs 32G 0 32G 0% /sys/fs/cgroup /dev/vda15 105M 3.6M 101M 4% /boot/efi /dev/loop0 611G 609G 64K 100% /opt/concourse/worker/volumes
Running du on the volumes dir resulted in:
root@concourse-worker-prod-ac4cbe47fa:/opt/concourse/worker/volumes# du -h -d 2 | grep G 1.5G ./live/c7bb466c-ee98-4ee8-7ba1-9fb0a6349a4a 1.3G ./live/907e9083-ed61-454d-7383-39dc1bb26e3a 1.7G ./live/beab15b6-1e6d-42e5-60f9-ec0f3de6d693 1.5G ./live/816d6ba8-3642-4083-4dc7-59deecfa76bf 1.3G ./live/70155688-e6e1-41e8-712f-e673908ac1c2 1.3G ./live/5ee99075-942e-443c-6796-b959409fc8f3 1.3G ./live/b458e07a-b5c2-4300-4347-168bbe0a774a 1.2G ./live/206c97cc-524a-458f-7a3a-30be4cfe5bdb 1.8G ./live/cfcbcb17-60e9-454d-4b4e-ab489c6b894c 2.4G ./live/074c80ef-3792-423a-5476-e036fef57c0d 1.3G ./live/c71889b4-94ac-4dae-7565-52fc6013dc72 1.3G ./live/e82f9ea3-fb67-439b-6996-1b9086beef49 1.3G ./live/4f6b925d-0741-4cac-7f97-c3b50ec084aa 1.3G ./live/f6f82d5d-eb5b-44b7-4bdd-ccac58c56a67 1.6G ./live/9f748409-9108-4e71-5f23-515f0643c3a5 1.3G ./live/80c848c3-dd25-41af-498a-300f306d338b 1.6G ./live/3f2bc122-6237-4d59-559d-93f664ae3161 1.6G ./live/5e901f98-6751-4adb-65c5-111c3cb51bf2 1.5G ./live/34cbd64f-3b10-45a6-7032-595abed82acb 1.3G ./live/11b98ccd-c21d-4b23-4124-363d355f2942 1.3G ./live/aa1de45c-045e-4e6b-769f-60a1acb02ec0 1.3G ./live/367a427f-a179-4214-5ce7-1c156680363d # skip a bunch... 106G ./live 106G .
So, the only subdirectory of volumes to have any volumes was
live and its total consumption was 106GB. We saw similar issues like this around the time that Concourse was defaulting to overlay due to performance issues with BTRFS, then back to BTRFS after the performance issues were resolved. I think that was during early v3.x?