Workers running out of disk space

We recently upgraded our fleet to Ubuntu 18.04 (bionic) from 14.04 (trusty) and Concourse v4.2 to v5.1 in one sweep. Our workers are running on DigitalOcean Droplet VMs with over 600GB of storage. The workers are configured with the BTRFS file system for volumes. We generally have between 25 - 35 workers depending on workload demand. We currently do not have global resources enabled which we planned on doing soon after we have had time to test it on our staging cluster.

Since the upgrade, we have had 3 workers reach 100% disk usage. I do not recall one instance of this on the old trusty cluster with v4 in probably the last year. Short term, I have just retired the worker, then used our Terraform package to destroy the unhealthy worker & create a replacement.

When I looked at the latest worker to run out of disk, I saw the folllowing:

root@concourse-worker-prod-ac4cbe47fa:/opt/concourse/worker/volumes# df -h
Filesystem      Size  Used Avail Use% Mounted on
udev             32G     0   32G   0% /dev
tmpfs           6.3G  1.1M  6.3G   1% /run
/dev/vda1       621G  614G  6.5G  99% /
tmpfs            32G     0   32G   0% /dev/shm
tmpfs           5.0M     0  5.0M   0% /run/lock
tmpfs            32G     0   32G   0% /sys/fs/cgroup
/dev/vda15      105M  3.6M  101M   4% /boot/efi
/dev/loop0      611G  609G   64K 100% /opt/concourse/worker/volumes

Running du on the volumes dir resulted in:

root@concourse-worker-prod-ac4cbe47fa:/opt/concourse/worker/volumes# du -h -d 2 | grep G
1.5G    ./live/c7bb466c-ee98-4ee8-7ba1-9fb0a6349a4a
1.3G    ./live/907e9083-ed61-454d-7383-39dc1bb26e3a
1.7G    ./live/beab15b6-1e6d-42e5-60f9-ec0f3de6d693
1.5G    ./live/816d6ba8-3642-4083-4dc7-59deecfa76bf
1.3G    ./live/70155688-e6e1-41e8-712f-e673908ac1c2
1.3G    ./live/5ee99075-942e-443c-6796-b959409fc8f3
1.3G    ./live/b458e07a-b5c2-4300-4347-168bbe0a774a
1.2G    ./live/206c97cc-524a-458f-7a3a-30be4cfe5bdb
1.8G    ./live/cfcbcb17-60e9-454d-4b4e-ab489c6b894c
2.4G    ./live/074c80ef-3792-423a-5476-e036fef57c0d
1.3G    ./live/c71889b4-94ac-4dae-7565-52fc6013dc72
1.3G    ./live/e82f9ea3-fb67-439b-6996-1b9086beef49
1.3G    ./live/4f6b925d-0741-4cac-7f97-c3b50ec084aa
1.3G    ./live/f6f82d5d-eb5b-44b7-4bdd-ccac58c56a67
1.6G    ./live/9f748409-9108-4e71-5f23-515f0643c3a5
1.3G    ./live/80c848c3-dd25-41af-498a-300f306d338b
1.6G    ./live/3f2bc122-6237-4d59-559d-93f664ae3161
1.6G    ./live/5e901f98-6751-4adb-65c5-111c3cb51bf2
1.5G    ./live/34cbd64f-3b10-45a6-7032-595abed82acb
1.3G    ./live/11b98ccd-c21d-4b23-4124-363d355f2942
1.3G    ./live/aa1de45c-045e-4e6b-769f-60a1acb02ec0
1.3G    ./live/367a427f-a179-4214-5ce7-1c156680363d
# skip a bunch...
106G    ./live
106G    .

So, the only subdirectory of volumes to have any volumes was live and its total consumption was 106GB. We saw similar issues like this around the time that Concourse was defaulting to overlay due to performance issues with BTRFS, then back to BTRFS after the performance issues were resolved. I think that was during early v3.x?

Any thoughts?

We just hit this same issue on our Concourse.

BOSH deployed on AWS
Concourse 5.2.0
Stemcell 170.69

Seeing the same thing as @christophermancini

Filesystem      Size  Used Avail Use% Mounted on
devtmpfs         16G     0   16G   0% /dev
tmpfs            16G     0   16G   0% /dev/shm
tmpfs            16G  2.1G   14G  14% /run
tmpfs           5.0M     0  5.0M   0% /run/lock
tmpfs            16G     0   16G   0% /sys/fs/cgroup
/dev/xvda1      2.9G  1.5G  1.3G  54% /
/dev/xvdb2      162G  155G     0 100% /var/vcap/data
tmpfs           1.0M  4.0K 1020K   1% /var/vcap/data/sys/run
/dev/loop0      152G  134G     0 100% /var/vcap/data/worker/work/volumes

with the same du output. We only have the one worker which was created about 9 days ago.

I’m not sure what else to do to diagnose what’s happening here.

I’m still getting started with Concourse CI. My setup is one web node, one worker, running bare-metal on the same host. My worker filled up the 250GB BTRFS partition rather quickly (within a day or two). I changed quite a lot of the configuration, but I think what did the trick was removing the cache configuration from by tasks.

---
platform: linux

image_resource:
  type: registry-image
  source: { repository: concourse/builder-task }

params:
  REPOSITORY: temp-image-name
  CONTEXT: main-repo/((docker-file))

inputs:
- name: main-repo

# Caches were created but never used
#cache:
#- path: cache

outputs:
- name: image

run: { path: build }

We are running into similar issues. Poor mans solution is killing off our workers regulary :frowning:

We made several changes recently and have not had the issue come back:

  • Upgraded to v5.2
  • Enabled global resources
  • Switched from random to volume-locality for container placement
  • Began implementing shallow clones on git / pull request resources for big repositories