Volumes not being clean up


#1

We have a deployment with nearly 20 workers deployed using the bosh concourse release.

Ephemeral disk usage grows more or less consistently over time (2 to 3% per day) until the disk is full.
Today for example there were over 900 volumes under /var/vcap/data/baggageclaim/volumes/live and /var/vcap/data/baggageclaim/volumes.img was huge (86588076032 or ~80G). Some of the volume date back December.

The current process is to bosh stop the worker VM, /var/vcap/packages/btrfs_tools/sbin/btrfs subvolume delete to clean up the sub volumes, reboot, clean out /var/vcap/data/garden and remove /var/vcap/data/baggageclaim/volumes.img before restarting the worker.

This is less than ideal and can cause resources in the pipelines to go orange and not to recheck cleanly ( error: check failed with exit status ‘70’: unknown handle: 9b87fbb6-0c92-438a-535a-f48681a2bc02). It can take in excess of an hour for concourse to recover completely.

(How) is it possible to map these volumes to resources/tasks and perform a less aggressive garbage collection. This should also allow understanding of just how much disk space might actually be required and/or understand which resources/tasks are consuming too much space with an aim to reducing usage.