our team is using concourse a lot and having 40 worker nodes currently. From time to time we will get error message like “failed to create volume” or “no space on device disk” etc.
To solve these issues, we find those nodes with 100% on /var/lib/concourse then terminate those nodes.
I tried another way to clear subvolumes in the node, it goes back to 2% but there will be “failed to create volume” error.
But obviously this is not an ideal solution. Could anyone give a hint about how to avoid or solve this issue? Short term or long term are both welcome!