No space on device, failed to create volume

#1

Hey,

our team is using concourse a lot and having 40 worker nodes currently. From time to time we will get error message like “failed to create volume” or “no space on device disk” etc.

To solve these issues, we find those nodes with 100% on /var/lib/concourse then terminate those nodes.
I tried another way to clear subvolumes in the node, it goes back to 2% but there will be “failed to create volume” error.

But obviously this is not an ideal solution. Could anyone give a hint about how to avoid or solve this issue? Short term or long term are both welcome!

#2

What we do is monitoring and alarming of the worker disk size. When it goes close to full, we take a look. If it is a transient error (for example a runaway build), we abort the build or if something goes wrong we destroy the instance. If it is a legitimate trend (classic example: the type of pipelines are taking more space, or we know the number of builds per worker are increasing), we bump the worker disk size. We are at around 120GB. Works fine for us.

1 Like