No space left on device

#1

hello All, we have started seeing this error on a dockerized concourse v3.8.0

FAILURE: Build failed with an exception.

  • What went wrong:
    Execution failed for task ‘:extractor:dockerBuildImage’.

{“message”:“Error processing tar file(exit status 1): write /extractor-1.0.185.tar: no space left on device”}

OS: centos
kernel: 4.15.12-1.el7.elrepo.x86_64
docker version: 17.12.0-ce, build c97c6d6
storage: overlay2

if you have seen this issue and know the root cause, it would be great if you can share the cause & probable fix if you found it.

Thanks,
Ashish

#2

This is one of those generic system error message that really just means what it says: you’ve run of disk space on whichever device it’s writing to. There’s nothing very Concourse-specific about it - you probably just need more disk space. How much you need really depends on your workload. See the worker node docs for more information.

#3

The error is very generic, i have enough disk space on host and for the worker container as well. the docker restart fixes the issue temporarily and the issue keeps coming back.

What is the reason behind switching to btrfs in v3.10.0? anything specific related to similar issue or performance issues?

#4

We were using overlay before, which had major performance issues impacting anyone doing Docker builds or privileged tasks in their pipeline. (So, probably most people.)

We were using btrfs prior to v3.1.0, when we switched due to stability issues, which we think have since been resolved in v3.9.0, so we switched the default back to btrfs in v3.10.0.

This could perhaps be the same issue: Resetting a worker to a "clean" state

#5

I just assumed that concourse had some disk utilization target and when it filled up too much it would delete old cache to make room. That seems to be incorrect?

In other words, given workers pulling e.g. updated container images, concourse will always fill up the worker volume?

#6

Caches go away once a pipeline stops using them (i.e. once they’re not going to be used for any future builds). When a new image comes out the old caches will be garbage-collected, unless there’s some other job that still needs it.

#7

So when we see this kind of volume or disk error, what can we do? Restarting worker nodes is just too terrible for us. I read two posts on medium but I still cannot find a way to solve those issues.

#8

If space is an issue you could enable build_logs_to_retain on concourse web and set it so you do not keep storing ever increasing logs that you may not look back on older builds etc. This can save some significant space

#9

Hey, thanks for the advice. But I think log goes to database right?