Overlay causes jobs to slow down

Around a month ago, we switched filesystems from btrfs to overlay, based on this issue (https://github.com/concourse/concourse/issues/4570), because our worker CPUs were getting pegged and jobs were stalling/failing. Since then, we’ve noticed that a number of tasks will consistently take an additional 6-8 minutes to start, after the conclusion of the last task (when they had previously only taken ~15 seconds to start).

For example, there was an 11 second delay between the create-cf-space and ginkgo-cflinuxfs3 tasks here, and a 6 minute delay between the same tasks the next day (build 563 of that pipeline).

Is this avoidable? Can we shorten this time in between the tasks?

There is a known drawback with the overlay driver and privileged containers (i.e. tasks or typically the docker-image resource type). Privileged containers require their filesystem permissions to be ‘remapped’ to root (i.e. a recursive chown) which takes time on container start.

We have a plan to resolve this by writing a new worker backend that uses containerd directly, but it’ll be a while before it lands.

If there are no privileged tasks/containers involved, I’m not sure what else it would be. :sweat_smile: Would take more digging.

Thanks for the reply, @Vito!

Would that be resolved by using the registry-image-resource, because it’s not privileged?

After investigating, it seems like using the registry-image resource didn’t help that much, likely because our ci image is quite large (~1G). What did help quite a lot was getting the ci image in the job, and then overriding the image_resource configuration with image (per this), which dropped a bunch of minutes off one task. I’ll try it out on some more and hopefully get some more traction.