Hello, we are using Concourse 4.2.1 in our production setup. We have issues recently with increased cpu usage on workers. They reach more than 85-90% CPU and this doesn’t seem to get back and relax (even for long time). This results in timeouts in many jobs.
The detailed look of the CPU usage showed two things:
in several occurrences the kswapd0 usage increases to more than 80% and cannot get down. This leads to increase of the ephemeral disk usage which also gets to 90-100%. This might be caused by increased load (which is to be expected). The problem is that even after the load decreases and get back to pretty low levels ( < 10) the kswapd0 CPU usage remains very high. In the linux community there are a lot of issues related to kswapd0, some of them seem to be fixed, others don’t.
in other cases kswapd0 seems normal but the buffered/cache usage is pretty high (at least 80-90% from the total memory). This again leads to timeouts in jobs and inconsistent results.
We use “random” container placement strategy.
We’ve tried to scale the VM size up from 4 cpus and 8GB of memory to 4 cpus and 16GB of memory. After a week or two the situation repeated. We have then again increased the cpus to 8 and memory to 30GB. But this cannot continue and we want to find a permanent solution.
We changed stemcells from Ubuntu Trusty to Xenial several weeks before the issue started to occur. Before this change we haven’t faces such issues (but the reason for the issue might be other - we run more jobs now).
Please advise what should we do to resolve this. Any help is much appreciated.