High CPU usage on workers

#1

Hello, we are using Concourse 4.2.1 in our production setup. We have issues recently with increased cpu usage on workers. They reach more than 85-90% CPU and this doesn’t seem to get back and relax (even for long time). This results in timeouts in many jobs.

The detailed look of the CPU usage showed two things:

  • in several occurrences the kswapd0 usage increases to more than 80% and cannot get down. This leads to increase of the ephemeral disk usage which also gets to 90-100%. This might be caused by increased load (which is to be expected). The problem is that even after the load decreases and get back to pretty low levels ( < 10) the kswapd0 CPU usage remains very high. In the linux community there are a lot of issues related to kswapd0, some of them seem to be fixed, others don’t.

  • in other cases kswapd0 seems normal but the buffered/cache usage is pretty high (at least 80-90% from the total memory). This again leads to timeouts in jobs and inconsistent results.

We use “random” container placement strategy.

We’ve tried to scale the VM size up from 4 cpus and 8GB of memory to 4 cpus and 16GB of memory. After a week or two the situation repeated. We have then again increased the cpus to 8 and memory to 30GB. But this cannot continue and we want to find a permanent solution.

We changed stemcells from Ubuntu Trusty to Xenial several weeks before the issue started to occur. Before this change we haven’t faces such issues (but the reason for the issue might be other - we run more jobs now).

Please advise what should we do to resolve this. Any help is much appreciated.

1 Like

#2

Hello, shot in the dark here: since you mention “we run more jobs now”, this might be the cause. More jobs ==> more load :slight_smile: Since Concourse doesn’t have a queue, everything that is ready to run is sent immediately to the workers.

To validate this assumption, you can have a look at metrics such as containers per workers and total builds (sorry I don’t have Prometheus around now so cannot give you the metrics names).

I strongly suggest to consider upgrading to 5.x, which added fewest-build-containers placement strategy, which is a great improvement over random. Lastly, scaling horizontally by adding more workers makes way more sense with fewest-build-containers than with random. With random, in my experience, it is better to scale vertically (or less worse).

0 Likes

#3

Hey marco-m, thanks for the advises. I’ll try to get more information/statistics for the number of jobs we run (before and now). When the issue occurred last time I monitored the system and (strangely) there was no decrease in the number of started and finished builds. But I don’t remember about the containers, so I’ll check this out too.
As for the 5.x we plan to migrate to it, but first we will start validating it :). In the meantime I am trying to get more information about the issue and to reproduce it.

0 Likes