We are currently running a heavily used (or we think) deployment of concourse. There are about 10 teams, each team having anywhere between 10 to 100 pipelines, some which are very large.
We are using helm to run our cluster on EKS (AWS). We are currently 5.4.1, but this applies to most previously released version since 3.14 as we have been continuously upgrading
We used to have a lot of performance issues on the worker, where the jobs would be slow to run. That has more or less been solved after now running 15 m5.2xlarge worker nodes (we tried R and C instances and neither of them were good candidates). We used to get “max containers reached” issue a lot because of using “volume-locality” which always made a few worker nodes hot and the rest under utilized. This was resolved by switching to “random” policy instead.
However, we are currently having a lot of issues with the web node being very slow. No matter how much resource we through at it, it is very slow to load.
Database metrics don’t show anything and its about 40% cpu utilized.
The web nodes also indicate that they are about 50% cpu utilized.
To address this issue, we tried to scale out web nodes to 10 containers running at different instances, but that made the performance much much worse and a lot of other issues (like the web node losing track of volumes and containers, not running and scheduling jobs, etc). So far, it seems that horizontal scaling doesn’t really help and somewhat makes it worse.
I just wanted to get the community’s thought and recommendation on how to optimize the concourse cluster, especially in an AWS environment:
- What instances best fit worker nodes?
- What instances beset fit web nodes?
- How many instances of the web nodes do you guys run?
- Any env vars that can significantly improve the Web performance.
Thanks a lot !