We recently experimented with using two ATCs in our CI instead of one. The first thing that we noticed when we implemented the second ATC was that the average rate of resource checks dropped by a factor of two. (This is the emitted
concourse_resource_checks_total Prometheus metric.) The reason for this behavior was that our ATCs sat behind a load balancer; Prometheus, which we use for metrics collection, was configured to collect metrics from their common endpoint in our network. Each time Prometheus requested a set of metrics it would receive an answer from only one of the two ATCs.
Our conclusion is that the number of resource checks metric emitted by an ATC corresponds to that ATC only; to get the total number of resource checks in the whole CI system, we need to collect metrics from both ATCs independently.
Do all metrics work this way? For example, to get the
concourse_workers_containers metric or the
concourse_db_queries_total metric for the whole CI system, do we always need to collect them from all ATCs? Are there any metrics for which collection from only one ATC will equal the system-wide value?
We are using Concourse 4.2.1.
Thanks for the help!