I am seeing a few issues with our concourse, and I think it’s because of how busy it’s getting (on a side note, any useful resources for tunning concouse’s performance are welcome).
The actual symptoms I am seeing are the following:
- Some pipeline views don’t load; the
/jobsendpoint returns 500 some times, and some times it will load - also
fly -t main jobs -p my-pipelinealso returns internal server error
- A lot of tasks time out with the error
insufficient subnets in the pool(although this is not as frequent)
- This comes up in the logs a lot
pq: remaining connection slots are reserved for non-replication superuser connections, though as you’ll see later on connections are not maxed out.
Concourse is run on GKE via the official stable helm-chart (version 5.5.0) , with a GCP-managed postgres (2vCPUs, 7GB RAM, 20GB SSD, 200 max connections). Concourse is configured to run with 6 workers (3vcpus and 6 GB per worker), 2 ui nodes (1.5vcpu and 3GB per ui node). Attached are some relevant datadog metrics, and from those I think I need to scale it up a bit, but kubernetes did not kill any pods during the period of time:
I’d like some help with figuring out why these pipelines are not loading (the main homepage works just fine, always). I can see nothing very enlightening in the logs (concourse logs are collected centrally so we can access all the logs).
Any help would be appreciated! Thanks!