Some pipelines don't work in the UI

Hello all!

I am seeing a few issues with our concourse, and I think it’s because of how busy it’s getting (on a side note, any useful resources for tunning concouse’s performance are welcome).

The actual symptoms I am seeing are the following:

  • Some pipeline views don’t load; the /jobs endpoint returns 500 some times, and some times it will load - also fly -t main jobs -p my-pipeline also returns internal server error
  • A lot of tasks time out with the error insufficient subnets in the pool (although this is not as frequent)
  • This comes up in the logs a lot pq: remaining connection slots are reserved for non-replication superuser connections, though as you’ll see later on connections are not maxed out.

Concourse is run on GKE via the official stable helm-chart (version 5.5.0) , with a GCP-managed postgres (2vCPUs, 7GB RAM, 20GB SSD, 200 max connections). Concourse is configured to run with 6 workers (3vcpus and 6 GB per worker), 2 ui nodes (1.5vcpu and 3GB per ui node). Attached are some relevant datadog metrics, and from those I think I need to scale it up a bit, but kubernetes did not kill any pods during the period of time:

I’d like some help with figuring out why these pipelines are not loading (the main homepage works just fine, always). I can see nothing very enlightening in the logs (concourse logs are collected centrally so we can access all the logs).

Any help would be appreciated! Thanks!