High write IOPS on database, builds starting delayed

Hi,

this is a continuation of a discord discussion I had with @vito:

We are seeing high Write IOPS on our database and users start complaining that builds start slower recently.

Info on our concourse environment:

Pipelines: ~600
Resources: ~4500
Builds in flight: ~3-5 during office hours, peaks up to 25, ~1 off-hours with ~10 builds kicking off at midnight from various teams

Web nodes: 2 (c5.xlarge)
Worker nodes: 20 (c5.2xlarge)
Database: RDS Aurora PostgreSQL compatible (db.r4.4xlarge)
Deployment method: BOSH

We can see in RDS Performance insights that this query produces the most load:

SELECT r.id, r.name, r.type, r.config, rcv.version, r.nonce, r.check_error, ro.check_error FROM resource_types r LEFT JOIN resource_configs c ON c.id = r.resource_config_id LEFT JOIN resource_config_scopes ro ON ro.resource_config_id = c.id LEFT JOIN LATERAL (
SELECT rcv.*
FROM resource_config_versions rcv
WHERE rcv.resource_config_scope_id = ro.id AND rcv.check_order != 0
ORDER BY rcv.check_order DESC
LIMIT 1
) AS rcv ON true WHERE r.active = $1 AND r.pipeline_id = $2 ORDER BY r.name

RDS Metrics and Performance Insights screenshots: https://cdn.discordapp.com/attachments/604330872992432150/620888944615358484/rds.zip

Looking into the tables accessed by the top 1 query, we can see that resource_configs contains around 2500 rows.

resource_config_version contains over 3.2 million rows.

I am part of a team that provides concourse as a CI/CD environment to many and almost every team utilises some custom resources.

I’m not sure how relevant it is, but let me shortly explain them:
One of them provides temporary AWS credentials to an AWS account. We always use the put step to retrieve credentials. check always returns this string: jq -rnc '[{deprecated: "PLEASE USE put INSTEAD OF get"}]'. This was added after noticing that during each check, we wrote a version to the database before.
Another resource always returns an empty array on check ([]) as far as I know.

A lot of teams also use the git-multibranch resource, which also means that some of the strings contained in the resource_config_versions table seem very long due to the nature of how that resource works.

We used to see a lot of issues with Concourse 3.x with the versioned_resources table, but things seemed fine after the upgrade to Concourse 5. We now have new complaints about concourse running slow again when it comes to starting builds. We can see that the Write IOPS to the database (and the general load) seems quite high, and also there is always items in the Queue.

If you need anything else, I’d be more than happy to provide more info. I am not sure how to proceed/how to resolve our current issues.

Hi!

That’s interesting that the query for grabbing the resource types for a pipeline is the slowest query in your deployment, did you run something similar to this

SELECT
  total_time / calls AS avg_time,
  calls,
  total_time,
  rows,
  100.0 * shared_blks_hit / nullif(shared_blks_hit + shared_blks_read, 0) AS hit_percent,
  regexp_replace(query, '[\s\t\n]+', ' ', 'g')
FROM pg_stat_statements
WHERE query NOT LIKE '%EXPLAIN%'
AND query NOT LIKE '%INDEX%'
ORDER BY avg_time DESC LIMIT 20;

to find that query to be the slowest? If you did, would you be able to share with me the full results of that query if possible?

Also for more information, do you know approximately how many custom resource types each of your pipelines or team uses?

Hey @clarafu , I’ve taken the query from RDS Performance Insights, an AWS service. If you need that query info I will look into enabling pg_stat_statements for the DB as it is currently turned off. I’m not entirely sure if this is the longest running query, looking at performance insights it generates the most load on the database, mostly tieing up the CPU. Probably also doesn’t explain the high write IOPS directly.

In terms of custom resources, I’d approximate that at least half (or even 2/3) of the pipelines use 4 or more configs of this custom resource (the AWS one) as we supply teams with 4 AWS Accounts. For git multibranch it is a bit hard to say as we don’t have many insights on how teams design their pipelines.
A lot of teams also use a bitbucket buildstatus resource, so you can add one custom resource per almost every pipeline.