Concourse db high CPU load (80-90% all time)


#1

Hi,

We have high load for the concurse database, I have found the sql-statement that is using all CPU for DB-host. Can we do some update for index usage, or another fix to correct this performance issue? Or do we need to do some other maintenance for tables build or build_inputs ?
Pipelines is slow.

EXPLAIN analyse SELECT COALESCE(MAX(bi.modified_time), ‘epoch’) as bi_max FROM build_inputs bi LEFT OUTER JOIN versioned_resources v ON v.id = bi.versioned_resource_id LEFT OUTER JOIN resources r ON r.id = v.resource_id WHERE r.pipeline_id = 245;
QUERY PLAN

Aggregate (cost=100051.05…100051.06 rows=1 width=8) (actual time=0.574…0.575 rows=1 loops=1)
-> Hash Join (cost=99489.25…100050.52 rows=213 width=8) (actual time=0.570…0.570 rows=0 loops=1)
Hash Cond: (bi.versioned_resource_id = v.id)
-> Seq Scan on build_inputs bi (cost=0.00…458.46 rows=26846 width=12) (actual time=0.007…0.007 rows=1 loops=1)
-> Hash (cost=99226.73…99226.73 rows=21002 width=4) (actual time=0.413…0.413 rows=0 loops=1)
Buckets: 32768 Batches: 1 Memory Usage: 256kB
-> Gather (cost=1025.56…99226.73 rows=21002 width=4) (actual time=0.412…0.412 rows=0 loops=1)
Workers Planned: 2
Workers Launched: 0
-> Hash Join (cost=25.56…96126.53 rows=8751 width=4) (actual time=0.026…0.026 rows=0 loops=1)
Hash Cond: (v.resource_id = r.id)
-> Parallel Seq Scan on versioned_resources v (cost=0.00…82175.70 rows=1103870 width=8) (actual time=0.002…0.002 rows=1 loops=1)
-> Hash (cost=25.47…25.47 rows=7 width=4) (actual time=0.011…0.011 rows=0 loops=1)
Buckets: 1024 Batches: 1 Memory Usage: 8kB
-> Bitmap Heap Scan on resources r (cost=4.33…25.47 rows=7 width=4) (actual time=0.011…0.011 rows=0 loops=1)
Recheck Cond: (pipeline_id = 245)
-> Bitmap Index Scan on resources_pipeline_id (cost=0.00…4.33 rows=7 width=0) (actual time=0.008…0.008 rows=0 loops=1)
Index Cond: (pipeline_id = 245)


#2

Pipelines are not slow per se, but as a result of CPU constantly at 80% on a AWS RDS Postgres 2xlarge instance, the builds are slow since most of the CPU are consumed by the above SQL statement.
The load is high during nighttime as well, even though no builds are running.


#3

Hi,

I think we’ve run into that exact same issue last week. Can you do the following queries and check out how many versioned resources there are:

select count(1) from versioned_resources;

The above statement will give you the number

select type,count(1) as count from versioned_resources group by type order by count desc;

This will break it down into the resource types.
Are there any resource types with a lot of rows? We had one resource that had over 13 million rows in that table. That resource was based on docurated/concourse-vault-resource and created a new random version (based on the current datetime) every check, causing concourse to write that version into the table 4 times per pipeline per team in our case (as everyone used the resource 4 times per pipeline).

We ran this query to remove them:

delete from versioned_resources where modified_time < now() - interval '1 hour' and type='aws-credentials';

This didn’t break anything for us. We’ve also changed our resource to only emit an empty array in check and change all get to put (where put emits a new random version number) to fix this issue.

Hope this helps!


#4

Hi,

We have about 3 mill rows in table “versioned_resources”
(count)

3040689

select type,count(1) as count from versioned_resources group by type order by count desc;
type | count
-------------------------±--------
webhook-resource | 2818259
concourse-pipeline | 211813

I see that the pipline is writing 4 rows pr pipline into both “webhook-resource & concourse-pipeline”. We are going to upgrade concource to 4.2.1 today. I think also I want to create a trigger and run the “delete” operation every day.
Have you tested the 4.2.1 version?

This didn’t break anything for us. We’ve also changed our resource to only emit an empty array in check and >change all get to put (where put emits a new random version number) to fix this issue.

We aill also see at if we can change from get to put.
Thanks for the info!


#5

We’re still on 3.14.1 so I can’t tell you how well it works with 4.2.1, sorry. We’ve done the removal 6 days ago and everything’s way smoother than before.


#6

This was really helpful for us (also on 3.14.1). We had over 3M records and now have 64k. A few older pipelines which were paused and no longer in use had multiple resources with over 200k versions. They took 5-10 minutes just to delete the pipeline.

Average CPU has dropped by half, as well as average full scheduling time.

Thanks!:grin: