High AWS API usage when using SSM credential mangement


#1

Hi,

I’m seeing a really high rate of API calls to AWS SSM from Concourse. Today it’s fetching stuff like the SSH key for the GIT repositories and some other secrets used for notifications etc. It appears as the retrieved parameters aren’t actually cached within Concourse, nor have I found any option to tweak how often it polls SSM.

Have I missed an option somewhere, or is it just not caching the parameters at all?

Thanks!


#2

I don’t see a configuration option for caching with SSM in the docs. Caching on Vault was only just added in 4.0.0 which required a lot of work in this PR. Adding similar functionality for other credential managers may require another chunk of work - I’m sure a PR would be welcomed.

Outside of caching you can reduce load by changing resource checks to be less frequent with check_every. From experience I’ve found that for most resources the difference between the default 1m and 2-3m is not really noticeable. You can also make sure credentials are stored at the pipeline level since Concourse will check /concourse/TEAM_NAME/PIPELINE_NAME/foo_param before /concourse/TEAM_NAME/foo_param


#3

I see a lot more frequent load than every minute, but that’s probably because I have quite a few pipelines configured. I currently see multiple calls almost every second, partially due to what you just said about the order it looks for the parameters. This will of course not scale when we take this into production.

I’m going to run a large scale performance test of Concourse pretty soon since I need to make sure Concourse can handle 100+ teams and 60k+ builds per day. The SSM issue might be a blocker, but I’ll take a look at the code, maybe I could contribute a PR to implement caching.


#4

This was also an issue for Vault until someone PR’d caching support to the Vault credential manager: https://github.com/concourse/atc/pull/236

We mentioned in that PR that it would be nice to have this implemented generically for all credential managers. Maybe the existing implementation could be extracted/generalized in another PR? :slightly_smiling_face:


#5

We are using exactly the same setup, AWS and SSM (Concourse 3.14.1). We were often throttled by SSM, resulting in transient errors on tens of pipelines.

The solutions proposed by @crsimmons are the best to my knowledge:

  • increasing the resource check interval from the default 1m to 5m makes wonders and the SSM throttling almost disappeared for us.
  • if you still have throttling, then you have to move away from SSM and use your own Vault.