Concourse deployment on AWS

Hi all

I am interested in running a Concourse proof of concept and want to deploy it into an AWS environment.

Looking at the options it seems we can go either the Kubernetes route or EC2?

One thing that I’ve noticed with the Kubernetes deployment is a post from last year discussing the fact that Concourse must run with privilege mode of true and does not take advantage of the pod scheduler for scaling, instead using something called “garden”? this is a bit of a concern because I don’t want to experience any performance issues or have to control scaling myself.

On the EC2 side, will concourse support operating as a cluster in that I can drop worker nodes into an autoscaling group which will scale up / down the stack as and when required? For the DB I was considering using RDS if it is supported.

Just looking for some feedback based on the above from people (or the creators of this product) who have experience of running it.

I want to try and get my deployment strategy right first time so that I don’t have to revisit in the future due to performance issues.

Thanks

I’ve written a post about this previously which includes links to open source code (AWS EC2 Terraform), and a (slightly dated) tech talk on EngineerBetter’s YouTube channel.

The open source terraform is very specific to our use-case so your kilometerage may vary. We have 12-14 teams and 20-25 workers.

I also run Concourse in my personal k8s cluster, and it works well, but I haven’t used it for production workloads.

1 Like

I worked with a client once who was building custom AMIs which ran the appropriate concourse web/concourse worker commands at startup. iirc they were deploying with terraform and were using autoscaling groups to manage the VMs. Unfortunately this was a while ago so I don’t have any code or details to share.

I’d also like to plug control-tower which is a CLI we’ve made that will deploy a Concourse to AWS EC2 in one command.

1 Like

Thanks guys for both your input. tlwr, your link and talk was extremely useful, thank you very much for posting! that has given me a head start of my deployment topology.

I am currently building this entirely using a combination of packer, ansible and terraform code into AWS, I don’t want to manually do anything, it has to be completely automated (if the product supports it, I am soon to find out :smiley:)

What I hope for is a worker node cluster that will allow rolling updates, example I can scale up when demand for workers is high and scale back when it dips (using ASG and relevant health check), again, I don’t know if this is possible yet as the product is new for me, but I have done similar with HashiCorp Nomad and it worked well.

1 Like

Just checking tlwr, are you running your web nodes and worker nodes in Docker? or are you running them on top of the instance OS? i.e 1 x EC2 instance = 1 x Web or Worker Node?

We are running our Web and Worker processes directly on VMs, using systemd

A module for workers is here: https://github.com/alphagov/tech-ops/tree/master/reliability-engineering/terraform/modules/concourse-worker-pool which defines some cloud-init to run Concourse https://github.com/alphagov/tech-ops/blob/master/reliability-engineering/terraform/modules/concourse-worker-pool/files/worker-init.sh

We have gone through various iterations of using containers and ECS and not using containers. We were originally using Docker with ECS, but we stopped because it was unnecessarily complicated, due to our 1:1 mapping between Teams and Autoscaling groups (we call them worker pools)

I think that ECS/Docker is fine to run Concourse, if you are doing less weird stuff than we are.

We have some public documentation for our internal users here if you want to know more about our Concourse “addons”: https://reliability-engineering.cloudapps.digital/continuous-deployment.html#continuous-deployment

1 Like

Hi tlwr

Thanks for your detailed reply and also the links you provided, that’s very kind and I really appreciate your input!

I am going to follow your path and deploy to EC2 instances, so initially I was planning on running with 2 x web nodes (using highly available configuration behind load balancer) and 2 x worker nodes (also running as a cluster, I hope anyway, I have not read that far yet in the docs) which will be split across availability zones for redundancy. I am not quite sure yet how the backend storage works, if it needs to be shared via something like EFS or local storage is fine.

What I want to do is run the worker nodes in an ASG so that they can dynamically scale if the workload becomes high and scale back in, but a question I have is:

If I want to push a new version of the worker nodes out using launch config (via a new AMI image) / ASG, I would like to use rolling update to do this where a new worker node is deployed (alongside the existing “old” worker node cluster), health check is green, old worker node is terminated and this repeats until the new workers completely replace the old. My question is, how does this effect running jobs? can they automatically fail over to the new nodes (like HashiCorp Nomad does for example)? also when I am terminating existing running “old” worker nodes, do I have to perform some kind of draining exercise to stop running jobs on the workers gracefully?

Thanks

Concourse uses Postgres as storage, you can use RDS, we use it and it works.

I recommend you read https://concourse-ci.org/concourse-worker.html as it is very informative, specifically https://concourse-ci.org/concourse-worker.html#gracefully-removing-a-worker

If you use Docker then the STOPSIGNAL is automatically SIGUSR2, so docker stop works properly.

If you use systemd then you should ensure that your service uses SIGUSR2 when stopping, which will cause the worker to be retired gracefully. You should ensure that systemd waits an appropriate amount of time (defaults to 60s).

Thanks!

I will check out those links.

I’ll be running it natively on EC2 via systemd, not inside a container, useful to know regarding SIGUSR2 signalling when stopping the service.

Appreciate your tips!

Hi tlwr / crsimmons

So I have almost finished this but I am thinking about how the scaling will work with Worker nodes. I have a load balancer in front of the Web Nodes I am running with an entry in R53 pointing at the ALB which can be used as an entry point for people to get to the UI via https.

However, the Worker node needs to have the CONCOURSE_TSA_HOST set, I can’t use the ALB FQDN for this as it only supports HTTP/HTTPS, so I am guessing I would need to put an NLB infront of the Web node as well in order to load balance traffic to nlb_fqdn:2222 TCP?

How are you doing this if you don’t mind me asking? the other option is something like NGINX. I thought about using an ELB but is there a common health check I can use for that across both web and worker nodes? (as ELB only allows a single health check) so I don’t need TWO load balancers, was hoping to cut down on cost a bit.

The other option I have, is to run a SINGLE web node in an Auto Scaling Group and multiple worker nodes, then I won’t need an ALB. Obviously this could cause issues if the sole Web node is down or being replaced. I’d rather have a min of 2.

Thanks

If you want to use an ELB, you can use one health check with two forwarding rules.

If you want to use an ALB then for SSH you need to use either:

  • DNS service discovery and no load balancer
  • an ELB (classic)
  • a NLB
  • an instance which acts as a load balancer (as you say)

We use an ALB + ELB but your approach will change depends on preference and pricing.

Thanks tlwr, I appreciate the info! I have gone with a NLB for now, just testing my deployment. The next challenge is auto-scaling the worker nodes in the ASG. I am going to try with plain CPU metrics to begin with and see if that is sufficient.

Not sure if anyone is interested (maybe later if someone runs a google search!) but I have this setup now with CPU metrics. Next thing I want to implement once I’ve tested the solution actually works, is to add Vault into the mix for key management (on individual job basis) against AWS resources. I checked and it looks like Concourse supports Vault, I’m hoping it integrates in a similar way to Nomad.

Hi guys

My deployment is currently undergoing testing, very happy with it so far, thank you for all your help. It scales well and seems to be all good, except, for one tiny issue.

I’ve noticed that I’m getting timeouts with pipeline pulls, looking into it, I added – -dns-server=8.8.8.8 to my concourse service, this seems to have fixed the issue (I am using default AWS VPC DNS in resolv.conf on the main EC2 host), this seems a bit of a dirty fix and I would prefer to let it just use the local VPC DNS. Is there a better method that I am missing? I didn’t see anything in the installation documentation, and looking at the GitHub issues, its not clear.

If you could help with this final part, it would be a great help

Thanks

Figured this out, had to pass the env variable CONCOURSE_GARDEN_DNS_SERVER to concourse, all good now.

Seems I wasn’t the only one to hit this issue.

We have two clusters running at the moment now in separate regions, working very well.