Maximum (deployed) scale?


#1

What’s the largest known Concourse deployment?

Is concourse suitable for general-purpose workflow management?

For example, would it be reasonable to use Concourse to manage a large dependency graph involving large files and high-memory jobs?


#2

I know of a few companies running fairly large Concourses for deploying/managing multiple Cloud Foundries + related infrastructure. Pivotal runs a large internal Concourse installation called wings that spans multiple teams. Some of the pivots on here can tell you more about that. The other one that comes to mind is the Pivotal releng concourse (https://releng.ci.cf-app.com/) which has exceptionally large pipelines.

Concourse can scale both horizontally (add more workers) and vertically (larger vms) so I would think that it could be adapted to fit the desired jobs assuming the underlying resources are available.


#3

I don’t know about the maximum but I can tell you about some of the scale we’ve encountered and managed. As @crsimmons mentioned above, the team runs a hosted Concourse called “Wings” that hosts 95 teams. It runs at 3 ATCs and 24 workers on GCP. We used to think that was pretty big until we met folks who run at 4 ATCs and 30+ workers.

I’m sure there’s a maximum scale; but with our recent features around distributed GC and resource management, we’ve gotten better a being able to scale workers and web nodes.

Going back to your original question, i guess the answer is “it depends”. The CF teams, Spring teams and data teams predominantly use Concourse, and they seem to be doing well.


#4

Thanks! I’m going to start investigating using Concourse in that case.

To help me understand the scenarios you’ve mentioned, can you please re-frame in terms of cores? (100% utilized cores, concurrent) Secondarily, what’s the maximum-size (RAM) job supported on that cluster? Median? How is persistent data shared between jobs?

Please define terms? I’m new here : )

  • CF – Cloud Foundry? What’s that?
  • ATC

Thanks again!


#5

For the terms:
CF = Cloud Foundry which is a PaaS. See more at https://www.cloudfoundry.org/
ATC = The component of Concourse that manages pipeline scheduling

I recommend reading through the official docs at https://concourse-ci.org/index.html. The Concepts overview might be a good place to start.


#6

My bad, I have a bad tendency to lean on acronyms. @crsimmons got me covered though

Our stats and metrics are opensource and you can poke around our grafana dashboard here: https://metrics.concourse-ci.org/dashboard/db/concourse?refresh=1m&orgId=1&var-deployment=wings Lmk if you want more specifics.

AFAIK there are no limits to hte maximum-sized RAM jobs supported…if you have a VM large enough it should be able to run. I’m sure one of the engineers will have more to say on that.

Concourse is intentionally designed so that you need to externalize data between jobs (e.g. put it in an artifact store, git, etc.)


#7

I think the memory thing is workable with resource limits.

Large files is a different beast, though, depending on what we’re looking on as “large” (gigabytes? petabytes?). Streaming giant files around is always a performance bottleneck, no matter how it’s sliced. This is where I’d focus on Concourse’s abilities as a work orchestrator – interacting with remote systems via resources.

I’ll use an extreme example to demonstrate. Suppose I have a database with 100Tb of records across hundreds of tables. I could, in theory, stream the raw data into a volume as part of a task, then have my own task code manually perform joins. This might “work”, but it would require a very beefy worker and would take a very long time. It would be flaky and expensive and worst of all, slow.

In this situation, I would instead look to create a resource that interacts with my remote database. For online database situations (eg, Postgres) it would be modelled as a get which takes the query as a parameter and drops the results into the volume as a file. For offline/batch database situations (eg, Hadoop) I would probably model this as a put that submits the work to be performed, with a check/get that will wait on the result from that batch operation.


#8

Thanks. There are some many-gigabyte (and some terabyte) temporary files involved. Ideally I could express a preference that the next step runs on the same host as the previous step and the file is “transferred” by simply persisting on local scratch disk. If I can convince Concourse to use this strategy preferentially, but also fall back to transferring files between machines when necessary, that would be quite compelling.


#9

If you’re fine with doing this inside a single job, task inputs and outputs should do the trick. Task caches might also be worth looking into as well.

For the job-to-job case, you’ll still need a resource to do what you want. Some time back (before task caches) I experimented with colocating minio nodes on concourse workers using BOSH. This made it possible to put to the colocated minio cluster using the s3 resource. I think in principle it could be done with other deployment options too.