Concourse Outputs - Share across multiple tasks

Hi,

Can someone provide details on how outputs work in Concourse?

For instance, I have multiple tasks within the same job where I would like to place files to outputs so I can consume them in another task. I would like to share a single output, update it via multiple tasks and then consume it. I am only seeing file from 1 task. Can someone explain if this is expected behavior?

Hi @Dan

In order to accomplish this specifically with outputs from a task, you have to do a little mapping gymnastics. If all tasks have a commonly named output: “workspace”, then you can rename this output via the output_mapping property, to something unique, and map it to a common input; let’s say.

The example below is a little contrived, but it demonstrates how to take the commonly named “workspace” output and map it to an input of the next task. In this example, the context is, we are pulling in a repo, “doing some work with it”, then passing that work onto the next task, so that it may “do some work” on that same repo.

The pipeline.yml file would look something like this:

# pipeline.yml
resources:
- name: some-repo
  type: git
  source:
    uri: ((repository-uri))
    branch: ((repository-branch))
    private_key: ((repository-private-key))

- name: latest-version
  type: semver
  source:
    driver: s3
    bucket: ((version-bucket))
    key: ((version-bucket-key))
    access_key_id: ((version-access-key-id))
    secret_access_key: ((version-secret-access-key))

jobs:
- name: some-job
  plan:
  - get: some-repo
    trigger: true
  - get: latest-version
  - task: first-task
    input_mapping:
      repo: some-repo
      version: latest-version
    output_mapping:
      # NOTE: We are renaming the _common task output_ "workspace"
      #       to something unique "workspace-first-task"
      workspace: workspace-first-task
    file: some-repo/ci/first-task.yml
  - task: second-task
    input_mapping:
      # NOTE: We are mapping the previous _unique task output_ "workspace-first-task"
      #       to the _common task input_ "repo"
      repo: workspace-first-task
      version: latest-version
    output_mapping:
      # NOTE: We are renaming the _common output_ "workspace"
      #       to something unique "workspace-second-task"
      workspace: workspace-second-task
    file: some-repo/ci/second-task.yml

The task file for first-task would look something like:

# first-task.yml
---
platform: linux
image_resource:
  type: docker-image
  source:
    repository: ((task-image-repository-uri))
    tag: latest

inputs:
- name: repo
- name: version

outputs: 
- name: workspace

run:
  path: sh
  args:
  - -eux
  - |
    WORKING_DIR=$(pwd)
    REPO_NAME=repo
    REPO_DIR="${WORKING_DIR}/${REPO_NAME}"
    VERSION_DIR="${WORKING_DIR}/version"
    VERSION=$(cat "${VERSION_DIR}/version")
    # INFO: This directory will be passed as an output, into the next task
    OUTPUT_DIR="${WORKING_DIR}/workspace"
    # INFO: Copy **contents of** incoming repo into output; do all work there
    cp -R "${REPO_DIR}/." "${OUTPUT_DIR}/"
    # NOTE: Move to OUTPUT directory
    cd "${OUTPUT_DIR}"
    # INFO: Do some work in this task
    # ... did some work ...

The task file for second-task would look something like:

# second-task.yml
---
platform: linux
image_resource:
  type: docker-image
  source:
    repository: ((task-image-repository-uri))
    tag: latest

inputs:
- name: repo
- name: version

outputs: 
- name: workspace

run:
  path: sh
  args:
  - -eux
  - |
    WORKING_DIR=$(pwd)
    REPO_NAME=repo
    # INFO: This directory contains the output from the last task
    REPO_DIR="${WORKING_DIR}/${REPO_NAME}"
    # INFO: This directory could be passed onto another task
    OUTPUT_DIR="${WORKING_DIR}/workspace"
    # INFO: Copy **contents of** incoming repo into output; do all work there
    cp -R "${REPO_DIR}/." "${OUTPUT_DIR}/"
    # NOTE: Move to OUTPUT directory
    cd "${OUTPUT_DIR}"
    # INFO: Do some work in this task
    # ... did some work ...

2 Likes

Thanks for your reply. I ended up doing output mappings but wanted to see if there was another approach. I think that without mappings if I use same output name it probably gets overwritten. I plan to have about 30 tasks as we use monorepo and we have about 30 services so this means I would need 30 mappings. :frowning:

I am also curious how outputs work internally in Concourse? Since every task starts in its own container, how outputs get shared across multiple tasks? Considering each task can start on a separate worker, which might run on a separate EC2 instance (I am using concourse deployed in k8s via helm on AWS). I would guess it is shared via volumes?

Yes this is true. If you have an input which is named the same as an output. The input will get overwritten.

# some-task.yml
---
platform: linux
image_resource:
  type: docker-image
  source:
    repository: ((task-image-repository-uri))
    tag: latest

inputs:
- name: repo
- name: version
- name: important-thing

outputs: 
- name: workspace
- name: important-thing

run:
  path: repo/ci/some-task.sh

The contents of the important-thing/ folder in the container will be empty. Concourse “hooks up” the inputs first, and then “hooks up” the outputs (might be a race, and not serially). The output folders are usually empty, as they will contain something once your task runs, so when it gets overwritten, the output folder is usually empty, meaning, you don’t have the input you thought you would.

This sort of has a “code smell” to me. We’re also using a monorepo: 11 services, 10 lambdas, 26 packages, and don’t have this sort of complexity. I realise the implementation and context is really important here, but it seems to me, you might want to rethink the structure of your pipeline? I’d love to lend a helping hand if I can.

Yeah you’re absolutely correct. Each task is run in a container, and each input/output represents a volume that is mounted. Volumes are streamed between workers by the ATC/TSA. When Task1 lands on Worker1, creates some output in a workspace/ folder and Task2 requires this as input, and Task2 lands on Worker2. The volume that was hooked up to Task1 (living on Worker1) gets streamed to Worker2 and then mounted to Task2 in the container on Worker2.

Thanks for offering help. Let me describe what I am trying to build here.

We use a monorepo approach, which consists of about 30 services. Those are mostly python applications built with pants (.pex binary) and then I package that into a docker image. I used to have separate jobs for each app build but it was getting complicated to keep things in sync just by using resources so I decided to consolidate it into a single job with individual tasks for each app.

Whenever PR gets created I have a very first task in my pipeline that determines the list of files that are changed in that PR and by using an internal dependency tool I am generating a list of apps that are affected by those changes. I am creating apps.txt file with all affected apps and passing that as an output.

Then I have a separate task for each app, which gets that apps.txt file as an input and checks if that particular app exists in that file. If it does, I build it and if not I just skip the build. That logic exists in pr-pants-build.yml step.

Here’s the snippet from my pipeline:

  - name: pr-pants-build
    plan:
    - { get: pull-request, trigger: true, version: every }
    - task: detect-file-changes
      file: pull-request/utils/concourse-ci/pipelines/tasks/pr-detect-file-changes.yml
    - in_parallel:
      - task: app1-pants-build
        file: pull-request/utils/concourse-ci/pipelines/tasks/pr-pants-build.yml
        output_mapping: {pants_build_artifacts_output: app1_pants_build_artifacts_output}
        params:
          TARGET: app1
      - task: app2-pants-build
        file: pull-request/utils/concourse-ci/pipelines/tasks/pr-pants-build.yml
        output_mapping: {pants_build_artifacts_output: app2_pants_build_artifacts_output}
        params:
          TARGET: app2

In this case I have some flexibility whether to skip the build or not since it is bash. Each of my app1-pants-build tasks produces an artifact archive with dockerfile, pex binary, etc that I am passing as an output. Then I was thinking about having another task that will run after all app builds are done and collect all those artifacts, (pex1, pex2, etc) , create a single artifacts.tgz and upload it to s3. This way I can trigger docker build job whenever this artifacts.tgz changes and all of my apps are tightly coupled to a single PR commit.

Because docker build occurs in put step I don’t have that much flexibility whether or not to trigger the docker build. For instance, I created another job pr-docker-build, which will download this artifacts.tgz from s3 resource. I can unpack it and place each app’s artifacts into individual output per app. Then I would have a separate per app put step to build a docker image. However, in this case I basically need to hard code puts for all apps (30) and what will happen if there are only artifacts from previous step for 2 apps and I only need to build those 2?

There is not much flexibility in building docker images. I can use try and ignore failures whenever Dockerfile is missing for a specific app but in that case I will also be ignoring real build errors.

Maybe you have some other suggestions for accomplishing it?

There are definitely some efficiencies that can be made here.


Assumptions:

  • Only prep/build services that have actually changed in the pull-request
  • All services are deployed together

Questions:

  • Where do you keep your docker images? AWS ECR or DockerHub?
    • Are they tagged and versioned in this place?
  • If you only build 5 out of the 30 apps, how does the “system” know which new images to deploy into a test environment?

app1 is new so take version-x, app2 is not new so take version-latest, etc

  • Do all app’s in the mono repo “build/compile” the same, in that you could execute some build.sh file in their directory to get the desired output (prep) for the docker build step?

I think with the context above I could help you create a pipeline that was agnostic of number of services, potentially.

Thoughts

  • There should be a way to create a prep-task to create an (singular) output folder with all the necessary trimmings: Dockerfile, binary, libs, etc. for each app. In that you’d have a structure something like:
/output_folder
| - /app1
|   | - /libs
|   | - Dockerfile
|   | - binary
|
| - /app2
|   | - etc.
|
| - etc.
  • If the above is true, then consuming this becomes a little easier, as you could iterate the folders and “do things”.
  • Having all the app-image-resources as put tasks on this main task might be challenging, because I’m not sure there is a way to “noop” the put task of the docker-image-resource in concourse. I could be mistaken though and there might be a way to leverage this.

Thanks for your suggestions. After doing some more research and chatting with folks on discord I decided to go with Docker-in-Docker approach. I am using this script inside my ubuntu build image to setup docker: https://github.com/Snapkitchen/concourse-docker-compose/blob/master/lib/docker-v1.bash, which allows me to build/push/pull images with full docker support directly in my tasks. This provides me with a lot of flexibility as I can do everything within a single task without the need of passing things across tasks, put steps, etc.

Resources work great out of the box but they lack control and flexibility that I need to work around monorepo. For instance, we have multiple AWS accounts where I need to push docker images. With docker resource I actually need to specify 4 resources pointing to 4 repos and that multiply by 30 services creates a pretty messy pipeline :slight_smile: If I can push directly with docker I can do that inside my build task with few lines of bash. I am currently building out a POC for PR workflow and so far everything is working out great.