Debugging prolonged docker load of image when building


#1

Hi everyone.

The company I’m in. Brickshare uses Concourse-CI a lot. One issue is bugging us though. We use the https://github.com/concourse/docker-image-resource to build a docker image and the https://github.com/zlabjp/kubernetes-resource resource for interacting with Google Cloud kubernetes (in order to set the image on a deployment). The challenge is that from the get step (where the docker-image-resource downloads the image from Google Cloud registry). to the put step (where the docker-image-resource builds the image) there is a 15-20min wait time. I tried to find out what is going on and it looks like it takes a very long time to actually load the image. See the attached picture.

docker load…running for a long time. Point-in-time snapshot.

The image is quite large. It is 1.3GB when using cache. So we could likely do something to cut down on the size of the image. However, I can’t see why it should take so much time to load the image even though it might be large.

I configured the get and put steps of the pipeline to:

  - get: brickshare-php-docker-img
    params: { save: true }
  - put: brickshare-php-docker-img
    params: {
      additional_tags: repo-develop/.git/ref,
      build: repo-develop,
      cache_from: [
        brickshare-php-docker-img
      ],
      load_bases: [
        php-docker-img
      ],
      tag_as_latest: 'true'
    }
    get_params: { skip_download: true }

In order to use image cache and so forth.

Any ideas on how to speed up things? And thank you so much!

You might also like to know some specs:

  • Concourse-CI: v4.2.1
  • Running on Kubernetes in Google Cloud
  • One worker

Feel free to ask if more info is needed and I hope to hear from some of you.

Thank you so much.
/Lars Bengtsson


#2

Shot in the dark: for the get part, did you try https://github.com/concourse/registry-image-resource, which is (somehow) the successor of docker-image-resource ? It does not use the Docker daemon, so it might help.

To build, you will still have to use docker-image-resource, or try to use a task and registry-image-resource, as explained in the README over there.


#3

It is a good suggestion. I’ll at least try it and get back.


#4

I looked into it. I can not see that registry-image-resource allows me to do something like params: { save: true } … in order for the downloaded image to be saved for a later step. Do someone know if this is possible?

Also the image I’m getting needs to use authentication via username/password. That does not seem to be available on this resource either.

Thank you


#5

Yeah okay, found this > https://github.com/concourse/registry-image-resource/issues/1 <-- I think I can safely conclude that the resource is not ready for many use-cases yet.

So do someone have a suggestion to the original question? Maybe you @marco-m :slight_smile:

Thank you.


#6

I am sorry but I don’t have additional suggestions. I can say that we use huge images also, around 800MB, and it works fine, no delays.

Did you try to ssh onto the worker and docker pull / docker load the image, to verify if the problem goes away without Concourse in the loop?


#7

Thank you for the suggestion. However, I’m having a hard time troubleshooting this as whenever I hijack a worker container I’m thrown out of the container after maybe a minute to 1min 30sec. I don’t know if this is normal or if this is because of some incorrect configuration of Concourse-CI on my part.

Still would love a couple of tips on how-to troubleshoot this. Thank you


#8

If you are thrown out of an intercepted container that fast, then it is a bug I think. But my suggestion doesn’t need to intercept the container, this is why I mention ssh to the worker.


#9

Hmmm. Okay. I did try entering the worker container … that one do not have the docker client available.


#10

I seriously could need some help here. On how-to troubleshoot performance issues when using Concourse-CI. On how-to pinpoint why it takes so long between get an image and before the build of that image starts.

When troubleshooting I find that we are also getting errors such as:

  • /opt/resource/out: line 138: 486 Terminated docker load -i "${load_image}/image". That one is when a put step is building an image. The resource being: https://github.com/concourse/docker-image-resource
  • /opt/resource/common.sh: line 77: 173 Terminated timeout -t ${STARTUP_TIMEOUT} bash -ce 'while true; do try_start && break; done'. This also happens on a put step is building an image. On the same resource as the one above.
  • /opt/resource/in: line 92: 499 Terminated docker save -o ${destination}/image "$image_name". Happens on a get step fetching an image from Google Cloud registry, the image is private. Same resource as above the other two is used.
  • /opt/resource/common.sh: line 77: 178 Terminated timeout -t ${STARTUP_TIMEOUT} bash -ce 'while true; do try_start && break; done'. Put step building an container image. Same resource.
Pulling eu.gcr.io/"PROJECT"/IMAGE_NAME@sha256:9075a38869c5884a6e416b862b2eea9491e0a1e24c12f2e4b4cc301ac87225cd (attempt 2 of 3)...
	
Cannot connect to the Docker daemon at unix:///var/run/docker.sock. Is the docker daemon running?

Failed to pull image eu.gcr.io/PROJECT/IMAGE_NAME@sha256:9075a38869c5884a6e416b862b2eea9491e0a1e24c12f2e4b4cc301ac87225cd./opt/resource/common.sh: line 108: kill: (186) - No such process

The common denominator being:

  • The image is the same in all steps.

Here is the entire pipeline that is causing us a headache

#############
# RESOURCES #
#############
#
# Setup resources
#
resources:
- name: repo-develop
  type: git
  source:
    branch: develop
    uri: https://github.com/Brickshare/some_repo.git
    username: A_USER
    password: ((github-access-token))
- name: repo-master
  type: git
  source:
    branch: master
    uri: https://github.com/Brickshare/some_repo.git
    username: A_USER
    password: ((github-access-token))
- name: brickshare-php-docker-img
  type: docker-image
  source:
    repository: eu.gcr.io/PROJECT/IMAGE_NAME # Defaults to latest
    username: _json_key
    password: ((gcp-gcr-svcaccount-key))
- name: php-docker-img
  type: docker-image
  source:
    repository: php
    tag: 7.2.9-apache
- name: kubernetes-engine
  type: kubernetes
  source:
    kubeconfig: ((bs-kubernetes-bot))
- name: test-deploy-notification
  type: slack-notification
  source:
    url: https://hooks.slack.com/services/NOPE
- name: prod-deploy-notification
  type: slack-notification
  source:
    url: https://hooks.slack.com/services/YES

#
# Setup custom resources
#
resource_types:
- name: pull-request
  type: docker-image
  source:
    repository: teliaoss/github-pr-resource
- name: kubernetes
  type: docker-image
  source:
    repository: zlabjp/kubernetes-resource
    tag: "1.10"
- name: slack-notification
  type: docker-image
  source:
    repository: cfcommunity/slack-notification-resource
    tag: latest

########
# JOBS #
########
#
# Order the jobs into logical groups
#
groups:
- name: Deploys
  jobs:
  - brickshare-php-Deploy-ToTest
  - brickshare-php-Deploy-ToProd

# 
# Setup jobs
#
jobs:

- name: brickshare-php-Deploy-ToTest
  serial: true
  plan:
  - get: repo-develop
    trigger: true
  - get: php-docker-img
    params: { save: true }
  - get: brickshare-php-docker-img
    params: { save: true }
  - put: brickshare-php-docker-img
    params: {
      additional_tags: repo-develop/.git/ref,
      build: repo-develop,
      cache_from: [
        brickshare-php-docker-img
      ],
      load_bases: [
        php-docker-img
      ],
      tag_as_latest: 'true'
    }
    get_params: { skip_download: true }
  - put: kubernetes-engine
    params:
      kubectl: set image deployment/A_DEPLOYMENT AN_IMAGE=eu.gcr.io/PROJECT/AN_IMAGE:"$(cat repo-develop/.git/ref)"
      wait_until_ready: 0
    on_failure:
      put: test-deploy-notification
      params:
        text: |
          :no_entry: I shot the sheriff :gun:! Brickshare ..... failed to be deployed to *test*....
          ----
          > Look into why by going here - https://*/builds/$BUILD_ID
        username: A_USER
        silent: true
  - put: test-deploy-notification
    params:
      text: |
        Yay! :rocket: - Brickshare ..... was deployed to *test*....
        ----
        > It was deployed from Concourse-Ci - https://*/builds/$BUILD_ID
        > Want to give the backend a go? - https://*
      icon_emoji: ":construction:"
      username: A_USER
      silent: true

- name: brickshare-php-Deploy-ToProd
  serial: true
  plan:
  - get: repo-master
    trigger: true
  - get: php-docker-img
    params: { save: true }
  - get: brickshare-php-docker-img
    params: { save: true }
  - put: brickshare-php-docker-img
    params: {
      additional_tags: repo-master/.git/ref,
      build: repo-master,
      cache_from: [
        brickshare-php-docker-img
      ],
      load_bases: [
        php-docker-img
      ],
      tag_as_latest: 'true'
    }
    get_params: { skip_download: true }
  - put: kubernetes-engine
    params:
      kubectl: set image deployment/A_DEPLOYMENT AN_IMAGE=eu.gcr.io/PROJECT/AN_IMAGE:"$(cat repo-master/.git/ref)"
      wait_until_ready: 0
    on_failure:
      put: prod-deploy-notification
      params:
        text: |
          :no_entry: I shot the sheriff :gun:! Brickshare ..... failed to be deployed to *PRODUCTION*....
          ----
          > Look into why by going here - https://*/builds/$BUILD_ID
        username: A_USER
        silent: true
  - put: prod-deploy-notification
    params:
      text: |
        Yay! :rocket: - Brickshare ..... was deployed to *production*....
        ----
        > It was deployed from Concourse-Ci - https://*/builds/$BUILD_ID
        > Want to give the backend a go? - https://*
      icon_emoji: ":100:"
      username: A_USER
      silent: true

#11

Hi there…kept on troubleshooting this and it pretty much was a resource exhaustion issue.

I’m still looking forward to being able to specify resources for put steps

I think I will close this one. Thank you very much for your help @marco-m

Have a great weekend.


#12

@larssb happy you found the problem!