How to find out, what slows down my task?


#1

I am having a task and I’d like to find out, what slows it down. Actually I have a suspect, but log output does not help to confirm that. My actual work in the task takes 2m30 to run, but toe overall task needs much more time, and the overall job takes 1h46m.
What I found out so far is, that it took 1h30m from the last log entry of task execution until it was finally marked with a green tick. Meanwhile there was a cp process running in the container.

My suspect is, that I have used caches in my task, which are quite big (30GB). Its the build directory of an embedded linux, which does cross compiling. Without the cache the task needs 2h to complete by itself. With the cache its really fast.

Unfortunately I did not find much documentation about the details of the caches. Do they really need to be copied or are they just mounted? What else could that long-running copy command be? (Hint: the last copy command of my script copies an embedded linux image, which is approximately 250 MB in size).

Here are my build scripts for reference:

build-for-segin.yml:

platform: linux

image_resource:
  type: docker-image
  source: 
    repository: docker-reg:5000/yocto-build-environment
    insecure_registries: [ "docker-reg:5000" ]

inputs:
- name: twms-yocto-base
- name: twms-yocto-base-prepared

caches:
- path: work/build/cache
- path: work/build/downloads
- path: work/build/sstate-cache
- path: work/build/tmp

outputs:
- name: twms-segin

run:
  dir: twms-yocto-base
  path: ci/tasks/build-for-segin.sh

build-for-segin.sh:

#!/usr/bin/env bash
set -e
set -u
set -x
mkdir -p ../work
chown ci:ci ../work
cd ../work
chown ci:ci ./build || :
chown ci:ci ./build/* || :
rm ./work/build/tmp/deploy/images -rf
sudo -u ci tar -x --no-same-owner -f ../twms-yocto-base-prepared/twms-yocto-base-prepared.tar
# call the script in the root directory
sudo -u ci ./build-for-segin.sh
cp -r build/tmp/deploy/images ../twms-segin/

#2

Interesting problem!

Shot in the dark: what happens if you avoid the chown and sudo bits ? I was thinking about this because the original ticket which introduced the task cache (https://github.com/concourse/concourse/issues/230) talks about a copy on write semantics (I might be far off since I didn’t look at the code!)

Another action that might help is validating if you really need the 30GB of cache or if there is a sort of progressive “leaking” of the cache size.

From a diagnostics point of view, in case you are not aware, I found:

  • you can clear a task cache, see section “fly clear-task-cache” at https://concourse-ci.org/managing-jobs.html
  • fly volumes will show task cache volumes as task-cache and show the name of the task the cache is for.

#3

Hi marco-m,

Thank you for your response!

  • I did also hope, that it was copy on write. If that was the case there were no bulk copy action at the end.
  • The copy of 250MB (last command) should definitely not take ~ 1h30. There were no other things / jobs running on the machine, especially no disk IO. Plenty of RAM free so no IO for swap as well.
  • I cleared the task cache before the previous run of the command. The previous build collected those 30GB and cache should have kept it for the current run. And it was kept, just after the task there was this long running cp command. So the cache size came from one single run, fetching tons of repositories with sources for building a yocto linux for an embedded device.
  • Without the chown and sudo bits everything runs as root inside the container, which would be okay for me. However some of the configure scripts of the stuff used in the yocto build process refuses to be run as root “for security reasons” - they were not developed with docker in mind - probably some of them were developed even before docker was in anyones mind…
  • Luckily, except for the path to the cached directory, the cache seems to keep the ownership of the cached files, so the chown is not executed recursively, so it does not touch all the files.

ATM I am doing another run. It’s the following command which consumes that much time:

cp -a /worker-state/volumes/live/bf598b42-7de6-4c27-4f25-028ddf4f22ef/volume/. /worker-state/volumes/init/9844566e-c77b-447b-5831-eb1e73833e7f/volume

Can anyone see from this, whether it belongs to a cache operation or tell me how to find out what this copy is for? It’s maxing out disk IO to 95%.