Whats the best way to gracefully kill a container on a worker?

We have a bunch of pipelines that have been migrated from an other CI system were everything was required to run via docker-compose, thus we end up with a lot of dcind jobs.

For most part this is fine, until it isnt, and after the job is marked as successful/failed in the UI the container is still lingering around on the worker. This happens up to 20-30 times a day for us.

Currently we are running the following on all our workers every 30 mins

for PROCESS in `pgrep garden-init`; do
    AGE_IN_SECONDS=`expr $(date +%s) - $(stat --format=%Y /proc/$PROCESS)`
    # all our pipelines have all their jobs set to a timeout of 1h, check for containers that have lived >65 mins to give Concourse some time to do its magic
    if (( AGE_IN_SECONDS > 3900)); then
        if [[ `sudo strings /proc/$PROCESS/environ` != *"ATC_EXTERNAL_URL"* ]]; then
            # we dont care about check/get/put container, only task containers
            echo "$PROCESS has been alive for $AGE_IN_SECONDS seconds"

Checking out a $PROCESS we sometimes see the following

root     2892829  0.0  0.0   1120     0 ?        Ss   Nov20   0:00 /tmp/garden-init

and sometimes

root     2892829  0.0  0.0   1120     0 ?        Ss   Nov20   0:00 /tmp/garden-init
root     2893142  0.2  0.0 1256120 24292 ?       Sl   Nov20  11:50  \_ dockerd --data-root /scratch/docker
root     2893154  0.1  0.0 884244  8508 ?        Ssl  Nov20   8:26      \_ containerd --config /var/run/docker/containerd/containerd.toml --log-level info
# With or without some random process under the containerd

This is most likely something wrong on our end with our dcind container, which besides of spinning up docker does other nasty stuff such as mounting a NFS volume for shared caching across workers, and we are currently trying to catch and fix all errors as we find them.

But what is the correct way to get rid of these old shabby containers as they appear? Sometimes we find(especially when people are running MongoDB, MarkLogic, Graphite (yep, that is unfortunately a thing…O_o)) that these orphanage containers consume a-lot of resources and in extreme cases causes the workers to go into a sad sad state.

kill -9 $PROCESS obviously works, but it leaves the entry in the containers table in the db. Dont know if this is or isnt an issue?

There’s no way to tell Concourse “delete this one container on this one worker”. Right now, the only way to safely delete (from a Concourse system perspective) the container from the worker is to recreate the worker and run fly prune-worker on that worker.

Not sure if you know this part, jobs that fail will have their containers hang around indefinitely until a new build of a job is started. Successful builds will have their containers eventually deleted. It sounds like you know this already but sharing just in case you don’t :slight_smile:

Pulling back the curtain a bit, you can try deleting the container through the Garden api. This is how Concourse deletes the container. I’m no expert on the GC cycle, but I think if you delete the container through Garden then Concourse will eventually delete the container from its db.

There’s a garden process on each worker listening on port 7777. No auth. This is the endpoint you’d want: https://github.com/cloudfoundry/garden/blob/master/doc/garden-api.md#destroy-a-container

You’ll need the handle of the container first; not sure how you’ll get that. Hope this helps though!

Yeah, threw me off a bit as I think that at the time we started using Concourse, failed builds containers would only stick around 20 mins or so.

Excellent, Ill play around with this, thanks! :slight_smile:

Easiest way Ive found is identity the bad containers garden-init process.

Then check the mounts for that process,

cat /proc/$PID/mounts | grep root | cut -d / -f 6

Then use the handle to check the db.

FROM volumes v
    containers c ON (v.container_id = c.id)