Concourse jobs not scheduled or executed regularly

#1

Hello,

We have a pipeline with job that is supposed to run every 5 minutes. It normally runs fine.
However, at a certain time interval the job was not scheduled or executed (in the Concourse web ui we saw that there are no runs of this job for more than 40 minutes at that time interval). We have a Grafana dashboard setup in which we monitor the scheduling time of jobs and stalled workers. We saw that the scheduling time was normal. But at the beginning of the problematic interval one of the workers went to stalled state. We suspect that the job has been scheduled to run on this worker but for some reason not executed. At the moment that we detected the issue the worker VM was already gone (another one has been automatically created instead) so we couldn’t connect to VM to check its logs. Checking workers with fly still shows the problematic worker as stalled. We’ve also checked the atc logs and couldn’t find any hints or clues.
Our Concourse is on version 4.2.1. In the past we’ve often had issues with stalled workers, so we changed the default container placement strategy from “volume locality” to “random”.
Could you tell us how do you troubleshoot such cases and also give us some recommendations?

Kind regards,
Simeon

#2

We saw similar behavior sometimes. Related to https://github.com/concourse/concourse/issues/2581

Unfortunately I cannot give specific recommendations; I suggest to consider upgrading to 5.0. Regarding container placement strategy, I am not sure wether random can make the situation better wrt stalled workers. In any case, if you upgrade to 5, then fewest-build-containers strategy (new with 5.0) is always better than random. So for 5 I suggest either the default (volume locality) or fewest. Fewest can help when workers are often overloaded, if you have enough workers to spread the load.

#3

Hello,
thank you very much for your reply!
However, I still could not understand why there were not (at least from Concourse web UI perspective) any builds during those problematic interval (40 minutes)? Even when the scheduling was for every 5 minutes, no builds were triggered for more than 40 minutes.
Kind regards,
Simeon

#4

Hello @marco-m,

I think I found the root cause for not scheduling the jobs regularly.
According the documentation:
“Resource Check Containers are created from the resource type’s image and are used to check for new versions of a resource. There will be one per resource config.” (see https://concourse-ci.org/container-internals.html).
And what happened in our case is that the resource check container was running on the worker that got stalled in that time.
I was able to reproduce the issue several times on my dev Concourse setup.
I reproduced it by inspecting the resource check container which allowed me to determine on which worker the container is running on. Then I intentionally made the worker stalled. That of course made the resource check container unreachable.
As expected the checking process of 5 minutes time resource start failing with:
“Get http://127.0.0.1:8080/containers: worker ‘79c54ab1-53a0-406e-ba74-17a5676f7aaf’ is unreachable (state is ‘stalled’)”.
And because the container is one per resource config (as stated in the concourse docs) no builds were scheduled in the next 10 to 25 minutes. So the Concourse scheduling mechanism was not able to create a new container for 5 min resource in the next 18 minutes on average. All that led to the situation described in my first comment, where we observed that no builds were scheduled for a 24 minutes time interval.
In my opinion the fact that there is only one container per resource config is a single point of failure, because if the worker which runs the container got stalled (this is what happened in our case) that will seriously affect the jobs that are running on small time intervals (like ours - 5 minutes).
If there were two or more containers per resource config, which are spread on different workers and if one of the worker running the resource container goes in stalled state, then something like a failover procedure could take place in.
I was wondering why there is only one container per resource config?
And if there could be no more than 1 container per resource config, then would the Concourse scheduling mechanism be faster in determining the issue and create a new container on different (healthy) worker as soon as possible?

Kind regards,
Simeon

#5

Hello @simeonkorchev,
very good investigation! I have two comments:

  1. If you look at the ticket I already mentioned (https://github.com/concourse/concourse/issues/2581), it has pointers to various similar tickets. Could you go through them and select one that most closely resembles your problem and add this analysis ? I think it will help the Concourse team to prioritize.
  2. There is on-going work that might help: https://github.com/concourse/concourse/issues/3079