Concourse jobs not scheduled or executed regularly


#1

Hello,

We have a pipeline with job that is supposed to run every 5 minutes. It normally runs fine.
However, at a certain time interval the job was not scheduled or executed (in the Concourse web ui we saw that there are no runs of this job for more than 40 minutes at that time interval). We have a Grafana dashboard setup in which we monitor the scheduling time of jobs and stalled workers. We saw that the scheduling time was normal. But at the beginning of the problematic interval one of the workers went to stalled state. We suspect that the job has been scheduled to run on this worker but for some reason not executed. At the moment that we detected the issue the worker VM was already gone (another one has been automatically created instead) so we couldn’t connect to VM to check its logs. Checking workers with fly still shows the problematic worker as stalled. We’ve also checked the atc logs and couldn’t find any hints or clues.
Our Concourse is on version 4.2.1. In the past we’ve often had issues with stalled workers, so we changed the default container placement strategy from “volume locality” to “random”.
Could you tell us how do you troubleshoot such cases and also give us some recommendations?

Kind regards,
Simeon


#2

We saw similar behavior sometimes. Related to https://github.com/concourse/concourse/issues/2581

Unfortunately I cannot give specific recommendations; I suggest to consider upgrading to 5.0. Regarding container placement strategy, I am not sure wether random can make the situation better wrt stalled workers. In any case, if you upgrade to 5, then fewest-build-containers strategy (new with 5.0) is always better than random. So for 5 I suggest either the default (volume locality) or fewest. Fewest can help when workers are often overloaded, if you have enough workers to spread the load.


#3

Hello,
thank you very much for your reply!
However, I still could not understand why there were not (at least from Concourse web UI perspective) any builds during those problematic interval (40 minutes)? Even when the scheduling was for every 5 minutes, no builds were triggered for more than 40 minutes.
Kind regards,
Simeon