We have a pipeline with job that is supposed to run every 5 minutes. It normally runs fine.
However, at a certain time interval the job was not scheduled or executed (in the Concourse web ui we saw that there are no runs of this job for more than 40 minutes at that time interval). We have a Grafana dashboard setup in which we monitor the scheduling time of jobs and stalled workers. We saw that the scheduling time was normal. But at the beginning of the problematic interval one of the workers went to stalled state. We suspect that the job has been scheduled to run on this worker but for some reason not executed. At the moment that we detected the issue the worker VM was already gone (another one has been automatically created instead) so we couldn’t connect to VM to check its logs. Checking workers with fly still shows the problematic worker as stalled. We’ve also checked the atc logs and couldn’t find any hints or clues.
Our Concourse is on version 4.2.1. In the past we’ve often had issues with stalled workers, so we changed the default container placement strategy from “volume locality” to “random”.
Could you tell us how do you troubleshoot such cases and also give us some recommendations?