Delay before first task starts in jobs

Been using Concourse for about 6mo now and enjoying it quite a bit.

One thing I’ve noticed and am finally starting to dig into is it can take a job 1-3m to start it’s first task/resource. It’ll show that the job is running but no tasks/resources are doing anything. I would expect tasks to start executing immediately once a job shows as Running in the UI.

Looking at web logs I’ve noticed the following:

Job Triggered (from me hitting +):

{"timestamp":"2020-03-26T17:14:04.538422123Z","level":"info","source":"atc","message":"atc.checker.start","data":{"session":"23"}}
{"timestamp":"2020-03-26T17:14:04.540511053Z","level":"info","source":"atc","message":"atc.checker.end","data":{"session":"23"}}
{"timestamp":"2020-03-26T17:14:04.546986534Z","level":"info","source":"atc","message":"atc.checker.running","data":{"check":1014,"session":"23"}}
{"timestamp":"2020-03-26T17:14:04.547005164Z","level":"info","source":"atc","message":"atc.checker.running","data":{"check":1013,"session":"23"}}
{"timestamp":"2020-03-26T17:14:05.121453083Z","level":"info","source":"atc","message":"atc.checker.succeeded","data":{"check":1013,"session":"23"}}
{"timestamp":"2020-03-26T17:14:05.147445566Z","level":"info","source":"atc","message":"atc.checker.succeeded","data":{"check":1014,"session":"23"}}

First task starts running:

{"timestamp":"2020-03-26T17:16:18.272315232Z","level":"info","source":"atc","message":"atc.tracker.track.running","data":{"build":4684,"job":"build","pipeline":"publish","session":"20.19"}}
{"timestamp":"2020-03-26T17:16:18.276930595Z","level":"info","source":"atc","message":"atc.tracker.track.get-step.initializing","data":{"build":4684,"job":"build","job-id":585,"pipeline":"publish","session":"20.19.3","step-name":"master"}}

As you can see in above example, between the time the job was triggered (and shows as Running in UI) to time it shows activity in logs was 2m14s in this case. Once the first task/resource finally gets started all subsequent tasks start immediately after each-other in that build.

This is happening for all jobs. I’ve even tried creating a simple job that exhibits same issue.

I feel like long ago it used to behave as I’d expect. I feel like after upgrading I started noticing this behavior. But I’m not sure what version I started seeing it in. Right now I’m on 5.7.2. When I first started I was on 5.3.0, upgraded to 5.5.3 and later upgraded again to 5.7.2.

Let me know if there is any other information I can provide to help.

Hmm I think I’ve noticed this sort of thing too.

We’ve been doing a lot of internal refactoring to de-couple components from one another, and along the way that de-coupling may have introduced some delays as what used to be one component directly using another has become asynchronous/eventually-consistent.

One thing to keep in mind is that when a job is manually triggered, the build can only start once all of the job’s dependent resources have been checked. This is to ensure that the build runs with the latest available inputs. This has always been the case, but now that part is asynchronous too. A long time ago it would literally perform those checks in the scheduler, which was fast when it worked, but when it didn’t work it would cause big problems (such as all of scheduling grinding to a halt).

That being said, in our quest for better code and reduced risk I don’t want that to come at the expense of responsiveness. We’ve been using Postgres LISTEN / NOTIFY events to short-circuit polling intervals here and there, but it’s very possible that there’s one to be added to make jobs start a bit more quickly. We’d need to figure out where exactly things are hung up, though.

If you’ve got time, it would really help if there was an easy way to reproduce this on a local dev machine running against master (or at least v6.0 which just shipped yesterday). It shouldn’t be too difficult to get a local Concourse running from code and start poking around (see CONTRIBUTING.md). You might also want to try upgrading to v6.0 - it’s possible we’ve fixed this already. :thinking:

Thanks for the quick response.

I stood up a local instance from master. I created a simple pipeline that I set in both my stood up environment running 5.7.2 and my local environment running master.

The local instance running the latest code didn’t seem to have the issue. The 5.7.2 instance continued to exhibit the issue.

So either it’s fixed or my existing deployment isn’t “as clean” (existing DB entries, upgraded twice, etc) as a fresh local instance.

I may try to re-deploy my test environment directly to 5.7.2 and see if it continues to exhibit the issue … but right now my gut is the issue was fixed on the way to 6.0.0.

Thanks!
Jake

As a followup I think I figured out my issue.

I setup my internal environment to use 6.0.0 and still ran into issues. So I dug deeper.

I noticed in debug logs that that last-ran field for component-interval-not-reached for things was 2m in future from my web node. Once my web node time was ~10s past what last-run was reporting is when things would kick off.

So I checked date/time of web node and it was indeed 2m in the past. And I didn’t have NTP running on it. :man_facepalming:

So I got NTP running on my concourse nodes now and tried again … and things seem to be working as expected on 6.0.0. So I will take that back and expect it to also work on 5.7.2. Regardless, not having NTP running is a n00b mistake. :smile:

My guess is that some rows get created with current date/time from DB node? Because my DB node did have NTP running and was ~2m ahead of my web node. So that’s only way I can think that time was getting set 2m in future according to logs on web node.

Thanks for the help! Running it locally and seeing it work flawlessly helped to get me to dig deeper.

1 Like

Haha, I wouldn’t have guessed that. Thanks for figuring it out and posting a follow-up.

Indeed, all timestamps in the database are generated on the database side (using now()). In this case perhaps the right thing to do would be something like SELECT now() - last_ran > '10 seconds'::interval instead of comparing the web node’s local time. :thinking: