Concourse pipelines can't get containers on web node for k8s helm chart

Hey, I’m brand new to concourse and I’m trying to get it reliably running on my k8s cluster with the latest stable Helm chart. I keep running into something like the following when running a simple pipeline with the git resource (though once in a while this doesn’t happen oddly):

Get http://127.0.0.1:8080/containers: dial tcp 10.2.1.147:35417: connect: network is unreachable

Setup:

  • I’m using MetalLB and setting the service type to loadBalancer for web.

  • I have the web node / atc on port 80 which is automatically forwarded to a different container port (not 35417) on 10.2.1.147.
    i.e. from the concourse-web k8s service config:

    ports:
      - name: atc
        nodePort: 30456
        port: 80
        protocol: TCP
        targetPort: atc
      - name: tsa
        nodePort: 32417
        port: 2222
        protocol: TCP
        targetPort: tsa
    

Should there be an extra port in the concourse-web Service spec for the /containers endpoint or should it be the same port as the ui/atc? I’m not sure where the http://127.0.0.1:8080 and 35417 are coming from though the former sounds like a default IP and port. This is happening even when using a clean Postgres DB and clearing out the worker PersistentVolumes so it shouldn’t be caused by stale data.

I’ll try to connect first without LB, forwarding the ports in order to check if the installation is correct:
kubectl -n yourconcoursenamespace port-forward svc/concourse-web 8080:80 Take care the default web-ui is listening on 8080 port, not 80, double check concourse.web.bindPort in values.yaml

and then opening your browser to http://localhost:8080

Just tried this out, but was still running into the same error with concourse.web.service.type set to NodePort instead of LoadBalancer (since I’m not running this on a mult-node k8s cluster instead of a minikube instance).

I’ve reset the web port, web ip, and externalUrl, and web service type values to the default in the Helm chart and retried and it worked fine for the first 2 runs but the next 2 gave this error after hanging for a while:

Get /volumes/b8975398-eb86-43a6-57c6-0a94cd1d135a: dial tcp 10.2.1.202:35036: connect: network is unreachable

Per the chart default values, I’m not currently using NodePort or LoadBalancer (ClusterIP is the default) and the Postgres DB and PersistentVolumes are clean so again there should be no stale config:

    $ k get pods -o wide
    NAME                             READY   STATUS    RESTARTS   AGE   IP           NODE         NOMINATED NODE   READINESS GATES
    concourse-web-6cd765ffb5-t9747   1/1     Running   0          13m   10.2.1.202   <redacted>   <none>           <none>
    concourse-worker-0               1/1     Running   0          13m   10.2.3.35    <redacted>   <none>           <none>
    concourse-worker-1               1/1     Running   0          13m   10.2.2.148   <redacted>   <none>           <none>
    $ $ k get svc                            
    NAME               TYPE        CLUSTER-IP   EXTERNAL-IP   PORT(S)             AGE
    concourse-web      ClusterIP   10.3.0.65    <none>        8080/TCP,2222/TCP   13m
    concourse-worker   ClusterIP   None         <none>        <none>              13m
    $ k port-forward svc/concourse-web 8080

I retried the job a couple times afterward and it succeeded the first and failed the next with the same volume error but a different port. There are no more errors regarding the /containers endpoint, just /volumes/ but only half of the time. As before, I’m still not sure where or why this is trying to connect on a random NodePort for the web node that’s seemingly unrelated to the ATC port (8080) or the TSA port (2222).

Where the error of volumes come from? It’s shown in your shell? in browser? in some log?

I have no experience with minikube. What I would do is to full delete the current helm concourse installation (with --purge) and reinstall again with all the default values including potgresql, if possible in a new namespace, in order to check the default installation.

This is being shown in the browser for the first step of the pipeline (not the check itself on the first resource).

I’m not using minikube for this, just a small multi-node k8s instance. I’ve also been purging the Helm installation, deleting all leftover namespaces, clearing out the associated paths for the PersistentVolumes on the storage machines, and clearing out all Concourse tables in Postgres so it should be a fresh Helm installation each time. Not sure if I’m missing anything.

Actually, once again I’m getting the same error on the /containers endpoint as mentioned at the top of this thread (in the UI for the pipeline). Still using ClusterIP for the web node and 127.0.0.1:8080 for the IP and port for the web node.

I think I might have found a cause for the randomness with the failures though. Concourse is reporting there being workers that shouldn’t exist even though the installation is fresh (postgres cleared and PVs cleared, new namespace and helm installation):

$ fly -t dev-cluster-concourse login -c http://127.0.0.1:8080/ -u test -p test                                        
logging in to team 'main'


target saved
$ fly -t dev-cluster-concourse workers                                                                                
name                    containers  platform  tags  team  state    version
concourse-worker-0      7           linux     none  none  running  2.2    
concourse-worker-1      1           linux     none  none  running  2.2    
winning-coral-worker-0  5           linux     none  none  running  2.2    
$ k get pods                                                                        
NAME                                 READY   STATUS    RESTARTS   AGE
winning-coral-web-866c7fd98f-2t2m6   1/1     Running   0          17m
winning-coral-worker-0               1/1     Running   0          17m

Even when I land and prune the old workers from a previous installation they end up eventually coming back. Not sure why this is happening.

EDIT: I’ve confirmed these zombie workers are why the jobs are failing and with an aggressive pruning script running in the background to land and prune them running I can get the pipeline reliably working with ClusterIP + 127.0.0.1. My main question now is how can I stop these workers from coming back? Once this is taken care of I believe it should be smooth sailing with the LoadBalancer since it was originally working some of the time with that service spec.type.

Final update: everything is now working with the LoadBalancer spec.type for the service without issues. Problem was rogue workers coming from a completely different Concourse Helm deployment in another k8s cluster (concourse-worker-0 and -1 above). I deleted that deployment, wiped the main deployment, and re-did the helm install on the desired cluster and pipelines reliably work now. Thanks for the help!