Web behind ELB - forward-worker not working

I’m having a lot of issues migrating from 4.2.1 to 5.2.0 with regards to TSA connection handshakes.

Current AWS setup is 2 x EC2 web nodes behind an ELB and 1 x EC2 worker, they both run official docker images of concourse with appropriate ports 8080,2222,7777,7788,7799 etc.

web configs:

CONCOURSE_EXTERNAL_URL=https://{ some external url } 
CONCOURSE_PEER_ADDRESS=172.*********
CONCOURSE_SESSION_SIGNING_KEY=/concourse-keys/session_signing_key
CONCOURSE_TSA_PEER_ADDRESS={ LB domain }
CONCOURSE_TSA_HOST_KEY=/concourse-keys/tsa_host_key
CONCOURSE_TSA_AUTHORIZED_KEYS=/concourse-keys/authorized_worker_keys
# postgres etc...

worker configs:

CONCOURSE_NAME=i-(instance-id)
CONCOURSE_WORK_DIR=/opt/concourse/worker
CONCOURSE_BIND_IP=0.0.0.0
CONCOURSE_BAGGAGECLAIM_BIND_IP=0.0.0.0
CONCOURSE_GARDEN_BIND_IP=0.0.0.0
CONCOURSE_TSA_HOST={ LB domain }:2222
CONCOURSE_TSA_PUBLIC_KEY=/concourse-keys/tsa_host_key.pub
CONCOURSE_TSA_WORKER_PRIVATE_KEY=/concourse-keys/worker/worker_i-{instance-id}

I get no registered workers and the following errors in my logs:

web logs:

{"level":"info","source":"tsa","message":"tsa.connection.handshake-failed","data":{"error":"EOF","remote":"172.27.99.140:54039","session":"790"}}
{"level":"info","source":"tsa","message":"tsa.connection.handshake-failed","data":{"error":"EOF","remote":"172.27.99.140:23802","session":"791"}}
{"level":"info","source":"tsa","message":"tsa.connection.handshake-failed","data":{"error":"EOF","remote":"172.27.99.140:58480","session":"792"}}

worker logs:

{"level":"error","source":"worker","message":"worker.volume-sweeper.tick.failed-to-dial","data":{"error":"failed to establish SSH connection with gateway: ssh: handshake failed: ssh: unable to authenticate, attempted methods [none publickey], no supported methods remain","session":"7.2"}}
{"level":"error","source":"worker","message":"worker.volume-sweeper.tick.failed-to-get-volumes-to-destroy","data":{"error":"failed to establish SSH connection with gateway: ssh: handshake failed: ssh: unable to authenticate, attempted methods [none publickey], no supported methods remain","session":"7.2"}}
{"level":"error","source":"worker","message":"worker.beacon-runner.beacon.failed-to-dial","data":{"error":"failed to establish SSH connection with gateway: ssh: handshake failed: ssh: unable to authenticate, attempted methods [none publickey], no supported methods remain","session":"4.1"}}
{"level":"error","source":"worker","message":"worker.beacon-runner.beacon.exited-with-error","data":{"error":"failed to establish SSH connection with gateway: ssh: handshake failed: ssh: unable to authenticate, attempted methods [none publickey], no supported methods remain","session":"4.1"}}
{"level":"error","source":"worker","message":"worker.beacon-runner.failed","data":{"error":"failed to establish SSH connection with gateway: ssh: handshake failed: ssh: unable to authenticate, attempted methods [none publickey], no supported methods remain","session":"4"}}
{"level":"error","source":"worker","message":"worker.beacon-runner.beacon.run.command-failed","data":{"command":"forward-worker","error":"wait: remote command exited without exit status or exit signal","session":"4.1.10"}}
{"level":"error","source":"worker","message":"worker.beacon-runner.beacon.exited-with-error","data":{"error":"wait: remote command exited without exit status or exit signal","session":"4.1"}}
{"level":"error","source":"worker","message":"worker.beacon-runner.failed","data":{"error":"wait: remote command exited without exit status or exit signal","session":"4"}}
{"level":"error","source":"worker","message":"worker.beacon-runner.beacon.keepalive.failed-to-disable-keepalive","data":{"error":"set tcp 172.17.0.2:57304-\u003e172.27.99.140:2222: use of closed network connection","session":"4.1.9"}}

Is there something obvious I missed in the configs like a port or something? I can see that the EOF error is happening on the ELB IP but the port is ephemeral and I don’t know why or how to fix it? The ssh keys works, I tested them manually.

Workers always forward now, so all those internal ports (7777, 7788, etc) are irrelevant.

The EOF message is just noise from ELB health checks. To fix that, don’t health check the TSA directly, instead health check the ATC’s web port for the TSA (they run in the same process, so it’s effectively still a useful health check)

I recently posted some details on running concourse behind ALB/NLB Tip: Using Concourse behind AWS ALB and NLB with SSL

Make sure your TSA port is being load balanced in TCP mode, not HTTP/HTTPS.

Ok I reduced the TSA_LOG_LEVEL to error to avoid the NLB EOF logs.

I already had the exact setup you described in your ALB/NLB post. It’s started working so I will double check I used the same environment as above in case someone comes across this.

Is there any tip on how to debug this? the infra guys are on holidays lol