Containers are not destroyed after 6.1 upgrade

Hi,

we upgraded our concourse instance from version 6.0.0 to 6.1.0. Since then, containers on our worker are not destroyed anymore and we are running out of containers very soon.

Our Installation:
We are running Concourse web and worker on the same docker host using a docker-compose configuration (see below). The concourse web instance is not exposed directly, it is behind a nginx proxy configuration. The Nginx proxies a couple more applications on different URL paths (not domains or subdomains), so this might be a problem. We are in an enterprise environment, so there is also a configured proxy. We are setting the typical proxy env variables.

What we did:

  1. Upgraded Concourse from version 6.0.0 to 6.1.0 both using the official concourse image from docker hub.

  2. Switched SSL certificate and private key, since it was no longer valid. The certificate is used both for the nginx as well as the concourse ssl connection.

  3. After there were communication problems, the concourse keys have been regenerated and replaced.

What we see:

Concourse web is starting fine and there are no obvious error messages in the log, neither on info, nor debug log level.

Concourse worker starts printing out error messages like

{"timestamp":"2020-05-26T06:35:52.848533161Z","level":"error","source":"worker","message":"worker.container-sweeper.tick.failed-to-list-containers","data":{"error":"bad response: invalid character '\u003c' looking for beginning of value","session":"6.1"}}

Later those messages change to

{"timestamp":"2020-05-26T06:28:43.672306419Z","level":"error","source":"worker","message":"worker.container-sweeper.tick.failed-to-destroy-container","data":{"error":"bad response: invalid character '\u003c' looking for beginning of value","handle":"f268d838-febe-4329-7d91-faed37365ae8","session":"6.1047"}}

the web ui then shows the typical out of container message:

run check step: run check step: find or create container: insufficient subnets remaining in the pool

and checking the containers on the worker confirms, there are 256 running containers.

The \u003c symbol makes me guess, that we get an http error response with html or xml content instead of the expected json content.

Questions:

  1. Since we did not change any configuration besides the described steps, what might have changed in the behavior between 6.0 and 6.1?

  2. How does the communication work? Which component concourse web, nginx or maybe the corporate proxy might produce the incorrect answer?

… other ideas?

Thanks!
Steffen

Docker compose snippet:

concourse-web:
    image: concourse/concourse:6.1.0
    links: [postgres-db]
    depends_on: [postgres-db]
    restart: always
    command: web
    volumes:
       - "./concourse/keys/web:/concourse-keys"
       - "./certs:/certs"
       - "/var/lib/ca-certificates/pem:/etc/ssl/certs"
       - "/var/lib/ca-certificates/ca-bundle.pem:/etc/ssl/certs/ca-certificates.crt"
    environment:
      CONCOURSE_LOG_LEVEL: debug
      CONCOURSE_BIND_IP: '0.0.0.0'
      CONCOURSE_TLS_BIND_PORT: '443'
      CONCOURSE_TLS_KEY: '/certs/itfes_nopw_build.key'
      CONCOURSE_TLS_CERT: '/certs/itfes_build.crt'
      CONCOURSE_EXTERNAL_URL: "https://build.internal.net:443/"
      CONCOURSE_COOKIE_SECURE: 'false'
      CONCOURSE_POSTGRES_HOST: postgres-db
     CONCOURSE_POSTGRES_DATABASE: ${DB_CONCOURSE_DB}
      CONCOURSE_POSTGRES_USER: ${DB_CONCOURSE_USER}
      CONCOURSE_POSTGRES_PASSWORD: ${DB_CONCOURSE_PASSWORD}
      CONCOURSE_MAX_CONNS: 150
      CONCOURSE_RESOURCE_CHECKING_INTERVAL: '5m'
      CONCOURSE_ADD_LOCAL_USER: 'admin:secret'
      CONCOURSE_MAIN_TEAM_LOCAL_USER: 'admin'
      CONCOURSE_LDAP_DISPLAY_NAME: ${LDAP_DISPLAY_NAME}
      CONCOURSE_LDAP_HOST: ${LDAP_HOST}
      CONCOURSE_LDAP_INSECURE_NO_SSL: 'true'
      CONCOURSE_LDAP_START_TLS: 'false'
      CONCOURSE_LDAP_BIND_DN: ${LDAP_BIND_DN}
      CONCOURSE_LDAP_BIND_PW: ${LDAP_BIND_PW}
      CONCOURSE_LDAP_USER_SEARCH_BASE_DN: ${LDAP_USER_SEARCH_BASE_DN}
      CONCOURSE_LDAP_USER_SEARCH_SCOPE: ${LDAP_USER_SEARCH_SCOPE}
      CONCOURSE_LDAP_USER_SEARCH_FILTER: ${LDAP_USER_SEARCH_FILTER}
      CONCOURSE_LDAP_USER_SEARCH_USERNAME: ${LDAP_USER_SEARCH_USERNAME}
      CONCOURSE_LDAP_USER_SEARCH_ID_ATTR: ${LDAP_USER_SEARCH_ID_ATTR}
      CONCOURSE_LDAP_USER_SEARCH_EMAIL_ATTR: ${LDAP_USER_SEARCH_EMAIL_ATTR}
      CONCOURSE_LDAP_USER_SEARCH_NAME_ATTR: ${LDAP_USER_SEARCH_NAME_ATTR}
      CONCOURSE_MAIN_TEAM_LDAP_USER: '${LDAP_MAIN_TEAM_USER}'
      http_proxy: ${DOCKER_HTTP_PROXY}
      https_proxy: ${DOCKER_HTTPS_PROXY}
      no_proxy: ${DOCKER_NO_PROXY}
 
  concourse-worker:
    image: concourse/concourse:6.1.0
    links: [concourse-web]
    depends_on: [concourse-web]
    privileged: true
    restart: always
    command: worker
    stop_signal: SIGUSR2
    volumes:
      - "./concourse/keys/worker:/concourse-keys"
      - "/var/lib/ca-certificates/pem:/etc/ssl/certs"
      - "/var/lib/ca-certificates/ca-bundle.pem:/etc/ssl/certs/ca-certificates.crt"
    environment:
      CONCOURSE_LOG_LEVEL: debug
      CONCOURSE_TSA_HOST: 'concourse-web:2222'
      CONCOURSE_BAGGAGECLAIM_DRIVER: overlay
      CONCOURSE_CERTS_DIR: /etc/ssl/certs
      CONCOURSE_BIND_IP: 0.0.0.0
      CONCOURSE_BAGGAGECLAIM_BIND_IP: 0.0.0.0
      http_proxy: ${DOCKER_HTTP_PROXY}
      https_proxy: ${DOCKER_HTTPS_PROXY}
      no_proxy: ${DOCKER_NO_PROXY}

It seems that I found the problem. I removed the following two lines in concourse-worker config:

CONCOURSE_BIND_IP: 0.0.0.0
CONCOURSE_BAGGAGECLAIM_BIND_IP: 0.0.0.0

This config was not new, it did work with 6.0.0. I don’t know if this is a bug, or expected behavior.