Production DB filling up + jobs stalled

Hi there, our production concourse (@3.14.1) is having some serious issues at the moment:

  1. Our db VM is filling up its persistent disk at an alarming rate. It’s currently at 92% of 100GB. This issue occurred last week and we bumped the disk from 50GB to 100GB. As you can see, the CPU is also through the roof (output is from bosh vms --vitals):
    Instance                                     Process State  AZ  IPs          VM CID                                   VM Type    Active  VM Created At                 Uptime           Load                 CPU    CPU   CPU   CPU    Memory        Swap        System      Ephemeral   Persistent
                                                                                                                                                                                        (1m, 5m, 15m)        Total  User  Sys   Wait   Usage         Usage       Disk Usage  Disk Usage  Disk Usage
    db/64809726-e701-4911-8256-d7c5e5471f8f      running        z1  vm-4dfa08f1-a2db-4101-6113-28b8c2f152b7  highcpu32  true    Tue Jan 22 19:48:01 UTC 2019  0d 1h 52m 40s    72.58, 65.30, 52.98  -      1.9%  1.2%  55.7%  8% (2.3 GB)   0% (0 B)    44% (33i%)  0% (0i%)    92% (1i%)
  2. Concurrently, we’re seeing all jobs start pending and then not make any progress whatsoever, i.e. they’re all stuck in the Gray phase.
  3. We attempted to re-create the DB vm but it failed with the following message:
    Task 35286
    Task 35286 | 19:39:11 | Preparing deployment: Preparing deployment (00:00:01)
    Task 35286 | 19:39:14 | Preparing package compilation: Finding packages to compile (00:00:01)
    Task 35286 | 19:39:15 | Updating instance db: db/64809726-e701-4911-8256-d7c5e5471f8f (0) (canary) (00:11:08)
                     L Error: 'db/64809726-e701-4911-8256-d7c5e5471f8f (0)' is not running after update. Review logs for failed jobs: postgres, pg_janitor
    Task 35286 | 19:50:23 | Error: 'db/64809726-e701-4911-8256-d7c5e5471f8f (0)' is not running after update. Review logs for failed jobs: postgres, pg_janitor
    Task 35286 Started  Tue Jan 22 19:39:11 UTC 2019
    Task 35286 Finished Tue Jan 22 19:50:23 UTC 2019
    Task 35286 Duration 00:11:12
    Task 35286 error
    The logs did not provide any useful information that we can see, although we weren’t sure exactly what to look for.

We recently re-paved (i.e. redeployed) our worker VMs using BOSH, after which this behavior started, which suggests it may be related (but it may not). We’re under the impression that 50GB should be a sufficient size for the DB disk, and were startled to see that not only 50GB fill up, but 100GB fill up as well.

  • Has anyone seen this sort of behavior before?
  • If so, how did you fix it?
  • What else would be useful output to help debug this issue?

This is currently affecting a system in production so any assistance would be very helpful. Thanks!

Take a look at this thread, I have a feeling you’re running into the same issue.