Hi there, our production concourse (@
3.14.1) is having some serious issues at the moment:
- Our db VM is filling up its persistent disk at an alarming rate. It’s currently at 92% of 100GB. This issue occurred last week and we bumped the disk from 50GB to 100GB. As you can see, the CPU is also through the roof (output is from
bosh vms --vitals):
Instance Process State AZ IPs VM CID VM Type Active VM Created At Uptime Load CPU CPU CPU CPU Memory Swap System Ephemeral Persistent (1m, 5m, 15m) Total User Sys Wait Usage Usage Disk Usage Disk Usage Disk Usage db/64809726-e701-4911-8256-d7c5e5471f8f running z1 10.128.0.11 vm-4dfa08f1-a2db-4101-6113-28b8c2f152b7 highcpu32 true Tue Jan 22 19:48:01 UTC 2019 0d 1h 52m 40s 72.58, 65.30, 52.98 - 1.9% 1.2% 55.7% 8% (2.3 GB) 0% (0 B) 44% (33i%) 0% (0i%) 92% (1i%)
- Concurrently, we’re seeing all jobs start pending and then not make any progress whatsoever, i.e. they’re all stuck in the Gray phase.
- We attempted to re-create the DB vm but it failed with the following message:
The logs did not provide any useful information that we can see, although we weren’t sure exactly what to look for.
Task 35286 Task 35286 | 19:39:11 | Preparing deployment: Preparing deployment (00:00:01) Task 35286 | 19:39:14 | Preparing package compilation: Finding packages to compile (00:00:01) Task 35286 | 19:39:15 | Updating instance db: db/64809726-e701-4911-8256-d7c5e5471f8f (0) (canary) (00:11:08) L Error: 'db/64809726-e701-4911-8256-d7c5e5471f8f (0)' is not running after update. Review logs for failed jobs: postgres, pg_janitor Task 35286 | 19:50:23 | Error: 'db/64809726-e701-4911-8256-d7c5e5471f8f (0)' is not running after update. Review logs for failed jobs: postgres, pg_janitor Task 35286 Started Tue Jan 22 19:39:11 UTC 2019 Task 35286 Finished Tue Jan 22 19:50:23 UTC 2019 Task 35286 Duration 00:11:12 Task 35286 error
We recently re-paved (i.e. redeployed) our worker VMs using BOSH, after which this behavior started, which suggests it may be related (but it may not). We’re under the impression that 50GB should be a sufficient size for the DB disk, and were startled to see that not only 50GB fill up, but 100GB fill up as well.
- Has anyone seen this sort of behavior before?
- If so, how did you fix it?
- What else would be useful output to help debug this issue?
This is currently affecting a system in production so any assistance would be very helpful. Thanks!