Woke up this morning to find BigCloset down.

I didn’t have my load-monitoring script running on the front end because I wasn’t that worried about it. But I am now.

My htop session had died but was spiked over 21.5 when it did. For a general idea, a load of 1 == 1 Loaded CPU thread. We have a capacity of 6 threads so a load of 6 == Server is at capacity. A load OVER 6 == We’re over capacity, but if things slow down a bit, there is a chance to catch up.

As best I can tell at 3:11am Central time (4:11am eastern/my time) LogRotate ran and HUP’d all the services so that it could properly rotate the log files, and Apahce threw this error:

[Sun Feb 28 03:11:10 2016] [notice] seg fault or similar nasty error detected in the parent process
I think the CPU load was spiked way over 21 by that point (my htop session died about 12:53am when it showed 21.49 for the immediate average, and 14:50 for the 15 minute aggregated average load. It was climbing and I don’t have any confidence that it would have been able to drop far enough in 2 hours for this to not be the issue when logrotate ran.One front end is NOT going to be enough for BigCloset while using PHP5.6 and we aren’t going to be able to get PHP7/HHVM working for probably a couple weeks (possibly months) at best. And it won’t be an off the shelf solution. At best it would be me tracking down code errors, profiling them, and submitting patches to various Drupal code projects and hoping they get accepted to mainstream and us running custom-patched code till they do. We can either load a 2nd Front end, or load more CPU cores onto the one we have.

I will tweak and install my load monitoring script later. I was originally written for a situation just like this. Where we were throwing too much at a dieing server and wrote a script to watch for dead processes and high loads and kill things appropriately. We eventually fixed this by opening our Bridgewater,NJ pop.


Quick Note, we are DEFINATELY spiking usage on the front end. I’ve seen load average spikes over 15.0 which means, 15 cores worth of usage basically. But it only has 3 real cores, and 3 fake cores (6 total threads). It’s been recovering mostly on it’s own, but I’m afraid with using PHP56 we WILL need the 2nd front end.

HHVM and PHP7 do a lot of redudant call optimizaions which is what makes it MUCH leaner process wise.


I have officially downgraded the cloud cluster. It’s now longer using PHP7 or HHVM since neither was providing a stable environment for which we could properly operate Bigcloset/Topshelf.

Belle (front end server) is now operationally running PHP5.6

Belle, can be switched over to HHVM for limited testing as needed as we try to track down HHVM issues and fix them in code, but right now, the full/operational stack is as follows.

Belle (Front End)

  • Apache 2.2
  • Memcached
  • PHP5.6 (remi stable release)
Ariel (Back End/DB Server)
  • Percona 5.6 (release 76.1 Revision 5759e76)
Things seem to be operating within acceptable limits at the moment. I’ve had things switched over for maybe 10 minutes at this point. The site is still fast in places, a bit slower in others. It is DEFINATELY stressing the front end server more. Jumping the load average from under 0.5 consistently to over 2.0 consistently.-Piper

We are currently up and running using HHVM via Cloud in a “cobbled together” way. It is a less than ideal setup so I still need the VM re-imaged.

The method we have cobbled together right now seems to “fail” for HHVM every 8 hours or so in our initial tests, so I’ve written my own set of scripts to monitor the site and take automated action based on results. It’s kinda primitive but works 🙂


Wed Feb 24 15:00:01 CST 2016
HTTP Status when killed was: 503
Wed Feb 24 15:37:08 CST 2016
HTTP Status when killed was: 503
Wed Feb 24 23:43:01 CST 2016
HTTP Status when killed was: 503
Wed Feb 24 23:45:01 CST 2016
HTTP Status when killed was: 503