Woke up this morning to find BigCloset down.

I didn’t have my load-monitoring script running on the front end because I wasn’t that worried about it. But I am now.

My htop session had died but was spiked over 21.5 when it did. For a general idea, a load of 1 == 1 Loaded CPU thread. We have a capacity of 6 threads so a load of 6 == Server is at capacity. A load OVER 6 == We’re over capacity, but if things slow down a bit, there is a chance to catch up.

As best I can tell at 3:11am Central time (4:11am eastern/my time) LogRotate ran and HUP’d all the services so that it could properly rotate the log files, and Apahce threw this error:

[Sun Feb 28 03:11:10 2016] [notice] seg fault or similar nasty error detected in the parent process
I think the CPU load was spiked way over 21 by that point (my htop session died about 12:53am when it showed 21.49 for the immediate average, and 14:50 for the 15 minute aggregated average load. It was climbing and I don’t have any confidence that it would have been able to drop far enough in 2 hours for this to not be the issue when logrotate ran.One front end is NOT going to be enough for BigCloset while using PHP5.6 and we aren’t going to be able to get PHP7/HHVM working for probably a couple weeks (possibly months) at best. And it won’t be an off the shelf solution. At best it would be me tracking down code errors, profiling them, and submitting patches to various Drupal code projects and hoping they get accepted to mainstream and us running custom-patched code till they do. We can either load a 2nd Front end, or load more CPU cores onto the one we have.

I will tweak and install my load monitoring script later. I was originally written for a situation just like this. Where we were throwing too much at a dieing server and wrote a script to watch for dead processes and high loads and kill things appropriately. We eventually fixed this by opening our Bridgewater,NJ pop.


Quick Note, we are DEFINATELY spiking usage on the front end. I’ve seen load average spikes over 15.0 which means, 15 cores worth of usage basically. But it only has 3 real cores, and 3 fake cores (6 total threads). It’s been recovering mostly on it’s own, but I’m afraid with using PHP56 we WILL need the 2nd front end.

HHVM and PHP7 do a lot of redudant call optimizaions which is what makes it MUCH leaner process wise.


I have officially downgraded the cloud cluster. It’s now longer using PHP7 or HHVM since neither was providing a stable environment for which we could properly operate Bigcloset/Topshelf.

Belle (front end server) is now operationally running PHP5.6

Belle, can be switched over to HHVM for limited testing as needed as we try to track down HHVM issues and fix them in code, but right now, the full/operational stack is as follows.

Belle (Front End)

  • Apache 2.2
  • Memcached
  • PHP5.6 (remi stable release)
Ariel (Back End/DB Server)
  • Percona 5.6 (release 76.1 Revision 5759e76)
Things seem to be operating within acceptable limits at the moment. I’ve had things switched over for maybe 10 minutes at this point. The site is still fast in places, a bit slower in others. It is DEFINATELY stressing the front end server more. Jumping the load average from under 0.5 consistently to over 2.0 consistently.-Piper

We are currently up and running using HHVM via Cloud in a “cobbled together” way. It is a less than ideal setup so I still need the VM re-imaged.

The method we have cobbled together right now seems to “fail” for HHVM every 8 hours or so in our initial tests, so I’ve written my own set of scripts to monitor the site and take automated action based on results. It’s kinda primitive but works 🙂


Wed Feb 24 15:00:01 CST 2016
HTTP Status when killed was: 503
Wed Feb 24 15:37:08 CST 2016
HTTP Status when killed was: 503
Wed Feb 24 23:43:01 CST 2016
HTTP Status when killed was: 503
Wed Feb 24 23:45:01 CST 2016
HTTP Status when killed was: 503

One of the longest steps in this whole process is one known as “Rebuild Permissions”.  The BIGGEST issue with it was it seemed DESTINED to fail while the “WALL” was up, so to do it properly, we felt the need to take-down the wall and start letting people in.

Untill the permissions database is fully built, you will notice some stuff missing. Some OLDER stories will show up on the front page with new dates. Same with comments and such. But it IS being worked on.

The three of us pulled a Marathon 2/3 days getting this done, bulldozing every roadblock we ran into at full speed, and then rebuilding using the pieces left behind at each step.

I will post again here, and on BigCloset itself when it seems to be finished with it’s permission rebuild. and hopefully we can finish the lengthy process of fixing any lingering errors.

-Piper, Erin & Cat.

We’re still hard at work behind the scenes. I managed to grab about 3 hours sleep (not three hours straight through, but basically 3 hours where I blacked out enough that I’m going to call it sleep).

Things are progressing on the database and we are working on other issues as well right now simultaneously.

We are aiming for mid-day west-coast (USA) time for the site to come back online, and while it could be sooner, I want to warn everyone that it very easily could take longer depending on how everything goes.

Right now, we aren’t anticipating any issues that will cause major delays.

-Piper, Cat, Erin and the BigCloset Band

TopShelf is closed while we upgrade the software. This shouldn’t take more than…oh, maybe most of a day? Two days? Frankly, we’re not sure. Check back here now and then for progress reports.


Erin, Piper and Cat

Hi Everyone! Quick Note!

I’m on my way up to the datacenter, to install some new equipment.

Part of the Equipment Package I’m delivering includes Infrastructure improvements, namely new network cabling.

While the downtime should not be noticeable, less than 30 secs at a time, you may have periods where The site is shown as offline, or the database offline, or everything may LOOK fine, but because the slave database server is offline, you get a 404 error when you click on a story.

The outages should be less than 30 secs at a time, and the whole process shouldn’t take me very long at all, so please bear with us!

-Piper, Erin, Cat and the BigCloset Band