Woke up this morning to find BigCloset down.

I didn’t have my load-monitoring script running on the front end because I wasn’t that worried about it. But I am now.

My htop session had died but was spiked over 21.5 when it did. For a general idea, a load of 1 == 1 Loaded CPU thread. We have a capacity of 6 threads so a load of 6 == Server is at capacity. A load OVER 6 == We’re over capacity, but if things slow down a bit, there is a chance to catch up.

As best I can tell at 3:11am Central time (4:11am eastern/my time) LogRotate ran and HUP’d all the services so that it could properly rotate the log files, and Apahce threw this error:

[Sun Feb 28 03:11:10 2016] [notice] seg fault or similar nasty error detected in the parent process
I think the CPU load was spiked way over 21 by that point (my htop session died about 12:53am when it showed 21.49 for the immediate average, and 14:50 for the 15 minute aggregated average load. It was climbing and I don’t have any confidence that it would have been able to drop far enough in 2 hours for this to not be the issue when logrotate ran.One front end is NOT going to be enough for BigCloset while using PHP5.6 and we aren’t going to be able to get PHP7/HHVM working for probably a couple weeks (possibly months) at best. And it won’t be an off the shelf solution. At best it would be me tracking down code errors, profiling them, and submitting patches to various Drupal code projects and hoping they get accepted to mainstream and us running custom-patched code till they do. We can either load a 2nd Front end, or load more CPU cores onto the one we have.

I will tweak and install my load monitoring script later. I was originally written for a situation just like this. Where we were throwing too much at a dieing server and wrote a script to watch for dead processes and high loads and kill things appropriately. We eventually fixed this by opening our Bridgewater,NJ pop.

-Piper

Quick Note, we are DEFINATELY spiking usage on the front end. I’ve seen load average spikes over 15.0 which means, 15 cores worth of usage basically. But it only has 3 real cores, and 3 fake cores (6 total threads). It’s been recovering mostly on it’s own, but I’m afraid with using PHP56 we WILL need the 2nd front end.

HHVM and PHP7 do a lot of redudant call optimizaions which is what makes it MUCH leaner process wise.

-Piper

I have officially downgraded the cloud cluster. It’s now longer using PHP7 or HHVM since neither was providing a stable environment for which we could properly operate Bigcloset/Topshelf.

Belle (front end server) is now operationally running PHP5.6

Belle, can be switched over to HHVM for limited testing as needed as we try to track down HHVM issues and fix them in code, but right now, the full/operational stack is as follows.

Belleย (Front End)

  • Apache 2.2
  • Memcached
  • PHP5.6 (remi stable release)
Arielย (Back End/DB Server)
  • Percona 5.6 (release 76.1 Revision 5759e76)
Things seem to be operating within acceptable limits at the moment. I’ve had things switched over for maybe 10 minutes at this point. The site is still fast in places, a bit slower in others. It is DEFINATELY stressing the front end server more. Jumping the load average from under 0.5 consistently to over 2.0 consistently.-Piper

We are currently up and running using HHVM via Cloud in a “cobbled together” way. It is a less than ideal setup so I still need the VM re-imaged.

The method we have cobbled together right now seems to “fail” for HHVM every 8 hours or so in our initial tests, so I’ve written my own set of scripts to monitor the site and take automated action based on results. It’s kinda primitive but works ๐Ÿ™‚

-Piper

Wed Feb 24 15:00:01 CST 2016
HTTP Status when killed was: 503
Wed Feb 24 15:37:08 CST 2016
HTTP Status when killed was: 503
Wed Feb 24 23:43:01 CST 2016
HTTP Status when killed was: 503
Wed Feb 24 23:45:01 CST 2016
HTTP Status when killed was: 503

We are on the down-hill run at this point. I’m actually hoping to get some sleep tonite! ๐Ÿ™‚

Not all the new features will be available at the launch, but the site should be back shortly ๐Ÿ™‚

TopShelf is closed while we upgrade the software. This shouldn’t take more than…oh, maybe most of a day? Two days? Frankly, we’re not sure. Check back here now and then for progress reports.

Hugs,

Erin, Piper and Cat

We are currently transferring backups of BigCloset to outside servers. Once that is complete we will order the old drive removed, and destroyed, and the new drive installed.

Once installed, it will take the data-center possibly some time to install the new OS on the drive, and then a bit of time after that, for me (Piper) to get things just right and as needed for BigCloset.

Please hang around, and I will post updates as often as I can.

-Piper, Joyce, Cat and BigCloset Staff.

So we got BigCloset situated on EVA with a couple new HDD’s… We’ve got Eva’s websites on Magda, also with a new HDD… And we’ve got a couple more HDD’s to install in another system…

Everything seems stable for the time being, after spending the better part of 2 weeks tweaking settings, and load stabilizing as best we can.

We will still need to purchase new servers soon, and we will have a small bit of offline-time when we swap over then, but for now, I am HAPPY!

With luck, we will have time to get the SOLR/Search Service online this weekend and get BigCloset back to full features!

-HuGgLeS-
-Piper, Erin, Catrina, Sephrena, and all the BC Elves ๐Ÿ™‚