Got a ticket about a printer not working. Its at another site so I call up to trouble shoot.
Me: Is the printer powered on?
Me: What does the screen on the printer say?
User: Nothing, it's blank.
Me: You sure it's powered on?
User: Of course.
Me: Is there an amber light or any color light on the front?
Me: The printer is turned off
User: I told you
Read Full Tale
We have a client who is one of those rare good clients that are low maintenance, always decent, and best of all; they understand that downtime happens.
They are also our biggest client, so we treat them well, and happily, they reciprocate.
We have an ongoing project where they bought an Intel server from us for a Windows 2008R2 platform to host a single website. This site is a high traffic site, and they provided us with some specs as to what they would run, and where advised by them that we could spec according to the amount of traffic their site pulled on one of our co-hosted Windows 2003 servers.
We specced the machine accordingly, to wit:
Intel Xeon 5650 2.67, with 12GB RAM and two 500GB Seagate NS drives with RAID1.
We planned on about 60-70K visitors per month on this site pulling between 40-65GB traffic. Not high by most standards I guess.
They OK'd and paid for two of these, one to run live, and one to be on standby/testing server. Should one fail we could swop IP adresses on the interfaces and go on with our lives. MSSQL replication took care of making sure the Database was up to date on the standby server.
They would build the actual website themselves. We would just admin the server.
Anyhow, they decided to buy a CMS to build their site in. This CMS could only run if the application pool in IIS allocated to it was set to .net2.0, and 32bit. (The server runs Win Serv 2008R2 64Bit)
I complained to them about this (they are the kind of client who you can be direct with) and they said that they invested a few hundred K in the project, so we are stuck. Also, two more sites get added to the hosting requirement for this server.
Site gets built, we go through the usual testing phase of two or three months making sure everything works, and then she goes live.
Site runs fine for a few days, and then the CPU peggs to 99%. Log in, check, and sure enough the runaway process is w3wp.exe *32. (The *32 behind the process name indicates that the .net app pool is 32bit.)
Kill the process, and IIS restarts it automatically. Report the problem to them and go on with my life. A few hours later this happens again!
I am beginning to smell a rat and look at the traffic going into the site. 10Gigs of HTTP traffic a day. This means that we underspecced the expected traffic by a factor of five! Call them about this, "Oh, yeah we launched a massive advertising campaign. Could this be why the server is falling over?"
I reply that I don't think this is the case since we have smaller servers who handle more hits spread over their various cohosted sites. We go through about two weeks of this, and even on Sunday Evenings at 10PM, when there is almost no traffic the load would suddenly spike and pegg the CPU to 99%, and this would stay there until someone killed the w3wp process manually.
I troll the logs, and find a specific piece of .net code called by a form on the site that sends the server over the edge.
Provide them with logs, timestamps and whatnot and wait for the expected improvement in performance.
A week of babysitting the servers later the request comes through. "Can't we upgrade the server?"
Long story short, slap in another processor, up the RAM to 32GIG, and replace the drives with 10KRPM WD velociraptors.
The CPU usage pegged at 99% shortly afterward.
We quoted them on two new machines, they want quad processor Intel Xeon 5690 3.46GHz with 288GB RAM and SAS drives at 15K rpm for a new project.
To quote the client "I want to make sure we don't run into performance issues..."
Nice clients, luckily they have deep pockets...
TL;DR Let's slap some more hardware on bad code!