Data Center Junk Science: Thermal Shock \ Cooling Shock

I recently performed an interesting exercise where I reviewed typical co-location/hosting/ data center contracts from a variety of firms around the world.    If you ever have a few long plane rides to take and would like an incredible amount of boring legalese documents to review, I still wouldn’t recommend it.  :) 

I did learn quite a bit from going through the exercise but there was one condition that I came across more than a few times.   It is one of those things that I put into my personal category of Data Center Junk Science.   I have a bunch of these things filed away in my brain, but this one is something that not only raises my stupidity meter from a technological perspective it makes me wonder if those that require it have masochistic tendencies.

I am of course referring to a clause for Data Center Thermal Shock and as I discovered its evil, lesser known counterpart “Cooling” Shock.    For those of you who have not encountered this before its a provision between hosting customer and hosting provider (most often required by the customer)  that usually looks something like this:

If the ambient temperature in the data center raises 3 degrees over the course of 10 (sometimes 12, sometimes 15) minutes, the hosting provider will need to remunerate (reimburse) the customer for thermal shock damages experienced by the computer and electronics equipment.  The damages range from flat fees penalties to graduated penalties based on the value of the equipment.

This clause may be rooted in the fundamental belief (and one I subscribe to given many personally witnessed tests and trials) that its not high temperatures that servers do not like, but rather change of temperature.   In my mind this is a basic tenet of where the industry is evolving to with higher operating temperatures in data center environments.    My problem with this clause is more directed at the actual levels, metrics, and duration typically found in this requirement.  It smacks of a technical guy gone wild trying to prove to everyone how smart he or she is, all the while giving some insight into how myopic their viewpoint may be.

First lets take a look at the 3 degree temperature change.  This number ranges anywhere between 3 and 5 degrees in most contracts I reviewed that had them.   The problem here is that even with a strict adherence to the most conservative approach at running and managing data centers today, a 3 to 5 degree delta easily keeps you within even the old ASHRAE recommendations.  If we look at the airflow and temperatures at a Micro-scale within the server itself, the inlet air temperatures are likely to have variations within temperature range depending upon the level of utilization the box might be at.   This ultimately means that a customer who has periods of high compute, might themselves be violating this very clause if even for only a few minutes.

Which brings up the next component which is duration.   Whether you are speaking to 10 minutes or 15 minutes intervals these are nice long leisurely periods of time which could hardly cause a “Shock” to equipment.   Also keep in mind the previous point which is the environment has not even violated the ASHRAE temperature range.   In addition, I would encourage people to actually read the allowed and tested temperatures in which the manufacturers recommend for server operation.   A 3-5 degree swing  in temperature would rarely push a server into an operating temperature range that would violate the range the server has been rated to work in or worse — void the warranty.  

The topic of “Chilled Shock or Cooling Shock” which is the same but having to do with cooling intervals and temperatures is just as ludicrous.  Perhaps even more so!

I got to thinking, maybe, my experiential knowledge might be flawed.  So I went in search of white papers, studies, technical dissertations on the potential impact and failures with these characteristics.   I went looking, and looking, and looking, and ….guess what?   Nothing.   There is no scientific data anywhere that I could find to corroborate this ridiculous clause.   Sure there are some papers regarding running consistently hot and failures related, but in those studies they can easily be balanced against a servers depreciation cycle.

So why would people really require that this clause get added to the contract?  Are they really that draconian about it?   I went and asked a bunch of account managers I know (both from my firm and outside) and asked about those customers who typically ask for it.   The answer I got was surprising, there was a consistent percentage (albeit small) of customers out there that required this in their contracts and pushed so aggressively.  Even more surprising to me was that these were typically folks on the technical side of the house more then the lawyers or business people.  I mean, these are the folks that should be more in tune with logic than say business or legal people who can get bogged down in the letter of the law or dogmatic adherence to how things have been done.  Right?  I guess not.

But this brings up another important point.  Many facilities might experience a chiller failure, or a CRAH failure or some other event which might temporarily have this effect within the facility.    Lets say it happens twice in one year that you would potentially trigger this event for the whole or a portion of your facility (your probably not doing preventative maintenance  – bad you!).  So the contract language around Thermal shock now claims monetary damages.   Based on what?   How are these sums defined?  The contracts I read through had some wild oscillations on damages with different means of calculation, and a whole lot more.   So what is the basis of this damage assessment?   Again there are no studies that says each event takes off .005 minutes of a servers overall life, or anything like that.   So the cost calculations are completely arbitrary and negotiated between provider and customer.  

This is where the true foolishness then comes in.   The providers know that these events, while rare, might happen occasionally.   While the event may be within all other service level agreements, they still might have to award damages.   So what might they do in response?   They increase the costs of course to potentially cover their risk.   It might be in the form of cost per kw, or cost per square foot, and it might even be pretty small or minimal compared to your overall costs.  But in the end, the customer ends up paying more for something that might not happen, and if it does there is no concrete proof it has any real impact on the life of the server or equipment, and really only salves the whim of someone who really failed to do their homework.  If it never happens the hosting provider is happy to take the additional money.