Data Center Junk Science: Thermal Shock \ Cooling Shock

I recently performed an interesting exercise where I reviewed typical co-location/hosting/ data center contracts from a variety of firms around the world.    If you ever have a few long plane rides to take and would like an incredible amount of boring legalese documents to review, I still wouldn’t recommend it.  :) 

I did learn quite a bit from going through the exercise but there was one condition that I came across more than a few times.   It is one of those things that I put into my personal category of Data Center Junk Science.   I have a bunch of these things filed away in my brain, but this one is something that not only raises my stupidity meter from a technological perspective it makes me wonder if those that require it have masochistic tendencies.

I am of course referring to a clause for Data Center Thermal Shock and as I discovered its evil, lesser known counterpart “Cooling” Shock.    For those of you who have not encountered this before its a provision between hosting customer and hosting provider (most often required by the customer)  that usually looks something like this:

If the ambient temperature in the data center raises 3 degrees over the course of 10 (sometimes 12, sometimes 15) minutes, the hosting provider will need to remunerate (reimburse) the customer for thermal shock damages experienced by the computer and electronics equipment.  The damages range from flat fees penalties to graduated penalties based on the value of the equipment.

This clause may be rooted in the fundamental belief (and one I subscribe to given many personally witnessed tests and trials) that its not high temperatures that servers do not like, but rather change of temperature.   In my mind this is a basic tenet of where the industry is evolving to with higher operating temperatures in data center environments.    My problem with this clause is more directed at the actual levels, metrics, and duration typically found in this requirement.  It smacks of a technical guy gone wild trying to prove to everyone how smart he or she is, all the while giving some insight into how myopic their viewpoint may be.

First lets take a look at the 3 degree temperature change.  This number ranges anywhere between 3 and 5 degrees in most contracts I reviewed that had them.   The problem here is that even with a strict adherence to the most conservative approach at running and managing data centers today, a 3 to 5 degree delta easily keeps you within even the old ASHRAE recommendations.  If we look at the airflow and temperatures at a Micro-scale within the server itself, the inlet air temperatures are likely to have variations within temperature range depending upon the level of utilization the box might be at.   This ultimately means that a customer who has periods of high compute, might themselves be violating this very clause if even for only a few minutes.

Which brings up the next component which is duration.   Whether you are speaking to 10 minutes or 15 minutes intervals these are nice long leisurely periods of time which could hardly cause a “Shock” to equipment.   Also keep in mind the previous point which is the environment has not even violated the ASHRAE temperature range.   In addition, I would encourage people to actually read the allowed and tested temperatures in which the manufacturers recommend for server operation.   A 3-5 degree swing  in temperature would rarely push a server into an operating temperature range that would violate the range the server has been rated to work in or worse — void the warranty.  

The topic of “Chilled Shock or Cooling Shock” which is the same but having to do with cooling intervals and temperatures is just as ludicrous.  Perhaps even more so!

I got to thinking, maybe, my experiential knowledge might be flawed.  So I went in search of white papers, studies, technical dissertations on the potential impact and failures with these characteristics.   I went looking, and looking, and looking, and ….guess what?   Nothing.   There is no scientific data anywhere that I could find to corroborate this ridiculous clause.   Sure there are some papers regarding running consistently hot and failures related, but in those studies they can easily be balanced against a servers depreciation cycle.

So why would people really require that this clause get added to the contract?  Are they really that draconian about it?   I went and asked a bunch of account managers I know (both from my firm and outside) and asked about those customers who typically ask for it.   The answer I got was surprising, there was a consistent percentage (albeit small) of customers out there that required this in their contracts and pushed so aggressively.  Even more surprising to me was that these were typically folks on the technical side of the house more then the lawyers or business people.  I mean, these are the folks that should be more in tune with logic than say business or legal people who can get bogged down in the letter of the law or dogmatic adherence to how things have been done.  Right?  I guess not.

But this brings up another important point.  Many facilities might experience a chiller failure, or a CRAH failure or some other event which might temporarily have this effect within the facility.    Lets say it happens twice in one year that you would potentially trigger this event for the whole or a portion of your facility (your probably not doing preventative maintenance  – bad you!).  So the contract language around Thermal shock now claims monetary damages.   Based on what?   How are these sums defined?  The contracts I read through had some wild oscillations on damages with different means of calculation, and a whole lot more.   So what is the basis of this damage assessment?   Again there are no studies that says each event takes off .005 minutes of a servers overall life, or anything like that.   So the cost calculations are completely arbitrary and negotiated between provider and customer.  

This is where the true foolishness then comes in.   The providers know that these events, while rare, might happen occasionally.   While the event may be within all other service level agreements, they still might have to award damages.   So what might they do in response?   They increase the costs of course to potentially cover their risk.   It might be in the form of cost per kw, or cost per square foot, and it might even be pretty small or minimal compared to your overall costs.  But in the end, the customer ends up paying more for something that might not happen, and if it does there is no concrete proof it has any real impact on the life of the server or equipment, and really only salves the whim of someone who really failed to do their homework.  If it never happens the hosting provider is happy to take the additional money.



Author: mmanos

Infrastructure at Scale Technologist and Cloud Aficionado.

7 thoughts on “Data Center Junk Science: Thermal Shock \ Cooling Shock”

  1. As a former “sales guy” and “product manager” but one who also holds an engineering degree i found myself in these types of negotiations all the time. In data centres it is the temperature – where is it measured? inlet, outlet, ambient temp in the room. In networking deals it was often latency and packet loss across the IP network. In all these situtions we were essentially trying to write an SLA that the customer themselves could create a situation to put us, the provider, in breach.

    In the end (as you said) we did the math… how many times did we expect to be in violation of the metric, what was the cost of the penalty, add 20% for profit and contingency and then pass the costs on to the customer.
    (to my knowledge we have never paid a penalty on these metrics)

  2. The only place where something like this is “documented” in any way is in the ASHRAE THermal Guidelines book. Since the group that wrote this book included all of the major server vendors, it must have been created with some type of justifiable reason. It states that the “maximum rate of temperature change is 5 degress C (9 degrees F) per hour.

  3. Mike, as well as Dave are right. 3C/hr is not a shock, it is insignificant change on temperature at glacier pace. Mil-810 shock is a shock, and can have some real reliability issues, but how many such events you expect to see in the data center ?

    NEBS specifications call for equipment to handle temperature rates of change of 30C/hr. As far as I know (and been on designing NEBS compliant equipment, including bu t not limited to servers for quite some time), we have newer had any real issues meeting this requirement on tests, or any failures on field attributed to same.

    Now, recall that the ASHRAE stands for “American Society of Heating, Refrigerating and Air Conditioning Engineers”. Sure, there are server vendors in the bunch, and all are happy to overspecify the environmentals — server vendors “just in case”, and to maybe save a few $$s on cooling solution cost (applies to max temp’s not changes). “R” people want to keep the box servers live within as small as possible — bigger the box, less revenue and profit there is to be had. Do not expect objective environmental specifications come out of TC9.9.

    The green movement is good here — it calls for change, or at least forces people to defend the state of affair by data, which “keep the box small” folks will not have to offer.

    EU CoC for data centers refers to yet to be published revision ETSI environmental spec for the IT equipment. This calls for Environmental category 3.1 (tops at 40C), which includes 3.1E (excursion to 45C) for limited time periods. Btw, the only salient revision to present spec is addition of two words “Data Centres” and “Computer Halls” under list of where the class 3.1 applies, right after “Telecommunications Centres” and “storage rooms for valuable and sensitive products”. ETSI Spec is ETSI EN 300 019-1-3, and can be downloaded for free from ETSI web site.

  4. “In data centres it is the temperature – where is it measured?”

    This might be a little cynical, but I’d guess it’s at the instrument which shows the lowest excursion.

  5. “Since the group that wrote this book included all of the major server vendors, it must have been created with some type of justifiable reason.”

    My guess is that it’s the same reason that air temps must be at 17C. Consensus and covering ones own ass.

    If you were being scientific about it you’d strap a dozen thermocouples to say 10000-20000 servers and actually measure the temperature excursion vs failure rate in half a dozen facilities over the life of the servers. My guess is that unless Google or Mr Manos set this experiment up then it simply hasn’t been done.

Leave a Reply

Fill in your details below or click an icon to log in: Logo

You are commenting using your account. Log Out / Change )

Twitter picture

You are commenting using your Twitter account. Log Out / Change )

Facebook photo

You are commenting using your Facebook account. Log Out / Change )

Google+ photo

You are commenting using your Google+ account. Log Out / Change )

Connecting to %s