Attacking the Cruft

Today the Uptime Institute announced that AOL won the Server Roundup Award.  The achievement has gotten some press already (At Computerworld, PCWorld, and related sites) and I cannot begin to tell you how proud I am of my teams.   One of the more personal transitions and journeys I have made since my experience scaling the Microsoft environments from tens of thousands of servers to hundreds of thousands of servers has been truly understanding the complexity facing a problem most larger established IT departments have been dealing with for years.  In some respects, scaling infrastructure, while incredibly challenging and hard, is in large part a uni-directional problem space.   You are faced with growth and more growth followed by even more growth.  All sorts of interesting things break when you get to big scale. Processes, methodologies, technologies, all quickly fall to the wayside as you climb ever up the ladder of scale.

At AOL I faced a multi-directional problem space in that, as a company and as a technology platform we were still growing.  Added to that there was 27 years of what I call “Cruft”.   I define “Cruft” as years of build-up of technology, processes, politics, fiscal orphaning and poor operational hygiene.  This cruft can act as a huge boat anchor and barrier to an organization to drive agility in its online and IT operations.  On top of this Cruft a layer of what can best be described as lethargy or perhaps apathy can sometimes develop and add even more difficulty to the problem space.

One of the first things I encountered at AOL was the cruft.  In any organization, everyone always wants to work on the new, cool, interesting things. Mainly because they are new and interesting..out of the norm.  Essentially the fun stuff!  But the ability for the organization to really drive the adoption of new technologies and methods was always slowed, gated or in some cases altogether prevented by years interconnected systems, lost owners, servers of unknown purpose lost in the distant historical memory and the like.   This I found in healthy populations at AOL. 

We initially set about building a plan to attack this cruft.   To earnestly remove as much of the cruft  as possible and drive the organization towards agility.  Initially we called this list of properties, servers, equipment and the like the Operations $/-\!+ list. As this name was not very user-friendly it migrated into a series of initiatives grouped the name of Ops-Surdities.   These programs attacked different types of cruft and were at a high level grouped into three main categories:

The Absurdity List – A list of projects/properties/applications that had a questionable value, lack of owner, lack of direction, or the like but was still drawing load and resources from our data centers.   The plan here was to develop action plans for each of the items that appeared on this list.

Power Hog – An effort to audit our data center facilities, equipment, and the like looking for inefficient servers, installations, and /or technology and migrating them to new more efficient platforms or our AOL Cloud infrastructure.  You knew you were in trouble when you had a trophy of a bronze pig appear on your desk or office and that you were marked. 

Ops Hygiene – The sometimes tedious task of tracking down older machines and systems that may have been decomissioned in the past, marked for removal, or were fully depreciated and were never truly removed.  Pure Vampiric load.  You may or may not be surprised how much of this exists in modern data centers.  It’s a common issue I have had with most data center management professionals in the industry.

So here we are, in a timeline measured in under a year, and being told all along the way by“crufty old-timers” that we would never make any progress, my teams have de-comissioned almost 10,000 servers from our environments. (Actually this number is greater now, but the submission deadline for the award was earlier in the year).  What an amazing accomplishment.  What an amazing team!

So how did we do it?

As we will be presenting this in a lot more detail at the Uptime Symposium, I am not going to give away all of our secrets in a blog post and give you a good reason to head to the Uptime event and listen to and ask the primary leaders of this effort how they did it in person.  It may be a good use of that Travel budget your company has been sitting on this year.

What I will share is some guidelines on approach and some things to be wary of if you are facing similar challenges in your organization.

FOCUS AND ATTENTION

I cannot tell you how many I have spoken with that have tried to go after ‘cruft’ like this time and time again and failed.   One of the key drivers for success in my mind is ensuring that there is focus and attention on this kind of project at all levels, across all organizations, and most importantly from the TOP.   To often executives give out blind directives with little to no follow through and assume this kind of thing gets done.   They are generally unaware of the natural resistance to this kind of work there is in most IT organizations.    Having a motivated, engaged, and focused leadership on these types of efforts goes and extraordinarily long way to making headway here.  

BEWARE of ORGANIZATIONAL APATHY

The human factors that stack up against a project like this are impressive.  While they may not be openly in revolt over such projects there is a natural resistance to getting things done.  This work is not sexy.  This work is hard.  This work is tedious.  This likely means going back and touching equipment and kit that has not been messed with for a long time.   You may have competing organizational priorities which place this kind of work at the bottom of the workload priority list.   In addition to having Executive buy in and focus, make sure you have some really driven people running these programs.  You are looking for CAN DO people, not MAKE DO people.

TECHNOLOGY CAN HELP, BUT ITS NOT YOUR HEAVY LIFTER

Probably a bit strange for a technology blog to say, but its true.  We have an incredible CMDB and Asset System at AOL.  This was hugely helpful to the effort in really getting to the bottom of the list.   However no amount of Technology in place will be able to perform the myriad of tasks required to actually make material movement on this kind of work.   Some of it requires negotiation, some of it requires strength of will, some of it takes pure persistence in running these issues down…working with the people.  Understanding what is still required, what can be moved.  This requires people.   We had great technologies in place from the perspective of knowing where are stuff was, what it did, and what it was connected to.  We had great technologies like our Cloud to move some of these platforms to ultimately.    However, you need to make sure you don’t go to far down the people trap.  I have a saying in my organization – There is a perfect number of project managers and security people in any organization.  Where the work output and value delivered is highest.   What is that number?  Depends – but you definitely know when you have one too many of each.

MAKE IT FUN IF YOU CAN

From the brass pigs, to minor celebrations each month as we worked through the process we ensured that the attention given the effort was not negative. Sure it can be tough work, but you are at the end of the day substantially investing in the overall agility of your organization.  Its something to be celebrated.    In fact at the completion of our aggressive goals the primary project leads involved did a great video (which you can see here) to highlight and celebrate the win.   Everyone had a great laugh and a ton of fun doing what was ultimately a tough grind of work.  If you are headed to Symposium I strongly encourage you to reach out to my incredible project leads.  You will be able to recognize them from the video….without the mustaches of course!

\Mm

Breaking the Chrysalis

What has come before

When I first took my position at AOL I knew I was going to be in for some very significant challenges.   This position, perhaps more-so than any other in my career was going to push the bounds of my abilities.  As a technologist, as an operations professional, as a leader, and as someone who would hold measurable accountability to the operational success of an expansive suite of products and services.  As many of you may know, AOL has been engaged in what used to be called internally as a “Start-Around”.  Essentially an effort to try and fundamentally change the company from its historic roots to the premium content provider for the Internet. 

We no longer refer to this term internally as it is no longer about forming or defining that vision.  It has shifted to something more visceral.  More tangible.  It’s a challenge that most companies should be familiar with, It’s called Execution.  Execution is a very simple word but as any good operations professional knows, the devil is in the details, and those details have layers and layers of nuances.    Its where the proverbial rubber meets the road.  For my responsibilities within the company,  execution revolves 100% around delivering the technologies and services to ensure our products and content remain available to the world.   It is also about fundamentally transforming the infrastructural technologies and platform systems our products and content are based upon and providing the most agility and mobility we can to our business lines. 

One fact that is often forgotten in the fast-paced world of Internet Darlings, is that AOL had achieved a huge scale of infrastructure and technology investment long before many of these companies were gleams in the eyes of their founders.   While it may be fun and “new” to look at the tens of thousands of machines at Facebook, Google, or Microsoft – it is often overlooked that AOL had tens of thousands of machines (and still does!) and solved many of the same problems years ago.  To be honest it was a personal revelation for me when I joined.  There are few companies who have had to grow and operate at this kind of scale and every approach is a bit unique and different.  It was an interesting lesson, even for one who had a ton of experience doing something similar in “Internet Darling” infrastructures.

AOL has been around for over 27 years.  In technology circles, that’s like going back almost ten generations.   Almost 3 decades of “stuff”.  The stuff was not only gear and equipment from the natural growth of the business, but included the expansion of features and functionality of long standing services, increased systems interdependencies, and operational, technological, and programmatic “Cruft” as new systems / processes/ technologies were  built upon or bolted onto older systems. 

This “cruft” adds significant complexity to your operating environment and can truly limit your organization’s agility.  As someone tasked with making all this better, it struck me that we actually had at least two problems to solve.   The platform and foundation for the future, and a method and/or strategy for addressing the older products, systems, and environments and increase our overall agility as a company.

These are hard problems.  People have asked why I haven’t blogged in awhile externally.   This is the kind of challenge with multiple layers of challenges underneath that can keep one up at night.   From a strategy perspective do you target the new first?  Do you target the legacy environments to reduce the operational drag?  Or – Do you try and define a unified strategy to address both.  Its a lot harder and generally more complex, but they potential payoff is huge.   Luckily I have a world class team at AOL and together we built and entered our own cocoon and busily went to work.  We have gone down the path of changing out technology platforms, operational processes, outdated ways of thinking about data centers, infrastructure, and overall approach. Every inch fighting forward on this idea of unified infrastructure.

It was during this process that I came to realize that our particular legacy challenge, while at “Internet” scale, was more closely related to the challenges of most corporate or government environments than the biggest Internet players.  Sure we had big scale, we had hundreds of products and services, but the underlying “how to get there from here” problems were more universally like IT challenges than scaling out similar applications across commoditized infrastructure.   It ties into all the marketing promises, technological snake oil, and other baloney about the “cloud”.  The difference being that we had to quickly deliver something that worked and would not impact the business.  Whether we wanted to or not, we would be walking down some similar roads facing most IT organizations today.

As I look at the challenges facing modern IT departments across the world, their ability to “go to the cloud” or make use of new approaches is also securely anchored behind by the “cruft”  of their past.  Sometimes that cruft is so thick that the organization cannot move forward.  We were there, we were in the same boat.  We aren’t out of it yet – but we have made some pretty interesting developments that I think are pretty significant and I intend to share those learnings where appropriate. 

 

ATC

ATC IS BORN

Last week we launched a brand new data center facility we call, ATC.  This facility is fundamentally built upon the work that we have been doing around our own internal cloud technologies, shifts in operational process and methodology, and targeting our ability to be extremely agile in our new business model.  It represents a model on how to migrate the old, prepare for the new, and provide a platform upon which to build our future. 

Most people ignore the soft costs when looking at adoption of different cloud offerings, operational impacts are typically considered as afterthoughts.   What if you built those requirements in from day one… how would that change your design? your implementation? Your overall strategy?  I believe that ATC represents that kind of shift of thinking.  At least for us internally.

One of the key foundations for our ATC facility is our cloud platform and automation layer.  I like to think about this layer as a little bit country and a little bit rock and roll.  There is tremendous value in the learning’s that have come before, and nowhere else is this self evident than at AOL.  As I mentioned, the great minds of the past (as well as those in the present) had invested in many great systems that made this company a giant in the industry.   There are many such systems here, but one of the key ones in my mind is the Configuration Management System.  All organizations invest significantly into this type of platform.  If done correctly, their uses can span from more than a rudimentary asset management system, to include cost allocation systems, dependency mapping, detailed configuration and environmental data, and in some cases like ours provide the base foundation of leading us into the cloud. 

Many companies I speak with abandon this work altogether or live in a strange split/hybrid model where they treat “Cloud” as different.  In our space – new government regulations, new safe harbor laws, etc are continuing to drive the relevance of a universal system to act as a central authority.   The fact that this technology actually sped our development efforts in this automation cannot be ignored.

We went from provisioning servers in days, to getting base virtual machines up and running in under 8 seconds.  Want Service and Application images (for established products)? Add another 8 seconds or so.   Want to roll it into production globally (changing global DNS/Load balancing/Security changes)?  Lets call that another minute to roll out.   We used Open Source products and added our own development glue into our own systems to make all  this happen.  I am incredibly proud of my Cloud teams here at AOL, because what they have been able to do in such a relatively short period of time is to roll out a world class cloud and service provisioning system that can be applied to new efforts and platforms or our older products.   Better yet, the provisioning systems were built to be universal so that if required we can do the same thing with stand-alone physical boxes or virtual machines.  No difference.  Same system. This technology platform was recently recognized by the Uptime Institute at its last Symposium in California. 

auto2

This technology was put to the test in the recently with the earthquake that hit the East Coast of the United States.  While thankfully the damage was minimal, the tremor of Internet traffic was incredible.   The AOL homepage, along with our news sites started to get hammered with traffic and requests.  In the past this would have required a massive people effort to provision more capacity for our users.  With the new technology in place we were able to start adding additional machines to take the load extremely quickly with very minimal impact to our users.  In this particular case these machines were provisioned from our systems in existing data centers (not ATC), but the technology is the same.

This kind of technology and agility has some interesting side effects too.   It allows your organization to move much more quickly and aggressively than ever before.   I have seen Jevon’s paradox manifest itself over and over again in the Technology world.    For those of you who need a refresher, Jevons paradox is is the proposition that technological progress that increases the efficiency with which a resource is used tends to increase (rather than decrease) the rate of consumption of that resource. 

Its like when car manufacturers started putting the Miles per Gallon (MPG) efficiency on autos, the direct result was not a reduction of driving, but rather an overall increase of travel.

For ATC, which officially launched on October 1, 2011.  It took all of an hour to have almost 100 virtual machines deployed to it as soon as it was “turned on”.   It has since long passed that mark and in fact this technology usage is happening faster than coordinating executive schedules to attend our executive ribbon cutting ceremony this week.

While the Cloud development and technology efforts are cornerstones of the facility, it is not this work alone that is providing for something unique. After all however slick our virtualization and provisioning systems are, however deeply integrated they are into our internal tools and configuration management systems, those characteristics in and of themselves does not reflect the true evolution that ATC represents.

ATC is a 100% lights out facility.  There are absolutely no employees stationed at the facility full time, contract, or otherwise.   The entire premise is that we have moved from a reactive support model to a proactive or planned work support model.  If you compare this with other facilities (including some I built myself in the past) there is always personnel on site even if contractor.   This has fundamentally led to significant changes in how we operate our data centers, how, what, and when we do our work, and has impacted (downward) the overall costs to operate our environments.  Many of these are efficiencies and approaches I have used before (100% pre-racked/vendor integrated gear and systems integration) to fundamentally brand new approaches.  These changes have not been easy and a ton of credit goes to our operations and engineering staff in the Data Centers and across the Technology Operations world here at AOL.  Its always culturally tough to being open to fundamentally changing business as usual.   Another key aspect of this facility and infrastructure is that from network perspective its nearly 100% non-blocking.   My network engineers being network engineers pointed out that its not completely non-blocking for a few reasons, but I can honestly say that the network topology is the closest I have seen to “completely” non blocking deployed in real network environments ever especially compared to the industry standard of 2:1. 

Another incredible aspect of this new data center facility and the technology deployed is our ability to Quick Launch Compute Capacity.  The total time it took to go from idea inception (no data center) to delivering active capacity to our internal users was  90 days.  In my mind this made even more incredible by the fact that this was the first time that all these work-streams came together including the unified operations deployment model and included all of the physical aspects of just getting iron to the floor.    This time frame was made possible by a standardized / modular way to build out our compute capacity in logical segments based upon the the infrastructure cloud type being deployed (low tier, mid-tier, etc.).   This approach has given us a predictability to speed of deployment and cost which in my opinion is unparalleled.

The culmination of all of this work is the result of some incredible teams devoted to the desire to affect change, a little dash of renegade engineering, a heaping helping of some new perspective, blood, sweat, tears and vision.   I am extremely proud of the teams here at AOL to deliver this ground-breaking achievement.   But then again, I am more than a bit biased.   I have seen the passion of these teams manifested in some incredible technology.

As with all things like this, it’s been a journey and there is still a bunch of work to do.  Still more to optimize.  Deeper analysis and ease of aggregation for stubborn legacy environments.   We have already set our sights on the next generation of cloud development.  But for today, we have successfully built a new foundation upon which even more will be built.  For those of you who were not able to attend the Uptime Symposium this year I will be putting up some videos that give you some flavor of our work with driving a low cost cloud compute and provisioning system from Open Source components.

 

\Mm

Private Clouds – Not just a Cost and Technology issue, Its all about trust, the family jewels, corporate value, and identity

I recently read a post by my good friend James Hamilton at Amazon regarding Private Clouds.   James and I worked closely together at Microsoft and he was always a good source for out of the box thinking and challenging the status quo.    While James post found here, speaks to the Private Cloud initiative being what amounts to be an evolutionary dead end, I would have to respectfully disagree.

James’ post starts out by correctly pointing out that at scale the large cloud players have the resources and incentive to achieve some pretty incredible cost savings.  From an infrastructure perspective he is dead on.  But I don’t necessarily agree that this innovation will never reach the little guy.  In my role at Digital Realty Trust I think I might have a pretty unique perspective on the infrastructure developments both at the “big” guys along with what most corporate enterprises have available to them from a leasing or commercial perspective.  

Companies like Digital Realty Trust,  Equinix, Terramark, Dupont Fabros, and a host of others in the commercial data center space are making huge advancements in this space as well.   The free market economy has now placed an importance on low PUE highly efficient buildings.   You are starting to see these firms commission buildings with Commission PUEs Sub 1.4.   Compared to most existing data center facilities this is a huge improvement.  Likewise these firms are incented to hire mechanical and electrical experts.  This means that this same expertise is available to the enterprise through leasing arrangements.  Where James is potentially correct is at that next layer of IT specific equipment.

This is an area where there is an amazing amount of innovation happening by Amazon, Google, and Microsoft.   But even here in this space there are firms stepping up to provide solutions to bring extensive virtualization and cloud-like capabilities to bear.    Companies like Hexagrid have software solutions offerings that are being marketed to typical co-location and hosting firms to do the same thing.  Hexagrid and others are focusing on the software and hardware combinations to deliver full service solutions for those companies in this space.    In fact (as some comments on James’ blog mention) there is a lack of standards and a fear of vendor lock-in by choosing one of the big firms.  Its an interesting thought to ponder if a software+hardware solution offered to the hundreds of co-location players and hosting firms might be more of a universal solution without fear of lockdown.  Time will tell.

But this brings up one of the key criticisms that this is not just about cost and technology.   I believe what is really at stake here is much more than that.   James makes great points on greater resource utilization of the big cloud players and how much more efficient they are at utilizing their infrastructure.   To which i will snarkly (and somewhat tongue-in-cheek) say to that, “SO WHAT!”  🙂   Do enterprises really care about this?  Do they really optimize for this?  I mean if you pull back that fine veneer of politically correct answers  and “green-suitable” responses is that what their behavior in REAL LIFE is indicative of?    NO.

This was a huge revelation for me when I moved into my role at Digital.  When I was at Microsoft, I optimized for all of the things that James mentions because it made sense to do when you owned the whole pie.   In my role at Digital I have visibility into tens of data centers, across hundreds of customers that span just about every industry.  There is not, nor has there been a massive move (or any move for that matter) to become more efficient in the utilization of their resources.   We have had years of people bantering about how wonderful, cool, and how revolutionary a lot of this stuff is, but world wide Data center utilization levels have remained abysmally low.   Some providers bank on this.  Over subscription of their facilities is part of their business plan.  They know companies will lease and take down what they think they need, and never take it down in REALITY.   

So if this technology issue is not a motivating factor what is?  Well cost is always part of the equation.   The big cloud providers will definitely deliver cost savings, but private clouds could also deliver cost savings as well.   More importantly however, Private clouds will allow companies to retain their identity and uniqueness, and keep what makes them competitively them –Them.

I don’t so much see it as a Private cloud or Public cloud kind of thing but more of a Private Cloud AND Public cloud kind of thing.   To me it looks more an exercise of data abstraction.   The Public offerings will clearly offer infrastructure benefits in terms of cost, but will undoubtedly lock a company into that single solution.  The IT world has been bit before by putting all their eggs in a single basket and the need for flexibility will remain more key.    Therefore you might begin to see Systems Integrators, Co-location and hosting firms, and others build their own platforms, or much more likely, build platforms that umbrella over the big cloud players to give enterprises the best of both worlds. 

Additionally we must keep in mind that  the biggest resistance to the adoption of the cloud is not technology or cost but RISK and TRUST.  Do you, Mr. CIO, trust Google to run all of your infrastructure? your applications?  Do you Mrs. CIO, Trust Microsoft or Amazon to do the same for you?    The answer is not a blind yes or no.   Its a complicated set of minor yes responses and no responses.   They might feel comfortable outsourcing mail operations, but not the data warehouse manifesting decades of customer information.     The Private cloud approach will allow you to spread your risk.   It will allow you to maintain those aspects of the business that are core to the company. 

The cloud is an interesting place, today.  It is dominated by technologists.  Extremely smart engineering people who like to optimize and solve for technological challenges.  The actual business adoption of this technology set has yet to be fully explored.   Just wait until the “Business” side of the companies get their hooks into this technology set and start placing other artificial constraints, or optimizations around other factors.  There are thousands of different motivators out in the world.  Once they starts to happen earnest.  I think what you will find is a solution that looks more like a hybrid solution than the pure plays we dream about today.

Even if you think my ideas and thoughts on this topic is complete BS, I would remind you of something that I have told my teams for a very long time.  “There is no such thing as a temporary data center.”  This same mantra will hold true for the cloud.  If you believe that the Private Cloud will be a passing and temporary thing, just keep in mind that there will be systems and solutions build to this technology approach thus imbuing it with a very very long life.  

\Mm

A Practical Guide to the Early Days of Data Center Containers

In my current role (and given my past) I often get asked about the concept of Data Center Containers by many looking at this unique technology application to see if its right for them.   In many respects we are still in the early days of this technology approach and any answers one gives definitely has a variable shelf life given the amount of attention the manufacturers and the industry is giving this technology set.   Still, I thought it might be useful to try and jot down a few key things to think about when looking at data center containers and modularized solutions out there today.

I will do my best to try and balance this view across four different axis the Technology, Real Estate, Financial and Operational Considerations.  A sort of ‘ Executives View’  of this technology. I do this because containers as a technology can not and should not be looked at from a technology perspective alone.  To do so is complete folly and you are asking for some very costly problems down the road if you ignore the other factors.  Many love to focus on the interesting technology characteristics or the benefits in efficiency that this technology can bring to bare for an organization but to implement this technology (like any technology really) you need to have a holistic view of the problem you are really trying to solve.

So before we get into containers specifically lets take a quick look as to why containers have come about.  

The Sad Story of Moore’s Orphan

In technology circles, Moore’s law has come to be applied to a number of different technology advancement and growth trends and has come to represent exponential growth curves.  The original Moore’s law was actually an extrapolation and forward looking observation based on the fact that ‘the number of transistors per square inch on integrated circuits had doubled every year since the integrated circuit was invented.’  As my good friend and long time Intel Technical Fellow now with Microsoft, Dileep Bhandarkar routinely states – Moore has now been credited for inventing the exponential.  Its a fruitless battle so we may as well succumb to the tide.orphan

If we look at the technology trends across all areas of Information Technology, whether it be processors, storage, memory, or whatever, the trend has clearly fallen into this exponential pattern in terms of numbers of instructions, amount of storage or memory, network bandwidth, or even tape technology its clear that the movement of Technology has been marching ahead at a staggering pace over the last 20 years.   Isn’t it interesting then that places where all of this wondrous growth and technological wizardry has manifested itself, the data center or computer room, or data hall has been moving along at a near pseudo-evolutionary standstill.  In fact if one truly looks at the technologies present in most modern data center design they would ultimately find small differences from the very first special purpose data room built by IBM over 40 years ago.

Data Centers themselves have a corollary to the beginning of the industrial revolution.   In fact I am positive that Moore’s observations would hold true as civilization transitioned from an agricultural based economy to that of an industrialized one.   In fact one might say that the current modularization approach to data centers is really just the industrialization of the data center itself. 

In the past, each and every data center was built lovingly by hand by a team of master craftsmen and data center artisans.  Each is a one of a kind tool built to solve a set of problems.  Think of the eco-system that has developed around building these modern day castles.  Architects, Engineering firms, construction firms, specialized mechanical industries, and a host of others that all come together to create each and every masterpiece.    So to, did those who built plows, and hammers, clocks and sextants, and the tools of the previous era specialize in making each item, one by one.   That is, of course, until the industrial revolution.industrial

The data center modularization movement is not limited to containers and there is some incredibly ingenious stuff happening in this space out there today outside of containers, but one can easily see the industrial benefits of mass producing such technology.  This approach simply creates more value, reduces cost and complexity, makes technology cheaper and simplifies the whole.  No longer are companies limited to working with the arcane forces of data center design and construction, many of these components are being pre-packaged, pre-manufactured and becoming more aggregated.  Reducing the complexity of the past.  

And why shouldn’t it?   Data Centers live at the intersection of Information and Real Estate.   They are more like machines than buildings but share common elements of both buildings and technology.   All one has to do is look at it from a financial perspective to see how true this is.   In terms of construction, the cost of data centers break down to the following simple format.  Roughly 85% of the total costs to build the facility is made up of the components, labor, and technology to deal with the distribution or cooling of the electrical consumption.

pie

This of course leaves roughly 15% of the costs relegated to land, steel, concrete, bushes, and more of the traditional real estate components of the build.  Obviously these percentages differ market to market but on the whole they are close enough for one to get the general idea.  It also raises an interesting question as to what is the big drive for higher density in data centers, but that is a post for another day. 

As a result of this incredible growth there has been an explosion, a Renaissance if you will, in Data Center Design and approach and the modularization effort is leading the way in causing people to think differently about the data centers themselves.   Its a wonderful time to be part of this industry.   Some claim that the drivers of this change are being driven by the technology.  Others claim that the drivers behind this change have to do with the tough economic times and are more financial.  The true answer (as in all things) is that its a bit of both plus some additional factors.

Driving at the intersection of IT Lane and Building Boulevard

From the perspective of the technology drivers behind this change roads is the fact that most existing data centers are not designed or instrumented to handle the demands of the changing technology requirements occurring within the data center today.

Data Center managers are being faced with increasingly varied redundancy and resiliency requirements within the footprints that they manage.   They continue to support environments that heavily rely upon the infrastructure to provide robust reliability to ensure that key applications do not fail.  But applications are changing.  Increasingly there are applications that do not require the same level of infrastructure to be deployed because either the application is built in such a way that it is more geo-diverse or server-diverse. Perhaps the internal business units have deployed some test servers or lab / R&D environments that do not need this level of infrastructure. With the amount of RFPs out there demanding more diversity from software and application developers to solve the redundancy issue in software rather than large capital spend requirements on behalf of the enterprise, this is a trend likely to continue for some time.  Regardless the reason for the variability challenge that data center managers are facing, the truth is they are greater than ever before.

Traditional data center design cannot achieve these needs without additional waste or significant additional expenditure.   Compounding this is the ever increasing requirements for higher power density and resulting cooling requirements.  This is complicated by the fact that there is no uniformity of load across most data centers.  You have certain racks or areas driving incredible power consumption requiring significant density and other environments, perhaps legacy, perhaps under-utilized which run considerably less dense.   In a single room you could see rack power densities vary by as much as 8kw per rack! You might have a bunch of racks drawing 4kw/rack and an area drawing 12kw per rack or even denser.   This could consume valuable data center resources and make data center planning very difficult.

Additionally looming on the horizon is the spectre or opportunity of commodity cloud services which might offer additional resources which could significantly change the requirements of your data center design or need for specific requirements.  This is generally an unknown at this point, but my money is that the cloud could significantly impact not only what you build, but how you build it.   This ultimately drives a modularized approach to the fore.

From a business / finance perspective companies are faced with some interesting challenges as well.  The first is that the global inventory for data center space (from a leasing or purchase perspective) is sparse at best.    This is resulting from a glut of capacity after the dotcom era and the resulting land grab that occurred after 9/11 and the Finance industry chewing up much of the good inventory.    Additive to this is the fact that there is a real reluctance to build these costly facilities speculatively.   This is a combination of how the market was burned in the dotcom days, and the general lack of availability and access to large sums of capital.  Both of these factors are driving data center space to be a tight resource.

In my opinion the biggest problem across every company I have encountered is that of capacity planning.  Most organizations cannot accurately reflect how much data center capacity they will need in next year let alone 3 or 5 years from now.   Its a challenge that I have invested a lot of time trying to solve and its just not that easy.   But this lack of predictability exacerbates the problems for most companies.  By the time they realize they are running out of capacity or need additional capacity it becomes a time to market problem.   Given the inventory challenge I mentioned above this can position a company in a very uncomfortable place.   Especially if you take the all-in industry average of building a traditional data center yourself in a timeline somewhere between 106 and 152 weeks.  

The high upfront capital costs of a traditional data center build can also be a significant endeavor and business impact event for many companies.   The amount of spending associated with the traditional method of construction could cripple a company’s resources and/or force it to focus its resources on something non-core to the business.   Data Centers can and do impact the balance sheet.  This is a fact that is not lost on the Finance professionals in the organization looking at this type of investment.

With the need for companies to remain agile and allow them to move quickly they are looking for the same flexibility from their infrastructure.    An asset like a large data center built to requirements that no longer fit can create a drag on a companies ability to stay responsive as well. 

None of this even acknowledges some basic cost factors that are beginning to come into play around the construction itself.   The construction industry is already forecasting that for every 8 people retiring in the key trades (mechanical, electrical, pipe-fitting, etc) associated with data centers only one person is replacing them.   This will eventually mean higher cost of construction and an increased scarcity in construction resources.

Modularized approaches help all of these issues and challenges and provide the modern data center manager a way to solve for both the technology and business level challenges. It allows you to move to Site Integration versus Site Construction.    Let me quickly point out that this is not some new whiz bang technology approach.  It has been around in other industries for a long long time.  

Enter the Container Data Center

While it is not the only modularized approach, this is the landscape in which the data center container has made its entry.  container

First and foremost let me say that while I am strong proponent of containment in every aspect, containers can add great value or simply not be a fit at all.  They can drive significant cost benefits or end up costing significantly more than traditional space.  The key is that you need to understand what problem you are trying to solve and that you have a couple of key questions answered first.  

So lets explore some of these things to think about in the current state of Data Center Containers out there today.  

What problem are you trying to solve?

The first question to ask yourself when evaluating if containerized data center space would be a fit is figure out which problem you are trying to solve.   In the past, the driver for me had more to do with solving deployment related issues.   We had moved the base unit of measure from servers to racks of servers ultimately to containers.    To put it more in general IT terms, it was a move of deploying tens to hundreds of servers per month, to hundreds and thousands of servers per month, to tens of thousands of servers per month.    Some people look at containers as Disaster Recovery or Business Continuity Solutions.  Others look at it from the perspective HPC clusters or large uniform batch processing requirements and modeling.    You must remember that most vendor container solutions out there today are modeled on hundreds to thousands of servers per “box”.  Is this a scale that is even applicable to your environment?   If you think its as simple as just dropping a server in place and then deploying servers in as you will, you will have a hard learning curve in the current state of ‘container-world’.   It just does not work that way today. 

Additionally one has to think about the type of ‘IT Load’ they will place inside of a container.  most containers espouse similar or like machines in bulk.  Rare to non-existent is the container that can take a multitude of different SKUs in different configurations.  Does your use drive uniformity of load or consistent use across a large number of machines?  If so, containers might be a good fit, if not, I would argue you are better off in traditional data center space (whether traditionally built or modularly built).

I will assume for purposes of this document that you feel you have a good reason to use this technology application.

Technical things to think about . . .

For purposes of this document I am going to refrain from getting into a discussion or comparison of particular vendors (except in aggregate) and generalizations as I will not endorse any vendor over another in this space.  Nor will I get into an in depth discussion around server densities, compute power, storage or other IT-specific comparisons for the containers.   I will trust that your organizations have experts or at least people knowledgeable in the areas of which servers/network gear/operating systems and the like you need for your application.   There is quite a bit of variety out there to chose from and you are a much better judge of such things for your environments than I.  What I will talk about here from a technical perspective is things that you might not be thinking of when it comes to the use of containers.  

Standards – What’s In? What’s Out?

One of the first considerations you need to look at when looking at containers is to make sure that your facilities experts do a comprehensive look at the vendors you are looking at in terms of the data center aspects of the container.  Why? The answer is simple.  There is no set industry standards when it comes to Data Center Containers.   This means that each vendor might have their own approach on what goes in, and what stays out of the container.   This has some pretty big implications for you as the user.   For example, lets take a look at batteries or UPS solutions.   Some vendors provide this function in the container itself (for ride through, or other purposes), while others assume this is part of the facility you will be connecting the container in to.   How is the UPS/batteries configured in your container?   Some configurations might have some interesting harmonics issues that will not work for your specific building configuration.    Its best to make sure you have both IT and Facilities people look at the solutions you are choosing jointly and make sure you know what base services you will need to provide to the containers themselves from the building, what the containers will provide, and the like. 

This brings up another interesting point you should probably consider.  Given the variety of Container configurations and lack of overall industry standard, you might find yourself locked into a specific container manufacturer for the long haul.  If ensuring you have multiple vendors is important you will need to ensure  that find vendors compatible to a standard that you define or wait until there is an industry standard.    Some look to the widely publicized Microsoft C-Blox specification as a potential basis for a standard.  This is their internal container specification that many vendors have configurations for, but you need to keep in mind that’s based on Microsoft’s requirements and might not meet yours.  Until the Green Grid, ASHRAE, or other such standards body starts looking to drive standards in this space, its probably something to be concerned about.   This What’s in/What’s out conversation becomes important in other areas as well.   In the section below that talks about Finance Asset Classes and Operational items understanding what is inside has some large implications.

Great Server manufacturers are not necessarily great Data Center Engineers

Related to the previous topic, I would recommend that your facilities people really take a look at the mechanical and electrical distribution configurations of the container manufacturers you are evaluating.  The lack of standards leaves a pretty interesting view of interpretation and you may find that the one-line diagrams or configuration of the container itself will not meet your specifications.   Just because a firm builds great servers, it does not mean they build great containers.  Keep in mind, a data center container is a blending of both IT and infrastructure that might normally be housed in a traditional data center infrastructure.  In many cases the actual Data Center componentry and design might be new. Some vendors are quite good, some are not.  Its worth doing your homework here.

Certification – Yes, its different than Standards

Another thing you want to look for is whether or not your provider is UL and/or CE certified.  Its not enough that the servers/internal hardware are UL or CE listed, I would strongly recommend the container itself has this certification.  This is very important as you are essentially talking about a giant metal box that is connected to  somewhere between 100kw to 500kw of power.   Believe me it is in your best interest to ensure that your solution has been tested and certified.  Why? Well a big reason can be found down the yellow brick road.

The Wizard of AHJ or Pay attention to the man behind the curtain…

For those of you who do not know who or what an AHJ is, let me explain.  It standards for Authority having Jurisdiction.  It may sound really technical but it really breaks down to being the local code inspector of where you wish to deploy your containers.   This could be one of the biggest things to pay attention to as your local code inspector could quickly sink your efforts or considerably increase the cost to deploy your container solution from both an operational as well as capital perspective.  

wiz Containers are a relatively new technology and more than likely your AHJ will not have any familiarity with how to interpret this technology in the local market.  Given the fact that there is not a large sample set for them to reference, their interpretation will be very very important.   Its important to ensure you work with your AHJ early on.   This is where the UL or CE listing can become important.  An AHJ could potentially interpret your container in one of two ways.  The first is that of a big giant refrigerator.  Its a bad example, but what I mean is a piece of equipment.    UL and CE listing on the container itself will help with that interpretation.  This should be the correct interpretation ultimately but the AHJ can do what they wish.   They might look at the container as a confined work space.    They might ask you all sorts of interesting questions like how often will people be going into this to service the equipment, (if there is no UL/CE listing)they might look at the electrical and mechanical installations and distribution and rule that it does not meet local electrical codes for distances between devices etc.   Essentially, the AHJ is an all powerful force who could really screw things up for a successful container deployment.  Its important to note, that while UL/CE gives you a great edge, your AHJ could still rule against you. If he rules the container as a confined work space for example, you might be required to suit your IT workers up in hazmat/thermal suits in two man teams to change out servers or drives.  Funny?  That’s a real example and interpretation from an AHJ.    Which brings us to the importance the IT configuration and interpretation is for your use of containers.

Is IT really ready for this?

As you read this section please keep our Wizard of AHJ in the back of your mind. His influence will still be felt in your IT world, whether your IT folks realize it or not.  Containers are really best suited if you have a high degree of automation in your IT function for those services and applications to be run inside them.   If you have an extremely ‘high touch’ environment where you do not have the ability to remotely access servers and need physical human beings to do a lot of care and feeding of your server environment, containers are not for you.  Just picture, IT folks dressed up like spacemen.    It definitely requires that you have a great deal of automation and think through some key items.

Lets first look at your ability to remotely image brand new machines within thestartline container.   Perhaps you have this capability through virtualization or perhaps through software provided by your server manufacturer.   One thing is a fact, this is an almost must-have technology with containers.   Given the fact the a container can come with hundreds to thousands of servers, you really don’t want Edna from IT in a container with DVDs and manually loaded software images.   Or worse, the AHJ might be unfavorable to you and you might have to have two people in suits with the DVDs for safety purposes.  

So definitely keep in mind that you really need a way to deploy your images from a central image repository in place.   Which then leads to the integration with your potential configuration management systems (asset management systems) and network environments.   

Configuration Management and Asset Management systems are also essential to a successful deployment so that the right images get to the right boxes.  Unless you have a monolithic application this is going to be a key problem to solve.    Many solutions in the market today are based upon the server or device ‘ARP’ing out its MAC address and some software layer intercepting that arp correlating that MAC address to some data base to your image repository or configuration management system.   Otherwise you may be back to Edna and her DVDs and her AHJ mandated buddy. 

Of course the concept of Arp’ing brings up your network configuration.   Make sure you put plenty of thought into network connectivity for your container.   Will you have  one VLAN or multiple VLANs across all your servers?   Can your network equipment selected handle the amount of machines inside the container? How your container is configured from a network perspective, and your ability to segment out the servers in a container could be crucial to your success.   Everyone always blames the network guys for issues in IT, so its worth having the conversation up front with the Network teams on how they are going to address the connectivity A) to the container and B) inside the container from a distribution perspective. 

As long as I have all this IT stuff, Containers are cheaper than traditional DC’s right?

Maybe.  This blends a little with the next section specifically around finance things to think about for containers but its really sourced from a technical perspective.   Today you purchase containers in terms of total power draw for the container itself.   150kw, 300kw, 500kw and like denominations.   This ultimately means that you want to optimize your server environments for the load you are using.  Not utilizing the entire power allocation could easily flip the economic benefits of going to containers quickly.    I know what your thinking, Mike, this is the same problem you have in a traditional data center so this should really be a push and a non-issue.

The difference here is that you have a higher upfront cost with the containers.  Lets say you are deploying 300kw containers as a standard.    If you never really drive those containers to 300kw and lets say your average is 100kw you are only getting 33% of the cost benefit.   If you then add a second container and drive it to like capacity, you may find your self paying a significant premium for that capacity at a much higher price point that deploying those servers to traditional raised floor space for example.    Since we are brushing up on economic and financial aspects lets take a quick look at things to keep an eye on in that space.

Finance Friendly?

Most people have the idea that containers are ultimately cheaper and therefore those Finance guys are going to love them.   They may actually be cheaper or they may not, regardless there are other things your Finance teams will definitely want to take a look at.

money

The first challenge for your finance teams is to figure out how to classify this new asset called a container.   If you think about traditional asset classification for IT and data center investments they typically fall into 3 categories from which the rules for depreciation are set.  The first is Software, The second is server related infrastructure such as Servers, Hardware, racks, and the like.  The last category is the data center components itself.    Software investments might be capitalized over anywhere between 1-10 years.   Servers and the like typically range from 3-5 years, and data centers components (UPS systems, etc) are depreciated closer to 15-30 years.   Containers represent an asset that is really a mixed asset class.  The container obviously houses servers that have a useful life (presumably shorter than the container housing itself), the container also contains components that might be found in the data center therefore traditionally having a longer depreciation cycle.   Remember our What’s in? What’s out conversation? So your finance teams are going to have to figure out how they deal with a Mixed Asset class technology.   There is no easy answer to this.  Some Finance systems are set up for this, others are not.  An organization could move to treat it in an all or nothing fashion.  For example, If the entire container is depreciated over a server life cycle it will dramatically increase the depreciation hit for the business.  If you opt to depreciate it over the longer lead time items, then you will need to figure out how to deal with the fact that the servers within will be rotated much more frequently and be accounted for.    I don’t have an easy answer to this, but I can tell you one thing.   If your Finance folks are not looking at containers along with your facilities and IT folks, they should be.  They might have some work to do to accommodate this technology.

Related to this, you might also want to think about Containers from an insurance perspective.   How is your insurer looking at containers and how do they allocate cost versus risk for this technology set.  Your likely going to have some detailed conversations to bring them up to speed on the technology by and large.  You might find they require you to put in additional fire suppression (its a metal box, it something catches on fire inside, it should naturally be contained right?)  What about the burning plastics?  How is water delivered to the container for cooling, where and how does electrical distribution take place.   These are all questions that could adversely affect the cost or operation of your container deployment so make sure you loop them in as well.

Operations and Containers

Another key area to keep in mind is how your operational environments are going to change as a result of the introduction to containers.   Lets jump back a second and go back to our Insurance examples.   A container could weigh as much as 60,000 pounds (US).  That is pretty heavy.  Now imagine you accidently smack into a load bearing wall or column as you try to push it into place.  That is one area where Operations and Insurance are going to have to work together.   Is your company licensed and bonded for moving containers around?  Does your area have union regulations that only union personnel are certified and bonded to do that kind of work?   Important questions and things you will need to figure out from an Operations perspective.   

Going back to our What’s in and What’s out conversation – You will need to ensure that you have the proper maintenance regimen in place to facilitate the success of this technology.    Perhaps the stuff inside is part of the contract you have with your container manufacturer.  Perhaps its not.   What work will need to take place to properly support that environment.   If you have batteries in your container – how do you service them?  What’s the Wizard of AHJ ruling on that? 

The point here is that an evaluation for containers must be multi-faceted.  If you only look at this solution from a technology perspective you are creating a very large blind spot for yourself that will likely have significant impact on the success of containers in your environment.

This document is really meant to be the first of an evolutionary watch of the industry as it stands today. I will add observations as I think of them and repost accordingly over time. Likely (and hopefully) many of the challenges and things to think about may get solved over time and I remain a strong proponent of this technology application.   The key is that you cannot look at containers purely from a technology perspective.  There are a multitude of other factors that will make or break the use of this technology.  I hope this post helped answer some questions or at least force you to think a bit more holistically around the use of this interesting and exciting technology. 

\Mm

Data Center Junk Science: Thermal Shock \ Cooling Shock

I recently performed an interesting exercise where I reviewed typical co-location/hosting/ data center contracts from a variety of firms around the world.    If you ever have a few long plane rides to take and would like an incredible amount of boring legalese documents to review, I still wouldn’t recommend it.  :) 

I did learn quite a bit from going through the exercise but there was one condition that I came across more than a few times.   It is one of those things that I put into my personal category of Data Center Junk Science.   I have a bunch of these things filed away in my brain, but this one is something that not only raises my stupidity meter from a technological perspective it makes me wonder if those that require it have masochistic tendencies.

I am of course referring to a clause for Data Center Thermal Shock and as I discovered its evil, lesser known counterpart “Cooling” Shock.    For those of you who have not encountered this before its a provision between hosting customer and hosting provider (most often required by the customer)  that usually looks something like this:

If the ambient temperature in the data center raises 3 degrees over the course of 10 (sometimes 12, sometimes 15) minutes, the hosting provider will need to remunerate (reimburse) the customer for thermal shock damages experienced by the computer and electronics equipment.  The damages range from flat fees penalties to graduated penalties based on the value of the equipment.

This clause may be rooted in the fundamental belief (and one I subscribe to given many personally witnessed tests and trials) that its not high temperatures that servers do not like, but rather change of temperature.   In my mind this is a basic tenet of where the industry is evolving to with higher operating temperatures in data center environments.    My problem with this clause is more directed at the actual levels, metrics, and duration typically found in this requirement.  It smacks of a technical guy gone wild trying to prove to everyone how smart he or she is, all the while giving some insight into how myopic their viewpoint may be.

First lets take a look at the 3 degree temperature change.  This number ranges anywhere between 3 and 5 degrees in most contracts I reviewed that had them.   The problem here is that even with a strict adherence to the most conservative approach at running and managing data centers today, a 3 to 5 degree delta easily keeps you within even the old ASHRAE recommendations.  If we look at the airflow and temperatures at a Micro-scale within the server itself, the inlet air temperatures are likely to have variations within temperature range depending upon the level of utilization the box might be at.   This ultimately means that a customer who has periods of high compute, might themselves be violating this very clause if even for only a few minutes.

Which brings up the next component which is duration.   Whether you are speaking to 10 minutes or 15 minutes intervals these are nice long leisurely periods of time which could hardly cause a “Shock” to equipment.   Also keep in mind the previous point which is the environment has not even violated the ASHRAE temperature range.   In addition, I would encourage people to actually read the allowed and tested temperatures in which the manufacturers recommend for server operation.   A 3-5 degree swing  in temperature would rarely push a server into an operating temperature range that would violate the range the server has been rated to work in or worse — void the warranty.  

The topic of “Chilled Shock or Cooling Shock” which is the same but having to do with cooling intervals and temperatures is just as ludicrous.  Perhaps even more so!

I got to thinking, maybe, my experiential knowledge might be flawed.  So I went in search of white papers, studies, technical dissertations on the potential impact and failures with these characteristics.   I went looking, and looking, and looking, and ….guess what?   Nothing.   There is no scientific data anywhere that I could find to corroborate this ridiculous clause.   Sure there are some papers regarding running consistently hot and failures related, but in those studies they can easily be balanced against a servers depreciation cycle.

So why would people really require that this clause get added to the contract?  Are they really that draconian about it?   I went and asked a bunch of account managers I know (both from my firm and outside) and asked about those customers who typically ask for it.   The answer I got was surprising, there was a consistent percentage (albeit small) of customers out there that required this in their contracts and pushed so aggressively.  Even more surprising to me was that these were typically folks on the technical side of the house more then the lawyers or business people.  I mean, these are the folks that should be more in tune with logic than say business or legal people who can get bogged down in the letter of the law or dogmatic adherence to how things have been done.  Right?  I guess not.

But this brings up another important point.  Many facilities might experience a chiller failure, or a CRAH failure or some other event which might temporarily have this effect within the facility.    Lets say it happens twice in one year that you would potentially trigger this event for the whole or a portion of your facility (your probably not doing preventative maintenance  – bad you!).  So the contract language around Thermal shock now claims monetary damages.   Based on what?   How are these sums defined?  The contracts I read through had some wild oscillations on damages with different means of calculation, and a whole lot more.   So what is the basis of this damage assessment?   Again there are no studies that says each event takes off .005 minutes of a servers overall life, or anything like that.   So the cost calculations are completely arbitrary and negotiated between provider and customer.  

This is where the true foolishness then comes in.   The providers know that these events, while rare, might happen occasionally.   While the event may be within all other service level agreements, they still might have to award damages.   So what might they do in response?   They increase the costs of course to potentially cover their risk.   It might be in the form of cost per kw, or cost per square foot, and it might even be pretty small or minimal compared to your overall costs.  But in the end, the customer ends up paying more for something that might not happen, and if it does there is no concrete proof it has any real impact on the life of the server or equipment, and really only salves the whim of someone who really failed to do their homework.  If it never happens the hosting provider is happy to take the additional money.

 

\Mm

Modular Evolution, Uptime Resolution, and Software Revolution

Its a very little known fact but software developers are costing enterprises millions of dollars and I don’t think in many cases either party realizes it.   I am not referring to the actual cost of purchase for the programs and applications or even the resulting support costs.   Those are easily calculated and can be hard bounded by budgets.   But what of the resulting costs of the facility in which it resides?

The Tier System introduced by the Uptime Institute was an important step in our industry in that it gave us a common language or nomenclature in which to actually begin having a dialog on the characteristics of the facilities that were being built. It created formal definitions and classifications from a technical perspective that grouped up redundancy and resiliency targets, and ultimately defined a hierarchy in which to talk about those facilities that were designed to those targets.   For its time it was revolutionary and to a large degree even today the body of work is still relevant. 

There is a lot of criticism that its relevancy is fading fast due to the model’s greatest weakness which resides in its lack of significant treatment of the application.    The basic premise of the Tier System is essentially to take your most restrictive and constrained application requirements (i.e. the one that’s least robust) and augment that resiliency with infrastructure and what I call big iron.   If only 5% of your applications are this restrictive, then the other 95% of your applications which might be able to live with less resiliency will still reside in the castle built for the minority of needs.  But before you you call out an indictment of the Uptime Institute or this “most restrictive” design approach you must first look at your own organization.   The Uptime Institute was coming at this from a purely facilities perspective.  The mysterious workload and wizardry of the application is a world mostly foreign to them.   Ask yourself this question – ‘In my organization, how often does IT and facilities talk to one another around end to end requirements?’  My guess based on asking this question hundreds of times of customers and colleagues ranges between not often to not at all.  But the winds of change are starting to blow.

In fact, I think the general assault on the Tier System really represents a maturing of the industry to look at our problem space more combined wisdom.   I often laughed at the fact that human nature (or at least management human nature) used to hold a belief that a Tier 4 Data Center was better than a Tier 2 Data Center.  Effectively because the number was higher and it was built with more redundancy.   More Redundancy essentially equaled better facility.    A company might not have had the need for that level of physical systems redundancy (if one were to look at it from an application perspective) but Tier 4 was better than Tier 3, therefore we should build the best.   Its not better, just different. 

By the way, that’s not a myth that the design firms and construction firms were all that interested in dispelling either.   Besides Tier 4 having the higher number, and more redundancy, it also cost more to build, required significantly more engineering and took longer to work out the kinks.   So the myth of Tier 4 being the best has propagated for quite a long time.  Ill say it again.  Its not better, its just different.

One of the benefits of the recent economic downturn (there are not many I know), is that the definition of ‘better’ is starting to change.  With Capital budgets frozen or shrinking the willingness of enterprises to re-define ‘better’ is also changing significantly.   Better today means a smarter more economical approach.   This has given rise to the boom in Modular data center approach and its not surprising that this approach begins with what I call an Application level inventory.   

This application level inventory first specifically looks at the make up and resiliency of the software and applications within the data center environments.  Does this application need the level of physical fault tolerance that my Enterprise CRM needs?  Do servers that support testing or internal labs need the same level of redundancy?  This is the right behavior and the one that I would argue should have been used since the beginning.  The Data Center doesn’t drive the software, its the software that drives the Data Center. 

One interesting and good side effect of this is that the enterprise firms are now pushing harder on the software development firms.    They are beginning to ask some very interesting questions that the software providers have never been asked before.    For example, I sat in one meeting where and end customer asked their Financial Systems Application provider a series of questions on the inter-server latency requirements and transaction timeout lengths for data base access of their solution suite.  The reason behind this line of questioning was a setup for the next series of questions.   Once the numbers were provided it became abundantly clear that this application would only truly work from one location, from one data center and could not be redundant across multiple facilities.  This led to questions around the providers intentions to build more geo-diverse and extra facility capabilities into their product.   I am now even seeing these questions in official Requests for Information (RFI’s) and Requests for Proposal (RFPs).   The market is maturing and is starting to ask an important question – why should your sub-million dollar (euro) software application drive 10s of millions of capital investment by me?  Why aren’t you architecting your software to solve this issue.  The power of software can be brought to bear to easily solve this issue, and my money is on the fact this will be a real battlefield in software development in the coming years.

Blending software expertise with operational and facility knowledge will be at the center of a whole new train of software development in my opinion.  One that really doesn’t exist today and given the dollar amounts involved, I believe it will be a very impactful and fruitful line of development as well.    But it has a long way to go.    Most programmers coming out of universities today rarely question the impact of their code outside of the functions they are providing and the number of colleges and universities that teach a holistic approach can be counted on less than one hands worth of fingers world-wide.   But that’s up a finger or two from last year so I am hopeful. 

Regardless, while there will continue to be work on data center technologies at the physical layer, there is a looming body of work yet to be tackled facing the development community.  Companies like Oracle, Microsoft, SAP, and hosts of others will be thrust into the fray to solve these issues as well.   If they fail to adapt to the changing face and economics of the data center, they may just find themselves as an interesting footnote in data center texts of the future.

 

\Mm

More Chiller Side Chat Redux….

I have been getting continued feedback on the Chiller Side Chat that we did live on Monday, September 14th.  I wanted to take a quick moment and discuss one of the recurring themes of emails I have been receiving on the topic of data center site selection and the decisions that result at the intersection of data center technology, process and cost.    One of the key things that we technical people often forget is that the data center is first and foremost a business decision.  The business (whatever kind of business it is) has a requirement to improve efficiency through automation, store information, or whatever it is that represents the core function of that business.  The data center is at the heart of those technology decisions and the ultimate place where those solutions will reside.  

As the primary technical folks in an organization whether you represent IT or Facilities,  we can find ourselves in the position of getting deeply involved with the technical aspects of the facility – the design, the construction or retro-fit, the amount of power or cooling required, the amount of redundancy we need and the like.  Those in upper management however view this substantially in a different way.    Its all about business.  As I have gotten a slew of these mails recently I decided to try and post my own response.  As I thought about how I would go about this, I would keep going back to Chris Crosby’s discussion at Data Center Dynamics about two years ago.   As you know I was at Microsoft, at the time and felt that he did an excellent job of outlining the way the business person views data center decisions.    So I went digging around and found this video of Chris talking about it.  Hopefully this helps!  If not let me know and I am happy to discuss further or more specifically.

\Mm

Miss the “Live” Chiller Side Chat? Hear it here!

The folks who were recording the “Live” Chiller Side Chat have sent me a link to the recording.    If you were not able to make the event live, but are still interested in hearing how it went feel free to have a listen at the following link:

 

LIVE CHILLER SIDE CHAT

 

\Mm

“We Can’t Afford to measure PUE”

One of the more interesting phenomena that I experience as I travel and talk with customers and industry peers is that there is a significant number of folks out there with the belief that they cannot measure PUE because they cannot afford or lack the funding for the type of equipment and systems needed to properly measure their infrastructure.  As a rule I beleive this to be complete hogwash as there are ways to measure PUE without any additional equipment (I call it SneakerNet or Manual Scada).   One can easily look at the power draw off the UPS  and compare that to the information in their utility bills.  Its not perfect, but it gives you a measure that you can use to improve your efficiency.  As long as you are consistent in your measurement rigor (regular intervals, same time of day, etc) you can definitely achieve better and better efficiency within your facility.

Many people have pushed back on me saying that measurement closer to the PDU or rack is more important and for that one needs a full blown branch circuit monitoring solution.   While to me increased efficiency is more about vigilance in understanding your environment I have had to struggle with an affordable solution for folks who desired better granularity.

Now that I have been in management for the better half of my career I have had to closet the inner geek in me to home and weekend efforts.   Most of my friends laugh when they find out I essentially have a home data center comprised of a hodge podge of equipment I have collected over the years.  This includes things like my racks of different sizes (It has been at least a decade since I have seen a half-height rack in a facility, but I have two!) , my personal Cisco Certification lab, Linux Servers, Microsoft Servers, and a host of other odds and ends).  Its a personal playground for me to try and remain technical despite my management responsibilities.  

Its also a great place for me to try out new gear from time to time and I have to say I found something that might fit the bill for those folks that want to get a deeper understanding of power consumption in their facilities.   I rarely give product endorsements but this is something that the budget minded facilities folks might really like to take a look at. 

I received a CL-AMP IT package from the Noble Vision Group to review and give them some feedback on their kit.   The first thing that struck me was that this kit seemed to essentially be a power metering for dummies kit.    There were a couple of really neat characteristics out of the box that took many of the arguments I usually hear right off the table.

nvg

First the “clamp” itself in non-intrusive, non-invasive way to get accurate power metering and results.   This means contrary to other solutions I did not have to unplug existing servers and gear to be able to get readings from my gear or try and install this device inline.  I simply Clamped the power coming into the rack (or a server) and POOF! I had power information. It was amazingly simple. Next up -  I had heard that clamp like devices were not as accurate before so I did some initial tests using an older IP Addressable power strip which allowed me to get power readings for my gear.   I then used the CL-AMP device to compare and they were consistently within +/- 2% with each other.  As far as accuracy, I am calling it a draw because to be honest its a garage based data center and I am not really sure how accurate my old power strips are.   Regardless the CL-AMPS allowed me a very easy way to get my power readings easily without disrupting the network.  Additionally, its mobile so if I wanted to I could move it around you can.  This is important for those that might be budget challenged as the price point for this kit would be incredibly cheaper than a full blown Branch Circuit solution. 

While my experiment was far from completely scientific and I am the last person to call myself a full blown facilities engineer, one thing was clear this solution can easily fill a role as a mobile, quick hit way to measure PUE power usage in your facility that doesn’t break the bank or force you to disrupt operations or installing devices in line.   

\Mm

Chiller-Side Chats : Is Power Capping Ready for PrimeTime?

I was very pleased at the great many responses to my data center capacity planning chat.  They came in both public and private notes with more than a healthy population of those centered around my comments on power capping and their potential disagreement on why I don’t think the technology/applications/functionality is 100% there yet. So I decided to throw up an impromptu ad-hoc follow-on chat on Power Capping.  How’s that for service?

What’s your perspective?

In a nutshell my resistance can be summed up and defined in the exploration of two phrases.  The first is ‘prime time’ and how I define it from where I come at the problem from.  The second is the definition of the term ‘data center’ and in what context I am using it as it relates to Power Capping.

I think to adequately address my position I will answer it from the perspective of the three groups that these Chiller Side Chats are aimed at namely, the Facility side, the IT side, and ultimately the business side of the problem. 

Let’s start with the latter phrase : ‘data center’ first.  To the facility manager this term refers to the actual building, room, infrastructure that IT gear sits in.   His definition of Data Center includes things like remote power panels, power whips, power distribution units, Computer Room Air Handlers (CRAHs), generators, and cooling towers.   It all revolves around the distribution and management of power.

From an IT perspective the term is usually represented or thought of in terms of servers, applications, or network capabilities.  It sometimes blends in to include some aspects of the facility definition but only as it relates to servers and equipment.   I have even heard it used to applied to “information” which is even more ethereal.  Its base units could be servers, storage capacity, network capacity and the like.

From a business perspective the term ‘data center’ is usually lumped together to include both IT and facilities but at a very high level.  Where the currency for our previous two groups are technical in nature (power, servers, storage, etc) – the currency for the business side is cold hard cash.   It involves things like OPEX costs, CAPEX costs, and Return on Investment.

So from the very start, one has to ask, which data center are you referring to?  Power Capping is a technical issue, and can be implemented at either of the two technical perspectives.   It also will have an impact on the business aspect but it can also be a barrier to adoption.

We believe these truths to be self-evident

Here are some of the things that I believe to be inalienable truths about data centers today and in some of these probably forever if history is any indication.

  1. Data Centers are heterogeneous in the make up of facilities equipment with different brands of equipment across the functions.
  2. Data Centers are largely heterogeneous in the make up of their servers population, network population, etc.
  3. Data Centers house non-server equipment like routers, switches, tape storage devices and the like.
  4. Data Centers generally have differing designs, redundancy, floor layout, PDU distribution configurations.
  5. Today most racks are unintelligent, those that are not, are vendor specific and/or proprietary-also-Expensive versus bent steel.
  6. Except in a very few cases, there is NO integration between asset management, change management, incident management, problem management systems between IT *AND* facilities systems.

These will be important in a second so mark this spot on the page as it ties into my thoughts on the definition of prime time.  You see, to me in this context, Prime Time means that when a solution is deployed it will actually solve problems and reduce the number of things a Data Center Manager has to do or worry about.   This is important because notice I did not say anything about making something easier.  Sometimes, easier doesn’t solve the problem. 

There is some really incredible work going on at some of the server manufacturers in the area of power capping.   After all they know their products better than anyone.  For gratuitous purposes because he posts and comments here, I refer you to the Eye on Blades blog at HP by Tony Harvey.  On his post responding to the previous Chiller-side chat, he talked up the amazing work that HP is doing and is already available on some G5 boxes and all G6 boxes along with additional functionality available in the blade enclosures. 

Most of the manufacturers are doing a great job here.  The dynamic load stuff is incredibly cool as well.    However, the business side of my brain requires that I state that this level of super-cool wizardry usually comes at additional cost.   Lets compare that with Howard, the every day data center manager who does it today, who from a business perspective is a sunk cost.   Its essentially free.   Additionally, simple things like performing an SNMP poll for power draw on a box (which used to be available in some server products for free) have been removed or can only be accessed through additional operating licenses.  Read as more cost.    So the average business is faced with getting this capability for servers at an additional cost, or make Howard the Data Center manager do it for free and know that his general fear of losing his job if things blow up is a good incentive for doing it right. 

Aside from that, it still has challenges in Truth #2.  Extremely rare is the data center that uses only one server manufacturer.  While its the dream of most server manufacturers, its more common to find DELL Servers, along side HP Servers, alongside Rackable. Add to that fact that even in the same family you are likely to see multiple generations of gear.  Does the business have to buy into the proprietary solutions of each to get the functionality they need for power capping?  Is there an industry standard in Power Capping that ensures we can all live in peace and harmony?  No.  Again that pesky business part of my mind says, cost-cost-cost.  Hey Howard – Go do your normal manual thing.

Now lets tackle Truth #3 from a power capping perspective.   Solving the problem from the server side is only solving part of the problem.   How many network gear manufacturers have power capping features? You would be hard pressed to find a number on one hand.   In a related thought, one of the standard connectivity trends in the industry is top of rack switching.  Essentially for purposes of distribution, a network switch is placed at the top of the rack to handle server connectivity to the network.     Does our proprietary power capping software catch the power draw of that switch?  Any network gear for that matter?  Doubtful.  So while I may have super cool power capping on my servers I am still screwed at the rack layer –where data center managers manage from as one of their base units.   Howard may be able to have some level of Surety that his proprietary server power capping stuff is humming along swimmingly, he still has to do the work manually.  Its definitely simpler for Howard, to get that task done potentially quicker, but we have not actually reduced steps in the process.   Howard is still manually walking the floor.  

Which brings up a good point, Howard the Data Center manager manages by his base unit of rack.  In most data centers, racks can have different server manufacturers, different equipment types (servers, routers, switches, etc), and can even be of different sizes.    While some manufacturers have built state of the art racks specific for their equipment it doesn’t solve the problem.  We have now stumbled upon Truth #5.

Since we have been exploring how current power capping technologies meet at the intersection of IT and facilities it brings up the last point I will touch on regarding tools. I will get there by asking some basic questions as to the operations of a typical data center.  In terms of Operations does your IT asset management system provide for racks as an item of configuration?  Does your data center manager use the same system?  Does your system provide for multiple power variables?  does it track power at all?  Does the rack have power configuration associated with it?  Or does your version of Howard use spreadsheets?  I know where my bet is on your answers.  Tooling has a long way to go in this space.   Facilities vendors are trying to approach it from their perspective, IT tools providers are doing the same, along with tools and mechanisms from equipment manufacturers as well. There are a few tools that have been custom developed to do this kind of thing, but they have been done for use in very specific environments.  We have finally arrived at Power Capping and Truth #6. 

Please don’t get me wrong, I think that ultimately power capping will finally fulfill its great promise and do tremendous wonders.  Its one of those rare areas which will have a very big impact in this industry.   If you have the ability to deploy the vendor specific solutions (which are indeed very good), you should. It will make things a bit easier, even if it doesn’t remove steps.   However I think ultimately in order to have real effect its going to have to compete with the cost of free.   Today this work is done by the data center managers with no apparent additional cost from a business perspective.   If I had some kind of authority I would call for there to be a Standard to be put in place around Power Capping.  Even if its quite minimal it would have a huge impact.   It could be as simple as providing three things.  First provide for free and unfiltered access to an SNMP Mib that allows access to the current power usage information of any IT related device.  Second, provide a Mib, which through the use of a SET command could place a hard upper limit of power usage.  This setting could be read by the box and/or the operating system and start to slow things down or starve resources on the box for a time.  Lastly, the ability to read that same Mib.    This would allow for the poor cheap Howard’s to take advantage of at least simplifying their environments.  tremendously.  It would still provide software and hardware manufacturers to build and charge for the additional and dynamic features they would require. 

\Mm