AOL Data Centers – LooseBolts

Lots of interest in the MicroDC, but do you know what I am getting the most questions about?

Scott Killian of AOL talks about the MicroDC

Last week I put up a post about how AOL.com has 25% of all traffic now running through our MicroDC infrastructure. There was a great follow up post by James LaPlaine our VP of Operations on his blog Mental Effort, which goes into even greater detail. While many of the email inquiries I get have been based around the technology itself, surprisingly a large majority of the notes have been questions around how to make your software. applications, and development efforts ready for such an infrastructure and what the timelines for realistically doing so would be.

The general response of course is that it depends. If you are a web-based platform or property focused solely on Internet based consumers, or a firm that needs diversified presence in different regions without the hefty price tag of renting and taking down additional space this may be an option. However many of the enterprise based applications have been written in a way that is highly dependent upon localized infrastructure, short application based latency, and lack adequate scaling. So for more corporate data center applications this may not be a great fit. It will take sometime for those big traditional application firms to be able to truly build out their infrastructure to work in an environment like this (they may never do so). I suspect most will take an easier approach and try to ‘cloudify’ their own applications and run it within their own infrastructure or data centers under their control. This essentially will allow them to control the access portion of users needs, but continue to rely on the same kinds of infrastructure you might have in your own data center to support it. Its much easier to build a web based application which then connects to a traditional IT based environment, than to truly build out infrastructure capable of accommodating scale. I am happy to continue answer questions as they come up, but as I had an overwhelming response of questions about this I thought I would throw something quick up here that will hopefully help.

\Mm

On Micro Datacenters, Sandy, Supercomputing 2012, and Coding for Containerized Data Centers….

As everyone has been painfully aware last week the United States saw the devastation caused by the Superstorm Sandy. My original intention was to talk about yet another milestone with our Micro Data Center approach. As the storm slammed into the East Coast I felt it was probably a bad time to talk about achieving something significant especially as people were suffering through the storms outcome. In fact, after the storm AOL kicked off an incredible supplies drive and sent truckloads of goods up to the worst of the affected areas.

So, here we are a week after the storm, and while people are still in need and suffering, it is clear that the worst is over and the clean up and healing has begun. It turns out that Super Storm Sandy also allowed us to test another interesting case in the journey of the Micro Data Center as well that I will touch on.

25% of ALL AOL.COM Traffic runs through Micro Data Centers

I have talked about the potential value of our use of Micro Data Centers and the pure agility and economics the platform will provide for us. Up until this point we had used this technology in pockets. Think of our explorations as focusing on beta and demo environments. But that all changed in October when we officially flipped the switch and began taking production traffic for AOL.COM with the Micro Data Center. We are currently (and have been since flipping the switch) running about 25% of all traffic coming to our main web site. This is an interesting achievement in many ways. First, from a performance perspective we are manually limiting the platform (it could do more!) to ~65,000 requests per minute and a traffic volume of about 280mbits per second. To date I haven’t seen many people post performance statistics about applications in modular use, so hopefully this is relevant and interesting to folks in terms of the volume of load an approach such as this could take. We recently celebrated this at a recent All-Hands with an internal version of our MDC being plugged into the conference room. To prove our point we added it to the global pool of capacity for AOL.com and started taking production traffic right there at the conference facility. This proves in large part the value, agility and mobility a platform like this could bring to bear.

As I mentioned before, Super Storm Sandy threw us another curveball as the hurricane crashed into the Mid-Atlantic. While Virginia was not hit anywhere near as hard as New York and New Jersey, there were incredible sustained winds, tumultuous rains, and storm related damage everywhere. Through it all, our outdoor version of the MDC weathered the storm just fine and continued serving traffic for AOL.com without fail.

This kind of Capability is not EASY or Turn-Key

That’s not to say there isn’t a ton of work to do to get an application to work in an environment like this. If you take the problem space at different levels whether it be DNS, Load Balancing, network redundancy, configuration management, underlying application level timeouts, systems dependencies like databases, other information stores and the like the non-infrastructure related work and coding is not insignificant. There is a huge amount of complexity in running a site like AOL.Com. Lots of interdependencies, sophistication, advertising related collection and distribution and the like. It’s safe to say that this is not as simple as throwing up an Apache/Tomcat instance into a VM.

I have talked for quite awhile about what Netflix engineers originally coined as Chaos Monkeys. The ability, development paradigm, or even rogue processes for your applications to survive significant infrastructure and application level outages. Its essentially taking the redundancy out of the infrastructure and putting into the code. While extremely painful at the start, the savings long term are proving hugely beneficial. For most companies, this is still something futuristic, very far out there. They may be beholden to software manufacturers and developers to start thinking this way which may take a very very long time. Infrastructure is the easy way to solve it. It may be easy, but its not cheap. Nor, if you care about the environmental angle on it, is it very ‘sustainable’ or green. Limit the infrastructure. Limit the Waste. While we haven’t really thought about in terms of rolling it up into our environmental positions, perhaps we should.

The point is that getting to this level of redundancy is going to take work and to that end will continue to be a regulator or anchor slowing down a greater adoption of more modular approaches. But at least in my mind, the future is set, directionally it will be hard to ignore the economics of this type of approach for long. Of course as an industry we need to start training or re-training developers to think in this kind of model. To build code in such a way that it takes into effect the Chaos Monkey Potential out there.

Want to see One Live?

We have been asked to provide an AOL MicroData Center for the Super Computing 12 conference next week in Salt Lake City, Utah with our partner Penguin Computing. If you want to see one of our Internal versions live and up-close feel free to stop by and take a look. Jay Moran (my Distinguished Engineer here at AOL) and Scott Killian (The leader of our data center operations teams) will be onsite to discuss the technologies and our use cases.

\Mm

Pointy Elbows, Bags of Beans, and a little anthill excavation…A response to the New York Times Data Center Articles

I have been following with some interest the series of articles in the New York Times by Jim Glanz. The series premiered on Sunday with an article entitled Power, Pollution and the Internet, which was followed up today with a deeper dive in some specific examples. The examples today (Data Barns in a farm town, Gobbling Power and Flexing muscle) focused on the Microsoft program, a program of which I have more than some familiarity since I ran it for many years. After just two articles, reading the feedback in comments, and seeing some of the reaction in the blogosphere it is very clear that there is more than a significant amount of misunderstanding, over-simplification, and a lack of detail I think is probably important. In doing so I want to be very clear that I am not representing AOL, Microsoft, or any other organization other than my own personal observations and opinions.

As mentioned in both of the articles I was one of hundreds of people interviewed by the New York Times for this series. In those conversations with Jim Glanz a few things became very apparent. First – He has been on this story for a very long time, at least a year. As far as journalists go, he was incredibly deeply engaged and armed with tons of facts. In fact, he had a trove of internal emails, meeting minutes, and a mountain of data through government filings that must have taken him months to collect. Secondly, he had the very hard job of turning this very complex space into a format where the uneducated masses can begin to understand it. Therein lies much of the problem – This is an incredibly complex space to try and communicate it to those not tackling it day to day or even understand that technological, regulatory forces involved. This is not an area or topic that can be sifted down to a sound bite. If this were easy, there really wouldn’t be a story would there?

At issue for me is that the complexity of the powers involved seems to get scant attention aiming larger for the “Data Centers are big bad energy vampires hurting the environment” story. Its clearly evident reading through the comments on the both of the articles so far. Claiming that the sources and causes have everything to do from poor web page design to government or multi-national companies conspiracies to corner the market on energy.

So I thought I would take a crack article by article to shed some light (the kind that doesn’t burn energy) on some of the topics and just call out where I disagree completely. In full transparency the “Data Barns” article doesn’t necessarily paint me as a “nice guy”. Sometimes I am. Sometimes I am not. I am not an apologist, nor do I intend to do so in this post. I am paid to get stuff done. To execute. To deliver. Quite frankly the PUD missed deadlines (the progenitor event to my email quoted in the piece) and sometimes people (even utility companies) have to live in the real world of consequences. I think my industry reputation, work, and fundamental stances around driving energy efficiency and environmental conservancy in this industry can stand on its own both publicly and for those that have worked for me.

There is an inherent irony here that these articles were published in both print and electronically to maximize the audience and readership. To do that, these articles made “multiple trips” through a data center, and ultimately reside in one (or more). They seem to denote that keeping things online is bad which seems to go against the availability and need of the articles themselves. Doesn’t the New York times expect to make these articles available on-line for people to read? Its posted online already. Perhaps they expect that their micro-fiche experts would be able to serve the demand for these articles in the future? I do not think so.

This is a complex eco-system of users, suppliers, technology, software, platforms, content creators, data (both BIG and small), regulatory forces, utilities, governments, financials, energy consumption, people, personalities, politics, company operating tenets, community outreach to name a very few. On top of managing through all these variables they also have to keep things running with no downtime.

\Mm

The AOL Micro-DC adds new capability

Back in July, I announced AOL’s Data Center Independence Day with the release of our new ‘Micro Data Center’ approach. In that post we highlighted the terrific work that the teams put in to revolutionize our data center approach and align it completely to not only technology goals but business goals as well. It was an incredible amount of engineering and work to get to that point and it would be foolish to think that the work represented a ‘One and Done’ type of effort.

So today I am happy to announce the roll out of a new capability for our Micro-DC – An indoor version of the Micro-DC.

While the first instantiations of our new capability were focused on outdoor environments, we were also hard at work at an indoor version with the same set of goals. Why work on an indoor version as well? Well you might recall in the original post I stated:

We are no longer tied to traditional data center facilities or colocation markets. That doesn’t mean we wont use them, it means we now have a choice. Of course this is only possible because of the internally developed cloud infrastructure but we have freed ourselves from having to be bolted onto or into existing big infrastructure. It allows us to have an incredible amount geo-distributed capacity at a very low cost point in terms of upfront capital and ongoing operational expense.

We need to maintain a portfolio of options for our products and services. In this case – having an indoor version of our capabilities to ensure that our solution can live absolutely anywhere. This will allow our footprint, automation and all, to live inside any data center co-location environment or the interior of any office building anywhere around the planet, and retain the extremely low maintenance profile that we were targeting from an operational cost perspective. In a sense you can think of it as “productizing” our infrastructure. Could we have just deployed racks of servers, network kit, etc. like we have always done? Sure. But by continuing to productize our infrastructure we continue to drive down the costs relating to our short term and long term infrastructure costs. In my mind, Productizing your infrastructure, is actually the next evolution in standardization of your infrastructure. You can have infrastructure standards in place – Server Model, RAM, HD space, Access switches, Core switches, and the like. But until you get to that next phase of standardizing, automating, and ‘productizing’ it into a discrete set of capabilities – you only get a partial win.

Some people have asked me, “Why didn’t you begin with the interior version to start with? It seems like it would be the easier one to accomplish.” Indeed I cannot argue with them, it would have probably been easier as there were much less challenges to solve. You can make basic assumptions around where this kind of indoor solution would live in, and reduce much of the complexity. I guess it all nets out to a philosophy of solving the harder problems first. Once you prove the more complicated use case, the easier ones come much faster. This is definitely the situation here.

While this new capability continues the success we are seeing in re-defining the cost and operations of our particular engineering environments, the real challenge here (as with all sorts infrastructure and cloud automation) is whether or not we can map similar success of our applications and services to work correctly in that space. On that note, I should have more to post soon. Stay Tuned! Smile

\Mm

AOL’s Data Center Independence Day

Yesterday we celebrated Independence Day here in the United States. It’s a day where we embrace the freedoms we enjoy as a country, look back on where we have come, and celebrate the promise of the future. Yesterday was also a different kind of Independence Day for my teams at AOL. A Data Center Independence Day, if you will.

You may or may not have been following the progress of the work that we have been doing here at AOL over the last 14 or so months but the pace of change has been simply breathtaking. One of the first things I did when I entered into the company was deeply review all of the aspects of Operations. From Data Centers to Network Engineering, to the engineering teams supporting the products and services and everything in between. The net of the exercise was that AOL was probably similar to most companies out there in terms of technology mix, from the CRUFT that I mentioned in a previous post, to latest technologies. There were some incredible technologies built over the last three decades, some outdated processes and procedures, and if I am honest traces of a culture where the past had more meaning of the present or future.

In a very short period of time all of that changed. We aggressively made changes to the organization, re-aligned priorities, and perhaps most of all we created and defined a powerful collection of changes and evolutions we would need to bring about with very aggressive timelines. These changes were part of a defined Technology Roadmap that broke the work we needed to accomplish into three categories of work. The categorization focused on the internal technical challenges and tools we needed to make to enhance our own internal efficiencies. The second categorization focused on the technical challenges and aggressive things we could do to enhance and bring greater scalability to our products and services. This would include things like additional services and technology suites to our internally developed cloud infrastructure, and other items that would allow for more rapid product delivery of our products and services. The last categorization of work, was for the incredibly aggressive “wish list” types of changes. Items that could be so disruptive, so incredibly game-changing for us, that they could redefine our work on the whole. In fact we named this group of work “Nibiru” after a mythical planet that is said to cross into our solar system and wreaks havoc and brings about great change.

On July 4, 2012, one of our Nibiru items arrived and I am extremely ecstatic to state that we achieved our “Data Center Independence Day”. Our primary “Nibiru” goal was to develop and deliver a data center environment without the need of a physical building. The environment needed to require as minimal amount of physical “touch” as possible and allow us the ultimate flexibility in terms of how we delivered capacity for our products and services. We called this effort the Micro Data Center. If you think about the amount of things that need to change to evolve to this type of strategy it’s a bit mind-boggling.

Here is just a few of the things required to look at/change/and automate to even make this kind of achievement possible:

Developing an entirely new Technology Suite and the ability to deliver that capacity anywhere in the world with minimal to no staffing.
Delivering extremely dense compute capacity (think the latest technology) to give us the longest possible use of these assets once deployed into the field.
The ability to deliver a “Microdata Center” anywhere on the planet regardless of temperature and humidity settings
The ability to support/maintain/and administer remotely.
The ability to fit into the power envelope of a normal office building
Participation in our cloud environment and capabilities
The processes by which these facilities are maintained and serviced
and much much more…

In my mind, Its one thing to claim a technical achievement, its quite another to operationalize that achievement and make the process of supporting it repeatable. That’s my measure as to when you can REALLY declare victory. Science Experiments don’t count. It has to just plain work. To that end our first “beta” site for the technology was the AOL campus in Dulles, Virginia. Out on a lonely slab of concrete in the back of one of the buildings our future has taken shape.

Thanks in part to a lot of the work going on in the data center containerization space, we were able to jump start much of the work in a relatively quick fashion. In fact the pace set the Data Center and Technology Operations teams to deliver this achievement is more than a bit astounding. Most, if not all, of the existing AOL Data Centers would fall somewhere around a traditional Tier III / Tier II Uptime Institute definition. The teams really pushed ahead way outside their comfort zones to deliver some incredibly evolutions in a very short period of time. Of course there were steps along the way to get here. But those steps now seem to be in double time. A few months back we announced the launching of ATC, Our first completely automated facility. The work that went into ATC, was foundational to our achievement yesterday. It allowed us to really start working on the hard stuff first. That is to say the ‘Operationalization’ of these kinds of environments. It set the stage of how we could evolve to this next tier of evolution. Below is a summary of some of the achievements of our ATC launch, but if you were curious for the specifics on our work there feel free to click the ‘Breaking the Chrysalis’ post I did at that time. You can see how the work that we have been driving in our own internal cloud environments, the changes in operational procedure, the change in thought is additive and fundamental to our latest achievement. Its especially interesting to note that with all of the interesting blips and hiccups occurring in the ‘cloud industry’ like the leap second and the terrible storms on the East Coast this week which affected many data centers, that ATC, our completely unmanned facility just kept humming along with no issues (To be fair neither did our traditional facilities) despite much of the initial negative feedback we had received was solely based around the reliability of such moves. It goes to show how important engineering FOR Operation is. For AOL we have built this in from the start.

What does this actually buy AOL?

Ok, we stuck some computers in a box and we made sure it requires very little care and feeding – what does this buy us? Quite a bit actually. Jay Moran, the Distinguished Engineer who was in charge of driving this effort is always quick to point out that the problem space here is not just about the Technology. It has to be a marriage with the business side as well. Obviously the inherent flexibility of the design allows us a greater number of places around the planet we can deploy capacity to and that in and of itself is pretty revolutionary. We are no longer tied to traditional data center facilities or colocation markets. That doesn’t mean we wont use them, it means we now have a choice. Of course this is only possible because of the internally developed cloud infrastructure but we have freed ourselves from having to be bolted onto or into existing big infrastructure. It allows us to have an incredible amount geo-distributed capacity at a very low cost point in terms of upfront capital and ongoing operational expense. This is a huge game changer. So much so, allow me to do a bit of the ‘back of the napkin math’ with you. Lets call our global capacity in terms of compute, storage, etc. that we have today in our traditional environments – the Total Compute Capability or TCC. Its essentially the bandwidth for the work that we can get done. Inside the cost for TCC you have operating costs such power, lease costs, Data Center facility maintenance costs, support staff, etc. You additionally have the depreciation for the facilities themselves (or the specific buildouts – if colocating), the server and other equipment depreciation, and the rest. Lets call that baseline X. The MicroData Center strategy built out with the latest, our most dense server standards and infrastructure would allow us to have 5X the amount of total TCC in less than 10% of the cost and physical footprint. If you think about how this will allow us to aggregate and grow over time it ultimately drives us to a VERY LOW operational cost structure for delivering our products and services. Additionally it positions us for the future in very significant ways.

It redefines software architecture for greater resiliency
It allows us an incredibly flexible platform for driving and addressing privacy laws, regulatory oversight, and other such concerns allowing us to respond rapidly.
It further reduces energy consumption and carbon footprint emissions (important as taxation evolves around the world, as well as ongoing operational costs)
Gives us the ability to drive Edge Computing delivery to potentially bypass CDNs for certain content.
Gives us the capability to drive ‘Community-in-a-box’ whereby we can quickly launch new products in markets, quickly expand existing footprints like Patch in a low cost, but still hyper-local platform, allow the Huffington Post a platform to rapidly partner and enter new markets with minimal cost turn ups.
The fact that the technology mix in our SKUs is comprised of compute, storage, and network capacity maximizes the amount of products and services we can deploy to it.

As Always its really about the People

I cannot let a post about this huge win for us to go by without mentioning the teams involved in delivering this capability. This is not just a win for AOL, or to a lesser degree the industry at large in another proof-point that it cant evolve if it puts its mind to changing, but rather the Technology Teams at AOL. When I was first approached about joining AOL, my slightly sarcastic and comedic response was probably much like yours – ‘Are they still around?’ But the fact of the matter is that AOL has a vision of where they want to go, and what they want to be. That was compelling for me personally, compelling enough for me to make the move. What has truly amazed me however is the dedication and tenacity of its employees. These achievements would not be possible without the outright aggressiveness the organization has taken to moving the company forward. Its always hard to assess from the outside just how hard an effort is internally to achieve. In the case of our micro Data Center Strategy, the teams had just about every kind of barrier to deliver this capacity. Every kind of excuse to not make it, or even not to try. They put all of those things aside and just plain executed. If you allow me a small moment of bravado – Not only did my teams simply kick ass, they did it in a way that moved the needle for the company, and in my mind once again catapulted themselves into the forefront of operations and technology at scale. We still have a bunch of Nibiru projects to deliver, so my guess is we haven’t heard the last of some of these big wins.

\Mm

Sites and Sounds of DataCentre2012: My Presentation, Day 2, and Final Observations

Today marked the closing lot of sessions for DataCentres2012 and my keynote session to the attendees. After sitting through a series of product, technology, and industry trend presentations over the last two days I was feeling that my conversation would at the very least be something different. Before I get to that – I wanted to share some observations from the morning.

It all began with an interesting run-down of the Data Center and infrastructure industry trends across Europe from Steve Wallage of The BroadGroup. It contained some really compelling information and highlighted some interesting divergence between the European market and the US market in terms of adoption and trends of infrastructure. It looks like they have a method for those interested to get their hand on the detailed data (for purchase) if you are interested. The parts I found particularly industry was the significant slow down of the Wholesale data center market across Europe while Colocation providers continued to do well. Additionally the percentages of change within the customer base of those providers by category was compelling and demonstrated a fundamental shift and move of content related customers across the board.

This presentation was followed by a panel of European Thought Leaders made up mostly of those same colocation providers. Given the presentation by Wallage I was expecting some interesting data-points to emerge. While there was a range of ideas and perspectives represented by the panel, I have to say it really got me worked up and not in a good way. In many ways I felt the responses from many (not all) on the panel highlighted a continued resistance to change in thinking around everything from efficiency, to technology approach. It represented the things I despise most about about our industry at large. Namely the slow adoption of change. The warm embrace of the familiar. The outright resistance to new ideas. At one point, a woman in the front row whom I believe was from Germany got up to ask a question if the panelists had any plans to move their facilities outside of the major metros. She referenced Christian Belady’s presentation around the idea of Data as Energy and remote locations like Quincy, Washington or Lulea, Sweden. She referred to the overall approach and thinking differently as quite visionary. Now the panel could have easily have referred to the fact that companies like Microsoft, Google, Facebook and the like have much greater software level control than a colo-provider could provide. Or perhaps they could have referenced that most of their customers are limited by distance to existing infrastructure deployments due to inefficiencies in commercial or custom internally deployed applications. Databases with response times architected for in-rack or in-facility levels of response times. They did at least reference that most customers tend to be server huggers and want their infrastructure close by.

Instead the initial response was quite strange in my mind. It was to go after the ideas as “innovative” and to imply that nothing was really innovative about what Microsoft had done and the fact that they built a “mega data center” in Dublin shows that there is nothing innovative really happening. Really? The adoption of 100% Air Side economization is something everyone does? The deployment of containerized compute capacity is run of the mill? The concepts about the industrialization of compute is old-hat? I had to do a mental double take and question whether they were even listening during ANY of the earlier sessions. Don’t get me wrong, I am not trying to be an apologist for the Microsoft program, in fact there are some tenets of the program I find myself not in agreement with. However – You cannot deny that they are doing VERY different things. It illustrated an interesting undercurrent I felt during the entire event (and maybe even our industry). I definitely got the sensation of a growing gap between users requirements and their forward roadmaps and desires and what manufacturers and service providers are providing. This panel, and a previous panel on modularization really highlighted these gulfs pretty demonstrably. At a minimum I definitely walked away with an interesting new perspective on some of the companies represented.

It was then time for me to give my talk. Every discussion up until this point had really focused on technology or industry trends. I was going to talk about something else. Something more important. The one thing seemingly missing from the entire event. That is – the people attending. All the technology in the world, all of the understanding of the trends in our industry are nothing unless the people in the room were willing to act. Willing to step up and take active roles in their companies to drive strategy. As I have said before – to get out of the basement and into the penthouse. The pressures on our industry and our job roles has never been more complicated. So I walked through regulations, technologies, and cloud discussions. Using the work that we did at AOL as a backdrop and example – I really tried to drive my main point. That our industry – specifically the people doing all the work – were moving to a role of managing a complex portfolio of technologies, contracts, and a continuum of solutions. Gone are the days where we can hide sheltered in our data center facilities. Our resistance to embrace change, need to evolve with us, or it will evolve around us. I walked through specific examples of how AOL has had to broaden its own perspective and approach to this widening view of our work roles at all levels. I even pre-announced something we are calling Data Center Independence Day. An aggressive adoption of modularized compute capacity that we call MicroData Centers to help solve many of the issues we are facing as a business and the rough business case as to why it makes sense for us to move to this model. I will speak more of that in the weeks to come with a greater degree of specifics, but stressed again the need for a wider perspective to manage a large portfolio of technologies and approaches to be successful in the future.

In closing – the event was fantastic. The ability this event provides to network with leaders and professionals across the industry was first rate. If I had any real constructive feedback it would be to either lengthen sessions, or reduce panel sizes to encourage more active and lively conversations. Or both!

Perhaps at the end of the day, it’s truly the best measure of a good conference if you walk away wishing that more time could be spent on the topics. As for me I am headed back Stateside and to digging into the challenges of my day job. To the wonderful host city of Nice, I say Adieu!

\Mm

Sites and Sounds of DataCentre2012: Thoughts and my Personal Favorite presentations Day 1

We wrapped our first full day of talks here at DataCentre2012 and I have to say the content was incredibly good. A couple of the key highlights that really stuck out in my mind were the talk given by Christian Belady who covered some interesting bits of the Microsoft Data Center Strategy moving forward. Of course I have a personal interest in that program having been there for Generation1 through Generation4 of the evolutions of the program. Christian covered some of the technology trends that they are incorporating into their Generation 5 facilities. It was some very interesting stuff and he went into deeper detail than I have heard so far around the concept of co-generation of power at data center locations. While I personally have some doubts about the all-in costs and immediacy of its applicability it was great to see some deep meaningful thought and differentiation out of the Microsoft program. He also went into a some interesting “future” visions which talked about data being the next energy source. While he took this concept to an entirely new level I do feel he is directionally correct. His correlations between the delivery of “data” in a utility model rang very true to me as I have long preached about the fact that we are at the dawning of the Information Utility for over 5 years.

Another fascinating talk came from Oliver J Jones of a company called Chayora. Few people and companies really understand the complexities and idiosyncrasies of doing business let alone dealing with the development and deployment of large scale infrastructure there. The presentation done by Mr. Jones was incredibly well done. Articulating the size, opportunity, and challenges of working in China through the lens of the data center market he nimbly worked in the benefits of working with a company with this kind of expertise. It was a great way to quietly sell Chayora’s value proposition and looking around the room I could tell the room was enthralled. His thoughts and data points had me thinking and running through scenarios all day long. Having been to many infrastructure conferences and seeing hundreds if not thousands of presentations, anyone who can capture that much of my mindshare for the day is a clear winner.

Tom Furlong and Jay Park of Facebook gave a great talk on OCP with a great focus on their new facility in Sweden. They also talked a bit about their other facilities in Prineville and North Carolina as well. With Furlong taking the Mechanical innovations and Park going through the electrical it was a great talk to created lots of interesting questions. An incredibly captivating portion of the talk was around calculating data center availability. In all honesty it was the first time I had ever seen this topic taken head on at a data center conference. In my experience, like PUE, Availability calculations can fall under the spell of marketing departments who truly don’t understand that there SHOULD be real math behind the calculation. There were two interesting take aways for me. The first was just how impactful this portion of the talk had on the room in general. There was an incredible amount of people taking notes as Jay Park went through the equation and way to think about it. It led me to my second revelation – There are large parts of our industry who don’t know how to do this. In private conversations after their talk some people confided that had never truly understood how to calculate this. It was an interesting wake-up call for me to ensure I covered the basics even in my own talks.

After the Facebook talk it was time for me to mount the stage for Global Thought Leadership Panel. I was joined on stage by some great industry thinkers including Christian Belady of Microsoft, Len Bosack (founder of Cisco Systems) now CEO XKL Systems, Jack Tison-CTO of Panduit, Kfir Godrich-VP and Chief Technologist at HP, John Corcoran-Executive Chairman of Global Switch, and Paul-Francois Cattier-Global VP of Data Centers at Schneider Electric. That’s a lot of people and brainpower to fit on a single stage. We really needed three times the amount of time allotted for this panel, but that is the way these things go. Perhaps one of the most interesting recurring themes from question to question was the general agreement that at the end of the day – great technology means nothing without the will do something different. There was an interesting debate on the differences between enterprise users and large scale users like Microsoft, Google, Facebook, Amazon, and AOL. I was quite chagrined and a little proud to hear AOL named in that list of luminaries (it wasn’t me who brought it up). But I was quick to point out that AOL is a bit different in that it has been around for 30 years and our challenges are EXACTLY like Enterprise data center environments. More on that tomorrow in my keynote I guess.

All in all, it was a good day – there were lots of moments of brilliance in the panel discussions throughout the day. One regret I have was on the panel regarding DCIM. They ran out of time for questions from the audience which was unfortunate. People continue to confuse DCIM as BMS version 2.0 and really miss capturing the work and soft costs, let alone the ongoing commitment to the effort once started. Additionally there is the question of once you have mountains of collected data, what do you do with that. I had a bunch of questions on this topic for the panel, including if any of the major manufacturers were thinking about building a decision engine over the data collection. To me it’s a natural outgrowth and next phase of DCIM. The one case study they discussed was InterXion. It was a great effort but I think in the end maintained the confusion around a BMS with a web interface versus true Facilities and IT integration. Another panel on Modularization got some really lively discussion on feature/functionality and differentiation, and lack of adoption. To a real degree it highlighted an interesting gulf between manufacturers (mostly represented by the panel) who need to differentiate their products and the users who require vendor interoperability of the solution space. It probably doesn’t help to have Microsoft or myself in the audience when it comes to discussions around modular capacity. On to tomorrow!

\Mm

Uptime, Cowgirls, and Success in California

This week my teams have descended upon the Uptime Institute Symposium in Santa Clara. The moment is extremely bittersweet for me as this is the first Symposium in quite sometime I have been unable to attend. With my responsibilities expanding at AOL beginning this week there was simply too much going on for me to make the trip out. It’s a down right shame too. Why?

We (AOL) will be featured in two key parts at Symposiums this time around for some incredibly ground breaking work happening at the company. The first is a recap of the incredible work going on in the development of our own cloud platforms. Last year you may recall that we were asked to talk about some of the wins and achievements we were able to accomplish with the development of our cloud platform. The session was extremely well received. We were asked to come back, one year on, to discuss about how that work has progressed even more. Aaron Lake, the primary developer of our cloud platforms and my Infrastructure Development Team, will be talking on the continued success, features, and functionality, and the launch of our ATC Cloud Only Data Center. Its been an incredible break neck pace for Aaron and his team and they have delivered world-class capabilities for us internally.

Much of Aaron’s work has also enabled us to win the Uptime Institutes First Annual Server Round Up Award. I am especially proud of this particular honor as it is the result of an amazing amount of hard work within the organization on a problem faced by companies all over the planet. Essentially this is Operations Hygiene at a huge scale, getting rid of old servers, driving consolidation, moving platforms to our cloud environments and more. This talk will be lead by Julie Edwards, our Director of Business Operations and Christy Abramson, our Director of Service Management. Together these two teams led the effort to drive out “Operational Absurdities” and our “Power Hog” programs. We have sent along Lee Ann Macerelli and Rachel Paiva who were the primary project managers instrumental in making this initiative such a huge success. These “Cowgirls” drove an insane amount of work across the company resulting in over 5 million dollars of un-forecasted operational savings, proving that there is always room for good operational practices. They even starred in a funny internal video to celebrate their win which can be found here using the AOL Studio Now service.

If you happen to be attending Symposium this year feel free to stop by and say hello to these amazing individuals. I am incredibly proud of the work that they have driven within the company.

\Mm

The Cloud Cat and Mouse Papers–Site Selection Roulette and the Insurance Policies of Mobile infrastructure

Its always hard to pick exactly where to start in a conversation like this especially since this entire process really represents a changing life-cycle. Its more of a circular spiral that moves out (or evolves) as new data is introduced than a traditional life-cycle because new data can fundamentally shift the technology or approach. That being said I thought I would start our conversations at a logical starting point. Where does one place your infrastructure? Even in its embryonic “idea phase” the intersection of government and technology begins its delicate dance to a significant degree. These decisions will ultimately have an impact on more than just where the Capital investments a company decides to make are located. It has affects on the products and services they offer, and as I propose, an impact ultimately on the customers that use the services at those locations.

As I think back to the early days of building out a global infrastructure, the Site Selection phase started at a very interesting place. In some ways we approached it with a level of sophistication that has still to be matched today and in other ways, we were children playing a game whose rules had not yet been defined.

I remember sitting across numerous tables with government officials talking about making an investment (largely just land purchase decisions) in their local community. Our Site Selection methodology had brought us to these areas. A Site Selection process which continued to evolve as we got smarter, and as we started to truly understand the dynamics of the system were being introduced to. In these meetings we always sat stealthily behind a third party real estate partner. We never divulged who we were, nor were they allowed to ask us that directly. We would pepper them with questions, and they in turn would return the favor. It was all cloak and dagger with the Real Estate entity taking all action items to follow up with both parties.

Invariably during these early days - these locales would always walk away with the firm belief that we were a bank or financial institution. When they delved into our financial viability (for things like power loads, commitment to capital build-out etc.) we always stated that any capital commitments and longer term operational cost commitments were not a problem. In large part the cloak and dagger aspect was to keep land costs down (as we matured, we discovered this was quite literally the last thing we needed to worry about) as we feared that once our name became attached to the deal our costs would go up. These were the early days of seeding global infrastructure and it was not just us. I still laugh at the fact that one of our competitors bound a locality up so much in secrecy – that the community referred to the data center as Voldemort – He who shall not be named, in deference to the Harry Potter book series.

This of course was not the only criteria that we used. We had over 56 by the time I left that particular effort with various levels of importance and weighting. Some Internet companies today use less, some about the same, and some don’t use any, they ride on the backs of others who have trail-blazed a certain market or locale. I have long called this effect Data Center Clustering. The rewards for being first mover are big, less so if you follow them ultimately still positive.

If you think about most of the criteria used to find a location it almost always focuses on the current conditions, with some acknowledge in some of the criteria of the look forward. This is true for example when looking at power costs. Power costs today are important to siting a data center, but so is understanding the generation mix of that power, the corresponding price volatility, and modeling that ahead to predict (as best as possible) longer term power costs.

What many miss is understanding the more subtle political layer that occurs once a data center has been placed or a cluster has developed. Specifically that the political and regulatory landscape can change very quickly (in relationship to the life of a data center facility which is typically measured in 20, 30, or 40 year lifetimes). It’s a risk that places a large amount of capital assets potentially in play and vulnerable to these kinds of changes. Its something that is very hard to plan or model against. That being said there are indicators and clues that one can use to at least play risk factors against or as some are doing – ensuring that the technology they deploy limits their exposure. In cloud environments the question remains open – how liable are companies using cloud infrastructure in these facilities at risk? We will explore this a little later.

That’s not to say that this process is all downside either. As we matured in our approach, we came to realize that the governments (local or otherwise) were strongly incented to work with us on getting us a great deal and in fact competed over this kind of business. Soon you started to see the offers changing materially. It was little about the land or location and quickly evolved to what types of tax incentives, power deals, and other mechanisms could be put in play. You saw (and continue to see) deals structured around sales tax breaks, real estate and real estate tax deals, economic incentives around breaks in power rates, specialized rate structures for Internet and Cloud companies and the like. The goal here of course was to create the public equivalent of “golden handcuffs” for the Tech companies and try to marry them to particular region, state, or country. In many cases – all three. The benefits here are self apparent. But can they (or more specifically will they) be passed on in some way to small companies who make use of cloud infrastructure in these facilities? While definitely not part of the package deals done today – I could easily see site selection negotiations evolving to incent local adoption of cloud technology in these facilities or provisions being put in place tying adoption and hosting to tax breaks and other deal structures in the mid to longer timeframe for hosting and cloud companies.

There is still a learning curve out there as most governments mistakenly try and tie these investments with jobs creation. Data Centers, Operations, and the like represents the cost of goods sold (COGS) to the cloud business. Therefore there is a constant drive towards efficiency and reduction of the highest cost components to deliver those products and services. Generally speaking, people, are the primary targets in these environments. Driving automation in these environments is job one for any global infrastructure player. One of the big drivers for us investing and developing a 100% lights-out data center at AOL was eliminating those kinds of costs. Those governments that generally highlight job creation targets over other types typically don’t get the site selection. After having commissioned an economic study done after a few of my previous big data center builds I can tell you that the value to a region or a state does not come from the up front jobs the data center employs. After a local radio stationed called into question the value of having such a facility in their backyard, we used a internationally recognized university to perform a third party “neutral” assessment of the economic benefits (sans direct people) and the numbers were telling. We had surrendered all construction costs and other related material to them, and they investigated over the course of a year through regional interviews and the like of what the direct impacts of a data center was on the local community, and the overall impacts by the addition. The results of that study are owned by a previous employer but I can tell you with certainty – these facilities can be beneficial to local regions.

No one likes constraints and as such you are beginning to see Technology companies use their primary weapon – technology – to mitigate their risks even in these scenarios. One cannot argue for example, that while container-based data centers offer some interesting benefits in terms of energy and cost efficiencies, there is a certain mobility to that kind of infrastructure that has never been available before. Historically, data centers are viewed as large capital anchors to a location. Once in place, hundreds of millions to billions (depending on the size of the company) of dollars of capital investment are tied to that region for its lifespan. Its as close to permanent in the Tech Industry as building a factory was during the industrial revolution.

In some ways Modularization of the data center industry is/can/will have the same effect as the shipping container did in manufacturing. All puns intended. If you are unaware of how the shipping container revolutionized the world, I would highly recommend the book “The Box” by Marc Levinson, it’s a quick read and very interesting if you read it through the lens of IT infrastructure and the parallels of modularization in the Data Center Industry at large.

It gives the infrastructure companies more exit options and mobility in the future than they would have had in the past under large capital build-outs. Its an insurance policy if you will for potential changes is legislation or regulation that might negatively impact the Technology companies over time. Just another move in the cat and mouse games that we will see evolving here over the next decade or so in terms of the interactions between governments and global infrastructure.

So what about the consumers of cloud services? How much of a concern should this represent for them? You don’t have to be a big infrastructure player to understand that there are potential risks in where your products and services live. Whether you are building a data center or hosting inside a real estate or co-location provider – these are issues that will affect you. Even in cases where you only use the cloud provisioning capabilities within your chosen provider – you will typically be given options of what region or area would you like you gear hosted in. Typically this is done for performance reasons – reaching your customers – but perhaps this information might cause you to think of the larger ramifications to your business. It might even drive requirements into the infrastructure providers to make this more transparent in the future.

These evolutions in the relationship between governments and Technology and the technology options available to them will continue to shape site selection policy for years to come. So too will it ultimately affect the those that use this infrastructure whether directly or indirectly remains to be seen. In the next paper we will explore the this interaction more deeply as it relates to the customers of cloud services and the risks and challenges specifically for them in this environment.

\Mm

Open Sourcing our Operational Scale Tools–Meet Trigger

Not that long ago we made a decision to begin Open Sourcing some of our internal products and tools into the community at large. There are some really interesting benefits for open sourcing some of these internally developed tools and as a company we begun to do some very interesting work in this space. Some of our work has gotten quite a bit of attention such as SocketStream which is a very fast, real time web framework.

Today I am very pleased to announce that we are open-sourcing one of the tools that we use to manage and maintain our network infrastructure. We call it Trigger.

Trigger is a Python framework and suite of tools for interfacing with network devices and managing network configuration and security policy. Trigger was specifically internally designed to increase the speed and efficiency of network configuration management.

Trigger’s core device interaction utilizes the freely available Twisted event-driven networking engine. The libraries can connect to network devices by any available method (e.g. telnet, SSH), communicate with them in their native interface (e.g. Juniper JunoScript, Cisco IOS), and return output. Trigger is able to manage any number of jobs in parallel and handle output or errors as they return.

If you think a tool like this would be interesting for you or your company feel free to give it a try. The Open Source repository for it can be found here at :https://github.com/aol/trigger.

For the complete set of documentation is hosted at ReadtheDocs:

http://readthedocs.org/docs/trigger

This is just the first of many internal efforts and tools that we plan to Open Source. I will announce more tools in the coming months that have helped AOL to scale over the years. Stay tuned!

\Mm