Jay Moran – LooseBolts

Lots of interest in the MicroDC, but do you know what I am getting the most questions about?

Scott Killian of AOL talks about the MicroDC

Last week I put up a post about how AOL.com has 25% of all traffic now running through our MicroDC infrastructure. There was a great follow up post by James LaPlaine our VP of Operations on his blog Mental Effort, which goes into even greater detail. While many of the email inquiries I get have been based around the technology itself, surprisingly a large majority of the notes have been questions around how to make your software. applications, and development efforts ready for such an infrastructure and what the timelines for realistically doing so would be.

The general response of course is that it depends. If you are a web-based platform or property focused solely on Internet based consumers, or a firm that needs diversified presence in different regions without the hefty price tag of renting and taking down additional space this may be an option. However many of the enterprise based applications have been written in a way that is highly dependent upon localized infrastructure, short application based latency, and lack adequate scaling. So for more corporate data center applications this may not be a great fit. It will take sometime for those big traditional application firms to be able to truly build out their infrastructure to work in an environment like this (they may never do so). I suspect most will take an easier approach and try to ‘cloudify’ their own applications and run it within their own infrastructure or data centers under their control. This essentially will allow them to control the access portion of users needs, but continue to rely on the same kinds of infrastructure you might have in your own data center to support it. Its much easier to build a web based application which then connects to a traditional IT based environment, than to truly build out infrastructure capable of accommodating scale. I am happy to continue answer questions as they come up, but as I had an overwhelming response of questions about this I thought I would throw something quick up here that will hopefully help.

\Mm

On Micro Datacenters, Sandy, Supercomputing 2012, and Coding for Containerized Data Centers….

As everyone has been painfully aware last week the United States saw the devastation caused by the Superstorm Sandy. My original intention was to talk about yet another milestone with our Micro Data Center approach. As the storm slammed into the East Coast I felt it was probably a bad time to talk about achieving something significant especially as people were suffering through the storms outcome. In fact, after the storm AOL kicked off an incredible supplies drive and sent truckloads of goods up to the worst of the affected areas.

So, here we are a week after the storm, and while people are still in need and suffering, it is clear that the worst is over and the clean up and healing has begun. It turns out that Super Storm Sandy also allowed us to test another interesting case in the journey of the Micro Data Center as well that I will touch on.

25% of ALL AOL.COM Traffic runs through Micro Data Centers

I have talked about the potential value of our use of Micro Data Centers and the pure agility and economics the platform will provide for us. Up until this point we had used this technology in pockets. Think of our explorations as focusing on beta and demo environments. But that all changed in October when we officially flipped the switch and began taking production traffic for AOL.COM with the Micro Data Center. We are currently (and have been since flipping the switch) running about 25% of all traffic coming to our main web site. This is an interesting achievement in many ways. First, from a performance perspective we are manually limiting the platform (it could do more!) to ~65,000 requests per minute and a traffic volume of about 280mbits per second. To date I haven’t seen many people post performance statistics about applications in modular use, so hopefully this is relevant and interesting to folks in terms of the volume of load an approach such as this could take. We recently celebrated this at a recent All-Hands with an internal version of our MDC being plugged into the conference room. To prove our point we added it to the global pool of capacity for AOL.com and started taking production traffic right there at the conference facility. This proves in large part the value, agility and mobility a platform like this could bring to bear.

As I mentioned before, Super Storm Sandy threw us another curveball as the hurricane crashed into the Mid-Atlantic. While Virginia was not hit anywhere near as hard as New York and New Jersey, there were incredible sustained winds, tumultuous rains, and storm related damage everywhere. Through it all, our outdoor version of the MDC weathered the storm just fine and continued serving traffic for AOL.com without fail.

This kind of Capability is not EASY or Turn-Key

That’s not to say there isn’t a ton of work to do to get an application to work in an environment like this. If you take the problem space at different levels whether it be DNS, Load Balancing, network redundancy, configuration management, underlying application level timeouts, systems dependencies like databases, other information stores and the like the non-infrastructure related work and coding is not insignificant. There is a huge amount of complexity in running a site like AOL.Com. Lots of interdependencies, sophistication, advertising related collection and distribution and the like. It’s safe to say that this is not as simple as throwing up an Apache/Tomcat instance into a VM.

I have talked for quite awhile about what Netflix engineers originally coined as Chaos Monkeys. The ability, development paradigm, or even rogue processes for your applications to survive significant infrastructure and application level outages. Its essentially taking the redundancy out of the infrastructure and putting into the code. While extremely painful at the start, the savings long term are proving hugely beneficial. For most companies, this is still something futuristic, very far out there. They may be beholden to software manufacturers and developers to start thinking this way which may take a very very long time. Infrastructure is the easy way to solve it. It may be easy, but its not cheap. Nor, if you care about the environmental angle on it, is it very ‘sustainable’ or green. Limit the infrastructure. Limit the Waste. While we haven’t really thought about in terms of rolling it up into our environmental positions, perhaps we should.

The point is that getting to this level of redundancy is going to take work and to that end will continue to be a regulator or anchor slowing down a greater adoption of more modular approaches. But at least in my mind, the future is set, directionally it will be hard to ignore the economics of this type of approach for long. Of course as an industry we need to start training or re-training developers to think in this kind of model. To build code in such a way that it takes into effect the Chaos Monkey Potential out there.

Want to see One Live?

We have been asked to provide an AOL MicroData Center for the Super Computing 12 conference next week in Salt Lake City, Utah with our partner Penguin Computing. If you want to see one of our Internal versions live and up-close feel free to stop by and take a look. Jay Moran (my Distinguished Engineer here at AOL) and Scott Killian (The leader of our data center operations teams) will be onsite to discuss the technologies and our use cases.

\Mm

AOL’s Data Center Independence Day

Yesterday we celebrated Independence Day here in the United States. It’s a day where we embrace the freedoms we enjoy as a country, look back on where we have come, and celebrate the promise of the future. Yesterday was also a different kind of Independence Day for my teams at AOL. A Data Center Independence Day, if you will.

You may or may not have been following the progress of the work that we have been doing here at AOL over the last 14 or so months but the pace of change has been simply breathtaking. One of the first things I did when I entered into the company was deeply review all of the aspects of Operations. From Data Centers to Network Engineering, to the engineering teams supporting the products and services and everything in between. The net of the exercise was that AOL was probably similar to most companies out there in terms of technology mix, from the CRUFT that I mentioned in a previous post, to latest technologies. There were some incredible technologies built over the last three decades, some outdated processes and procedures, and if I am honest traces of a culture where the past had more meaning of the present or future.

In a very short period of time all of that changed. We aggressively made changes to the organization, re-aligned priorities, and perhaps most of all we created and defined a powerful collection of changes and evolutions we would need to bring about with very aggressive timelines. These changes were part of a defined Technology Roadmap that broke the work we needed to accomplish into three categories of work. The categorization focused on the internal technical challenges and tools we needed to make to enhance our own internal efficiencies. The second categorization focused on the technical challenges and aggressive things we could do to enhance and bring greater scalability to our products and services. This would include things like additional services and technology suites to our internally developed cloud infrastructure, and other items that would allow for more rapid product delivery of our products and services. The last categorization of work, was for the incredibly aggressive “wish list” types of changes. Items that could be so disruptive, so incredibly game-changing for us, that they could redefine our work on the whole. In fact we named this group of work “Nibiru” after a mythical planet that is said to cross into our solar system and wreaks havoc and brings about great change.

On July 4, 2012, one of our Nibiru items arrived and I am extremely ecstatic to state that we achieved our “Data Center Independence Day”. Our primary “Nibiru” goal was to develop and deliver a data center environment without the need of a physical building. The environment needed to require as minimal amount of physical “touch” as possible and allow us the ultimate flexibility in terms of how we delivered capacity for our products and services. We called this effort the Micro Data Center. If you think about the amount of things that need to change to evolve to this type of strategy it’s a bit mind-boggling.

Here is just a few of the things required to look at/change/and automate to even make this kind of achievement possible:

Developing an entirely new Technology Suite and the ability to deliver that capacity anywhere in the world with minimal to no staffing.
Delivering extremely dense compute capacity (think the latest technology) to give us the longest possible use of these assets once deployed into the field.
The ability to deliver a “Microdata Center” anywhere on the planet regardless of temperature and humidity settings
The ability to support/maintain/and administer remotely.
The ability to fit into the power envelope of a normal office building
Participation in our cloud environment and capabilities
The processes by which these facilities are maintained and serviced
and much much more…

In my mind, Its one thing to claim a technical achievement, its quite another to operationalize that achievement and make the process of supporting it repeatable. That’s my measure as to when you can REALLY declare victory. Science Experiments don’t count. It has to just plain work. To that end our first “beta” site for the technology was the AOL campus in Dulles, Virginia. Out on a lonely slab of concrete in the back of one of the buildings our future has taken shape.

Thanks in part to a lot of the work going on in the data center containerization space, we were able to jump start much of the work in a relatively quick fashion. In fact the pace set the Data Center and Technology Operations teams to deliver this achievement is more than a bit astounding. Most, if not all, of the existing AOL Data Centers would fall somewhere around a traditional Tier III / Tier II Uptime Institute definition. The teams really pushed ahead way outside their comfort zones to deliver some incredibly evolutions in a very short period of time. Of course there were steps along the way to get here. But those steps now seem to be in double time. A few months back we announced the launching of ATC, Our first completely automated facility. The work that went into ATC, was foundational to our achievement yesterday. It allowed us to really start working on the hard stuff first. That is to say the ‘Operationalization’ of these kinds of environments. It set the stage of how we could evolve to this next tier of evolution. Below is a summary of some of the achievements of our ATC launch, but if you were curious for the specifics on our work there feel free to click the ‘Breaking the Chrysalis’ post I did at that time. You can see how the work that we have been driving in our own internal cloud environments, the changes in operational procedure, the change in thought is additive and fundamental to our latest achievement. Its especially interesting to note that with all of the interesting blips and hiccups occurring in the ‘cloud industry’ like the leap second and the terrible storms on the East Coast this week which affected many data centers, that ATC, our completely unmanned facility just kept humming along with no issues (To be fair neither did our traditional facilities) despite much of the initial negative feedback we had received was solely based around the reliability of such moves. It goes to show how important engineering FOR Operation is. For AOL we have built this in from the start.

What does this actually buy AOL?

Ok, we stuck some computers in a box and we made sure it requires very little care and feeding – what does this buy us? Quite a bit actually. Jay Moran, the Distinguished Engineer who was in charge of driving this effort is always quick to point out that the problem space here is not just about the Technology. It has to be a marriage with the business side as well. Obviously the inherent flexibility of the design allows us a greater number of places around the planet we can deploy capacity to and that in and of itself is pretty revolutionary. We are no longer tied to traditional data center facilities or colocation markets. That doesn’t mean we wont use them, it means we now have a choice. Of course this is only possible because of the internally developed cloud infrastructure but we have freed ourselves from having to be bolted onto or into existing big infrastructure. It allows us to have an incredible amount geo-distributed capacity at a very low cost point in terms of upfront capital and ongoing operational expense. This is a huge game changer. So much so, allow me to do a bit of the ‘back of the napkin math’ with you. Lets call our global capacity in terms of compute, storage, etc. that we have today in our traditional environments – the Total Compute Capability or TCC. Its essentially the bandwidth for the work that we can get done. Inside the cost for TCC you have operating costs such power, lease costs, Data Center facility maintenance costs, support staff, etc. You additionally have the depreciation for the facilities themselves (or the specific buildouts – if colocating), the server and other equipment depreciation, and the rest. Lets call that baseline X. The MicroData Center strategy built out with the latest, our most dense server standards and infrastructure would allow us to have 5X the amount of total TCC in less than 10% of the cost and physical footprint. If you think about how this will allow us to aggregate and grow over time it ultimately drives us to a VERY LOW operational cost structure for delivering our products and services. Additionally it positions us for the future in very significant ways.

It redefines software architecture for greater resiliency
It allows us an incredibly flexible platform for driving and addressing privacy laws, regulatory oversight, and other such concerns allowing us to respond rapidly.
It further reduces energy consumption and carbon footprint emissions (important as taxation evolves around the world, as well as ongoing operational costs)
Gives us the ability to drive Edge Computing delivery to potentially bypass CDNs for certain content.
Gives us the capability to drive ‘Community-in-a-box’ whereby we can quickly launch new products in markets, quickly expand existing footprints like Patch in a low cost, but still hyper-local platform, allow the Huffington Post a platform to rapidly partner and enter new markets with minimal cost turn ups.
The fact that the technology mix in our SKUs is comprised of compute, storage, and network capacity maximizes the amount of products and services we can deploy to it.

As Always its really about the People

I cannot let a post about this huge win for us to go by without mentioning the teams involved in delivering this capability. This is not just a win for AOL, or to a lesser degree the industry at large in another proof-point that it cant evolve if it puts its mind to changing, but rather the Technology Teams at AOL. When I was first approached about joining AOL, my slightly sarcastic and comedic response was probably much like yours – ‘Are they still around?’ But the fact of the matter is that AOL has a vision of where they want to go, and what they want to be. That was compelling for me personally, compelling enough for me to make the move. What has truly amazed me however is the dedication and tenacity of its employees. These achievements would not be possible without the outright aggressiveness the organization has taken to moving the company forward. Its always hard to assess from the outside just how hard an effort is internally to achieve. In the case of our micro Data Center Strategy, the teams had just about every kind of barrier to deliver this capacity. Every kind of excuse to not make it, or even not to try. They put all of those things aside and just plain executed. If you allow me a small moment of bravado – Not only did my teams simply kick ass, they did it in a way that moved the needle for the company, and in my mind once again catapulted themselves into the forefront of operations and technology at scale. We still have a bunch of Nibiru projects to deliver, so my guess is we haven’t heard the last of some of these big wins.

\Mm