The AOL Micro-DC adds new capability

Back in July, I announced AOL’s Data Center Independence Day with the release of our new ‘Micro Data Center’ approach.   In that post we highlighted the terrific work that the teams put in to revolutionize our data center approach and align it completely to not only technology goals but business goals as well.   It was an incredible amount of engineering and work to get to that point and it would be foolish to think that the work represented a ‘One and Done’ type of effort.  

So today I am happy to announce the roll out of a new capability for our Micro-DC – An indoor version of the Micro-DC.

Aol MDC-Indoor2

While the first instantiations of our new capability were focused on outdoor environments, we were also hard at work at an indoor version with the same set of goals.   Why work on an indoor version as well?   Well you might recall in the original post I stated:

We are no longer tied to traditional data center facilities or colocation markets.   That doesn’t mean we wont use them, it means we now have a choice.  Of course this is only possible because of the internally developed cloud infrastructure but we have freed ourselves from having to be bolted onto or into existing big infrastructure.   It allows us to have an incredible amount geo-distributed capacity at a very low cost point in terms of upfront capital and ongoing operational expense.

We need to maintain a portfolio of options for our products and services.  In this case – having an indoor version of our capabilities to ensure that our solution can live absolutely anywhere.   This will allow our footprint, automation and all, to live inside any data center co-location environment or the interior of any office building anywhere around the planet, and retain the extremely low maintenance profile that we were targeting from an operational cost perspective.  In a sense you can think of it as “productizing” our infrastructure.  Could we have just deployed racks of servers, network kit, etc. like we have always done?  Sure.   But by continuing to productize our infrastructure we continue to drive down the costs relating to our short term and long term infrastructure costs.  In my mind, Productizing your infrastructure, is actually the next evolution in standardization of your infrastructure.   You can have infrastructure standards in place – Server Model, RAM, HD space, Access switches, Core switches, and the like.  But until you get to that next phase of standardizing, automating, and ‘productizing’ it into a discrete set of capabilities – you only get a partial win.

Some people have asked me, “Why didn’t you begin with the interior version to start with? It seems like it would be the easier one to accomplish.”  Indeed I cannot argue with them, it would have probably been easier as there were much less challenges to solve.  You can make basic assumptions around where this kind of indoor solution would live in, and reduce much of the complexity.   I guess it all nets out to a philosophy of solving the harder problems first.   Once you prove the more complicated use case, the easier ones come much faster.   This is definitely the situation here.  

While this new capability continues the success we are seeing in re-defining the cost and operations of our particular engineering environments, the real challenge here (as with all sorts infrastructure and cloud automation) is whether or not we can map similar success of our applications and services to work correctly in that space.   On that note, I should have more to post soon. Stay Tuned!  Smile

 

\Mm

AOL’s Data Center Independence Day

Yesterday we celebrated Independence Day here in the United States.   It’s a day where we embrace the freedoms we enjoy as a country, look back on where we have come, and celebrate the promise of the future.   Yesterday was also a different kind of Independence Day for my teams at AOL.  A Data Center Independence Day, if you will. 

You may or may not have been following the progress of the work that we have been doing here at AOL over the last 14 or so months but the pace of change has been simply breathtaking.  One of the first things I did when I entered into the company was deeply review all of the aspects of Operations.  From Data Centers to Network Engineering, to the engineering teams supporting the products and services and everything in between.   The net of the exercise was that AOL was probably similar to most companies out there in terms of technology mix, from the CRUFT that I mentioned in a previous post, to latest technologies.  There were some incredible technologies built over the last three decades, some outdated processes and procedures, and if I am honest traces of a culture where the past had more meaning of the present or future.

In a very short period of time all of that changed.  We aggressively made changes to the organization,  re-aligned priorities, and perhaps most of all we created and defined a powerful collection of changes and evolutions we would need to bring about with very aggressive timelines.    These changes were part of a defined Technology Roadmap that broke the work we needed to accomplish into three categories of work.   The categorization focused on the internal technical challenges and tools we needed to make to enhance our own internal efficiencies.  The second categorization focused on the technical challenges and aggressive things we could do to enhance and bring greater scalability to our products and services.   This would include things like additional services and technology suites to our internally developed cloud infrastructure, and other items that would allow for more rapid product delivery of our products and services.   The last categorization of work, was for the incredibly aggressive “wish list” types of changes.  Items that could be so disruptive, so incredibly game-changing for us, that they could redefine our work on the whole.  In fact we named this group of work “Nibiru” after a mythical planet that is said to cross into our solar system and wreaks havoc and brings about great change. 

On July 4, 2012, one of our Nibiru items arrived and I am extremely ecstatic to state that we achieved our “Data Center Independence Day”.  Our primary “Nibiru” goal was to develop and deliver a data center environment without the need of a physical building.  The environment needed to require as minimal amount of physical “touch” as possible and allow us the ultimate flexibility in terms of how we delivered capacity for our products and services. We called this effort the Micro Data Center.   If you think about the amount of things that need to change to evolve to this type of strategy it’s a bit mind-boggling. 

image

Here is just a few of the things required to look at/change/and automate to even make this kind of achievement possible:

  • Developing an entirely new Technology Suite and the ability to deliver that capacity anywhere in the world with minimal to no staffing.
  • Delivering extremely dense compute capacity (think the latest technology) to give us the longest possible use of these assets once deployed into the field.
  • The ability to deliver a “Microdata Center” anywhere on the planet regardless of temperature and humidity settings
  • The ability to support/maintain/and administer remotely.
  • The ability to fit into the power envelope of a normal office building
  • Participation in our cloud environment and capabilities
  • The processes by which these facilities are maintained and serviced
  • and much much more…

In my mind, Its one thing to claim a technical achievement, its quite another to operationalize that achievement and make the process of supporting it repeatable. That’s my measure as to when you can REALLY declare victory.  Science Experiments don’t count.   It has to just plain work.    To that end our first “beta” site for the technology was the AOL campus in Dulles, Virginia.  Out on a lonely slab of concrete in the back of one of the buildings our future has taken shape.

Thanks in part to a lot of the work going on in the data center containerization imagespace, we were able to jump start much of the work in a relatively quick fashion.  In fact the pace set the Data Center and Technology Operations teams to deliver this achievement is more than a bit astounding.   Most, if not all, of the existing AOL Data Centers would fall somewhere around a traditional Tier III / Tier II Uptime Institute definition.   The teams really pushed ahead way outside their comfort zones to deliver some incredibly evolutions in a very short period of time.   Of course there were steps along the way to get here.  But those steps now seem to be in double time.  A few months back we announced the launching of ATC, Our first completely automated facility.   The work that went into ATC, was foundational to our achievement yesterday.   It allowed us to really start working on the hard stuff first.   That is to say the ‘Operationalization’ of these kinds of environments.   It set the stage of how we could evolve to this next tier of evolution.   Below is a summary of some of the achievements of our ATC launch, but if you were curious for the specifics on our work there feel free to click the ‘Breaking the Chrysalis’ post I did at that time.  You can see how the work that we have been driving in our own internal cloud environments, the changes in operational procedure, the change in thought is additive and fundamental to our latest achievement.   Its especially interesting to note that with all of the interesting blips and hiccups occurring in the ‘cloud industry’ like the leap second and  the terrible storms on the East Coast this week which affected many data centers, that ATC, our completely unmanned facility just kept humming along with no issues (To be fair neither did our traditional facilities) despite much of the initial negative feedback we had received was solely based around the reliability of such moves.   It goes to show how important engineering FOR Operation is.  For AOL we have built this in from the start.

What does this actually buy AOL?

Ok, we stuck some computers in a box and we made sure it requires very little care and feeding – what does this buy us?  Quite a bit actually.  Jay Moran, the Distinguished Engineer who was in charge of driving this effort is always quick to point out that the problem space here is not just about the Technology.  It has to be a marriage with the business side as well.  Obviously the inherent flexibility of the design allows us a greater number of places around the planet we can deploy capacity to and that in and of itself is pretty revolutionary.   We are no longer tied to traditional data center facilities or colocation markets.   That doesn’t mean we wont use them, it means we now have a choice.  Of course this is only possible because of the internally developed cloud infrastructure but we have freed ourselves from having to be bolted onto or into existing big infrastructure.   It allows us to have an incredible amount geo-distributed capacity at a very low cost point in terms of upfront capital and ongoing operational expense.   This is a huge game changer.  So much so, allow me to do a bit of the ‘back of the napkin math’ with you.   Lets call our global capacity in terms of compute, storage, etc. that we have today in our traditional environments – the Total Compute Capability or TCC. Its essentially the bandwidth for the work that we can get done.   Inside the cost for TCC you have operating costs such power, lease costs, Data Center facility maintenance costs, support staff, etc.  You additionally have the imagedepreciation for the facilities themselves (or the specific buildouts – if colocating), the server and other equipment depreciation, and the rest.   Lets call that baseline X.   The MicroData Center strategy built out with the latest, our most dense server standards and infrastructure would allow us to have 5X the amount of total TCC in less than 10% of the cost and physical footprint.   If you think about how this will allow us to aggregate and grow over time it ultimately drives us to a VERY LOW operational cost structure for delivering our products and services.   Additionally it positions us for the future in very significant ways.

  • It redefines software architecture for greater resiliency
  • It allows us an incredibly flexible platform for driving and addressing privacy laws, regulatory oversight, and other such concerns allowing us to respond rapidly.
  • It further reduces energy consumption and carbon footprint emissions (important as taxation evolves around the world, as well as ongoing operational costs)
  • Gives us the ability to drive Edge Computing delivery to potentially bypass CDNs for certain content.
  • Gives us the capability to drive ‘Community-in-a-box’ whereby we can quickly launch new products in markets, quickly expand existing footprints like Patch in a low cost, but still hyper-local platform, allow the Huffington Post a platform to rapidly partner and enter new markets with minimal cost turn ups.
  • The fact that the technology mix in our SKUs is comprised of compute, storage, and network capacity maximizes the amount of products and services we can deploy to it.  

As Always its really about the People

I cannot let a post about this huge win for us to go by without mentioning the teams involved in delivering this capability.  This is not just a win for AOL, or to a lesser degree the industry at large in another proof-point that it cant evolve if it puts its mind to changing, but rather the Technology Teams at AOL.  When I was first approached about joining AOL, my slightly sarcastic and comedic response was probably much like yours – ‘Are they still around?’ But the fact of the matter is that AOL has a vision of where they want to go, and what they want to be.   That was compelling for me personally, compelling enough for me to make the move.   What has truly amazed me however is the dedication and tenacity of its employees.  These achievements would not be possible without the outright aggressiveness the organization has taken to moving the company forward.  Its always hard to assess from the outside just how hard an effort is internally to achieve.  In the case of our micro Data Center Strategy, the teams had just about every kind of barrier to deliver this capacity.  Every kind of excuse to not make it, or even not to try.   They put all of those things aside and just plain executed.  If you allow me a small moment of bravado – Not only did my teams simply kick ass, they did it in a way that moved the needle for the company, and in my mind once again catapulted themselves into the forefront of operations and technology at scale.   We still have a bunch of Nibiru projects to deliver, so my guess is we haven’t heard the last of some of these big wins.

\Mm

ATC Ribbon Cutting

grandopening

In my previous post I had mentioned how extremely proud I was of the Technology teams here at AOL in delivering a truly state of the art Data Center facility with some incredible ground breaking technology.  As I mentioned the facility was actually in production use faster than we could get the ribbon cutting ceremony scheduled.  I thought I would share a small slice of the pictures of the internal Ribbon Cutting Event.

___manos-gounares-cloud

Alex Gounares, former fellow Microsoft alum and AOL CTO and I presided over the celebration.   In this photo, Alex and I talk over some of the technologies used in our cloud with one our cloud engineers.  As the facility is based upon pre-racked technologies and modular facility and network build components it allows for significant cost and capital optimization. this allows us to build only when demand and growth dictates the need. All machines in the background are live and have been live for a few weeks.

___cut

After receiving two very large scissors which were remarkably sharp and precise for their size we were ready to go.   A few short words about the phenomenal job our teams performed and it was time for some ribbon to kiss raised floor.

 

 

___

At the end of the day the real reason why this project was such a success really breaks down to the team responsible for this incredible win.   An effort like this took incredibly smart people from different organizations working together to make this a reality.    The achievement is even more impressive in my mind when you think about the fact that in many cases our 90 day to live timeframe included design and execution on the go!   My guess is our next one may be significantly faster without all that design time. The true heroes of ATC are below!

the team

 

\Mm

(Special thanks goes out to Krysta Scharlach for the permission and use of her pictures in this post)

Breaking the Chrysalis

What has come before

When I first took my position at AOL I knew I was going to be in for some very significant challenges.   This position, perhaps more-so than any other in my career was going to push the bounds of my abilities.  As a technologist, as an operations professional, as a leader, and as someone who would hold measurable accountability to the operational success of an expansive suite of products and services.  As many of you may know, AOL has been engaged in what used to be called internally as a “Start-Around”.  Essentially an effort to try and fundamentally change the company from its historic roots to the premium content provider for the Internet. 

We no longer refer to this term internally as it is no longer about forming or defining that vision.  It has shifted to something more visceral.  More tangible.  It’s a challenge that most companies should be familiar with, It’s called Execution.  Execution is a very simple word but as any good operations professional knows, the devil is in the details, and those details have layers and layers of nuances.    Its where the proverbial rubber meets the road.  For my responsibilities within the company,  execution revolves 100% around delivering the technologies and services to ensure our products and content remain available to the world.   It is also about fundamentally transforming the infrastructural technologies and platform systems our products and content are based upon and providing the most agility and mobility we can to our business lines. 

One fact that is often forgotten in the fast-paced world of Internet Darlings, is that AOL had achieved a huge scale of infrastructure and technology investment long before many of these companies were gleams in the eyes of their founders.   While it may be fun and “new” to look at the tens of thousands of machines at Facebook, Google, or Microsoft – it is often overlooked that AOL had tens of thousands of machines (and still does!) and solved many of the same problems years ago.  To be honest it was a personal revelation for me when I joined.  There are few companies who have had to grow and operate at this kind of scale and every approach is a bit unique and different.  It was an interesting lesson, even for one who had a ton of experience doing something similar in “Internet Darling” infrastructures.

AOL has been around for over 27 years.  In technology circles, that’s like going back almost ten generations.   Almost 3 decades of “stuff”.  The stuff was not only gear and equipment from the natural growth of the business, but included the expansion of features and functionality of long standing services, increased systems interdependencies, and operational, technological, and programmatic “Cruft” as new systems / processes/ technologies were  built upon or bolted onto older systems. 

This “cruft” adds significant complexity to your operating environment and can truly limit your organization’s agility.  As someone tasked with making all this better, it struck me that we actually had at least two problems to solve.   The platform and foundation for the future, and a method and/or strategy for addressing the older products, systems, and environments and increase our overall agility as a company.

These are hard problems.  People have asked why I haven’t blogged in awhile externally.   This is the kind of challenge with multiple layers of challenges underneath that can keep one up at night.   From a strategy perspective do you target the new first?  Do you target the legacy environments to reduce the operational drag?  Or – Do you try and define a unified strategy to address both.  Its a lot harder and generally more complex, but they potential payoff is huge.   Luckily I have a world class team at AOL and together we built and entered our own cocoon and busily went to work.  We have gone down the path of changing out technology platforms, operational processes, outdated ways of thinking about data centers, infrastructure, and overall approach. Every inch fighting forward on this idea of unified infrastructure.

It was during this process that I came to realize that our particular legacy challenge, while at “Internet” scale, was more closely related to the challenges of most corporate or government environments than the biggest Internet players.  Sure we had big scale, we had hundreds of products and services, but the underlying “how to get there from here” problems were more universally like IT challenges than scaling out similar applications across commoditized infrastructure.   It ties into all the marketing promises, technological snake oil, and other baloney about the “cloud”.  The difference being that we had to quickly deliver something that worked and would not impact the business.  Whether we wanted to or not, we would be walking down some similar roads facing most IT organizations today.

As I look at the challenges facing modern IT departments across the world, their ability to “go to the cloud” or make use of new approaches is also securely anchored behind by the “cruft”  of their past.  Sometimes that cruft is so thick that the organization cannot move forward.  We were there, we were in the same boat.  We aren’t out of it yet – but we have made some pretty interesting developments that I think are pretty significant and I intend to share those learnings where appropriate. 

 

ATC

ATC IS BORN

Last week we launched a brand new data center facility we call, ATC.  This facility is fundamentally built upon the work that we have been doing around our own internal cloud technologies, shifts in operational process and methodology, and targeting our ability to be extremely agile in our new business model.  It represents a model on how to migrate the old, prepare for the new, and provide a platform upon which to build our future. 

Most people ignore the soft costs when looking at adoption of different cloud offerings, operational impacts are typically considered as afterthoughts.   What if you built those requirements in from day one… how would that change your design? your implementation? Your overall strategy?  I believe that ATC represents that kind of shift of thinking.  At least for us internally.

One of the key foundations for our ATC facility is our cloud platform and automation layer.  I like to think about this layer as a little bit country and a little bit rock and roll.  There is tremendous value in the learning’s that have come before, and nowhere else is this self evident than at AOL.  As I mentioned, the great minds of the past (as well as those in the present) had invested in many great systems that made this company a giant in the industry.   There are many such systems here, but one of the key ones in my mind is the Configuration Management System.  All organizations invest significantly into this type of platform.  If done correctly, their uses can span from more than a rudimentary asset management system, to include cost allocation systems, dependency mapping, detailed configuration and environmental data, and in some cases like ours provide the base foundation of leading us into the cloud. 

Many companies I speak with abandon this work altogether or live in a strange split/hybrid model where they treat “Cloud” as different.  In our space – new government regulations, new safe harbor laws, etc are continuing to drive the relevance of a universal system to act as a central authority.   The fact that this technology actually sped our development efforts in this automation cannot be ignored.

We went from provisioning servers in days, to getting base virtual machines up and running in under 8 seconds.  Want Service and Application images (for established products)? Add another 8 seconds or so.   Want to roll it into production globally (changing global DNS/Load balancing/Security changes)?  Lets call that another minute to roll out.   We used Open Source products and added our own development glue into our own systems to make all  this happen.  I am incredibly proud of my Cloud teams here at AOL, because what they have been able to do in such a relatively short period of time is to roll out a world class cloud and service provisioning system that can be applied to new efforts and platforms or our older products.   Better yet, the provisioning systems were built to be universal so that if required we can do the same thing with stand-alone physical boxes or virtual machines.  No difference.  Same system. This technology platform was recently recognized by the Uptime Institute at its last Symposium in California. 

auto2

This technology was put to the test in the recently with the earthquake that hit the East Coast of the United States.  While thankfully the damage was minimal, the tremor of Internet traffic was incredible.   The AOL homepage, along with our news sites started to get hammered with traffic and requests.  In the past this would have required a massive people effort to provision more capacity for our users.  With the new technology in place we were able to start adding additional machines to take the load extremely quickly with very minimal impact to our users.  In this particular case these machines were provisioned from our systems in existing data centers (not ATC), but the technology is the same.

This kind of technology and agility has some interesting side effects too.   It allows your organization to move much more quickly and aggressively than ever before.   I have seen Jevon’s paradox manifest itself over and over again in the Technology world.    For those of you who need a refresher, Jevons paradox is is the proposition that technological progress that increases the efficiency with which a resource is used tends to increase (rather than decrease) the rate of consumption of that resource. 

Its like when car manufacturers started putting the Miles per Gallon (MPG) efficiency on autos, the direct result was not a reduction of driving, but rather an overall increase of travel.

For ATC, which officially launched on October 1, 2011.  It took all of an hour to have almost 100 virtual machines deployed to it as soon as it was “turned on”.   It has since long passed that mark and in fact this technology usage is happening faster than coordinating executive schedules to attend our executive ribbon cutting ceremony this week.

While the Cloud development and technology efforts are cornerstones of the facility, it is not this work alone that is providing for something unique. After all however slick our virtualization and provisioning systems are, however deeply integrated they are into our internal tools and configuration management systems, those characteristics in and of themselves does not reflect the true evolution that ATC represents.

ATC is a 100% lights out facility.  There are absolutely no employees stationed at the facility full time, contract, or otherwise.   The entire premise is that we have moved from a reactive support model to a proactive or planned work support model.  If you compare this with other facilities (including some I built myself in the past) there is always personnel on site even if contractor.   This has fundamentally led to significant changes in how we operate our data centers, how, what, and when we do our work, and has impacted (downward) the overall costs to operate our environments.  Many of these are efficiencies and approaches I have used before (100% pre-racked/vendor integrated gear and systems integration) to fundamentally brand new approaches.  These changes have not been easy and a ton of credit goes to our operations and engineering staff in the Data Centers and across the Technology Operations world here at AOL.  Its always culturally tough to being open to fundamentally changing business as usual.   Another key aspect of this facility and infrastructure is that from network perspective its nearly 100% non-blocking.   My network engineers being network engineers pointed out that its not completely non-blocking for a few reasons, but I can honestly say that the network topology is the closest I have seen to “completely” non blocking deployed in real network environments ever especially compared to the industry standard of 2:1. 

Another incredible aspect of this new data center facility and the technology deployed is our ability to Quick Launch Compute Capacity.  The total time it took to go from idea inception (no data center) to delivering active capacity to our internal users was  90 days.  In my mind this made even more incredible by the fact that this was the first time that all these work-streams came together including the unified operations deployment model and included all of the physical aspects of just getting iron to the floor.    This time frame was made possible by a standardized / modular way to build out our compute capacity in logical segments based upon the the infrastructure cloud type being deployed (low tier, mid-tier, etc.).   This approach has given us a predictability to speed of deployment and cost which in my opinion is unparalleled.

The culmination of all of this work is the result of some incredible teams devoted to the desire to affect change, a little dash of renegade engineering, a heaping helping of some new perspective, blood, sweat, tears and vision.   I am extremely proud of the teams here at AOL to deliver this ground-breaking achievement.   But then again, I am more than a bit biased.   I have seen the passion of these teams manifested in some incredible technology.

As with all things like this, it’s been a journey and there is still a bunch of work to do.  Still more to optimize.  Deeper analysis and ease of aggregation for stubborn legacy environments.   We have already set our sights on the next generation of cloud development.  But for today, we have successfully built a new foundation upon which even more will be built.  For those of you who were not able to attend the Uptime Symposium this year I will be putting up some videos that give you some flavor of our work with driving a low cost cloud compute and provisioning system from Open Source components.

 

\Mm