Breaking the Chrysalis

What has come before

When I first took my position at AOL I knew I was going to be in for some very significant challenges.   This position, perhaps more-so than any other in my career was going to push the bounds of my abilities.  As a technologist, as an operations professional, as a leader, and as someone who would hold measurable accountability to the operational success of an expansive suite of products and services.  As many of you may know, AOL has been engaged in what used to be called internally as a “Start-Around”.  Essentially an effort to try and fundamentally change the company from its historic roots to the premium content provider for the Internet. 

We no longer refer to this term internally as it is no longer about forming or defining that vision.  It has shifted to something more visceral.  More tangible.  It’s a challenge that most companies should be familiar with, It’s called Execution.  Execution is a very simple word but as any good operations professional knows, the devil is in the details, and those details have layers and layers of nuances.    Its where the proverbial rubber meets the road.  For my responsibilities within the company,  execution revolves 100% around delivering the technologies and services to ensure our products and content remain available to the world.   It is also about fundamentally transforming the infrastructural technologies and platform systems our products and content are based upon and providing the most agility and mobility we can to our business lines. 

One fact that is often forgotten in the fast-paced world of Internet Darlings, is that AOL had achieved a huge scale of infrastructure and technology investment long before many of these companies were gleams in the eyes of their founders.   While it may be fun and “new” to look at the tens of thousands of machines at Facebook, Google, or Microsoft – it is often overlooked that AOL had tens of thousands of machines (and still does!) and solved many of the same problems years ago.  To be honest it was a personal revelation for me when I joined.  There are few companies who have had to grow and operate at this kind of scale and every approach is a bit unique and different.  It was an interesting lesson, even for one who had a ton of experience doing something similar in “Internet Darling” infrastructures.

AOL has been around for over 27 years.  In technology circles, that’s like going back almost ten generations.   Almost 3 decades of “stuff”.  The stuff was not only gear and equipment from the natural growth of the business, but included the expansion of features and functionality of long standing services, increased systems interdependencies, and operational, technological, and programmatic “Cruft” as new systems / processes/ technologies were  built upon or bolted onto older systems. 

This “cruft” adds significant complexity to your operating environment and can truly limit your organization’s agility.  As someone tasked with making all this better, it struck me that we actually had at least two problems to solve.   The platform and foundation for the future, and a method and/or strategy for addressing the older products, systems, and environments and increase our overall agility as a company.

These are hard problems.  People have asked why I haven’t blogged in awhile externally.   This is the kind of challenge with multiple layers of challenges underneath that can keep one up at night.   From a strategy perspective do you target the new first?  Do you target the legacy environments to reduce the operational drag?  Or – Do you try and define a unified strategy to address both.  Its a lot harder and generally more complex, but they potential payoff is huge.   Luckily I have a world class team at AOL and together we built and entered our own cocoon and busily went to work.  We have gone down the path of changing out technology platforms, operational processes, outdated ways of thinking about data centers, infrastructure, and overall approach. Every inch fighting forward on this idea of unified infrastructure.

It was during this process that I came to realize that our particular legacy challenge, while at “Internet” scale, was more closely related to the challenges of most corporate or government environments than the biggest Internet players.  Sure we had big scale, we had hundreds of products and services, but the underlying “how to get there from here” problems were more universally like IT challenges than scaling out similar applications across commoditized infrastructure.   It ties into all the marketing promises, technological snake oil, and other baloney about the “cloud”.  The difference being that we had to quickly deliver something that worked and would not impact the business.  Whether we wanted to or not, we would be walking down some similar roads facing most IT organizations today.

As I look at the challenges facing modern IT departments across the world, their ability to “go to the cloud” or make use of new approaches is also securely anchored behind by the “cruft”  of their past.  Sometimes that cruft is so thick that the organization cannot move forward.  We were there, we were in the same boat.  We aren’t out of it yet – but we have made some pretty interesting developments that I think are pretty significant and I intend to share those learnings where appropriate. 

 

ATC

ATC IS BORN

Last week we launched a brand new data center facility we call, ATC.  This facility is fundamentally built upon the work that we have been doing around our own internal cloud technologies, shifts in operational process and methodology, and targeting our ability to be extremely agile in our new business model.  It represents a model on how to migrate the old, prepare for the new, and provide a platform upon which to build our future. 

Most people ignore the soft costs when looking at adoption of different cloud offerings, operational impacts are typically considered as afterthoughts.   What if you built those requirements in from day one… how would that change your design? your implementation? Your overall strategy?  I believe that ATC represents that kind of shift of thinking.  At least for us internally.

One of the key foundations for our ATC facility is our cloud platform and automation layer.  I like to think about this layer as a little bit country and a little bit rock and roll.  There is tremendous value in the learning’s that have come before, and nowhere else is this self evident than at AOL.  As I mentioned, the great minds of the past (as well as those in the present) had invested in many great systems that made this company a giant in the industry.   There are many such systems here, but one of the key ones in my mind is the Configuration Management System.  All organizations invest significantly into this type of platform.  If done correctly, their uses can span from more than a rudimentary asset management system, to include cost allocation systems, dependency mapping, detailed configuration and environmental data, and in some cases like ours provide the base foundation of leading us into the cloud. 

Many companies I speak with abandon this work altogether or live in a strange split/hybrid model where they treat “Cloud” as different.  In our space – new government regulations, new safe harbor laws, etc are continuing to drive the relevance of a universal system to act as a central authority.   The fact that this technology actually sped our development efforts in this automation cannot be ignored.

We went from provisioning servers in days, to getting base virtual machines up and running in under 8 seconds.  Want Service and Application images (for established products)? Add another 8 seconds or so.   Want to roll it into production globally (changing global DNS/Load balancing/Security changes)?  Lets call that another minute to roll out.   We used Open Source products and added our own development glue into our own systems to make all  this happen.  I am incredibly proud of my Cloud teams here at AOL, because what they have been able to do in such a relatively short period of time is to roll out a world class cloud and service provisioning system that can be applied to new efforts and platforms or our older products.   Better yet, the provisioning systems were built to be universal so that if required we can do the same thing with stand-alone physical boxes or virtual machines.  No difference.  Same system. This technology platform was recently recognized by the Uptime Institute at its last Symposium in California. 

auto2

This technology was put to the test in the recently with the earthquake that hit the East Coast of the United States.  While thankfully the damage was minimal, the tremor of Internet traffic was incredible.   The AOL homepage, along with our news sites started to get hammered with traffic and requests.  In the past this would have required a massive people effort to provision more capacity for our users.  With the new technology in place we were able to start adding additional machines to take the load extremely quickly with very minimal impact to our users.  In this particular case these machines were provisioned from our systems in existing data centers (not ATC), but the technology is the same.

This kind of technology and agility has some interesting side effects too.   It allows your organization to move much more quickly and aggressively than ever before.   I have seen Jevon’s paradox manifest itself over and over again in the Technology world.    For those of you who need a refresher, Jevons paradox is is the proposition that technological progress that increases the efficiency with which a resource is used tends to increase (rather than decrease) the rate of consumption of that resource. 

Its like when car manufacturers started putting the Miles per Gallon (MPG) efficiency on autos, the direct result was not a reduction of driving, but rather an overall increase of travel.

For ATC, which officially launched on October 1, 2011.  It took all of an hour to have almost 100 virtual machines deployed to it as soon as it was “turned on”.   It has since long passed that mark and in fact this technology usage is happening faster than coordinating executive schedules to attend our executive ribbon cutting ceremony this week.

While the Cloud development and technology efforts are cornerstones of the facility, it is not this work alone that is providing for something unique. After all however slick our virtualization and provisioning systems are, however deeply integrated they are into our internal tools and configuration management systems, those characteristics in and of themselves does not reflect the true evolution that ATC represents.

ATC is a 100% lights out facility.  There are absolutely no employees stationed at the facility full time, contract, or otherwise.   The entire premise is that we have moved from a reactive support model to a proactive or planned work support model.  If you compare this with other facilities (including some I built myself in the past) there is always personnel on site even if contractor.   This has fundamentally led to significant changes in how we operate our data centers, how, what, and when we do our work, and has impacted (downward) the overall costs to operate our environments.  Many of these are efficiencies and approaches I have used before (100% pre-racked/vendor integrated gear and systems integration) to fundamentally brand new approaches.  These changes have not been easy and a ton of credit goes to our operations and engineering staff in the Data Centers and across the Technology Operations world here at AOL.  Its always culturally tough to being open to fundamentally changing business as usual.   Another key aspect of this facility and infrastructure is that from network perspective its nearly 100% non-blocking.   My network engineers being network engineers pointed out that its not completely non-blocking for a few reasons, but I can honestly say that the network topology is the closest I have seen to “completely” non blocking deployed in real network environments ever especially compared to the industry standard of 2:1. 

Another incredible aspect of this new data center facility and the technology deployed is our ability to Quick Launch Compute Capacity.  The total time it took to go from idea inception (no data center) to delivering active capacity to our internal users was  90 days.  In my mind this made even more incredible by the fact that this was the first time that all these work-streams came together including the unified operations deployment model and included all of the physical aspects of just getting iron to the floor.    This time frame was made possible by a standardized / modular way to build out our compute capacity in logical segments based upon the the infrastructure cloud type being deployed (low tier, mid-tier, etc.).   This approach has given us a predictability to speed of deployment and cost which in my opinion is unparalleled.

The culmination of all of this work is the result of some incredible teams devoted to the desire to affect change, a little dash of renegade engineering, a heaping helping of some new perspective, blood, sweat, tears and vision.   I am extremely proud of the teams here at AOL to deliver this ground-breaking achievement.   But then again, I am more than a bit biased.   I have seen the passion of these teams manifested in some incredible technology.

As with all things like this, it’s been a journey and there is still a bunch of work to do.  Still more to optimize.  Deeper analysis and ease of aggregation for stubborn legacy environments.   We have already set our sights on the next generation of cloud development.  But for today, we have successfully built a new foundation upon which even more will be built.  For those of you who were not able to attend the Uptime Symposium this year I will be putting up some videos that give you some flavor of our work with driving a low cost cloud compute and provisioning system from Open Source components.

 

\Mm

Author: mmanos

Infrastructure at Scale Technologist and Cloud Aficionado.

30 thoughts on “Breaking the Chrysalis”

  1. Go AOL…. Well done to all teams…. Still makes me proud to hear about the progress. D.W. Need more photo’s of you prancing around ATC. 😉

      1. This is clearly a win for the teams involved. There was a ton of great work going on here before I got here, and there is alot of great work continuing and involving.

      2. If I somehow insinuated that this effort was all me, that was not my intention. The Technology teams here are first rate.

      3. You are of course both right……a real case of nanos gigantium humeris insidentes …. so respect to Mike and Team for delivering the vision

    1. Hardware failures are handled with routine and scheduled visits. If infrastructure fails we move the actual ‘properties’ (our products and services) to other cloud instances in the same data center, or across to other of our facilities in the region or beyond. – Hope this helps!

  2. Sure wish you were back at Microsoft. What most people reading this will fail to understand is that you know to lead and drive a vision. I was not surprised by the quote that highlighted that some of this stuff existing before you got there. Your genius is packaging up solutions to be unique and then driving the execution to the end. This is a great story and I am happy for you. Congratulations to your new team. Your walk-about is now over, come back to Seattle and fix what has been broken since you left.

  3. It is evident Mr. Manos has an “abundant personality” and hopefully that rubs off on the entire leadership team. If AOL concentrates on doing what their competitor is not doing and do what they are doing better, AOL will be a leader again in the internet space. Good Luck!

    1. Hey Brad – Is that “abundant” thing some kind of fat joke? 🙂 j/k

      I personally think that from a technology perspective there are some really great things going on. But as I said – I am a bit biased.

      \Mm

  4. Id like to doubledown on that sentiment to bring Manos back to Microsoft. It looks like he is doing great things there at aol and I would expect nothing less. The dude is the hardest charging smart guy I ever met. He knows what it take to just move things forward. Very atypical for most execs I met. By the way? Whats all this Open Source non-sense. 🙂

  5. Hey Mike. Thanks for the post. Quick question. What is the physical size of the facility? Just looking through balance sheet and didn’t see PPE numbers budge . Read about 100 virtual machines. Just a PhD researcher here wondering how small-caps operate/fund data centers. Thanks again

    1. Adeel – The physical size of he ATC facility is not large unto itself. It houses a few thousand servers. You would not likely see the impact on the balance sheet for a couple of reasons. First – This infrastructure purchase (as it relates to this specific post) would/could really be lumped into a capital avoidance. This capacity was used to provision mostly against new growth versus consolidation of existing footprint. That being said, we had another project underway where we focused on cost reduction through infrastructure consolidation. We will have much more data coming out at the Uptime Institute regarding our work there. In that effort we realized some significant savings as measured through power costs, license costs, and more. Hope this helps!
      \Mm

Leave a comment