Cruft – LooseBolts

The Weight of Technical Liberty…Cutting the Cruft

Over the next few months, it’s my sincere desire to share with you some of the amazing technology accomplishments currently underway at First Data and how we are attempting to change the industry. In any conversation about the future, you must begin by framing the past. As you may or may not know First Data is a company that was founded in 1971. It is a company hallmarked in its early years by significant technology innovation with a number of ‘firsts’ in the enablement of credit card processing across the globe.

Throughout the years the company grew both organically as well as through large numbers of mergers and acquisitions on a global scale which ultimately enabled it to become the international leader it is today. I will spare folks a deeper commercial of the company only to state that today it has more scale and technology reach than any other company like it in the #Fintech space.

I share this information because it’s that unencumbered growth over decades of acquisition, an evolving and changing regulatory and compliance field of requirements, and a historically growing list of platforms and services that ultimately led to the largest trove of ‘Cruft’ I have ever been challenged with in my personal career. It’s a challenge 45 years in the making.

As you may recall I first defined ‘Cruft’ while engaged at the Turn-Around at AOL:

Cruft is defined as years of build-up of technology, processes, politics, fiscal orphaning, and poor operational hygiene that ultimately impede technical agility and operation. Additionally, Cruft can create an acidic cloud of lethargy or apathy in the workforce that ultimately sucks the energy out of innovation from within.

When I originally defined the term I was referring to the work we accomplished attacking the Cruft in a different organization which ultimately led to the company winning the Uptime Institute’s “Server Round-Up” Award. That award was created to promote full IT and Facilities integration and improve overall energy efficiency. While recognized for the energy efficiency improvement, it was really a by-product of other technological and organizational wins for the company.

Our work on ‘attacking the Cruft’ at First Data has resulted in similar, in fact, greater energy cost savings, but more importantly it has reduced and continues to reduce the operational complexity of our environments. Attacking the Cruft problem along the technology, process, and hygiene axes have resulted in some very powerful and significant results. While we are far from completing the task, the last twenty-four (24) months have yielded some mind-numbing progress.

Is this really my metric? So Not Technical…

The first challenge I had was trying to find a way to truly quantify the reductions in a metric everyone could understand. Simply counting servers was not enough, it could not account for other devices like storage equipment, network equipment, and other kit that does not easily fold into that definition. Measuring power usage decreases, while absolutely telling the effort from a purely technical perspective, obfuscated the tremendous amount of work and passion the teams poured into modernizing our plant. Many of the consumers of the information of our modernization efforts are not technology or energy wonks. We had to come up with a metric that was universal. That everyone, even non-technical people could understand and visualize. In the end, we settled on the ‘ton’.

I know what you are thinking…the ton? As in… like..weight?

Yes.

It’s not as cool as measuring in megawatts, or measured computational capacity, or MIPS, or IOPs, or whatever metric is fashionable these days, but it is universal. Additionally, the scale of the work output would just get lost. So what did we achieve over the last 24 months?

We removed 220+ tons of IT Equipment from our global data centers.
We consolidated and shutdown 5 data centers across the world; and have an aggressive plan to continue to consolidate more.
We employed large-scale internal virtualization technology, open source cloud technologies, and are building a hybridized cloud controller that has resulted in moving nearly 75% of our physical distributed server environments to a virtualized footprint. (I will share more on that in a different post).

There were significant other achievements as well which we can discuss at a later date. But as I said, we had to set the framework of what the starting position was. We still have a mountain of work in this space to do but the momentum has started and passions have been ignited. Those passions are blowing away that “acidic cloud” that results from Cruft. The results speak for themselves. That is an incredible amount of work to achieve in just 24 months. It’s not just about establishing a set of technical goals for an organization to achieve. As a leader it’s about ensuring that you have created the fertile soil for those changes to take place and have empowered your people to make decisions along that alignment.

Of course, none of this could have been achieved if the firm from the top down was dedicated to driving this kind of significant change. First Data is truly blessed with a board and leadership team who not only understand technology, they have lived it, they have managed it, they have won with it. It’s a very unique set of variables that have been toggled.

While tonnage may be an easier metric for non-techies to understand how much equipment was removed, it is hard to grasp just how much 220 tons actually represents. As these efforts over the last two years have created more operational simplicity giving us the freedom and liberty to expand and explore new technology approaches it is only fitting to associate it with the Statue of Liberty. Which by coincidence also weighs 220 tons. Visualize that.

\Mm

The AOL Micro-DC adds new capability

Back in July, I announced AOL’s Data Center Independence Day with the release of our new ‘Micro Data Center’ approach. In that post we highlighted the terrific work that the teams put in to revolutionize our data center approach and align it completely to not only technology goals but business goals as well. It was an incredible amount of engineering and work to get to that point and it would be foolish to think that the work represented a ‘One and Done’ type of effort.

So today I am happy to announce the roll out of a new capability for our Micro-DC – An indoor version of the Micro-DC.

While the first instantiations of our new capability were focused on outdoor environments, we were also hard at work at an indoor version with the same set of goals. Why work on an indoor version as well? Well you might recall in the original post I stated:

We are no longer tied to traditional data center facilities or colocation markets. That doesn’t mean we wont use them, it means we now have a choice. Of course this is only possible because of the internally developed cloud infrastructure but we have freed ourselves from having to be bolted onto or into existing big infrastructure. It allows us to have an incredible amount geo-distributed capacity at a very low cost point in terms of upfront capital and ongoing operational expense.

We need to maintain a portfolio of options for our products and services. In this case – having an indoor version of our capabilities to ensure that our solution can live absolutely anywhere. This will allow our footprint, automation and all, to live inside any data center co-location environment or the interior of any office building anywhere around the planet, and retain the extremely low maintenance profile that we were targeting from an operational cost perspective. In a sense you can think of it as “productizing” our infrastructure. Could we have just deployed racks of servers, network kit, etc. like we have always done? Sure. But by continuing to productize our infrastructure we continue to drive down the costs relating to our short term and long term infrastructure costs. In my mind, Productizing your infrastructure, is actually the next evolution in standardization of your infrastructure. You can have infrastructure standards in place – Server Model, RAM, HD space, Access switches, Core switches, and the like. But until you get to that next phase of standardizing, automating, and ‘productizing’ it into a discrete set of capabilities – you only get a partial win.

Some people have asked me, “Why didn’t you begin with the interior version to start with? It seems like it would be the easier one to accomplish.” Indeed I cannot argue with them, it would have probably been easier as there were much less challenges to solve. You can make basic assumptions around where this kind of indoor solution would live in, and reduce much of the complexity. I guess it all nets out to a philosophy of solving the harder problems first. Once you prove the more complicated use case, the easier ones come much faster. This is definitely the situation here.

While this new capability continues the success we are seeing in re-defining the cost and operations of our particular engineering environments, the real challenge here (as with all sorts infrastructure and cloud automation) is whether or not we can map similar success of our applications and services to work correctly in that space. On that note, I should have more to post soon. Stay Tuned! Smile

\Mm

AOL’s Data Center Independence Day

Yesterday we celebrated Independence Day here in the United States. It’s a day where we embrace the freedoms we enjoy as a country, look back on where we have come, and celebrate the promise of the future. Yesterday was also a different kind of Independence Day for my teams at AOL. A Data Center Independence Day, if you will.

You may or may not have been following the progress of the work that we have been doing here at AOL over the last 14 or so months but the pace of change has been simply breathtaking. One of the first things I did when I entered into the company was deeply review all of the aspects of Operations. From Data Centers to Network Engineering, to the engineering teams supporting the products and services and everything in between. The net of the exercise was that AOL was probably similar to most companies out there in terms of technology mix, from the CRUFT that I mentioned in a previous post, to latest technologies. There were some incredible technologies built over the last three decades, some outdated processes and procedures, and if I am honest traces of a culture where the past had more meaning of the present or future.

In a very short period of time all of that changed. We aggressively made changes to the organization, re-aligned priorities, and perhaps most of all we created and defined a powerful collection of changes and evolutions we would need to bring about with very aggressive timelines. These changes were part of a defined Technology Roadmap that broke the work we needed to accomplish into three categories of work. The categorization focused on the internal technical challenges and tools we needed to make to enhance our own internal efficiencies. The second categorization focused on the technical challenges and aggressive things we could do to enhance and bring greater scalability to our products and services. This would include things like additional services and technology suites to our internally developed cloud infrastructure, and other items that would allow for more rapid product delivery of our products and services. The last categorization of work, was for the incredibly aggressive “wish list” types of changes. Items that could be so disruptive, so incredibly game-changing for us, that they could redefine our work on the whole. In fact we named this group of work “Nibiru” after a mythical planet that is said to cross into our solar system and wreaks havoc and brings about great change.

On July 4, 2012, one of our Nibiru items arrived and I am extremely ecstatic to state that we achieved our “Data Center Independence Day”. Our primary “Nibiru” goal was to develop and deliver a data center environment without the need of a physical building. The environment needed to require as minimal amount of physical “touch” as possible and allow us the ultimate flexibility in terms of how we delivered capacity for our products and services. We called this effort the Micro Data Center. If you think about the amount of things that need to change to evolve to this type of strategy it’s a bit mind-boggling.

Here is just a few of the things required to look at/change/and automate to even make this kind of achievement possible:

Developing an entirely new Technology Suite and the ability to deliver that capacity anywhere in the world with minimal to no staffing.
Delivering extremely dense compute capacity (think the latest technology) to give us the longest possible use of these assets once deployed into the field.
The ability to deliver a “Microdata Center” anywhere on the planet regardless of temperature and humidity settings
The ability to support/maintain/and administer remotely.
The ability to fit into the power envelope of a normal office building
Participation in our cloud environment and capabilities
The processes by which these facilities are maintained and serviced
and much much more…

In my mind, Its one thing to claim a technical achievement, its quite another to operationalize that achievement and make the process of supporting it repeatable. That’s my measure as to when you can REALLY declare victory. Science Experiments don’t count. It has to just plain work. To that end our first “beta” site for the technology was the AOL campus in Dulles, Virginia. Out on a lonely slab of concrete in the back of one of the buildings our future has taken shape.

Thanks in part to a lot of the work going on in the data center containerization space, we were able to jump start much of the work in a relatively quick fashion. In fact the pace set the Data Center and Technology Operations teams to deliver this achievement is more than a bit astounding. Most, if not all, of the existing AOL Data Centers would fall somewhere around a traditional Tier III / Tier II Uptime Institute definition. The teams really pushed ahead way outside their comfort zones to deliver some incredibly evolutions in a very short period of time. Of course there were steps along the way to get here. But those steps now seem to be in double time. A few months back we announced the launching of ATC, Our first completely automated facility. The work that went into ATC, was foundational to our achievement yesterday. It allowed us to really start working on the hard stuff first. That is to say the ‘Operationalization’ of these kinds of environments. It set the stage of how we could evolve to this next tier of evolution. Below is a summary of some of the achievements of our ATC launch, but if you were curious for the specifics on our work there feel free to click the ‘Breaking the Chrysalis’ post I did at that time. You can see how the work that we have been driving in our own internal cloud environments, the changes in operational procedure, the change in thought is additive and fundamental to our latest achievement. Its especially interesting to note that with all of the interesting blips and hiccups occurring in the ‘cloud industry’ like the leap second and the terrible storms on the East Coast this week which affected many data centers, that ATC, our completely unmanned facility just kept humming along with no issues (To be fair neither did our traditional facilities) despite much of the initial negative feedback we had received was solely based around the reliability of such moves. It goes to show how important engineering FOR Operation is. For AOL we have built this in from the start.

What does this actually buy AOL?

Ok, we stuck some computers in a box and we made sure it requires very little care and feeding – what does this buy us? Quite a bit actually. Jay Moran, the Distinguished Engineer who was in charge of driving this effort is always quick to point out that the problem space here is not just about the Technology. It has to be a marriage with the business side as well. Obviously the inherent flexibility of the design allows us a greater number of places around the planet we can deploy capacity to and that in and of itself is pretty revolutionary. We are no longer tied to traditional data center facilities or colocation markets. That doesn’t mean we wont use them, it means we now have a choice. Of course this is only possible because of the internally developed cloud infrastructure but we have freed ourselves from having to be bolted onto or into existing big infrastructure. It allows us to have an incredible amount geo-distributed capacity at a very low cost point in terms of upfront capital and ongoing operational expense. This is a huge game changer. So much so, allow me to do a bit of the ‘back of the napkin math’ with you. Lets call our global capacity in terms of compute, storage, etc. that we have today in our traditional environments – the Total Compute Capability or TCC. Its essentially the bandwidth for the work that we can get done. Inside the cost for TCC you have operating costs such power, lease costs, Data Center facility maintenance costs, support staff, etc. You additionally have the depreciation for the facilities themselves (or the specific buildouts – if colocating), the server and other equipment depreciation, and the rest. Lets call that baseline X. The MicroData Center strategy built out with the latest, our most dense server standards and infrastructure would allow us to have 5X the amount of total TCC in less than 10% of the cost and physical footprint. If you think about how this will allow us to aggregate and grow over time it ultimately drives us to a VERY LOW operational cost structure for delivering our products and services. Additionally it positions us for the future in very significant ways.

It redefines software architecture for greater resiliency
It allows us an incredibly flexible platform for driving and addressing privacy laws, regulatory oversight, and other such concerns allowing us to respond rapidly.
It further reduces energy consumption and carbon footprint emissions (important as taxation evolves around the world, as well as ongoing operational costs)
Gives us the ability to drive Edge Computing delivery to potentially bypass CDNs for certain content.
Gives us the capability to drive ‘Community-in-a-box’ whereby we can quickly launch new products in markets, quickly expand existing footprints like Patch in a low cost, but still hyper-local platform, allow the Huffington Post a platform to rapidly partner and enter new markets with minimal cost turn ups.
The fact that the technology mix in our SKUs is comprised of compute, storage, and network capacity maximizes the amount of products and services we can deploy to it.

As Always its really about the People

I cannot let a post about this huge win for us to go by without mentioning the teams involved in delivering this capability. This is not just a win for AOL, or to a lesser degree the industry at large in another proof-point that it cant evolve if it puts its mind to changing, but rather the Technology Teams at AOL. When I was first approached about joining AOL, my slightly sarcastic and comedic response was probably much like yours – ‘Are they still around?’ But the fact of the matter is that AOL has a vision of where they want to go, and what they want to be. That was compelling for me personally, compelling enough for me to make the move. What has truly amazed me however is the dedication and tenacity of its employees. These achievements would not be possible without the outright aggressiveness the organization has taken to moving the company forward. Its always hard to assess from the outside just how hard an effort is internally to achieve. In the case of our micro Data Center Strategy, the teams had just about every kind of barrier to deliver this capacity. Every kind of excuse to not make it, or even not to try. They put all of those things aside and just plain executed. If you allow me a small moment of bravado – Not only did my teams simply kick ass, they did it in a way that moved the needle for the company, and in my mind once again catapulted themselves into the forefront of operations and technology at scale. We still have a bunch of Nibiru projects to deliver, so my guess is we haven’t heard the last of some of these big wins.

\Mm

Attacking the Cruft

Today the Uptime Institute announced that AOL won the Server Roundup Award. The achievement has gotten some press already (At Computerworld, PCWorld, and related sites) and I cannot begin to tell you how proud I am of my teams. One of the more personal transitions and journeys I have made since my experience scaling the Microsoft environments from tens of thousands of servers to hundreds of thousands of servers has been truly understanding the complexity facing a problem most larger established IT departments have been dealing with for years. In some respects, scaling infrastructure, while incredibly challenging and hard, is in large part a uni-directional problem space. You are faced with growth and more growth followed by even more growth. All sorts of interesting things break when you get to big scale. Processes, methodologies, technologies, all quickly fall to the wayside as you climb ever up the ladder of scale.

At AOL I faced a multi-directional problem space in that, as a company and as a technology platform we were still growing. Added to that there was 27 years of what I call “Cruft”. I define “Cruft” as years of build-up of technology, processes, politics, fiscal orphaning and poor operational hygiene. This cruft can act as a huge boat anchor and barrier to an organization to drive agility in its online and IT operations. On top of this Cruft a layer of what can best be described as lethargy or perhaps apathy can sometimes develop and add even more difficulty to the problem space.

One of the first things I encountered at AOL was the cruft. In any organization, everyone always wants to work on the new, cool, interesting things. Mainly because they are new and interesting..out of the norm. Essentially the fun stuff! But the ability for the organization to really drive the adoption of new technologies and methods was always slowed, gated or in some cases altogether prevented by years interconnected systems, lost owners, servers of unknown purpose lost in the distant historical memory and the like. This I found in healthy populations at AOL.

We initially set about building a plan to attack this cruft. To earnestly remove as much of the cruft as possible and drive the organization towards agility. Initially we called this list of properties, servers, equipment and the like the Operations $/-\!+ list. As this name was not very user-friendly it migrated into a series of initiatives grouped the name of Ops-Surdities. These programs attacked different types of cruft and were at a high level grouped into three main categories:

The Absurdity List – A list of projects/properties/applications that had a questionable value, lack of owner, lack of direction, or the like but was still drawing load and resources from our data centers. The plan here was to develop action plans for each of the items that appeared on this list.

Power Hog – An effort to audit our data center facilities, equipment, and the like looking for inefficient servers, installations, and /or technology and migrating them to new more efficient platforms or our AOL Cloud infrastructure. You knew you were in trouble when you had a trophy of a bronze pig appear on your desk or office and that you were marked.

Ops Hygiene – The sometimes tedious task of tracking down older machines and systems that may have been decomissioned in the past, marked for removal, or were fully depreciated and were never truly removed. Pure Vampiric load. You may or may not be surprised how much of this exists in modern data centers. It’s a common issue I have had with most data center management professionals in the industry.

So here we are, in a timeline measured in under a year, and being told all along the way by“crufty old-timers” that we would never make any progress, my teams have de-comissioned almost 10,000 servers from our environments. (Actually this number is greater now, but the submission deadline for the award was earlier in the year). What an amazing accomplishment. What an amazing team!

So how did we do it?

As we will be presenting this in a lot more detail at the Uptime Symposium, I am not going to give away all of our secrets in a blog post and give you a good reason to head to the Uptime event and listen to and ask the primary leaders of this effort how they did it in person. It may be a good use of that Travel budget your company has been sitting on this year.

What I will share is some guidelines on approach and some things to be wary of if you are facing similar challenges in your organization.

FOCUS AND ATTENTION

I cannot tell you how many I have spoken with that have tried to go after ‘cruft’ like this time and time again and failed. One of the key drivers for success in my mind is ensuring that there is focus and attention on this kind of project at all levels, across all organizations, and most importantly from the TOP. To often executives give out blind directives with little to no follow through and assume this kind of thing gets done. They are generally unaware of the natural resistance to this kind of work there is in most IT organizations. Having a motivated, engaged, and focused leadership on these types of efforts goes and extraordinarily long way to making headway here.

BEWARE of ORGANIZATIONAL APATHY

The human factors that stack up against a project like this are impressive. While they may not be openly in revolt over such projects there is a natural resistance to getting things done. This work is not sexy. This work is hard. This work is tedious. This likely means going back and touching equipment and kit that has not been messed with for a long time. You may have competing organizational priorities which place this kind of work at the bottom of the workload priority list. In addition to having Executive buy in and focus, make sure you have some really driven people running these programs. You are looking for CAN DO people, not MAKE DO people.

TECHNOLOGY CAN HELP, BUT ITS NOT YOUR HEAVY LIFTER

Probably a bit strange for a technology blog to say, but its true. We have an incredible CMDB and Asset System at AOL. This was hugely helpful to the effort in really getting to the bottom of the list. However no amount of Technology in place will be able to perform the myriad of tasks required to actually make material movement on this kind of work. Some of it requires negotiation, some of it requires strength of will, some of it takes pure persistence in running these issues down…working with the people. Understanding what is still required, what can be moved. This requires people. We had great technologies in place from the perspective of knowing where are stuff was, what it did, and what it was connected to. We had great technologies like our Cloud to move some of these platforms to ultimately. However, you need to make sure you don’t go to far down the people trap. I have a saying in my organization – There is a perfect number of project managers and security people in any organization. Where the work output and value delivered is highest. What is that number? Depends – but you definitely know when you have one too many of each.

MAKE IT FUN IF YOU CAN

From the brass pigs, to minor celebrations each month as we worked through the process we ensured that the attention given the effort was not negative. Sure it can be tough work, but you are at the end of the day substantially investing in the overall agility of your organization. Its something to be celebrated. In fact at the completion of our aggressive goals the primary project leads involved did a great video (which you can see here) to highlight and celebrate the win. Everyone had a great laugh and a ton of fun doing what was ultimately a tough grind of work. If you are headed to Symposium I strongly encourage you to reach out to my incredible project leads. You will be able to recognize them from the video….without the mustaches of course!

\Mm