Today the Uptime Institute announced that AOL won the Server Roundup Award. The achievement has gotten some press already (At Computerworld, PCWorld, and related sites) and I cannot begin to tell you how proud I am of my teams. One of the more personal transitions and journeys I have made since my experience scaling the Microsoft environments from tens of thousands of servers to hundreds of thousands of servers has been truly understanding the complexity facing a problem most larger established IT departments have been dealing with for years. In some respects, scaling infrastructure, while incredibly challenging and hard, is in large part a uni-directional problem space. You are faced with growth and more growth followed by even more growth. All sorts of interesting things break when you get to big scale. Processes, methodologies, technologies, all quickly fall to the wayside as you climb ever up the ladder of scale.
At AOL I faced a multi-directional problem space in that, as a company and as a technology platform we were still growing. Added to that there was 27 years of what I call “Cruft”. I define “Cruft” as years of build-up of technology, processes, politics, fiscal orphaning and poor operational hygiene. This cruft can act as a huge boat anchor and barrier to an organization to drive agility in its online and IT operations. On top of this Cruft a layer of what can best be described as lethargy or perhaps apathy can sometimes develop and add even more difficulty to the problem space.
One of the first things I encountered at AOL was the cruft. In any organization, everyone always wants to work on the new, cool, interesting things. Mainly because they are new and interesting..out of the norm. Essentially the fun stuff! But the ability for the organization to really drive the adoption of new technologies and methods was always slowed, gated or in some cases altogether prevented by years interconnected systems, lost owners, servers of unknown purpose lost in the distant historical memory and the like. This I found in healthy populations at AOL.
We initially set about building a plan to attack this cruft. To earnestly remove as much of the cruft as possible and drive the organization towards agility. Initially we called this list of properties, servers, equipment and the like the Operations $/-\!+ list. As this name was not very user-friendly it migrated into a series of initiatives grouped the name of Ops-Surdities. These programs attacked different types of cruft and were at a high level grouped into three main categories:
The Absurdity List – A list of projects/properties/applications that had a questionable value, lack of owner, lack of direction, or the like but was still drawing load and resources from our data centers. The plan here was to develop action plans for each of the items that appeared on this list.
Power Hog – An effort to audit our data center facilities, equipment, and the like looking for inefficient servers, installations, and /or technology and migrating them to new more efficient platforms or our AOL Cloud infrastructure. You knew you were in trouble when you had a trophy of a bronze pig appear on your desk or office and that you were marked.
Ops Hygiene – The sometimes tedious task of tracking down older machines and systems that may have been decomissioned in the past, marked for removal, or were fully depreciated and were never truly removed. Pure Vampiric load. You may or may not be surprised how much of this exists in modern data centers. It’s a common issue I have had with most data center management professionals in the industry.
So here we are, in a timeline measured in under a year, and being told all along the way by“crufty old-timers” that we would never make any progress, my teams have de-comissioned almost 10,000 servers from our environments. (Actually this number is greater now, but the submission deadline for the award was earlier in the year). What an amazing accomplishment. What an amazing team!
So how did we do it?
As we will be presenting this in a lot more detail at the Uptime Symposium, I am not going to give away all of our secrets in a blog post and give you a good reason to head to the Uptime event and listen to and ask the primary leaders of this effort how they did it in person. It may be a good use of that Travel budget your company has been sitting on this year.
What I will share is some guidelines on approach and some things to be wary of if you are facing similar challenges in your organization.
FOCUS AND ATTENTION
I cannot tell you how many I have spoken with that have tried to go after ‘cruft’ like this time and time again and failed. One of the key drivers for success in my mind is ensuring that there is focus and attention on this kind of project at all levels, across all organizations, and most importantly from the TOP. To often executives give out blind directives with little to no follow through and assume this kind of thing gets done. They are generally unaware of the natural resistance to this kind of work there is in most IT organizations. Having a motivated, engaged, and focused leadership on these types of efforts goes and extraordinarily long way to making headway here.
BEWARE of ORGANIZATIONAL APATHY
The human factors that stack up against a project like this are impressive. While they may not be openly in revolt over such projects there is a natural resistance to getting things done. This work is not sexy. This work is hard. This work is tedious. This likely means going back and touching equipment and kit that has not been messed with for a long time. You may have competing organizational priorities which place this kind of work at the bottom of the workload priority list. In addition to having Executive buy in and focus, make sure you have some really driven people running these programs. You are looking for CAN DO people, not MAKE DO people.
TECHNOLOGY CAN HELP, BUT ITS NOT YOUR HEAVY LIFTER
Probably a bit strange for a technology blog to say, but its true. We have an incredible CMDB and Asset System at AOL. This was hugely helpful to the effort in really getting to the bottom of the list. However no amount of Technology in place will be able to perform the myriad of tasks required to actually make material movement on this kind of work. Some of it requires negotiation, some of it requires strength of will, some of it takes pure persistence in running these issues down…working with the people. Understanding what is still required, what can be moved. This requires people. We had great technologies in place from the perspective of knowing where are stuff was, what it did, and what it was connected to. We had great technologies like our Cloud to move some of these platforms to ultimately. However, you need to make sure you don’t go to far down the people trap. I have a saying in my organization – There is a perfect number of project managers and security people in any organization. Where the work output and value delivered is highest. What is that number? Depends – but you definitely know when you have one too many of each.
MAKE IT FUN IF YOU CAN
From the brass pigs, to minor celebrations each month as we worked through the process we ensured that the attention given the effort was not negative. Sure it can be tough work, but you are at the end of the day substantially investing in the overall agility of your organization. Its something to be celebrated. In fact at the completion of our aggressive goals the primary project leads involved did a great video (which you can see here) to highlight and celebrate the win. Everyone had a great laugh and a ton of fun doing what was ultimately a tough grind of work. If you are headed to Symposium I strongly encourage you to reach out to my incredible project leads. You will be able to recognize them from the video….without the mustaches of course!