Coding for Redundancy

Scott Killian of AOL talks about the MicroDC

Last week I put up a post about how AOL.com has 25% of all traffic now running through our MicroDC infrastructure. There was a great follow up post by James LaPlaine our VP of Operations on his blog Mental Effort, which goes into even greater detail. While many of the email inquiries I get have been based around the technology itself, surprisingly a large majority of the notes have been questions around how to make your software. applications, and development efforts ready for such an infrastructure and what the timelines for realistically doing so would be.

The general response of course is that it depends. If you are a web-based platform or property focused solely on Internet based consumers, or a firm that needs diversified presence in different regions without the hefty price tag of renting and taking down additional space this may be an option. However many of the enterprise based applications have been written in a way that is highly dependent upon localized infrastructure, short application based latency, and lack adequate scaling. So for more corporate data center applications this may not be a great fit. It will take sometime for those big traditional application firms to be able to truly build out their infrastructure to work in an environment like this (they may never do so). I suspect most will take an easier approach and try to ‘cloudify’ their own applications and run it within their own infrastructure or data centers under their control. This essentially will allow them to control the access portion of users needs, but continue to rely on the same kinds of infrastructure you might have in your own data center to support it. Its much easier to build a web based application which then connects to a traditional IT based environment, than to truly build out infrastructure capable of accommodating scale. I am happy to continue answer questions as they come up, but as I had an overwhelming response of questions about this I thought I would throw something quick up here that will hopefully help.

\Mm

As everyone has been painfully aware last week the United States saw the devastation caused by the Superstorm Sandy. My original intention was to talk about yet another milestone with our Micro Data Center approach. As the storm slammed into the East Coast I felt it was probably a bad time to talk about achieving something significant especially as people were suffering through the storms outcome. In fact, after the storm AOL kicked off an incredible supplies drive and sent truckloads of goods up to the worst of the affected areas.

So, here we are a week after the storm, and while people are still in need and suffering, it is clear that the worst is over and the clean up and healing has begun. It turns out that Super Storm Sandy also allowed us to test another interesting case in the journey of the Micro Data Center as well that I will touch on.

25% of ALL AOL.COM Traffic runs through Micro Data Centers

I have talked about the potential value of our use of Micro Data Centers and the pure agility and economics the platform will provide for us. Up until this point we had used this technology in pockets. Think of our explorations as focusing on beta and demo environments. But that all changed in October when we officially flipped the switch and began taking production traffic for AOL.COM with the Micro Data Center. We are currently (and have been since flipping the switch) running about 25% of all traffic coming to our main web site. This is an interesting achievement in many ways. First, from a performance perspective we are manually limiting the platform (it could do more!) to ~65,000 requests per minute and a traffic volume of about 280mbits per second. To date I haven’t seen many people post performance statistics about applications in modular use, so hopefully this is relevant and interesting to folks in terms of the volume of load an approach such as this could take. We recently celebrated this at a recent All-Hands with an internal version of our MDC being plugged into the conference room. To prove our point we added it to the global pool of capacity for AOL.com and started taking production traffic right there at the conference facility. This proves in large part the value, agility and mobility a platform like this could bring to bear.

As I mentioned before, Super Storm Sandy threw us another curveball as the hurricane crashed into the Mid-Atlantic. While Virginia was not hit anywhere near as hard as New York and New Jersey, there were incredible sustained winds, tumultuous rains, and storm related damage everywhere. Through it all, our outdoor version of the MDC weathered the storm just fine and continued serving traffic for AOL.com without fail.

This kind of Capability is not EASY or Turn-Key

That’s not to say there isn’t a ton of work to do to get an application to work in an environment like this. If you take the problem space at different levels whether it be DNS, Load Balancing, network redundancy, configuration management, underlying application level timeouts, systems dependencies like databases, other information stores and the like the non-infrastructure related work and coding is not insignificant. There is a huge amount of complexity in running a site like AOL.Com. Lots of interdependencies, sophistication, advertising related collection and distribution and the like. It’s safe to say that this is not as simple as throwing up an Apache/Tomcat instance into a VM.

I have talked for quite awhile about what Netflix engineers originally coined as Chaos Monkeys. The ability, development paradigm, or even rogue processes for your applications to survive significant infrastructure and application level outages. Its essentially taking the redundancy out of the infrastructure and putting into the code. While extremely painful at the start, the savings long term are proving hugely beneficial. For most companies, this is still something futuristic, very far out there. They may be beholden to software manufacturers and developers to start thinking this way which may take a very very long time. Infrastructure is the easy way to solve it. It may be easy, but its not cheap. Nor, if you care about the environmental angle on it, is it very ‘sustainable’ or green. Limit the infrastructure. Limit the Waste. While we haven’t really thought about in terms of rolling it up into our environmental positions, perhaps we should.

The point is that getting to this level of redundancy is going to take work and to that end will continue to be a regulator or anchor slowing down a greater adoption of more modular approaches. But at least in my mind, the future is set, directionally it will be hard to ignore the economics of this type of approach for long. Of course as an industry we need to start training or re-training developers to think in this kind of model. To build code in such a way that it takes into effect the Chaos Monkey Potential out there.

Want to see One Live?

We have been asked to provide an AOL MicroData Center for the Super Computing 12 conference next week in Salt Lake City, Utah with our partner Penguin Computing. If you want to see one of our Internal versions live and up-close feel free to stop by and take a look. Jay Moran (my Distinguished Engineer here at AOL) and Scott Killian (The leader of our data center operations teams) will be onsite to discuss the technologies and our use cases.

\Mm

Tag: Coding for Redundancy

Lots of interest in the MicroDC, but do you know what I am getting the most questions about?

On Micro Datacenters, Sandy, Supercomputing 2012, and Coding for Containerized Data Centers….

25% of ALL AOL.COM Traffic runs through Micro Data Centers

This kind of Capability is not EASY or Turn-Key

Want to see One Live?