As everyone has been painfully aware last week the United States saw the devastation caused by the Superstorm Sandy. My original intention was to talk about yet another milestone with our Micro Data Center approach. As the storm slammed into the East Coast I felt it was probably a bad time to talk about achieving something significant especially as people were suffering through the storms outcome. In fact, after the storm AOL kicked off an incredible supplies drive and sent truckloads of goods up to the worst of the affected areas.
So, here we are a week after the storm, and while people are still in need and suffering, it is clear that the worst is over and the clean up and healing has begun. It turns out that Super Storm Sandy also allowed us to test another interesting case in the journey of the Micro Data Center as well that I will touch on.
25% of ALL AOL.COM Traffic runs through Micro Data Centers
I have talked about the potential value of our use of Micro Data Centers and the pure agility and economics the platform will provide for us. Up until this point we had used this technology in pockets. Think of our explorations as focusing on beta and demo environments. But that all changed in October when we officially flipped the switch and began taking production traffic for AOL.COM with the Micro Data Center. We are currently (and have been since flipping the switch) running about 25% of all traffic coming to our main web site. This is an interesting achievement in many ways. First, from a performance perspective we are manually limiting the platform (it could do more!) to ~65,000 requests per minute and a traffic volume of about 280mbits per second. To date I haven’t seen many people post performance statistics about applications in modular use, so hopefully this is relevant and interesting to folks in terms of the volume of load an approach such as this could take. We recently celebrated this at a recent All-Hands with an internal version of our MDC being plugged into the conference room. To prove our point we added it to the global pool of capacity for AOL.com and started taking production traffic right there at the conference facility. This proves in large part the value, agility and mobility a platform like this could bring to bear.
As I mentioned before, Super Storm Sandy threw us another curveball as the hurricane crashed into the Mid-Atlantic. While Virginia was not hit anywhere near as hard as New York and New Jersey, there were incredible sustained winds, tumultuous rains, and storm related damage everywhere. Through it all, our outdoor version of the MDC weathered the storm just fine and continued serving traffic for AOL.com without fail.
This kind of Capability is not EASY or Turn-Key
That’s not to say there isn’t a ton of work to do to get an application to work in an environment like this. If you take the problem space at different levels whether it be DNS, Load Balancing, network redundancy, configuration management, underlying application level timeouts, systems dependencies like databases, other information stores and the like the non-infrastructure related work and coding is not insignificant. There is a huge amount of complexity in running a site like AOL.Com. Lots of interdependencies, sophistication, advertising related collection and distribution and the like. It’s safe to say that this is not as simple as throwing up an Apache/Tomcat instance into a VM.
I have talked for quite awhile about what Netflix engineers originally coined as Chaos Monkeys. The ability, development paradigm, or even rogue processes for your applications to survive significant infrastructure and application level outages. Its essentially taking the redundancy out of the infrastructure and putting into the code. While extremely painful at the start, the savings long term are proving hugely beneficial. For most companies, this is still something futuristic, very far out there. They may be beholden to software manufacturers and developers to start thinking this way which may take a very very long time. Infrastructure is the easy way to solve it. It may be easy, but its not cheap. Nor, if you care about the environmental angle on it, is it very ‘sustainable’ or green. Limit the infrastructure. Limit the Waste. While we haven’t really thought about in terms of rolling it up into our environmental positions, perhaps we should.
The point is that getting to this level of redundancy is going to take work and to that end will continue to be a regulator or anchor slowing down a greater adoption of more modular approaches. But at least in my mind, the future is set, directionally it will be hard to ignore the economics of this type of approach for long. Of course as an industry we need to start training or re-training developers to think in this kind of model. To build code in such a way that it takes into effect the Chaos Monkey Potential out there.
Want to see One Live?
We have been asked to provide an AOL MicroData Center for the Super Computing 12 conference next week in Salt Lake City, Utah with our partner Penguin Computing. If you want to see one of our Internal versions live and up-close feel free to stop by and take a look. Jay Moran (my Distinguished Engineer here at AOL) and Scott Killian (The leader of our data center operations teams) will be onsite to discuss the technologies and our use cases.