Monday, 11 October 2010

MapReduce with an embedded distributed file system

When I wrote my C++ MapReduce library, I had quite a bit of interest from potential users. One of the most frequent questions that came up was about scaling across multiple machines. To do this, there is a requirement for a lot of infrastructure to manage the execution of MapReduce Jobs, merge and sort intermediate results, manage cross-machine communication and provide resilience in the case of machine failures. One of the biggest components is the distribution of data files to be processed and consolidation of result data files across the network. This is typically done with a subsystem called the Distributed File System. Google's original MapReduce algorithm used GFS, the Google File System. Hadoop has the HDF, Hadoop File System.

I didn't want to go down the same route, because my idea for the MapReduce library is that it should be lightweight and easy to deploy. I don't want a dependency of complex configuration across multiple machines, and I want it easy to use on all platforms, including Windows, without a dependent software stack and configuration overhead.

My solution to this is an Embedded DFS so the DFS infrastructure is bound into the client application and runs without configuration. I want to be able to build a MapReduce program and run it on any number of machines in a network and it will "just work". No configuration, no messing, the subsystem takes care of it all.

Will this bloat client applications? No. The subsystems for MapReduce and DFS are very small, so the footprint overhead is minimal.

Early prototypes have proved the concept, and I can run multiple instances on multiple machines and they all find each other, communicate with each other and cope when one of more are unavailable, either by being shutdown cleanly or with a forced close.

0 comments:

Post a Comment