Michael Isard, Mihai Budiu, Yuan Yu, Andrew Birrell, Dennis Fetterly, “Dryad: Distributed Data-Parallel Programs from Sequential Building Blocks,” EuroSys, 2007. [PDF]
Summary
Dryad is Microsoft’s answer to the MapReduce paradigm, albeit at a (slightly) lower level with greater flexibility. Like MapReduce, Dryad allows developers to think about what to do with the data, and Dryad itself takes care of distribution, fault-tolerance, stragglers etc. Unlike MapReduce, Dryad enables creating extensive data flow models using DAGs or directed acyclic graphs. Dryad also adds increased flexibility for communication between computation nodes in a DAG via disk, TCP pipes, and shared memory queues, as opposed to only disk-based communication promoted by MapReduce. This could possibly allow fully in-memory data flow models for faster data mining and iterative jobs.
Comments
By giving more power to the developers, Dryad sacrificed its simplicity. This is the exact opposite of the tradeoff made in the MapReduce design. Dryad is great as a substrate for higher-level systems that are more usable. This became apparent when Microsoft later built DryadLINQ and SCOPE on top of Dryad, that are easy to use, powerful, but takes away unnecessary flexibility from the developers. The influence of Dryad is big, at least for Microsoft. As far as we know, they heavily use DryadLINQ and SCOPE for most of their data-intensive workloads.
The paper itself has some obvious flaws. One of them would be its evaluation. Comparison against MS SQL server is nothing to be too excited about. It would have been nice if they compared it against some implementation of MapReduce (Hadoop was not publicly available when Dryad was being built, but they could easily use Dryad to imitate MapReduce). However, Dryad is expected to win over MapReduce for multi-stage workload. I found this paper not easy to read, nor to follow, often boring. Nevertheless, it is an interesting piece of work and the obvious next step for MapReduce.