Yahoo! Research, "PNUTS: Yahoo!’s Hosted Data Serving Platform," PVLDB, 2008. [PDF]
Category Archives: Reviews
Data-parallel pipelines using high-level languages
Microsoft, "DryadLINQ: A System for General-Purpose Distributed Data-Parallel Computing Using a High-Level Language," OSDI, 2008. [PDF]
Google, "FlumeJava: Easy, Efficient Data-Parallel Pipelines," PLDI, 2010. [LINK]
Background
Data-parallel computing systems expose high-level abstractions to the users to reason about distributed computations, while handling low-level tasks of scheduling and automated fault-tolerance without any user input. At … Continue Reading ››Dremel: Interactive Analysis of Web-Scale Datasets
Google, "Dremel: Interactive Analysis of Web-Scale Datasets," VLDB, 2010. [PDF]
Summary
Dremel is Google's interactive ad hoc query system for analysis of read-only nested data. Unlike MapReduce, Dremel is aimed toward data exploration, monitoring, and debugging, where near real-time performance is of utmost importance. To achieve scalability and performance, Dremel builds upon three key ideas:- It … Continue Reading ››
Dynamo: Amazon’s Highly Available Key-value Store
Amazon, "Dynamo: Amazon's Highly Available Key-value Store," SOSP, 2007. [PDF]
Summary
Dynamo is a highly available (99.9th percentile) key-value storage mechanism that sacrifices traditional consistency models for eventual consistency to achieve availability. Dynamo works with a simple query model, where read/write (get() and put()) operations are performed on data items uniquely identified by their keys. … Continue Reading ››Bigtable: A Distributed Storage System for Structured Data
Google, "Bigtable: A Distributed Storage System for Structured Data," OSDI, 2006. [PDF]
Summary
Bigtable is a large-scale (petabytes of data across thousands of machines) distributed storage system for managing structured data. It is built on top of several existing Google technology (e.g., GFS, Chubby, and Sawzal) and used by many of Google's online … Continue Reading ››SCADS: Scale-Independent Storage for Social Computing Applications
Michael Armbrust, Armando Fox, David A. Patterson, Nick Lanham, Beth Trushkowsky, Jesse Trutna, Haruki Oh, "SCADS: Scale-Independent Storage for Social Computing Applications," CIDR, 2009. [PDF]
Summary
SCADS (Scalable Consistency Adjustable Data Storage) is a proposal for a collection of components leveraging database, control theory, and machine learning techniques to achieve data scale independence for rapidly … Continue Reading ››High-level platforms on top of Hadoop
Christopher Olston, Benjamin Reed, Utkarsh Srivastava, Ravi Kumar, Andrew Tomkins, "Pig Latin: A Not-So-Foreign Language for Data Processing," SIGMOD, 2008. [PDF]
Facebook Data Team, "Hive: Data Warehousing and Analytics on Hadoop," . [LINK]
Summary
Pig and Hive are higher level programming interfaces to Hadoop with corresponding data management tools and related optimizations developed by … Continue Reading ››Dryad: Distributed Data-Parallel Programs from Sequential Building Blocks
Michael Isard, Mihai Budiu, Yuan Yu, Andrew Birrell, Dennis Fetterly, "Dryad: Distributed Data-Parallel Programs from Sequential Building Blocks," EuroSys, 2007. [PDF]
Summary
Dryad is Microsoft's answer to the MapReduce paradigm, albeit at a (slightly) lower level with greater flexibility. Like MapReduce, Dryad allows developers to think about what to do with the data, and Dryad … Continue Reading ››MapReduce: Simplified Data Processing on Large Clusters
Jeffrey Dean, Sanjay Ghemawat, "MapReduce: Simplified Data Processing on Large Clusters," OSDI, 2004. [PDF]
Summary
MapReduce is a programming model and associated implementation for processing and generating large data sets in a parallel, fault-tolerant, distributed, and load-balanced manner. There are two main functions (both user provided) in this programming model. The map function takes an input … Continue Reading ››Megastore: Providing Scalable, Highly Available Storage for Interactive Services
Google, "Megastore: Providing Scalable, Highly Available Storage for Interactive Services," CIDR, 2011. [PDF]