Congratulations to the whole Spark community on the prestigious award and its 1800+ contributors and innumerable users. As I look back, memories from years ago continue to remind me how fortunate I was to be able to make small contributions at the infancy of this juggernaut.
Tag Archives: Spark
Orchestra is the Default Broadcast Mechanism in Apache Spark
With its recent release, Apache Spark has promoted Cornet—the BitTorrent-like broadcast mechanism proposed in Orchestra (SIGCOMM'11)—to become its default broadcast mechanism. It's great to see our research see the light of the real-world! Many thanks to Reynold and others for making it happen.
MLlib, the machine learning library of Spark, will enjoy the biggest boost from this change because of the broadcast-heavy nature of … Continue Reading ››
Spark wins the Best Paper Award at NSDI’2012
Spark (Resilient Distributed Datasets/RDDs) has won the Best Paper award at NSDI 2012. Woohoo! We were also nominated for the inaugural Community Award for open-sourcing the project.
Spark has been accepted at NSDI’2012
Our paper "Resilient Distributed Datasets: A Fault-Tolerant Abstraction for In-Memory Cluster Computing" has been accepted at NSDI'2012. This is Matei's brainchild and a joint work of a lot of people including, but not limited to, TD, Ankur, Justin, Murphy, and professors Ion Stoica, Scott Shenker, and Michael Franklin. Unlike many other systems papers, Spark is … Continue Reading ››
Distributed in-memory datasets
AMPLab, "Resilient Distributed Datasets: A Fault-Tolerant Abstraction for In-Memory Cluster Computing," UCB/EECS-2011-82, 2011. [PDF]
Russell Power, Jinyang Li, "Piccolo: Building Fast, Distributed Programs with Partitioned Tables," OSDI, 2010. [PDF]
Summary
MapReduce and similar frameworks, while widely applicable, are limited to directed acyclic data flow models, do not expose global states, and generally slow due … Continue Reading ››
Technical report on Spark is available Online
A technical report describing the key concepts behind Spark is available online. The abstract goes below:
We present Resilient Distributed Datasets (RDDs), a distributed memory abstraction that allows programmers to perform in-memory computations on large clusters while retaining the fault tolerance of data flow models like MapReduce. RDDs are motivated by two types of applications … Continue Reading ››
Spark’s in the wild
We have been working on the Spark cluster computing framework for last couple of years. It has always been open source under the BSD license in github. But yesterday Matei declared official launch of the spark website (spark-project.org) and mailing lists along with its 0.2 release to everyone during the AMPLab summer retreat … Continue Reading ››
Orchestra has been accepted at SIGCOMM’2011
Update: Camera-ready version of the paper should be can be found in the publications page very soon!
Our paper "Managing Data Transfers in Computer Clusters with Orchestra" has been accepted at SIGCOMM'2011. This is a joint work with Matei, Justin, and professors Mike Jordan and Ion Stoica. The project started as part of Continue Reading ››
Spark short paper has been accepted at HotCloud’10
An initial overview of our ongoing work on Spark, an iterative and interactive framework for cluster computing, has been accepted at HotCloud'10. I've been joined the project last February, while Matei has been working on it since last Fall. I will have uploaded the paper in the publications page. once … Continue Reading ››