Resilient distributed datasets: a fault-tolerant abstraction for in-memory cluster computing

Matei Zaharia(University of California, Berkeley), Mosharaf Chowdhury(University of California, Berkeley), Tathagata Das(University of California, Berkeley), Ankur Dave(University of California, Berkeley), Justin Ma(University of California, Berkeley), Murphy McCauley(University of California, Berkeley), Michael J. Franklin(University of California, Berkeley), Scott Shenker(University of California, Berkeley), Ion Stoica(University of California, Berkeley)
Unknown
April 25, 2012
Cited by 3,577

Abstract

We present Resilient Distributed Datasets (RDDs), a distributed memory abstraction that lets programmers perform in-memory computations on large clusters in a fault-tolerant manner. RDDs are motivated by two types of applications that current computing frameworks handle inefficiently: iterative algorithms and interactive data mining tools. In both cases, keeping data in memory can improve performance by an order of magnitude. To achieve fault tolerance efficiently, RDDs provide a restricted form of shared memory, based on coarsegrained transformations rather than fine-grained updates to shared state. However, we show that RDDs are expressive enough to capture a wide class of computations, including recent specialized programming models for iterative jobs, such as Pregel, and new applications that these models do not capture. We have implemented RDDs in a system called Spark, which we evaluate through a variety of user applications and benchmarks. 1


Related Papers

No related papers found

Powered by citation graph analysis