Tradeoffs in scalable data routing for deduplication clusters

Wei Dong; Fred Douglis; Kai Li; Hugo Patterson; Sazzala Reddy; Philip Shilane

doi:10.5555/1960475.1960477

Tradeoffs in scalable data routing for deduplication clusters

Wei Dong(Princeton University), Fred Douglis, Kai Li(Princeton University), Hugo Patterson, Sazzala Reddy, Philip Shilane

Unknown

February 15, 2011

10.5555/1960475.1960477

Cited by 135

Abstract

As data have been growing rapidly in data centers, deduplication storage systems continuously face challenges in providing the corresponding throughputs and capacities necessary to move backup data within backup and recovery window times. One approach is to build a cluster deduplication storage system with multiple deduplication storage system nodes. The goal is to achieve scalable throughput and capacity using extremely highthroughput (e.g. 1.5 GB/s) nodes, with a minimal loss of compression ratio. The key technical issue is to route data intelligently at an appropriate granularity. We present a cluster-based deduplication system that can deduplicate with high throughput, support deduplication ratios comparable to that of a single system, and maintain a low variation in the storage utilization of individual nodes. In experiments with dozens of nodes, we examine tradeoffs between stateless data routing approaches with low overhead and stateful approaches that have higher overhead but avoid imbalances that can adversely affect deduplication effectiveness for some datasets in large clusters. The stateless approach has been deployed in a two-node commercial system that achieves 3 GB/s for multi-stream deduplication throughput and currently scales to 5.6 PB of storage (assuming 20X total compression). 1

Burton H. Bloom|Communications of the ACM|1970|7.5k

A universal algorithm for sequential data compression

J. Ziv, A. Lempel|IEEE Transactions on Information Theory|1977|5.5k

PVFS : a parallel file system for linux clusters

Philip Carns, W.B. Ligon, Robert Ross et al.|OSTI OAI (U.S. Department of Energy Office of Scientific and Technical Information)|2000|843

Farsite

Atul Adya, William J. Bolosky, Miguel Castro et al.|ACM SIGOPS Operating Systems Review|2002|794

A low-bandwidth network file system

Athicha Muthitacharoen, Benjie Chen, David Mazières|Unknown|2001|786

Tradeoffs in scalable data routing for deduplication clusters

Abstract

Related Papers