An Information-Theoretic Analysis of Deduplication

Deduplication finds and removes long-range data duplicates. It is commonly used in cloud and enterprise server settings and has been successfully applied to primary, backup, and archival storage. Despite its practical importance as a source-coding technique, its analysis from the point of view of in...

Full description

Saved in:
Bibliographic Details
Published in:IEEE transactions on information theory Vol. 65; no. 9; pp. 5688 - 5704
Main Author: Niesen, Urs
Format: Journal Article
Language:English
Published: New York IEEE 01-09-2019
The Institute of Electrical and Electronics Engineers, Inc. (IEEE)
Subjects:
Online Access:Get full text
Tags: Add Tag
No Tags, Be the first to tag this record!
Description
Summary:Deduplication finds and removes long-range data duplicates. It is commonly used in cloud and enterprise server settings and has been successfully applied to primary, backup, and archival storage. Despite its practical importance as a source-coding technique, its analysis from the point of view of information theory is missing. This paper provides such an information-theoretic analysis of data deduplication. It introduces a new source model adapted to the deduplication setting. It formalizes the two standard fixed-length and variable-length deduplication schemes, and it introduces a novel multi-chunk deduplication scheme. It then provides an analysis of these three deduplication variants, emphasizing the importance of boundary synchronization between source blocks and deduplication chunks. In particular, under fairly mild assumptions, the proposed multi-chunk deduplication scheme is shown to be order optimal.
ISSN:0018-9448
1557-9654
DOI:10.1109/TIT.2019.2916037