An Information-Theoretic Analysis of Deduplication
Deduplication finds and removes long-range data duplicates. It is commonly used in cloud and enterprise server settings and has been successfully applied to primary, backup, and archival storage. Despite its practical importance as a source-coding technique, its analysis from the point of view of in...
Saved in:
Published in: | IEEE transactions on information theory Vol. 65; no. 9; pp. 5688 - 5704 |
---|---|
Main Author: | |
Format: | Journal Article |
Language: | English |
Published: |
New York
IEEE
01-09-2019
The Institute of Electrical and Electronics Engineers, Inc. (IEEE) |
Subjects: | |
Online Access: | Get full text |
Tags: |
Add Tag
No Tags, Be the first to tag this record!
|
Summary: | Deduplication finds and removes long-range data duplicates. It is commonly used in cloud and enterprise server settings and has been successfully applied to primary, backup, and archival storage. Despite its practical importance as a source-coding technique, its analysis from the point of view of information theory is missing. This paper provides such an information-theoretic analysis of data deduplication. It introduces a new source model adapted to the deduplication setting. It formalizes the two standard fixed-length and variable-length deduplication schemes, and it introduces a novel multi-chunk deduplication scheme. It then provides an analysis of these three deduplication variants, emphasizing the importance of boundary synchronization between source blocks and deduplication chunks. In particular, under fairly mild assumptions, the proposed multi-chunk deduplication scheme is shown to be order optimal. |
---|---|
ISSN: | 0018-9448 1557-9654 |
DOI: | 10.1109/TIT.2019.2916037 |