Tracking Join and Self-Join Sizes in Limited Storage

This paper presents algorithms for tracking (approximate) join and self-join sizes in limited storage, in the presence of insertions and deletions to the data set(s). Such algorithms detect changes in join and self-join sizes without an expensive recomputation from the base data, and without the lar...

Full description

Saved in:

Bibliographic Details
Published in:	Journal of computer and system sciences Vol. 64; no. 3; pp. 719 - 747
Main Authors:	Alon, Noga, Gibbons, Phillip B., Matias, Yossi, Szegedy, Mario
Format:	Journal Article
Language:	English
Published:	Elsevier Inc 01-05-2002
Online Access:	Get full text
Tags:	Add Tag No Tags, Be the first to tag this record!

Description
Summary:	This paper presents algorithms for tracking (approximate) join and self-join sizes in limited storage, in the presence of insertions and deletions to the data set(s). Such algorithms detect changes in join and self-join sizes without an expensive recomputation from the base data, and without the large space overhead required to maintain such sizes exactly. Query optimizers rely on fast, high-quality estimates of join sizes in order to select between various join plans, and estimates of self-join sizes are used to indicate the degree of skew in the data. For self-joins, we consider two approaches proposed in N. Alon et al. (The space complexity of approximating the frequency moments, J. Comput. System Sci.58 (1999), 137–147), which we denote tug-of-war and sample-count . We present fast algorithms for implementing these approaches, and extensions to handle deletions as well as insertions. We also report on the first experimental study of the two approaches, on a range of synthetic and real-world data sets. Our study shows that tug-of-war provides more accurate estimates for a given storage limit than sample-count, which in turn is far more accurate than a standard sampling-based approach. For example, tug-of-war needed only 4-256 memory words, depending on the data set, in order to estimate the self-join size to within a 15% relative error; on average, this is over 4 times (50 times) fewer memory words than needed by sample-count (standard sampling, resp.) to obtain a similar accuracy. For joins, we propose schemes based on maintaining a small signature of each relation independently, such that join sizes can be quickly and accurately estimated between any pair of relations using only these signatures. We show that taking random samples for join signatures can lead to an inaccurate estimation unless the sample size is quite large; moreover, we show that no other signature scheme can significantly improve upon sampling without further assumptions. These negative results are shown to hold even in the presence of sanity bounds. On the other hand, we present a fast join signature scheme based on tug-of-war signatures that provides guarantees on join size estimation as a function of the self-join sizes of the joining relations; this scheme can significantly improve upon the sampling scheme.
ISSN:	0022-0000 1090-2724
DOI:	10.1006/jcss.2001.1813