Generalized linear models for massive data via doubly-sketching

Generalized linear models are a popular analytics tool with interpretable results and broad applicability, but require iterative estimation procedures that impose data transfer and computational costs that can be problematic under some infrastructure constraints. We propose a doubly-sketched approxi...

Full description

Saved in:

Bibliographic Details
Published in:	Statistics and computing Vol. 33; no. 5
Main Authors:	Hou-Liu, Jason, Browne, Ryan P.
Format:	Journal Article
Language:	English
Published:	New York Springer US 01-10-2023 Springer Nature B.V
Subjects:	Algorithms Artificial Intelligence Asymptotic methods Asymptotic properties Computer Science Computing costs Cost control Data transfer (computers) Datasets Empirical analysis Generalized linear models Infrastructure Iterative methods Original Paper Personal computers Probability and Statistics in Computer Science Regression coefficients Sketches Statistical models Statistical Theory and Methods Statistics and Computing/Statistics Programs Stochastic approximation Database systems Generalized linear models Subsampling Sketching
Online Access:	Get full text
Tags:	Add Tag No Tags, Be the first to tag this record!

Description
Summary:	Generalized linear models are a popular analytics tool with interpretable results and broad applicability, but require iterative estimation procedures that impose data transfer and computational costs that can be problematic under some infrastructure constraints. We propose a doubly-sketched approximation of the iteratively re-weighted least squares algorithm to estimate generalized linear model parameters using a sequence of surrogate datasets. The procedure sketches once to reduce data transfer costs, and sketches again to reduce data computation costs, yielding wall-clock time savings. Regression coefficients and standard errors are produced, with comparison against literature methods. Asymptotic properties of the proposed procedure are shown, with empirical results from simulated and real-world datasets. The efficacy of the proposed method is investigated across a variety of commodity computational infrastructure configurations accessible to practitioners. A highlight of the present work is the estimation of a Poisson-log generalized linear model across almost 1.7 billion observations on a personal computer in 25 min.
ISSN:	0960-3174 1573-1375
DOI:	10.1007/s11222-023-10274-8