Generalized linear models for massive data via doubly-sketching
Generalized linear models are a popular analytics tool with interpretable results and broad applicability, but require iterative estimation procedures that impose data transfer and computational costs that can be problematic under some infrastructure constraints. We propose a doubly-sketched approxi...
Saved in:
Published in: | Statistics and computing Vol. 33; no. 5 |
---|---|
Main Authors: | , |
Format: | Journal Article |
Language: | English |
Published: |
New York
Springer US
01-10-2023
Springer Nature B.V |
Subjects: | |
Online Access: | Get full text |
Tags: |
Add Tag
No Tags, Be the first to tag this record!
|
Summary: | Generalized linear models are a popular analytics tool with interpretable results and broad applicability, but require iterative estimation procedures that impose data transfer and computational costs that can be problematic under some infrastructure constraints. We propose a doubly-sketched approximation of the iteratively re-weighted least squares algorithm to estimate generalized linear model parameters using a sequence of surrogate datasets. The procedure sketches once to reduce data transfer costs, and sketches again to reduce data computation costs, yielding wall-clock time savings. Regression coefficients and standard errors are produced, with comparison against literature methods. Asymptotic properties of the proposed procedure are shown, with empirical results from simulated and real-world datasets. The efficacy of the proposed method is investigated across a variety of commodity computational infrastructure configurations accessible to practitioners. A highlight of the present work is the estimation of a Poisson-log generalized linear model across almost 1.7 billion observations on a personal computer in 25 min. |
---|---|
ISSN: | 0960-3174 1573-1375 |
DOI: | 10.1007/s11222-023-10274-8 |