Finding Decision Tree Splits in Streaming and Massively Parallel Models
In this work, we provide data stream algorithms that compute optimal splits in decision tree learning. In particular, given a data stream of observations $x_i$ and their labels $y_i$, the goal is to find the optimal split $j$ that divides the data into two sets such that the mean squared error (for...
Saved in:
Main Authors: | , , |
---|---|
Format: | Journal Article |
Language: | English |
Published: |
28-03-2024
|
Subjects: | |
Online Access: | Get full text |
Tags: |
Add Tag
No Tags, Be the first to tag this record!
|
Summary: | In this work, we provide data stream algorithms that compute optimal splits
in decision tree learning. In particular, given a data stream of observations
$x_i$ and their labels $y_i$, the goal is to find the optimal split $j$ that
divides the data into two sets such that the mean squared error (for
regression) or misclassification rate and Gini impurity (for classification) is
minimized. We provide several fast streaming algorithms that use sublinear
space and a small number of passes for these problems. These algorithms can
also be extended to the massively parallel computation model. Our work, while
not directly comparable, complements the seminal work of Domingos-Hulten (KDD
2000) and Hulten-Spencer-Domingos (KDD 2001). |
---|---|
DOI: | 10.48550/arxiv.2403.19867 |