Optimizing Prediction Serving on Low-Latency Serverless Dataflow
Prediction serving systems are designed to provide large volumes of low-latency inferences machine learning models. These systems mix data processing and computationally intensive model inference and benefit from multiple heterogeneous processors and distributed computing resources. In this paper, w...
Saved in:
Main Authors: | , , , , |
---|---|
Format: | Journal Article |
Language: | English |
Published: |
11-07-2020
|
Subjects: | |
Online Access: | Get full text |
Tags: |
Add Tag
No Tags, Be the first to tag this record!
|
Summary: | Prediction serving systems are designed to provide large volumes of
low-latency inferences machine learning models. These systems mix data
processing and computationally intensive model inference and benefit from
multiple heterogeneous processors and distributed computing resources. In this
paper, we argue that a familiar dataflow API is well-suited to this
latency-sensitive task, and amenable to optimization even with unmodified
black-box ML models. We present the design of Cloudflow, a system that provides
this API and realizes it on an autoscaling serverless backend. Cloudflow
transparently implements performance-critical optimizations including operator
fusion and competitive execution. Our evaluation shows that Cloudflow's
optimizations yield significant performance improvements on synthetic workloads
and that Cloudflow outperforms state-of-the-art prediction serving systems by
as much as 2x on real-world prediction pipelines, meeting latency goals of
demanding applications like real-time video analysis. |
---|---|
DOI: | 10.48550/arxiv.2007.05832 |