CheckBullet: A Lightweight Checkpointing System for Robust Model Training on Mobile Networks

Training on time-series data generated from mobile networks is a resource-intensive and time-consuming task that encounters various training failures. To cope with this issue, we propose CheckBullet, a lightweight checkpoint system to minimize storage requirements and enable fast recovery in mobile...

Full description

Saved in:
Bibliographic Details
Published in:IEEE transactions on mobile computing Vol. 23; no. 12; pp. 14946 - 14958
Main Authors: Jeon, Youbin, Choi, Hongrok, Jeong, Hyeonjae, Jung, Daeyoung, Pack, Sangheon
Format: Magazine Article
Language:English
Published: IEEE 01-12-2024
Subjects:
Online Access:Get full text
Tags: Add Tag
No Tags, Be the first to tag this record!
Description
Summary:Training on time-series data generated from mobile networks is a resource-intensive and time-consuming task that encounters various training failures. To cope with this issue, we propose CheckBullet, a lightweight checkpoint system to minimize storage requirements and enable fast recovery in mobile networks. First, CheckBullet determines a checkpointing interval based on the characteristics of the model and the timing of failure occurrences. This approach ensures fast recovery while preserving the existing training runtime. Second, CheckBullet quantizes the weight tensor and eliminates duplicate weights, which significantly reduces the overall checkpoint size, leading to a substantial decrease in storage requirements. Third, CheckBullet selects the minimum training loss among the deduplicated checkpoints and merges the selected checkpoints. This approach reduces recovery time while preserving existing training loss. The experimental results show that CheckBullet can reduce the recovery time by <inline-formula><tex-math notation="LaTeX">6\times</tex-math> <mml:math><mml:mrow><mml:mn>6</mml:mn><mml:mo>×</mml:mo></mml:mrow></mml:math><inline-graphic xlink:href="pack-ieq1-3450283.gif"/> </inline-formula> to <inline-formula><tex-math notation="LaTeX">11\times</tex-math> <mml:math><mml:mrow><mml:mn>11</mml:mn><mml:mo>×</mml:mo></mml:mrow></mml:math><inline-graphic xlink:href="pack-ieq2-3450283.gif"/> </inline-formula> barely increasing the training runtime. Furthermore, CheckBullet can save storage requirements by up to 70% while maintaining the minimum training loss.
ISSN:1536-1233
1558-0660
DOI:10.1109/TMC.2024.3450283