CheckBullet: A Lightweight Checkpointing System for Robust Model Training on Mobile Networks
Training on time-series data generated from mobile networks is a resource-intensive and time-consuming task that encounters various training failures. To cope with this issue, we propose CheckBullet, a lightweight checkpoint system to minimize storage requirements and enable fast recovery in mobile...
Saved in:
Published in: | IEEE transactions on mobile computing Vol. 23; no. 12; pp. 14946 - 14958 |
---|---|
Main Authors: | , , , , |
Format: | Magazine Article |
Language: | English |
Published: |
IEEE
01-12-2024
|
Subjects: | |
Online Access: | Get full text |
Tags: |
Add Tag
No Tags, Be the first to tag this record!
|
Summary: | Training on time-series data generated from mobile networks is a resource-intensive and time-consuming task that encounters various training failures. To cope with this issue, we propose CheckBullet, a lightweight checkpoint system to minimize storage requirements and enable fast recovery in mobile networks. First, CheckBullet determines a checkpointing interval based on the characteristics of the model and the timing of failure occurrences. This approach ensures fast recovery while preserving the existing training runtime. Second, CheckBullet quantizes the weight tensor and eliminates duplicate weights, which significantly reduces the overall checkpoint size, leading to a substantial decrease in storage requirements. Third, CheckBullet selects the minimum training loss among the deduplicated checkpoints and merges the selected checkpoints. This approach reduces recovery time while preserving existing training loss. The experimental results show that CheckBullet can reduce the recovery time by <inline-formula><tex-math notation="LaTeX">6\times</tex-math> <mml:math><mml:mrow><mml:mn>6</mml:mn><mml:mo>×</mml:mo></mml:mrow></mml:math><inline-graphic xlink:href="pack-ieq1-3450283.gif"/> </inline-formula> to <inline-formula><tex-math notation="LaTeX">11\times</tex-math> <mml:math><mml:mrow><mml:mn>11</mml:mn><mml:mo>×</mml:mo></mml:mrow></mml:math><inline-graphic xlink:href="pack-ieq2-3450283.gif"/> </inline-formula> barely increasing the training runtime. Furthermore, CheckBullet can save storage requirements by up to 70% while maintaining the minimum training loss. |
---|---|
ISSN: | 1536-1233 1558-0660 |
DOI: | 10.1109/TMC.2024.3450283 |