Understanding Self-Distillation and Partial Label Learning in Multi-Class Classification with Label Noise
Self-distillation (SD) is the process of training a student model using the outputs of a teacher model, with both models sharing the same architecture. Our study theoretically examines SD in multi-class classification with cross-entropy loss, exploring both multi-round SD and SD with refined teacher...
Saved in:
Main Authors: | , |
---|---|
Format: | Journal Article |
Language: | English |
Published: |
16-02-2024
|
Subjects: | |
Online Access: | Get full text |
Tags: |
Add Tag
No Tags, Be the first to tag this record!
|
Summary: | Self-distillation (SD) is the process of training a student model using the
outputs of a teacher model, with both models sharing the same architecture. Our
study theoretically examines SD in multi-class classification with
cross-entropy loss, exploring both multi-round SD and SD with refined teacher
outputs, inspired by partial label learning (PLL). By deriving a closed-form
solution for the student model's outputs, we discover that SD essentially
functions as label averaging among instances with high feature correlations.
Initially beneficial, this averaging helps the model focus on feature clusters
correlated with a given instance for predicting the label. However, it leads to
diminishing performance with increasing distillation rounds. Additionally, we
demonstrate SD's effectiveness in label noise scenarios and identify the label
corruption condition and minimum number of distillation rounds needed to
achieve 100% classification accuracy. Our study also reveals that one-step
distillation with refined teacher outputs surpasses the efficacy of multi-step
SD using the teacher's direct output in high noise rate regimes. |
---|---|
DOI: | 10.48550/arxiv.2402.10482 |