论文标题
通过球形知识划分来减少教师差距
Reducing the Teacher-Student Gap via Spherical Knowledge Disitllation
论文作者
论文摘要
知识蒸馏旨在通过从更大的映射函数中学习映射函数来获得紧凑而有效的模型。由于学生的能力有限,学生将使老师不足。因此,在从超大老师中提取时,学生的表现将出乎意料地下降,称为容量差距问题。我们通过研究教师和学生之间信心的差距来调查这个问题。我们发现,如果学生被迫学习信心,那么信心的程度对于知识蒸馏而不是必需的,可能会损害学生的表现。我们提出了球形知识蒸馏,以明确消除这一差距,从而缓解了不足的问题。我们发现,这种新颖的知识表示可以改善更大的教师的紧凑模型,并且对温度非常健壮。我们对CIFAR100和Imagenet进行了实验,并取得了显着改进。具体来说,我们将RESNET18训练至73.0的精度,这比以前的SOTA相当大,并且与Resnet34相当,几乎是学生规模的两倍。该实现已在https://github.com/forjiuzhou/spherical-knowledge-distillation上共享。
Knowledge distillation aims at obtaining a compact and effective model by learning the mapping function from a much larger one. Due to the limited capacity of the student, the student would underfit the teacher. Therefore, student performance would unexpectedly drop when distilling from an oversized teacher, termed the capacity gap problem. We investigate this problem by study the gap of confidence between teacher and student. We find that the magnitude of confidence is not necessary for knowledge distillation and could harm the student performance if the student are forced to learn confidence. We propose Spherical Knowledge Distillation to eliminate this gap explicitly, which eases the underfitting problem. We find this novel knowledge representation can improve compact models with much larger teachers and is robust to temperature. We conducted experiments on both CIFAR100 and ImageNet, and achieve significant improvement. Specifically, we train ResNet18 to 73.0 accuracy, which is a substantial improvement over previous SOTA and is on par with resnet34 almost twice the student size. The implementation has been shared at https://github.com/forjiuzhou/Spherical-Knowledge-Distillation.
