Faster Knowledge Distillation Using Uncertainty-Aware Mixup
towardsai.net
Author(s): Tata Ganesh Originally published on Towards AI. Photo by Jaredd Craig on UnsplashIn this article, we will review the paper titled Computation-Efficient Knowledge Distillation via Uncertainty-Aware Mixup [1], which aims to reduce the computational cost associated with distilling the knowledge of computer vision models.Disclaimer: This papers arxiv draft was published in 2020, so some of the teacher models mentioned in the results are small models by todays standards.Knowledge DistillationKnowledge distillation (KD) is the process of transferring learning from a larger model (called the teacher) to a smaller model (called the student). It is used to create compressed models that can be run on resource-constrained environments. Further, KD yields a more accurate model compared to a model that is trained from scratch. In the original knowledge distillation paper by Hinton et al. [2], the student model is trained using the output logits from the teacher model for each training sample. The ground-truth labels are also included during training if they are available. This process is illustrated below.Knowledge Distillation framework. Figure by author. Dog image from CIFAR-10 dataset [3]Computational Cost of Knowledge DistillationFirst, let us define the different floating point operations that contribute to KDs computational cost. Note that these operations are defined per image.F = Teacher forward pass (to get output logits from teacher model)F = Student forward pass (to get output logits from student model)B = Student backward pass (to update weights of the student model)The breakdown of the typical KD process for a mini-batch of N images is as follows:A mini-batch of N images is passed through the teacher and the student models. The cost of this forward pass is F + F.A distillation loss is applied between the teacher and the student models for different layers.The student models weights are updated during the backward pass. The cost of this backward pass is B.Note: Since the teacher model is much larger than the student model, we can assume that F >> F, F >> B and F = B.This process can be summarized using the following figure:Framework of Knowledge Distillation [1]Hence, the total cost of KD for a mini-batch of N images is:Computational Cost of KD [1]Reducing the number of images passed to the teacher model can lead to an overall reduction in the computational cost of KD. So, how can we sample images from each mini-batch to reduce the cost associated with the teacher models forward pass operation? Katharopoulos et al. [4] claim that all samples in a dataset are not equally important for neural network training. They propose an importance sampling technique to focus computation on informative examples during training. Similarly, the importance or informativeness of examples in a mini-batch can be used to sample only informative examples and pass them to the teacher model. In the next section, we will discuss how the proposed method, named UNIX, performs this sampling.UNcertainty-aware mIXup (UNIX)UNIX Framework [1]The sequence of steps for each mini-batch in UNIX is as follows:Step 1: Student forward passEach mini-batch of images is fed to the student model to obtain the predicted class probabilities for each image.Step 2: Uncertainty EstimationFor each image, the predicted probabilities are used to generate an uncertainty estimate. The uncertainty value loosely indicates the prediction confidence of the student model for each image. The higher the uncertainty, the lower the confidence. Based on Active Learning literature [5], uncertainty can be used to estimate the informativeness of each image. For example, the authors use entropy of the student models predicted probability distribution to quantify uncertainty.Uncertainty quantification using entropy [1]Step 3: Shuffling and Sorting the mini-batchThe mini-batch is then sorted in decreasing order of sample uncertainties. Let us name the sorted mini-batch Bsorted. Further, the original mini-batch is shuffled. Let us name the shuffled mini-batch Bshuffled.Step 4: Uncertainty-Aware MixupMixup [6] is a data augmentation technique that performs a convex combination of two images and their corresponding labels in a mini-batch. Mixup has been shown to improve the generalization of neural networks.Mixup Data Augmentation [6]. is used to control the magnitude of mixup.The authors propose to use mixup as a way to compress information from two images into one, then feed the mixed image to the teacher and student models for KD. An element-wise mixup is performed between images in Bsorted and Bshuffled. Specifically,Performing mixup based on sample uncertainty [1]Here, c is a correction factor, which is a function of each samples uncertainty. c ensures that mixup is mild for uncertain samples and strong for confident samples. Note that labels are NOT mixed.Step 5: Sampling and Teacher forward passAfter performing mixup, k images are sampled from the N mixed images. These k mixed images are fed as input to the teacher and student models for KD.Comparing Computational CostsConsider the case where batch size N = 64 and k = 40. Then, the computational cost of a forward pass for a mini-batch with and without UNIX is (Note that the final cost is expressed with respect to the student model) :Example of Computation Cost of KD with and without UNIX. Figure by Author.In our example, KD with UNIX yields a ~25% reduction in computational cost, improving the computational efficiency of the distillation process.ResultsCIFAR-100 ResultsResults of different model architectures on the CIFAR-100 [2] image classification dataset are shown below.KD results on CIFAR-100 [1]. WRN means Wide Resnet [7].In most cases, the performance of UNIXKD is on par with original KD. Specifically, UNIXKD with k=36 provides a good tradeoff between accuracy and computational cost. Further, random sampling with KD (Random+KD) performs on par or worse than UNIXKD for all model architectures, highlighting the importance of uncertainty-based sampling in improving computational efficiency with minimal reduction in accuracy.ImageNet resultsResults on the ImageNet [8] dataset are shown below.KD results on ImageNet[1].The columns with +label specify KD with ground truth labels. For experiments with and without ground truth labels, UNIXKD performs on par with original KD while reducing the total computational cost by ~23%.ConclusionKnowledge Distillation is a technique used for transferring the knowledge of a large teacher model into a small student model. However, the high computational cost of performing a forward pass through the teacher model makes the distillation process computationally expensive. To tackle this problem, UNcertainty-aware mIXup (UNIX) uses uncertainty sampling and the mixup augmentation technique to pass a smaller number of images to the teacher model. Experiments on CIFAR 100 and ImageNet datasets show that UNIX can reduce the computational cost of knowledge distillation by 25% with minimal reduction in classification performance.References[1] G. Xu, Z. Liu, and C. Change Loy. Computation-Efficient Knowledge Distillation via Uncertainty-Aware (2020), arXiv preprint arXiv:2012.09413.[2] G. Hinton, O. Vinyals, and J. Dean. Distilling the knowledge in a neural network (2015), arXiv preprint arXiv:1503.02531.[3] A. Krizhevsky and G. Hinton. Learning multiple (2009).layers of features from tiny images[4] A. Katharopoulos and F. Fleuret. Not all sam- (2018), International conference on ples are created equal: Deep learning with importance sam-plingmachine learning. PMLR.[5] B. Settles. Active learning literature survey (2010), University of Wisconsin, Madison, 52(5566):11.[6] H. Zhang, M. Cisse, Y. Dauphin, and D. Lopez-Paz. mixup: Beyond (2018), 6th International Conference on Learning Representations.empirical risk minimization[7] S. Zagoruyko and N. Komodakis. Wide Residual Networks (2017), arXiv preprint arXiv:1605.07146.[8] J. Deng, W. Dong, R. Socher, L. Li, Kai Li, and Li Fei- Fei. Imagenet: A large-scale hierarchical image database (2009), IEEE Conference on Computer Vision and Pattern Recognition.Join thousands of data leaders on the AI newsletter. Join over 80,000 subscribers and keep up to date with the latest developments in AI. From research to projects and ideas. If you are building an AI startup, an AI-related product, or a service, we invite you to consider becoming asponsor. Published via Towards AI
0 Comments ·0 Shares ·123 Views