EFFICIENT ARCHITECTURE AND DATA MANIPULATION FOR DEEP LEARNING SYSTEMS By Yu Zheng A DISSERTATION Submitted to Michigan State University in partial fulfillment of the requirements for the degree of Electrical Engineering—Doctor of Philosophy 2024 ABSTRACT The significant progress of deep learning models in recent years can be attributed primarily to the growth of the model scale and the volume of data on which it was trained. Although scaling up the model with sufficient training data typically provides enhanced performance, the amount of memory and GPU hours used for training provide great challenges for deep learning infrastructures. Another challenge for training a good deep learning model is the quantity of the data it was trained on. To achieve state-of-the-art performance, it has become a standard way to train or fine-tune deep neural networks on a dataset augmented with well-designed augmentation transformations. This introduces difficulties in efficiently identifying the best data augmentation strategies for training. Furthermore, there has been a noticeable increase in the dataset size across many learning tasks, making it the third challenge of modern deep learning systems. The dataset size becomes large and large over the years, posing great burdens on storage and training cost. Moreover, it can be prohibitive to perform hyperparameter optimization and neural architecture search on networks trained on such massive datasets. In this dissertation, we address the first challenges from a model-centric perspective. We propose MSUNet, which is designed with key techniques: 1) ternary conv layers, 2) sparse conv layers, 3) self-supervised consistency regularizer along with mixed precision training. These techniques allow faster training and inference of deep learning models without sacrificing significant accuracy loss. We then look at deep learning systems from a data-centric perspective. To deal with the second challenge, we propose Deep AutoAugment (DeepAA), a multi-layer data augmentation search method which aims to remove the need of crafting augmentation strategies manually. DeepAA fully automates the data augmentation process by searching for a deep data augmentation policy on an expanded set of transformations. We formulate the search of data augmentation policy as a regularized gradient matching problem by maximizing the cosine similarity of the gradients between augmented data and original data with regularization. To avoid exponential growth of dimensionality of the search space when more augmentation layers are used, we incrementally stack augmentation layers based on the data distribution transformed by all the previous augmentation layers. DeepAA achieves the best performance compared to existing automatic augmentation search methods evaluated on various models and datasets. To tackle the third challenge, we proposed a dataset condensation method by distilling the information from a large dataset to a small condensed dataset. The data condensation is realized by matching the training trajectories of the original dataset with that of the condensed dataset. Experiments show that our proposed method outperforms the baseline methods. Copyright by YU ZHENG 2024 ACKNOWLEDGEMENTS First and foremost, I would like to express my deepest appreciation to my academic advisor, Dr. Mi Zhang. He has been supporting me with profound expertise and insightful mentorship throughout my academic journey. His mentorship has not only led to the success of our research project, but has also inspired me to strive for excellence. I would also like to thank Dr. Yiming Deng, Dr. Guan-Hua Tu and Dr. Zhichao Cao who serve as my committee members and provide valuable guidance for my research. I would like to express my profound gratitude to Dr. Tongtong Li for her support on my academic career. I am also deeply grateful for the opportunities to work alongside my fellows in the lab, including Shen Yan, Biyi Fang, Xiao Zeng, Dong Chen, Jiajia Li, Wei Ao, Zhe Wang, Yuan Liang, Zixiao Yu and Kaixiang Lin. Their encouragement and friendship are a great support during my life at Michigan State University. I am also grateful to all the staff and faculty in the ECE department. My internships provided me with great learning and networking opportunities. I want to express my sincere appreciation to Dr. Zhi Zhang and Dr. Yi Zhu for hosting me at Amazon Web Services. I am grateful for the valuable experiences and collaboration opportunities they provided. Last but certainly not least, I must acknowledge the unwavering love and support from my parents. Their constant belief in my abilities and their emotional support have been the cornerstone of my academic and personal achievements. v TABLE OF CONTENTS LIST OF TABLES . . . LIST OF FIGURES . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . vii . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . ix CHAPTER 1 INTRODUCTION . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1.1 Motivation and Challenges . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1.2 Summary of Research Contributions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1.3 Thesis Organization . . 1 1 1 2 CHAPTER 2 MSUNET . . . . 2.1 Introduction . . 2.2 Related Work . . . 2.3 MSUNet Design . . . . 2.4 Experiment . . . . 2.5 Conclusion . . . . . . . . 4 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 4 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 5 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 6 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 8 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 11 CHAPTER 3 DEEP AUTOAUGMENT . . . . . . . . . . . . . . . . . . . . . . . . . 3.1 Introduction . . 3.2 Related Work . . 3.3 Deep AutoAugment . 3.4 Experiments and Analysis . . . . . . . . . . . . . . . . . . . . . . . . . . . . 3.5 Conclusion . . . 12 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 12 . . 14 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 15 . 19 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 33 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . CHAPTER 4 DATASET CONDENSATION VIA IMPORTANCE AWARE TRAJECTORY MATCHING . . . . . . . . . . . . . . . . . . . . . . . 34 . 34 . 35 . 38 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 44 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 53 Introduction of Dataset Condensation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 4.1 4.2 Related Work . . . 4.3 Method . . 4.4 Experiments . . 4.5 Conclusion . . . . . . . . . . CHAPTER 5 CONCLUSION . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 54 BIBLIOGRAPHY . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 56 vi LIST OF TABLES Table 2.1 Configuration of MSUNet series. . . . . . . . . . . . . . . . . . . . . . . . . . . 9 Table 2.2 The performance of MSUNet series on CIFAR-100 dataset. . . . . . . . . . . . . 10 Table 2.3 Comparison between binary and ternary quantization of MixNet-M base model. . 11 Table 3.1 List of operations in the search space and the corresponding range of magni- tudes in the standard augmentation space. Note that some operations do not use magnitude parameters. We add flip and crop to the search space which were found in the default augmentation pipeline in previous works. Flips operates by randomly flipping the images with 50% probability. In line with previous works, crop denotes pad-and-crop and resize-and-crop transforms for CIFAR10/100 and ImageNet respectively. We set Cutout magnitude to 16 for CIFAR10/100 dataset to be the same as the Cutout in the default augmentation pipeline. We set Cutout magnitude to 60 pixels for ImageNet which is the upper limit of the magnitude used in AA [1]. . . . . . . . . . . . . . . . . . . . 21 Table 3.2 Top-1 test accuracy on CIFAR-10/100 for Wide-ResNet-28-10 and Shake- Shake-2x96d. The results of DeepAA are averaged over four independent runs with different initializations. The 95% confidence interval is denoted by ±. . . . 22 Table 3.3 Top-1 test accuracy (%) on ImageNet for ResNet-50 and ResNet-200. The results of DeepAA are averaged over four independent runs with different initializations. The 95% confidence interval is denoted by ±. . . . . . . . . . . . 23 Table 3.4 Top-1 test accuracy (%) on CIFAR-10/100 dataset with WRN-28-10 with Batch Augmentation (BA), where eight augmented instances were drawn for each image. The results of DeepAA are averaged over four independent runs with different initializations. The 95% confidence interval is denoted by ±. . . . 23 Table 3.5 Policy search time on CIFAR-10/100 and ImageNet in GPU hours. . . . . . . . 25 Table 3.6 Top-1 test accuracy of DeepAA on CIFAR-10/100 for different numbers of augmentation layers. The results are averaged over 4 independent runs with different initializations with the 95% confidence interval denoted by ±. . . . . . 25 Table 3.7 Top-1 test accuracy of DeepAA on ImageNet with ResNet-50 for different numbers of augmentation layers. The results are averaged over 4 independent runs w/ different initializations with the 95% confidence interval denoted by ±. . 25 Table 3.8 Model hyperparameters of Batch Augmentation on CIFAR10/100 for TA (Wide) and DeepAA. Learning rate, weight decay and number of epochs are found via grid search. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 31 vii Table 4.1 Top-1 test accuracy on CIFAR-10/100 compared with previous work on coreset section and dataset condensation. Consistent with prior work, we use a 3-layer ConvNet for both distilation and evaluation. The results are averaged over four independent runs with different initializations. The standard deviation is denoted by ±. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 46 . . . . . Table 4.2 Top-1 test accuracy on Tiny ImageNet compared with previous work on coreset section and dataset condensation. For Tiny ImageNet, we use a 4-layer Con- vNet for both distillation and evaluation, since it has a larger resolution. The results are averaged over four independent runs with different initializations. The standard deviation is denoted by ±. . . . . . . . . . . . . . . . . . . . . . . 47 Table 4.3 Hyper-parameters for different datasets. . . . . . . . . . . . . . . . . . . . . . . 47 Table 4.4 Top-1 test accuracy evaluated on CIFAR-100 with 50 images per class on different architectures. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 47 viii LIST OF FIGURES Figure 3.1 (A) Existing automated data augmentation methods with shallow augmenta- tion policy followed by hand-picked transformations. (B) DeepAA with deep augmentation policy with no hand-picked transformations. . . . . . . . . . . . . 12 Figure 3.2 Top-1 test accuracy (%) on ImageNet of DeepAA-simple, DeepAA, and other automatic augmentation methods on ResNet-50. . . . . . . . . . . . . . . . . . 24 Figure 3.3 Figure 3.4 The distribution of operations at each layer of the policy for CIFAR-10/100 and ImageNet. The probability of each operation is summed up over all 12 discrete intensity levels (see Appendix 3.4.5 and 3.4.6) of the corresponding transformation. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 26 . . Illustration of the search trajectory of DeepAA in comparison with DeepTA on CIFAR-10. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 27 . . . Figure 3.5 The distribution of discrete magnitudes of each augmentation transformation in each layer of the policy for CIFAR-10/100. The x-axis represents the discrete magnitudes and the y-axis represents the probability. The magnitude is discretized to 12 levels with each transformation having its own range. A large absolute value of the magnitude corresponds to high transformation intensity. Note that we do not show identity, autoContrast, invert, equalize, flips, Cutout and crop because they do not have intensity parameters. . . . . . . 29 Figure 3.6 The distribution of discrete magnitudes of each augmentation transformation in each layer of the policy for ImageNet. The x-axis represents the discrete magnitudes and the y-axis represents the probability. The magnitude is discretized to 12 levels with each transformation having its own range. A large absolute value of the magnitude corresponds to high transformation intensity. Note that we do not show identity, autoContrast, invert, equalize, flips, Cutout and crop because they do not have intensity parameters. . . . . . . 30 Figure 3.7 Comparison of the policy of DeepAA and some publicly available augmen- taiotn policy found by other methods including AA, FastAA and DADA on the CIFAR-10 dataset. Since the compared methods have varied numbers of augmentation layers, we cumulate the probability of each operation over all the augmentation layers. Thus, the cumulative probability can be larger than 1. For AA, Fast AA and DADA, we add additional 1.0 probability to flip, Cutout and Crop, since they are applied by default. In addition, we normal- ize the magnitude to the range [-5, 5], and use color to distinguish different magnitudes. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 32 Figure 4.1 An overview of Dataset Condensation. Dataset condensation focuses on generating new, condensed training data such that models trained on such dataset have similar performance to the models trained on the original dataset. . 36 ix Figure 4.2 Figure 4.3 Illustration of trajectory matching. The yellow blocks indicates the expert trajectory of the teacher network, while green blocks represent the trajectory of student network. The objective is to minimize the distance between 𝜃∗ and ˆ𝜃𝑡+𝑁 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 39 𝑡+𝑀 . . . . . . . Illustration of importance aware trajectory matching. Illustration of trajectory matching. The yellow blocks indicates the expert trajectory of the teacher network, while green blocks represent the trajectory of student network. The gray nodes in the neural network indicates the corresponding weight is truncated by the importance aware factorization. The objective is to minimize the distance between truncated version of 𝜃∗ 𝑡+𝑀 and ˆ𝜃𝑡+𝑁 . . . . . . . . . . . . . . 43 Figure 4.4 Visualization of the distilled Tiny ImageNet dataset with IPC=1 (1/2). . . . . . 49 Figure 4.5 Visualization of the distilled Tiny ImageNet dataset with IPC=1 (2/2). . . . . . 50 Figure 4.6 Visualization of the distilled Tiny ImageNet dataset with IPC=50 (1/2). . . . . . 51 Figure 4.7 Visualization of the distilled Tiny ImageNet dataset with IPC=50 (2/2). . . . . . 52 x CHAPTER 1 INTRODUCTION In this chapter, I first introduce the motivation of this thesis and the challenges for efficient deep learning system. Then I introduce the specific research objectives of this thesis and provide a summary of the key contributions. 1.1 Motivation and Challenges The remarkable performance of deep learning models has revolutionized a wide spectrum of areas including computer vision, speech recognition, and natural language processing. The significant progress of deep learning models in recent years can be attributed primarily to the growth of the model scale and the volume of data on which it was trained. Although scaling up the model with sufficient training data typically provides enhanced performance, the amount of memory and GPU hours used for training provide great challenges for deep learning infrastructures. Even for inference purpose, deploying large models on devices with limited resources remains challenging due to their high memory and computational requirements at inference time. Another challenge for training a good deep learning model is the quantity of the data it was trained on. To achieve state-of-the-art performance, it has become a standard way to train or fine-tune deep neural networks on a dataset augmented with well-designed augmentation transfor- mations. This introduces difficulties in efficiently identifying the best data augmentation strategies for training. Furthermore, there has been a noticeable increase in the dataset size across many learning tasks, making it the third challenge of modern deep learning systems. The dataset size grows too large over the years posing great burdens on storage and training cost. Moreover, it can be prohibitive to perform hyperparameter optimization and neural architecture search on networks trained on such massive datasets. 1.2 Summary of Research Contributions This thesis studies efficient architecture and data manipulation techniques for deep learning systems. The principal contributions of this thesis are outlined as follows: 1 1. A ternary quantization method, which outperforms the binary quantization (https://github.com/AIoT-MLSys-Lab/MSUNet). 2. Self-supervised consistency regularizer, which helps improving the sparsity of pruning and ternary quantization without significant scarifying the test performance (https://github.com/AIoT-MLSys-Lab/MSUNet). 3. An efficient multi-layer data augmentation search method that finds state-of-the-art deep data augmentation policy without using any hand-picked default transformations (https://github.com/AIoT-MLSys-Lab/DeepAA). 4. Importance aware weight factorization, that ranks the weight importance based on its contri- bution to the layer output. 5. Importance aware trajectory matching method for dataset condensation that builds upon the importance aware weight factorization. 1.3 Thesis Organization The remainder of this dissertation is organized as follows: Chapter 2: MSUNet In this chapter, I propose MSUNet, which is designed with three key techniques: 1) ternary con- volution layers, 2) sparse squeeze-excitation layers, and 3) self-supervised consistency regularizer along with mixed precision training. Our framework shows a great improvement in parameter size and computational cost measure by #Parameters and #Flops over the baseline model. Lastly, we show that ternary quantization outperforms binary quantization in all aspects, including accuracy, parameter size, and computation cost. Chapter 3: Deep AutoAugment In this chapter, I present Deep AutoAugment (DeepAA), a multi-layer data augmentation search method that finds deep data augmentation policy without using any hand-picked default transformations. We formulate data augmentation search as a regularized gradient matching 2 problem, which maximizes the gradient similarity between augmented data and original data along the direction with low variance. Our experimental results show that DeepAA achieves strong performance without using default augmentations, indicating that regularized gradient matching is an effective searching method for data augmentation policies. Chapter 4: Dataset Condensation via Importance Aware Trajectory Matching Neural networks are typically over parameterized with a lot of redundancy, so the weights are not born equally important. Based on this insight, in this chapter we first proposed an importance aware weight factorization, which factorize the weight matrix based on its influence on the layer output. Building on that we proposed an importance aware trajectory matching algorithm which only match the most important weights early in the training trajectory and gradually adding more weight components until reaching the full weights later in training trajectory. Our methods removes the interfere of the redundancy of the weights when matching the training trajectory. The results show improvements over previous work on CIFAR and Tiny ImageNet datasets. Chapter 5: Conclusion This chapter concludes the whole thesis. 3 CHAPTER 2 MSUNET 2.1 Introduction The breakthrough of deep learning in recent years can be partially attributed to the growth of model scale along with higher memory and computational requirement. These models have shown remarkable performance in a wide range of applications, such as image classification, object detection, image segmentation and natural language processing. However, deploying these models on devices with limited resources remains challenging due to their high memory and computational requirements. There has been a significant progress in developing neural network architectures and model compression strategies that enhance the memory and computational efficiency in deep neural net- works. Efficient convolutional neural network designs such as SqueezeNet [2] and MobileNet [3] use 1x1 convolutions and depthwise separable convolutions, respectively. The technique of Block-term Tensor Decomposition [4] has been adopted in [5] to compact transformer-based lan- guage models. Model compression methods that are agnostic to architecture, including knowledge distillation [6], network pruning [7], and trained quantization [7], are employed to enhance neu- ral network performance or reduce model dimensions. Recently some research topics have also been directed towards automatically generating efficient models [8, 9, 10] and creating specialized hardware to run these compressed models [11, 12]. Network quantization [13] is also a major technique to address the memory and computational challenges for deep neural networks. The key idea of quantization is to convert the network weights or activations from higher precision floating point numbers to lower precision floating point numbers or even lower bit integer numbers. This brings significant reduction of memory overhead and computational cost. While quantization brings light-weight models with enhanced efficiency, it introduced noise to the network, resulting in worse performance than the original one. The challenge is to design a better quantization technique that brings less quantization noise to the model. 4 In light of the techniques for model compression and quantization, in this chapter, we propose MSUNet, which is designed with key techniques: 1) ternary convolution layers, 2) sparse squeeze- excitation layers, 3) self-supervised consistency regularizer along with mixed precision training. 2.2 Related Work Efficient CNN Architectures: ResNet proposed in [14] is the pioneering work on efficient convolutional neural network architectures. Compared with prior work such as Alexnet [15] and VGG [16], ResNet achieves better performance with far less memory and FLOPs for processing each image. The key to the reduced parameter size with enhanced performance is to use parameter free global average pooling layers and a skip connections that enable deeper network to be trained. Based on the ResNet architecture, may variants have since been proposed such as Wide-ResNet [17] and ResNeXt [18]. MobileNets [19, 20, 21] transform 𝑘 × 𝑘 convolutions into separate depthwise and pointwise convolutions. ShuffleNets [22, 23] imporves efficiency by incorporating group convolutions and channel shuffling into pointwise convolution. MixConv, introduced in [24], integrates multiple kernel sizes within a single convolutional layer. The work of [25] employs the butterfly transform as a surrogate for pointwise convolution. EfficientNet [26, 27] innovates with a compound scaling technique for balanced scaling across depth, width, and resolution. AdderNet [28] optimizes computational efficiency by substituting multiplications with additions. GhostNet [29] uses inexpensive linear operations to produce additional feature maps. Introduced in [30], a structural revision of the inverted residual block was shown to be effective in reducing the information loss and gradient confusion. Neural Network Quantization: Depending on whether the quantization is applied during training or at inference time, network quantization can be divided into two categories: 1) quanti- zation aware training (QAT) and 2) post training quantization (PTQ). Half precision floating point numbers has become a standard technique for network training and inference which is supported by NVidia GPUs and major deep learning frameworks [31]. It was shown in [32] that the mixed precision training can achieves lossless performance with speed up of training and inference time. The key technique of mixed precision training is to store weights, activations, and gradients in FP16 5 while keeping an FP32 copy of weights for gradient accumulation. It was reported in [33, 34, 35] that we can reduce the parameter to 8-bit integers after training without significant drop of inference accuracy. Jacob [36] proposed an 8-bit quantization technique for both training and inference, which shows an accuracy loss of 1.5% on ResNet-50. Xilinx [37] proposed an 8-bit lossless quantization that does not require retraining. 2.3 MSUNet Design In this section, we present the detailed design of MSUNet. The key components of MSUNet includes 1) ternary convolutional layers, 2) sparse squeeze-excitation layers and 3) self-supervised consistency regularizer. 2.3.1 Ternary Convolutional Layers For our proposed ternary convolutional layers, the weights are quantized into three level {−1, 0, 1} according to a predefined threshold 𝛼. For a convolution layer with weight 𝑊 and a predefined threshold 𝛼, we modify the weight as ˜𝑊 = Ternary(𝑊, 𝛼) , where the ternary operator Ternary(·, 𝛼) is a point-wise operation defined as: Ternary(𝑤, 𝛼) = 1, if 𝑤 > 𝛼 0, if − 𝛼 ≤ 𝑤 ≤ 𝛼 −1, if 𝑤 < −𝛼    However the quantization leads to a mismatch between the scale of the quantized weight ˜𝑊 and the original weight 𝑊, which needs to be mitigated. Suppose that the weight 𝑊 is initialized using normal distribution with standard deviation 𝜎 according to [38]. The ternalized weight ˜𝑊 has the standard deviation ˜𝜎 = √︁𝑛/𝑁 where 𝑁 and 𝑛 are the number of all the parameters and the number of non-zero parameters of ˜𝑊, respectively. Empirically, we found that the mismatch between 𝜎 and ˜𝜎 will slow down the convergence. Thus we modify the forward pass with a compensation factor 𝜎/ ˜𝜎 for the scale mismatch: ˜𝑌 = 𝜎 √︂ 𝑁 𝑛 ˜𝑊 𝑋 6 (2.3) (2.1) (2.2) Similar to the approach of mixed precision training [32], we keep a copy of the original weights (with FP32 precision) during training, and allow the gradient to be accumulated to the original weights. Different from the normal gradient calculation, we use the ternary weight ˜𝑊 during forward and backward, that is the gradient is calculated based on ˜𝑊 while accumulated on 𝑊 instead. The code that reflects the ternary operation is shown in the code snippet below. 1 class ForwardSign (torch . autograd . Function ): 2 3 4 5 6 7 8 9 10 11 12 13 @staticmethod def forward (ctx , x): global alpha x_ternary = (x - x.mean ())/x.std () ones = ( x_ternary > alpha).type(torch.cuda. FloatTensor ) neg_ones = -1 * ( x_ternary < -alpha).type(torch.cuda. FloatTensor ) x_ternary = ones + neg_ones multiplier = math.sqrt (2. / (x.shape [1] * x.shape [2] * x.shape [3]) * x_ternary .numel () / x_ternary . nonzero ().size (0) ) if args.amp: return ( x_ternary .type(torch.cuda. HalfTensor ), torch. tensor ( multiplier ).type( torch.cuda. HalfTensor )) else: return ( x_ternary .type(torch.cuda. FloatTensor ), torch. tensor ( multiplier ).type( torch.cuda. FloatTensor )) Listing 2.1 The code snippet that reflects the ternary operation. 2.3.2 Sparse Squeeze-excitation Layers Empirically, we found that applying quantization on some specific network layers leads to sig- nificant accuracy drop, which includes the squeeze-excitation layers and depth-wise convolutional layers. Therefore, we keep the layers that are sensitive to quantization in their original format (FP32) during the initial training. After initial training, we then freeze the quantized layers and conduct pruning on rest layers that are sensitive to quantization using the magnitude pruner provided by the Neural Network Distiller toolbox [39]. 7 2.3.3 Self-supervised Consistency Regularizer During training, we use mixup [40] as a data augmentation technique. The self-supervised consistency regularizer is built upon mixup [40], enforcing feature-level consistency between mixup data points in the feature space without using the label information. Denoting a pair of sample-label tuples as (𝑥𝑖, 𝑦𝑖), (𝑥 𝑗 , 𝑦 𝑗 ), mixup generates augmented tuples with a simple weighted sum [40]: 𝑥′ = 𝜆𝑥𝑖 + (1 − 𝜆)𝑥 𝑗 𝑦′ = 𝜆𝑦𝑖 + (1 − 𝜆)𝑦 𝑗 (2.4) (2.5) where 𝜆 ∈ [0, 1]. By using the constructed data point (𝑥′, 𝑦′) for training, it encourages the linear behavior of the model where linear interpolation in the raw data leads to linear interpolation of predictions. Denoting Z as the feature extracted by the network 𝑓 and 𝑧 ∈ Z, we define the regularizer term as: 𝑧𝑖 𝑗 = 𝜆 𝑓𝜃 (𝑥𝑖) + (1 − 𝜆) 𝑓𝜃 (𝑥 𝑗 ) L𝑧 = 1 𝐵 ∑︁ 𝑖 (cid:13)𝑧𝑖 𝑗 − 𝑓𝜃 (𝑥′)(cid:13) (cid:13) 2 (cid:13) 2 (2.6) (2.7) where 𝑥′ is from Eq. (2.4). That is, by minimizing L𝑧 we push the mixed feature 𝑧𝑖 𝑗 closer to the feature of the mixed input 𝑓𝜃 (𝑥′) through MSE loss between the two vectors. In this way, we impose the linearity constraint to be enforced also at the feature level without using the label information. 2.4 Experiment 2.4.1 Experiment Setup We attend the NeurIPS 2019 MicroNet Challenge (CIFAR-100 Track) with our proposed MSUNet. Thus, we evaluate the performance on CIFAR-100 dataset. The backbone architec- tures for MSUNet is MixNet-M [41] which utilize mixed depthwise convolution (MixConv) and squeeze-excitation layers to boost the performance. MSUNet is a series of efficient neural networks with different building blocks. The configuration for each model is listed in Table 2.1. 8 Ternary Convolutional ✓ ✓ ✓ ✓ MSUNet-V0 MSUNet-V1 MSUNet-V2 MSUNet-V3 Sparse Squeeze-excite Consistency Regularizer Mixed-precision Training ✓ ✓ ✓ ✓ ✓ ✓ Table 2.1 Configuration of MSUNet series. Besides the techniques that we developed in Section 2.3, we also employ mixed-precision training that casts the weights from FP32 to FP16 for training and inference, which further boosts the model efficiency. 2.4.2 Results on CIFAR-100 The performance of MSUNet series is reported in Table 2.2. In addition to Top-1 accuracy, we also report the #Flops and #Parameters at inference time for comparison. To calculate #Flops, a 32-bit operation counts as one operation. If quantization is performed, an operation on data of less than 32-bits will be counted as a fraction of one operation. To calculate #Parameters, a 32-bit parameter counts as one parameter. Quantized parameters of less than 32-bits will be counted as a fraction of one parameter. As required by NeurIPS 2019 MicroNet Challenge the Score is the sum of #Flops and #Parameters normalized relative to WideResNet-28-10, which has 36.5M parameters and 10.49B math operations. Score = #Flops 10.49 B + #Parameters 36.5 M (2.8) As seen from Table 2.2, by applying ternary quantization alone, the MSUNet-V0 could achieve a very large improvement over the base model with normalized score 0.01836. By further adding the sparse sequeeze-excitation component for MSUNet-V1, we see a slight accuracy drop alone with a drop of #Flops and #Parameters. By adding the consistency regularizer component to MSUNet-V2, we did not see an improvement in accuracy, instead we obtained a significant improvement over #Flops and #Parameters. This indicates that the consistency regularizer could help improve the quantization and pruning process. Lastly, as we add the mixed-precision training in MSUNet- V3, an accuracy drop of 0.17% coupled with improvment over #Flops and #Parameters. The 9 final performance of MSUNet-V3 wins the 4th place in the NeurIPS 2019 MicroNet Challenge (CIFAR-100 Track). Model Top-1 acc #Flops #Paramters Score MSUNet-V0 MSUNet-V1 MSUNet-V2 MSUNet-V3 80.61 % 122.6 M 0.2437 M 0.01836 80.47% 118.6 M 0.2199 M 0.01711 80.30% 97.01 M 0.1204 M 0.01255 80.13% 85.27 M 0.09853 M 0.01083 Table 2.2 The performance of MSUNet series on CIFAR-100 dataset. 2.4.3 Comparison of Ternary and Binary Quantization Even though the ternary quantization technique proposed in section 2.3 admits binary quan- tization (i.e., quantized to {−1, 1} instead of {−1, 0, 1}) without significant modifying the imple- mentation, we found that the performance of ternary quantization performs significantly better than binary in terms of #Flops and #Parameters accuracy as shown in Table 2.3. The major reason is that ternarization introduced may 0s which can be efficiently represented by sparse data structures. Moreover, the additional quantization level of ternary weights brings more freedom for performing quantization. 10 Quantization Top-1 acc #Flops #Paramters Binary Ternary 80.43% 170.0 M 0.2564 M 80.61% 122.6 M 0.2437 M Table 2.3 Comparison between binary and ternary quantization of MixNet-M base model. 2.5 Conclusion In this chapter, we propose MSUNet, which is designed with three key techniques: 1) ternary convolution layers, 2) sparse squeeze-excitation layers, and 3) self-supervised consistency regular- izer along with mixed precision training. Our framework shows a great improvement in parameter size and computational cost measure by #Parameters and #Flops over the baseline model. Lastly, we show that ternary quantization outperforms binary quantization in all aspects, including accuracy, parameter size, and computation cost. 11 CHAPTER 3 DEEP AUTOAUGMENT 3.1 Introduction Data augmentation (DA) is a powerful technique for machine learning since it effectively regularizes the model by increasing the number and the diversity of data points [42, 43]. A large body of data augmentation transformations has been proposed [44, 40, 45, 46, 47, 48] to improve model performance. While applying a set of well-designed augmentation transformations could help yield considerable performance enhancement especially in image recognition tasks, manually selecting high-quality augmentation transformations and determining how they should be combined still require strong domain expertise and prior knowledge of the dataset of interest. With the recent trend of automated machine learning (AutoML), data augmentation search flourishes in the image domain [1, 49, 50, 51, 52, 53, 54], which yields significant performance improvement over hand-crafted data augmentation methods. Although data augmentation policies in previous works [1, 49, 50, 51, 52, 53] contain multiple Figure 3.1 (A) Existing automated data augmentation methods with shallow augmentation policy followed by hand-picked transformations. (B) DeepAA with deep augmentation policy with no hand-picked transformations. 12 AugmentationPolicyDefaultTransformation 1(A)Transformation 2Transformation 2Transformation 1Transformation 1DefaultTransformation 2AugmentationPolicyTransformation 2Transformation 2Transformation 1Transformation 1Transformation K-1Transformation K-1Transformation KTransformation K(B)... transformations applied sequentially, only one or two transformations of each sub-policy are found through searching whereas the rest transformations are hand-picked and applied by default in addition to the found policy (Figure 3.1(A)). From this perspective, we believe that previous automated methods are not entirely automated as they are still built upon hand-crafted default augmentations. In this work, we propose Deep AutoAugment (DeepAA), a multi-layer data augmentation search method which aims to remove the need of hand-crafted default transformations (Figure 3.1(B)). DeepAA fully automates the data augmentation process by searching a deep data augmentation policy on an expanded set of transformations that includes the widely adopted search space and the default transformations (e.g. flips, Cutout, crop). We formulate the search of data augmentation policy as a regularized gradient matching problem by maximizing the cosine similarity of the gradi- ents between augmented data and original data with regularization. To avoid exponential growth of dimensionality of the search space when more augmentation layers are used, we incrementally stack augmentation layers based on the data distribution transformed by all the previous augmentation layers. We evaluate the performance of DeepAA on three datasets – CIFAR-10, CIFAR-100, and ImageNet – and compare it with existing automated data augmentation search methods including AutoAugment (AA) [1], PBA [50], Fast AutoAugment (FastAA) [51], Faster AutoAugment (Faster AA) [52], DADA [53], RandAugment (RA) [49], UniformAugment (UA) [55], TrivialAugment (TA) [56], and Adversarial AutoAugment (AdvAA) [57]. Our results show that, without any default augmentations, DeepAA achieves the best performance compared to existing automatic augmentation search methods on CIFAR-10, CIFAR-100 on Wide-ResNet-28-10 and ImageNet on ResNet-50 and ResNet-200 with standard augmentation space and training procedure. We summarize our main contributions below: • We propose Deep AutoAugment (DeepAA), a fully automated data augmentation search method that finds a multi-layer data augmentation policy from scratch. • We formulate such multi-layer data augmentation search as a regularized gradient matching 13 problem. We show that maximizing cosine similarity along the direction of low variance is effective for data augmentation search when augmentation layers go deep. • We address the issue of exponential growth of the dimensionality of the search space when more augmentation layers are added by incrementally adding augmentation layers based on the data distribution transformed by all the previous augmentation layers. • Our experiment results show that, without using any default augmentations, DeepAA achieves stronger performance compared with prior works. 3.2 Related Work Automated Data Augmentation. Automating data augmentation policy design has recently emerged as a promising paradigm for data augmentation. The pioneer work on automated data augmentation was proposed in AutoAugment [1], where the search is performed under reinforce- ment learning framework. AutoAugment requires to train the neural network repeatedly, which takes thousands of GPU hours to converge. Subsequent works [51, 53, 54] aim at reducing the computation cost. Fast AutoAugment [51] treats data augmentation as inference time density match- ing which can be implemented efficiently with Bayesian optimization. Differentiable Automatic Data Augmentation (DADA) [53] further reduces the computation cost through a reparameter- ized Gumbel-softmax distribution [58]. RandAugment [49] introduces a simplified search space containing two interpretable hyperparameters, which can be optimized simply by grid search. Ad- versarial AutoAugment (AdvAA) [57] searches for the augmentation policy in an adversarial and online manner. It also incorporates the concept of Batch Augmentaiton [59, 60], where multiple adversarial policies run in parallel. Although many automated data augmentation methods have been proposed, the use of default augmentations still imposes strong domain knowledge. Gradient Matching. Our work is also related to gradient matching. In [61], the authors showed that the cosine similarity between the gradients of different tasks provides a signal to detect when an auxiliary loss is helpful to the main loss. In [62], the authors proposed to use cosine similarity as the training signal to optimize the data usage via weighting data points. A similar approach was 14 proposed in [63], which uses the gradient inner product as a per-example reward for optimizing data distribution and data augmentation under the reinforcement learning framework. Our approach also utilizes the cosine similarity to guide the data augmentation search. However, our implementation of cosine similarity is different from the above from two aspects: we propose a Jacobian-vector product form to backpropagate through the cosine similarity, which is computational and memory efficient and does not require computing higher order derivative; we also propose a sampling scheme that effectively allows the cosine similarity to increase with added augmentation stages. 3.3 Deep AutoAugment 3.3.1 Overview Data augmentation can be viewed as a process of filling missing data points in the dataset with the same data distribution [52]. By augmenting a single data point multiple times, we expect the resulting data distribution to be close to the full dataset under a certain type of transformation. For example, by augmenting a single image with proper color jittering, we obtain a batch of augmented images which has similar distribution of lighting conditions as the full dataset. As the distribution of augmented data gets closer to the full dataset, the gradient of the augmented data should be steered towards a batch of original data sampled from the dataset. In DeepAA, we formulate the search of the data augmentation policy as a regularized gradient matching problem, which manages to steer the gradient to a batch of original data by augmenting a single image multiple times. Specifically, we construct the augmented training batch by augmenting a single training data point multiple times following the augmentation policy. We construct a validation batch by sampling a batch of original data from the validation set. We expect that by augmentation, the gradient of augmented training batch can be steered towards the gradient of the validation batch. To do so, we search for data augmentation that maximizes the cosine similarity between the gradients of the validation data and the augmented training data. The intuition is that an effective data augmentation should preserve data distribution [64] where the distribution of the augmented images should align with the distribution of the validation set such that the training gradient direction is close to the validation gradient direction. 15 Another challenge for augmentation policy search is that the search space can be prohibitively large with deep augmentation layers (𝐾 ≥ 5). This was not a problem in previous works, where the augmentation policies is shallow (𝐾 ≤ 2). For example, in AutoAugment [1], each sub-policy contains 𝐾 = 2 transformations to be applied sequentially, and the search space of AutoAugment contains 16 image operations and 10 discrete magnitude levels. The resulting number of combina- tions of transformations in AutoAugment is roughly (16 × 10)2 = 25, 600, which is handled well in previous works. However, when discarding the default augmentation pipeline and searching for data augmentations from scratch, it requires deeper augmentation layers in order to perform well. For a data augmentation with 𝐾 = 5 sequentially applied transformations, the number of sub-policies is (16 × 10)5 ≈ 1011, which is prohibitively large for the following two reasons. First, it becomes less likely to encounter a good policy by exploration as good policies become more sparse on high dimensional search space. Second, the dimension of parameters in the policy also grows with 𝐾, making it more computational challenging to optimize. To tackle this challenge, we propose to build up the full data augmentation by progressively stacking augmentation layers, where each augmentation layer is optimized on top of the data distribution transformed by all previous layers. This avoids sampling sub-policies from such a large search space, and the number of parameters of the policy is reduced from |T|𝐾 to T for each augmentation layer. 3.3.2 Search Space Let O denote the set of augmentation operations (e.g. identity, rotate, brightness), 𝑚 denote an operation magnitude in the set M, and 𝑥 denote an image sampled from the space X. We define the set of transformations as the set of operations with a fixed magnitude as T := {𝑡|𝑡 = 𝑜(· ; 𝑚), 𝑜 ∈ O and 𝑚 ∈ M}. Under this definition, every 𝑡 is a map 𝑡 : X → X, and there are |T| = |M| · |O| possible transformations. In previous works [1, 51, 53, 52], a data augmentation policy P consists of several sub-policies. As explained above, the size of candidate sub-policies grows exponentially with depth 𝐾. Therefore, we propose a practical method that builds up the full data augmentation by progressively stacking augmentation layers. The final data augmentation policy hence consists of 𝐾 layers of sequentially applied policy P = {P1, · · · , P𝐾 }, where policy P𝑘 is optimized conditioned 16 on the data distribution augmented by all previous (𝑘 − 1) layers of policies. Thus we write the policy as a conditional distribution P𝑘 := 𝑝𝜃 𝑘 (𝑛|{P1, · · · , P𝑘−1}) where 𝑛 denotes the indices of transformations in T. For the purpose of clarity, we use a simplified notation as 𝑝𝜃 𝑘 to replace 𝑝𝜃 𝑘 (𝑛|{P1, · · · , P𝑘−1}). 3.3.3 Augmentation Policy Search via Regularized Gradient Matching Assume that a single data point 𝑥 is augmented multiple times following the policy 𝑝𝜃. The resulting average gradient of such augmentation is denoted as 𝑔(𝑥, 𝜃), which is a function of data 𝑥 and policy parameters 𝜃. Let 𝑣 denote the gradients of a batch of the original data. We optimize the policy by maximizing the cosine similarity between the gradients of the augmented data and a batch of the original data as follows: 𝜃 = arg max 𝜃 cosineSimilarity(𝑣, 𝑔(𝑥, 𝜃)) (3.1) = arg max 𝜃 𝑣𝑇 · 𝑔(𝑥, 𝜃) ∥𝑣∥·∥𝑔(𝑥, 𝜃) ∥ where ∥·∥ denotes the L2-norm. The parameters of the policy can be updated via gradient ascent: 𝜃 ← 𝜃 + 𝜂∇𝜃 cosineSimilarity(𝑣, 𝑔(𝑥, 𝜃)), (3.2) where 𝜂 is the learning rate. 3.3.3.1 Policy Search for One layer We start with the case where the data augmentation policy only contains a single augmentation layer, i.e., P = {𝑝𝜃 }. Let 𝐿(𝑥; 𝑤) denote the classification loss of data point 𝑥 where 𝑤 ∈ R𝐷 represents the flattened weights of the neural network. Consider applying augmentation on a single data point 𝑥 following the distribution 𝑝𝜃. The resulting averaged gradient can be calculated analytically by averaging all the possible transformations in T with the corresponding probability 𝑝(𝜃): 𝑔(𝑥; 𝜃) = |T| ∑︁ 𝑛=1 𝑝𝜃 (𝑛)∇𝑤 𝐿 (𝑡𝑛 (𝑥); 𝑤) (3.3) = 𝐺 (𝑥) · 𝑝𝜃 17 where 𝐺 (𝑥) = (cid:2)∇𝑤 𝐿 (𝑡1(𝑥); 𝑤), · · · , ∇𝑤 𝐿 (𝑡|T| (𝑥); 𝑤)(cid:3) is a 𝐷 × |T| Jacobian matrix, and 𝑝𝜃 = [ 𝑝𝜃 (1), · · · , 𝑝𝜃 (|T|)]𝑇 is a |T| dimensional categorical distribution. The gradient w.r.t. the cosine similarity in Eq. (3.2) can be derived as: where ∇𝜃 cosineSimilarity(𝑣, 𝑔(𝑥; 𝜃)) = ∇𝜃 𝑝𝜃 · 𝑟 𝑟 = 𝐺 (𝑥)𝑇 (cid:18) 𝑣 ∥𝑔(𝜃) ∥ − 𝑣𝑇 𝑔(𝜃) ∥𝑔(𝜃) ∥2 · 𝑔(𝜃) ∥𝑔(𝜃) ∥ (cid:19) (3.4) (3.5) which can be interpreted as a reward for each transformation. Therefore, 𝑝𝜃 ·𝑟 in Eq.(3.4) represents the average reward under policy 𝑝𝜃. 3.3.3.2 Policy Search for Multiple layers The above derivation is based on the assumption that 𝑔(𝜃) can be computed analytically by Eq.(3.3). However, when 𝐾 ≥ 2, it becomes impractical to compute the average gradient of the augmented data given that the search space dimensionality grows exponentially with 𝐾. Consequently, we need to average the gradient of all |T|𝐾 possible sub-policies. To reduce the parameters of the policy to T for each augmentation layer, we propose to in- crementally stack augmentations based on the data distribution transformed by all the previous augmentation layers. Specifically, let P = {P1, · · · , P𝐾 } denote the 𝐾-layer policy. The policy P𝑘 modifies the data distribution on top of the data distribution augmented by the previous (𝑘 − 1) layers. Therefore, the policy at the 𝑘 𝑡ℎ layer is a distribution P𝑘 = 𝑝𝜃 𝑘 (𝑛) conditioned on the policies {P1, · · · , P𝑘−1} where each one is a |T|-dimensional categorical distribution. Given that, the Jacobian matrix at the 𝑘 𝑡ℎ layer can be derived by averaging over the previous (𝑘 − 1) layers of policies as follows: 𝐺 (𝑥) 𝑘 = |T| ∑︁ |T| ∑︁ · · · 𝑛𝑘−1=1 𝑛1=1 𝑝𝜃 𝑘−1 (𝑛𝑘−1) · · · 𝑝𝜃1 (𝑛1) [∇𝑤 𝐿 ((𝑡1 ◦ 𝑡𝑛𝑘−1 · · · ◦ 𝑡𝑛1) (𝑥); 𝑤), · · · , (3.6) ∇𝑤 𝐿 ((𝑡|T| ◦ 𝑡𝑛𝑘−1 ◦ · · · ◦ 𝑡𝑛1) (𝑥); 𝑤)] 18 where 𝐺 𝑘 can be estimated via the Monte Carlo method as: ˜𝐺 𝑘 (𝑥) = ∑︁ ∑︁ · · · ˜𝑛𝑘−1∼𝑝 𝜃𝑘 ˜𝑛1∼𝑝 𝜃 1 [∇𝑤 𝐿 ((𝑡1 ◦ 𝑡 ˜𝑛𝑘−1 · · · ◦ 𝑡 ˜𝑛1) (𝑥); 𝑤), · · · , ∇𝑤 𝐿 ((𝑡|T| ◦ 𝑡 ˜𝑛𝑘−1 ◦ · · · ◦ 𝑡 ˜𝑛1) (𝑥); 𝑤)] where ˜𝑛𝑘−1 ∼ 𝑝𝜃 𝑘−1 (𝑛), · · · , ˜𝑛1 ∼ 𝑝𝜃1 (𝑛). The average gradient at the 𝑘 𝑡ℎ layer can be estimated by the Monte Carlo method as: ˜𝑔(𝑥; 𝜃 𝑘 ) = ∑︁ ∑︁ · · · ˜𝑛𝑘∼𝑝 𝜃𝑘 ˜𝑛1∼𝑝 𝜃 1 ∇𝑤 𝐿 (cid:0)(𝑡 ˜𝑛𝑘 ◦ · · · ◦ 𝑡 ˜𝑛1) (𝑥); 𝑤(cid:1) . Therefore, the reward at the 𝑘 𝑡ℎ layer is derived as: ˜𝑟 𝑘 (𝑥) = (cid:16) ˜𝐺 𝑘 (𝑥) (cid:17)𝑇 (cid:18) 𝑣 ∥ ˜𝑔𝑘 (𝑥; 𝜃 𝑘 ) ∥ − 𝑣𝑇 ˜𝑔𝑘 (𝑥; 𝜃 𝑘 ) ∥ ˜𝑔𝑘 (𝑥; 𝜃 𝑘 ) ∥2 · ˜𝑔𝑘 (𝑥; 𝜃 𝑘 ) ∥ ˜𝑔𝑘 (𝑥; 𝜃 𝑘 ) ∥ (cid:19) . (3.7) (3.8) (3.9) To prevent the augmentation policy from overfitting, we regularize the optimization by avoiding optimizing towards the direction with high variance. Thus, we penalize the average reward with its standard deviation as 𝑟 𝑘 = 𝐸𝑥 { ˜𝑟 𝑘 (𝑥)} − 𝑐 · √︁𝐸𝑥 {( ˜𝑟 𝑘 (𝑥) − 𝐸𝑥 { ˜𝑟 𝑘 (𝑥)})2} , (3.10) where we use 16 randomly sampled images to calculate the expectation. The hyperparameter 𝑐 controls the degree of regularization, which is set to 1.0. With such regularization, we prevent the policy from converging to the transformations with high variance. Therefore the parameters of policy P𝑘 (𝑘 ≥ 2) can be updated as: 𝜃 ← 𝜃 𝑘 + 𝜂∇𝜃 𝑘 cosineSimilarity(𝑣, 𝑔(𝜃 𝑘 )) where ∇𝜃 cosineSimilarity(𝑣, 𝑔𝑘 (𝑥; 𝜃)) = ∇𝜃 𝑝𝜃 𝑘 · 𝑟 𝑘 . (3.11) (3.12) 3.4 Experiments and Analysis Benchmarks and Baselines. We evaluate the performance of DeepAA on three standard benchmarks: CIFAR-10, CIFAR-100, ImageNet, and compare it against a baseline based on 19 standard augmentations (i.e., flip left-righ, pad-and-crop for CIFAR-10/100, and Inception-style preprocesing [65] for ImageNet) as well as nine existing automatic augmentation methods including (1) AutoAugment (AA) [1], (2) PBA [50], (3) Fast AutoAugment (Fast AA) [51], (4) Faster AutoAugment [52], (5) DADA [53], (6) RandAugment (RA) [49], (7) UniformAugment (UA) [55], (8) TrivialAugment (TA) [56], and (9) Adversarial AutoAugment (AdvAA) [57]. Search Space. We set up the operation set O to include 16 commonly used operations (iden- tity, shear-x, shear-y, translate-x, translate-y, rotate, solarize, equalize, color, posterize, contrast, brightness, sharpness, autoContrast, invert, Cutout) as well as two operations (i.e., flips and crop) that are used as the default operations in the aforementioned methods. The list of operations and the range of magnitudes in the standard augmentation space are summarized in Table 3.1. Among the operations in O, 11 operations are associated with magnitudes. We then discretize the range of magnitudes into 12 uniformly spaced levels and treat each operation with a discrete magnitude as an independent transformation. Therefore, the policy in each layer is a 139-dimensional categorical distribution corresponding to |T| = 139 {operation, magnitude} pairs. 20 Operation Magnitude Identity ShearX ShearY TranslateX TranslateY Rotate AutoContrast Invert Equalize Solarize Posterize Contrast Color Brightness Sharpness Flips Cutout Crop - [-0.3, 0.3] [-0.3, 0.3] [-0.45, 0.45] [-0.45, 0.45] [-30, 30] - - - [0, 256] [4, 8] [0.1, 1.9] [0.1, 1.9] [0.1, 1.9] [0.1, 1.9] - 16 (60) - Table 3.1 List of operations in the search space and the corresponding range of magnitudes in the standard augmentation space. Note that some operations do not use magnitude parameters. We add flip and crop to the search space which were found in the default augmentation pipeline in previous works. Flips operates by randomly flipping the images with 50% probability. In line with previous works, crop denotes pad-and-crop and resize-and-crop transforms for CIFAR10/100 and ImageNet respectively. We set Cutout magnitude to 16 for CIFAR10/100 dataset to be the same as the Cutout in the default augmentation pipeline. We set Cutout magnitude to 60 pixels for ImageNet which is the upper limit of the magnitude used in AA [1]. 3.4.1 Performance on CIFAR-10 and CIFAR-100 Policy Search. Following [1], we conduct the augmentation policy search based on Wide- ResNet-40-2 [17]. We first train the network on a subset of 4, 000 randomly selected samples from CIFAR-10. We then progressively update the policy network parameters 𝜃 𝑘 (𝑘 = 1, 2, · · · , 𝐾) for 512 iterations for each of the 𝐾 augmentation layers. We use the Adam optimizer [66] and set the learning rate to 0.025 for policy updating. Policy Evaluation. Using the publicly available repository of Fast AutoAugment [51], we evaluate the found augmentation policy on both CIFAR-10 and CIFAR-100 using Wide-ResNet- 28-10 and Shake-Shake-2x96d models. The evaluation configurations are kept consistent with that 21 of Fast AutoAugment. Results. Table 3.2 reports the Top-1 test accuracy on CIFAR-10/100 for Wide-ResNet-28-10 and Shake-Shake-2x96d, respectively. The results of DeepAA are the average of four independent runs with different initializations. We also show the 95% confidence interval of the mean accuracy. As shown, DeepAA achieves the best performance compared against previous works using the standard augmentation space. Note that TA(Wide) uses a wider (stronger) augmentation space on this dataset. CIFAR-10 WRN-28-10 Shake-Shake (26 2x96d) CIFAR-100 WRN-28-10 Shake-Shake (26 2x96d) Baseline AA PBA FastAA FasterAA DADA RA UA TA(RA) TA(Wide) 1 DeepAA 96.1 97.1 81.2 82.9 97.4 98.0 97.4 98.0 82.9 85.7 83.3 84.7 97.3 98.0 82.7 85.1 97.4 98.0 82.7 85.0 97.3 98.0 82.5 84.7 97.3 98.0 97.33 98.1 97.46 98.05 83.3 - 82.82 - 83.54 - 97.46 98.21 84.33 86.19 97.56 ± 0.14 98.11 ± 0.12 84.02 ± 0.18 85.19 ± 0.28 Table 3.2 Top-1 test accuracy on CIFAR-10/100 for Wide-ResNet-28-10 and Shake-Shake-2x96d. The results of DeepAA are averaged over four independent runs with different initializations. The 95% confidence interval is denoted by ±. 3.4.2 Performance on ImageNet Policy Search. We conduct the augmentation policy search based on ResNet-18 [14]. We first train the network on a subset of 200, 000 randomly selected samples from ImageNet for 30 epochs. We then use the same settings as in CIFAR-10 for updating the policy parameters. Policy Evaluation. We evaluate the performance of the found augmentation policy on ResNet- 50 and ResNet-200 based on the public repository of Fast AutoAugment [51]. The parameters for training are the same as the ones of [51]. In particular, we use step learning rate scheduler with a reduction factor of 0.1, and we train and evaluate with images of size 224x224. Results. The performance on ImageNet is presented in Table 3.3. As shown, DeepAA achieves the best performance compared with previous methods without the use of default augmentation pipeline. In particular, DeepAA performs better on larger models (i.e. ResNet-200), as the performance of DeepAA on ResNet-200 is the best within the 95% confidence interval. Note that 1On CIFAR-10/100, TA (Wide) uses a wider (stronger) augmentation space, while the other methods including TA (RA) uses the standard augmentation space. 22 while we train DeepAA using the image resolution (224×224), we report the best results of RA and TA, which are trained with a larger image resolution (244×224) on this dataset. Baseline AA Fast AA Faster AA DADA RA UA TA(RA)1 TA(Wide)2 DeepAA ResNet-50 ResNet-200 76.3 78.5 77.6 80.0 77.6 80.6 76.5 - 77.5 - 77.6 - 77.63 80.4 77.85 - 78.07 - 78.30 ± 0.14 81.32 ± 0.17 Table 3.3 Top-1 test accuracy (%) on ImageNet for ResNet-50 and ResNet-200. The results of DeepAA are averaged over four independent runs with different initializations. The 95% confidence interval is denoted by ±. 3.4.3 Performance with Batch Augmentation Batch Augmentation (BA) is a technique that draws multiple augmented instances of the same sample in one mini-batch. It has been shown to be able to improve the generalization performance of the network [59, 60]. AdvAA [57] directly searches for the augmentation policy under the BA setting whereas for TA and DeepAA, we apply BA with the same augmentation policy used in Table 3.2. Note that since the performance of BA is sensitive to the hyperparameters [67], we have conducted a grid search on the hyperparameters of both TA and DeepAA (details are included in Appendix 3.4.7). As shown in Table 3.4, after tuning the hyperparameters, the performance of TA (Wide) using BA is already better than the reported performance in the original paper. The performance of DeepAA with BA outperforms that of both AdvAA and TA (Wide) with BA. AdvAA TA(Wide) (original paper) TA(Wide) (ours) DeepAA CIFAR-10 CIFAR-100 98.1 ± 0.15 84.51 ± 0.18 98.04 ± 0.06 84.62 ± 0.14 98.06 ± 0.23 85.40 ± 0.15 98.21 ± 0.14 85.61 ± 0.17 Table 3.4 Top-1 test accuracy (%) on CIFAR-10/100 dataset with WRN-28-10 with Batch Augmen- tation (BA), where eight augmented instances were drawn for each image. The results of DeepAA are averaged over four independent runs with different initializations. The 95% confidence interval is denoted by ±. 23 Figure 3.2 Top-1 test accuracy (%) on ImageNet of DeepAA-simple, DeepAA, and other automatic augmentation methods on ResNet-50. 3.4.4 Understanding DeepAA Effectiveness of Gradient Matching. One uniqueness of DeepAA is the regularized gradient matching objective. To examine its effectiveness, we remove the impact coming from multiple augmentation layers, and only conduct search for a single layer of augmentation policy. When evaluating the searched policy, we apply the default augmentation in addition to the searched policy. We refer to this variant as DeepAA-simple. Figure 3.2 compares the Top-1 test accuracy on ImageNet using ResNet-50 between DeepAA-simple, DeepAA, and other automatic augmentation methods. While there is 0.22% performance drop compared to DeepAA, with a single augmentation layer, DeepAA-simple still outperforms other methods and is able to achieve similar performance compared to TA (Wide) but with a standard augmentation space and trains on a smaller image size (224×224 vs 244×224). Policy Search Cost. Table 3.5 compares the policy search time on CIFAR-10/100 and ImageNet in GPU hours. DeepAA has comparable search time as PBA, Fast AA, and RA, but is slower than Faster AA and DADA. Note that Faster AA and DADA relax the discrete search space to a continuous one similar to DARTS [68]. While such relaxation leads to shorter searching time, it inevitably introduces a discrepancy between the true and relaxed augmentation spaces. 1TA (RA) achieves 77.55% top-1 accuracy with image resolution 224×224. 2TA (Wide) achieves 77.97% top-1 accuracy with image resolution 224×224. 24 Dataset AA PBA Fast AA Faster AA DADA RA DeepAA CIFAR-10/100 ImageNet 5000 15000 5 - 3.5 450 0.23 2.3 0.1 1.3 25 5000 9 96 Table 3.5 Policy search time on CIFAR-10/100 and ImageNet in GPU hours. Impact of the Number of Augmentation Layers. Another uniqueness of DeepAA is its multi-layer search space that can go beyond two layers which existing automatic augmentation methods were designed upon. We examine the impact of the number of augmentation layers on the performance of DeepAA. Table 3.6 and Table 3.7 show the performance on CIFAR-10/100 and ImageNet respectively with increasing number of augmentation layers. As shown, for CIFAR- 10/100, the performance gradually improves when more augmentation layers are added until we reach five layers. The performance does not improve when the sixth layer is added. For ImageNet, we have similar observation where the performance stops improving when more than five augmentation layers are included. 1 layer 2 layers 3 layers 4 layers 5 layers 6 layers CIFAR-10 CIFAR-100 96.3 ± 0.21 80.9 ± 0.31 96.6 ± 0.18 81.7 ± 0.24 96.9 ± 0.12 82.2 ± 0.21 97.4 ± 0.14 83.7 ± 0.24 97.56 ± 0.14 84.02 ± 0.18 97.6 ± 0.12 84.0 ± 0.19 Table 3.6 Top-1 test accuracy of DeepAA on CIFAR-10/100 for different numbers of augmentation layers. The results are averaged over 4 independent runs with different initializations with the 95% confidence interval denoted by ±. 1 layer 3 layers 5 layers 7 layers ImageNet 75.27 ± 0.19 78.18 ± 0.22 78.30 ± 0.14 78.30 ± 0.14 Table 3.7 Top-1 test accuracy of DeepAA on ImageNet with ResNet-50 for different numbers of augmentation layers. The results are averaged over 4 independent runs w/ different initializations with the 95% confidence interval denoted by ±. Figure 3.3 illustrates the distributions of operations in the policy for CIFAR-10/100 and Ima- geNet respectively. As shown in Figure 3.3(a), the augmentation of CIFAR-10/100 converges to identity transformation at the sixth augmentation layer, which is a natural indication of the end of the augmentation pipeline. We have similar observation in Figure 3.3(b) for ImageNet, where the 25 Figure 3.3 The distribution of operations at each layer of the policy for CIFAR-10/100 and ImageNet. The probability of each operation is summed up over all 12 discrete intensity levels (see Appendix 3.4.5 and 3.4.6) of the corresponding transformation. identity transformation dominates in the sixth augmentation layer. These observations match our results listed in Table 3.6 and Table 3.7. We also include the distribution of the magnitude within each operation for CIFAR-10/100 and ImageNet in Appendix 3.4.5 and Appendix 3.4.6. Validity of Optimizing Gradient Matching with Regularization. To evaluate the validity of optimizing gradient matching with regularization, we designed a search-free baseline named “DeepTA”. In DeepTA, we stack multiple layers of TA on the same augmentation space of DeepAA without using default augmentations. As stated in Eq.(3.10) and Eq.(3.12), we explicitly optimize the gradient similarities with the average reward minus its standard deviation. The first term – the average reward 𝐸𝑥 { ˜𝑟 𝑘 (𝑥)} – encourages the direction of high cosine similarity. The second √︁𝐸𝑥 {( ˜𝑟 𝑘 (𝑥) − 𝐸𝑥 { ˜𝑟 𝑘 (𝑥)})2} – acts as a regularization term – the standard deviation of the reward that penalizes the direction with high variance. These two terms jointly maximize the gradient similarity along the direction with low variance. To illustrate the optimization trajectory, we design two metrics that are closely related to the two terms in Eq.(3.10): the mean value, and the standard 26 (a) Mean of the gradient similarity improvement (b) Standard deviation of the gradi- ent similarity improvement (c) Mean accuracy over different augmentation depth Figure 3.4 Illustration of the search trajectory of DeepAA in comparison with DeepTA on CIFAR- 10. deviation of the improvement of gradient similarity. The improvement of gradient similarity is obtained by subtracting the cosine similarity of the original image batch from that of the augmented batch. In our experiment, the mean and standard deviation of the gradient similarity improvement are calculated over 256 independently sampled original images. As shown in Figure 3.4(a), the cosine similarity of DeepTA reaches the peak at the fifth layer, and stacking more layers decreases the cosine similarity. In contrast, for DeepAA, the cosine similarity increases consistently until it converges to identity transformation at the sixth layer. In Figure 3.4(b), the standard deviation of DeepTA significantly increases when stacking more layers. In contrast, in DeepAA, as we optimize the gradient similarity along the direction of low variance, the standard deviation of DeepAA does not grow as fast as DeepTA. In Figure 3.4(c), both DeepAA and DeepTA reach peak performance at the sixth layer, but DeepAA achieves better accuracy compared against DeepTA. Therefore, we empirically show that DeepAA effectively scales up the augmentation depth by increasing cosine similarity along the direction with low variance, leading to better results. Comparison with Other Policies. In Figure 3.7 in Appendix 3.4.8, we compare the policy of DeepAA with the policy found by other data augmentation search methods including AA, FastAA and DADA. We have three interesting observations: • AA, FastAA and DADA assign high probability (over 1.0) on flip, Cutout and crop, as those transformations are hand-picked and applied by default. DeepAA finds a similar pattern that 27 assigns high probability on flip, Cutout and crop. • Unlike AA, which mainly focused on color transformations, DeepAA has high probability over both spatial and color transformations. • FastAA has evenly distributed magnitudes, while DADA has low magnitudes (common issues in DARTS-like method). Interestingly, DeepAA assigns high probability to the stronger magnitudes. 28 3.4.5 The distribution of magnitudes for CIFAR-10/100 Figure 3.5 The distribution of discrete magnitudes of each augmentation transformation in each layer of the policy for CIFAR-10/100. The x-axis represents the discrete magnitudes and the y-axis represents the probability. The magnitude is discretized to 12 levels with each transformation having its own range. A large absolute value of the magnitude corresponds to high transformation intensity. Note that we do not show identity, autoContrast, invert, equalize, flips, Cutout and crop because they do not have intensity parameters. 29 3.4.6 The distribution of magnitudes for ImageNet Figure 3.6 The distribution of discrete magnitudes of each augmentation transformation in each layer of the policy for ImageNet. The x-axis represents the discrete magnitudes and the y-axis represents the probability. The magnitude is discretized to 12 levels with each transformation having its own range. A large absolute value of the magnitude corresponds to high transformation intensity. Note that we do not show identity, autoContrast, invert, equalize, flips, Cutout and crop because they do not have intensity parameters. 30 3.4.7 Hyperparameters for Batch Augmentation The performance of BA is sensitive to the training settings [67, 69]. Therefore, we conduct a grid search on the learning rate, weight decay and number of epochs for TA and DeepAA with Batch Augmentation. The best-found parameters are summarized in Table 3.8 in Appendix. We did not tune the hyperparameters of AdvAA [57] since AdvAA claims to be adaptive to the training process. Dataset Augmentation Model Batch Size Learning Rate Weight Decay Epoch CIFAR-10 CIFAR-100 TA (Wide) DeepAA TA (Wide) DeepAA WRN-28-10 WRN-28-10 WRN-28-10 WRN-28-10 128 × 8 128 × 8 128 × 8 128 × 8 0.2 0.2 0.4 0.4 0.0005 0.001 0.0005 0.0005 100 100 35 35 Table 3.8 Model hyperparameters of Batch Augmentation on CIFAR10/100 for TA (Wide) and DeepAA. Learning rate, weight decay and number of epochs are found via grid search. 31 3.4.8 Comparison of data augmentation policy Figure 3.7 Comparison of the policy of DeepAA and some publicly available augmentaiotn policy found by other methods including AA, FastAA and DADA on the CIFAR-10 dataset. Since the compared methods have varied numbers of augmentation layers, we cumulate the probability of each operation over all the augmentation layers. Thus, the cumulative probability can be larger than 1. For AA, Fast AA and DADA, we add additional 1.0 probability to flip, Cutout and Crop, since they are applied by default. In addition, we normalize the magnitude to the range [-5, 5], and use color to distinguish different magnitudes. 32 Sampling probability of each transformations cumulated over all augmentation layers (a) DeepAA(b) AA(c) FastAA(d) DADA 3.5 Conclusion In this chapter, I present Deep AutoAugment (DeepAA), a multi-layer data augmentation search method that finds deep data augmentation policy without using any hand-picked default transformations. We formulate data augmentation search as a regularized gradient matching problem, which maximizes the gradient similarity between augmented data and original data along the direction with low variance. Our experimental results show that DeepAA achieves strong performance without using default augmentations, indicating that regularized gradient matching is an effective search method for data augmentation policies. 33 CHAPTER 4 DATASET CONDENSATION VIA IMPORTANCE AWARE TRAJECTORY MATCHING 4.1 Introduction of Dataset Condensation Over recent years, deep learning has achieved significant success across various domains, in- cluding computer vision[70], natural language processing[71], and speech recognition[72]. Land- mark models such as AlexNet[70] in 2012, ResNet[14] in 2016, Bert[71] in 2018, as well as more recent innovations like ViT[73], CLIP[74], and DALLE[75], all depend on large scale datasets for their training. However, managing such large amounts of data, including collection, storage, transmission, and preprocessing, requires significant effort. Moreover, the computational demands for training on these large datasets often requires large amount of GPU resources for optimal per- formance. This creates challenges for applications that need to train on a dataset for multiple times such as hyper-parameter optimization[76, 77, 78] and neural architecture search[79, 80, 81]. The situation is exacerbated by the rapid growth of the scale of datasets, where training solely on new data risks catastrophic forgetting[82, 83], and maintaining all historical data becomes impractical. Thus, there is a conflict between the need for high-accuracy models and the limitations of compu- tational and storage resources. A natural solution is dataset condensation, which distill the original datasets into smaller, information-rich subsets to ease storage demands while maintaining model performance at test time. A direct method to achieve such data condensation is through coreset selection, which selects the most representative samples from original datasets to ensure that models trained on these subsets perform comparably to those trained on the full datasets. Despite its effectiveness, this approach often discards a significant portion of data, that may overlook value training information and lead to suboptimal outcomes. Moreover, using the unmodified original data directly may raise privacy concern. To address the above challenges, dataset condensation (DC) or dataset distillation (DD) has emerged, focusing on generating new, condensed training data. An overview of dataset condensation 34 is illustrated in Fig. 4.1 Unlike core-set selection, DD aims to synthesize a limited number of samples encapsulating the essence of the original datasets. Originally proposed by Wang et al.[84], this method iteratively updates synthetic samples to ensure model performance well on the original datasets. This foundational work has innovated numerous subsequent studies, significantly advancing DD performance[85, 86, 87, 88, 89, 90] and extending its application to fields like continual learning[91, 92, 93, 94, 95] and federated learning[96, 97, 98, 99, 100, 101, 102]. Neural networks is typically over parameterized with a lot of redundancy, so the weights are not born equally important. Based on this insight, in this chapter we first proposed an importance aware weight factorization, which factorize the weight matrix based on its influence on the layer output. Building on that we proposed an importance aware trajectory matching algorithm which only match the most important weights early in the training trajectory and gradually adding more weight components until reaching the full weights later in training trajectory. Our methods removes the interfere of the redundancy of the weights when matching the training trajectory. The results shows improvements over previous work on various datasets. 4.2 Related Work Coreset selection. A relatively straight forward way to reduce the dataset size is to maintain a subset that only contains a few most representative samples selected from the original dataset. A key challenge in achieving a good trade-off between the training performance and subset size lies in determining the importance of each sample. This type of method is known as coreset selection [103, 104, 105, 106]. Dataset condensation. The task of dataset condensation is to learn a small synthetic dataset that retains the knowledge the original dataset. The deep neural network trained on the samll synthetic dataset should obtain similar performance to the network trained on the original dataset. A more strict dataset condensation method also requires that the weights of the network trained on the small synthetic dataset be close to the network trained on the original dataset. • Performance matching. The performance matching methods are designed to ensure that neural networks trained on the condensed dataset perform comparable to the network trained 35 Figure 4.1 An overview of Dataset Condensation. Dataset condensation focuses on generating new, condensed training data such that models trained on such dataset have similar performance to the models trained on the original dataset. on the original dataset. A simple proof-of-concept algorithm was introduced in [77] where the gradient with respective to the condesed dataset is computed by chaining derivaties backwards through the entire training procedure. In [84] a bilevel optimization technique is used, where the inner loop optimization is training a network on the condensed dataset and outer loop optimize the validation performance of the network trained in the inner loops and backpropage through the unrolled computation graph of inner loop optimization. However the unrolled backpropagation requires to store the entire inner loop training trajectory grows which could easily exceeds the available memory in the hardware. To address the challenge, [78] leverages the Implicit Function Theorem (IFT) in conjection with an efficient Hessian inverse method to approximate the unrolled differentiation which only requires constant amount of memory. • Weight space matching. The methodology of weight space matching was initially introduced 36 in et al., with subsequent expansions provided through a series of studies [107, 108, 109, 110]. Distinct from performance matching, which aims at optimizing the performance of networks trained on synthetic data, the concept of weight space matching involves training an identical network on both synthetic and original datasets for a predefined number of steps and then aligning the weights of both networks. The early works on weight space matching [88, 107, 108, 111, 112] were focused on matching the trajectory of a single step gradient update. While this strategy is computational efficient, the errors may accumulate when the models are updated by the synthetic datasets for multiple steps. To address this issue, [110] introduced a long range trajectory matching technique that transfers the knowledge from a pre-trained network through multi-step parameter updates. • NTK/functional space matching. The preformance matching and weight space matchign methods with muti-step parameter updates involves unrolling the gradient by traversing back through the entire updating process. It requires higher order gradient calculation that demands considerable computational resources to compute the unrolled gradients. The recent study of neural tangent kernel (NTK) shows that the gradient descent in the infinite-width limit is fully equivalent to kernel gradient descent with the NTK. Inspired by the connections between infinitely-wide neural networks and kernel ridge-regression (KRR), the Kernel Inducing Point (KIP) algorithm is proposed in [86, 87]. It relies on the closed-form solution to linear models to avoid repetitive inner-loop optimization steps that leads to unrolled gradient update. To mitigate the computational complexity associated with calculating the Neural Tangent Kernel [109] introduced RFAD, leveraging the Empirical Neural Network Gaussian Process (NNGP) kernel [113, 114]. To further improve the performance, they adopt platt scaling [115] by applying cross-entropy loss to labels of real data instead of mean square error, which further improves the performance. • Feature space matching. The objective of the feature space matching strategy is to generate synthetic data whose features closely mimics the distribution of real data. Maximum Mean 37 Discrepancy (MMD) is used in [90] to align the features extracted prior to the final layer between the distilled dataset and the original dataset. Instead of aligning the features prior to the last linear layer, [116] propose CAFE that further improves the performance by ensuring the consistency of features across all intermediate layers. 4.3 Method In this section, we first provide a brief introduction to the trajectory based condensation method which serves as the background for our method. We then show that the model weights are not equally important which servers as the motivation and mathematical foundation for our proposed method. Finally, we introduce our proposed method named importance aware trajectory matching in details. 4.3.1 Background on Trajectory Matching based Dataset Condensation Dataset distillation focuses on generating a compact dataset Dsyn from a larger, original dataset Dreal, with the goal that models trained on Dsyn exhibit comparable test performance to those trained on Dreal. For methods based on trajectory matching (TM), this process involves aligning the training trajectories of surrogate models trained on both Dreal and Dsyn. Specifically, we define 𝜏∗ as the expert training trajectories, represented by a sequence of model parameters {𝜃∗ 0 acquired during the training on Dreal. Similarly, ˆ𝜃𝑡 represents the parameters at training step 𝑡 of a network trained 𝑡 }𝑛 on the synthetic dataset Dsyn. During each distillation iteration, 𝜃∗ 𝑡+𝑀 are randomly chosen from the collection of expert trajectories 𝜏∗ as the initial and target parameters for matching, with 𝑀 being a predetermined 𝑡 and 𝜃∗ hyperparameter. TM-based methods then refine Dsyn by minimizing the loss expressed as: L = ∥ ˆ𝜃𝑡+𝑁 − 𝜃∗ 𝑡 − 𝜃∗ ∥𝜃∗ 𝑡+𝑀 ∥2 2 𝑡+𝑀 ∥2 2 , (4.1) where 𝑁 is a hyper-parameter and ˆ𝜃𝑡 + 𝑁 results from inner optimization using cross-entropy (CE) loss ℓ and a learnable learning rate 𝛼: ˆ𝜃𝑡+𝑖+1 = ˆ𝜃𝑡+𝑖 − 𝛼∇ℓ( ˆ𝜃𝑡+𝑖, Ds𝑦𝑛), (4.2) 38 Figure 4.2 Illustration of trajectory matching. The yellow blocks indicates the expert trajectory of the teacher network, while green blocks represent the trajectory of student network. The objective is to minimize the distance between 𝜃∗ 𝑡+𝑀 and ˆ𝜃𝑡+𝑁 . where ˆ𝜃𝑡 := 𝜃∗ 𝑡 . 4.3.2 Importance Aware Weight Factorization Deep networks with a large number of trainable parameters within each layer are capable achieving remarkable inference accuracy. While a large number of parameters contribute to higher classification precision, it has been discovered that these parameters often exhibit considerable redundancy. This redundancy allows for techniques such as pruning and quantization to be employed in order to decrease the overall size of the network. In this section, we show that the significance of weights within deep networks is not uniform, that is, a subset of these weights plays a far more critical role in determining the ultimate performance of the model than the rest. We hypothesize that matching the most important subset of weighs will lead to good performance, while matching the less important weights would contribute minimally or even negatively to the final performance. Motivated by this, we start our work by exploring the importance of each components of the model weights. Given a weight matrix Θ ∈ R𝑀×𝑁 with input activation 𝑋 ∈ R𝑁×𝐵, where 𝑀, 𝑁 and 𝐵 denotes the input dimension, output feature dimension and batch size, respectively. The output pre-activate 39 is computed as 𝑌 = Θ𝑋 which is equivalent to: 𝑌 = (Θ𝑆) (𝑆−1𝑋) = ˜Θ ˜𝑋 (4.3) with ˜Θ = Θ𝑆 and ˜𝑋 = 𝑆−1𝑋 being the transformed weight and activation, and 𝑆 ∈ R𝑁×𝑁 being an arbitrary inevitable matrix. We construct the matrix 𝑆−1 to be an whitening matrix of input activate 𝑋, which satisfies 𝑆𝑆𝑇 = 𝑋 𝑋𝑇 . Thus each channel of the transformed input activations ˜𝑋 are independent from each other, i.e., ˜𝑋 ˜𝑋𝑇 = 𝑆−1𝑋 𝑋 𝑆 (𝑆−1)𝑆 = 𝐼. To find the optimal subset of weights that influence the final performance most, we decompose the transformed weight ˜Θ via singular value decomposition as: ˜Θ = Θ𝑆 (4.4) = ˜𝑈 ˜Σ ˜𝑉𝑇 𝑟 ∑︁ = ˜𝜎𝑛 ˜𝑢𝑛 ˜𝑣𝑇 𝑛 𝑛=1 where ˜𝑈, ˜𝑉, ˜Σ are the left and right singular vectors and singular values. Denoting ˜𝑈 = [ ˜𝑢1, ˜𝑢2, ˜𝑢3, ..., ˜𝑢𝑟], ˜Σ = diag( ˜𝜎1, ˜𝜎2, ˜𝜎3, · · · , ˜𝜎𝑟), and ˜𝑉 = [ ˜𝑣1, ˜𝑣2, ˜𝑣3, ..., ˜𝑣𝑟] where 𝑟 = min{𝑀, 𝑁 } supposing the weight Θ has full rank. In the following, we provide the theoretical proofs that are useful to determine the importance of the weight components. Lemma 1. The Frobenius norm of matrix 𝐴 with dimension 𝑚 × 𝑛 can be deduced into the square root of the trace of its gram matrix, which is: 𝑛 ∑︁ 𝑚 ∑︁ 1 2 (cid:12) 2(cid:170) (cid:12) (cid:174) (cid:172) Using lemma 1, we obtain the loss 𝐿𝑖 when removing the 𝑖𝑡ℎ singular value and singular vectors ∥ 𝐴∥𝐹 ≜ (cid:169) (cid:173) (cid:171) 𝐴𝑇 𝐴 (cid:12) (cid:12)𝑎𝑖 𝑗 trace (4.5) 𝑗=1 𝑖=1 = 2 (cid:104) (cid:16) (cid:17)(cid:105) 1 40 of ˜𝑊: 𝐿𝑖 = ||( ˜Θ − ∑︁ ˜𝜎𝑛 ˜𝑢𝑛 ˜𝑣𝑇 𝑛 ) ˜𝑋 ||𝐹 (4.6) 𝑛≠𝑖 𝑖 ˜𝑋 ||𝐹 = ˜𝜎𝑖 𝑡𝑟𝑎𝑐𝑒( ˜𝑢𝑖 ˜𝑣𝑇 = || ˜𝜎𝑖 ˜𝑢𝑖 ˜𝑣𝑇 𝑖 ˜𝑋 ˜𝑋𝑇 ˜𝑣𝑖 ˜𝑢𝑇 𝑖 ) 1 2 Since both ˜𝑈 = [ ˜𝑢1, ˜𝑢2, ˜𝑢3, ..., ˜𝑢𝑟] and ˜𝑉 = [ ˜𝑣1, ˜𝑣2, ˜𝑣3, ..., ˜𝑣𝑟] are orthogonal matrices, we have: 𝑖 ˜𝑣𝑖 = ˜𝑢𝑇 ˜𝑣𝑇 𝑖 ˜𝑢𝑖 = 𝐼 𝑖 ˜𝑣 𝑗 = ˜𝑢𝑇 ˜𝑣𝑇 𝑖 ˜𝑢 𝑗 = 0, ∀𝑖 ≠ 𝑗 (4.7) 𝑡𝑟𝑎𝑐𝑒( ˜𝑣𝑖 ˜𝑣𝑇 𝑖 ) = 𝑡𝑟𝑎𝑐𝑒( ˜𝑢𝑖 ˜𝑢𝑇 𝑖 ) = 1    Theorem 1. If the whitening matrix 𝑆 satisfies 𝑆𝑆𝑇 = 𝑋 𝑋𝑇 , the compression loss 𝐿𝑖 equals to ˜𝜎𝑖. Proof. Since ˜𝑋 = 𝑆−1𝑋 and 𝑆𝑆𝑇 = 𝑋 𝑋𝑇 , we can further 𝐿𝑖: (4.8) 𝐿𝑖 = ||( ˜Θ − ˜𝜎𝑛 ˜𝑢𝑛 ˜𝑣𝑇 𝑛 ) ˜𝑋 ||𝐹 ∑︁ 𝑛≠𝑖 = || ˜𝜎𝑖 ˜𝑢𝑖 ˜𝑣𝑇 𝑖 ˜𝑋 ||𝐹 = ˜𝜎𝑖 𝑡𝑟𝑎𝑐𝑒( ˜𝑢𝑖 ˜𝑣𝑇 𝑖 ˜𝑋 ˜𝑋𝑇 ˜𝑣𝑖 ˜𝑢𝑇 𝑖 ) 1 2 = ˜𝜎𝑖 𝑡𝑟𝑎𝑐𝑒( ˜𝑢𝑖 ˜𝑣𝑇 𝑖 𝑆−1𝑋 𝑋𝑇 (𝑆−1)𝑇 ˜𝑣𝑖 ˜𝑢𝑇 𝑖 ) 1 2 = ˜𝜎𝑖 𝑡𝑟𝑎𝑐𝑒( ˜𝑢𝑖 ˜𝑣𝑇 𝑖 𝑆−1𝑆𝑆𝑇 (𝑆𝑇 )−1 ˜𝑣𝑖 ˜𝑢𝑇 𝑖 ) 1 2 = ˜𝜎𝑖 𝑡𝑟𝑎𝑐𝑒( ˜𝑢𝑖 ˜𝑣𝑇 𝑖 ˜𝑣𝑖 ˜𝑢𝑇 𝑖 ) 1 2 = ˜𝜎𝑖 We can find such 𝑆 using the Cholesky decomposition of 𝑋 𝑋𝑇 , which satisfies 𝑆𝑆𝑇 = 𝑋 𝑋𝑇 , the loss 𝐿𝑖 of removing ˜𝜎𝑖 equals to the singular value ˜𝜎𝑖 itself. □ Theorem 2. If the whitening matrix 𝑆 satisfies 𝑆𝑆𝑇 = 𝑋 𝑋𝑇 , removing two components corre- sponding to the singular values ˜𝜎𝑖 and ˜𝜎𝑗 , the squared loss 𝐿2 𝑖, 𝑗 is the summation of ˜𝜎2 𝑖 and ˜𝜎2 𝑗 41 Proof. Suppose we remove ˜𝜎𝑖 and ˜𝜎𝑗 from the SVD decompositoin of ˜Θ , we calculate the square of the loss 𝐿𝑖, 𝑗 : 𝑖, 𝑗 =||( ˜Θ − 𝐿2 ∑︁ 𝑛∉{𝑖, 𝑗 } ˜𝜎𝑛 ˜𝑢𝑛 ˜𝑣𝑇 𝑛 ) ˜𝑋 ||𝐹 (4.9) = (cid:12) (cid:12) (cid:12) (cid:12) ˜𝜎𝑖 ˜𝑢𝑖 ˜𝑣𝑇 𝑖 ˜𝑋 + ˜𝜎𝑗 ˜𝑢ℎ ˜𝑣𝑇 ℎ ˜𝑋(cid:12) (cid:12) (cid:12) (cid:12) 2 𝐹 = ˜𝜎2 𝑖 𝑡𝑟𝑎𝑐𝑒( ˜𝑢𝑖 ˜𝑣𝑖 𝑇 ˜𝑋 ˜𝑋𝑇 ˜𝑣𝑖 ˜𝑢𝑖 𝑇 ) + ˜𝜎𝑖 ˜𝜎𝑗 𝑡𝑟𝑎𝑐𝑒( ˜𝑢𝑖 ˜𝑣𝑖 𝑇 ˜𝑋 ˜𝑋𝑇 ˜𝑣 𝑗 ˜𝑢 𝑗 𝑇 ) + ˜𝜎2 𝑗 𝑡𝑟𝑎𝑐𝑒( ˜𝑢 𝑗 ˜𝑣 𝑗 𝑇 ˜𝑋 ˜𝑋𝑇 ˜𝑣 𝑗 ˜𝑢 𝑗 𝑇 ) = ˜𝜎2 𝑖 𝑡𝑟𝑎𝑐𝑒( ˜𝑢𝑖 ˜𝑣𝑖 𝑇 ˜𝑋 ˜𝑋𝑇 ˜𝑣𝑖 ˜𝑢𝑖 𝑇 ) + ˜𝜎2 𝑗 𝑡𝑟𝑎𝑐𝑒( ˜𝑢 𝑗 ˜𝑣 𝑗 𝑇 ˜𝑋 ˜𝑋𝑇 ˜𝑣 𝑗 ˜𝑢 𝑗 𝑇 ) =(𝐿𝑖)2 + (𝐿 𝑗 )2 = ˜𝜎2 𝑖 + ˜𝜎2 𝑗 The squared loss 𝐿2 is equal to the sum of the squared singular values. □ Combining theorem 1 and theorem 2, we can conclude that the value ˜𝜎𝑖 of the transformed weight ˜Θ = Θ𝑆 = (cid:205)𝑟 𝑛=1 ˜𝜎𝑛 ˜𝑢𝑛 ˜𝑣𝑇 𝑛 can be used to rank the importance of the SVD component ˜𝜎𝑖 ˜𝑢𝑖 ˜𝑣𝑇 𝑖 . 4.3.3 Importance Aware Trajectory Matching Observing that the network tends to learn easy patterns early in training, then the harder ones later on. The key of our proposed method is to match the few important weight component from the expert trajectory 𝜏∗ = {𝜃∗ 0 , · · · , 𝜃∗ 𝑡 , · · · , 𝜃∗ 𝑇 } in the early training stage, then gradually adding more weight components to the matching target when 𝑡 grows large. In this section 𝜃 denotes the vectorized version of the weight matrix Θ, i.e., 𝜃 = vec(Θ). With a bit of notation abuse, we use 𝜃 · 𝑆 to denote the vectorized version of Θ𝑆, that is, 𝜃 · 𝑆 = vec(Θ𝑆). We rank the singular values in Eqn. 4.4 in descending order ˜𝜎1 ≥ ˜𝜎2 ≥ ˜𝜎3 ≥ · · · ≥ ˜𝜎𝑟. For epoch 𝑡 in the expert trajectory 𝜏∗, we set a threshold 𝜏(𝑡) = √︃ 𝑡/𝑇 (cid:205)𝑟 𝑛=1 ˜𝜎2 𝑟 to truncate the singular values, where 𝑇 is the total number of epochs in the expert trajectory. We have the truncated weight Trunc(𝜃 · 𝑆, 𝑡) = vec (cid:33) ˜𝜎𝑖 ˜𝑢𝑖 ˜𝑣𝑇 𝑖 (cid:32) 𝑘 (𝑡) ∑︁ 𝑖=1 (4.10) 42 Figure 4.3 Illustration of importance aware trajectory matching. Illustration of trajectory matching. The yellow blocks indicates the expert trajectory of the teacher network, while green blocks represent the trajectory of student network. The gray nodes in the neural network indicates the corresponding weight is truncated by the importance aware factorization. The objective is to minimize the distance between truncated version of 𝜃∗ 𝑡+𝑀 and ˆ𝜃𝑡+𝑁 . where 𝑘 (𝑡) satisfy 𝑘 (𝑡) ∑︁ ˜𝜎2 𝑖 ≤ 𝜏(𝑡)2    𝑡+𝑀 in the expert trajectory 𝜏∗, we seek only matching the Instead of matching the raw weight 𝜃∗ 𝑡+𝑀 · 𝑆, 𝑡) instead. The trajectory matching target in Eqn.4.1 is modified ˜𝜎2 𝑖 ≥ 𝜏(𝑡)2 𝑖=1 𝑘 (𝑡)+1 ∑︁ (4.11) 𝑖=1 truncated weight Trunc(𝜃∗ as: L = ∥ ˆ𝜃𝑡+𝑁 · 𝑆 − Trunc(𝜃∗ ∥Trunc(𝜃∗ 𝑡 , 𝑡) − Trunc(𝜃∗ 𝑡+𝑀, 𝑡) ∥2 2 𝑡+𝑀, 𝑡) ∥2 2 . (4.12) which ensures that when 𝑡 ≪ 𝑇, only the most important components of weight is matched, and when 𝑡 → 𝑇 all the weight components are matched. The full algorithm is illustrated in Algorithm. 4.1. 43 Algorithm 4.1 Importance Aware Trajectory Matching 𝑖 } with 𝜏∗ = {𝜃∗ ⊲ Sample expert trajectory: 𝜏∗ ∼ {𝜏∗ ⊲ Choose random start epoch, 𝑡 ≤ 𝑇 + ⊲ Initialize student network with expert params: Require: {𝜏∗ 𝑖 }: set of expert parameter trajectories trained on Dreal. Require: 𝑀: # of updates between starting and target expert params. Require: 𝑁: # of updates to student network per distillation step. Require: 𝑇 + < 𝑇: Maximum start epoch. 1: Initialize distilled data Dsyn ∼ Dreal 2: Initialize trainable learning rate 𝛼 := 𝛼0 for apply Dsyn 3: for each distillation step... do 4: 5: 6: 7: 8: 9: 10: 11: 12: 13: 14: 15: 16: end for ⊲ Gather the input features 𝑋, compute 𝑆: ⊲ Compute loss between ending student and expert params: ⊲ Update student network w.r.t. classification loss: ˆ𝜃𝑡 := 𝜃∗ 𝑡 for 𝑛 = 0 → 𝑁 − 1 do ⊲ Sample a mini-batch of distilled images: ˆ𝜃𝑡+𝑛+1 = ˆ𝜃𝑡+𝑛 − 𝛼∇ℓ(A (𝑏𝑡+𝑛); ˆ𝜃𝑡+𝑛) 𝑆 = cholesky(𝑋 𝑋𝑇 ) 𝑏𝑡+𝑛 ∼ Dsyn 𝑡 }𝑇 0 17: L = ∥ ˆ𝜃𝑡+𝑁 · 𝑆 − Trunc(𝜃∗ 𝑡+𝑀 , 𝑡)∥2 2 ∥Trunc(𝜃∗ 𝑡 , 𝑡) − Trunc(𝜃∗ 𝑡+𝑀 , 𝑡)∥2 2 ⊲ Update Dsyn and 𝛼 with respect to L 18: 19: end for Ensure: distilled data Dsyn and learning rate 𝛼 4.4 Experiments 4.4.1 Experiments Setup Datasets. We evaluate the performance of our proposed method on varous datasets, including • CIFAR-10 dataset which consists of 60000 32 × 32 images in 10 classes, with 6000 images per class. There are 50000 training images and 10000 test images. • CIFAR-100 dataset which consists of 60000 32 × 32 images in 100 classes, with 600 images per class. There are 50000 training images and 10000 test images. • Tiny ImageNet dataset which consists of 100,000 64 × 64 images in 200 classes. For each class, there are 500 training images, 50 validating images, and 50 test images. 44 Architectures. For a fair comparison, we stay consistent with previous works [117, 118, 119, 87]. The expert and student networks utilize a simple ConvNet architecture introduced in [120], where for CIFAR-10/100 a 3-layer ConvNet is employed and for Tiny ImageNet a depth-4 ConvNet is employed. We use networks with instance normalization by default. We also useAlexNet [15], VGG11 [16], and ResNet18 [14] for cross-architecture experiments. Baseline methods. We compare our method to recent work on dataset condensation as well as several work on coreset selection. • Dataset Condensation: Dataset Distillatio [121] (DD), Flexible Dataset Distillation [122] (LD), Dataset Condensation [117] (DC), and Differentiable Siamese Augmentation [118] (DSA) Matching Training Trajectories [110] (MTT), and neural tangent kernel based methods [86, 87] (KIP), and feature matching based methods Distribution Matching [119] (DM) and Aligning Features [116] (CAFE). • Coreset selection: random selection (random), herding methods [123] (herding), and example forgetting [124] (forgetting). 4.4.2 Performance on CIFAR-10 and CIFAR-100 We first evaluate our proposed method on the low resolution datasets (32 × 32), CIFAR-10 and CIFAR-100. Consistent with prior work, we use a 3-layer ConvNet for both distilation and evaluation. We also employ ZCA whitening [125] on the training dataset as was done previous work [86, 87]. Table 4.1 reports the Top-1 test accuracy on CIFAR-10/100 for ConvNet trained on the condensed dataset with given number of images per class (IPC). We also show the the standard deviation and the mean accuracy. As show, our method achieves the best performance except on CIFAR-10 with IPC=50. The rest of hyperparameters in this expreiemnt can be found in Table 4.3. 4.4.3 Performance on Tiny ImageNet We further evaluate our proposed method on the higher resolution datasets (64 × 64), Tiny ImageNet. To account for the large resolution, we use a 4-layer ConvNet for both distilation and evaluation. We do not apply ZCA whitening [125] on the training dataset. Table 4.1 reports 45 Dataset IPC Random Herding Forgetting 1 14.4 ± 2.0 21.5 ± 1.2 13.5 ± 1.2 DD LD DC DSA DM CAFE - 25.7 ± 0.7 28.3 ± 0.5 28.8 ± 0.7 26.0 ± 0.8 30.3 ± 1.1 CAFE+DSA 31.6 ± 0.8 43.6 ± 0.8 46.5 ± 0.6 Ours MTT CIFAR-10 10 26.0 ± 1.2 31.6 ± 0.7 23.3 ± 1.0 36.8 ± 1.2 38.8 ± 0.4 44.9 ± 0.5 52.1 ± 0.5 48.9 ± 0.6 40.6 ± 0.6 50.9 ± 0.5 65.3 ± 0.7 65.9 ± 0.6 50 43.4 ± 1.0 40.4 ± 0.6 23.3 ± 1.1 - 42.5 ± 0.4 53.9 ± 0.5 60.6 ± 0.5 63.0 ± 0.4 55.5 ± 0.6 62.3 ± 0.4 71.6 ± 0.2 71.2 ± 0.4 1 4.2 ± 0.3 8.4 ± 0.3 4.5 ± 0.2 - 11.5 ± 0.4 12.8 ± 0.3 13.9 ± 0.3 11.4 ± 0.3 12.9 ± 0.3 14.0 ± 0.3 24.3 ± 0.3 25.2 ± 0.4 CIFAR-100 10 14.6 ± 0.5 17.3 ± 0.3 15.1 ± 0.3 - - 25.2 ± 0.3 32.3 ± 0.3 29.7 ± 0.3 27.9 ± 0.3 31.5 ± 0.2 40.1 ± 0.4 40.9 ± 0.3 50 30.0 ± 0.4 33.7 ± 0.5 30.5 ± 0.3 - - - 42.8 ± 0.4 43.6 ± 0.4 37.9 ± 0.3 42.9 ± 0.2 47.7 ± 0.2 48.4 ± 0.3 Table 4.1 Top-1 test accuracy on CIFAR-10/100 compared with previous work on coreset section and dataset condensation. Consistent with prior work, we use a 3-layer ConvNet for both distilation and evaluation. The results are averaged over four independent runs with different initializations. The standard deviation is denoted by ±. the Top-1 test accuracy on Tiny ImageNet for ConvNet trained on the condensed dataset with given number of images per class (IPC). We also show the the standard deviation and the mean accuracy. As show, our method achieves significant better performance on Tiny ImageNet with different IPCs. Note that may dataset condensation algorithms that are effective for CIFAR-10/100 are unable to work on Tiny ImageNet dataset due to their high memory and computation resource requirement. Thus we only include DM and MTT in this experiment. The rest of hyperparameters in this expreiemnt can be found in Table 4.3. 4.4.4 Cross Architecture Generalization Since the distilled dataset is condensed using a simple ConvNet architecture where we match the training trajectory between the teacher and student networks of the same architecture. In this experiment, we evaluate the performance of the condensed dataset on unseen neural architectures. We evaluate the performance on three architectures AlexNet [15], VGG11 [16], and ResNet18 [14] together with the 3-layer ConvNet architecture used for dataset condensation. We synthesis the condensed dataset of CIFAR-100 with IPC=50 and hyperparameters shown in Table 4.3. The resulting Top-1 test accuracy is reported in Table 4.4. Despite the data being distill only from the 46 Dataset IPC Random Herding Forgetting DM MTT Ours 1 1.4 ± 0.1 2.8 ± 0.2 1.6 ± 0.1 3.9 ± 0.2 8.8 ± 0.3 12.0 ± 0.3 Tiny ImageNet 10 5.0 ± 0.2 6.3 ± 0.2 5.1 ± 0.2 12.9 ± 0.4 23.2 ± 0.2 27.3 ± 0.3 50 1.5 ± 0.4 16.7 ± 0.3 15.0 ± 0.3 24.1 ± 0.3 28.0 ± 0.3 33.1 ± 0.2 Table 4.2 Top-1 test accuracy on Tiny ImageNet compared with previous work on coreset section and dataset condensation. For Tiny ImageNet, we use a 4-layer ConvNet for both distillation and evaluation, since it has a larger resolution. The results are averaged over four independent runs with different initializations. The standard deviation is denoted by ±. Dataset IPC N M 𝑇 − 𝑇 𝑇 + Interval Synthetic Batch Size Learning Rate CIFAR-10 CIFAR-100 Tiny ImageNet 1 10 50 1 10 50 1 10 50 80 80 80 40 80 80 60 60 80 2 2 2 3 2 2 2 2 2 0 0 0 0 0 20 0 10 40 4 10 20 10 30 70 15 50 70 4 20 40 20 50 70 20 50 70 - 100 100 100 100 - 400 - - 10 100 500 100 1000 1000 200 250 250 100 100 1000 1000 1000 1000 10000 100 100 Table 4.3 Hyper-parameters for different datasets. trajectory of 3-layer ConvNet, our synthetic dataset performs best on the other three unseen neural architecture. This indicates that the distilled dataset does not suffer from over-fitting on a particular model. Method ConvNet ResNet18 VGG AlexNet Random MTT ours 30.0 47.7 48.4 31.9 42.6 46.3 32.2 41.2 45.1 26.7 40.3 45.4 Table 4.4 Top-1 test accuracy evaluated on CIFAR-100 with 50 images per class on different architectures. 47 4.4.5 Visualize the condensed dataset Figures 4.4 - 4.5 and Figures 4.6 - 4.7 demonstrate the condensed dataset of Tiny ImageNet with IPC=1 and IPC=50 respectively. In the low IPC case (IPC=1), we see that the generated patters are a superposition of many images. This indicates that the algorithm is trying to distill as many patterns as possible into the condensed dataset. However, for the high IPC case (IPC=50), the distilled images have sharp edges and are very close to the images in the original dataset. 48 Figure 4.4 Visualization of the distilled Tiny ImageNet dataset with IPC=1 (1/2). 49 Figure 4.5 Visualization of the distilled Tiny ImageNet dataset with IPC=1 (2/2). 50 Figure 4.6 Visualization of the distilled Tiny ImageNet dataset with IPC=50 (1/2). 51 Figure 4.7 Visualization of the distilled Tiny ImageNet dataset with IPC=50 (2/2). 52 4.5 Conclusion Neural networks are typically over parameterized with a lot of redundancy, so the weights are not born equally important. Based on this insight, in this chapter we first proposed an importance aware weight factorization, which factorize the weight matrix based on its influence on the layer output. Building on that we proposed an importance aware trajectory matching algorithm which only match the most important weights early in the training trajectory and gradually adding more weight components until reaching the full weights later in training trajectory. Our methods removes the interfere of the redundancy of the weights when matching the training trajectory. The results show improvements over previous work on CIFAR and Tiny ImageNet datasets. 53 CHAPTER 5 CONCLUSION Observing that the significant progress of deep learning models in recent years can be primarily attributed to the growth of the model scale and the volume of data on which it was trained. In this dissertation, my goal is to study efficient architecture and data manipulation techniques for deep learning systems. Chapter 2 - MSUNet deals with the problem of efficient architecture. In this chapter, I propose MSUNet, which is designed with three key techniques: 1) ternary convolution layers, 2) sparse squeeze-excitation layers, and 3) self-supervised consistency regularizer along with mixed preci- sion training. Our framework shows a great improvement in parameter size and computational cost measure by #Parameters and #Flops over the baseline model. Lastly, we show that ternary quantization outperforms binary quantization in all aspects, including accuracy, parameter size and computation cost. Chapter 3 - Deep AutoAugment deals with the problem of efficient data manipulation for deep learning systems. In particular, I focused on automated data augmentation. In this chapter, I present Deep AutoAugment (DeepAA), a multi-layer data augmentation search method that finds deep data augmentation policy without using any hand-picked default transformations. We formulate data augmentation search as a regularized gradient matching problem, which maximizes the gradient similarity between augmented data and original data along the direction with low variance. Our experimental results show that DeepAA achieves strong performance without using default augmentations, indicating that regularized gradient matching is an effective searching method for data augmentation policies. Chapter 4 - Dataset Condensation via Importance Aware Trajectory Matching deals with the problem of efficient data manipulation. Different from Chapter 3, we focus on distilling a large dataset into a condensed dataset. By observing that neural networks are typically over parameterized with a lot of redundancy, so the weights are not born equally important. Based on this, we fist proposed an importance aware weight factorization, which factorize the weight matrix based on 54 its influence on the layer output. Building on it we proposed an importance aware trajectory matching algorithm which only match the most important weights early in the training trajectory and gradually adding more weight components until reaching the full weights later in training trajectory. Our methods removes the interfere of the redundancy of the weights when matching the training trajectory. The results show improvements over previous work on CIFAR and Tiny ImageNet datasets. 55 [1] [2] [3] [4] [5] [6] [7] [8] BIBLIOGRAPHY E. D. Cubuk, B. Zoph, D. Mane, V. Vasudevan, and Q. V. Le, “Autoaugment: Learning aug- mentation strategies from data,” in Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 2019, pp. 113–123. F. N. Iandola, M. W. Moskewicz, K. Ashraf, S. Han, W. J. Dally, and K. Keutzer, “Squeezenet: Alexnet-level accuracy with 50x fewer parameters and <1mb model size,” CoRR, vol. abs/1602.07360, 2016. [Online]. Available: http://arxiv.org/abs/1602.07360 A. G. Howard, M. Zhu, B. Chen, D. Kalenichenko, W. Wang, T. Weyand, M. Andreetto, and H. Adam, “Mobilenets: Efficient convolutional neural networks for mobile vision applications,” CoRR, vol. abs/1704.04861, 2017. [Online]. Available: http://arxiv.org/abs/1704.04861 L. Lathauwer, “Decompositions of a higher-order tensor in block terms—part ii: Definitions and uniqueness,” SIAM J. Matrix Analysis Applications, vol. 30, pp. 1033–1066, 01 2008. X. Ma, P. Zhang, S. Zhang, N. Duan, Y. Hou, D. Song, and M. Zhou, “A tensorized transformer for language modeling,” CoRR, vol. abs/1906.09777, 2019. [Online]. Available: http://arxiv.org/abs/1906.09777 G. Hinton, O. Vinyals, and J. Dean, “Distilling the knowledge in a neural network,” 2015. and W. Compressing compression: S. Han, H. Mao, deep neural networks with pruning, trained quantization and huffman coding,” International Conference on Learning Representations (ICLR), 2016. [Online]. Available: https://arxiv.org/abs/1510.00149 J. Dally, “Deep Y. He, J. Lin, Z. Liu, H. Wang, L.-J. Li, and S. Han, “Amc: Automl for model the European compression and acceleration on mobile devices,” in Proceedings of Conference on Computer Vision (ECCV), 2018, pp. 784–800. [Online]. Available: https://arxiv.org/pdf/1802.03494.pdf [9] D. R. So, C. Liang, and Q. V. Le, “The evolved transformer,” arXiv preprint arXiv:1901.11117, 2019. [Online]. Available: https://arxiv.org/abs/1901.11117 [10] T. Wang, K. Wang, H. Cai, J. Lin, Z. Liu, H. Wang, Y. Lin, and S. Han, “Apq: Joint search for network architecture, pruning and quantization policy,” in Conference on Computer Vision and Pattern Recognition, 2020. [11] A. Parashar, M. Rhu, A. Mukkara, A. Puglielli, R. Venkatesan, B. Khailany, J. Emer, S. W. Keckler, and W. J. Dally, “Scnn: An accelerator for compressed-sparse convolutional neural networks,” ACM SIGARCH Computer Architecture News, vol. 45, no. 2, pp. 27–40, 2017. [Online]. Available: https://arxiv.org/abs/1708.04485 [12] Z. Zhang, H. Wang, S. Han, and W. J. Dally, “Sparch: Efficient architecture for sparse matrix multiplication,” in 2020 IEEE International Symposium on High Performance Computer Architecture (HPCA). IEEE, 2020. [Online]. Available: https://arxiv.org/abs/2002.08947 56 [13] J. Yang, X. Shen, J. Xing, X. Tian, H. Li, B. Deng, J. Huang, and X.-s. Hua, “Quantization networks,” in Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, 2019, pp. 7308–7316. [14] K. He, X. Zhang, S. Ren, and J. Sun, “Deep residual learning for image recognition,” in Proceedings of the IEEE conference on computer vision and pattern recognition, 2016, pp. 770–778. [15] A. Krizhevsky, I. Sutskever, and G. E. Hinton, “Imagenet classification with deep convo- lutional neural networks,” Advances in neural information processing systems, vol. 25, pp. 1097–1105, 2012. [16] K. Simonyan and A. Zisserman, “Very deep convolutional networks for large-scale image recognition,” arXiv preprint arXiv:1409.1556, 2014. [17] S. Zagoruyko and N. Komodakis, “Wide residual networks,” in British Machine Vision Conference 2016. British Machine Vision Association, 2016. [18] S. Xie, R. Girshick, P. Dollár, Z. Tu, and K. He, “Aggregated residual transformations for deep neural networks,” in Proceedings of the IEEE conference on computer vision and pattern recognition, 2017, pp. 1492–1500. [19] A. G. Howard, M. Zhu, B. Chen, D. Kalenichenko, W. Wang, T. Weyand, M. Andreetto, and H. Adam, “Mobilenets: Efficient convolutional neural networks for mobile vision applica- tions,” in CVPR, 2017. [20] M. Sandler, A. Howard, M. Zhu, A. Zhmoginov, and L.-C. Chen, “Mobilenetv2: Inverted residuals and linear bottlenecks,” in Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, 2018, pp. 4510–4520. [21] A. Howard, M. Sandler, G. Chu, L.-C. Chen, B. Chen, M. Tan, W. Wang, Y. Zhu, R. Pang, V. Vasudevan, Q. V. Le, and H. Adam, “Searching for mobilenetv3,” in Proceedings of the IEEE/CVF International Conference on Computer Vision (ICCV), October 2019. [22] X. Zhang, X. Zhou, M. Lin, and J. Sun, “Shufflenet: An extremely efficient convolutional neural network for mobile devices,” in The IEEE Conference on Computer Vision and Pattern Recognition (CVPR), June 2018. [23] N. Ma, X. Zhang, H.-T. Zheng, and J. Sun, “Shufflenet v2: Practical guidelines for effi- cient cnn architecture design,” in The European Conference on Computer Vision (ECCV), September 2018. [24] M. Tan and Q. V. Le, “Mixconv: Mixed depthwise convolutional kernels,” in 30th British Machine Vision Conference 2019, 2019. [25] K. A. vahid, A. Prabhu, A. Farhadi, and M. Rastegari, “Butterfly transform: An efficient fft based neural architecture design,” in Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), June 2020. 57 [26] M. Tan and Q. Le, “Efficientnet: Rethinking model scaling for convolutional neural net- works,” in ICML, Long Beach, California, USA, 09–15 Jun 2019, pp. 6105–6114. [27] M. Tan, R. Pang, and Q. V. Le, “Efficientdet: Scalable and efficient object detection,” in Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), June 2020. [28] H. Chen, Y. Wang, C. Xu, B. Shi, C. Xu, Q. Tian, and C. Xu, “Addernet: Do we really need multiplications in deep learning?” in Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), June 2020. [29] K. Han, Y. Wang, Q. Tian, J. Guo, C. Xu, and C. Xu, “Ghostnet: More features from cheap operations,” in IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), June 2020. [30] D. Zhou, Q.-B. Hou, Y. Chen, J. Feng, and S. Yan, “Rethinking bottleneck structure for efficient mobile network design,” in ECCV, August 2020. [31] D. Das, N. Mellempudi, D. Mudigere, D. Kalamkar, S. Avancha, K. Banerjee, S. Sridharan, K. Vaidyanathan, B. Kaul, E. Georganas, A. Heinecke, P. Dubey, J. Corbal, N. Shustrov, R. Dubtsov, E. Fomenko, and V. Pirogov, “Mixed Precision Training of Convolutional Neural Networks using Integer Operations,” in International Conference on Learning Representations(ICLR), vol. abs/1802.0, 2 2018, pp. 1–11. [Online]. Available: https://www.anandtech.com/show/11741/hot-chips-intel-knights-mill-live-blog- 445pm-pt-1145pm-utc http://arxiv.org/abs/1802.00930 [32] P. Micikevicius, S. Narang, J. Alben, G. Diamos, E. Elsen, D. Garcia, B. Ginsburg, M. Houston, O. Kuchaiev, G. Venkatesh, and H. Wu, “Mixed Precision Training,” in International Conference on Learning Representations(ICLR), 10 2017. [Online]. Available: http://arxiv.org/abs/1710.03740 [33] Y. Ma, N. Suda, Y. Cao, J. S. Seo, and S. Vrudhula, “Scalable and modularized RTL compilation of Convolutional Neural Networks onto FPGA,” FPL 2016 - 26th International Conference on Field-Programmable Logic and Applications, 2016. [34] K. He, X. Zhang, S. Ren, and J. Sun, “Deep Residual Learning for Image Recognition,” in IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), vol. 7, IEEE, 12 2015, pp. 171–180. [Online]. Available: http://arxiv.org/abs/1512.03385 no. 3. http://ieeexplore.ieee.org/document/7780459/ [35] V. Vanhoucke, A. Senior, and M. Z. Mao, “Improving the speed of neural networks on CPUs,” 2011. [Online]. Available: https://research.google/pubs/pub37631/ [36] B. Jacob, S. Kligys, B. Chen, M. Zhu, M. Tang, A. Howard, H. Adam, and D. Kalenichenko, for Efficient Integer-Arithmetic-Only Inference,” in IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), vol. abs/1712.0. IEEE, 6 2018, pp. 2704–2713. [Online]. Available: https://ieeexplore.ieee.org/document/8578384/ “Quantization and Training of Neural Networks 58 [37] S. O. Settle, M. Bollavaram, P. D’Alberto, E. Delaye, O. Fernandez, N. Fraser, A. Ng, A. Sirasao, and M. Wu, “Quantizing Convolutional Neural Networks for Low-Power High-Throughput [Online]. Available: http://arxiv.org/abs/1805.07941 Inference Engines,” ArXiv preprint, 5 2018. [38] K. He, X. Zhang, S. Ren, and J. Sun, “Delving deep into rectifiers: Surpassing human-level performance on imagenet classification,” in ICCV, 2015. [39] N. Zmora, G. Jacob, L. Zlotnik, B. Elharar, and G. Novik, “Neural network distiller: A python package for dnn compression research,” October 2019. [Online]. Available: https://arxiv.org/abs/1910.12232 [40] H. Zhang, M. Cisse, Y. N. Dauphin, and D. Lopez-Paz, “mixup: Beyond empirical risk minimization,” International Conference on Learning Representations, 2018. [41] M. Tan and Q. V. Le, “Mixconv: Mixed depthwise convolutional kernels,” arXiv preprint arXiv:1907.09595, 2019. [42] I. Goodfellow, Y. Bengio, A. Courville, and Y. Bengio, Deep learning. MIT press Cam- bridge, 2016. [43] C. Zhang, S. Bengio, M. Hardt, B. Recht, and O. Vinyals, “Understanding deep learning re- quires rethinking generalization,” in International Conference on Learning Representations, 2017. [44] H. Inoue, “Data augmentation by pairing samples for images classification,” arXiv preprint arXiv:1801.02929, 2018. [45] T. DeVries and G. W. Taylor, “Improved regularization of convolutional neural networks with cutout,” arXiv preprint arXiv:1708.04552, 2017. [46] S. Yun, D. Han, S. J. Oh, S. Chun, J. Choe, and Y. Yoo, “Cutmix: Regularization strat- egy to train strong classifiers with localizable features,” in Proceedings of the IEEE/CVF International Conference on Computer Vision, 2019, pp. 6023–6032. [47] D. Hendrycks, N. Mu, E. D. Cubuk, B. Zoph, J. Gilmer, and B. Lakshminarayanan, “Augmix: A simple data processing method to improve robustness and uncertainty,” International Conference on Learning Representations, 2020. [48] S. Yan, H. Song, N. Li, L. Zou, and L. Ren, “Improve unsupervised domain adaptation with mixup training,” in arXiv preprint arXiv: 2001.00677, 2020. [49] E. D. Cubuk, B. Zoph, J. Shlens, and Q. V. Le, “Randaugment: Practical automated data augmentation with a reduced search space,” in Advances in Neural Information Processing Systems, vol. 33, 2020, pp. 702–703. [50] D. Ho, E. Liang, X. Chen, I. Stoica, and P. Abbeel, “Population based augmentation: Effi- cient learning of augmentation policy schedules,” in International Conference on Machine Learning. PMLR, 2019, pp. 2731–2741. 59 [51] S. Lim, I. Kim, T. Kim, C. Kim, and S. Kim, “Fast autoaugment,” Advances in Neural Information Processing Systems, vol. 32, 2019. [52] R. Hataya, J. Zdenek, K. Yoshizoe, and H. Nakayama, “Faster autoaugment: Learning augmentation strategies using backpropagation,” in European Conference on Computer Vision. Springer, 2020, pp. 1–16. [53] Y. Li, G. Hu, Y. Wang, T. Hospedales, N. M. Robertson, and Y. Yang, “Differentiable Springer, automatic data augmentation,” in European Conference on Computer Vision. 2020, pp. 580–595. [54] A. Liu, Z. Huang, Z. Huang, and N. Wang, “Direct differentiable augmentation search,” in Proceedings of the IEEE/CVF International Conference on Computer Vision, 2021, pp. 12 219–12 228. [55] T. C. LingChen, A. Khonsari, A. Lashkari, M. R. Nazari, J. S. Sambee, and M. A. Nascimento, “Uniformaugment: A search-free probabilistic data augmentation approach,” arXiv preprint arXiv:2003.14348, 2020. [56] S. G. Müller and F. Hutter, “Trivialaugment: Tuning-free yet state-of-the-art data augmen- tation,” in Proceedings of the IEEE/CVF International Conference on Computer Vision (ICCV), October 2021, pp. 774–782. [57] X. Zhang, Q. Wang, J. Zhang, and Z. Zhong, “Adversarial autoaugment,” in International Conference on Learning Representations, 2019. [58] E. Jang, S. Gu, and B. Poole, “Categorical reparameterization with gumbel-softmax,” in International Conference on Learning Representations, 2017. [59] M. Berman, H. Jégou, A. Vedaldi, I. Kokkinos, and M. Douze, “Multigrain: a unified image embedding for classes and instances,” arXiv preprint arXiv:1902.05509, 2019. [60] E. Hoffer, T. Ben-Nun, I. Hubara, N. Giladi, T. Hoefler, and D. Soudry, “Augment your batch: Improving generalization through instance repetition,” in Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 2020, pp. 8129–8138. [61] Y. Du, W. M. Czarnecki, S. M. Jayakumar, M. Farajtabar, R. Pascanu, and B. Lak- shminarayanan, “Adapting auxiliary losses using gradient similarity,” arXiv preprint arXiv:1812.02224, 2018. [62] X. Wang, H. Pham, P. Michel, A. Anastasopoulos, J. Carbonell, and G. Neubig, “Optimizing data usage via differentiable rewards,” in International Conference on Machine Learning. PMLR, 2020, pp. 9983–9995. [63] S. Müller, A. Biedenkapp, and F. Hutter, “In-loop meta-learning with gradient-alignment reward,” arXiv preprint arXiv:2102.03275, 2021. [64] S. Chen, E. Dobriban, and J. H. Lee, “A group-theoretic framework for data augmentation,” Journal of Machine Learning Research, vol. 21, no. 245, pp. 1–71, 2020. 60 [65] C. Szegedy, W. Liu, Y. Jia, P. Sermanet, S. Reed, D. Anguelov, D. Erhan, V. Vanhoucke, and A. Rabinovich, “Going deeper with convolutions,” in Proceedings of the IEEE conference on computer vision and pattern recognition, 2015, pp. 1–9. [66] D. P. Kingma and J. Ba, “Adam: A method for stochastic optimization,” in International Conference on Learning Representations, 2015. [67] S. Fort, A. Brock, R. Pascanu, S. De, and S. L. Smith, “Drawing multiple augmenta- tion samples per image during training efficiently decreases test error,” arXiv preprint arXiv:2105.13343, 2021. [68] H. Liu, K. Simonyan, and Y. Yang, “Darts: Differentiable architecture search,” in Interna- tional Conference on Learning Representations, 2018. [69] R. Wightman, H. Touvron, and H. Jégou, “Resnet strikes back: An improved training procedure in timm,” vol. 34, 2021. [70] A. Krizhevsky, I. Sutskever, and G. E. Hinton, “Imagenet classification with deep convolu- tional neural networks,” Communications of the ACM, vol. 60, no. 6, pp. 84–90, 2017. [71] J. Devlin, M.-W. Chang, K. Lee, and K. Toutanova, “Bert: Pre-training of deep bidirectional transformers for language understanding,” arXiv preprint arXiv:1810.04805, 2018. [72] D. Amodei, S. Ananthanarayanan, R. Anubhai, J. Bai, E. Battenberg, C. Case, J. Casper, B. Catanzaro, Q. Cheng, G. Chen et al., “Deep speech 2: End-to-end speech recognition in english and mandarin,” in International conference on machine learning. PMLR, 2016, pp. 173–182. [73] A. Dosovitskiy, L. Beyer, A. Kolesnikov, D. Weissenborn, X. Zhai, T. Unterthiner, M. De- hghani, M. Minderer, G. Heigold, S. Gelly et al., “An image is worth 16x16 words: Trans- formers for image recognition at scale,” arXiv preprint arXiv:2010.11929, 2020. [74] A. Radford, J. W. Kim, C. Hallacy, A. Ramesh, G. Goh, S. Agarwal, G. Sastry, A. Askell, P. Mishkin, J. Clark et al., “Learning transferable visual models from natural language supervision,” in International Conference on Machine Learning. PMLR, 2021, pp. 8748– 8763. [75] A. Ramesh, P. Dhariwal, A. Nichol, C. Chu, and M. Chen, “Hierarchical text-conditional image generation with clip latents,” arXiv preprint arXiv:2204.06125, 2022. [76] C. Chen, Y. Zhang, J. Fu, X. Liu, and M. Coates, “Bidirectional learning for offline infinite- width model-based optimization,” in Thirty-Sixth Conference on Neural Information Pro- cessing Systems, 2022. [Online]. Available: https://openreview.net/forum?id=_j8yVIyp27Q [77] D. Maclaurin, D. Duvenaud, and R. Adams, “Gradient-based hyperparameter optimization PMLR, through reversible learning,” in International conference on machine learning. 2015, pp. 2113–2122. 61 [78] J. Lorraine, P. Vicol, and D. Duvenaud, “Optimizing millions of hyperparameters by implicit differentiation,” in International conference on artificial intelligence and statistics. PMLR, 2020, pp. 1540–1552. [79] F. P. Such, A. Rawal, J. Lehman, K. Stanley, and J. Clune, “Generative teaching networks: Accelerating neural architecture search by learning to generate synthetic training data,” in ICML, 2020. [80] L. Li and A. Talwalkar, “Random search and reproducibility for neural architecture search,” in Uncertainty in artificial intelligence. PMLR, 2020, pp. 367–377. [81] T. Elsken, J. H. Metzen, and F. Hutter, “Neural architecture search: A survey,” The Journal of Machine Learning Research, vol. 20, no. 1, pp. 1997–2017, 2019. [82] I. J. Goodfellow, M. Mirza, D. Xiao, A. Courville, and Y. Bengio, “An empirical in- vestigation of catastrophic forgetting in gradient-based neural networks,” arXiv preprint arXiv:1312.6211, 2013. [83] S.-A. Rebuffi, A. Kolesnikov, G. Sperl, and C. H. Lampert, “icarl: Incremental classifier and representation learning,” in CVPR, 2017. [84] T. Wang, J.-Y. Zhu, A. Torralba, and A. A. Efros, “Dataset distillation,” arXiv preprint arXiv:1811.10959, 2018. [85] Y. Zhou, E. Nezhadarya, and J. Ba, “Dataset distillation using neural feature regression,” in NeurIPS, 2022. [86] T. Nguyen, Z. Chen, and J. Lee, “Dataset meta-learning from kernel ridge-regression,” arXiv preprint arXiv:2011.00050, 2020. [87] T. Nguyen, R. Novak, L. Xiao, and J. Lee, “Dataset distillation with infinitely wide convo- lutional networks,” in NeurIPS, 2021. [88] B. Zhao and H. Bilen, “Dataset condensation with differentiable siamese augmentation,” in International Conference on Machine Learning. PMLR, 2021, pp. 12 674–12 685. [89] ——, “Dataset condensation with differentiable siamese augmentation,” in International Conference on Machine Learning. PMLR, 2021, pp. 12 674–12 685. [90] Bo Zhao and Hakan Bilen, “Dataset condensation with distribution matching,” CoRR, vol. abs/2110.04181, 2021. [91] Y. Liu, Y. Su, A.-A. Liu, B. Schiele, and Q. Sun, “Mnemonics training: Multi-class in- cremental learning without forgetting,” in Proceedings of the IEEE/CVF conference on Computer Vision and Pattern Recognition, 2020, pp. 12 245–12 254. [92] A. Rosasco, A. Carta, A. Cossu, V. Lomonaco, and D. Bacciu, “Distilled replay: Over- coming forgetting through synthetic samples,” in International Workshop on Continual Semi-Supervised Learning. Springer, 2022, pp. 104–117. 62 [93] M. Sangermano, A. Carta, A. Cossu, and D. Bacciu, “Sample condensation in online continual learning,” in 2022 International Joint Conference on Neural Networks (IJCNN). IEEE, 2022, pp. 01–08. [94] F. Wiewel and B. Yang, “Condensed composite memory continual learning,” in 2021 Inter- national Joint Conference on Neural Networks (IJCNN). IEEE, 2021, pp. 1–8. [95] W. Masarczyk and I. Tautkute, “Reducing catastrophic forgetting with learning on synthetic data,” in CVPR Workshop, 2020. [96] J. Goetz and A. Tewari, “Federated learning via synthetic data,” arXiv preprint arXiv:2008.04489, 2020. [97] Y. Zhou, G. Pu, X. Ma, X. Li, and D. Wu, “Distilled one-shot federated learning,” arXiv preprint arXiv:2009.07999, 2020. [98] Y. Xiong, R. Wang, M. Cheng, F. Yu, and C.-J. Hsieh, “Feddm: Iterative distribution matching for communication-efficient federated learning,” arXiv preprint arXiv:2207.09653, 2022. [99] R. Song, D. Liu, D. Z. Chen, A. Festag, C. Trinitis, M. Schulz, and A. Knoll, “Federated learning via decentralized dataset distillation in resource-constrained edge environments,” arXiv preprint arXiv:2208.11311, 2022. [100] P. Liu, X. Yu, and J. T. Zhou, “Meta knowledge condensation for federated learning,” arXiv preprint arXiv:2209.14851, 2022. [101] S. Hu, J. Goetz, K. Malik, H. Zhan, Z. Liu, and Y. Liu, “Fedsynth: Gradient compression via synthetic data in federated learning,” arXiv preprint arXiv:2204.01273, 2022. [102] R. Pi, W. Zhang, Y. Xie, J. Gao, X. Wang, S. Kim, and Q. Chen, “Dynafed: Tackling client data heterogeneity with global dynamics,” arXiv preprint arXiv:2211.10878, 2022. [103] M. Welling, “Herding dynamical weights to learn,” in Proceedings of the 26th Annual International Conference on Machine Learning, 2009, pp. 1121–1128. [104] Y. Chen, M. Welling, and A. Smola, “Super-samples from kernel herding,” arXiv preprint arXiv:1203.3472, 2012. [105] D. Feldman, M. Faulkner, and A. Krause, “Scalable training of mixture models via coresets,” Advances in neural information processing systems, vol. 24, 2011. [106] O. Bachem, M. Lucic, and A. Krause, “Coresets for nonparametric estimation-the case of dp-means,” in International Conference on Machine Learning. PMLR, 2015, pp. 209–217. [107] S. Lee, S. Chun, S. Jung, S. Yun, and S. Yoon, “Dataset condensation with contrastive signals,” in Proceedings of the International Conference on Machine Learning (ICML), 2022, pp. 12 352–12 364. 63 [108] Z. Jiang, J. Gu, M. Liu, and D. Z. Pan, “Delving into effective gradient matching for dataset condensation,” arXiv preprint arXiv:2208.00311, 2022. [109] N. Loo, R. Hasani, A. Amini, and D. Rus, “Efficient dataset distillation using random feature approximation,” in Proceedings of the Advances in Neural Information Processing Systems (NeurIPS), 2022. [110] G. Cazenavette, T. Wang, A. Torralba, A. A. Efros, and J.-Y. Zhu, “Dataset distillation by matching training trajectories,” in Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 2022, pp. 4750–4759. [111] J.-H. Kim, J. Kim, S. J. Oh, S. Yun, H. Song, J. Jeong, J.-W. Ha, and H. O. Song, “Dataset con- densation via efficient synthetic-data parameterization,” arXiv preprint arXiv:2205.14959, 2022. [112] L. Zhang, J. Zhang, B. Lei, S. Mukherjee, X. Pan, B. Zhao, C. Ding, Y. Li, and X. Dongkuan, “Accelerating dataset distillation via model augmentation,” arXiv preprint arXiv:2212.06152, 2022. [113] R. M. Neal, Bayesian learning for neural networks. Springer Science & Business Media, 2012, vol. 118. [114] J. Lee, Y. Bahri, R. Novak, S. S. Schoenholz, J. Pennington, and J. Sohl-Dickstein, “Deep neural networks as gaussian processes,” arXiv preprint arXiv:1711.00165, 2017. [115] J. Platt et al., “Probabilistic outputs for support vector machines and comparisons to regu- larized likelihood methods,” Advances in large margin classifiers, vol. 10, no. 3, pp. 61–74, 1999. [116] K. Wang, B. Zhao, X. Peng, Z. Zhu, S. Yang, S. Wang, G. Huang, H. Bilen, X. Wang, and Y. You, “Cafe: Learning to condense dataset by aligning features,” arXiv preprint arXiv:2203.01531, 2022. [117] B. Zhao, K. R. Mopuri, and H. Bilen, “Dataset condensation with gradient matching,” in ICLR, 2020. [118] B. Zhao and H. Bilen, “Dataset condensation with differentiable siamese augmentation,” in ICML, 2021. [119] ——, “Dataset condensation with distribution matching,” in WACV, 2023. [120] S. Gidaris and N. Komodakis, “Dynamic few-shot visual learning without forgetting,” in CVPR, 2018. [121] T. Wang, J.-Y. Zhu, A. Torralba, and A. A. Efros, “Dataset distillation,” arXiv preprint arXiv:1811.10959, 2018. [122] O. Bohdal, Y. Yang, and T. Hospedales, “Flexible dataset distillation: Learn labels instead of images,” in NeurIPS Workshop, 2020. 64 [123] Y. Chen, M. Welling, and A. Smola, “Super-samples from kernel herding,” in UAI, 2010. [124] M. Toneva, A. Sordoni, R. T. des Combes, A. Trischler, Y. Bengio, and G. J. Gordon, “An empirical study of example forgetting during deep neural network learning,” in ICLR, 2018. [125] E. Riba, D. Mishkin, D. Ponsa, E. Rublee, and G. Bradski, “Kornia: an open source differentiable computer vision library for pytorch,” in WACV, 2020. [Online]. Available: https://arxiv.org/pdf/1910.02190.pdf 65