NEW PERSPECTIVES IN NEURAL ARCHITECTURE SEARCH: ARCHITECTURE EMBEDDINGS, EFFICIENT PERFORMANCE ESTIMATIONS, AND THEIR APPLICATIONS By Shen Yan A DISSERTATION Submitted to Michigan State University in partial fulfillment of the requirements for the degree of Computer Science — Doctor of Philosophy 2022 ABSTRACT Understanding the neural architecture representations and their associated learning curves through theoretical analysis and empirical evaluations is crucial for achieving sta- ble and scalable neural architecture search (NAS). Despite recent advances in one-shot NAS, the stability of search performance remains an issue for users. Sampling-based NAS can eliminate the weights coupling problem by running multiple search trails separately, but it requires significant computational resources. In response, researchers have begun studying architecture representations as a potential solution to the search bias caused by joint opti- mization of network representations and search methods. These representations are learned to encourage neural architectures with similar structures or computations to cluster together, which helps to map architectures with similar performance to the same regions in the latent space and leads to smoother transitions in the latent space. This benefits downstream search and can be further accelerated by learning curve extrapolation, where the final validation accuracy of a neural network is estimated from the learning curve of a partially trained network. This dissertation presents our contributions to the field of neural architecture search (NAS), which push the limits of NAS and achieve state-of-the-art performance. Our contributions include efficient one-shot NAS via hierarchical masking [1], addressing the joint optimization problem of architecture representations and search using unsupervised pre-training [2], improving the generalization ability of architecture representations with computation-aware embedding [3], developing a method for facilitating multi-fidelity NAS research and demonstrating the power of using partial learning curve extrapolation [4]. Copyright by SHEN YAN 2022 ACKNOWLEDGMENTS First and foremost, I would like to express my deepest appreciation to my advisor, Pro- fessor Mi Zhang. Mi has been an inspiring and encouraging mentor, providing profound insights and guidance throughout my research. Not only has he taught me how to pursue top-quality research, but he has also shared invaluable lessons about life with me. I am incredibly grateful for his support and guidance. During my PhD studies, I would like to express my deep gratitude to several individuals and organizations who have provided support and guidance. This includes Dr. Xuehan Xiong, Dr. Anurag Arnab, Zhichao Lu, Professor Chen Sun, and Professor Cordelia Schmid at Google Research’s Perception Team, as well as Dr. Jiahui Yu, Dr. Tao Zhu, Dr. Zirui Wang, Dr. Yuan Cao and Dr. Yonghui Wu at Google Research’s Brain Team. I also want to thank Dr. Colin White for hosting me at Abacus.AI, Ming Chen and Youlong Cheng for hosting me at Bytedance AML, Dr. Huan Song, Lincan Zou, and Dr. Liu Ren for hosting me at Bosch Research. Finally, I want to express my deep appreciation to Dr. David Ross and Dr. Rahul Sukthankar at Google Research for their support during the job search process. I am grateful for the valuable experiences and opportunities that these individuals and organizations have provided. I am deeply grateful for the guidance and support of my master’s advisor, Professor Hermann Ney, who introduced me to the world of machine learning and has been a role model for me ever since. I also want to thank Dr. Evgeny Matusov, Dr. Shahram Khadivi, Dr. Daniel Bug, Dr. Harald Hanselmann, and Dr. Abin Jose for their advice during my master’s studies. These individuals have played a crucial role in shaping my education and career in machine learning. I would like to express my deep gratitude to my colleagues and collaborators, who have iv provided inspiration and support throughout my research. This includes Yu Zheng, Wei Ao, Dr. Biyi Fang, and Dr. Xiao Zeng. I am also thankful to other supportive individuals, including Dr. Kaiqiang Song, Dr. Yandong Li, Dr. Ji Hou, Dr. Yang Liu, Taojiannan Yang, Lemeng Wu, Dr. Li Erran Li, Professor Chen Chen, Professor Fei Liu, and Professor Frank Hutter. I am grateful for the contributions of these individuals, who have helped make this thesis possible. I would like to express my appreciation to my committee and faculty members, including Professor Xiaoming Liu, Professor Jiliang Tang, Professor Arun Ross, Professor Sandeep Kulkarni, and Professor Wolfgang Banzhaf. I am grateful for their support and guidance throughout my academic career. These individuals have played a crucial role in shaping my education and research interests. Lastly, I want to thank my parents, grandparents, and fiancé for their unwavering support and love. Their encouragement has been a constant source of motivation and inspiration throughout my life. v TABLE OF CONTENTS Chapter1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1 Chapter2 Efficient One-Shot NAS via Hierarchical Masking . . . . . . . . 10 Chapter3 Arch2Vec . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 29 Chapter4 Computation-aware Neural Architecture Encoding . . . . . . . 53 Chapter5 NAS-Bench-x11 and the Power of Learning Curve . . . . . . . 75 Chapter6 Conclusion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 100 BIBLIOGRAPHY . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 104 vi Chapter 1 Introduction NAS can be seen as subfield of AutoML, where architecture configuration itself can be viewed as a hyperparameter, and it can be optimized via meta-learning. In [5], NAS is categorized into three dimensions: search space, search strategy, and performance estimation strategy. Figure 1.1 shows an overview the the NAS process, where architecture representations are continuous encodings of the search space, and learning curve extrapolation can be viewed as an speedy performance estimation method. Figure 1.1: A search strategy selects an architecture A from a predefined search space A. The architecture is passed to a performance estimation strategy, which returns the estimated performance of A to the search strategy. Source: [5]. Search Space & Architecture Encoding The search space defines which architectures and how they can be represented. Two most commonly used search spaces are Inception cell-based [7] and ResNet block-based search space [8]. The cell-based search space consists of two different kind of cells: a normal cell that learns the patterns of the input and a reduction cell which reduces the spatial dimension. The overall architecture is then built by stacking these cells in a repeated manner. The ResNet block-based search space is inspired by MobileNet [9] which is based on an inverted residual structure where the shortcut connections are between the bottleneck layers. When designing a NAS algorithm, the goal is how should we encode the neural archi- 1 Figure 1.2: The one-hot adjacency matrix encoding is created by flattening the architecture adjacency matrix and concatenating it with a list of node operation labels. Each position in the operation list is a single integer-valued feature or a one-hot matrix. tecture to maximize performance. Architectures sampled from the same search space share similar encoding properties. The representation of the DAG-based architectures may signifi- cantly change the outcome of NAS subroutines such as perturbing or manipulating architec- tures. In [10], a cell-based architecture is encoded using adjacency matrix-based encoding, categorical encoding, and continuous encoding. Figure 1.2 shows an architecture sampled in NAS-Bench-101 search space [10] and an architecture sampled in NAS-Bench-201 [11] using adjacency encoding, separately. Figure 1.3 an example of the cell encoding in DARTS search space. To convert the DARTS search space [12] into one with the same input format as NAS-Bench-101, we can add a summation node to make nodes represent operations and edges represent data flow. To this end, a 15 × 15 upper-triangular binary matrix is used to encode edges and a 15 × 11 one-hot matrix is used to encode operations. Similarly, the categorical adjacency encoding can be derived by first flattening the adjacency matrix, and is then defined as a list of the indices each of which specifies one of the possible edges in the adjacency matrix. Each architecture can be represented by the maximum number of possible 2 Figure 1.3: A 15 × 15 upper-triangular binary matrix is used to encode edges and a 15 × 11 one-hot matrix is used to encode operations. edges to ensure a fixed length encoding. The benefit of doing it is that the encoding is not scaled quadratically with increasing nodes. Similarly, the continuous adjacency encoding can be created by taking the fixed number of edges with the highest continuous values, which is similar to encodings used in DARTS [12]. Search Algorithms The search algorithm describes how to explore and exploit the search space. It is expected to find well-performing architectures quickly, while on the other hand, local convergence to a region of sub-optimal architectures should not be encouraged. NAS using REINFORCE [13] is the first work that systematically investigated large- scale neural architecture search based on individual sampling process. It encodes neural networks as macro and micro architectures. The densely connected macro network has up to N layers, where each of the layer may have different predecessors. In additional to the direct connection with the previous layers, other connections can also be present or not, resulting in exponential compleixty of the search space. Within each layer, there are multiple 3 Figure 1.4: The categorical encoding is created to ensure fixed length encodings. Source: [6]. hyperparameter choices, e.g. strides, filter height/width and the number of filters. The model is trained using a recurrent neural network (RNN) to sequentially sample a string that in turn encodes the neural architecture, inspired by the success of neural machine translation. The network is optimized with REINFORCE [14] algorithm, but later by Proximal Policy Optimization (PPO) [10]. An alternative to using RL is to use evolutionary algorithms for optimizing the neu- ral architecture. The first such approach for designing neural networks since 1990 [15–17] starting with genetic algorithms. Many neuro-evolutionary approaches [18, 19] since then use genetic algorithms to optimize both the neural architectures. Evolutionary methods differ in how they sample parents, update populations, and generate next generations. For example, [18, 20] use tournament selection to sample parents; [19] sample parents from a multi-objective Pareto front. The worst/oldest individual from a population is removed as a regularization, to escape from the local minimum. Bayesian Optimization is also one of the popular methods for hyperparameter optimiza- 4 tion. [21] derive kernel functions for architecture search spaces in order to use classic Gaus- sian Process (GP) as the surrogate model, but it doesn’t perform well in high dimensional space. [22] uses MLP as the surrogate model and is able to perform well on the high di- mensional DARTS search space. Furthermore, [3] uses DNGO as the surrogates and shows well-performing results on NASBench101 and DARTS search space. Performance Estimation The objective of NAS is typically to find architectures that achieve high predictive per- formance on unseen data. Performance estimation refers to the process of estimating the performance from either a few training iterations. Techniques include fitting the partial curve to an ensemble of parametric functions [23], predicting the performance based on the partial trained neural network configurations [24], summing the training losses [25], using the basis functions as the output layer of a Bayesian neural network [26], using previous learning curves as basis function extrapolators [27], using the positive-definite covariance kernel to capture a variety of training curves [28], or using a Bayesian recurrent neural network [29]. While in this work we focus on multi-fidelity optimization utilizing learning curve-based extrapola- tion, another main category of methods lie in bandit-based algorithm selection [30–34], and the fidelities can be further adjusted according to the previous observations or a learning rate scheduler [35–37]. One-shot methods can also be viewed as a promising approach for speeding up NAS due to their computational efficiency [1,12,38–43]. Recent advances in performance prediction [2, 44–50] and other iterative techniques [31,51] have reduced the runtime gap between iterative and weight sharing techniques. For detailed surveys, it is suggested referring to [5, 52]. 5 Thesis Outline Despite all the aforementioned advantages and potentials, the current NAS methods still have many drawbacks. In this thesis, I will highlight two problems on architecture encodings and speedy performance estimation, namely how unsupervised architecture representation learning helps downstream search methods, and how to efficiently estimate an architecture’s performance built upon learning curve extrapolation. Each chapter is devoted to solving one of these problems. In each chapter, I will describe how to make it more effective and efficient. In brief, this thesis is organized as follows. I start off by providing my first work on one-shot NAS [1] in Chapter 2. I will analysis the drawbacks of this line of research and detail my research on studying architecture representations [2, 3], learning curve extrapolation and its applications [4] in Chapters 3, 4 and 5 respectively. Chapter 6 wraps up and discusses remaining challenges in NAS research. HM-NAS: Efficient Neural Architecture Search via Hierarchical Masking The use of automatic methods, often referred to as Neural Architecture Search (NAS), in de- signing neural network architectures has recently drawn considerable attention. In this work, we present an efficient NAS approach, named HM-NAS, that generalizes existing weight sharing based NAS approaches. Existing weight sharing based NAS approaches still adopt hand designed heuristics to generate architecture candidates. As a consequence, the space of architecture candidates is constrained in a subset of all possible architectures, making the architecture search results sub-optimal. HM-NAS addresses this limitation via two innova- tions. First, it incorporates a multi-level architecture encoding scheme to enable searching for more flexible network architectures. Second, it discards the hand designed heuristics and incorporates a hierarchical masking scheme that automatically learns and determines 6 the optimal architecture. Compared to state-of-the-art weight sharing based approaches, HM-NAS is able to achieve better architecture search performance and competitive model evaluation accuracy. This chapter is based on the following paper [1]. Does Unsupervised Architecture Representation Learning Help Neural Archi- tecture Search? Existing Neural Architecture Search (NAS) methods either encode neural architectures us- ing discrete encodings that do not scale well, or adopt supervised learning-based methods to jointly learn architecture representations and optimize architecture search on such represen- tations which incurs search bias. Despite the widespread use, architecture representations learned in NAS are still poorly understood. We observe that the structural properties of neural architectures are hard to preserve in the latent space if architecture representation learning and search are coupled, resulting in less effective search performance. In this work, we find empirically that pre-training architecture representations using only neural archi- tectures without their accuracies as labels improves the downstream architecture search efficiency. To explain this finding, we visualize how unsupervised architecture representa- tion learning better encourages neural architectures with similar connections and operators to cluster together. This helps map neural architectures with similar performance to the same regions in the latent space and makes the transition of architectures in the latent space relatively smooth, which considerably benefits diverse downstream search strategies. This chapter is based on the paper [2], which addresses the drawbacks of joint optimization of architecture representations and search algorithms such as [1]. Computation-aware Architecture Encodings with Transformers Recent works [2, 6] demonstrate the importance of architecture encodings in Neural Ar- chitecture Search (NAS). These encodings encode either structure or computation infor- 7 mation of the neural architectures. Compared to structure-aware encodings, computation- aware encodings map architectures with similar accuracies to the same region, which im- proves the downstream architecture search performance [6, 53]. In this work, we introduce a Computation-Aware Transformer-based Encoding method called CATE. Different from existing computation-aware encodings based on fixed transformation (e.g. path encoding), CATE employs a pairwise pre-training scheme to learn computation-aware encodings using Transformers with cross-attention. Such learned encodings contain dense and contextualized computation information of neural architectures. We compare CATE with eleven encodings under three major encoding-dependent NAS subroutines in both small and large search spaces. Our experiments show that CATE is beneficial to the downstream search, especially in the large search space. Moreover, the outside search space experiment shows its superior generalization ability beyond the search space on which it was trained. This chapter is based on the following paper [3], which takes inspirations from my earlier work [2]. NAS-Bench-x11 and the Power of Learning Curves While early research in neural architecture search (NAS) required extreme computational resources, the recent releases of tabular and surrogate benchmarks have greatly increased the speed and reproducibility of NAS research. However, two of the most popular bench- marks do not provide the full training information for each architecture. As a result, on these benchmarks it is not possible to run many types of multi-fidelity techniques, such as learning curve extrapolation, that require evaluating architectures at arbitrary epochs. In this work, we present a method using singular value decomposition and noise modeling to create surrogate benchmarks, NAS-Bench-111, NAS-Bench-311, and NAS-Bench-NLP11, that output the full training information for each architecture, rather than just the final validation accuracy. We demonstrate the power of using the full training information by 8 introducing a learning curve extrapolation framework to modify single-fidelity algorithms, showing that it leads to improvements over popular single-fidelity algorithms which claimed to be state-of-the-art upon release. This chapter is based on the following paper [4], which builds the first multi-fidelity NAS surrogate benchmark and provides speedy performance estimation given different architecture representations such as [2] and [3]. Summary In summary, this thesis has touched all aspects of NAS pipeline as illustrated in Figure 1.1, in which NAS can be significantly improved. We hope to convince the reader at the end of this thesis that neural architecture encoding is an important design decision in NAS and it is able to significantly improve the efficiency of NAS by leveraging the power of partial learning curves. Still, there are many challenging and rewarding problems to be explored which I will summarize in the conclusion chapter 6. The material is based on the AutoML workshop tutorial at ICML’2021.1 All code, data, and models used in this thesis can be found at https://github.com/MSU- MLSys-Lab. 1 The tutorial website is at https://sites.google.com/corp/view/automl2021. 9 Chapter 2 Efficient One-Shot NAS via Hierarchical Masking Neural architecture search (NAS) has recently attracted significant interests due to its ca- pability of automating neural network architecture design and its success in outperforming hand-crafted architectures in many important tasks such as image classification [54], object detection [55], and semantic segmentation [56]. In early NAS approaches, architecture can- didates are first sampled from the search space; the weights of each candidate are learned independently and are discarded if the performance of the architecture candidate is not competitive [18, 54, 57, 58]. Despite their remarkable performance, since each architecture candidate requires a full training, these approaches are computationally expensive, consum- ing hundreds or even thousands of GPU days in order to find high-quality architectures. To overcome this bottleneck, a majority of recent efforts focuses on improving the computation efficiency of NAS using the weight sharing strategy [8,12,57,59,60]. Specifically, rather than training each architecture candidate independently, the architecture search space is encoded within a single over-parameterized supernet which includes all the possible connections (i.e., wiring patterns) and operations (e.g., convolution, pooling, identity). The supernet is trained only once. All the architecture candidates inherit their weights directly from the supernet without training from scratch. By doing this, the computation cost of NAS is significantly reduced. Unfortunately, although the supernet subsumes all the possible architecture candi- dates, existing weight sharing based NAS approaches still adopt hand designed heuristics to extract architecture candidates from the supernet. As an example, in many existing weight sharing based NAS approaches such as DARTS [12], the supernet is organized as stacked cells and each cell contains multiple nodes connected with edges. However, when extracting architecture candidates from the supernet, each candidate is hard coded to have exactly two 10 input edges for each node with equal importance and to associate each edge with exactly one operation. As such, the space of architecture candidates is constrained in a subset of all possible architectures, making the architecture search results sub-optimal. Given the constraint of existing weight sharing approaches, it is natural to ask the question: will we be able to improve architecture search performance if we loosen this constraint? To this end, we present HM-NAS, an efficient neural architecture search approach that effectively addresses such limitation of existing weight sharing based NAS approaches to achieve better architecture search performance and competitive model evaluation accuracy. As illustrated in Figure 2.1, to loosen the constraint, HM-NAS incorporates a multi-level architecture en- coding scheme which enables an architecture candidate extracted from the supernet to have arbitrary numbers of edges and operations associated with each edge. Moreover, it allows each operation and edge to have different weights which reflect their relative importance across the entire network. Based on the multi-level encoded architecture, HM-NAS formu- lates neural architecture search as a model pruning problem: it discards the hand designed heuristics and employs a hierarchical masking scheme to automatically learn the optimal numbers of edges and operations and their corresponding importance as well as mask out unimportant network weights. Moreover, the addition of these learned hierarchical masks on top of the supernet also provides a mechanism to help correct the architecture search bias caused by bilevel optimization of architecture parameters and network weights during supernet training [38, 61, 62]. Because of such benefit, HM-NAS is able to use the unmasked network weights to speed up the training process. We evaluate HM-NAS on both CIFAR-10 and ImageNet and our results are promising: HM-NAS is able to achieve competitive accu- racy on CIFAR-10 with 1.6× to 1.8× less parameters and 2.7× total training time speed-up compared with state-of-the-art weight sharing approaches. Similar results are also achieved 11 on ImageNet. Moreover, we have conducted a series of ablation studies that demonstrate the superiority of our multi-level architecture encoding and hierarchical masking schemes over randomly searched architectures, as well as single-level architecture encoding and hand designed heuristics used in existing weight sharing based NAS approaches. Finally, we have conducted an in-depth analysis on the best-performing network architectures found by HM- NAS. Our results show that without the constraint imposed by the hand designed heuristics, our searched networks contain more flexible and meaningful architectures that existing weight sharing based NAS approaches are not able to discover. In summary, our work makes the following contributions: 1). We present HM-NAS, an efficient neural architecture search approach that loosens the constraint of existing weight sharing based NAS approaches. 2). We introduce a multi-level architecture encoding scheme which enables an architecture can- didate to have arbitrary numbers of edges and operations with different importance. We also introduce a hierarchical masking scheme which is able to not only automatically learn the optimal numbers of edges, operations and important network weights, but also help correct the architecture search bias caused by bilevel optimization during supernet training. 3). Extensive experiments show that compared to state-of-the-art weight sharing based NAS approaches, HM-NAS is able to achieve better architecture search efficiency and competitive model evaluation accuracy. Related Work Designing high-quality neural networks requires domain knowledge and extensive experi- ences. To cut the labor intensity, there has been a growing interest in developing automated neural network design approaches through NAS. Pioneer works on NAS employ reinforce- ment learning (RL) or evolutionary algorithms to find the best architecture based on nested optimization [18, 54, 57, 58]. However, these approaches are incredibly expensive in terms 12 Train Supernet with Train Supernet with Single-Level Multi-Level Architecture Encoding Architecture Encoding Architecture Search Architecture Search via Hand Designed via Learned Heuristics Hierarchical Masks Fine-tune Train from Scratch Searched Network (a) (b) Figure 2.1: The pipelines of (a) existing weight sharing based NAS approaches; and (b) HM-NAS. of computation cost. For example, in [54], it takes 450 GPUs for four days to search for the best network architecture. To reduce computation cost, many works adopt the weight sharing strategy where the weights of architecture candidates are inherited from a supernet that subsumes all the possible architecture candidates. To further reduce the computation cost, recent weight sharing based approaches such as DARTS [12] and SNAS [59] replace the discrete architecture search space with a continuous one and employ gradient descent to find the optimal architecture. However, these approaches restrict the continuous search space with hand designed heuristics, which could jeopardize the architecture search performance. Moreover, as discussed in [38,61,62], the bilevel optimization of architecture parameters and network weights used in existing weight sharing based approaches inevitably introduces bias to the architecture search process, making their architecture search results sub-optimal. Our approach is related to DARTS and SNAS in the sense that we both build upon the weight sharing strategy. However, our goal is to address the above limitations of existing approaches to achieve better architecture search performance. Our approach is also related to ProxylessNAS [8]. ProxylessNAS formulates NAS as a model pruning problem. In our approach, the employed hierarchical masking scheme also prunes the redundant parts of the supernet to generate the optimal network architecture. 13 Architecture Retrain Use NAS Approach Encoding from Scratch Proxy ENAS [57] Operations Yes Yes NASNet [54] Operations Yes Yes AmoebaNet [18] Operations Yes Yes NAONet [63] Operations Yes Yes ProxylessNAS [8] Operations Yes No FBNet [64] Operations Yes No DARTS [12] Operations Yes Yes SNAS [59] Operations Yes Yes HM-NAS Operations & Edges No No Table 2.1: Comparison between HM-NAS and other NAS approaches. The distinction is that ProxylessNAS focuses on pruning operations (referred to as path in [8]) of the supernet, while HM-NAS provides a more generalized model pruning mecha- nism which prunes the redundant operations, edges, and network weights of the supernet to derive the optimal architecture. Our approach is also similar to ProxylessNAS as being a proxyless approach. Rather than adopting a proxy strategy like [12, 59], which transfers the searched architecture to another larger network, both HM-NAS and ProxylessNAS directly search the architectures on target datasets without architecture transfer. However, unlike ProxylessNAS which involves retraining as the last step, HM-NAS eliminates the prolonged retraining process and replaces it with a fine-tuning process with the reuse of the unmasked pretrained supernet weights. Table 2.1 provides a comparison between HM-NAS and relevant approaches on a number of important dimensions. The combination of the proposed multi- level architecture encoding and hierarchical masking techniques makes HM-NAS superior over many existing approaches. 14 Approach Supernet Design Following [12,18,59], we use a cell structure with an ordered sequence of nodes as our search space. The network is then composed of several identical cells which are stacked on top of each other. Specifically, a cell is represented using a directed acyclic graph (DAG) where each node x in the DAG is a latent representation (e.g., a feature map in a convolutional network). A cell is set to have two input nodes, one or more intermediate nodes, and one output node. Specifically, the two input nodes are connected to the outputs of cells from two previous cells; each intermediate node is connected by all its predecessors; and the output node is the concatenation of all the intermediate nodes within the cell. To build the supernet that subsumes all the possible architectures in the search space, existing works such as DARTS [12] and SNAS [59] associate each edge in the DAG with a mixture of candidate operations (e.g., convolution, pooling, identity) instead of a definite one. Moreover, each candidate operation of the mixture is assigned with a learnable variable (i.e., operation mixing weight) which encodes the importance of this candidate operation. As such, the mixture of candidate operations associated with a particular edge is represented as the softmax over all candidate operations: N X exp(αi ) o(x) = oi (x) (2.1) j exp(αj ) P i=1 where {oi } denote the set of N candidate operations, {αi } denote the set of N real-valued operation mixing weights. Although this supernet encodes the importance of different candi- date operations within each edge, it does not provide a mechanism to encode the importance of different edges across the entire DAG. Instead, all the edges across the DAG are con- strained to have the same importance. However, as we observed in our experiments (Section 15 Step 1 Step k 0 M𝑟𝑤 0 0 M𝑟𝑤 Network 0.8 0.2 0.9 0.2 0.2 1.9 Weights Masked 𝓜𝑟𝑤 1 1.8 1.2 1.1 1.8 0.2 0.4 Masked 0.3 1.2 0.2 1 Weights 0.3 1.4 0.2 1 Weights 0.2 0.2 1.9 1.8 0.2 0.4 M𝛽𝑟 M𝑟𝛼 M𝛽𝑟 M𝑟𝛼 0.3 1.4 0.2 (0, 1) 0.9 0.3 0.2 1.8 (0, 1) 1.9 0.2 0.1 1.9 2 (0, 2) 0.8 0.8 0.1 1.5 2 (0, 2) 1.8 1.8 0.1 0.5 2 (0, 3) 1.1 0.2 0.0 1.1 (0, 3) 1.4 0.1 0.2 1.1 (0, 1) 1.9 0.2 0.1 1.9 (1, 2) 0.1 0.1 0.5 0.9 (1, 2) 1.1 0.1 1.5 1.9 (0, 2) 1.8 1.8 0.1 0.5 (1, 3) 1.2 1.2 0.3 0.4 (1, 3) 1.2 1.3 0.5 0.6 (0, 3) 1.4 0.1 0.2 1.1 (2, 3) 1.0 0.3 0.8 0.8 3 (2, 3) 1.2 0.3 1.2 0.4 3 (1, 2) 1.1 0.1 1.5 1.9 3 (1, 3) 0.2 0.3 0.5 0.6 (2, 3) 1.2 0.3 1.2 0.4 Supernet with Multi-level Real-Valued Masked Real-Valued Masked Architecture Encoding Hierarchical Mask Architecture Hierarchical Mask Architecture Figure 2.2: In this example, each edge has 3 candidate operations marked using red, yellow, and blue color respectively. In each iteration, the real-valued hierarchical masks {Mαr , Mβr , Mwr } are passed through a deterministic thresholding function to obtain the corresponding binary masks (highlighted grids represent o o o 1 2 3 ‘1’, the rest represent ‘0’) that mask out redundant operations, edges, and weights of the supernet. 2.5), loosening this constraint is able to help NAS find better architectures. Motivated by this observation, in HM-NAS’s supernet, besides encoding the importance of each candidate operation within an edge, we introduce a separate set of learnable variables (i.e., edge mixing weights) to independently encode the importance of each edge across the DAG. As such, each intermediate node x(i) in the DAG is computed based on all of its predecessors as: X exp(β (i,j) ) x(i) = o(x(j) ) (2.2) exp(β P (i,k) ) j