INTERPRETABLE ARTIFICIAL INTELLIGENCE USING NONLINEAR DECISION TREES By Yashesh Deepakkumar Dhebar A DISSERTATION Michigan State University in partial fulfillment of the requirements Submitted to for the degree of Mechanical Engineering – Doctor of Philosophy 2020 ABSTRACT INTERPRETABLE ARTIFICIAL INTELLIGENCE USING NONLINEAR DECISION TREES By Yashesh Deepakkumar Dhebar The recent times have observed a massive application of artificial intelligence (AI) to automate tasks across various domains. The back-end mechanism with which automation occurs is generally black-box. Some of the popular black-box AI methods used to solve an automation task include decision trees (DT), support vector machines (SVM), artificial neural networks (ANN), etc. In the past several years, these black-box AI methods have shown promising performance and have been widely applied and researched across industries and academia. While the black-box AI models have been shown to achieve high performance, the inherent mechanism with which a decision is made is hard to comprehend. This lack of interpretability and transparency of black-box AI methods makes them less trustworthy. In addition to this, the black-box AI models lack in their ability to provide valuable insights regarding the task at hand. Following these limitations of black-box AI models, a natural research direction of developing interpretable and explainable AI models has emerged and has gained an active attention in the machine learning and AI community in the past three years. In this dissertation, we will be focusing on interpretable AI solutions which are being currently developed at the Computational Optimization and Innovation Laboratory (COIN Lab) at Michigan State University. We propose a nonlinear decision tree (NLDT) based framework to produce transparent AI solutions for automation tasks related to classification and control. The recent advancement in non-linear optimization enables us to efficiently derive interpretable AI solutions for various automation tasks. The interpretable and transparent AI models induced using customized optimization techniques show similar or better performance as compared to complex black-box AI models across most of the benchmarks. The results are promising and provide directions to launch future studies in developing efficient transparent AI models. Copyright by YASHESH DEEPAKKUMAR DHEBAR 2020 Dedicated to my parents and elder brother Paurav. iv ACKNOWLEDGEMENTS The PhD journey for me has been really fulfilling and transformational and I would be like to acknowledge those who have played instrumental role during this endeavour. I thought that writing this Acknowledgements section of my dissertation would be easy and quick, however its not the case. While I was typing this, I came to a realization that this expression of gratitude would be very limited and always incomplete for certain people including my parents, my brother Paurav, my friends with whom I spent most of my PhD duration with and Prof. Kalyanmoy Deb. The role of my family has been extremely important and encouraging in bringing me to the point I am at right now. Contributions and sacrifices made by my father Mr. Deepak Dhebar and my mother Mrs. Deepika Dhebar are among the things which cannot be quantified and most of which I believe are still outside my conscious realm. If it would not have been my brother Paurav, who spotted my technical aptitude and envisioned me getting into the IIT, my trajectory of life would be have been completely different. The vision, the support and the encouragement provided to me by my family has played a big part in turning this dream into reality. I would like to thank my friends who I see as my family far from my hometown, the friends with whom I stayed and created a permanent bond with during my PhD – thank you Tarang, Swati, Mayank, Kokil, Ashish, Saptarshi, Sabhyasachi, Thrilok and Vikram Prajapati. I would like to also thank Abhinav and Kamala for showing their concern and warmth and being like elder siblings in the town. The stay has been made equally joyous because of the indoor board games amidst the harsh weather of Michigan and I would like to thank Kanchan, Aritra, Rahul and Bakul for being the part of the clan. I would like to thank Pratap Bhanu Solanki for helping me during my transition from IIT Kanpur to Michigan State University. I would like to thank Nilay Kant for his valuable inputs during my recent job search and providing me with some fundamental insights across numerous aspects during my PhD. I was really fortunate to have great lab-mates as my colleagues at the Computational Optimiza- v tion and Innovation Laboratory. I got to learn a lot from them. Thank you Abhinav, Haitham, Proteek, Rayan, Julian, Zhichao, Khaled, Airuddh, Ali, Abhiroop, Yash, Shuvei, Mohamed and visiting scholars viz. Ankur, Flavio, Pablo, Sukrit, Sparsh and Hemant. I would like to thank Mrs. Debjani Deb for organizing lab gatherings and enriching the social culture of the lab. It was indeed my pleasure to interact with Prof. Ranjan Mukherjee and Mrs. Moushumi Mukherjee and am thankful to them for introducing me and involving me in the Durga Pooja festivals and other social gatherings. It was indeed great to experience the Indian traditional culture on the foreign land. I would like to thank Prof. Bhaskar Dasgupta and Prof. Bishakh Bhattacharya of IIT Kanpur and Prof. G. K. Ananthasuresh of IISc Bangalore who wrote my recommendation letter for PhD. I would like to thank Prof. Niraj Sinha, Prof. Anindya Chatterjee and Prof. Harish Karnick of IIT Kanpur for giving valuable advice regarding how to choose a career after bachelors and what to consider while opting for a PhD. I still remember this line from Prof. Chatterjee’s homepage and it reads: The PhD advisor is more than merely a source of information and funding. Ideally, your PhD advisor will profoundly influence not only how you view your subject, but how you view I could experience this my case. Working under Prof. Kalyanmoy Deb shaped me the world. fundamentally and his guidance showed me the way to conduct the life: both professional and personal. Thank you Prof. Deb for everything and thank you Prof. Erik Goodman, Prof. Ronald Averill and Dr. Vishnu Boddeti for being the part of my PhD committee and for providing valuable feedback and suggestions in regards to my research. Warm Regards, Yashesh Dhebar vi TABLE OF CONTENTS . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . LIST OF TABLES . LIST OF FIGURES . LIST OF ALGORITHMS . CHAPTER 1 THE NEED AND THE START . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . x . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . xii . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . xviii 1 2 1.1 Our Work . . . . . . . . . . . . . . . . . . . . . . . 3.2.1 3.2.2 3.1 Past Studies . . 3.2 Proposed Approach . CHAPTER 2 A HIGH LEVEL VIEW OF OUR INTREPRETABLE AI MODEL . . 4 CHAPTER 3 CLASSIFICATION . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 7 . . 9 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 12 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Classifier Representation Using Nonlinear Decision Tree . . . . . . . . . . 12 Split-Rule Discovery Using Bilevel Optimization . . . . . . . . . . . . . . 14 3.3 Bilevel Approach for Split-Rule Identification . . . . . . . . . . . . . . . . . . . . 17 3.3.1 The Hierarchical Objectives to derive split-rule . . . . . . . . . . . . . . . 17 3.3.1.1 Computation of  . . . . . . . . . . . . . . . . . . . . . . . . . 19 3.3.1.2 Computation of  . . . . . . . . . . . . . . . . . . . . . . . . . 20 3.3.2 Upper Level Optimization (ULGA) . . . . . . . . . . . . . . . . . . . . . . 20 . 21 3.3.2.1 Custom Initialization for Upper Level GA . . . . . . . . . . . . 3.3.2.2 Ranking of Upper Level Solutions . . . . . . . . . . . . . . . . . 21 3.3.2.3 Custom Crossover Operator for Upper Level GA . . . . . . . . . 22 3.3.2.4 Custom Mutation Operator for Upper Level GA . . . . . . . . . . 24 3.3.2.5 Duplicate Update Operator . . . . . . . . . . . . . . . . . . . . . 27 . . . . . . . . . . . . . . . . . . . . . 27 Lower Level Optimization (LLGA) 3.3.3.1 Custom Initialization for Lower Level GA . . . . . . . . . . . . . 28 Selection, Crossover, and Mutation for Lower Level GA . . . . . 30 3.3.3.2 . 31 Termination Criteria for Lower level GA . . . . . . . . . . . . 3.3.3.3 3.4 Ablation Studies and Comparison . . . . . . . . . . . . . . . . . . . . . . . . . . . 31 3.4.1 Ablation Studies on Lower Level GA . . . . . . . . . . . . . . . . . . . . . 31 3.4.2 Ablation Studies on the Proposed Bilevel GA . . . . . . . . . . . . . . . . 33 3.5 Visualization of split-rule: X-Space and B-Space . . . . . . . . . . . . . . . . . . 35 3.6 Overall Tree Induction and Pruning . . . . . . . . . . . . . . . . . . . . . . . . . . 36 3.7 Results . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 38 3.7.1 Customized Datasets: DS1 to DS4 . . . . . . . . . . . . . . . . . . . . . . 39 3.7.2 Breast Cancer Wisconsin Dataset . . . . . . . . . . . . . . . . . . . . . . . 40 3.7.3 Wisconsin Diagnostic Breast Cancer Dataset (WDBC) . . . . . . . . . . . 41 3.7.4 Real World Auto-Industry Problem (RW-problem) . . . . . . . . . . . . . . 42 3.7.5 Results on Multi-Objective Optimization Problems . . . . . . . . . . . . . 43 Truss 2D and Welded Beam Problems . . . . . . . . . . . . . . . 44 3.7.5.1 3.3.3 . . . . . . . . . vii Support Vector Machines (SVMs) 3.7.5.2 Modified ZDT (m-ZDT) and DLTZ (m-DTLZ) Problems . . . . 3.7.5.3 m-ZDT and m-DTLZ Results: . . . . . . . . . . . . . . . . . . . 47 . 50 3.8 Additional Comparisons and Results . . . . . . . . . . . . . . . . . . . . . . . . . 51 3.8.1 . . . . . . . . . . . . . . . . . . . . . . 53 3.8.2 Generalized Additive Models (GAMs) . . . . . . . . . . . . . . . . . . . . 55 . 58 3.8.3 Genetic Programming (GP) . . . . . . . . . . . . . . . . . . . . . . . . . 3.8.4 Results . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 62 3.9 Conclusions and Future work . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 63 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 65 3.10 Parameter Settings . . . . . . . . . . 3.10.1 Termination Criteria and other Parameter Settings for Inducing a Non- linear Decision Tree (NLDT) . . . . . . . . . . . . . . . . . . . . . . . . . 65 3.10.2 Parameter Setting for NSGA-II for multi-objective data creation . . . . . . 65 3.10.3 Parameter Setting for Upper Level GA . . . . . . . . . . . . . . . . . . . . 66 3.10.4 Parameter Setting for Lower Level GA . . . . . . . . . . . . . . . . . . . . 66 3.10.5 Creation of Customized 2D Datasets: DS1- DS4 . . . . . . . . . . . . . . 67 . . CHAPTER 4 CONTROL: INTERPRETABLE POLICY FOR DISCRETE AC- . . . . . . Introduction . 4.6 Overall Approach . TION SPACES . . . . . 4.5 Nonlinear Decision Trees (NLDTs) as Policies . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 69 4.1 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 69 4.2 Motivation for the Study . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 70 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 73 4.3 Related Past Studies . 4.4 Performance Measures . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 75 4.4.1 Open-Loop Accuracy . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 75 4.4.2 Closed-loop Performance . . . . . . . . . . . . . . . . . . . . . . . . . . . 76 . 76 4.5.1 Binary-split NLDT . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 77 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 78 4.5.2 Multi-split NLDT . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 79 4.6.1 Data Normalization . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 80 4.6.2 Open-loop Training . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 80 4.6.2.1 Open-loop training for Binary-split NLDT . . . . . . . . . . . . 80 4.6.2.2 Open-loop training for Multi-split NLDT . . . . . . . . . . . . . 81 4.6.3 Closed-loop Training . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 82 4.7 Experiments: AIM and Procedure . . . . . . . . . . . . . . . . . . . . . . . . . . 82 4.7.1 Experimental Setup . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 83 . . . . . . . . . . . . . . . . . . . . 84 . . . . . . . . . . . . . . . . . . . 85 . 85 4.8.1 CartPole Problem . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 85 4.8.1.1 NLDT for CartPole Problem . . . . . . . . . . . . . . . . . . . . 86 4.8.2 CarFollowing Problem . . . . . . . . . . . . . . . . . . . . . . . . . . . . 87 4.8.2.1 NLDT for CarFollowing Problem . . . . . . . . . . . . . . . . . 90 . 92 4.9.1 MountainCar Problem . . . . . . . . . . . . . . . . . . . . . . . . . . . . 92 4.9.2 NLDT for MountainCar Problem . . . . . . . . . . . . . . . . . . . . . . . 93 4.9 Experiments and Analysis on Multiple Discrete Action Space . . . . . . . . . . . 4.8 Experiments and Analysis on Control Tasks with Binary Action Spaces . . . . . . 4.7.1.1 Creation of Regular Dataset 4.7.1.2 Creation of Balanced Dataset viii 4.9.3 LunarLander Problem . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 94 4.9.3.1 NLDT for LunarLander Problem . . . . . . . . . . . . . . . . . 95 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 98 4.10 Conclusions . . . . . . . CHAPTER 5 SCALE-UP STUDY AND IMPROVISATION . . . . . . . . . . . . . . . 101 5.1 Ablation Study for Open-loop Training . . . . . . . . . . . . . . . . . . . . . . . . 103 5.2 Closed-loop Visualization . . 105 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 109 5.3 Reengineering NLDT* . . 111 . . 5.4 Conclusion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . CHAPTER 6 EXTENSION TO REGRESSION AND CONTINUOUS CONTROL . . . . . . . PROBLEMS . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 113 Introduction . . 113 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 6.1 6.2 Interpretable AI for Regression Problems using NLDT . . . . . . . . . . . . . . . 113 6.3 Dual Bilevel Algorithm (DBA) for Regression . . . . . . . . . . . . . . . . . . . . 115 6.3.1 Data Normalization . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 116 6.3.2 Bilevel Regression Algorithm to Obtain (x) . . . . . . . . . . . . . . . . 117 6.3.2.1 Lower Level Regression Optimization . . . . . . . . . . . . . . . 118 6.3.2.2 Upper Level Regression Optimization . . . . . . . . . . . . . . . 122 6.3.3 Bilevel Partition Algorithm (BPA) to Obtain (x) . . . . . . . . . . . . . . 123 Lower Level Partition Optimization . . . . . . . . . . . . . . . . 124 6.3.4 Results on a Customized Benchmark Problem . . . . . . . . . . . . . . . . 125 Interpretable AI for Control Problem with Continuous Action Space . . . . . . . . 128 6.4.1 NLDT Representation and Training . . . . . . . . . . . . . . . . . . . . . 131 6.4.2 Data generation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 131 6.4.3 Results on Car-Following Problem . . . . . . . . . . . . . . . . . . . . . . 132 6.3.3.1 6.4 . CHAPTER 7 CONCLUSION AND FUTURE WORK . . . . . . . . . . . . . . . . . . 136 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 139 BIBLIOGRAPHY . . . . . . . . . ix LIST OF TABLES Table 3.1: Results on DS1 to DS4 datasets (2 features). . . . . . . . . . . . . . . . . . . . . 39 Table 3.2: Results on breast cancer Wisconsin dataset (10 features). . . . . . . . . . . . . . 40 Table 3.3: Results on WDBC dataset (30 features). . . . . . . . . . . . . . . . . . . . . . . 41 Table 3.4: Results on the real-world auto-industry problem (36 features). . . . . . . . . . . 43 Table 3.5: Results on Truss-2D with  = 6. . . . . . . . . . . . . . . . . . . . . . . . . 44 Table 3.6: 2D Truss problem with  = 6 and  = 9. . . . . . . . . . . . . . . . . . 45 Table 3.7: Results on welded beam design with   = 1 and   = 10.  = 3 is kept fixed. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 47 Table 3.8: Parameter setting to generate datasets for m-ZDT and m-DTLZ problems. We generate 1000 datapoints for each class. . . . . . . . . . . . . . . . . . . . . . . 49 Table 3.9: Results on multi-objective problems for classifying dominated and non-dominated solutions. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 51 Table 3.10: SVM Result for different values of penalty parameter . For each dataset, the first row represents the testing accuracy and the second row represents complexity (number of support vectors).  = 1000 gives overall best performance. 56 Table 3.11: Details regarding parametric study for GAMs. . . . . . . . . . . . . . . . . . . . 58 Table 3.12: GP Result for different values of parsimony coefficient . For each dataset, the first row represents the testing accuracy and the second row represents complexity (number of internal nodes).  = 0.001 produces better results. . . . 60 Table 3.13: Summary of results obtained using various methods. For each dataset, the first row indicates testing accuracy and the second row indicates complexity. Italicized entries are statistically insignificant (according to 95% confidence in Wilcoxon rank-sum test) compared to the best entry in the same row. . . . . . . . 63 Table 3.14: Parameter setting to create customized datasets D1-D4. For Class-1 data- points (, ) = (0, 0.01). . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 68 x Table 4.1: Class Distribution of Regular Training Datasets for different problems. For each row, the -th number in second column represents the number of datapoints belonging to -th action. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 84 Table 4.2: Effect of training data size on performance of NLDT  on CartPole problem. . . 86 Table 4.3: Results on CarFollowing problem correspoing to open-loop training (NLDT ). 89 Table 4.4: Closed-loop performance analysis after re-optimizing NLDT for CarFollowing problem (k = 103). . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 89 Table 4.5: Mountain car results. The numbers indicate average scores. . . . . . . . . . . . . 93 Table 4.6: LunarLander open-loop training (NLDT ) results. The numbers indicate average scores. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 95 Table 4.7: Closed-loop performance on LunarLander problem with and without re-optimization on 26-rule NLDT . Number of rules are specified in brackets for each NLDT and total parameters for the DNN is marked. . . . . . . . . . . . . . . . . . . . . 97 Table 4.8: NLDT rules before and after the closed-loop training for LunarLander problem, for which NLDT* is shown in Figure 4.16. Video showing the simulation output of the performance of NLDTs with rule-sets mentioned in this table can be found at https://youtu.be/DByYWTQ6X3E. Respective minimum and maximum state variables are min = [-0.38, -0.08, -0.80, -0.88, -0.42, -0.85, 0.00, 0.00], max = [0.46, 1.52, 0.80, 0.50, 0.43, 0.95, 1.00, 1.00], respectively. . 99 Table 5.1: Details regarding custom designed Planar Serial Manipulator environments. . . . 102 Table 5.2: Comparing performance of different lower-level optimization algorithms. For comparison, closed-loop performance of the original DNN policy is also reported.104 xi LIST OF FIGURES Figure 2.1: NLDT for classification: At each conditional node (white colored), a nonlin- ear condition (x) ≤ 0 is checked for a datapoint x. The datapoint x traverses the NLDT by following conditions (x) at each conditional node and reaches a leaf node (colored) where it is assigned to a class represented by the leaf node. 4 Figure 2.2: NLDT for Regression: Each terminal leaf node (green colored) here repre- sents a regression equation (x) to predict the value of a dependent variable . However, these regression rules ( = (x)) are valid only for certain part of the feature space. These regions of the feature space are defined using the partition rules (x) at each condition node (white). . . . . . . . . . . . . . . . 5 Figure 3.1: An illustration of a Nonlinear Decision Tree (NLDT) for a classification problem. For a given conditional node, the split-rule function (x) is derived using a dedicated bilevel-optimization procedure. At first, the algorithm is applied to the entire data on Node 1 to obtain split-rule 1(x) ≤ 0. If this split-rule is not able to partition the data perfectly, then a similar bilevel optimization is invoked at Node 2 and Node 3 to determine 2(x) and 3(x), respectively, on the subset of data present in Node 1 (wherein the Node 2 will have the data satisfying the split-rule 1(x) ≤ 0 and Node 3 will have data of Node 1 which violates split-rule 1(x) ≤ 0). The process continues until a certain termination criteria is met. Terminal leaf-nodes are assigned with a class-label based on the distribution of data in that node. . . . . . . . . . . . . . 12 Figure 3.2: In this illustration, a two-class data comprising of two features 1 and 2 is provided. ,  and  are three different split-rules. Split-rule  is able to optimally partition the data, but is complex and may not be interpretable. Rule  is simple, however it does not produce accurate classification. Rule  is simple as well as classifies the data accurately. . . . . . . . . . . . . . . . . 18 Figure 3.3: Illustrations of meaningful and meaningless crossover operations. In the meaningful crossover operation, children (green dots) will carry information of parents (red dots). Hence, they will be located with high probability at the corners of the rectangle . On the other hand, the meaningless crossover operation creates children at random locations, without taking leverage of parents. 23 Figure 3.4: Upper Level Crossover Operation . . . . . . . . . . . . . . . . . . . . . . . . . 24 Figure 3.5: Mutation for Upper Level GA . . . . . . . . . . . . . . . . . . . . . . . . . . . 26 xii Figure 3.6: Illustration of Data in B-Space. The split function  (x, B, , w, ) is a straight line in B-space. Dotted lines represent the convex hull engulfing the data. Different possible split functions are shown by straight lines , ,  and . 28 Figure 3.7: Mixed dipole (xA, xB) and a hyperplane (wh, Ò). Figure 3.8: Customized datasets. DS1 is linear and balanced, DS2 is linear but unbal- anced with minority class having 10x less points. DS3 is nonlinear and DS4 has a sandwiched distribution. . . . . . . . . . . . . . . . . . . . . . . . . . . . 32 . . . . . . . . . . . . . . . 30 Figure 3.9: Results on DS1 dataset to benchmark LLGA. Numbers in first parenthesis indicate Type-1 and Type-2 error on training data and the numbers in second parenthesis indicate Type-1 and Type-2 error on testing data. . . . . . . . . . . . 33 Figure 3.10: Results on DS2 dataset to benchmark LLGA. . . . . . . . . . . . . . . . . . . . 34 Figure 3.11: Bilevel GA results on DS3 and DS4. . . . . . . . . . . . . . . . . . . . . . . . 35 Figure 3.12: Results on DS3 dataset to benchmark the overall bilevel algorithm. . . . . . . . 36 Figure 3.13: Results on DS4 dataset to benchmark the overall bilevel algorithm. . . . . . . . 37 Figure 3.14: Feature Transformation. A point in three-dimensional X-space is mapped to a point in a two-dimensional B-space for which  =  1 1  2 2  3 3 . . . . . . . . 38 Figure 3.15: Breast Cancer Wisconsin NLDT. For each node, number of datapoints  present in the node, impurity of the node (Gini) and class distribution (in square parenthesis) is reported. . . . . . . . . . . . . . . . . . . . . . . . . . . 40 Figure 3.16: B-space plot for Wisconsin breast cancer dataset. . . . . . . . . . . . . . . . . . 41 Figure 3.17: Tree for WDBC dataset. For each node, number of datapoints  present in the node, impurity of the node (Gini) and class distribution (in square parenthesis) is reported. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 42 Figure 3.18: B-space plot for WDBC dataset. . . . . . . . . . . . . . . . . . . . . . . . . . 42 Figure 3.19: NLDT for the auto-industry problem. The first split-rule uses five variables and the second one uses 12. For each node, number of datapoints  present in the node, impurity of the node (Gini) and class distribution (in square parenthesis) is reported. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 43 xiii Figure 3.20: Truss design problem data visualization.   = 1 is kept fixed. For a fixed value of   , larger value of  implies better separation between datapoints belonging to two classes. . . . . . . . . . . . . . . . . . . . . . . . . 45 Figure 3.21: Comparison of bilevel and CART methods on truss problem with  = 6 dataset. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 46 Figure 3.22: Welded Beam Design Problem data visualization.  = 3 is kept fixed. Problem at (b) is more difficult to solve than at (a). . . . . . . . . . . . . . . . . 47 Figure 3.23: Welded beam classifier with one rule for   = 10. . . . . . . . . . . . . . . . 48 Figure 3.24: m-ZDT Datasets. . Figure 3.25: m-DTLZ Datasets. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 49 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 50 Figure 3.26: Original DS1 and its modified version. . . . . . . . . . . . . . . . . . . . . . . 53 Figure 3.27: SVM on separable datasets with a hard margin. . . . . . . . . . . . . . . . . . . 54 Figure 3.28: SVM with non-separable datasets with a soft margin. . . . . . . . . . . . . . . 54 Figure 3.29: Effective degree of freedom (EoDF) V/s Accuracy for Cancer-10 dataset. The best (, ) parameter setting for this dataset is found to be ∗ = [8, 3, 8, 13, 8, 8, 13, 3, 8, 21] and ∗ = [2, 2, 5, 5, 2, 3, 3, 2, 2, 2]. . . . . . . . . . 59 Figure 3.30: A sample genetic program (GP) tree. The above GP translates to this equation:  (x) = (5 − 7) + 32. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 59 Figure 3.31: Classifiers for Cancer data:  = 0.005:  (x) = 9+ −0.537 (0.1716)(0.17132) and  = 0.01:  (x) = 2 + −0.502 (0.0776) . . . . . . . . . . . . . . . . . . . . . . . . . . 61 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 69 Figure 4.1: Control Loop . . . . . Figure 4.2: Mountain Car problem. It comprises of two state variables ,  and is con- -1 for deceleration, 0 for nothing and +1 for trolled using three actions: acceleration. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 71 Figure 4.3: State-action combinations for MountainCar prob. . . . . . . . . . . . . . . . . 71 Figure 4.4: Performance measures. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 75 Figure 4.5: Binary-split NLDT for Discrete action control systems. . . . . . . . . . . . . . 77 xiv Figure 4.6: Multi-split NLDT configuration. Numbers in square brackets indicate class distribution of datapoints. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 78 Figure 4.7: A schematic of the proposed overall approach. . . . . . . . . . . . . . . . . . . 79 Figure 4.8: Three control problems. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 83 Figure 4.9: CartPole NLDT  induced using 10,000 training samples. It is 91.45% accurate on the testing dataset but has 100% closed loop performance. Nor- malization constants are: xmin = [-0.91, -0.43, -0.05, -0.40], xmax = [1.37, 0.88, 0.10, 0.45]. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 87 Figure 4.10: Reward function for CarFollowing environment. . . . . . . . . . . . . . . . . . 88 Figure 4.11: Relative distance plot for CarFollowing problem. . . . . . . . . . . . . . . . . . 89 Figure 4.12: NLDT  for the CarFollowing problem. Normalization constants are: min = [0.25, -7.93, -1.00], max = [30.30, 0.70, 1.00]. . . . . . . . . . . . . . . . . . 90 Figure 4.13: NLDT  for MountainCar problem. Normalization constants: min = [-1.20, -0.06], max = [0.50, 0.06]. . . . . . . . . . . . . . . . . . . . . . . . . . . . . 93 Figure 4.14: NLDT-6 (with 26 rules) and other lower depth NLDTs for the LunarLander problem. Lower depth NLDTs are extracted from the depth-6 NLDT. Each node has an associated node-id (on top) and a node-class (mentioned in bottom within parenthesis). Table 4.7 in provides results on closed-loop performance obtained using these trees before and after applying re-optimization on rule- sets using the closed-loop training procedure. Figure 4.15: Final NLDT*-3 for LunarLander prob. (cid:98) is a normalized state variable (see Section 4.6.1). . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 96 . . . . . . . . . . . . . . . . . . 96 Figure 4.16: Topology of Depth-3 NLDT() obtained from a different run on the Lu- narLander problem. The equations corresponding the conditional-nodes be- fore and after re-optimization are provided in Table 4.8. . . . . . . . . . . . . . 98   Figure 4.17: Closed-loop training plot for finetuning the rule-set corresponding to depth-3 (Table 4.8) to obtain NLDT* for LunarLander problem. . . . . . . . 100 NLDT()   Figure 5.1: Acrobot and a customized Planar Serial Manipulator benchmark problems. . . . 101 xv Figure 5.2: Action Vs. Time plot for 5-Link manipulator problem. Figure 5.2b provides the plot for NLDT* which is obtained from the NLDT  trained using SQP algorithm in lower-level. Similarly, Figure 5.2c provides the plot for NLDT* which is obtained from the NLDT  trained using RGA algorithm in lower-level.106 Figure 5.3: Action Vs. Time plot for 10-Link manipulator problem. Figure 5.3b provides the plot for NLDT* which is obtained from the NLDT  trained using SQP algorithm in lower-level. Similarly, Figure 5.3c provides the plot for NLDT* which is obtained from the NLDT  trained using RGA algorithm in lower-level.108 Figure 5.4: NLDTs for 5-Link Manipulator problem. . . . . . . . . . . . . . . . . . . . . . 110 Figure 5.5: Pruned version of NLDT* (Figure 5.4b) for 5-link manipulator problem. . . . . 111 Figure 6.1: Piecewise linear regression tree with two predictors from [1]. At each leaf node, features involved in the expression of two-regressor linear model is shown. Splits use only one feature variable. . . . . . . . . . . . . . . . . . . . 114 Figure 6.2: Conceptual layout of NLDT for regression task. . . . . . . . . . . . . . . . . . 114 Figure 6.3: Three Island Regression Problem. . . . . . . . . . . . . . . . . . . . . . . . . . 118 Figure 6.4: Computation of Lower Level Objective function  based on Algorithm 9. . . 120 Figure 6.5: Regression algorithm is applied on all datapoints. The obtained regression rule 0(x) = 0.911 − 0.900 + 0.08 is able to fit the subset of datapoints which are represented by Y-predicted (in green circles). Partition rule 0(x) will be now derived to identify the domain in -space (i.e. 0(x) ≤ 0)) where this regression rule is applicable. . . . . . . . . . . . . . . . . . . . . . . . . . 123 Figure 6.6: Result on the Pure Three Island Dataset. (b) Intertpretable AI representation using NLDT. Normalization constants are: Xmean = [0.49, 0.49], Xstd = [0.25, 0.25], Ymean = 0.05 and Ystd = 0.28. . . . 126 (a) Visualization of result. Figure 6.7: Result on the Noisy Three Island Dataset. (b) Intertpretable AI representation using NLDT. Normalization constants are: Xmean = [0.52, 0.52], Xstd = [0.28, 0.27], Ymean = 0.03 and Ystd = 0.29. . . . 127 (a) Visualization of result. Figure 6.8: A Car-following problem with continuous acceleration. The rear car is con- trolled by an AI while the car in the front moves with a random acceleration profile. This problem is similar to the problem shown in Figure 4.8b with the only difference being that the cars can now assume a value of acceleration in range −2/2 to 2/2 (unlike in the previous case where the rear car could only have an acceleration from pre-specified discrete values). . . . . . . . . . . 128 xvi Figure 6.9: ANN Output for the continuous car-following problem described in Fig- ure 6.8. Figures (a), (b) and (c) are the plots of different state variables w.r.t to the time-step. The output of the ANN is shown in Figure (d). . . . . . . . . . 129 Figure 6.10: NLDT Agent representation for continuous control task involving one action. . . 131 Figure 6.11: Plots from different simulation runs with different initial relative velocity () between cars. The NLDT’s acceleration output (red line) matches with the ANN’s acceleration output (blue line). NLDT agent behaviour is almost the same as that of ANN and can thus be used to explain the behaviour of ANN. 133 Figure 6.12: NLDT for car following problem with continuous action space. . . . . . . . . . 134 xvii LIST OF ALGORITHMS Algorithm 1: Pseudo Code to Recursively Induce NLDT. . . . . . . . . . . . . . . . . . . . 15 Algorithm 2: ObtainSplitRule subroutine. Implements a Bilevel optimization algorithm of determine a split-rule. The UpperLevel searches the space of Block Matrix B and modulus-flag . Constraint violation value for an upper level individuals comes by executing lower-level GA (LLGA) as shown in Algorithm 3. . . . . . . . . . . . 16 Algorithm 3: EvaluateUpperLevelPop subroutine. A dedicated LowerLevel optimization is executed for each upper level population member. . . . . . . . . . . . . . . . . . 17 Algorithm 4: Crossover operation in upper level GA. . . . . . . . . . . . . . . . . . . . . . 25 Algorithm 5: CartPole Rules. Normalization constants are:  = [-0.91, -0.43, -0.05, -0.40],  = [1.37, 0.88, 0.10, 0.45]. . . . . . . . . . . . . . . . . . . . . . . . . . 87 Algorithm 6: Ruleset corresponding to NLDT  (Figure 4.12) of the CarFollowing problem. 90 Algorithm 7: MountainCar NLDT . Normalization constants are:  = [-1.20, -0.06],  = [0.50, 0.06]. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 94 Algorithm 8: Pseudo-code of dual bilevel algorithm (DBA) . . . . . . . . . . . . . . . . . . 116 Algorithm 9: Algorithm to compute the lower level fitness  . . . . . . . . . . . . . . . . . 119 xviii CHAPTER 1 THE NEED AND THE START The year was 2016 and our lab was offered with a project from the Dow Chemical Company. The goal of that project was to develop an automation system to automatically predict the harmonized tariff schedule code (HS-code) given the chemical recipe of a product which the Dow Chemical Company imports or exports. Prof. Deb, Prof. Goodman and I operated as a team from MSU and collaborated with concerned experts from the Dow who were involved in manually solving the task of HS-code assignment. At the end of the project, we were able to successfully develop an automation system to predict the HS-codes of chemical products with a human-level accuracy. This work attracted attention from various corporate organizations and is currently under the process of getting patented. During the process of developing a classifier, the team at Dow used to casually intervene into those slides where I used to show them how the AI doing HS-code prediction looks like. They were intrigued by the fact that the task being handled manually at Dow was now getting done by the AI, with the same performance accuracy as that of manual HS-code assignment. More than that, they were interested to know how the AI is thinking in the background and is coming up with a HS-code for a supplied chemical product. The AI developed however was fairly complicated and no such human interpretable logic could be borrowed through the visual inspection of AI, but we got the signal that a human mind is simply not satisfied with answers an AI is providing, and it longs to understand “why" the thing worked. The implicit signal to develop an interpretable AI from Dow was made explicit a few months later. Our lab was approached by General Motors (GM), which wanted to develop a classifier to distinguish between good and bad designs. The task here was to develop a classifier which could be read by design engineers at GM. The readable AI would then serve as a tool to develop better insights regarding the design process and could be used as a recipe by design engineers to develop and innovate future car designs. Later, a similar problem of developing an explainable AI was floated to us by Ford Motors, 1 where the task was to decipher the complicated logic learnt by a deep neural network (DNN) which is controlling a car for autonomous driving. The research presented in this dissertation is a byproduct of the process we went through in answering the questions laid out by two automotive giants. It has triggered many possibilities and we believe that foundation set by our work would lead to a further advancement in the field of interpretable and explainable AI. 1.1 Our Work The field of interpretable and explainable AI has received active attention in the past four years. Broadly speaking, the interpretable AI approaches are categorized into two main groups • Intrinsic (or model based) and • Post-hoc. Within the framework of intrinsically interpretable AI, the AI model itself is transparent and interpretable. The task of generating a less complex AI is generally realized using decision trees, rule-sets [2, 3] or additive models [4, 5, 6]. The aim here is to express an AI (or a subpart of an AI) in as simple format as possible. The post-hoc interpretable methods are aimed towards analyzing the behavior of an already trained AI, mostly using visualization techniques such as partial dependence plots, histogram plots, heat-maps, attention maps, etc. A more comprehensive explanation to these is provided in [7, 8]. In this dissertation, we focus at developing an intrinsically interpretable AI solutions in form of a nonlinear decision trees (NLDT). The main criteria we consider in our research is to reduce the complexity of the transparent AI while ensuring it has a good performance for the machine learning task under consideration. Additionally, the approach being developed assumes that the features (or attributes) in a given automation task are interpretable. The optimization algorithm is then developed to induce an AI which is visually simple, involves less number of terms or functionals and can be expressed in a format which is humanly comprehensible. 2 The dissertation is organized as follows. First, a birds-eye view of the overall interpretable AI framework is provided in Chapter 2. In Chapter 3, we formally introduce the algorithm to derive NLDT for classification problems. Next, the approach developed to induce NLDT in Chapter 3 is extended to solve control problems involving discrete actions. In the same chapter, a reinforcement learning algorithm in form of closed-loop training is proposed to enhance the control performance of NLDT. The approach to arrive at interpretable controllers is improvised and tested for scalability on custom designed benchmarks in Chapter 5. A methodology to extend the concept of NLDT to solve regression and continuous control problems is discussed in Chapter 6. Concluding remarks and possible directions for future work are provided in Chapter 7. 3 CHAPTER 2 A HIGH LEVEL VIEW OF OUR INTREPRETABLE AI MODEL Before we dig into the details, I would like to provide you with a bird’s eye view of the overall approach. The interpretable and explainable AI we are developing assumes the framework of a Nonlinear Decision Tree (NLDT). A high-level perspective of NLDT for a classification task and regression task is provided in Figures 2.1 and 2.2 respectively. Figure 2.1: NLDT for classification: At each conditional node (white colored), a nonlinear condition (x) ≤ 0 is checked for a datapoint x. The datapoint x traverses the NLDT by following conditions (x) at each conditional node and reaches a leaf node (colored) where it is assigned to a class represented by the leaf node. Here (x) and (x) can be any linear or non-linear functions on the input feature vector x. Conceptually, all AI models can be represented in the NLDT format. For instance, artificial neural networks (ANNs) or support vector machines (SVMs) designed for binary classification tasks will have one single condition 0(x) ≤ 0 which will be used to predict if the supplied input datapoint x belongs to Class 1 or Class 2. A natural extension to handle multi class problems is possible either by increasing number of conditions (x) or by increasing number of splits per condition by allowing more branches from a conditional node. We shall visit the topic of handling multiple classes with our proposed NLDT approach later in the thesis. A complicated axis parallel 4 Figure 2.2: NLDT for Regression: Each terminal leaf node (green colored) here represents a regression equation (x) to predict the value of a dependent variable . However, these regression rules ( = (x)) are valid only for certain part of the feature space. These regions of the feature space are defined using the partition rules (x) at each condition node (white). decision tree (decision tree involving only rules such as  ≤ , where  is the -th component of the feature vector x) are by default in the NLDT format. Similar to classification problems, all the AI models developed to handle regression problems can be represented with the NLDT framework. In case of ANNs, the NLDT representation will involve only one node (which will be also a terminal node) with regression rule  = 0(x), where 0(x) will represent an ANN. While ANNs, SVMs or traditional CART decision trees can be used to model classification decision boundaries or regression surfaces, they are expressed with either a very complicated expression for functions (x) or (x) (like in ANNs or SVMs) but simpler topology (only upto 1 layer of NLDT is sufficient), OR are topologically very complex (like CART based decision trees) but have simple expressions for (x) and (x). Hence, existing AI models lie in either of the two extremes: • Either they have a very complicated function representation for (x) and (x), OR • They are topologically complicated and involves many nodes in the overall structure of the NLDT. 5 The above mentioned two aspects make these traditional AI models non-comprehensible and thus non-interpretable. However, what if we allow a controlled non-linearity for (x) and (x)? Maybe allowing some degree of non-linearity for (x) and (x) can help us induce a decision tree with fewer nodes while also ensuring the simplicity of rule expression for (x) and (x) as compared to black-box AI counterparts like ANN, DNN, SVM 0r CART based DT1. In this dissertation, we develop algorithms which navigate through the search space of equations to obtain visually simple linear or non-linear rules ( (x), (x)) as compared to black-box AI counterparts and eventually induce decision tree with fewer nodes. A more in-depth discussion will follow in subsequent chapters where we design dedicated algorithms to solve classification and regression problems and apply them to generate a relatively interpretable translator to policy networks for reinforcement learning tasks. The interpretable AI (IAI) models developed are relatively simple, easy to read and thus humanly more comprehensible thank black-box AI models. 1In this work, we will use black-box term for AI models represented as DNN, ANN, SVM and topologically complex CART Based trees. 6 CHAPTER 3 CLASSIFICATION In a classification task, usually a set of labelled data involving a set of features and its belonging to a specific class are provided. The task in a classification problem solving is to design an algorithm which works on the given dataset and arrives at one or more classification rules (expressed as a function of problem features) which are able to make predictions with maximum classification accuracy. A classifier can be expressed in many different ways suitable for the purpose of the application. It can be a procedure in which a feature vector is supplied to the procedure as an input and the procedure determines the class in which the feature vector belongs. It can also be a mathematical function of the feature vector, considering only a few features, instead of all features, that when takes a value within a non-overlapping range in the real space, it belongs to a specific class. It can also assume the form of a rule-set system with several “if-then-else" conditional statements. Each rule in the rule-set system helps to generate an overall classification logic. Decision Trees (DT) fall under this category, wherein the rule-set is represented in an inverted tree based structure. The non-terminal conditional nodes comprise of conditional statements and the terminal leaf-nodes have a class label associated with them. A datapoint traverses through the DT by following the rules at conditional nodes and lands at a particular leaf-node. The classifier under consideration involves several variables which are supposed to be learned (or optimized) using an optimization algorithm to achieve an optimal classification performance. In case of mathematical single-rule based classifiers like artificial neural networks (ANNs) or support- vector-machines (SVM), these variables are associated with connection-weights or coefficients and location of support-vectors respectively. If the classifier is represented as a rule-set like in DT, each of the conditional split-rules involve some variables which dictate the equation of the split-rule. The optimization problem of determining these variables of classifier involves one or more objectives specifying the quality of classification. Due to their importance in many practical problems involving design, control, identification, 7 and other machine learning related tasks, researchers have spent a lot of attention to develop efficient optimization-based classification algorithms [9, 10, 11]. While most algorithms are developed for classifying two-class data sets, the developed methods can be extended to multi-class classification problems as a hierarchical two-class classification problem or by extending them to constitute a simultaneous multi-class classification algorithm. In this chapter, we will consider the binary classification task. An extension to multi-class classification will be discussed later in Chapter 4. In most classification problem solving tasks, maximizing classification accuracy or minimizing classification error on the labelled dataset is usually used as the sole objective. However, besides the classification accuracy, in many applications, users are also interested in finding an easily interpretable classifier for various practical reasons: (i) it will help identify most important rules which are responsible for the classification task, (ii) it will help provide a more direct relationship of features to have a better insight for the underlying classification task for knowledge enhancement and future developmental purposes. The definition of an easily interpretable classifier will largely depend on the context. In terms of mathematically expressed classifier, this may mean a linear, polynomial, or posynomial function involving only a few features. In the case of a DT-based classifier, this may additionally mean a low-depth tree involving only a few branches of the tree. In our work, we propose a number of novel ideas. First, instead of a DT, we propose to develop a nonlinear decision tree (NLDT) as a classifier, in which every non-terminal conditional node will represent a nonlinear function of features (  (x)) to express a split-rule. Each of these split-rules will split the data into two non-overlapping subsets. Successive hierarchical splits of these subsets are carried out and the tree is allowed to grow until one of the termination criteria is met. We argue that flexibility of allowing nonlinear split-rule at conditional nodes (instead of a single-variable based rule, which is found in tradition ID3 based DTs [12]) will result in a more compact DT (i.e. it will have fewer nodes). Second, to derive the split-rule at a given conditional node, a dedicated bilevel-optimization algorithm is applied. The upper level optimization focuses at determining the structure of the split-rule, while the lower level optimization searches for the necessary coefficients (weights) and biases of the corresponding rule-structure as supplied by the upper level. Third, 8 our proposed methodology uses some generic classification problem information to make the overall bilevel optimization algorithm computationally efficient. Fourth, we emphasize simplistic rule structures in our bilevel optimization method so that obtained rules are also relatively more interpretable than the black-box AI counterparts like SVM or ANN. In the remainder of this chapter, we provide a brief survey of the existing approaches of inducing DTs in Section 3.1. A detailed description about the problem of inducing interpretable DTs from optimization perspective and a high-level view on proposed approach is provided in Section 3.2. Next, we provide an in-depth discussion of the bilevel-optimization algorithm which is adopted to derive interpretable split-rules at each conditional node of NLDT in Section 3.3. Section 3.6 provides a brief overview on the post-processing method which is used to prune the tree to simplify its topology. Compilation of results on standard classification and engineering problems is provided in section 3.7. The chapter ends with concluding remarks and some highlights on future work in Section 3.9. 3.1 Past Studies There exist many studies involving machine learning and data analytics methods for discovering rules from data. Here, we provide a brief mention of some studies which are close to our proposed study. Traditional induction of DTs is done in an axis-parallel fashion [12], wherein each split-rule is of type  ≤  or  ≥ . Algorithms such as ID3, CART and C4.5 are among the popular ones in this category. The work done in the past to generate non-axis-parallel trees can be found in [13, 14, 15], where the researchers rely on randomized algorithms to search for multi-variate hyperplanes. Work done in [16, 17] use evolutionary algorithms to induce oblique DTs. The idea of OCI [15] is extended in [18] to generate nonlinear quadratic split-rules in a DT. Bennett Et al. [19] uses SVM to generate linear/nonlinear split-rules. However, these works do not address certain key practical considerations, such as the complexity of the combined split-rules and handling of biased data. Some works which take the aspect of 9 complexity of rule into consideration are discussed next. In [20], ellipsoidal and interval based rules are determined using the set of support-vectors and prototype points/vertex points. The authors there primarily focus at coming up with compact set of if-then-else rules which are comprehensible. Despite its intuitiveness, the approach proposed in [20] doesn’t result into comprehensible set of rules on high-dimensional datasets. Another approach suggested in [21] uses the decompositional based technique of deriving rules using the output of a linear-SVM classifier. The rules here are expressed as hypercubes. The method proposed is computationally fast, but it lacks in its scope to address nonlinear decision boundaries and its performance is limited by the performance of a regular linear-SVM. On the other hand, this approach has a tendency to generate more rules if the data is distributed parallel to the decision boundary. The study conducted in [22] uses a trained neural-network as an oracle to develop a DT of at least m-of-n type rule-sets (similar to the one described in ID2-of-3 [23]). The strength of this approach lies in its potential to scale up. Its pedagogical approach of inducing DT by referring to the oracle empowers it to create as many synthetic datapoints as desired using the oracle neural-network. However, its accuracy on unseen dataset usually falls by about 3% from the corresponding oracle neural-network counterpart. Authors in [24] use a pedagogical technique to evolve comprehensible rules using genetic-programming. The algorithm G-REX proposed in this work considers the fidelity (i.e. how closely can the evolved AI agent mimic the behaviour of the oracle NN) as the primary objective and the compactness of the rule expression is penalized using a parameter to evolve interpretable set of fuzzy rules for a classification problem. The approach is good enough to produce comprehensible rule-set, but it needs tweaking and fine tuning of the penalty parameter. A nice summary of the above mentioned methods is provided in [25]. in [26] implemented a three-objective strategy to evolve fuzzy set of rules using a multi-objective genetic algorithm. The objectives considered in that research were the classification accuracy, number of rules in the evolved rule-set and the total number of antecedent conditions. In [27], an artificial neural network (ANN) is used as a final classifier and a multi-objective optimization approach is employed to find simple and accurate classifier. The simplicity is expressed by the number of nodes. Ishibuchi et al. 10 Hand calculations are then used to analytically express the neural network with a mathematical function. This procedure however becomes intractable when the evolved neural network has a large number of nodes. Genetic programming (GP) based methods have been found efficient in deriving relevant features from the set of original features which are then used to generate a classifier [28, 29]. In some studies, the entire classifier is encoded with a genetic-representation and the genome is then evolved using GP. Some works conducted in this regard also simultaneously consider complexity of the classifier [30, 31, 32, 33], but those have been limited to axis-parallel or oblique splits. Application of GP to induce nonlinear multivariate decision tree can be found in [34, 35]. Our approach of inducing nonlinear decision tree is conceptually similar to the idea discussed in [34], where the DT is induced in the top-down way and at each internal conditional node, the nonlinear split-rule is derived using a GP. Here, the fitness of a GP solution is expressed as a weighted-sum of misclassification-rates and generalizability term. However, the interpretability aspect of the split-rule does not get captured anywhere in the fitness assignment and authors do not report the complexity of the final classifier. No further extension of this study is found in the literature. In our proposed approach, we attempt to evolve nonlinear DTs which are robust to different biases in the dataset and simultaneously target in evolving nonlinear yet simpler polynomial split-rules at each conditional node of NLDT with a dedicated bilevel genetic algorithm (shown pictorially in Figure 3.1). In oppose to the method discussed in [34], where the fitness of GP individual doesn’t capture the notion of complexity of rule expression, the bilevel-optimization proposed in our work deals with the aspects of interpretability and performance of split-rule in a logically hierarchical manner (conceptually illustrated in Figure 3.2). Results indicate that the proposed bilevel algorithm for evolving nonlinear split-rules eventually generates classifiers which are simpler than other black-box AI and traditional machine learning (ML) based classifiers and have high or comparable classification accuracy on all the test problems used in this study. 11 3.2 Proposed Approach 3.2.1 Classifier Representation Using Nonlinear Decision Tree As mentioned before, the classifier developed in this work is represented in the form of a nonlinear decision tree (NLDT) as depicted in Figure 3.1. Figure 3.1: An illustration of a Nonlinear Decision Tree (NLDT) for a classification problem. For a given conditional node, the split-rule function (x) is derived using a dedicated bilevel- optimization procedure. At first, the algorithm is applied to the entire data on Node 1 to obtain split-rule 1(x) ≤ 0. If this split-rule is not able to partition the data perfectly, then a similar bilevel optimization is invoked at Node 2 and Node 3 to determine 2(x) and 3(x), respectively, on the subset of data present in Node 1 (wherein the Node 2 will have the data satisfying the split-rule 1(x) ≤ 0 and Node 3 will have data of Node 1 which violates split-rule 1(x) ≤ 0). The process continues until a certain termination criteria is met. Terminal leaf-nodes are assigned with a class-label based on the distribution of data in that node. The decision tree (DT) comprises of conditional (or non-terminal) nodes and terminal leaf- nodes. Each conditional-node of the DT has a rule associated to it. In NLDT, we allow this rule to assume a nonlinear equation. A datapoint x traverses the DT based on conditions defined by split-rules at conditional nodes and eventually lands at a particular terminal leaf node. To make the DT more interpretable, two aspects are considered: 12 1. Simplicity of split-rule (x) at each conditional nodes (see Figure 3.2) and 2. Simplicity of the topology of overall DT, which is computed by the counting total number of conditional-nodes in the DT. Under an ideal scenario, the simplest split-rule will involve just one attribute (or feature) and can be expressed as  (x) :  −  ≤ 0. Here, the split occurs based on the Ò component of the overall feature vector x. Since for most of the problems, just one such simple split-rule is not sufficient to partition the data into two-classes, many such splits are used in hierarchical fashion to partition the dataset, wherein the first split is done on the entire training-dataset and the subsequent splits are conducted on the subsets of the original training-dataset. This exercise usually resolves into a topologically complicated DT. DTs induced using algorithms such as ID.3 and C4.5 fall under this category. In the present work, we refer to these trees with CART. On the other extreme, a topologically-simplest tree will correspond to the DT involving only one conditional node, but the associated rule is complex to interpret. SVM based classifiers fall in this category, wherein decision-boundary is expressed in form of a complicated mathematical equation. Another way to represent a classifier is through an artificial neural network (ANN), which when attempted to express analytically, will resort into a complicated nonlinear function  (x) without any easy interpretability. In this work, we propose a novel and compromise approach to above two extreme cases so that the resulting DT is not deep and associated split-rule function at each conditional node (x) assume a nonlinear form with controlled complexity and is easily interpretable as compared to the rule equation corresponding to SVM (or ANN). Allowing the flexibility to have nonlinear split-rules is believed to induce the tree which is topologically simpler than the CART counterpart, while simultaneously ensuring simplicity of split-functions (x). If the split at the root-node is not sufficient to partition the dataset into two classes, subsequent nonlinear splits are determined in a hierarchical fashion, similar to popular ID3 and C4.5 approaches of inducing CART based 13 decision trees. This process continues until one of the termination criteria is met1. The challenging task of obtaining nonlinear split-rule  (x) ≤ 0 for each conditional-node is carried out using a dedicated bilevel-optimization, which we discuss in Section 3.2.2. A high-level perspective of this is provided in Figure 3.1. During the training phase, NLDT is induced using a recursive algorithm as shown in Algorithm 1. Brief description about subroutines used in Algorithm 1 is provided below and relevant pseudo codes for ObtainSplitRule subroutine and an evaluator for upper-level GA are provided in Algorithm 2 and 3 respectively. • ObtainSplitRule(Data) – Input: ( × ( + 1))-matrix representing the dataset comprising of  datapoints with  features. The last column indicates the class-label. – Output: Nonlinear split-rule  (x) ≤ 0. – Method: Bilevel optimization is used to determine  (x). Details regarding this are discussed in Sec 3.3. • SplitNode(Data, SplitRule) – Input: Data, split-rule  (x) ≤ 0. – Output: LeftNode and RightNode which are node-data-structures, where LeftN- ode.Data represents datapoints in the input Data satisfying  (x) ≤ 0 while RightN- ode.Data represents datapoints in the input Data violating the split-rule. 3.2.2 Split-Rule Discovery Using Bilevel Optimization In this chapter, we restrict the expression of split-rule at a conditional node of the decision tree operating on a feature vector x to assume the following structure:  (x, w, , B) ≤ 0, 1Details regarding termination criteria can be found in the Section 3.10 Rule : (3.1) 14 Algorithm 1: Pseudo Code to Recursively Induce NLDT. Input: Dataset Function UpdatedNode = InduceNLDT(Node, depth): Node.depth = depth; if TerminationSatisfied(Node) then Node.Type = ‘leaf’; else Node.Type = ‘conditional’; Node.SplitRule = ObtainSplitRule(Node.Data); [LeftNode, RightNode] = SplitNode(Node.Data, Node.SplitRule); Node.LeftNode = InduceNLDT(LeftNode, depth + 1); Node.RightNode = InduceNLDT(RightNode, depth + 1); end UpdatedNode = Node; // NLDT Induction Algorithm RootNode.Data = Dataset; Tree = InduceNLDT(RootNode, 0) where  (x, w, , B) can be expressed in two different forms depending on whether a modulus operator  is sought or not:  (x, w, , B) = 1 + 11 + . . . +   , (cid:12)(cid:12)1 + 11 + . . . +   (cid:12)(cid:12) −(cid:12)(cid:12)2(cid:12)(cid:12) , if  = 0, if  = 1. (3.2) Here, ’s are coefficients or weights of several power-laws (’s), ’s are biases,  is the modulus- flag which indicates the presence or absence of the modulus operator,  is a user-specified parameter to indicate the maximum number of power-laws () which can exist in the expression of  (x), and  represents a power-law rule of type:  =  1 ×  1 2 × . . . ×  2   , (3.3) 15 Algorithm 2: ObtainSplitRule subroutine. Implements a Bilevel optimization algorithm of determine a split-rule. The UpperLevel searches the space of Block Matrix B and modulus-flag . Constraint violation value for an upper level individuals comes by executing lower-level GA (LLGA) as shown in Algorithm 3. Input: Data Output : SplitRule // split-rule  (x) Function SplitRule = ObtainSplitRule(Data): Initialize:  // Upper Level population // Execute LLGA (Algorithm 3)  = EvaluateUpperLevelPop(); // Upper Level GA Loop for gen = 1:MaxGen do  ); = Selection(); = Crossover(   Ò   );  = Mutation(Ò // Execute LLGA (Algorithm 3)  = EvaluateUpperLevelPop(); // Elite preservation  = SelectElite( if TerminationConditionSatisfied then , );  break; end end // Extract best solution in  SplitRule = ObtainBestInd(); B is a block-matrix of exponents  , given as follows:  B = 11 12 13 . . . 1 21 22 23 ... ... ... . . . 2 ... . . .  1  2  3 . . .    . (3.4) Exponents  ’s are allowed to assume values from a specified discrete set E. In this work, we set  = 3 and E = {−3,−2,−1, 0, 1, 2, 3} to limit the maximum complexity of the rule, however value of  and set E can be changed by the user. Parameters  and  are real-valued variables in [−1, 1]. The feature vector x is a datapoint in a -dimensional space. Another user-defined parameter max 16 Algorithm 3: EvaluateUpperLevelPop subroutine. A dedicated LowerLevel optimization is executed for each upper level population member. Input:  // Upper Level population (cid:48) Output  // Evaluated Population : Function (cid:48)  = EvaluateUpperLevelPop(): for i = 1:PopSize do // Execute LLGA (see Section 3.3.3) [, w, ] = LLGA(Data, B, ); [[i].w, [i].] = [w, ]; // Constraint and Fitness Value [i].CV =  − ; [i]. = MaxActiveTerms(B); end (cid:48)  =  controls the maximum number of variables that can be present in each power-law . The default is max =  (i.e. dimension of the feature space). 3.3 Bilevel Approach for Split-Rule Identification 3.3.1 The Hierarchical Objectives to derive split-rule Here, we illustrate the need of formulating the problem of split-rule identification as a bilevel- problem using Figure 3.2. The geometry and shape of split-rules is defined by exponent terms   appearing in its expres- sion (Eq. 3.2 and 3.3) while the orientation and location of the split-rule in the feature-space is dictated by the values of coefficients  and biases  (Eq. 3.2). Thus, the above optimization task of estimating split-rule  (x) involves two types of variables: 1. Discrete: -matrix representing exponents of -terms (i.e.   as shown in Eq. 3.4) and the modulus flag  indicating the presence or absence of a modulus operator in the expression of  (x), and 2. Continuous: weights w and biases  in each rule function  (x). 17 Figure 3.2: In this illustration, a two-class data comprising of two features 1 and 2 is provided. ,  and  are three different split-rules. Split-rule  is able to optimally partition the data, but is complex and may not be interpretable. Rule  is simple, however it does not produce accurate classification. Rule  is simple as well as classifies the data accurately. Identification of a good structure for  terms and value of  is a more difficult task, compared to the weight and bias identification. We argue that if both types of variables are concatenated in a single genome, a good (B, ) combination may be associated with a not-so-good (w, ) combination (like split-rule  in Fig. 3.2), thereby making the whole solution vulnerable to deletion during evolution. It may be better to separate the search of a good structure of (B, ) combination from the weight-bias search at the same level, and search for the best weight-bias combination for every (B, ) pair as a detailed task. This hierarchical structure of the variables motivates us to employ a bilevel optimization approach [36] to handle above variables. The upper level optimization algorithm searches the space of B and . Then, for each (B, ) pair, the lower level optimization algorithm is invoked to determine the optimal values of w and . Referring to Fig. 3.2, the upper- level will search for the structure of  (x) which might have a nonlinearity (like in rule ) or might be linear (like rule ). Then, for each upper-level solution (for instance a solution corresponding to a linear rule structure), the lower-level will adjust its weights and biases to determine its optimal location . 18 The bilevel optimization problem can be then formulated as shown below: Min. (B, , w∗, ∗), s.t. (w∗, ∗) ∈ argmin(cid:8)(w, )|(B,)(cid:12)(cid:12) −1 ≤  ≤ 1, ∀,  ∈ [−1, 1]+1(cid:9) , (w, )|(B,) ≤  , (3.5)  ∈ {0, 1},   ∈ {−3,−2,−1, 0, 1, 2, 3}, where the upper level objective  is quantifies the simplicity of the split-rule, and, the lower-level objective  quantifies quality of split resulting due to split-rule  (x) ≤ 0. An upper-level solution is considered feasible only if it is able to partition the data within some acceptable limit which is set by a parameter . A more detailed explanation regarding the upper level objective  and the lower-level objective  is provided in next sections. A pseudo code of the bilevel algorithm to obtain split-rule is provided in Alogrithm 2 and the pseudo code provided in Algorithm 3 gives on overview on the evaluation of population-pool in upper level GA. In next sections, we provide a mathematical insight on the procedure to compute lower-level and upper level objective functions,  and  respectively. 3.3.1.1 Computation of  The impurity of a node in a decision tree is defined by the distribution of data present in the node. In this work, we use gini-score to gauge the impurity of a node. Thus, for a given parent node  in a decision tree and two child nodes  and  resulting from it, the net-impurity of child nodes () can be computed as follows: (cid:18)   (w, )|(B,) = (cid:19) Gini() +   Gini() , (w,,B,) (3.6) where  is the total number of datapoints present in the parent node , and  and  indicate the total number of points present in left () and right () child nodes, respectively. Datapoints in (x) ≤ 0, x ∈ ) (Eq 3.1) go to left node, while the  which satisfy the split-rule at node  (i.e. rest go to the right node. The objective  of minimizing the net-impurity of child nodes favors 19 the creation of purer child nodes (i.e. nodes with a low gini-score value). For an ideal split, the left and right child nodes will have completely homogeneous data, i.e. all datapoints present in the node will belong to one class (say left node will have all points from Class-1 while right node will have all points from Class-2). 3.3.1.2 Computation of  The objective  is subjective in its form since it targets at dealing with a subjective notion of generating visually simple equations of split-rule. Usually, equations with more variables and terms seem visually complicated. Taking this aspect into consideration,  is set as the total number of non-zero exponent terms present in the overall equation of the split-rule (Eq. 3.1). Mathematically, this can be represented with the following equation: (B, , w∗, ∗) = ( ), (3.7)   =1 =1  where () = 1,  ࣔ 0, 0,  = 0. Here, we use only B to define , but another more comprehensive simplicity term involving presence or absence of modulus operators and relative optimal weight/bias values of the rule can also be used. 3.3.2 Upper Level Optimization (ULGA) A genetic algorithm (GA) is implemented to explore the search space of power-laws ’s and modulus flag  in the upper level. The genome is represented as a tuple (B, ) wherein B is a matrix as shown in Equation 3.4 and  assumes a Boolean value of 0 or 1. The upper level GA focuses at estimating a simple equation of the split-rule within a desired value of net impurity () of child nodes. Thus, the optimization problem for upper level is 20 formulated as a single objective constrained optimization problem as shown in Eq. 3.5. The constraint function (w, )|(B,) is evaluated at the lower level of our bilevel optimization framework. The threshold value  indicates the desired value of net-impurity of resultant child nodes (Eq. 3.6). In our experiments, we set  to be 0.05. As mentioned before, a solution satisfying this constraint implies creation of purer child nodes. Minimization of objective function  should result in a simplistic structure of the rule and the optimization will also reveal key variables needed in the overall expression. 3.3.2.1 Custom Initialization for Upper Level GA Minimization of objective  (given in Eq 3.7) requires to have less number of active (or non-zero) exponents in the expression of split-rule (Eq 3.1). To facilitate this, the population is initialized with a restriction of having only one active (i.e. non-zero) exponent in the expression of split-rule, i.e., any-one of the  ’s in the block matrix B is set to a non-zero value from a user-specified set E and the rest of the elements of matrix B are set to zero. Note here that only 2 number of unique individuals ( individuals with  = 0 and  individuals with  = 1) can exist which satisfy the above mentioned restriction. If the population size for upper level GA exceeds 2, then remaining individuals are initialized with two non-zero active-terms (i.e. randomly, two  ’s are set to a non-zero value, while rest of the elements in B are fixed to zero) and the process continues until all population members in the upper level are initialized uniquely. As the upper level GA progresses, incremental enhancement in rule complexity is realized through crossover and mutation operations, which are described next. 3.3.2.2 Ranking of Upper Level Solutions The binary-tournament selection operation and (+) survival selection strategies are implemented for the upper level GA. Selection operators use the following hierarchical ranking criteria to perform selection: 21 Definition 3.3.1 For two individuals  and  in the upper level, rank() is better than rank( ), when any of the following is true: •  and  are both infeasible AND () < ( ), •  is feasible (i.e. () ≤ ) AND  is infeasible (i.e. ( ) > ), •  and  both are feasible AND () < ( ), •  and  both are feasible AND () = ( ) AND () < ( ), •  and  both are feasible AND () = ( ) AND () = ( ) AND () < ( ). 3.3.2.3 Custom Crossover Operator for Upper Level GA The main challenging aspect of developing an evolutionary algorithm is to design meaningful genetic operators. Crossover operation is one such stage in GAs where there is an information exchange between two (or more) individual species in the parent population pool. The crossover operation between two (or more) participating parents creates one or more children. For a real coded GA involving only real continous variables, Simulated Binary Crossover (SBX) [37] is one such meaningful genetic operator for exchanging information between two parents to create two children. A 2D illustration of the concept of meaningful crossover is provided in Figure 3.3. In our case, in the upper level, the crossover operator needs to be designed to meaningfully crossover two parent equations. To do that, population members are first clustered according to their -value – all individuals with  = 0 belong to one cluster and all individuals with  = 1 belong to another cluster. The crossover operation is then restricted to individuals belonging to the same cluster. The crossover operation selects two parents 1 and 2 from the same cluster having block matrix BP1 . As mentioned before, each row of a block matrix B represents a power-law  (Eq. 3.3). The crossover to create two children with block matrices BC1 and BP2 and BC2 22 (a) Meaningful Crossover (SBX) example (b) Meaningless Crossover example Figure 3.3: Illustrations of meaningful and meaningless crossover operations. In the meaningful crossover operation, children (green dots) will carry information of parents (red dots). Hence, they will be located with high probability at the corners of the rectangle . On the other hand, the meaningless crossover operation creates children at random locations, without taking leverage of parents. operation is executed separately on each row of block matrices of participating parents to generate corresponding rows of child block matrices. First, rows of block matrices of participating parents are rearranged in descending order of the magnitude of their corresponding coefficient values (i.e. weights  of Eq. 3.2). This way, the most influential power-law in the equation of parent individual will be shuffled to the first row of the rearranged block matrix B(cid:48). The second row of the rearranged block matrix B(cid:48) will have exponents corresponding to the second most influential power-law and and B(cid:48) so on. Let the parent block matrices be represented as B(cid:48) after this rearrangement. The P2 P1 crossover operation is then conducted element wise on each row of B(cid:48) . Doing so allows P1 us to mate those rows of the participating parents which had similar importance in the respective equations of the split-rule. For better understanding of the cross-over operation, a psuedo code is provided in Algorithm 4 and a schematic is provided in Figure 3.4. and B(cid:48) P2 23 (a) Sorting of B-mat (b) Crossover Operation Figure 3.4: Upper Level Crossover Operation 3.3.2.4 Custom Mutation Operator for Upper Level GA In genetic algorithms, the purpose of mutation operation is to conduct local perturbations to parent solutions. A meaningful and controlled mutation generates the mutated individual in the vicinity of the unmutated original value2. In real coded GAs, polynomial mutation [38] is popularly used to execute this task. An illustrative plot of polynomial mutation operator is provided in Figure 3.5a. Since the upper level of our GA focuses on the search space involving discrete variables, we use the discretized version of mutation to mutate the values of exponents   (Eq. 3.3) as shown in Figure 3.5. For a given upper level solution with block matrix B and modulus flag , the probability with which  ’s and  gets mutated is controlled by parameter  . From the experiments, value  = 1/ is found to work well. The mutation operation then changes the value of exponents of  2while most of the times mutation is desired to create local perturbations, under some scenario, a randomized mutation is favoured to increase the diversity of the population pool. 24 Child block matrices BC1 input : Block matrices BP1 output : B(cid:48) P1 B(cid:48) P2  = Size(BP1,1);  = Size(BP1,2); for  ← 1 to  do = SortRows(BP1,wP1); = SortRows(BP2, wP2); for  ← 1 to  do  = rand(); // random no. if  ≤ 0.5 then else BC1(, ) = BP1(, ); BC2(, ) = BP2(, ); BC1(, ) = BP2(, ); BC2(, ) = BP1(, ); between 0 and 1. Algorithm 4: Crossover operation in upper level GA. and BP2 , and weight vectors wP1 and BC2 . and wP2 . end end end   and the modulus flag . Let the domain of   be given by E, which is a sorted finite set of allowable exponents. Since E is sorted, its elements can be accessed using an integer-id , with id-value of  = 1 representing the smallest exponent and  =  representing the largest exponent. In our case, E = [−3,−2,−1, 0, 1, 2, 3] (making  = 7). The mechanism with which the mutation operator mutates the value of   for any arbitrary sorted array E is illustrated in Fig. 3.5. Here, the red tile indicates the index () of exponent   in array E. Red shaded vertical bars indicate the probability distribution for obtaining mutated-values. The   can be mutated to either of  − 2,  − 1,  + 1, or  + 2 id-values with a probability of , , , and  respectively. The parameter  is preferred to be greater than one and is supplied by the user. The parameter  is then computed using the following equation:  = 1 2(1 + ) . 25 (3.8) (a) Polynomial Mutation on the -th component of an individual x. The probability distribution curve (, ) dictates the location where the -th ) will be after mutating the parent component component of child (i.e.   has a higher probability of getting (i.e.   . The steepness of the probability curve created near the parent value   (, ) can be adjusted by parameter , with higher values of  giving steeper curve. ). Here, the mutated value   (b) Discretized Mutation mimics the behaviour of real polynomial mutation discussed in figure above. Red vertical bars indicate the probability distri- bution of obtaining mutated values. The parent unmutated value is located at -th index value. Here, we intentionally avoid creating mutated value on the parent value and encourage creation of mutated values at immediate two neighboring allowable values to facilitate the creation of unique solutions. Figure 3.5: Mutation for Upper Level GA The above formula to compute  is derived by equating the sum of probabilities to 1. In our experiments, we have set  = 3. The value of the modulus flag  is mutated randomly to either 0 or 1 with 50% probability. In order to bias the search to create simpler rules (i.e. split-rules with a small number of non- zero  ’s), we introduce a parameter . The value of parameter  indicates the probability with which a variable   participating in mutation is set to zero (i.e.   ← 0). In our case, we use  = 0.75. Thus,   → 0 with a net probability of   × . 26 3.3.2.5 Duplicate Update Operator After creating the offspring population, we attempt to eliminate duplicate population members. For each duplicate found in the child population, the block-matrix B of that individual is randomly mutated using the following equation B( , ) = E(), (3.9) where  ,  and  are randomly chosen integers within specified limits. This process is repeated for all members of child population, until each member is unique within the child population. This operation allows to maintain diversity and encourages the search to generate multiple novel and optimized equations of the split-rule. (+) survival selection operation is then applied on the combined child and parent population. The selected elites then go to the next generation as parents and the process repeats. The parameter setting for upper level GA can be found in Section 3.10. 3.3.3 Lower Level Optimization (LLGA) For a given population member of the upper level GA (with block matrix B and modulus flag  as variables), the lower level optimization problem determines the coefficients  (Eq. 3.2) and biases  such that (w, )|(B,) (Eq. 3.6) is minimized. Thus, the lower level optimization problem can be stated as below: Minimize: (w, )|(B,), w ∈ [−1, 1] ,  ∈ [−1, 1]+1. (3.10) We describe the details of the real-parameter GA used for solving the lower level problem next. 27 3.3.3.1 Custom Initialization for Lower Level GA One of the most crucial and effective operator to determine optimal values of variables of lower level optimization was the initialization operation. As stated before in Section 3.3.1.1, the lower level objective function  provides the quantification to the net impurity of child nodes (Eq. 3.6). The number of points going to left child node  and right child node  depends on the sign of the split function  (x) of the parent node, where  (x) ≤ 0 =⇒ x →  and  (x) > 0 =⇒ x → . Geometrically the split function  (x, B, , w, ) is linear in the transformed B-space since 3. Hence, the lower level searches the optimal orientation (dictated by w) and location (dictated by ) of the linear straight line in the mapped B-space as shown in Figure 3.6. =1      (cid:12)(cid:12)(cid:12) =1   + 0 =1   + 0  (cid:12)(cid:12)(cid:12) − |1| if  = 0 if  = 1  (x, B, , w, ) = where  = Figure 3.6: Illustration of Data in B-Space. The split function  (x, B, , w, ) is a straight line in B-space. Dotted lines represent the convex hull engulfing the data. Different possible split functions are shown by straight lines , ,  and . Since no separation of data is evident if the straight line representing the split function in 3technically, with  = 1, there will be two linear lines which will represent the split fuction. However, each line will be separating two classes. 28 B-space lies outside of the convex hull (dotted lines in Figure 3.6), the function landscape of (w, )|B, is flat in the region outside the convex hull (since either  = 0 or  = 0 in Eq. 3.6). This constitutes a very significant portion of the domain of . If the initial population pool of straight lines lies in the region outside of the convex hull, then the optimization algorithm gets no signal and motivation to move and reorient pool of initialized straight lines. Under certain scenarios, crossover and mutation operators might generate solutions representing straight lines passing through the dataset (shown by  in Figure 3.6), thereby partitioning it into two subsets  (x) ≤ 0 and  (x) > 0. In practice, a lot of computation is wasted while arriving at a situation where most of the individuals in the population pool represent straight lines in B-space which crosses the convex hull and partition the data. In a bilevel optimization, the lower level problem must be solved for every upper level variable vector, thereby requiring a computationally fast algorithm for the lower level problem. Considering these aspects, instead of creating every population member in lower level randomly, we use the mixed dipole concept [39, 17, 40] to initialize the population pool of straight lines in B-space to ensure each of the initialized member partitions atleast one point in the dataset from the rest as shown in Figure 3.7. This smart initialization facilitates faster convergence towards optimal (or near-optimal) values of w and . The dipole based initialization is done in the following way: Step 1: Randomly pick two datapoints xA and xB such that xA ∈ Class-1 and xB ∈ Class-2. Step 2: Weights wh and bias Ò of the hyperplane  (where (wh, Ò) : wh · x + Ò = 0) separating xA and xB can be computed with equation mentioned below wh = xA − xB, Ò = Ɗ × (−wh · xB) − (1 − Ɗ) × (wh · xA), (3.11) where Ɗ is a random number between 0 and 1. This is pictorially represented in Fig. 3.7. 29 Step 3: Set values of variables w and  of a population member of lower level GA as shown below w = wh, 1 = Ò, 2 = min(Ɗ, 1 − Ɗ) if  = 1. (3.12) (3.13) (3.14) Step 4: Repeat Steps 1-3 until all population members of lower level GA are initialized. Figure 3.7: Mixed dipole (xA, xB) and a hyperplane (wh, Ò). As mentioned before, this method of initialization is found to be more efficient than randomly initializing w and Ǝ since in the region beyond the convex-hull bounding the training dataset (which constitutes major volume of the domain space), the landscape of  is flat thereby making any optimization algorithm to stagnate. 3.3.3.2 Selection, Crossover, and Mutation for Lower Level GA The binary tournament selection [38], SBX crossover [37] and polynomial mutation [41] were used to create the offspring solutions. ( + ) survival selection strategy was then adopted to preserve elites. 30 3.3.3.3 Termination Criteria for Lower level GA The lower level GA is terminated after a maximum of 50 generations were reached or when the change in the lower level objective value (i.e. ) is less than 0.01% in the past 10 consecutive generations. Other parameters setting for lower-level GA can be found in Section 3.10. 3.4 Ablation Studies and Comparison In this section, we test the efficacy of lower-level and upper-level optimization algorithm by applying them on four customized problems (DS1-DS4) which are shown pictorially in Fig. 3.8. Their performances are compared against two standard classifiers: CART and support-vector- machines (SVM)4. 3.4.1 Ablation Studies on Lower Level GA Since the lower level GA (LLGA) focuses at determining coefficients  and bias  of linear split-rule in B-space, DS1 and DS2 datasets are used to guage its efficacy. The block matrix B is fixed to 2 × 2 identity matrix and the modulus-flag  is set to 0. Hence, the split-rule (classifier) to be evolved is  (x) = 1 + 11 + 22, where 1, 2 and 1 are to be determined by the LLGA. The classifier generated by our LLGA is compared with that obtained using the standard SVM algorithm (without any kernel-trick) on DS1 dataset. Since DS2 dataset is unbalanced, SMOTE algorithm [42] is first applied to over-sample datapoints of the minority class before generating the classifier using SVM. For the sake of completeness, classifier generated by SVM on DS2 dataset, without any oversampling, is also compared against the ones obtained using LLGA and 4Here, the support vector machine is applied without any kernal-trick. 31 (a) DS1 Dataset. (b) DS2 Dataset. (c) DS3 Dataset. (d) DS4 Dataset. Figure 3.8: Customized datasets. DS1 is linear and balanced, DS2 is linear but unbalanced with minority class having 10x less points. DS3 is nonlinear and DS4 has a sandwiched distribution. SMOTE+SVM. The results are shown in Fig. 3.9 and Fig. 3.10, respectively, for DS1 and DS2 datasets. Type-1 and Type-2 errors are reported for each experiment5. It is clearly evident from the results shown in Fig. 3.9 that the proposed customized LLGA is more efficient than SVM in arriving at the desired split-rule as a classifier. The LLGA is able to find the decision boundary with 100% prediction accuracy on training and testing datasets. However, the classifier generated by SVM has an overlap with the cloud of datapoints belonging to the scattered class and there-by resulting into the accuracy of 89% on the testing dataset. On the DS2 dataset, SVM is required to rely on SMOTE to synthetically generate datapoints 5Type-1 error indicates the percentage of datapoints belonging to Class-1 getting classified as Class-2. Type-2 error indicates the percentage of points belonging to Class-2 getting classified as Class-1). 32 (a) LLGA: (0%, 0%); (0%, 1%) (b) SVM: (0%, 17%); (0%, 10%) Figure 3.9: Results on DS1 dataset to benchmark LLGA. Numbers in first parenthesis indicate Type-1 and Type-2 error on training data and the numbers in second parenthesis indicate Type-1 and Type-2 error on testing data. in order to achieve respectable performance as can be seen from Fig. 3.10b (without SMOTE) and 3.10c (with SMOTE). However, LLGA performs better without SMOTE, as shown in Fig. 3.10a. Ablation study conducted above on LLGA clearly indicates that the customized optimization algorithm developed for estimating weights w and bias  in the expression of a split-rule is reliable and efficient and could easily outperform the classical SVM algorithm. 3.4.2 Ablation Studies on the Proposed Bilevel GA As mentioned before, the upper level of our bilevel algorithm aims at determining optimal power- law structures (i.e. s) and the value of modulus flag . Experiments are performed on DS3 and DS4 datasets to guage the efficacy of upper level (and thus the overall bilevel) GA. No prior information about the optimal values of block matrix B and modulus flag  is supplied. Thus, the structure of the equation of the split-rule is unknown to the algorithm. Results of these experiments are shown in Figs. 3.11a and 3.11b. As can be seen from Fig. 3.11a, the proposed bilevel algorithm is able to find quatratic split-rule which is able to partition the DS3 dataset with no Type-I or Type-II error. Results on DS4 dataset validated the importance of involving modulus-flag  as a search variable for upper level GA in 33 (a) LLGA : (0%, 0%); (0%, 0%) (b) SVM: (100%, 0%); (100%, 0%) (c) SVM+SMOTE: (0%, 1.5%); (0%, 1.5%) Figure 3.10: Results on DS2 dataset to benchmark LLGA. addition to the block-matrix B. The algorithm is able to converge at the optimal tree topology (Fig. 3.11b) with 100% classification accuracy. It is to note here that the algorithm had no prior knowledge about optimal block matrix B of exponents  . This observation suggests that the upper level GA is successfully able to navigate through the search space of block-matrices B and modulus-flag  (which is a binary variable) to arrive at the optimal power-laws and split-rule. Fig. 3.12 and Fig. 3.13 shows plots of a decision boundary (split-rule) obtained using our bilevel-algorithm on DS3 and DS4 datasets, their corresponding NLDTs and its comparison against the decision-tree obtain using traditional CART algorithm on DS3 and DS4 datasets. 34 11.522.533.5-0.200.20.40.60.811.2 (a) DS3: (0%, 0%); (0%, 0%) (b) DS4: (0%, 0%); (0%, 0%) Figure 3.11: Bilevel GA results on DS3 and DS4. 3.5 Visualization of split-rule: X-Space and B-Space A comprehensible visualization of the decision boundary is possible if dimension of the feature space is up to three. However, if the dimension  of the original feature space (or, X-space) is larger than three, the decision-boundary can be visualized on the transformed-feature-space (or B-space). In our experiments, the maximum number of allowable power-law-rules () is fixed to three (i.e.  = 3). Thus, any datapoint x in a -dimensional X-space can be mapped to a three-dimensional B-space (Eq. 3.3). A conceptual illustration of this mapping is provided in Fig. 3.14, where, for the sake of simplicity, a three-dimensional X-space (1 to 3) is mapped to a two-dimensional B-space using two power-law-rules (1 and 2). It is to note here that the power-law rules ’s are not known a priori. The proposed bilevel optimization method determines them so that the data becomes linearly separable in B-space. Results and discussions provided in Section 3.7 provide clarity on this novel aspect of our nonlinear decision tree representation. 35 (a) Bilevel GA results: (0%, 0%) and (1%, 0%). (b) Obtained NLDT for DS3 problem – One rule. (c) CART results: (0%, 4%) and (4%, 3%) – 13 rules. Figure 3.12: Results on DS3 dataset to benchmark the overall bilevel algorithm. 3.6 Overall Tree Induction and Pruning Once the split-rule for a conditional node is determined using the above bilevel optimization procedure, resulting child nodes are checked for the following termination criteria: • Depth of the Node > Maximum allowable depth, • Number of datapoints in node < , and • Node Impurity ≤ . 36 x1≤1.01x0≤0.753x0≤0.34x1≤0.461x1≤0.84x0≤0.64x0≤0.40x1≤0.82x1≤0.6011x1≤0.58311x1≤0.89313x0≤1.013x1≤0.2731x0≤0.8613 (a) Bilevel GA results: (0%, 0%) and (0%, 0%) – No error. (b) Obtained NLDT for DS4 problem – One rule. (c) CART results: (4%, 3%) and (5%, 18%), 27 rules. Figure 3.13: Results on DS4 dataset to benchmark the overall bilevel algorithm. If any of the above criteria is satisfied, then that node is declared as a terminal leaf node. Otherwise, the node is flagged as an additional conditional node. It then undergoes another split (using a split- rule which is derived by running the proposed bilevel-GA on the data present in the node) and the process is repeated. This overall process is illustrated in Algorithm 1. This procedure eventually produces the main nonlinear decision tree  . The fully grown decision tree then undergoes pruning to produce the pruned decision tree  . During the pruning phase, splits after the root-node are systematically removed until the training accuracy of the pruned tree does not fall below a pre-specified threshold value of  = 3% 37 x0≤1.03x0≤0.743x1≤1.52x1≤1.503x0≤0.53x1≤1.93x1≤1.953x0≤0.43x1≤2.20x1≤2.20x1≤2.17x0≤−0.01313x0≤0.07x1≤2.84x1≤2.863113x1≤1.95331x0≤0.6331x1≤1.7313x1≤0.97x0≤0.803x0≤1.01x0≤0.903x1≤1.14x1≤1.243x0≤0.78x1≤1.38x1≤1.4331131313 Figure 3.14: Feature Transformation. A point in three-dimensional X-space is mapped to a point in a two-dimensional B-space for which  =  3 3 . 1 1  2 2  (set here after trial-and-error runs). This makes the resultant tree topologically less complex and provides a better generalizability. In subsequent sections, we provide results on final NLDT obtained after pruning (i.e.  ), unless otherwise specified. 3.7 Results This section summarizes results obtained on four customized test problems, two real-world classification problems, three real-world optimization problems and eight custom designed multi- objective problems. For each experiment, 50 runs are performed with random training and testing sets. A dataset is split into training and testing sets with a ratio of 7:3, respectively. Mean and standard deviation of training and testing accuracy-scores across all 50 runs are evaluated to gauge the performance of the classifier. Statistics about the number of conditional nodes (given by #Rules) in NLDT, average number of active terms in split-rules of decision tree (i.e. /Rule) and rule length (which gives total number of active-terms appearing in the entire NLDT) is provided to quantify the simplicity and interpretability of classifier (lesser this number, simpler is the classifier). The comparison is made against classical CART tree solution [12] and SVM [43] results. For SVM, the Gaussian kernel is used and along with overall accuracy-scores; statistics about the number of support vectors (which is also equal to the rule-length) is also reported. It is to note that the decision boundary computed by SVM can be expressed with a single equation, however the length 38 of the equation usually increases with the number of support vectors [44]. We used MATLAB’s SVM routine with default parameter settings. For each set of experiments, best scores are highlighted in bold and Wilcoxon signed-rank test is performed on overall testing accuracy scores to check the statistical similarity with other classifiers in terms of classification accuracy. Statistically similar classifiers (which will have their -value greater than 0.05) are italicized. 3.7.1 Customized Datasets: DS1 to DS4 The compilation of results of 50 runs on datasets DS1-DS4 is presented in Table 3.1. The results clearly indicate the superiority of the proposed nonlinear, bilevel based decision tree approach over classical CART and SVM based classification methods. Bilevel approach finds a single rule with Table 3.1: Results on DS1 to DS4 datasets (2 features). DS1 DS2 Training Dataset Method Accuracy 99.78 ± 0.51 Bilevel 97.99 ± 0.96 CART SVM 98.67 ± 0.31 99.80 ± 0.40 Bilevel 98.80 ± 0.50 CART SVM 96.95 ± 0.73 99.91 ± 0.35 Bilevel 99.42 ± 0.57 CART SVM 98.96 ± 0.42 99.34 ± 1.21 Bilevel 96.98 ± 1.19 CART SVM 98.19 ± 0.44 DS3 DS4 Testing Accuracy 99.55 ± 1.08 90.32 ± 4.06 98.50 ± 1.09 99.44 ± 0.87 95.43 ± 1.50 95.24 ± 0.16 99.77 ± 0.67 95.00 ± 2.35 98.38 ± 1.15 98.88 ± 1.65 88.68 ± 3.60 97.28 ± 1.19 p-value – 7.34e-10 1.29e-05 – 5.90e-10 2.43e-10 – 7.00e-10 7.16e-09 – 7.31e-10 1.31e-05 #Rules 1.0 ± 0.0 14.5 ± 1.7 1 1 1 1 1.0 ± 0.0 11.0 ± 1.4 1.0 ± 0.0 11.5 ± 1.3 1.2 ± 0.4 31.3 ± 4.2 1 1 FU/Rule 2.3 ± 0.6 71.5 ± 3.3 2.3 ± 0.7 44.7 ± 1.9 2.2 ± 0.5 62.1 ± 3.4 2.6 ± 0.6 89.8 ± 3.3 1 1 Rule Length 2.3 ± 0.6 14.5 ± 1.7 71.5 ± 3.3 2.3 ± 0.7 11.0 ± 1.4 44.7 ± 1.9 2.2 ± 0.5 11.5 ± 1.3 62.1 ± 3.4 3.1 ± 1.4 31.3 ± 4.2 89.8 ± 3.3 two to three appearances of variables in the rule, whereas CART requires 11 to 31 rules involving a single variable per rule, and SVM requires only one rule but involving 44 to 90 appearances of variables in the rule. Moreover, the bilevel approach achieves this with the best accuracy. Thus, classifiers obtained by bilevel approach are more simplistic and simultaneously more accurate. 39 3.7.2 Breast Cancer Wisconsin Dataset This problem was originally proposed in 1991. It has two classes: benign and malignant, with 458 (or 65.5%) datapoints belonging to benign class and 241 (or 34.5%) belonging to malignant class. Each datapoint is represented with 10 attributes. Results are tabulated in Table 3.2. The bilevel method and SVM had similar performance (p-value> 0.05), but SVM requires about 90 variable appearances in its rule, compared to only about 6 variable appearances in the single rule obtained by the bilevel approach. The proposed approach outperformed the classifiers generated using techniques proposed in [20, 25, 24, 21, 22] in terms of both: accuracy and comprehensibil- ity/compactness. The NLDT classifier obtained by a specific run of the bilevel approach has five variable appearances and is presented in Fig. 3.15. Table 3.2: Results on breast cancer Wisconsin dataset (10 features). Training Method Accuracy 98.07 ± 0.39 Bilevel 98.21 ± 0.49 CART SVM 97.65 ± 0.39 Testing Accuracy 96.50 ± 1.16 94.34 ± 1.92 96.64 ± 1.16 p-value 0.308 8.51e-09 – #Rules 1.0 ± 0.0 11.6 ± 2.4 1 FU/Rule 6.4 ± 1.7 89.4 ± 14.8 1 Rule Length 6.4 ± 1.7 11.6 ± 2.4 89.4 ± 14.8 Figure 3.15: Breast Cancer Wisconsin NLDT. For each node, number of datapoints  present in the node, impurity of the node (Gini) and class distribution (in square parenthesis) is reported. Figure 3.16 provides B-Space visualization of decision-boundary obtained by the bilevel ap- proach, which is able to identify two nonlinear -terms involving variables to split the data linearly 40 to obtain high accuracy. Figure 3.16: B-space plot for Wisconsin breast cancer dataset. 3.7.3 Wisconsin Diagnostic Breast Cancer Dataset (WDBC) This dataset is an extension to the dataset of the previous section. It has 30 features with total 356 datapoints belonging to benign class and 212 to malign class. Results shown in Table 3.3 indicate that the bilevel-based NLDT is able to outperform standard CART and SVM algorithms. The NLDT generated by a run of the bilevel approach requires seven out of 30 variables (or features) and is shown in Fig. 3.17. It is almost as accurate as that obtained by SVM and is more interpretable (seven versus about 107 variable appearances). Table 3.3: Results on WDBC dataset (30 features). Training Method Accuracy 98.24 ± 0.64 Bilevel 98.76 ± 0.60 CART SVM 98.65 ± 0.37 Testing Accuracy 96.20 ± 1.49 92.11 ± 2.07 97.39 ± 1.37 p-value 4.65e-05 1.09e-09 – #Rules 1.0 ± 0.0 10.8 ± 2.1 1 FU/Rule 9.2 ± 4.1 106.7 ± 6.6 1 Rule Length 9.2 ± 4.1 10.8 ± 2.1 106.7 ± 6.6 The -space plot (Fig. 3.18) shows an efficient discovery of -functions to linearly classify the supplied data with a high accuracy. 41 0500100015002000250030003500400000.20.40.60.81 Figure 3.17: Tree for WDBC dataset. For each node, number of datapoints  present in the node, impurity of the node (Gini) and class distribution (in square parenthesis) is reported. Figure 3.18: B-space plot for WDBC dataset. 3.7.4 Real World Auto-Industry Problem (RW-problem) This real-world engineering design optimization problem has 36 variables, eight constraints, and one objective function. The dataset is highly biased, with 188 points belonging to the good-class and 996 belonging to the bad-class. Results obtained using the bilevel GA are shown in Table 3.4. The proposed algorithm is able to achieve near 90% accuracy scores requiring only two split-rules. The best performing NLDT has the testing-accuracy of 93.82% and is shown in Fig. 3.19. SVM performed the best in terms of accuracy, but the resulting classifier is complicated with 42 about 241 variable appearances in the rule. However, bilevel GA requires only about two rules, each having about only 10 variable appearances per rule to achieve slightly less-accurate classification. CART requires about 30 rules with a deep DT, making the classifier difficult to interpret easily. Table 3.4: Results on the real-world auto-industry problem (36 features). Training Method Accuracy 94.36 ± 1.47 Bilevel 98.00 ± 0.55 CART SVM 94.98 ± 0.73 Testing Accuracy 89.93 ± 2.04 91.13 ± 1.32 93.24 ± 1.38 p-value 4.35e-09 9.80e-08 – #Rules 1.9 ± 0.5 29.6 ± 3.9 1 FU/Rule 10.0 ± 2.9 240.8 ± 9.4 1 Rule Length 18.2 ± 5.9 29.6 ± 3.9 240.8 ± 9.4 Figure 3.19: NLDT for the auto-industry problem. The first split-rule uses five variables and the second one uses 12. For each node, number of datapoints  present in the node, impurity of the node (Gini) and class distribution (in square parenthesis) is reported. 3.7.5 Results on Multi-Objective Optimization Problems After bench-marking the proposed bilevel GA algorithm on standard classical benchmarks and a single-objective engineering problem, we now evaluate its performance on two real-world multi- objective problems: welded-beam design and 2D truss design and eight modified ZDT and DTLZ problems. We will briefly discuss the procedure used to generate datasets corresponding to these problems followed by results obtained on them using our approach. 43 3.7.5.1 Truss 2D and Welded Beam Problems Data creation: For each problem, NSGA-II [45] algorithm is first applied to evolve the population of  individuals for  generations. Population for each generation is stored. Naturally, population from later generations are closer to Pareto-front than the population of initial generations. We artificially separate the entire dataset into two classes – good and bad – using two parameters   (indicating an intermediate generation for data collection) and  (indicating the minimum non-dominated rank for defining the bad cluster). First, population members from   and  generations are merged. Next, non-dominated sorting is executed on the merged population to determine their non- domination rank. Points belonging to same non-domination rank are further sorted based on their crowding-distance (from highest to lowest) [45]. For the good-class, top  points from rank-1 front are chosen and for the bad-class, top  points from  front onward are chosen. Increasing the  increases the separation between good and bad classes, while the   parameter has the inverse effect. Parameter setting for NSGA-II algorithm which is used to generate multi-objective datasets can be found in Section 3.10. Truss 2D Results: Truss 2D problem [46] is a three-variable multi objective problem involving one constraint. Visu- alization of the Truss dataset in the F-space and X-space is provided in Figure 3.20 for  = 6 and  = 9, with   = 1. Compilation of results obtained using  = 6 is provided in Table 3.5. Table 3.5: Results on Truss-2D with  = 6. Training Method Accuracy 99.77 ± 0.72 Bilevel 99.34 ± 0.32 CART SVM 99.66 ± 0.15 Testing Accuracy 99.54 ± 0.75 98.33 ± 1.10 99.46 ± 0.50 p-value – 6.04e-08 0.135 #Rules 1.2 ± 0.5 11.06 ± 3.15 1 FU/Rule 2.9 ± 0.5 62.5 ± 2.9 1 Rule Length 3.3 ± 0.9 11.06 ± 3.15 62.5 ± 2.9 44 (a) F-space plot with  = 6. (b) X-space plot with  = 6. (c) F-space plot with  = 9. (d) X-space plot with  = 9. Figure 3.20: Truss design problem data visualization.   = 1 is kept fixed. For a fixed value of   , larger value of  implies better separation between datapoints belonging to two classes. Clearly, the bilevel NLDT has the best accuracy and fewer variable appearances (meaning easier intrepretability) compared to CART and SVM generated classifiers. Fig. 3.21a shows a 100% correct classifier with a single rule obtained by our bilevel NLDT, whereas Fig. 3.21b shows a typical CART classifier with 19 rules, making it difficult to interpret easily. A comparison of results obtained using NLDT approach on truss problem with  = 6 and  = 9 is provided in Table 3.6. Table 3.6: 2D Truss problem with  = 6 and  = 9.  6 9 Training Accuracy 99.86 ± 0.36 100 ± 0 Testing Accuracy 99.6 ± 0.58 99.92 ± 0.19 #Rules 1.20 ± 0.53 1.36 ± 0.48 FU/Rule 2.92 ± 0.52 2.26 ± 0.52 Clearly, since the data is more separated for  = 9, the results for  = 9 are more accurate and relatively simpler. 45 00.010.020.030.040.050.060.07F10246810F2104ParetoNon-Pareto10.011.52x32.53x20.0050.01x10.00500ParetoNon-Pareto00.010.020.030.040.050.060.07F10246810F2104ParetoNon-Pareto10.011.52x32.53x20.0050.01x10.00500ParetoNon-Pareto (a) Bilevel classifier requiring only one rule. Figure 3.21: Comparison of bilevel and CART methods on truss problem with  = 6 dataset. (b) CART classifier with 11 rules. Welded Beam Design Problem Results: This bi-objective optimization problem has four variables and four constraints [46]. Here, two sets of experiments are conducted for two different values of   , keeping the value of  fixed to 3. Visualization of these datasets in the objective space (or F-space) is provided in Figure 3.22. A higher value of   results in good and bad datapoints too close to each other (Figure 3.22b), thereby making it difficult for the classification algorithm to determine a suitable classifier. This can be validated from the results presented in Table 3.7. However, bilevel NLDTs have produced 46 x0≤0.01x2≤1.8233x2≤2.19x1≤0.01x1≤0.01x0≤0.00x0≤0.00x0≤0.003113x0≤0.001x1≤0.01313x0≤0.0031 (a) F-space plot with   = 1. (b) F-space plot with   = 10. Figure 3.22: Welded Beam Design Problem data visualization.  = 3 is kept fixed. Problem at (b) is more difficult to solve than at (a). one rule with about 2 to 4 variable appearances compared to 39.5 to 126 (on average) for SVM classifiers. CART does well in this problem with, on average, 2.94 to 8.42 rules. The NLDT (having a single rule) obtained with   = 10 is shown in Fig. 3.23. Interestingly, the bilevel NLDTs have similar accuracy to that of CART and SVM. Table 3.7: Results on welded beam design with   = 1 and   = 10.  = 3 is kept fixed. Method Training Accuracy Testing Accuracy 99.98 ± 0.06 Bilevel 99.97 ± 0.07 CART SVM 99.37 ± 0.17 99.39 ± 0.38 Bilevel 99.46 ± 0.27 CART SVM 99.46 ± 0.19 99.38 ± 0.49 99.50 ± 0.59 99.40 ± 0.40 98.58 ± 1.13 97.72 ± 1.04 98.97 ± 0.54 p-value   = 1 0.158 – 0.331   = 10 3.42e-02 4.63e-08 – #Rules FU/Rule Rule Length 1.0 ± 0.0 2.94 ± 0.71 1 1.0 ± 0.0 8.42 ± 1.42 1 2.6 ± 0.7 39.5 ± 2.5 1 2.6 ± 0.7 2.94 ± 0.71 39.5 ± 2.5 3.9 ± 1.0 126.0 ± 6.8 1 3.9 ± 1.0 8.42 ± 1.42 126.0 ± 6.8 3.7.5.2 Modified ZDT (m-ZDT) and DLTZ (m-DTLZ) Problems Problem Definition and Dataset Creation: These are the modified versions of ZDT and DLTZ problems [45, 47]. For m-ZDT problems, the g-function is modified to the following: 47 01020304050F100.511.522.533.54F210-3ParetoNon-Pareto01020304050F100.511.522.533.54F210-3ParetoNon-Pareto Figure 3.23: Welded beam classifier with one rule for   = 10. (2, . . . , ) =1 + 18  − 1   = 2 and even ( + +1 − 1)2. (3.15) Many two-variable relationships ( + +1 = 1) must be set to be on the Pareto set. For m-DTLZ problems, the g-function for xm variables is (xm) =100 ×   ∈ xm and  is even ( + +1 − 1)2. (3.16) Pareto points for m-ZDT and m-DTLZ problems are generated by using the exact analytical expression of Pareto-set (locations where  = 0). The non-Pareto set is generated by using two parameters  and    . To compute the location of a point xnp belonging to non-Pareto set from the location of the point xp on the Pareto set, following equation is used:  + 1(    + 2) ()  =() where 1 ∈ −1, 1, 2 ∈ [0, 1]. (3.17) (3.18) Here, 1 and 2 are randomly generated for each () . For m-ZDT problems,  = 1, 2, . . . , 48 Table 3.8: Parameter setting to generate datasets for m-ZDT and m-DTLZ problems. We generate 1000 datapoints for each class. Prob. m-ZDT1 m-ZDT1 m-ZDT2 m-ZDT2 m-DTLZ1 m-DTLZ1 m-DTLZ2 m-DTLZ2 nvars spead offset 30 0.1 0.1 500 0.1 30 500 0.1 0 30 0 500 0 30 500 0 0.3 0.3 0.3 0.3 0.05 0.05 0.05 0.05 , () while for m-DTLZ problems ()  ∈ xm. Parameter setting for generating m-ZDT and m-DTLZ datasets is provided in Table 3.8. For 30 variable problems, all  ∈ xm are changed according to Eq. 3.18 for generate non-pareto points. However, for 500 vars problems, only first 28 of variables in xm are modified according to Eq. 3.18 to generate non-pareto data-points. Visualization of Datasets for m-ZDT and m-DTLZ problems in provided in Figure 3.24 and Figure 3.25. (a) m-ZDT1, 30 vars (b) m-ZDT1, 500 vars (c) m-ZDT2, 30 vars (d) m-ZDT2, 500 vars Figure 3.24: m-ZDT Datasets. 49 (a) m-DTLZ1, 30 vars (b) m-DTLZ1, 500 vars (c) m-DTLZ2, 30 vars (d) m-DTLZ2, 500 vars Figure 3.25: m-DTLZ Datasets. 3.7.5.3 m-ZDT and m-DTLZ Results: Experiments conducted on datasets involving 500 features (see Table 3.9) and for two and three- objective optimization problems confirm the scalability aspect of the proposed approach. In all these problems, traditional methods like CART and SVM find it difficult to conduct a proper classification task. The provision of allowing controlled non-linearity at each conditional-node provides the proposed NLDT approach with necessary flexibility to make an appropriate overall classification. 50 Table 3.9: Results on multi-objective problems for classifying dominated and non-dominated solutions. Method Training Acc. Testing Acc. NLDT CART SVM NLDT CART SVM NLDT CART SVM NLDT CART SVM NLDT CART SVM NLDT CART SVM NLDT CART SVM NLDT CART SVM 99.18 ± 0.40 97.48 ± 0.37 84.14 ± 2.42 99.23 ± 0.38 97.41 ± 0.36 99.28 ± 0.18 97.21 ± 2.52 89.72 ± 0.88 52.44 ± 0.63 97.76 ± 1.88 85.64 ± 3.33 54.82 ± 0.98 99.20 ± 0.29 98.76 ± 0.27 100.00 ± 0.00 99.18 ± 0.38 98.61 ± 0.36 100.00 ± 0.00 94.36 ± 4.03 83.23 ± 3.52 51.56 ± 0.52 95.89 ± 3.83 89.09 ± 1.75 51.30 ± 0.55 98.96 ± 0.59 94.78 ± 1.02 82.84 ± 2.77 98.96 ± 0.57 94.72 ± 0.89 98.44 ± 0.57 96.65 ± 2.86 71.74 ± 2.49 45.86 ± 1.28 97.22 ± 2.25 63.41 ± 7.59 49.61 ± 1.62 98.93 ± 0.60 93.48 ± 0.94 100.00 ± 0.00 98.88 ± 0.71 93.95 ± 1.33 100.00 ± 0.00 93.77 ± 4.24 58.26 ± 7.37 47.13 ± 1.19 95.33 ± 4.46 70.04 ± 3.46 47.01 ± 1.20 # Rules m-ZDT1-2-30 1.80 ± 0.60 26.42 ± 1.89 1.00 ± 0.00 m-ZDT2-2-30 1.92 ± 0.56 27.80 ± 2.19 1.00 ± 0.00 m-DTLZ1-3-30 3.12 ± 0.59 93.88 ± 3.23 1.00 ± 0.00 m-DTLZ2-3-30 3.02 ± 0.62 100.82 ± 6.11 1.00 ± 0.00 m-ZDT1-2-500 1.78 ± 0.54 20.58 ± 1.39 1.00 ± 0.00 m-ZDT2-2-500 1.80 ± 0.69 20.98 ± 1.98 1.00 ± 0.00 m-DTLZ1-3-500 3.00 ± 0.45 104.88 ± 5.49 1.00 ± 0.00 m-DTLZ2-3-500 3.04 ± 0.56 96.08 ± 3.93 1.00 ± 0.00 /Rule Rule Length 4.47 ± 2.21 1.00 ± 0.00 1192.38 ± 11.34 7.64 ± 3.50 26.42 ± 1.89 1192.38 ± 11.34 4.37 ± 1.82 1.00 ± 0.00 315.54 ± 6.12 6.14 ± 2.18 1.00 ± 0.00 1381.56 ± 8.25 5.81 ± 1.95 1.00 ± 0.00 1367.42 ± 8.44 5.66 ± 3.23 1.00 ± 0.00 240.88 ± 4.62 5.30 ± 2.45 1.00 ± 0.00 248.06 ± 4.37 7.61 ± 2.36 1.00 ± 0.00 1382.38 ± 8.84 7.28 ± 3.16 1.00 ± 0.00 1382.62 ± 8.46 8.14 ± 3.36 27.80 ± 2.19 315.54 ± 6.12 19.10 ± 7.03 93.88 ± 3.23 1381.56 ± 8.25 17.50 ± 6.73 100.82 ± 6.11 1367.42 ± 8.44 9.36 ± 4.15 20.58 ± 1.39 240.88 ± 4.62 8.88 ± 3.98 20.98 ± 1.98 248.06 ± 4.37 22.76 ± 7.57 104.88 ± 5.49 1382.38 ± 8.84 22.14 ± 10.45 96.08 ± 3.93 1382.62 ± 8.46 3.8 Additional Comparisons and Results In previous sections, we focused at fundamental procedure of inducing the NLDT and com- pared its performance against two standard algorithms: SVM and CART. In this section, we extend our experimental study to compare our approach against other methods which are strong poten- 51 tial candidates to generate an interpretable AI: generalized additive models (GAMs) and genetic programming (GP). We also re-iterate experiments using the default parameter settings for NLDT and CART, while we fine tune the results corresponding to SVM using the best possible parameter values. New problems (m-DS1 to m-DS3) are introduced here to obtain further insight into the working principles of above mentioned algorithms. The customized problems DS1 to DS4, m-DS1 to m- DS3, m-ZDTs and m-DTLZs are designed to benchmark the performance of classifiers on various properties of datasets such as: • Data Distribution: For DS1-DS4 datasets, degree of scatter in data varies across classes. For m-DS1, m-DS2 and m-DS3 the scattering of data for each class is more similar than that in original DS datasets. A visualisation of feature spaces for DS1 and m-DS1 dataset is provided in Figure 3.26a and 3.26b, respectively to demonstrate this. • Geometry of Decision Boundary: Here, the effect of the nature of the simplest possible de- cision boundary is considered. Decision boundary corresponding to DS1-DS2 and modified DS1-DS2 is linear, DS3 and m-DS3 have decision boundary involving nonlinearity of order 2 and DS4 have two disjoint linear decision boundaries. • Data Bias: Here, effect of bias in class representation is considered. All datasets except DS2 and m-DS2 are balanced. For DS2 and m-DS2, minority class has 5 times less number of data points as the majority class. In next few sections, we briefly discuss SVM, GAM and GP to gain a conceptual insight towards critical parameters which can influence the classification performance and the complexity of the classifier. A more detailed discussion is provided in [48]. 52 (a) DS1 Dataset. (b) m-DS1 Dataset. Figure 3.26: Original DS1 and its modified version. 3.8.1 Support Vector Machines (SVMs) For a separable dataset, support vector machine (SVM) algorithm attempts to derive a decision boundary in the form of a single mathematical equation as shown below: (x) = w (x) + , (3.19) where (x) is a set of feature transformation functions which can be either linear or non-linear functions of feature vector x, w is a weight vector and  is a bias term. A conceptual understanding of SVM is provided in Figure 3.27. For a binary classification task involving class labels  = −1 or  = 1, an optimal hyper-surface is derived by maximizing the margin between two classes, as shown by  = 0 line in the figure. Points with  ≤ −1 belong to one class and points with  ≥ 1 belong to another class. The points which fall on  = 1 and  = −1 are called support vectors, as they alone decide the classifier. However, for non-separable datasets, such as the scenario shown in Figure 3.28, a soft margin approach is used to allow some data points within || < 1 (margin) while training the SVM. These points are also declared as support vectors in addition to the points on the margin. 53 Figure 3.27: SVM on separable datasets with a hard margin. Figure 3.28: SVM with non-separable datasets with a soft margin. Minimize: subject to : =1 2||||2 +  1 (w (x) + ) ≥ 1 − ,  ≥ 0,  = 1, 2, . . . , , To identify the classifier and the support vectors, the underlying optimization problem is solved:  , (3.20) where  is the true class label (either 1 or -1) of the datapoint,  is the distance of -th data point from its representative margin, thus  = max [0, 1 −  (x)] (where value of (x) is estimated from Eq. 3.19).  is a penalty parameter which is used to enhance generalizability by compromising It is also aimed to balance the complexity of the classifier (described with training accuracy. 54 with the number of non-zero terms of w) and soft support vectors within the margin and is an important parameter. With lower values of , broader margin (with some misclassification of training datapoints) is achieved while for large values of , misclassification of training datapoints is heavily penalized and so narrower margin is achieved. Using a kernel trick [49] (x, x) = (x) (x) Eq. 3.19 is transformed into the following: (x) =  (x, x) + , (3.21)  =1     =1 =1 =1  = 0. where  is a Lagrange multiplier which is obtained by converting the optimization problem of maximizing the margin (Eq. 3.20) to a dual Lagrangian representation [49]: Min: (a) =  − 1 2     (x, x ), =1 s.t.  ∈ [0, ], (3.22) Classical gradient based algorithms can then be employed to find . In Eq. 3.21, data points x for which  = 0 do not contribute in the equation of the split rule (Eq. 3.21) and data points for which  > 0 are called support vectors for the SVM classifier and they dictate the overall length of the classifier’s equation (Eq. 3.21). The penalty parameter  has to be tuned to efficiently derive the decision boundary. Lower value of  makes the classifier more generalizable.  = ∞ (hard margin) attempts to achieve near 100% training accuracy and hence is prone to overfitting. In our case, we use scikit-learn’s [50] SVM module and set  = 1, 000. We use RBF (or Gaussian) kernel function. Table 3.10 shows results for various settings of  on some datasets considered in our study. 3.8.2 Generalized Additive Models (GAMs) For a binary classification task involving two classes: Class 1 ( = 0) and Class 2 ( = 1), the GAM based classifier [5, 6] estimates the probability of a data point belonging to class  = 1 (i.e. 55 Table 3.10: SVM Result for different values of penalty parameter . For each dataset, the first row represents the testing accuracy and the second row represents complexity (number of support vectors).  = 1000 gives overall best performance. DS2 Pen. Param. DS1 DS4 DS3 96.93 ± 1.87 64.68 ± 2.16 99.32 ± 0.70 26.68 ± 1.88 99.75 ± 0.58 10.60 ± 0.92 m-DS3 99.97 ± 0.23 36.54 ± 1.72 100.00 ± 0.00 12.60 ± 0.98 100.00 ± 0.00 8.82 ± 1.01 45.20 ± 4.24 138.76 ± 1.99 68.77 ± 4.00 262.60 ± 4.76 96.63 ± 1.35 56.70 ± 3.13 Cancer-10 97.15 ± 1.08 69.98 ± 6.51 95.98 ± 1.13 56.22 ± 6.43 95.23 ± 1.09 52.36 ± 4.91  = 1  = 10  = 1, 000 Pen. Param.  = 1  = 10  = 1, 000 94.75 ± 1.97 191.94 ± 4.38 98.42 ± 1.16 58.70 ± 2.90 99.88 ± 0.33 8.36 ± 0.87 m-DS1 99.77 ± 0.67 70.22 ± 2.23 100.00 ± 0.00 26.42 ± 1.46 99.93 ± 0.33 7.38 ± 0.75 95.24 ± 0.00 16.56 ± 0.80 95.24 ± 0.00 30.46 ± 0.64 99.70 ± 0.50 8.56 ± 0.67 m-DS2 95.24 ± 0.00 16.18 ± 0.59 98.89 ± 0.85 14.40 ± 0.89 99.97 ± 0.22 5.34 ± 0.55 Pen. Param.  = 1  = 10  = 1, 000 Truss 77.31 ± 2.16 343.76 ± 9.51 81.29 ± 2.19 258.52 ± 10.85 88.54 ± 1.60 176.22 ± 7.87 Welded Beam 98.83 ± 0.70 47.26 ± 3.00 99.53 ± 0.42 17.86 ± 1.73 99.63 ± 0.38 7.88 ± 0.86 Cancer-30 90.83 ± 1.83 106.88 ± 4.44 91.94 ± 1.36 81.66 ± 4.54 95.08 ± 1.65 58.74 ± 5.18 ( = 1|x))6 as ˆ(x) using the following equation ˆ(x) = 1 1 + −(x) , (3.23) where (x) is referred to as link function [51]. The link function (x) in GAM is expressed as a sum of non-linear functions as shown below: (x) = 1(x) + 2(x) + · · · + (x) + 0, (3.24) 6probability of datapoint belonging to other class (i.e.  = 0) will be 1 − ˆ. 56 where 0 is a constant and (x) are scalar valued nonlinear functions. The functional form of (x) and total number of such nonlinear functions is pre-specified by the user. Modelling of link function (x) using Eq. 3.24 makes GAMs more generalizable than its precursor: generalized linear models (GLMs) [4], which involves only linear terms. In our experiments, we use penalized B-splines to model non-linearity of each feature separately (i.e. referring to Eq. 3.24, (x) = ()). Thus, the -function in our case is given by (x) = 1(1) + 2(2) . . . () + 0, () = ()   = B(cid:48) () () .   =1 where: (3.25)  Here, () denotes a spline function corresponding to -th feature, () () indicates the basis function of order ,   are scalar coefficients and  is the total number of basis functions used to model the spline. The order of spline (i.e. ) and the number of basis-functions  is user-specified. Once the structure of link function (x) is specified, an optimization algorithm is invoked to learn parameters corresponding to basis functions () () and coefficients   with an objective to minimize the error between the estimated value of probability ( ˆ(x) Eq. 3.23) and the actual  values across the dataset. To make the resulting model more generalize and simple, a second-order smoothing is employed. Thus, using Eq. 3.23 and 3.25, the overall optimization problem translates to minimizing the following function:   =1 Þ  =1 Min: (B(cid:48), ) = ( − ˆ(B(cid:48)))2 +   ((cid:48)(cid:48)  ( |B(cid:48)    ))2  , (3.26) where  is the actual class of the -th datapoint (which can have value of either 0 or 1) and ˆ is the probability of -th point belonging to class  = 1 (i.e. ( = 1|xi)) as predicted by the GAM classifier using Eq. 3.23.   are the penalty parameters which are prespecified. In our case, we use   = 0.6 for all features. The rule complexity of a GAM classifier can be tuned using  , where higher values of   imposes heavy penalty on non-linearities with more than second order. Additionally, the complexity can also be controlled by regulating the degree () and number of 57 basis-functions  (Eq. 3.25). In our experimental setup, we conduct series of experiments using different combinations of (, ) to model splines for each feature. Values of  and  are picked from the one listed in Table 3.11. Table 3.11: Details regarding parametric study for GAMs. # Basis Functions () Degree () 2, 3, 5, 8, 13, 21 2, 3, 5 Total number of terms arising from the expression of rule (x) (Eq. 3.25) is  =1 Total Terms = (  + 1) ×   +   + 1. (3.27) which are not contributing in minimizing the error However, due to second-order smoothening effect (Eq. 3.26), non-linearities greater than 2nd order =1( − ˆ(B(cid:48)))2 will get removed from the rule and thus, the effective degree of freedom (EoDF) will be far less than the total length of the rule. Effective degrees of freedom versus accuracy plot for GAM classifiers obtained using various combinations of (, ) on Cancer-10 dataset is shown in Figure 3.29. It is clear that a high training accuracy is achieved with a large EoDF, but makes an over-fitting and produces less testing accuracy. About 500 such experiments are performed and the best combinations of (, ) are used to generate results (Table 3.13) for a given dataset. Note here that generating classifiers using GAM is computationally expensive for high-dimensional datasets and so, we do not run experiments on datasets involving 500 features. 3.8.3 Genetic Programming (GP) Genetic Programming has been extensively used to derive non-linear and interpretable classifiers [52, 53, 54, 55, 56, 57]. A GP algorithm evolves programs (or equations of classifier’s decision boundary in our case) using genetic operators like crossover and mutation. Programs in GP are usually represented with tree architecture as shown in Figure 3.30. Internal nodes of this tree can involve mathematical operations, like +,×,−,÷, log, sin. Allowable set of mathematical operations 58 (a) Training Accuracy (b) Testing Accuracy Figure 3.29: Effective degree of freedom (EoDF) V/s Accuracy for Cancer-10 dataset. The best (, ) parameter setting for this dataset is found to be ∗ = [8, 3, 8, 13, 8, 8, 13, 3, 8, 21] and ∗ = [2, 2, 5, 5, 2, 3, 3, 2, 2, 2]. Figure 3.30: A sample genetic program (GP) tree. The above GP translates to this equation:  (x) = (5 − 7) + 32. are pre-specified by the user. In our case, we use {+,×,−,÷} only. Terminal leaf nodes of a GP program either have one of the input feature  or a constant term . It is to note here that a GP tree (T) represents one non-linear equation and is fundamentally different from the decision tree which involves assembly of split-rule equations which are organized in a hierarchical format. The optimal structure of tree, operators used, features  involved and value of constants  are all unknown and are determined through an evolutionary algorithm. The evolution is conducted with an objective to minimize the cross-entropy loss. However, if unchecked, the size of GP trees grows as the evolution progress and the GP algorithm suffers from bloating [58]. To counter this effect of bloating and encourage evolution of simpler trees (trees with less number of nodes), a parsimony coefficient  is used to penalize the fitness of a GP tree (T) as shown below: Min: (T) =  +  × , (3.28) 59  where  = − 1 =1[(x) log( ˆ(x))+(1− (x)) log(1− ˆ(x))], and ˆ(x) = Sigmoid(  (x)). In Eq. 3.28,  represents size of the tree and is computed by counting total number of nodes in the tree.  (x) is the value the GP tree outputs for a given feature vector x (see Figure 3.30).  It is important to choose a suitable parsimony coefficient  for a problem. Smaller value of  will encourage bloating and will evolve complex equations while the higher value of  will evolve simpler equations at an expense of reduced classification accuracy. In our case, we perform experiments using three values : 0.01, 0.005 and 0.001, and conduct 50 runs on each dataset shown in Table 3.12 after randomly splitting the dataset into 70% training and 30% testing for each run. Statistics regarding testing accuracy and complexity (measured as the total number of internal nodes) is reported in the table. It is clear from the table that while a small  produces a better Table 3.12: GP Result for different values of parsimony coefficient . For each dataset, the first row represents the testing accuracy and the second row represents complexity (number of internal nodes).  = 0.001 produces better results. Pars. coeff.  = 0.01  = 0.005  = 0.001 Pars. coeff.  = 0.01  = 0.005  = 0.001 DS1 61.07 ± 9.91 3.40 ± 3.70 77.3 ± 11.29 16.18 ± 9.99 91.70 ± 6.91 67.72 ± 26.72 m-DS1 89.53 ± 3.27 8.34 ± 1.98 93.37 ± 4.57 16.32 ± 9.55 98.83 ± 1.88 55.38 ± 22.39 DS2 95.24 ± 0.00 1.98 ± 0.14 95.24 ± 0.00 3.86 ± 1.23 95.37 ± 0.63 15.14 ± 13.55 m-DS2 95.65 ± 0.70 3.58 ± 1.07 95.65 ± 0.70 3.76 ± 1.22 96.67 ± 1.93 14.08 ± 9.11 DS3 65.37 ± 11.57 4.44 ± 2.37 86.27 ± 11.41 19.86 ± 11.45 96.50 ± 3.3 76.74 ± 33.36 m-DS3 96.33 ± 4.68 15.04 ± 6.36 98.4 ± 1.99 19.88 ± 9.94 99.27 ± 1.22 49.80 ± 21.69 DS4 49.93 ± 1.43 1.12 ± 3.40 50.37 ± 2.96 2.06 ± 3.88 58.00 ± 11.22 18.76 ± 23.94 Cancer-10 94.03 ± 4.59 5.56 ± 2.06 95.04 ± 1.76 7.88 ± 3.07 96.13 ± 1.29 15.80 ± 5.66 Pars. coeff.  = 0.01  = 0.005  = 0.001 Truss 82.78 ± 11.28 5.20 ± 3.30 90.03 ± 8.50 11.98 ± 7.12 97.36 ± 3.81 36.02 ± 16.99 WeldedBeam 84.88 ± 13.08 9.32 ± 5.15 92.35 ± 6.06 14.08 ± 5.35 96.46 ± 4.14 35.90 ± 18.28 Cancer-30 90.47 ± 4.54 4.78 ± 2.30 90.96 ± 6.29 5.74 ± 1.84 92.40 ± 4.98 14.58 ± 7.14 60 accuracy, a large  produces smaller sized GPs. To demonstrate, we present two GP classifiers for  = 0.005 and 0.01 obtained for the breast cancer Wisconsin dataset (involving total 10 features) in Figure 3.31. Training () and testing () accuracy are better for  = 0.005. (a)  = 0.005,  = 96.44,  = 99.02, Complexity = 6. (b)  = 0.01,  = 95.60,  = 98.05, Complexity = 3. Figure 3.31: Classifiers for Cancer data:  = 0.005:  = 0.01:  (x) = 2 + −0.502 (0.0776) .  (x) = 9 + −0.537 (0.1716)(0.17132) and Table 3.12 indicates that GP does not perform well on certain problems even in small-sized problems, such as DS1 and DS4. In a mathematical classifier search, there are two hierarchical aspects which must be learnt: (i) structure of the classifier, and (ii) coefficient of each term in the structure. GP attempts to learn both aspects in a single optimization task. We argue that while a “good" structure may have evolved at a generation, if its associated coefficients are not proper, the whole classifier will be judged as “bad". We attempt to alleviate this aspect in the next procedure by using a bilevel optimization framework. 61 3.8.4 Results A comprehensive compilation of results obtained on all 19 datasets is shown in Table 3.13. The results are segmented into four parts: Sr. 1-7 for synthetic DS datasets, Sr. 8-9 for traditional cancer datasets, Sr. 10-11 for real-world multi-objective datasets, Sr. 12-19 for high-dimensional modified multi-objective ZDT and DTLZ benchmarks. For each method, a parametric study is performed on each problem and the setting which obtained the best testing accuracy is used to generate the final results. Statistics of 50 runs (with random data-split of 70% training and 30% testing in each) on each dataset for two performance metrics is presented in Table 3.13. For CART, the complexity metric is defined as total number of nodes; for SVM, it is defined as total number of support vectors; for GAM, it is defined as the effective degrees of freedom (EoDF); for GP, it is defined as the total number of internal nodes; and for NLDT, it is defined as total number of variable occurrences in the entire tree. It is clear that a method with high testing accuracy and low complexity is better. The table clearly indicates that NLDT performs well in terms of both metrics. Also, the performance of NLDT scales well with an increase in feature size. CART produces a good compromise on accuracy and complexity, but performs worse than NLDT on both metrics. While SVM achieves a high accuracy, in general, the complexity of its classifiers is large, thereby making them not easy to interpret for any explainability purposes. The performance of GP is poor for achieving a high accuracy. GAM is clearly not suitable for problems with a large number of features and cannot be run due to impractical computational time requirement for some problems (marked with a dash). GP cannot match both accuracy and complexity obtained by NLDT. GP’s performance is also depended on the way the data is scattered which is evident from results on DS1 and m-DS1 problem. This is mostly because of the fact that in GP, the rule-structure and coefficients are evolved simultaneously, thereby making it difficult to efficiently navigate through the search-space of equations. A future study could be launched to incorporate bilevel search strategy with customized initialization to evolve rules within GP framework. In most problems, NLDT classifiers require fewer conditional rules (albeit with restricted nonlinearities) and still 62 achieve near 100% correct testing accuracy. Table 3.13: Summary of results obtained using various methods. For each dataset, the first row indicates testing accuracy and the second row indicates complexity. Italicized entries are statistically insignificant (according to 95% confidence in Wilcoxon rank-sum test) compared to the best entry in the same row. Sr. 1 2 3 4 5 6 7 8 9 Problem DS1 DS2 DS3 DS4 m-DS1 m-DS2 m-DS3 Cancer-10 Cancer-30 10 Welded Beam 11 12 13 14 15 16 Truss m-ZDT1-30 m-ZDT1-500 m-ZDT2-30 m-ZDT2-500 m-DTLZ1-30 17 m-DTLZ1-500 18 m-DTLZ2-30 19 m-DTLZ2-500 NLDT 99.55 ± 1.08 2.3 ± 0.6 99.44 ± 0.87 2.3 ± 0.7 99.77 ± 0.67 2.2 ± 0.5 98.88 ± 1.65 3.1 ± 1.4 99.10 ± 1.54 2.00 ± 0.00 99.46 ± 1.08 2.10 ± 0.30 99.20 ± 1.30 2.02 ± 0.14 96.50 ± 1.16 6.4 ± 1.7 96.20 ± 1.49 9.2 ± 4.1 98.58 ± 1.13 3.9 ± 1.0 99.54 ± 0.75 3.30 ± 0.90 98.97 ± 0.57 7.60 ± 3.50 98.93 ± 0.60 9.34 ± 4.15 98.96 ± 0.57 8.10 ± 3.35 98.87 ± 0.72 8.84 ± 3.95 98.77 ± 0.87 11.98 ± 5.85 93.76 ± 4.24 22.72 ± 7.50 97.22 ± 2.25 17.48 ± 6.75 95.32 ± 4.45 22.02 ± 10.62 CART 90.32 ± 4.06 14.5 ± 1.7 95.43 ± 1.50 11.0 ± 1.4 95.00 ± 2.35 11.5 ± 1.3 88.68 ± 3.60 31.3 ± 4.2 89.73 ± 4.53 7.90 ± 1.22 96.25 ± 1.92 5.96 ± 0.81 92.87 ± 4.35 5.78 ± 1.11 94.34 ± 1.92 11.6 ± 2.4 92.11 ± 2.07 10.8 ± 2.1 97.72 ± 1.04 8.42 ± 1.42 98.33 ± 1.10 11.06 ± 3.15 97.77 ± 0.58 30.26 ± 4.65 95.96 ± 0.80 21.02 ± 1.55 97.88 ± 0.70 28.22 ± 2.35 95.96 ± 0.80 21.02 ± 1.55 78.52 ± 7.94 128.40 ± 22.39 78.31 ± 7.21 126.94 ± 20.07 69.83 ± 6.16 156.00 ± 16.09 76.68 ± 5.44 133.22 ± 15.95 SVM 99.87 ± 0.45 8.16 ± 0.88 99.33 ± 1.10 7.64 ± 0.87 99.63 ± 0.69 10.22 ± 1.42 93.97 ± 2.35 43.70 ± 2.69 99.90 ± 0.40 7.50 ± 0.75 99.94 ± 0.44 5.44 ± 0.67 100.00 ± 0.00 8.82 ± 0.89 95.07 ± 1.23 51.26 ± 5.02 95.24 ± 1.29 58.88 ± 4.46 99.58 ± 0.45 7.86 ± 1.27 88.21 ± 1.62 174.28 ± 8.49 99.39 ± 0.35 82.08 ± 4.19 100.00 ± 0.00 140.58 ± 4.25 99.51 ± 0.33 80.98 ± 3.50 100.00 ± 0.00 140.56 ± 4.00 94.22 ± 0.95 615.54 ± 9.24 64.32 ± 1.76 1236.82 ± 13.26 94.25 ± 1.04 615.44 ± 10.67 64.22 ± 1.48 1245.92 ± 11.45 GAM 100.0 ± 0.00 2.89 ± 0.00 100.0 ± 0.00 2.89 ± 0.00 99.47 ± 1.03 4.98 ± 0.14 48.63 ± 6.50 3.80 ± 0.99 100.0 ± 0.00 2.90 ± 0.00 99.94 ± 0.31 2.90 ± 0.00 99.17 ± 1.48 3.24 ± 0.22 95.32 ± 1.49 22.14 ± 10.36 93.74 ± 5.83 32.47 ± 12.41 99.53 ± 0.48 11.06 ± 0.81 96.18 ± 1.20 19.19 ± 1.06 85.31 ± 1.35 220.20 ± 11.73 84.97 ± 1.18 233.69 ± 6.56 — — — — — — — — 55.96 ± 3.19 33.89 ± 0.01 54.52 ± 2.96 35.84 ± 0.02 GP 91.70 ± 6.91 67.72 ± 26.72 95.37 ± 0.63 15.14 ± 13.55 96.50 ± 3.30 76.74 ± 33.36 59.63 ± 10.81 24.70 ± 26.40 98.83 ± 1.88 55.38 ± 22.39 96.67 ± 1.93 14.08 ± 9.11 99.27 ± 1.22 49.8 ± 21.69 96.13 ± 1.29 15.80 ± 5.66 92.40 ± 4.98 14.58 ± 7.14 96.46 ± 4.14 35.90 ± 18.28 97.36 ± 3.81 36.02 ± 16.99 93.58 ± 10.21 45.34 ± 26.09 83.21 ± 18.42 52.14 ± 24.47 91.57 ± 11.92 48.44 ± 23.69 85.06 ± 16.82 50.64 ± 21.24 81.59 ± 17.32 16.08 ± 21.89 80.49 ± 11.54 8.66 ± 16.19 79.81 ± 19.46 12.32 ± 14.79 78.46 ± 14.08 8.08 ± 15.21 3.9 Conclusions and Future work In this chapter, we have addressed the problem of generating interpretable and accurate non- linear decision trees for binary classification problems. Split-rules at the conditional nodes of a decision tree have been represented as a weighted sum of power-laws of feature variables. A provision of integrating modulus operation in the split-rule has also been formulated, particularly 63 to handle sandwiched classes of datapoints. The proposed algorithm of deriving split-rule at a given conditional-node in a decision tree has two levels of hierarchy, wherein the upper-level is focused at evolving the power-law structures and operates on a discrete variable space, while the lower-level is focused at deriving values of optimal weights and biases by searching a continuous variable space to minimize the net impurity of child nodes after the split. Efficacy of upper-level and lower-level evolutionary algorithm have been tested on customized datasets. The lower-level GA is able to robustly and efficiently determine weights and biases to minimize the net-impurity. A test conducted on imbalanced dataset has also revealed the superiority of the proposed lower-level GA to achieve a high classification accuracy without relying on any synthetic data. Ablation studies conducted on the upper-level GA also demonstrate the efficacy of the algorithm to obtain simpler split-rules without using any prior knowledge. Results obtained on standard classification bench- mark problems and a real-world single-objective problem has amply demonstrated the efficacy and usability of the proposed algorithm. The classification problem has been extended to discriminate Pareto-data from non-Pareto data in a multi-objective problem. Experiments conducted on multi-objective engineering problems have resulted in determining simple polynomial rules which can assist in determining if the given solution is Pareto-optimal or not. These rules can then be used as design-principles for generating future optimal designs. As further future studies, we plan to integrate non-polynomial and generic terms in the expression of the split-rules. The proposed bilevel framework can also be tested for regression and reinforcement-learning related tasks which we shall discuss in upcoming chapters. The upper-level problem can be converted to a bi-objective problem for optimizing both  and  simultaneously, so that a set of trade-off solutions can be obtained in a single run. Nevertheless, this proof-of-principle study on the use of a customized bilevel optimization method for classification tasks is encouraging for us to launch such future studies. 64 3.10 Parameter Settings 3.10.1 Termination Criteria and other Parameter Settings for Inducing a Non-linear Deci- sion Tree (NLDT) • Number of power-laws per split-rule:  = 3, • Allowable set of exponents: E = {−3,−2, . . . , 3}, • Impurity Metric: Gini Score. • Minimum Impurity to conduct a split:  = 0.05 • Minimum number of points required in a node conduct a split:  = 10 • Maximum allowable Depth = 5 • Pruning threshold:  = 3% 3.10.2 Parameter Setting for NSGA-II for multi-objective data creation We used the implementation of NSGA-II Algorithm as provided in pymoo [59]. pymoo is an official python package for multiobjective optimization algorithms and is developed under supervision of Prof. Deb who is the original developer of NSGA-II and NSGA-III algorithms. Following parameter setting was adopted to create Pareto and non-Pareto datasets for two objective Truss and Welded beam problems: • Population Size = 500 • Maximum Generations = 1000 • Cross Over = SBX • SBX  = 15 • SBX probabililty = 0.9 65 • Mutation type = Polynomial Mutation • Mutation  = 20 • Mutation probability  = 1/ (where  is total number of variables) 3.10.3 Parameter Setting for Upper Level GA This section provides the general parameter setting used for the upper level of our bilevel GA. • Population size = 10 × , where  is the dimension of the feature space of the original problem. • Selection = Binary tournament selection. • Crossover probability ( ) = 0.9 • Mutation parameters: – Mutation probability ( ) = (0.33, 1/). – Distribution parameter  = 3 (any value > 1 will be good). •  = 0.75 • Maximum number of generations = 100 3.10.4 Parameter Setting for Lower Level GA Following parameter setting is used for lower level GA: • Population size = 50, • Maximum generations = 50, • Variable range = [−1, 1], • Crossover type = Simulated Binary Crossover (SBX) 66 • Mutation type = Polynomial Mutation • Selection type = Binary Tournament Selection • Crossover probability = 0.9, • Mutation probability = 1/, • Number of variables  =  + 1, if  = 0 or  + 2 if  = 1, where  is the total number of power-laws allowed and  is the modulus flag, • SBX and polynomial mutation parameters: (, ) = (2, 15). 3.10.5 Creation of Customized 2D Datasets: DS1- DS4 As mentioned before, the customized datasets DS1-DS4 (Figure 3.8) were synthetically generated. This section explains the procedure which was adopted to generate these customized datasets. Datapoints in 2 space were created using a reference curve (x) and were assigned to either the Class-1 or Class-2. Two additional parameters  and  controlled the location of a datapoint relative to the reference curve (x). Datapoints (x(i)) for a given dataset were generated using following steps: • First  points falling on curve (x) = 0 were initialized for reference. Lets denote these reference points on the curve with x(i) r (where  = 1, 2, . . . , ). Thus, (x(i) r ) = 0,  = 1, 2, . . . , . •  points (x(i)) for a dataset were then generated using the following equation x(i) = x(i) r +  + ,  = 1, 2, . . . , , (3.29) where  is a random number between 0 and 1,  represents the offset and  indicates the amount of spread. 67 Four datasets generated for conducting ablation studies are graphically represented in Fig. 3.8. Parameter setting which was used to generate these datasets is provided Table 3.14.  and  indicate the number of datapoints belonging to Class-1 and Class-2 respectively. Table 3.14: Parameter setting to create customized datasets D1-D4. For Class-1 data-points (, ) = (0, 0.01). Dataset DS1 DS2 DS3 DS4 (x) 21 + 2 − 3 = 0 21 + 2 − 3 = 0 1 + 2 − 1 = 0 2 21 + 2 − 3 = 0 Class-1 NA 100 10 100 100 (, ) Class-2 (0.025, 0.2) (0.025, 0.2) (0.025, 0.2) (0.025, 0.2) (−0.015,−0.2) NB 100 200 100 50 50 68 CHAPTER 4 CONTROL: INTERPRETABLE POLICY FOR DISCRETE ACTION SPACES 4.1 Introduction Control system problems are increasingly being solved by using modern reinforcement learning (RL) and other machine learning (ML) methods to find an autonomous agent (or controller) to provide an optimal action  for every state variable combination  in a given environment at every time-step . Execution of the output action  takes the object to the next state +1 in the environment and the process is repeated until a termination criteria is met. This is conceptually demonstrated in Figure 4.1 Figure 4.1: Control Loop The mapping between input state  and output action  is usually captured through an artificial intelligence (AI) method. In the RL literature, this mapping is referred to as policy (() : S → A), where S is the state space and A is the action space. Sufficient literature exists in efficient training of these RL policies [60, 61, 62, 63]. While these methods are efficient at training the AI policies for a given control system task, the developed AI policies, captured through complicated networks, are complex and non-interpretable. Interpretability of AI policies is important to a human mind due to several reasons: (i) they help 69 provide a better insight and knowledge to the working principles of the derived policies, (ii) they can be easily deployed with a low fidelity hardware, (iii) they may also allow an easier way to extend the control policies for more complex versions of the problem. While defining interpretability is a subjective matter, a number of past efforts have attempted to find interpretable AI policies with limited success. In the remainder of this chapter, we first present the main motivation behind finding interpretable policies in Section 4.2. A few past studies in arriving at interpretable AI policies is presented in Section 4.3. In Section 4.5, we briefly discuss an extension to our nonlinear decision tree (NLDT) approach in the context of arriving at interpretable AI policies. The overall open-loop and closed- loop NLDT policy generation methods are described in Section 4.6. Results on some control system problems are presented in Section 4.7. Finally, conclusions and future studies are presented in Section 4.10. 4.2 Motivation for the Study As mentioned in the earlier chapters, various data analysis tasks, such as classification, controller design, regression, image processing, etc., are increasingly being solved using artificial intelligence (AI) methods.These are done, not because they are new and interesting, but because they have been demonstrated to solve complex data analysis tasks without much change in their usual frameworks. With more such studies over the past few decades, they are faced with a huge challenge. Achieving a high-accuracy solution does not necessarily satisfy a curious domain expert, particularly if the solution is not interpretable or explainable. A technique (whether AI-based or otherwise) to handle data well is no more enough, researchers now demand an explanation of why and how they work. Consider the MountainCar control system problem (see Figure 4.2, which has been extensively studied using various AI methods [64, 65, 66]. The problem has two state variables (position  along -axis and velocity  along positive -axis) at every time instant  which would describe the state of the car at . Based on the state vector  = (, ), a policy () must decide on one of the three actions : decelerate ( = 0) along positive -axis with a pre-defined value −, do nothing 70 Figure 4.2: Mountain Car problem. It comprises of two state variables ,  and is controlled using three actions: -1 for deceleration, 0 for nothing and +1 for acceleration. ( = 1), or accelerate ( = 2) with  in positive -axis direction. The goal of the control policy () is to take the under-powered car (it does not have enough fuel to directly climb the mountain and reach the destination) over the right hump in a maximum of 200 time-steps starting anywhere at the trough of the landscape. Physical laws of motion are applied and a policy () has been trained to solve the problem. The RL produces a black-box policy () for which an action  ∈ [0, 1, 2] will be produced for a given input  = (, ) ∈ R2. Figure 4.3a shows the state- (a) Using . (b) Using NLDT. Figure 4.3: State-action combinations for MountainCar prob. action combinations obtained from 92 independent successful trajectories (amounting to total of 71 10,000 time-steps) leading to achieving the goal using a pre-trained deterministic black-box policy . The -location of the car and its velocity can be obtained from a point on the 2D plot. The color of the point  = (, ) indicates the action  suggested by the oracle policy  ( = 0: blue,  = 1: orange, and  = 2: green). If a user is now interested in understanding how the policy  chooses a correct  for a given , one way to achieve this would be to represent and analyze the policy function () as shown below:  () =  0,  1,  2, if 0() is true, if 1() is true, if 2() is true, (4.1) where () : 2 → {0, 1} is a Boolean function which partitions the state space S into two sub-domains based on its output value and for a given state , exactly one of () is true, thereby making the policy  deterministic. If we re-look at Figure 4.3a we notice that the three actions are quite mixed at the bottom part of the - plot (state space). Thus, the partitioning Boolean functions  of Eq. 4.1 need to be quite complex in order to have 0() = true for all blue points, 1() = true for all orange points and 2() = true for all green points in a mutually exclusive manner. What we address in this study is an attempt to find an approximated policy function () which may not explain all 100% time instance data corresponding to the oracle black-box policy () (Figure 4.3a), but it is fairly interpretable to capture and explain most of the behavior of . Consider the state-action plot in Figure 4.3b, which is generated with a simpler and a relatively more interpretable policy () = {|() is true,  = 0, 1, 2} obtained by our proposed procedure as shown below 0() = ¬1(), 1() = (1() ∧ ¬2()) , 2() = (1() ∧ 2()) , 72 (4.2) where 1() = |0.96 − 0.63/(cid:98) 0.30(cid:98) 2 + 0.28/(cid:98) − 0.22(cid:98)(cid:98)| ≤ 0.36, and 2() = |1.39 − 0.28(cid:98) 2| ≤ 0.53. Here,(cid:98) and(cid:98) are normalized state variables (see Section 4.6.1). The action  2 − predicted using the above policy does not match the output of  at some states (about 8.1%), but from our experiments we observe that it is still able to drive the mountain-car to the destination goal located on the right hill in 99.8% episodes. Importantly, the policies are simplistic and amenable to an easier understanding of the relation- ships between  and  to make a near perfect control. Since the explanation process used the data from  as the universal truth, the derived relationships can also provide an explanation of the working of the black-box policy . A more gross approximation to Figure 4.3a by more simplified relationships () may reduce the overall open-loop accuracy (see Section 4.4) of matching the output of . Hence, a balance between a good interpretability and a high open-loop accuracy in searching for Boolean functions () becomes an important matter for such an interpretable AI-policy development study. In this work, we focus on developing a search procedure for arriving at the -functions (see Eq. 4.2) for discrete action systems. The structure of the policy () shown in Eq. 4.1 resembles a decision tree (DT), but unlike a standard DT, it involves a nonlinear function at every non-leaf node, requiring an efficient nonlinear optimization method to arrive at reasonably succinct and accurate functionals. The procedure we propose here is generic and is independent of the AI method used to develop the black-box policy . 4.3 Related Past Studies A few studies exist which are focused at generating interpretable policies. In [67], an in- terpretable orchestrator is developed to choose from two RL-policies:  (modelled for reward maximization) and  for maximizing an ethical consideration. The orchestrator is dependent on only one of the state-variables and despite it being interpretable, the policies:  and  are still black-box and convoluted. [68] constructs a set of interpretable index based policies and uses multi-arm bandit procedure to select a high performing index based policy. The search space of 73 interepretable policies is much smaller and the procedure suggested for finding an interpretable policy is computationally heavy, taking about hours to several days of computational time on simple control problems. In [69], genetic programming (GP) is used to obtain interpretable policies on control tasks involving continuous actions space through model-based policy learning. However the interpretability was not captured in the design of the fitness function and a large archive was created passively to store every policy for each complexity encountered during the evolutionary search. A linear decision tree (DT) based model is used in [70] to approximate the Q-values of trained neural network. In that work, the split in DT occurs based on only one feature, and at each terminal node the Q-function is fitted using a linear model on all features. [71] uses a program sketch  to define the domain of interpretable policies . Interpretable policies are found using a trained black-box oracle  as a reference by first conducting a local search in the sketch space  to mimic the behaviour of the oracle  and then fine-tuning the policy parameters through online Bayesian optimization. The bias towards generating interpretable programs is done through controlled initialization and local search rather than explicitly capturing interpretability as one of the fitness measure. Particle swarm optimization [72] is used to generate interpretable fuzzy rule set in [73] and is demonstrated on classic control problems involving continuous actions. Works on DT [12] based policies through imitation learning has been carried out in [74]. [75] extends this to utilize Q-values and eventually render DT policies involving < 1, 000 nodes on some toy games and CartPole environment with an ultimate aim to have the induced policies verifiable. [76] used axis-aligned DTs to develop interpretable models for black-box classifiers and RL-policies. They first derive a distribution function P by fitting the training data through axis-aligned Gaussian dis- tributions. P is then used to compute the loss function for splitting the data in the DT. [77] attempts to generate interpretable DTs from an ensemble using a genetic algorithm. In [78], regression trees are derived using classical methods such as CART [12] and Kd-tree [79] to model Q-function through supervised training on batch of experiences and comparative study is made with ensemble techniques. In [80], a gradient based approach is developed to train the DT of pre-fixed topology involving linear split-rules. These rules are later simplified to allow only one feature per split node 74 (a) Open-loop (b) Closed-loop Figure 4.4: Performance measures. and resulting DTs are pruned to generate simplified rule-set. While the above methods attempt to generate an interpretable policy, the search process does not use complexity of policy in the objective function, instead, they rely on initializing the search with certain interpretable policies. In our approach described below, we use to concept of NLDT induction discussed in Chapter 3 to build an efficient search algorithm to directly find relatively simple and interpretable policies than black-box DNN or tabular-tile based policies using recent advances in nonlinear optimization. 4.4 Performance Measures Before we proceed into formally discussing the NLDT based policies, we will quickly discuss two performance metrics we used to measure and compare performance of policies. 4.4.1 Open-Loop Accuracy Open-loop accuracy quantifies how accurately does a given policy  mimics the reference black- box oracle policy . This is conceptually shown in Figure 4.4a. Here, under ideal scenario (100% open-loop accuracy), for every state (), the output (cid:48)() of policy  will match the output () of the black-box policy . In our case, since the action-space is discrete, the open-loop accuracy is identical to the classification accuracy on the labelled state-action dataset 75 generated using the black-box policy . 4.4.2 Closed-loop Performance Under the close-loop setting, the AI is directly used as a controller to control the object by interacting with the environment or dynamics of the control system as shown in Figure 4.4b. For each transition from state () to state ( + 1) using action (), a reward () is collected. The loop is repeated until a termination criteria is reached. The entire sequence of state-action-state transitions from start to termination is referred to as episode. The cumulative reward value collected during entire episode is given by   =0  = (), (4.3) where   is the time-step at which an episode terminated. The episode is called a success if the desired goal is reached or a target is achieved upon the termination. Else, the episode is declared as a failure. In the mountain-car example (Figure 4.2), an episode is considered as a success if by 200 time-steps, the AI is able to drive the car to the flag-post located on the right up-hill, else it’s a failure. We measure closed-loop performance of an AI using two metrics • Cumulative Reward which is measured using Eq. 4.3, and • Task Completion Rate, which quantifies number of successful episodes across 100 closed- loop simulations. 4.5 Nonlinear Decision Trees (NLDTs) as Policies Similar to classification problems where NLDT is used as a classifier to predict the class of a given datapoint, in control systems involving discrete actions, NLDT can represent a policy wherein for a given input state x, the corresponding action  can be determined by traversing the tree and reaching the leaf node. 76 We implement two frameworks to represent non-linear decision trees: binary-split and multi- split. The binary-split NLDT is identical to the one discussed in Chapter 3. The multi-split NLDT can be obtained by allowing more than two splits for a given conditional node. We will briefly discuss binary-split and multi-split NLDT frameworks. 4.5.1 Binary-split NLDT In binary-split NLDT, a conditional node undergoes exactly two splits. Unlike in Chapter 3 where we mainly discussed problems involving two classes, in case of discrete-action control problems, more than two actions can appear. This is conceptually illustrated in Figure 4.5 for control problem involving three actions. Figure 4.5: Binary-split NLDT for Discrete action control systems. Each conditional node represents a non-linear control logic  (x) ≤ 0, where the non-linear function  (x) assumes the similar form as the one discussed in Chapter 3 and is shown below for quick reference   (cid:12)(cid:12)(cid:12) =1   + 1, =1   + 1 The  are power-laws of type  =  (x) =    =1  operator. 77 if  = 0, (cid:12)(cid:12)(cid:12) − |2|, if  = 1, (4.4) and  indicates presence or absence of absolute 4.5.2 Multi-split NLDT Unlike the binary-split NLDT, in multi-split NLDT, a node is allowed to have more than two partitions or splits. If the number of classes (or discrete action values in our case) in the dataset is , then a node in the multi-split NLDT can have upto -splits as shown in Figure 4.6a. (a) Schematic of Multi-split NLDT. (b) Biases in (x) for Multi-split NLDT. Figure 4.6: Multi-split NLDT configuration. Numbers in square brackets indicate class distribution of datapoints. The number within square brackets in a node in multi-split NLDT of Figure 4.6a indicates class distribution. The split function  (x) for a node in multi-split NLDT is given by the following equation  =1  (x) =  . (4.5) Notice the difference between split-rule for binary-split NLDT (Eq. 4.4) and split-rule for multi-split NLDT (Eq. 4.5). The former involves bias terms  and a modulus parameter  while the latter is expressed as the weighted sum of power-laws  (Eq. 3.3) only. In multi-split NLDT, for a given feature vector x, the value of its split function  (x) is compared against biases  to determine child node where the datapoint will belong. This is schematically shown in Figure 4.6b for a node undergoing 4 splits. Thus, for a given feature vector x, If − ∞ <  (x) ≤ 1, Move to Split 1 Node, If 1 <  (x) ≤ 2, Move to Split 2 Node, · · · · · · 78 4.6 Overall Approach The overall approach is illustrated in Figure 4.7. First, a dedicated black-box policy  is Figure 4.7: A schematic of the proposed overall approach. trained from the actual environment/physics of the problem. Training of black-box policy  is not the focus of our current work and we use standard approaches from literature to obtain . Next, the trained policy  (Block 1 in the figure) is used to generate labelled training and testing datasets of state-action pairs from different time-steps. We generate two types of training datasets: Regular – as they are recorded from multiple episodes, and Balanced – selected from multiple episodes to have almost equal number of states for each action, where an episode is a complete simulation of controlling an object with a policy over multiple time-steps. Third, the labelled training dataset (Block 2) is used to find the NLDT (Block 3) using the recursive bilevel evolutionary algorithm described in Section 4.6.2. We call this an open-loop NLDT (or, NLDT ), since it is derived from a labelled state-action dataset generated from , without using any overall reward or any final goal objective in its search process, which is typically a case while doing reinforcement learning. Use of labelled state-action data in supervised manner allows a faster search of NLDT even with a large dataset as compared to constructing the NLDT from scratch through reinforcement learning by interacting with the environment to maximize the cumulative rewards [71]. Next, in an effort to make the overall NLDT relatively more interpretable while simultaneously ensuring better closed-loop performance, we prune the NLDT by taking only the 79   top part of NLDT  (we call NLDT() in Block 4) and re-optimize all non-linear rules within it for the weights and biases using an efficient evolutionary optimization procedure to obtain final NLDT* (Block 5). The re-optimization is done here with closed-loop objectives, such as the cumulative reward function or closed-loop completion rate. We briefly discuss the open-loop training procedure of inducing NLDT  and the closed-loop training procedure to generate NLDT* in next sections. 4.6.1 Data Normalization First, we provide the exact normalization of state variables performed before the open-loop learning task is executed. Before training and inducing the non-linear decision tree (NLDT), features in the dataset are normalized using the following equation: (cid:98) = 1 + ( − min  )/(max  − min ),  (4.6) where  is the original value of the -th feature,(cid:98) is the normalized value of the -th feature, min and max are minimum and maximum value of -th feature as observed in the training dataset. This normalization will make every feature  to lie within [1, 2]. This is done to ensure that  = 0 is avoided to not cause a division by zero.   4.6.2 Open-loop Training As a first step for open-loop training, the pre-trained black-box oracle  is used to generate labelled state-action dataset (step A in Figure 4.7). Once the labeled dataset is generated, the problem of inducing NLDT to fit the data translates to the supervised learning problem of developing a classifier. 4.6.2.1 Open-loop training for Binary-split NLDT The open-loop training for binary-split is identical to the procedure discussed in Chapter 3 wherein an evolutionary bilevel algorithm is employed to derive non-linear split-rule at each conditional 80 node. The NLDT is grown using recursive splitting of training data. 4.6.2.2 Open-loop training for Multi-split NLDT Similar to binary-split NLDT, a dedicated bilevel optimization algorithm is applied to derive the split-rule  (x) of Eq. 4.5 and its associated biases  at a given conditional node. The optimization formulation to obtain the split-rule for a conditional node in multi-split NLDT and values of corresponding  is as given below Min. (B, w∗, ∗), s.t. (w∗, ∗) ∈ argmin(cid:8)(w, )|(B,)(cid:12)(cid:12)  ∈ [−1, 1]+1(cid:9) , (w, −1 ≤  ≤ 1, ∀, )|(B,) ≤  ,  <   ,  < , ∀, ,   ∈ . (4.7) Here, the upper level objective function  is identical to the one given in Eq. 3.7. It quantifies the complexity of the rule by counting the number of non-zero exponents in power-laws . The lower level objective function  quantifies the quality of split by computing weighted sum of impurity scores of resulting child nodes as shown in the equation below Ò =1  = Gini(),   (4.8) where Ò indicates total number of child nodes created from the split. The number of splits (and hence the number of child nodes) depends on the distribution of datapoints in the given conditional node. If the number of datapoints belonging to -th class in the given conditional node is  and there are total  classes in the original dataset, then the number of splits  (or equivalently Ò) a given conditional node undergoes is  =1  = max(0,  − )  −  , 81 (4.9) where  is a user specified parameter. The optimization formulation for lower-level optimization in multi-split NLDT is given as under Min. (w, )(cid:12)(cid:12)(B,), s.t.  <   ,  ∈ {1, . . . ,  − 2} and  =  + 1, (4.10)  ∈ [−1, 1] ,  ∈ [−1, 1]−1, where  is the total number of power-laws  (Eq. 4.5). 4.6.3 Closed-loop Training The intention behind the closed-loop training is to enhance the closed-loop performance of NLDT. It will be discussed in Section 4.7 that while closed-loop performance of NLDT  is at par with  on control tasks involving two to three discrete actions, like CartPole and MountainCar, the NLDT  struggles to autonomously control the agent for control problems such as LunarLander having more states and actions. In closed-loop training, we fine-tune and re-optimize the weights W and biases  of an entire NLDT  (or pruned NLDT , i.e. NLDT() – block 4 in Figure 4.7) to maximize its closed-loop fitness ( ), which is expressed as the average of the cumulative reward collected on  episodes:    (W, ), Maximize  (W, ) = Subject to W ∈ [−1, 1] ,  ∈ [−1, 1] , =1 1  (4.11) where  and  are total number of weights and biases appearing in entire NLDT and  = 20 in our case. 4.7 Experiments: AIM and Procedure In this work, we conduct experiments to demonstrate the performance of NLDT on control problems involving discrete actions. Efficacy of the proposed approach of inducing NLDT is 82 shown on two types of control tasks as listed below: • Binary Discrete Action Space (B-DAS): The agent in these control tasks is allowed to have exactly two discrete actions. Environments considered for this task are CartPole and CarFollowing as shown in Figures 4.8a and 4.8b, respectively. • Multiple Discrete Action Space (M-DAS): Action space of agent in these control tasks in- volves three or more discrete actions. Environments considered for this task are MountainCar and LunarLander as shown in Figures 4.2 and 4.8c, respectively. (b) CarFollowing environment. (a) CartPole environment. (c) LunarLander environment. Figure 4.8: Three control problems. Through our experiments, we try to empirically demonstrate effects of training data size, data distribution (regular or balanced) and NLDT framework (binary-split or multi-split) on the performance of NLDT. 4.7.1 Experimental Setup At first, a dedicated black-box AI is trained for each control tasks listed above. The method used to derive the black-box AI for each control task is provided in the corresponding sections. It is 83 to note here that training of the black-box AI is not the focus of our research. Our aim here is to decipher complicated and incomprehensible (but efficiently functioning) black-box AI controller by representing it in a relatively more interpretable format through Non-linear Decision Tree (NLDT). Once the black-box AI is trained, it is used to generate labelled training and testing datasets of state-action pairs (block 2 in Figure 4.7). The labelled training dataset is used to induce the NLDT using the recursive bilevel evolutionary algorithm [81] (Section 4.6.2). The training dataset is generated by storing state-action values from the sequential interaction of black-box AI (such as DNN) with the corresponding environment. We generate two types of training datasets: Regular and Balanced. Discussion regarding the procedure used to generate regular and balanced dataset is explained next. 4.7.1.1 Creation of Regular Dataset For creating a regular dataset of  datapoints, the sequential state-action data of different episodes is stored from the interaction of trained black-box AI with the corresponding environment. Since the datapoint (state-action pair) is stored for each time-step in sequential manner (if the episode terminates and the collected number of datapoints is less than , then a new episode is invoked and the process of data collection repeats), there is no explicit control on the number of datapoints which will fall into a particular action category. Hence, the resulting data distribution might have an inherent bias. This bias in data distribution is more evident for environments involving more than two actions as shown in Table 4.1. Table 4.1: Class Distribution of Regular Training Datasets for different problems. For each row, the -th number in second column represents the number of datapoints belonging to -th action. Problem CartPole CarFollowing MountainCar LunarLander #Classes 2 2 3 4 Class Distribution 4997; 5003 5237; 4763 3452; 495; 6053 3661; 1650; 3779; 910 Since some of the actions have lesser representation in the training dataset, it might negatively affect the closed-loop performance of the NLDT. To investigate this effect of bias, we create 84 balanced datasets with near-uniform distribution across all actions/classes. The procedure to create a balanced (or almost balanced) dataset is discussed next. 4.7.1.2 Creation of Balanced Dataset In order to create a uniformly (or near-uniformly) distributed data across all actions in the dataset of  datapoints, data creation procedure similar to the one discussed above is implemented, i.e. data is stored from sequential time-steps. However, if the number of datapoints for a particular (cid:101) (where  is the total number of discrete actions action reaches a threshold limit of (cid:100)   and (cid:100)(cid:101) indicates the round-up function), no extra datapoints are collected for that action. It is to note that data for other actions (for which number of collected datapoints is still less than the prescribed threshold value) is still collected from sequential time-steps until total  datapoints are collected. The regular and balanced data creation method is used to create training datasets only. Testing datasets are created with the regular dataset creation approach. Also, since for binary action spaces, the distribution is fairly uniform (Table 4.1), we don’t create balanced training datasets for these problems (i.e. CartPole and CarFollowing). 4.8 Experiments and Analysis on Control Tasks with Binary Action Spaces In this section, we conduct experiments and discuss results obtained by using our approach for control tasks involving two discrete actions, namely: 1) CartPole, and 2) CarFollowing. 4.8.1 CartPole Problem As shown in Figure 4.8a, the CartPole problem comprises of four state variables: 1) -position ( → 0), velocity in +ve  direction ( → 1), angular position from vertical ( → 2) and angular velocity ( → 3) and is controlled by applying force towards left (action 0) or right (action 1) to the cart. The objective is to balance the inverted pendulum (i.e. −24 deg ≤  ≤ 24 deg) while also ensuring that the cart doesn’t fall off from the platform (i.e. −4.8 ≤  ≤ 4.8). For every time-step, 85 a reward value of 1 is received while  is within ±24 deg. The maximum episode length is set to 200 time-steps. A deep neural network (DNN) controller is trained on the CartPole environment using the PPO algorithm [61]. Table 4.2 shows the performance of NLDT on the training data sets of different sizes. It is observed that NLDT trained with 5,000 and 10,000 data points shows a robust open-loop performance and also produces 100% closed-loop performance. Keeping this in mind, we keep the training data size of 10,000 fixed across all control problems discussed in this dissertation. It is observed that NLDT  trained with at least 5,000 data points shows a robust open-loop performance. The obtained NLDT  has a about two rules with on an average three terms in the derived policy function. Table 4.2: Effect of training data size on performance of NLDT  on CartPole problem. Compl. Rate Training Data Size (Closed-loop) Training Accuracy Testing Accuracy Cumulative 100 500 1,000 5,000 10,000 97.00 95.5 91.90 92.07 91.86 82.79 79.66 90.59 92.02 92.05 # Rules 1.50 1.90 1.80 1.70 1.30 Rule Length 3.30 3.88 4.05 4.25 4.45 Reward 199.73 175.38 200.00 200.00 200.00 95.0 51.00 100 100 100 Interestingly, the same NLDT (without closed-loop training) also produces 100% closed-loop performance by achieving the maximum cumulative reward value of 200. 4.8.1.1 NLDT for CartPole Problem One of the NLDT  obtained for the CartPole environment is shown in Figure 4.9 in terms of normalized state variable vector(cid:98)x. The respective policy can be alternatively stated using the programmable if-then-else rule- structure as shown in Algorithm 5: A little manipulation will reveal that for a correct control startegy, Action 0 must be invoked if following condition is true: 2.39 ≤ (cid:33) (cid:32) (cid:98)0(cid:98)2 2 + 3.50(cid:98)3 2 86 ≤ 5.06, Figure 4.9: CartPole NLDT  induced using 10,000 training samples. It is 91.45% accurate on the testing dataset but has 100% closed loop performance. Normalization constants are: xmin = [-0.91, -0.43, -0.05, -0.40], xmax = [1.37, 0.88, 0.10, 0.45]. Algorithm 5: CartPole Rules. Normalization constants are:  = [-0.91, -0.43, -0.05, -0.40],  = [1.37, 0.88, 0.10, 0.45]. (cid:12)(cid:12)(cid:12)−0.18(cid:98)0(cid:98)2−2 − 0.63(cid:98)3−2 + 0.67(cid:12)(cid:12)(cid:12) − 0.24 ≤ 0 then if Action = 0 Action = 1 else end otherwise, Action 1 must be invoked. First, notice that the above policy does not require the current velocity ((cid:98)1) to determine the left or right action movement. Second, for small values of angular position ((cid:98)2 ≈ 1) and angular velocity ((cid:98)3 ≈ 1), i.e. of (cid:98)2 ≈ 2 and (cid:98)3 ≈ 2), the term in bracket will be smaller than 2.39 for all (cid:98)0 ∈ [1, 2], and the the pole is falling towards left, the above condition is always true. That is, the cart should be pushed towards left, thereby trying to stabilize the pole to vertical position. On the other hand, if the pole is falling towards right (large values above policy suggests that Action 1 (push the cart towards right) must be invoked. When the pole is falling right, a push of the cart towards right helps to stabilize the pole towards its vertical position. These extreme case analyses are intuitive and our policy can be explained for its proper working, but what our NLDT approach is able to find is a precise rule for all situations of the state variables to control the Cart-Pole to a stable configuration, mainly using the blackbox-AI data. 4.8.2 CarFollowing Problem This problem simulates a one dimensional car chasing scenario. We have developed a discretized version of the car following problem discussed in [82] (illustrated in Figure 4.8b), wherein the task 87 is to follow the car in the front which moves with a random acceleration profile (between −1/2 and +1/2) and maintain a safe distance of    = 30 from it. The rear car is controlled using two discrete acceleration values of −1/2 (Action 0) and +1/2 (Action 1). The car-chase episode terminates when the relative distance  =    −  is either zero (i.e. collision case) or is greater than 150 m. At the start of the simulation, both cars start with the initial velocity of zero. A DNN policy for CarFollowing problem was obtained using a double Q-learning algorithm [83]. The reward function for the CarFollowing problem is shown in Figure 4.10, indicating that a relative distance close to 30 m produces the highest reward. Figure 4.10: Reward function for CarFollowing environment. It is to note here that unlike the CartPole control problem, where the dynamics of the system was deterministic, the dynamics of the CarFollowing problem is not deterministic due to the random acceleration profile with which the car in the front moves. This randomness introduced by the unpredictable behaviour of the front car makes this problem more challenging. Results for the CarFollowing problem are shown in Table 4.3. An average open-loop accuracy of 96.53% is achieved with at most three rules, each having 3.28 terms on an average. For this problem, we apply the closed-loop re-optimization (Blocks 4 and 5 to produce Block 6 in Figure 4.7) on the entire NLDT . As shown Table 4.4, NLDT* is able to achieve better closed- 88 Table 4.3: Results on CarFollowing problem correspoing to open-loop training (NLDT ). Rule Length Compl. Rate Train. Acc. 96.41 ± 1.97 3.28 ± 0.65 100 ± 0.00 Test. Acc. 96.53 ± 1.90 # Rules 2.40 ± 0.66 Depth 1.90 ± 0.30 loop performances (100% completion rate and better average cumulative reward). Figure 4.11 shows that NLDT* adheres the 30 gap between the cars more closely than original DNN or NLDT . Table 4.4: Closed-loop performance analysis after re-optimizing NLDT for CarFollowing problem (k = 103). Cumulative Reward AI Avg ± Std Best 174.16k 173.75k ±20.95 DNN NLDT  174.15k 173.87k ±16.48 179.76k 179.71k ±0.95 NLDT* Compl. Rate 100 ± 0.00 100 ± 0.00 100 ± 0.00 Figure 4.11: Relative distance plot for CarFollowing problem. 89 0100200300400500600700800Time Steps26.527.027.528.028.529.029.530.030.531.0Relative Distancedsafe=30DNNNLDTNLDT* 4.8.2.1 NLDT for CarFollowing Problem The NLDT  obtained for the CarFollowing problem is shown in Figure 4.12. The rule-set is Figure 4.12: NLDT  for the CarFollowing problem. Normalization constants are: min = [0.25, -7.93, -1.00], max = [30.30, 0.70, 1.00]. provided in its natural if-then-else form in Algorithm 6. Algorithm 6: Ruleset corresponding to NLDT  (Figure 4.12) of the CarFollowing problem. if 0.63(cid:98)0 − 0.87(cid:98)1−2(cid:98)2 − 1.00 ≤ 0 then if 0.96(cid:98)1−3 − 0.58(cid:98)0 + 1.00 ≤ 0 then Action = 1 else end Action = 0 else end Action = 1 Recall that the physical meaning of state variables is: 0 →  (relative distance between front car and rear car), 1 →  (relative velocity between front car and rear car) and 2 →  (acceleration value (−1 or +1 m/s2) at the previous time step). Action = 1 stands for acceleration and Action = 0 denotes deceleration of the rear car in the next time step. From the first rule (Node 0), it is clear that if the rear car is close to the front car ((cid:98)0 ≈ 1), acceleration of the rear car (both(cid:98)1 and(cid:98)2 lying in [1,2]). Thus, Node 4 (Action = 1, indicating the root function 0(x) is never going to be positive for any value of relative velocity or previous 90 acceleration of the rear car in the next time step) will never be invoked when the rear car is too close to the front car. Thus for(cid:98)0 ≈ 1, the control always passes to Node 1. A little analysis will also reveal that for(cid:98)0 ≈ 1, the rule 1(x) > 0 for any relative velocity(cid:98)1 ∈ [1, 2]. This means that when the two cars are relatively close, only Node 3 gets fired to decelerate (Action = 0) the rear car. This policy is intuitively correct, as the only way to increase the gap between the cars is for the controlled rear car to be decelerating. However, when the rear car is far away for which (cid:98)0 ≈ 2, Action 1 (Node 4) gets fired if (cid:98)1 > 1.829(cid:112)(cid:98)2. If the rear car was decelerating in the previous time step (meaning (cid:98)2 = 1), the obtained NLDT recommends that the rear car should accelerate if (cid:98)1 ∈ [1.829, 2], or when already accelerating in the previous time step ((cid:98)2 = 2), Node 4 does not fire, as (cid:98)1 can never be the magnitude of the relative velocity is small, or when 1 ∈ [−0.776, 0.700]/. This will help maintain the requisite distance between the cars. On the other hand, if the rear car was √ 2 and the control goes to Node 1 for another check. Thus, the rule in Node 0 more than 1.829 makes a fine balance of the rear car’s movement to keep it a safe distance away from the front car, based on the relative velocity, position, and previous acceleration status. When the control comes to Node 1, Action 1 (acceleration) is invoked if(cid:98)1 ≥ 0.96/(0.58(cid:98)0 − 1). For(cid:98)0 ≈ 2, this happens when (cid:98)1 > 1.817 (meaning that when the magnitude of the relative velocity is small, or 1 ∈ [−0.879, 0.700]/), the rear car should accelerate in the next time step. For all other negative but large relative velocities 1 ∈ [−7.930, 0.879]/), meaning the rear car is rushing to catch up the front car, the rear car should decelerate in the next time step. From the black-box AI data, our proposed methodology is able to obtain a simple decision tree with two nonlinear rules to make a precise balance of movement of the rear car and also allowing us to understand the behavior of a balanced control strategy. Results of NLDT’s performance on problems with two discrete actions (Tables 4.2, 4.3 and 4.4) indicate that despite having a noticeable mismatch with the open-loop output of the oracle black-box policy , the closed-loop performance of NLDT is at par or at times better than . This observation suggests that certain state-action pairs are not of crucial importance 91 when it comes to executing the closed-loop control and, therefore, errors made in predicting these state-action events do not affect or deteriorate the closed-loop performance. 4.9 Experiments and Analysis on Multiple Discrete Action Space In this section, we investigate the performance of proposed NLDT method to approximate the behavior of the parent ANN/black-box AI for control tasks involving more than two discrete actions. Here, as mentioned before, we compare the open-loop and closed-loop performances across different representations of NLDT, i.e. binary-split NLDT and multi-split NLDT, and two different training dataset distributions, namely, regular and balanced. 4.9.1 MountainCar Problem A schematic of this environment is shown in Figure 4.2. The car starts somewhere near the bottom of the valley and the goal of the task is to reach the flag-post located on the right up-hill with non-negative velocity. The fuel is not enough to directly climb the hill and hence a control strategy needs to be devised to move car back (left up-hill), leverage the potential energy and then accelerate it to eventually reach the flag-post within 200 time-steps. The car receives the reward value of -1 for each time-step, until it reaches the flag-post where the reward value is zero. The car is controlled using three actions: accelerate left (action 0), do nothing (action 1) and accelerate right (action 2) by observing its state which is given by two state-variables:  position → 0 and velocity  → 1. We use the SARSA algorithm [84] with tile encoding to derive the black-box AI controller, which is represented in form of a tensor and has total 151, 941 elements. Compilation of results of the NLDT controller induced using different NLDT-representations and training dataset distributions is presented in Table 4.5. A state-action plot of the black-box AI and the NLDT controller corresponding to the first row of Table 4.5 is provided in Figures 4.3a and 4.3b, respectively. It can be seen from the plots that about 8% mismatch in the open-loop performance (i.e. testing accuracy in Table 4.5) comes from the lower region of state-action plot (Figures 4.3a and 4.3b) due to highly non-linear output of the black-box AI controller. Again, 92 despite having this mismatch, the NDLT controller is able to achieve close to 100% closed-loop control performance. The regular dataset is able to produce a multi-split NLDT with three control Table 4.5: Mountain car results. The numbers indicate average scores. Testing Accuracy 91.92 88.70 Balanced Training Accuracy No Yes No Yes 90.61 86.11 91.66 88.53 Depth # Rules Binary-split NLDT 2 4 2 3 Multi-split NLDT 2 2 3 4 91.50 88.44 Avg Rule Length Completion Rate 3.00 3.00 1.67 2.50 99.8 100.0 100.0 100.0 rules with an average 1.67 terms in each rule to achieve 100% closed-loop performance. The multi-split NLDT has a comparable performance to that of binary-split NLDT. 4.9.2 NLDT for MountainCar Problem The NLDT  obtained for the MountainCar problem is shown below in Figure 4.13. The resulting rule-set is also shown in if-then-else statements in Algorithm 7. Figure 4.13: NLDT  for MountainCar problem. Normalization constants: min = [-1.20, -0.06], max = [0.50, 0.06]. This rule-set corresponds to the plot shown in Figure 4.3b. A detail analysis of the two rules can be made to have a deeper understanding of the control policy. 93 Algorithm 7: MountainCar NLDT . Normalization constants are:  = [-1.20, -0.06],  = [0.50, 0.06]. (cid:12)(cid:12)(cid:12)−0.22(cid:98)0(cid:98)1 + 0.28(cid:98)1−1 − 0.63(cid:98)0−2 + 0.96(cid:12)(cid:12)(cid:12) − 0.36 ≤ 0 then (cid:12)(cid:12)(cid:12)−0.30(cid:98)1 2 + 1.39(cid:12)(cid:12)(cid:12) − 0.53 ≤ 0 then 2 − 0.28(cid:98)0 if if else end Action = 2 Action = 1 else end Action = 0 4.9.3 LunarLander Problem This problem is motivated from a classic problem of design of a rocket-controller. Here, the state of the lunar-lander is expressed with eight state variables, of which six can assume continuous real values, while the rest two are categorical, and can assume a Boolean value. The first six state variables indicate the (, ) position, and velocity and angular orientation and angular velocity of the lunar-lander. The two Boolean state variables provide the indication regarding the left-leg and right-leg contact of lunar-lander with the ground terrain. The lunar-lander is controlled using four actions: action 0 → do nothing, action 1 → fire left engine, action 2 → fire main engine and action 3 → fire right engine as shown schematically in Figure 4.8c. The black-box DNN based controller for this problem is trained using the PPO algorithm [61] and involves two hidden layers of 64 nodes. Table 4.6 provides the compilation of results obtained using open-loop training. It is evident from the table that the binary-split NLDT  representation is superior to the multi- split NLDT  representation in terms of both open-loop and closed-loop performances. In this problem, a better closed-loop performance of NLDT  is observed when the open-loop training is done on the balanced dataset. This indicates that under-represented actions in regular training dataset (see Table 4.1) limits the exposure to some crucial states during the training phase of NLDT. The crucial state-action pairs (or experience) necessary for closed-loop control gets captured while creating the balanced dataset, and so, despite having the lower open-loop performance on the testing 94 Table 4.6: LunarLander open-loop training (NLDT ) results. The numbers indicate average scores. Testing Accuracy Depth # Rules Binary-split NLDT Balanced Training Accuracy No Yes No Yes No Yes No Yes 79.17 69.83 87.43 81.74 75.84 68.31 86.95 82.17 76.36 66.58 81.74 71.52 69.68 70.79 79.40 71.08 Multi-split NLDT 3 3 6 6 2 2 6 6 5.60 4.40 34.70 25.70 4 5 37 41 Avg Rule Length Completion Rate 5.59 5.79 4.94 5.17 3.50 5.80 3.84 3.95 14.00 42.00 48.00 93.00 0.11 21.60 17.70 89.50 dataset (which is regular), the overall closed-loop performance improves. The best performance is observed with a binary-split NLDT  trained on the balanced dataset. 4.9.3.1 NLDT for LunarLander Problem The topology of one of the binary-split NLDT  obtained after open-loop training using balanced dataset is shown in Figure 4.14. This NLDT  successfully lands the lander in 93.4% episodes and has in total 26 rules each having about 4.15 terms involving state variables. It is understandable that a complex control task involving many state variables cannot be simplified or made interpretable with just one or two control rules. A separate study focusing at identifying crucial state-action pairs would help us understand this phenomenon better and we plan to do this in our future work. Next, we use a part of the NLDT  from the root node to obtain the pruned NLDT() (step ‘B’ in Figure 4.7) and re-optimize all weights (W) and biases () using the procedure discussed in Section 4.6.3 (shown by orange box in Figure 4.7) to find closed-loop NLDT*.   Table 4.7 shows that for the pruned NLDT-3 which comprises of the top three layers and involves only four rules of original 26-rule NLDT  (i.e. NLDT-6), the closed-loop performance increases 95 Figure 4.14: NLDT-6 (with 26 rules) and other lower depth NLDTs for the LunarLander problem. Lower depth NLDTs are extracted from the depth-6 NLDT. Each node has an associated node-id (on top) and a node-class (mentioned in bottom within parenthesis). Table 4.7 in provides results on closed-loop performance obtained using these trees before and after applying re-optimization on rule-sets using the closed-loop training procedure. from 51% to 96% (NLDT*-3 results in Table 4.7) after re-optimizing its weights and biases with closed-loop training. The resulting NLDT with its associated four rules are shown in Figure 4.15. Figure 4.15: Final NLDT*-3 for LunarLander prob. (cid:98) is a normalized state variable (see Sec- tion 4.6.1). As shown in Table 4.7, the NLDT* with just two rules (NLDT-2) is too simplistic and does 96 not recover well after re-optimization. However, the NLDT*s with four and seven rules achieve a near 100% closed-loop performance. Clearly, an NLDT* with more rules (NLDT-5 and NLDT-6) are not worth considering since both closed-loop performances and the size of rule-sets are worse than NLDT*-4. Note that DNN produces a better reward, but not enough completion rate, and the policy is more complex with 4,996 parameters. Table 4.7: Closed-loop performance on LunarLander problem with and without re-optimization on 26-rule NLDT . Number of rules are specified in brackets for each NLDT and total parameters for the DNN is marked. Re-Opt. NLDT-2 (2) Before −1675.77 ± 164.29 −133.95 ± 2.51 After Before After 0.00 ± 0.00 48.00 ± 7.38 NLDT-3 (4) 42.96 ± 13.83 231.42 ± 17.95 51.00 ± 3.26 96.00 ± 2.77 56.16 ± 23.50 182.87 ± 21.92 NLDT-5 NLDT-4 (7) (13) Cumulative Reward 54.24 ± 27.44 234.98 ±22.25 Completion Rate 82.00 ± 9.80 99.00 ±1.71 79.00 ± 7.66 93.00 ± 7.59 DNN (4,996) 247.27 ± 3.90 94.00 ±1.96 NLDT-6 (26) 169.43 ±23.96 214.94 ± 17.31 93.00 ±3.30 94.00 ± 4.45   To demonstrate the efficacy and repeatability of our proposed approach, we perform another run of the open-loop and closed-loop training and obtain a slightly different NLDT*-3, which is shown in Figure 4.16. This NLDT also has four rules, which are shown in Table 4.8. Four rules corresponding to the pruned NLDT() (Depth 3) are also shown in the table for comparison. It can be noticed that the re-optimization of NLDT through closed-loop training (Section 4.6.3) modifies the values of coefficients and biases, however the basic structure of all four rules remains intact. Figure 4.17 shows the closed-loop training curve for generating NLDT* from Depth-3 NLDT() .   The objective is to maximize the closed-loop fitness (reward)   (Eq. 4.11) which is expressed as the average of the cumulative reward  collected over  episodes. It is evident that the cumulative reward for the best-population member climbs to the target reward of 200 at around 25-th generation and the average cumulative reward of the population also catches up the best cumulative reward value with generations. 97 Figure 4.16: Topology of Depth-3 NLDT() obtained from a different run on the LunarLander problem. The equations corresponding the conditional-nodes before and after re-optimization are provided in Table 4.8.   A visualization of the real-time closed-loop performance obtained using this new NLDT (Fig- ure 4.16) for two different rule-sets (i.e. before applying re-optimization and after applying the re-optimization through closed-loop training) is shown in https://youtu.be/DByYWTQ6X3E. It can be observed in the video that the closed-loop control executed using the Depth-3 NLDT()   comprising of rules obtained directly from the open-loop training (i.e. without any re-optimization) is able to bring the LunarLander close to the target. However the LunarLander hovers above the landing pad and the Depth-3 NLDT() is unable to land it in most occasions. Episodes in these cases are terminated after the flight-time runs out. On the other hand, the Depth-3 NLDT* compris- ing of rule-sets obtained after re-optimization through closed-loop training is able to successfully land the LunarLander.   4.10 Conclusions In this work, we have proposed a two-step strategy to arrive at hierarchical and relatively interpretable rulesets using a nonlinear decision tree (NLDT) concept to facilitate an explanation of the working principles of AI-based policies. The NLDT training phases use recent advances in 98 Rules before Re-optimization (Depth-3 NLDT()  ) 2 + 0.83(cid:12)(cid:12)(cid:12) − 0.85 (cid:12)(cid:12)(cid:12)−0.23(cid:98)0(cid:98)2−1(cid:98)6−1(cid:98)7−1 − 1.00(cid:98)1−1(cid:98)6 − 0.79(cid:98)0−1(cid:98)1−1(cid:98)6 0.17(cid:98)2−1 − 0.64(cid:98)3(cid:98)7−1 + 0.90(cid:98)1−2(cid:98)6−2(cid:98)7−3 + 0.29 0.82(cid:98)7−1 + 0.52(cid:98)0−1(cid:98)4(cid:98)6−1 − 0.59(cid:98)4−1 − 0.95 (cid:12)(cid:12)(cid:12)−0.16(cid:98)4−3(cid:98)6−3(cid:98)7 − 0.86(cid:98)0(cid:98)5−1(cid:98)6−3 + 1.00(cid:98)4(cid:98)6−1 − 0.70(cid:12)(cid:12)(cid:12) − 0.26 (cid:12)(cid:12)(cid:12)−0.39(cid:98)0(cid:98)2−1(cid:98)6−1(cid:98)7−1 − 0.96(cid:98)1−1(cid:98)6 − 0.12(cid:98)0−1(cid:98)1−1(cid:98)6 2 + 0.89(cid:12)(cid:12)(cid:12) − 0.80 0.17(cid:98)2−1 − 0.78(cid:98)3(cid:98)7−1 + 0.90(cid:98)1−2(cid:98)6−2(cid:98)7−3 + 0.35 0.82(cid:98)7−1 + 0.52(cid:98)0−1(cid:98)4(cid:98)6−1 − 0.59(cid:98)4−1 − 0.96 (cid:12)(cid:12)(cid:12)−(cid:16)1.3 × 10−3(cid:17)(cid:98)4−3(cid:98)6−3(cid:98)7 − 0.86(cid:98)0(cid:98)5−1(cid:98)6−3 + 0.65(cid:98)4(cid:98)6−1 − 0.42(cid:12)(cid:12)(cid:12) − 0.26 Node 0 1 2 6 Node 0 1 2 6 Table 4.8: NLDT rules before and after the closed-loop training for LunarLander problem, for which NLDT* is shown in Figure 4.16. Video showing the simulation output of the performance of NLDTs with rule-sets mentioned in this table can be found at https://youtu.be/DByYWTQ6X3E. Respective minimum and maximum state variables are min = [-0.38, -0.08, -0.80, -0.88, -0.42, -0.85, 0.00, 0.00], max = [0.46, 1.52, 0.80, 0.50, 0.43, 0.95, 1.00, 1.00], respectively. Rules after Re-optimization (Depth-3 NLDT*) nonlinear optimization to focus its search on rule structure and details describing weights and biases of the rules by using a bilevel optimization algorithm. Starting with an open-loop training, which is relatively fast (due computationally fast fitness evaluation) but uses only time-instant state-action data, we have proposed a final closed-loop training phase in which the complete or a part of the open-loop NLDT is re-optimized for weights and biases using complete episode data. Results on popular discrete action problems have amply demonstrated the usefulness of the proposed overall approach. This proof-of-principle study encourages us to pursue a number of further studies. First, the scalability of the NLDT approach to large-dimensional state-action space problems must now be explored. A study conducted in Chapter 3 on binary classification of dominated versus non- dominated data in multi-objective problems was successfully extended to 500-variable problems. While it is encouraging, the use of customization methods for initialization and genetic operators using problem heuristics and/or recently proposed innovization methods [46] in the upper level problem can be tried. Second, this study has used a computationally fast open-loop accuracy measure as the fitness for evolution of the NLDT . This is because, in general, an NLDT  with 99 Figure 4.17: Closed-loop training plot for finetuning the rule-set corresponding to depth-3 NLDT() (Table 4.8) to obtain NLDT* for LunarLander problem.   a high open-loop accuracy is likely to achieve a high closed-loop performance. However, we have observed here that a high closed-loop performance is achievable with a NLDT  having somewhat degraded open-loop performance, but re-optimized using closed-loop performance metrics. Thus, a method to identify the crucial (open-loop) states from the AI-based simulation dataset that improves the closed-loop performance would be another interesting step for deriving NLDT . This may eliminate the need for re-optimization through closed-loop training. Third, a more comprehensive study using closed-loop performance and respective complexity as two conflicting objectives for a bi-objective NLDT search would produce multiple trade-off control rule-sets. Such a study can, not only make the whole search process faster due to the expected similarities among multiple policies, they will also enable users to choose a single policy solution from a set of accuracy-complexity trade-off solutions. 100 051015202530Generation5004003002001000100200RewardClosed-Loop TrainingBest RewardAvg. Reward CHAPTER 5 SCALE-UP STUDY AND IMPROVISATION In this chapter, we focus at how the overall algorithm developed in previous chapter can be made more efficient in terms of – training time and scalability. To this purpose, we introduce a benchmark problem of planar serial robotic manipulator. This problem is inspired from the classical Acrobot control problem [64]. A schematic of the acrobot problem and serial manipulator problem is provided in Figure 5.1a and 5.1b respectively. (a) Acrobot (b) Planar Serial Manipulator Figure 5.1: Acrobot and a customized Planar Serial Manipulator benchmark problems. The state-space for acrobot environment (Figure 5.1a) is six dimensional with following state variables: (1), (1), (2), (2), 1 and 2, where 1 is the angle the first link makes with the vertical axis and 2 is the angle the longitudinal axis of the second link makes with that of the first one. The second joint between the first and second link is actuated with a motor. In case of planar manipulator (Figure 5.1b), the sate space comprises of angular position  and angular velocity  of each joint. Thus, for a -link manipulator involving  revolute joints, the state space would be 2 dimensional. The motor is located at the last joint of the manipulator and is actuated 101 using three torque values: −, 0 and . Each link is of 1 unit length and has its center-of-mass at its geometric center. The motor is assumed to be massless for the sake of simplicity. The base of the planar-manipulator is located at (0, 0, 0) and the motion of the planar manipulator is limited to the XZ plane. There is a downward gravitational pull () of 10 units (i.e. -10 along vertical Z axis). Torque is applied along the Y-axis. The task in this problem is to take the end-effector (i.e. tip of the last link) of the serial-manipulator to a desired height of  units by supplying torque to the motor located at the last joint (joint between ( − 1)-th and -th link). The difficulty of this benchmark problem can be adjusted by 1. changing the number of links, 2. changing the value of desired height level , 3. changing the value of torque  and 4. placing extra motors at other joints. We simulate the mechanics of the planar serial manipulator using PyBullet [85]: a Python based physics engine. In our work, we provide two case-scenarios by focusing at the first three points of the above list. As mentioned before, by changing the number of links, the dimension of the state-space changes. The dimension of the action-space depends on the number of motors used. In the present work, we keep the number of motors fixed to one. The details regarding two environments which are created and studied in this chapter are summarized in Table 5.1. Table 5.1: Details regarding custom designed Planar Serial Manipulator environments. Env. Name 5-Link Manipulator 10-Link Manipulator # Links () 5 10 Motor Torque () 1,000 2,000 Desired Height () +2 +2 # State Vars. 10 20 The reward function (x, ) is given by the following equation 102 (cid:17)2 −1 −(cid:16) +1− ++1 100 if  ≥ H,  (x) = if  < , (5.1) where  is the vertical location of the end-effector. The minimium value for  is − when the entire manipulator is stretched to its full length and all joint angles (i.e. ) are 0. At the beginning of an episode, joint angles  of the manipulator are randomly initialized between −5 deg and +5 deg, and the angular velocities  are initialized to a value between −0.5 rad/sec and +0.5 rad/sec. In next sections we discuss results obtained on the above two custom designed environments by using different procedures of inducing NLDE . The black-box AI (DNN) is trained using the PPO algorithm [61]. 5.1 Ablation Study for Open-loop Training In this section, we launch two separate studies related to open-loop training procedure (see Figure 4.7). Following our results from Chapter 4, we restrict our discussion to binary-split NLDTs. It was seen in the previous chapter that the open-loop training is conducted using the hierarchical bilevel-optimization algorithm, which is discussed at length in Chapter 3. A dedicated bilevel-optimization algorithm is invoked to derive the split-rule  (x) at a given conditional node. The upper-level search is executed using a discrete version of a genetic algorithm and the lower- level search is realized through an efficient real-coded genetic algorithm (RGA). Evolutionary algorithms are in general considered robust and have a potential to conduct more global search. However, being population driven, their search-speed is often less that of classical optimization algorithms. In this section, we study the effect of replacing the real-coded algorithm with a classical sequential quadratic programming (SQP) optimization algorithm in the lower-level of the overall bilevel algorithm for obtaining NLDT . Later, closed-loop training based on real-code genetic algorithm is applied to NLDT  to obtain NLDT* by re-optimizing the real valued coefficients of 103 NLDT  (section 4.6.3). We use the SciPy [86] implementation of SQP. The initial point required for this algorithm is obtained using the dipole concept (Figure 3.7, Eq. 3.11). For analysis, we induce the NLDT  of depth-3 on the balanced training dataset (section 4.7.1.2) of 10,000 datapoints. The testing dataset is regular (see section 4.7.1.1) and comprises of 10,000 datapoints. Comparison of accuracy scores and average training time of inducing NLDT  by using SQP and RGA algorithm at lower-level is provided in Table 5.2. For a given procedure (SQP or RGA) the best NLDT  from 10 independent runs is chosen and is re-optimized using closed-loop training (section 4.6.3). Statistics regarding closed-loop performance of NLDT* is shown in the last two columns of Table 5.2. It is to note here that the closed-loop training is done using the real-coded genetic algorithm (RGA) discussed in section 4.6.3. Table 5.2: Comparing performance of different lower-level optimization algorithms. For compari- son, closed-loop performance of the original DNN policy is also reported. Open-Loop NLDT  Training Time (s) 5-Link Manipulator 15.29 ± 4.95 Closed-Loop NLDT* Cumulative Reward Algo. Name Training Accuracy 62.46 ± 2.01 SQP RGA 71.14 ± 1.77 DNN NA 56.57 ± 1.00 SQP RGA 65.84 ± 0.85 DNN NA Testing Accuracy 69.34 ± 5.39 69.17 ± 4.39 NA 55.64 ± 3.58 62.60 ± 3.09 NA −91.52 ± 14.42 1091.56 ± 319.18 −123.48 ± 25.11 −170.90 ± 42.57 NA 10-Link Manipulator 39.27 ± 11.63 2860.96 ± 789.49 NA −318.82 ± 15.52 −281.88 ± 9.38 −325.86 ± 4.63 Completion Rate 97.92 ± 1.93 97.00 ± 5.13 89.00 ± 5.19 96.00 ± 3.92 95.00 ± 4.31 85.88 ± 1.94 It can be observed from the results that open-loop training done with SQP in lower-level is about 70 times faster than the training done using RGA at lower level. The training done with RGA has a better overall open-loop performance. Thus, if the task is to closely mimic the behavior of black-box AI or if only a high classification accuracy is desired (in case of classification problems), then RGA is the recommended algorithm for lower-level optimization to obtain NLDT . However, NLDT* obtained after re-optimizing NLDT  corresponding to SQP and RGA have similar closed-loop completion rate (last column of Table 5.2). This implies that despite low open-loop accuracy scores, the open-loop training done using SQP in lower-level was able to successfully determine 104 the template of split-rules  (x) and the topology of NLDT, which upon re-optimization via closed- loop training algorithm could fetch a decent performing NLDT*. During open-loop training, the search on weights and coefficient using SQP was possibly not as perfect as compared to the one obtained through RGA, however, the re-optimization done through closed-loop training could compensate this shortcoming of SQP algorithm and produce NLDT* with a respectable closed-loop performance. Additionally, in either cases, the NLDT* obtained always had a better closed-loop performance than the original black-box DNN policy. This observation suggests that it is preferable to use SQP in lower-level during open-loop training to quickly arrive at a rough structure of NLDT  and then use closed-loop training to derive a high performing NLDT*. Another important thing to note here is that the open-loop training time for obtaining NLDT  for 10-link manipulator problem is about 2.5 times less than the corresponding training time for obtaining NLDT  for the 5-link manipulator problem. This is because, the population size in upper-level GA for 10-link manipulator problem (20 state variables) is twice that of the one for the 5-link manipulator problem (10 state variables). Also, the upper-level search in the high-dimensional space could possibly take more generations than it will take for lower-dimensional search spaces to converge. 5.2 Closed-loop Visualization In this section we provide a visual insight into the closed-loop performance of NLDT* and DNN which we derived in the previous section. In our case, the frequency of the simulation is set to 240Hz, meaning that the transition to the next state is calculated using the time-step of 1/240 seconds. Geometrically speaking, this implies that the Euclidean distance between states from neighboring time-steps would be small. The AI (DNN or NLDT*) outputs the action value of 0 (− torque), 1 (0 torque) or 2 (+ torque) for a given input state. Action Vs Time plots corresponding to different closed-loop simulation runs obtained by using DNN, NLDT* (SQP)1 and NLDT* (RGA)2 as controllers is shown in Figure 5.2 and 5.3 for 5-link and 10-link manipulator problems respectively. 1NLDT* (SQP) indicates that the corresponding NLDT  was derived using the SQP algorithm in the lower-level 2NLDT* (RGA) indicates that the corresponding NLDT  was derived using the RGA algorithm in the lower-level 105 (a) DNN (b) NLDT* (SQP) (c) NLDT* (RGA) Figure 5.2: Action Vs. Time plot for 5-Link manipulator problem. Figure 5.2b provides the plot for NLDT* which is obtained from the NLDT  trained using SQP algorithm in lower-level. Similarly, Figure 5.2c provides the plot for NLDT* which is obtained from the NLDT  trained using RGA algorithm in lower-level. 106 020406080100120Time Steps012Action020406080100120140Time Steps012Action020406080100120140Time Steps012Action Certain key observations can be made by looking at the plots in Figure 5.2. The control output for DNN is more erratic, with sudden jerks as compared to the control output of NLDT* (SQP) and NLDT* (RGA). The performance of NLDT* in Figures 5.2b and 5.2c is smooth and regular. This behaviour can be due to the involvement of a relatively simpler non-linear rules (as compared to the complicated non-linear rule represented by DNN) which are captured inside NLDT*. This is equivalent to the observation we made for the mountain car problem in Figures 4.3a and 4.3b, wherein the black-box AI had a very erratic behavior for the region of state-space in the lower-half of the state-action plot, while the output of NLDT was more smooth. Additionally, it was seen in Table 5.2 that the NLDT* (irrespective of how its predecessor NLDT  was obtained, i.e. either through SQP or RGA in lower-level) showed better closed-loop performance than the parent DNN policy. This observation implies that simpler rules expressed in the form of a nonlinear decision tree have better generalizability, thereby giving more robust performance for randomly initialized control problems. A careful investigation to the plots in Figure 5.2b and 5.2c reveals that only two out of three allowable actions are required to efficiently execute the given control task of lifting the end-effector of a 5-link serial manipulator. This concept will be used to re-engineer the NLDT*, a discussion regarding which is provided in the next section. A similar argument can be made from plots in Figure 5.3 for 10-link manipulator problem. This is a slightly difficult problem to solve than the 5-link version since it involves twice the number of state variables. Interestingly, the search for NLDT* provides us with the simplest solution to this problem to lift the end-effector to the desired height of 2 units above the base. The simplest solution here is to give a constant torque in one direction as shown in Figure 5.3b. However, for this problem the best closed-loop performance in terms of cumulative reward (second last column of Table 5.2) is obtained using the control strategy corresponding to NLDT* (RGA) (Figure 5.3c). Here too, for most of the states, only one action is required, and occasionally other actions are invoked. 107 (a) DNN (b) NLDT* (SQP) (c) NLDT* (RGA) Figure 5.3: Action Vs. Time plot for 10-Link manipulator problem. Figure 5.3b provides the plot for NLDT* which is obtained from the NLDT  trained using SQP algorithm in lower-level. Similarly, Figure 5.3c provides the plot for NLDT* which is obtained from the NLDT  trained using RGA algorithm in lower-level. 108 050100150200250Time Steps012Action050100150200250Time Steps012Action050100150200250300Time Steps012Action As mentioned before, in the next section we will discuss a post-processing approach to further simplify the NLDT*. 5.3 Reengineering NLDT* It was seen in action-time plots in Figure 5.2b, 5.2c, 5.3b and 5.3c that not all actions are required to perform a given control task. Also, it might be possible that while performing a closed-loop control using NLDT, not all branches and nodes of NLDT are visited. Thus, the portion of the NLDT which is not being utilized or is getting utilized very rarely can be pruned and the overall NLDT architecture can be made simpler. To illustrate this idea, we consider the NLDT which is derived for the 5-link manipulator problem. The topology of the best performing NLDT  (SQP) for the 5-link manipulator problem is shown Figure 5.4a. As mentioned before, this NLDT  is trained on a balanced training dataset which is generated by collecting state-action pairs using parent DNN controller. In the figure, for each node, the information regarding its node-id, class distribution (given in square parenthesis) and the most dominating class is provided. Other than the root-node (Node 0), all nodes are colored to indicate the dominating class, however, it is to note that only the class associated to leaf-nodes carry the actual meaning while predicting the action for a given input state. This NLDT  comprises of three split-rules in total. The class-distribution for each node is obtained by counting how many datapoints from the balanced training dataset visited a given node. Thus, the root node comprises of all datapoints (total 10,000), which are then scattered according to the split-rules present at each conditional node. Figure 5.4b provides topology of NLDT* which is obtained after re-optimizing NLDT  of Figure 5.4a using closed-loop training. As discussed in section 4.6 and 4.6.3, the topology of the tree and the structure of non-linear rules is identical for both: NLDT  and NDLT*. However, the weights and biases of NLDT* are updated to enhance the closed-loop control performance. Similar to NLDT  of Figure 5.4a, the information regarding node id and class-distribution is provided for all the nodes of NLDT* in Figure 5.4b. However, the data distribution in NLDT* 109 (a) NLDT  (b) NLDT* Figure 5.4: NLDTs for 5-Link Manipulator problem. is obtained by using the actual state-action data from closed-loop simulations, wherein NLDT* is used as a controller. Total 10,000 datapoints are collected in form of sequential states-action pairs from closed-loop simulation runs which are executed using NLDT*. As can be seen in the root node of NLDT* (Figure 5.4b), out of 10,000 states visited during closed-loop control, action 0 (− torque) was chosen by NLDT* for total 7736 states and action 2 (+ torque) was chosen for 2264 states. In none of the states visited during closed-loop control was action 1 (no torque) chosen. This is consistent with what we have observed in the Action Vs Time plot in Figure 5.2b, wherein most of the time action 0 was executed, while there was no event where action 1 was executed. The flow of these 10,000 state-action pairs through NLDT* and their corresponding distribution 110 in each node of NLDT* is provided in Figure 5.4b. It can be observed that Node 5, Node 4 and Node 8 of NLDT* are never visited during closed-loop control. This implies that splits at Node 2, Node 1 and Node 6 are redundant. Thus, the part of NLDT* shown in red-box in Figure 5.4b can be pruned and the overall topology of the tree can be simplified. The pruned NLDT* will involve only one split (occurring at Node 0) and two leaf nodes: Node 1 and Node 6. However, it is to note here that we need to re-assign class-labels to the newly formed leaf nodes (i.e. Node 1 and Node 6) based on the data-distribution from closed-loop simulations. The old class-labelling for the Node 1 and Node 6 was done based on the open-loop data (Figure 5.4a). Using the new class distribution corresponding to NLDT* (Figure 5.4b), Node 1 is re-labelled with Class-2 and Node 6 with Class-0. The pruned version of NLDT* of Figure 5.4b is provided in Figure 5.5 (here nodes are re-numbered, with Node 6 of NLDT* in Figure 5.4b re-numbered to Node 2 in the pruned NLDT* as shown in Figure 5.5). The split-rule corresponding to the root-node is also shown. Interestingly, out of 10 total state-variables, only 2 are used to decide which action to execute for closed-loop control. 1 corresponds to the angular position of the second link and 9 variable corresponds to the angular velocity of the last joint (i.e. the joint between link-4 and link-5 where the motor is connected). Figure 5.5: Pruned version of NLDT* (Figure 5.4b) for 5-link manipulator problem. 5.4 Conclusion In this chapter we introduced benchmark problems to conduct scalable study and investigate and compare algorithms to efficiently conduct open-loop training. The lower level optimization 111 done using SQP during open-loop training reduces the overall open-loop training time by 70 times. The NLDT  induced using RGA in lower level shows robust and relatively better open-loop performance. However the closed-loop performance observed after re-optimizing NLDT  to NLDT* is similar in either case. This observation suggests the use of SQP in control tasks to quickly generate NLDT . An extension of this work can be made for action-spaces involving more than 3 actions. This can be achieved by allowing more joints of the serial-manipulator to get actuated using motors. Also, a hybrid approach combining SQP and RGA for open-loop training can be derived to efficiently induce high performing NLDT . This aspect will be particularly useful for classification problems. 112 EXTENSION TO REGRESSION AND CONTINUOUS CONTROL PROBLEMS CHAPTER 6 6.1 Introduction In this chapter, we provide a conceptual path on how to extend the idea of nonlinear decision tree to solve problems pertaining to regression and control systems involving continuous action as output. Regression problems involves several independent features which are represented with feature vector x, and one dependent variable . The task here is to learn the relationship between dependent and independent variables ( =  (x)) Traditionally, linear regression models and neural networks can be trained to predict the value of a dependent variable  from the input features x. However, these regression models are inherently complex since they translate to long equations to represent the mathematical relationship between output  and input features x. Regression trees have been popularly used to find piece-wise constant curves to fit the regression surface. The idea behind regression trees is to partition the feature space into sub-regions using one or more splits and then have a constant term for each such sub-region to approximate the value of the dependent variable . CART and M5 based trees fall in this category. An improvisation to this concept is suggested in [1], where, instead of having the leaf node to represent a constant valued function as in CART, leaf nodes represent a linear or quadratic functions comprising of  features (where  is a user-specified parameter). Partitioning splits however occur on only one feature. An example of interpretable tree generated in this fashion is shown in Figure 6.1. 6.2 Interpretable AI for Regression Problems using NLDT The interpretable AI (IAI) developed for regression in our case is represented in the form of a nonlinear decision tree (NLDT), with terminal leaf nodes representing regression equations and the non-terminal conditional split-nodes representing conditional rules. The set of conditions at split-nodes define the domain in the feature space where the regression rule at a follower terminal 113 Figure 6.1: Piecewise linear regression tree with two predictors from [1]. At each leaf node, features involved in the expression of two-regressor linear model is shown. Splits use only one feature variable. node is applicable. A conceptual illustration of the NLDT framework for regression task is provided in Figure 6.2. Figure 6.2: Conceptual layout of NLDT for regression task. The rule  at each terminal node is a function of feature vector x and provides the value of the dependent variable  (i.e.  = (x)). Each conditional rule  partitions the feature space into 114 two parts: (x) ≤ 0 and (x) > 0. Thus mathematically, for each -th leaf node: If:  (x) > 0, for all  < , and (x) ≤ 0, Rule:  = (x). (6.1) For example, for Node 1 ( = 0) leaf node, the rule indicates: If 0 ≤ 0, then  = 0(x). Node 3 rule ( = 2) indicates: If 0 > 0 ∧ 2 ≤ 0, then  = 2(x). The nonlinear decision tree is induced by hierarchically applying a Dual bilevel algorithm at each conditional node. For an -th conditional node, Stage 1 of the proposed dual bilevel algorithm focuses at deriving the regression rule  = (x). Stage 2 of the algorithm is then invoked to derive the partition rule (x) to split the data in conditional node  into two subsets:   follows the regression rule  = (x) and have a mean squared error (MSE) value within a user defined threshold value . Thus,   is declared as a terminal leaf node. A new dedicated dual bilevel algorithm is reapplied  to the datapoints present in node    and process repeats. This procedure of hierarchically inducing the decision tree continues until one of the termination criteria is met. We now discuss the dual bilevel algorithm which is used to derive regression and partition rule at a given conditional node .  . Points in    and    6.3 Dual Bilevel Algorithm (DBA) for Regression The dual bilevel algorithm is used to derive two rules at a given conditional node : 1) regression rule  = (x) and 2) partition rule (x). The objective of deriving regression rule (x) is to fit as many datapoints present in the conditional node  as possible. If the regression rule  = (x) is able to fit all datapoints present in node  within an acceptable limit on mean squared error, the dual bilevel algorithm is terminated and the node  is declared as the leaf node. However, if regression rule  = (x) is not able to fit all datapoints in , the second stage of dual bilevel algorithm is invoked to derive the partition rule (x). As mentioned before, the partition rule (x) partitions the data in node  into two subsets  ) (for which (x) ≤ 0), and • Left child node (  115 • Right child node (   ) (for which (x) > 0),  have their MSE value w.r.t to regression curve  = (x) within such that datapoints in subset   a pre-specified threshold value . Thus, for a new test datapoint x, conditions listed in Eq. 6.1 are first checked before applying regression rule  = (x) to estimate the value of the dependent variable . A high level pseudo-code for the dual bilevel algorithm is provided in Algorithm 8. Algorithm 8: Pseudo-code of dual bilevel algorithm (DBA) Input : Data in Conditional node :  datapoints with  independent variables ((1, 2, . . . , ) = x) and one dependent variable . Regression Rule  = (x) and partition rule (x). Output : // Bilevel Algorithm to derive (x) Step 1 derive regression rule:  = (x); if MSE(, (x)) ≤  then terminate; Derive Partition Rule (x); else end We will next discuss the bilevel algorithm which is used to derive the regression rule  = (x). 6.3.1 Data Normalization Before training and inducing the NLDT for regression, we normalize the data. Each feature  and dependent variable  is normalized using the following equation:  =  −  2 , (6.2) where  is the normalized value of the variable ,  and  are mean and standard deviation of variable , respectively. 116 6.3.2 Bilevel Regression Algorithm to Obtain (x) The regression rule to be evolved is expressed as a weighted sum of power-laws as given in equation below: where ’s are power-laws of type  = (x) = . Like in the standard classification task discussed in Chapter 3, the genome representing the template of the regression rule is represented with a block matrix B as shown below =1     11 + 22 + · · · +    + 0 1 , (6.3)  B = 11 12 13 . . . 1 21 22 23 ... ... ... . . . 2 ... . . .  1  2  3 . . .    . (6.4) The upper level of the bilevel algorithm conducts the search in the discrete space of exponents  . The lower level of the bilevel algorithm searches the coefficients  corresponding to power- laws  and the bias terms 0 and 1. The upper level algorithm is similar to the upper level algorithm developed for classification problems (see section 3.3.2), however the objective and constraint evaluation is different as shown in Eq. 6.5. For the lower level, we use the sequential quadratic program (SQP) to estimate the values of ’s and ’s. The optimization formulation corresponding to the problem of estimating the regression function  = (x) is given below: Minimize (B, w∗, ∗) = (w∗, ∗)|B, Subject to (w∗, ∗) ∈ argmin {(w, )|B} , −1 ≤  ≤ 1, ∀, 0 ∈ [−1, 1], 1 ∈ (0, 1],   ∈ , (6.5) where for our study here we choose  = {−3,−2,−1, 0, 1, 2, 3}. The allowable powers indicated by set  controls the maximum complexity achievable by our procedure. The fitness of the upper 117 level individual corresponds to the fitness of the optimal solution of the corresponding lower level problem. We will next discuss how the lower level fitness function is designed to facilitate the regression fit. 6.3.2.1 Lower Level Regression Optimization Since we are targeting at evolving a simple rule, it is highly likely that one simple rule won’t fit the entire dataset. Figure 6.3 illustrates one such situation involving two independent variables and one dependent variable. Here, datapoints are scattered across three disjoint islands. Most of Figure 6.3: Three Island Regression Problem. them belong to island 2 while island 1 has the least number of datapoints. There might exist one complicated regression surface which passes through all the datapoints. However, since we are interested in having simple rule, one of the simple solution is to have three simple linear rules, one for each island. Furthermore, in our overall algorithm, we would be interested to determine the rule corresponding to island 2 first since it has majority number of points. The fitness function  for the lower-level optimization is formulated to capture these aspects. The algorithm to compute  is provided in Algorithm 9. We will discuss the motivation behind different segments of Algorithm 9 here and explain the idea with the help of Figure 6.4. 118 Algorithm 9: Algorithm to compute the lower level fitness  Input : Dataset D in B-space of size  ×  and array  of length  comprising of values of dependent variable , weight vector w, bias valus 0 and 1. Lower level fitness value  Output : // predict the value of  1 (cid:48) = PredictY(D, w, 0, 1); // Compute the absolute difference between predicted and actual  2  = | − (cid:48)|; // Get a Boolean array of good points 3  =  ≤ ; // Compute the total number of good points values  []; 4  =  5 if  == 0 then // No good point found  = ComputeRMSE((cid:48),  ); // Root Mean Squared Error if  <  − 2 then 6 7 else 8 (cid:48). // Sorted array of error in ascending order  = Sort(   ); /* Get the error value of the best point from set of bad points */ (cid:48) = [ + 1]; if (cid:48) < 0.5 then +1 +1 =1 []2;  = 1 9 10 11 12 13 14 15 16 17 18 19 20 end else  = ComputeRMSE([], (cid:48)[]); else end  = ComputeRMSE([], (cid:48)[]); end  =  −  The value of  is computed based on which stage the regression surface is relative to the training data. Stage 1 Here, the regressed surface is far off from the location of training datapoints as shown in Figure 6.4a. This implies that no points are within a  distance from the regressed 119 (a) Stage 1 (b) Stage 2 (c) Stage 3 Figure 6.4: Computation of Lower Level Objective function  based on Algorithm 9. surface, i.e. ∀x ∈ , |(x) − (cid:48)(x)| > , where  and (cid:48) are actual and regressed values of x respectively. The value of parameter  is user specified. Under this scenario, the objective  is evaluated by computing mean squared error (MSE) between actual and predicted values. Minimization of MSE in Stage 1 brings the regression surface closer to the location of training dataset. This is shown pictorially in Fig. 6.4a. This segment is represented by lines 5-6 in Algorithm 9. Eventually, the regressed surface/curve will come in the vicinity of one (or more) datapoints (shown by red in Fig. 6.4a) and will result into  ≥ 0, where  is the number of datapoints for which the difference between predicted and actual  value is less than . Stage 2 During Stage 2, we already have atleast one point with | − (cid:48)| ≤  (i.e.  ≥ 1). In Stage 2, we design the objective function  such that its minimization will ensure maximization of number of good points (i.e. points with | − (cid:48)| ≤ ). One such way to enforce this is to set  = −. However, setting  this way resembles a discontinuous 120 many-to-one mapping. This will create local flat regions (similar to the situation discussed in Figure 3.6) thereby making it difficult for an optimization solver to navigate the search space of w and . We thus do the following modification to avoid creation of flat zones in the fitness landscape of : Step 1 Determine the closest point x(cid:48) in dataset D to the regressed surface such that |(cid:48)(x(cid:48)) − (x(cid:48))| >  and compute its error as (cid:48) = |(cid:48)(x(cid:48))− (x(cid:48))| (line 9-10 in Algorithm 9). If (cid:48) < 0.5, go to Step 2, else go to Stage 3 Step 2 Since (cid:48) < 0.5, x(cid:48) is potentially a next good point. Thus, we append datapoint x(cid:48) to the list of good points and compute the MSE on this modified set of good points. Let the MSE value of this set (good points + potentially good point) be denoted with .  is then computed as  =  −  as shown in line 12 and line 19 in Algorithm 9. Note that at first, the term  will remain constant and only  will get minimized while minimizing . This will result in tilting of the regressed surface as shown in Figure 6.4b (where regressed surface  is tilted to ). Eventually, minimization of  this way will successively cover more points (shown by shaded red colors in Figure 6.4b) and will bring the regression surface to location  (shown in Figure 6.4c). Stage 3 Here, the next closest point with (cid:48) >  is far away from the regression surface (marked with green in Figure 6.4c). Hence,  will be computed as  =  − , where  is the mean squared error of good points (i.e. all those points for which |(x) − (cid:48)(x)| ≤ ). In the example shown in Figure 6.4, values assumed by  during Stage 3 will be certainly less than those in Stage 2 since  term of  in 121 Stage 3 will be more by atleast 1 unit than the  term in Stage 2. Also,  term in Stage 3 will remain constant since the next possible good point (green colored in Figure 6.4c) is located very far away. Thus, minimization of  will imply minimization of . This will steer to re-position the line  (for which all red points of Figure 6.4c were within  distance) to line . Based on the above explanation and illustrations shown in Figure 6.4, in a particular lower level optimization run using a point based classical optimization approach, the values of  as observed in Stage 1 will be more than those observed in Stage 2 and values observed in Stage 2 will be more than those observed in Stage 3. This would be true in general, but will also depend on several other factors such as noise in the dataset, value of  and relative locations of datapoints in training set. Eventually, the regression curve will fit certain segment of the training data (represented by  in Figure 6.4c). In our experiments, we set  = 0.03. Computing  using the procedure discussed above removes the bottleneck of having flat fitness landscape. This allows us to use point based classical optimization algorithms to quickly determine optimal values of w and . Having a fast and reliable lower level optimization is really important for the efficient design of the overall bilevel algorithm to estimate the regression function (x). Next we discuss the upper level optimization which is used to estimate the structure and template of the regression function (x). 6.3.2.2 Upper Level Regression Optimization Fitness of the optimal solution of lower level optimization becomes the fitness of the upper level optimization as shown in Eq. 6.5. Since genetic operators in the upper level are designed to incrementally complexify the expression of the power-law rule (Eq. 6.3), starting from a very basic rule involving only one variable appearance, we observe a natural preference and bias for simpler rules over complex rules. Output from one run of bilevel regression algorithm is shown in Figure 6.5. It can be observed that the algorithm is able to successfully fetch us with a linear 122 regression rule (x) = 0.911 − 0.900 + 0.08 to fit the island comprising of maximum number of datapoints (Figure 6.5). Figure 6.5: Regression algorithm is applied on all datapoints. The obtained regression rule 0(x) = 0.911 − 0.900 + 0.08 is able to fit the subset of datapoints which are represented by Y-predicted (in green circles). Partition rule 0(x) will be now derived to identify the domain in -space (i.e. 0(x) ≤ 0)) where this regression rule is applicable. 6.3.3 Bilevel Partition Algorithm (BPA) to Obtain (x) The regression rule ((cid:48) = (x)) derived in the previous section fits certain portion of the dataset in node  (as an illustration, this is highlighted with green circles in Figure 6.5). The bilevel partition algorithm is now employed to derive a partitioning boundary (x) = 0 which separates (for which (x) > 0). The the dataset in  into two subsets:     have their mean squared error w.r.t. the regression curve (x) within fraction of datapoins in   a user specified threshold value . Hence, for these datapoints, the depended variable  can be predicted using the regression function (x). The partition rule (x) is derived to maximize the (for which (x) ≤ 0) while keeping their mean squared error value number of datapoints in    (where the error is measured between the actual  value and the predicted (cid:48) value) within . The optimization formulation to derive the partition rule (x) is organized in the bilevel (for which (x) ≤ 0) and    structure as shown below 123 Minimize (B, , w∗, ∗) = (w∗, ∗)|(B,), Subject to (B, , w∗, ∗) = (w∗, ∗)|(B,) ≤ 0, (w∗, ∗) ∈ argmin(cid:8)(w, )(cid:12)(cid:12)(w, ) ≤ 0(cid:9)(cid:12)(cid:12)(B,), −1 ≤  ≤ 1, ∀,  ∈ [−1, 1]+1,  ∈ {0, 1},   ∈ . (6.6) The overall bilevel algorithm is identical to the one developed for the classification problem to derive split-rules for a classification task (as explained in Section 3.3), with the only change in the objective and constraint functions. The partition rule (x) (where  is the node number (see Figure 6.2)) being derived is of the following type (x, w, , B) = 1 + 11 + . . . +   , (cid:12)(cid:12)1 + 11 + . . . +   (cid:12)(cid:12) −(cid:12)(cid:12)2(cid:12)(cid:12) , if  = 0, if  = 1. (6.7)  where all symbols carry similar meanings to those discussed in the equation of split-rule for classification problem (Eq. 3.2). x is the input feature vector,  is the power-law of type (x) =    , ’s are the coefficients and ’s are the biases. =1  Since both objective and constraint value for upper level optimization are borrowed from the optimal solution of the lower level optimization, we discuss the optimization formulation for lower level optimization in the next section. 6.3.3.1 Lower Level Partition Optimization The lower level optimization estimates optimal values of weights w and biases  which minimize the following objective function Min. (w, )(cid:12)(cid:12)(B,) = −   s.t. (w, )(cid:12)(cid:12)(B,) =    −  ≤ 0 −1 ≤  ≤ 1, ∀,  ∈ [−1, 1]+1 124 (6.8) after partition.    Here,    indicates the number of points in node  for which (x)|(w,,B,) ≤ 0 (Eq. 6.7). Thus, this objective function tends to maximize the number of points going to the left node is computed by measuring the mean squared error of the ac-     with the predicted (cid:48) values obtained using tual  value of datapoints in the left node   the regression function (cid:48) = (x) derived in the previous section. The constraint satisfaction (w, )|(B,) =    −  ≤ 0(cid:17) ensures that  value of all datapoints coming to the left (cid:16) (i.e. (x) > 0). Hence, the right node which contains datapoints from subset    after the split (i.e. (x) ≤ 0) can be reliably predicted by using the regression function node   (cid:48) = (x). In our experiments, we set the value of  parameter to 2  (where  = 0.03). It is to note here that the regression rule  = (x) is not applicable to points belonging to subset  becomes the    new non-terminal node and a dedicated dual bilevel algorithm is applied to it to derive its corre- sponding regression and partition rules (2(x) and 2(x) in Figure 6.2). This process is repeated until maximum depth of tree is reached or the regression rule is able to fit all datapoints belonging to a non-terminal node (in which case it is declared as a terminal leaf node). 6.3.4 Results on a Customized Benchmark Problem In this section, we show results obtained on different versions of customized benchmark three island problem (Figure 6.3). In the first experiment, we generate a pure dataset using the following set of equations for islands 1, 2 and 3: Island 1: Island 2: Island 3: Concatenate: 1 = 0 + 1 + 0.2, 2 = −0 + 1 + 0.1, 3 = 0 − 1 − 0.3,  = [1; 2; 3] [0, 1] ∈ [0, 0.2]2 [0, 1] ∈ [0.2, 0.7]2 [0, 1] ∈ [0.7, 1]2 (6.9) In the second experiment, we add noise to the pure dataset of Eq. 6.9 as shown below  ()  =  ()  + ,  = 1, 2, . . . ,  ∈ [−0.1, 0.1] (6.10) 125 Results obtained on pure and noisy three island datasets are provided in Figures 6.6 and 6.7 respectively. Very small MSE values for each of the three leaf nodes indicate the efficacy of our proposed Dual bilevel optimization algorithm to find regression and partition rules. (a) (b) Figure 6.6: Result on the Pure Three Island Dataset. (b) Intert- pretable AI representation using NLDT. Normalization constants are: Xmean = [0.49, 0.49], Xstd = [0.25, 0.25], Ymean = 0.05 and Ystd = 0.28. (a) Visualization of result. As can be seen from results shown in Figure 6.6 and Figure 6.7, the proposed approach of deriv- ing piece-wise non-linear regression equations fetches us with a relatively simple and interpretable regressor, which otherwise would have assumed a complicated form with other regression models like artificial neural networks or regression CART. The rules evolved are simplistic in structure and 126 (a) (b) Figure 6.7: Result on the Noisy Three Island Dataset. (b) Intert- pretable AI representation using NLDT. Normalization constants are: Xmean = [0.52, 0.52], Xstd = [0.28, 0.27], Ymean = 0.03 and Ystd = 0.29. (a) Visualization of result. the induced NLDT is also topologically simple. Currently, we are required to properly set the parameter  ( = 2 ) to get optimal performance. One of the possible way to efficiently handle this is to re-formulate the upper-level problem as multi-objective and search for the knee in the bi-objective pareto front of MSE Vs  (where  is the number of points in left node) and use that solution to conduct the partition. In the next section, we will apply the proposed dual bilevel algorithm to arrive at a relatively interpretable policy for a control problem involving continuous action. 127 6.4 Interpretable AI for Control Problem with Continuous Action Space In this section, we implement the dual bilevel algorithm we developed in the previous section to explain control logic of an ANN (artificial neural network) agent which is used to control an object using an action value from a continuous real domain. The problem considered here is borrowed from [87] and is pictorially shown in Figure 6.8. Figure 6.8: A Car-following problem with continuous acceleration. The rear car is controlled by an AI while the car in the front moves with a random acceleration profile. This problem is similar to the problem shown in Figure 4.8b with the only difference being that the cars can now assume a value of acceleration in range −2/2 to 2/2 (unlike in the previous case where the rear car could only have an acceleration from pre-specified discrete values). Here, the car in the rear is autonomously controlled using an AI. The car in the front moves with a random acceleration profile. The state of the rear car at a given time  is determined by three quantities (the state is denoted by a real-parameter vector x): 1. Its relative distance from the front car (0 =    − ), 2. its relative velocity with respect to the front car (1 =    − ) and 3. its acceleration at time  − 1 (2 = ( − 1)). The control task requires to have a safe distance    between two cars, while simultaneously ensuring a comfortable and smooth ride. The ANN agent is trained using the DDPG algorithm [62]. For a given time-step , the ANN agent takes the state x of the rear car as an input and outputs the acceleration value  (i.e. ()) in the continuous range from -2/2 to 2/2. Figure 6.9 shows a sample run where the trained ANN agent is used to control the rear car while the car in the front operates with a random acceleration. 128 (a) (b) Figure 6.9: ANN Output for the continuous car-following problem described in Figure 6.8. Figures (a), (b) and (c) are the plots of different state variables w.r.t to the time-step. The output of the ANN is shown in Figure (d). 129 Figure 6.9 (cont’d) (c) (d) To explain the performance of the ANN, we first collect the data comprising of series of states and corresponding action value as determined by the ANN. The tabulated data is then used to train and induce the NLDT. This problem of training and deriving NLDT from the ANN generated dataset translates to a regression problem of predicting the value of a continuous dependent variable given a set of continuous input features (or state variables in our case). The challenge here is to obtain the NLDT regressor with simple rules so that it can act as a logic-translator to explain the inner-logic of the ANN, RL or any more compelex CNN/DNN used earlier to solve the problem. 130 6.4.1 NLDT Representation and Training As mentioned in the last section, the problem of developing interpretable AI for a continuous control task involving one action translates to a regression rule finding problem. We thus adopt the dual bilevel approach we discussed in section 6.3 to derive the NLDT for this regression problem. For the control task, the regression rules represent different control logic and the partition rules represent different conditions (or scenario) under which a given control logic is applicable. An illustrative sketch of the same is provided in Figure 6.10. Figure 6.10: NLDT Agent representation for continuous control task involving one action. The conditional rules (x) and control logic (x) are expressed as weighed sum of power-laws as mentioned in Eq. 6.3 and 6.7 respectively. The induction and training of the non-linear decision tree (NLDT) of Figure 6.10 is done using the methodology discussed in the previous section in Section 6.3. 6.4.2 Data generation To derive the NLDT, the data is first generated using the trained artificial neural network (ANN) controller. For our task, the ANN controller takes state variables as input and outputs the value In our experiments, we generate 5,000 datapoints to populate the training of the acceleration. 131 data. Once the training data is generated, the dataset is normalized using the method discussed in Section 6.3.1. The normalized data is then used to train the NLDT. As mentioned before, each regression rule and partition rule is derived using the proposed dual bilevel algorithm (Section 6.3). 6.4.3 Results on Car-Following Problem In this section we discuss some results obtained on the 1D car following problem involving continuous action space (Figure 6.8). As mentioned before, the car in the front follows a random acceleration profile. The rear car is controlled using a trained ANN controller. The NLDT takes the same state information as the ANN counterpart and predicts the value of acceleration. It is desired to have the predicted acceleration value of NLDT to be as close as possible to the corresponding ANN output for a given input state. Plots from different simulation runs are shown in Figure 6.11 to provide the visualization of how closely the NLDT is able to mimic the parent ANN controller’s output. The performance seen in Figure 6.11 is obtained by the NLDT agent shown in Figure 6.12. From the results shown in Figure 6.11, it can be concluded that our proposed algorithm is able to induce a relatively interpretable NLDT (Figure 6.12) which is able to match the predictions of the complicated ANN controller. It is important to highlight all components of the NLDT (Figure 6.12) which is obtained by our procedure: 1. A two-depth decision tree is adequate to explain control strategies of the ANN with a small MSE error. Our dual bilevel approach is able to evolve this simple tree. 2. The tree involves one non-linear partition rule (0(x)) at the root node, and two non-linear regression rules ( = (x)) at left leaf node ( = 1, applicable for 0(x) ≤ 0) and right leaf node ( = 2, applicable for 0(x) > 0). 3. The structures of the regression and partition rules are evolved by our dual bilevel optimizer involving simple expressions. 132 (a) ( = 0) = 0 (b) ( = 0) = 5 Figure 6.11: Plots from different simulation runs with different initial relative velocity () between cars. The NLDT’s acceleration output (red line) matches with the ANN’s acceleration output (blue line). NLDT agent behaviour is almost the same as that of ANN and can thus be used to explain the behaviour of ANN. 133 Figure 6.11 (cont’d) (c) ( = 0) = 10 Figure 6.12: NLDT for car following problem with continuous action space. While the original ANN can control the rear car well, it certainly cannot explain the reasons behind choosing the action for each combination of three input states. Above partition and regression rules paves the way of arriving at a relatively humanly-interpretable explanation to the data derived from the ANN controller. Notice that a regular decision tree (DT) is incapable of coming up with such a nonlinear combinations of variables. This is true even for the generalized linear models (GLMs) and generalized additive models (GAMs). Due to hurdles of optimizing non-linear optimization problems with guaranteed optimality, nonlinear decision trees were not studied in the 134 past. With the advancements of efficient evolutionary (and bilevel) optimization methods, we are able to open up this avenue to come up with as few as possible and as interpretable as possible nonlinear decision rules. We believe that this strategy can be extended to more complex control problems. 135 CHAPTER 7 CONCLUSION AND FUTURE WORK In this work, we primarily focused on inducing an intrinsically interpretable AI which is relatively interpretable and simple than ANN, CART and SVM. The AI in our case is modelled using a nonlinear decision tree (NLDT) wherein, unlike traditional decision trees, the conditional nodes involve non-linear rules on input features. The non-linear rules at conditional nodes are expressed as weighted sum of power-laws with exponents assuming values from a discrete set while the coefficients and biases take real values between -1 to 1. The search for the non-linear rules is carried out using a custom-designed bilevel approach. The upper level spans the discrete search space corresponding to exponents while the lower level conducts a search on continuous variables to optimize the split criteria. The NLDT is grown through a recursive splitting procedure, and for each new conditional node, the non-linear rule is efficiently derived using a bilevel algorithm. This implies that there is no need to pre-supply the topology of the NLDT since it gets organically determined. The NLDTs produced are simple in topology as compared to traditional CART trees and have much simpler rules than ANN or SVM counterparts at each conditional node, thereby making the overall AI relatively more interpretable. A comparison with classical classification algorithms on some standard and custom-designed benchmarks reveals the superiority of the proposed NLDT algorithm in efficiently deriving simple yet high-performing classifiers. This idea is extended to develop relatively interpretable controllers for control problems involv- ing discrete actions. The open-loop training determines the structure of the NLDT and non-linear rules, which are then re-optimized using an evolutionary-algorithm-based closed-loop training pro- cedure to enhance the closed-loop control performance of the NLDT. The objective evaluation for closed-loop training is computationally expensive; however, a speed-up to the overall evolutionary closed-loop training algorithm is possible by distributing the evaluation of population individuals to different computer cores. The proposed algorithm is able to induce an NLDT* controller which is relatively simple than DNN controller and has better closed-loop control than the parent black-box 136 AI controller. One of our key observations in this study was it was not required to have an NLDT  with 100% open-loop accuracy to achieve a competent closed-loop performance. We extended our study of inducing relatively interpretable controllers in Chapter 5 to conduct a scale-up study and proposed a benchmark-problem. The open-loop training is made considerably fast by replacing the lower-level RGA algorithm with a classical SQP algorithm. This compromises the open-loop accu- racy, however the re-optimization through the follow-up closed-loop training successfully readjusts weights and coefficients of NLDT . The resulting NLDT* has a better closed-loop performance than the original black-box AI. The NLDT* is then pruned and simplified by eliminating nodes which are rarely visited during closed-loop control runs. A conceptual layout on the possible extension of NLDT idea to solve regression and control problems with continuous action space is demonstrated in Chapter 6. Here, a dual-bilevel algorithm to proposed to systematically derive regression functions (x) and partition functions (x). The proposed approach is first tested on a customized three-island regression problem involving three separate disjoint linear regression surfaces. Next, the dual-bilevel algorithm of inducing NLDT is applied to derive an interpretable translator to the DNN for a CarFollowing problem involving continuous action. The derived NLDT involves only one partition rule and two non-linear regression control rules and is able to successfully mimic the behaviour of the parent DNN controller. The results obtained are very encouraging and open up a number of possible research paths. For classification problems, the parameter  is required (Eq. 3.5). A parameter-free algorithm could be developed to conduct splits. Currently, the rule structure is fixed to be a weighted sum of power-laws. A genetic-programming-based approach can be used with the bilevel idea to derive rules with different non-linearity. Another possible extension to this work would involve developing an algorithm to handle categorical features in addition to ordinal or continuous features. It would also be interesting to see how the dual-bilevel approach can be applied to any generic regression or continuous control task. Currently, this approach is sensitive to the parameters  and  (Algo. 9 and Eq. 6.8). Another parallel research can be launched focused on automating the reasoning and explanation process of developed NLDTs. Currently, we could do it for simple 137 problems manually. This work on producing a transparent AI system which is relatively interpretable used several concepts from the non-linear optimization literature, such as bilevel-optimization and constraint- handling. The training of inducing NLDT  and NLDT* can be made better and faster by using concepts from surrogate optimization and metamodelling. A multi-objective-optimization-based approach can also be developed to induce NLDT wherein the interpretability and performance are modelled as separate objectives. As mentioned before, interpretability is subjective, so it calls for a joint collaborative research to develop transparent AI systems by integrating domain expertise of human experts and requirements of end-users. We hope that our work helps to advance the research in the field of interpretable artificial intelligence and beyond. 138 BIBLIOGRAPHY 139 BIBLIOGRAPHY [1] H. Kim, W.-Y. Loh, Y.-S. Shih, and P. Chaudhuri, “Visualizable and interpretable regression models with good prediction power,” IIE Transactions, vol. 39, no. 6, pp. 565–579, 2007. [3] [2] B. Letham, C. Rudin, T. H. McCormick, D. Madigan et al., “Interpretable classifiers using rules and bayesian analysis: Building a better stroke prediction model,” The Annals of Applied Statistics, vol. 9, no. 3, pp. 1350–1371, 2015. J. H. Friedman, B. E. Popescu et al., “Predictive learning via rule ensembles,” The Annals of Applied Statistics, vol. 2, no. 3, pp. 916–954, 2008. J. A. Nelder and R. W. Wedderburn, “Generalized linear models,” Journal of the Royal Statistical Society: Series A, vol. 135, no. 3, pp. 370–384, 1972. [4] [5] T. J. Hastie and R. J. Tibshirani, Generalized additive models. CRC press, 1990, vol. 43. [6] S. N. Wood, Generalized additive models: An introduction with R. CRC press, 2017. [7] W. J. Murdoch, C. Singh, K. Kumbier, R. Abbasi-Asl, and B. Yu, “Interpretable machine learning: definitions, methods, and applications,” arXiv preprint arXiv:1901.04592, 2019. [8] Z. C. Lipton, “The mythos of model interpretability,” Queue, vol. 16, no. 3, pp. 31–57, 2018. [9] S. B. Kotsiantis, I. Zaharakis, and P. Pintelas, “Supervised machine learning: A review of clas- sification techniques,” Emerging artificial intelligence applications in computer engineering, vol. 160, pp. 3–24, 2007. [10] J. S. Bergstra, R. Bardenet, Y. Bengio, and B. Kégl, “Algorithms for hyper-parameter opti- mization,” in Advances in neural information processing systems, 2011, pp. 2546–2554. [11] C. Thornton, F. Hutter, H. H. Hoos, and K. Leyton-Brown, “Auto-weka: Combined selection and hyperparameter optimization of classification algorithms,” in Proceedings of the 19th ACM SIGKDD international conference on Knowledge discovery and data mining. ACM, 2013, pp. 847–855. [12] L. Breiman, Classification and regression trees. Routledge, 2017. [13] D. Heath, S. Kasif, and S. Salzberg, “Induction of oblique decision trees,” in IJCAI, vol. 1993, 1993, pp. 1002–1007. [14] S. K. Murthy, S. Kasif, and S. Salzberg, “A system for induction of oblique decision trees,” Journal of artificial intelligence research, vol. 2, pp. 1–32, 1994. [15] S. K. Murthy, S. Kasif, S. Salzberg, and R. Beigel, “Oc1: A randomized algorithm for building oblique decision trees,” in Proceedings of AAAI, vol. 93. Citeseer, 1993, pp. 322–327. 140 [16] E. Cantú-Paz and C. Kamath, “Using evolutionary algorithms to induce oblique decision trees,” in Proceedings of the 2nd Annual Conference on Genetic and Evolutionary Computa- tion. Morgan Kaufmann Publishers Inc., 2000, pp. 1053–1060. [17] M. Kretowski, “An evolutionary algorithm for oblique decision tree induction,” in Inter- Springer, 2004, pp. national Conference on Artificial Intelligence and Soft Computing. 432–437. [18] A. Ittner and M. Schlosser, “Non-linear decision trees-ndt,” in ICML. Citeseer, 1996, pp. 252–257. [19] K. P. Bennett and J. A. Blue, “A support vector machine approach to decision trees,” in Neural Networks Proceedings, 1998. IEEE World Congress on Computational Intelligence. The 1998 IEEE International Joint Conference on, vol. 3. IEEE, 1998, pp. 2396–2401. [20] H. Núñez, C. Angulo, and A. Català, “Rule extraction from support vector machines.” in Esann, 2002, pp. 107–112. [21] G. Fung, S. Sandilya, and R. B. Rao, “Rule extraction from linear support vector machines,” in Proceedings of the eleventh ACM SIGKDD international conference on Knowledge discovery in data mining. ACM, 2005, pp. 32–40. [22] M. Craven and J. W. Shavlik, “Extracting tree-structured representations of trained networks,” in Advances in neural information processing systems, 1996, pp. 24–30. [23] P. M. Murphy and M. J. Pazzani, “Id2-of-3: Constructive induction of m-of-n concepts for discriminators in decision trees,” in Machine Learning Proceedings 1991. Elsevier, 1991, pp. 183–187. [24] U. Johansson, R. König, and L. Niklasson, “The truth is in there-rule extraction from opaque models using genetic programming.” in FLAIRS Conference. Miami Beach, FL, 2004, pp. 658–663. [25] D. Martens, B. Baesens, T. Van Gestel, and J. Vanthienen, “Comprehensible credit scoring models using rule extraction from support vector machines,” European journal of operational research, vol. 183, no. 3, pp. 1466–1476, 2007. [26] H. Ishibuchi, T. Nakashima, and T. Murata, “Three-objective genetics-based machine learning for linguistic rule extraction,” Information Sciences, vol. 136, no. 1-4, pp. 109–133, 2001. [27] Y. Jin and B. Sendhoff, “Pareto-based multiobjective machine learning: An overview and case studies,” IEEE Transactions on Systems, Man, and Cybernetics, Part C (Applications and Reviews), vol. 38, no. 3, pp. 397–415, 2008. [28] H. Guo and A. K. Nandi, “Breast cancer diagnosis using genetic programming generated feature,” Pattern Recognition, vol. 39, no. 5, pp. 980–987, 2006. [29] M. Muharram and G. D. Smith, “Evolutionary constructive induction,” IEEE transactions on knowledge and data engineering, vol. 17, no. 11, pp. 1518–1528, 2005. 141 [30] M. Shirasaka, Q. Zhao, O. Hammami, K. Kuroda, and K. Saito, “Automatic design of binary decision trees based on genetic programming,” in Proc. The Second Asia-Pacific Conference on Simulated Evolution and Learning (SEAL’98. Citeseer, 1998. [31] S. Haruyama and Q. Zhao, “Designing smaller decision trees using multiple objective opti- mization based gps,” in IEEE International Conference on Systems, Man and Cybernetics, vol. 6. IEEE, 2002, pp. 5–pp. [32] C.-S. Kuo, T.-P. Hong, and C.-L. Chen, “Applying genetic programming technique in classi- fication trees,” Soft Computing, vol. 11, no. 12, pp. 1165–1172, 2007. [33] M. C. Bot, “Improving induction of linear classification trees with genetic programming,” in Proceedings of the 2nd Annual Conference on Genetic and Evolutionary Computation, 2000, pp. 403–410. [34] R. E. Marmelstein and G. B. Lamont, “Pattern classification using a hybrid genetic program- decision tree approach,” Genetic Programming, pp. 223–231, 1998. [35] E. M. Mugambi, A. Hunter, G. Oatley, and L. Kennedy, “Polynomial-fuzzy decision tree structures for classifying medical data,” in International Conference on Innovative Techniques and Applications of Artificial Intelligence. Springer, 2003, pp. 155–167. [36] A. Sinha, P. Malo, and K. Deb, “A review on bilevel optimization: From classical to evo- lutionary approaches and applications,” IEEE Transactions on Evolutionary Computation, vol. 22, no. 2, pp. 276–295, 2017. [37] K. Deb and R. B. Agrawal, “Simulated binary crossover for continuous search space,” Complex systems, vol. 9, no. 2, pp. 115–148, 1995. [38] K. Deb, Multi-objective optimization using evolutionary algorithms. Wiley, 2005. [39] L. Bobrowski and M. Kretowski, “Induction of multivariate decision trees by using dipolar criteria,” in European Conference on Principles of Data Mining and Knowledge Discovery. Springer, 2000, pp. 331–336. [40] M. Kretowski and M. Grześ, “Global induction of oblique decision trees: an evolutionary Springer, 2005, pp. approach,” in Intelligent Information Processing and Web Mining. 309–318. [41] K. Deb, Multi-objective optimization using evolutionary algorithms. 2001, vol. 16. John Wiley & Sons, [42] N. V. Chawla, K. W. Bowyer, L. O. Hall, and W. P. Kegelmeyer, “Smote: synthetic minority over-sampling technique,” Journal of artificial intelligence research, vol. 16, pp. 321–357, 2002. [43] V. Vapnik, The nature of statistical learning theory. 2013. Springer science & business media, 142 [44] J.-P. Vert, K. Tsuda, and B. Schölkopf, “A primer on kernel methods,” Kernel methods in computational biology, vol. 47, pp. 35–70, 2004. [45] K. Deb, A. Pratap, S. Agarwal, and T. Meyarivan, “A fast and elitist multiobjective genetic algorithm: Nsga-ii,” IEEE transactions on evolutionary computation, vol. 6, no. 2, pp. 182– 197, 2002. [46] K. Deb and A. Srinivasan, “Innovization: Innovating design principles through optimization,” in Proceedings of the 8th annual conference on Genetic and evolutionary computation. ACM, 2006, pp. 1629–1636. [47] K. Deb, L. Thiele, M. Laumanns, and E. Zitzler, “Scalable multi-objective optimization test problems,” in Proceedings of the 2002 Congress on Evolutionary Computation. CEC’02 (Cat. No. 02TH8600), vol. 1. IEEE, 2002, pp. 825–830. [48] Y. Dhebar, S. Gupta, and K. Deb, “Evaluating nonlinear decision trees for binary classification tasks with other existing methods,” ArXiv, vol. abs/2008.10753, 2020. [49] C. M. Bishop, Pattern recognition and machine learning. [50] F. Pedregosa, G. Varoquaux, A. Gramfort, V. Michel, B. Thirion, O. Grisel, M. Blondel, P. Prettenhofer, R. Weiss, V. Dubourg, J. Vanderplas, A. Passos, D. Cournapeau, M. Brucher, M. Perrot, and E. Duchesnay, “Scikit-learn: Machine learning in Python,” Journal of Machine Learning Research, vol. 12, pp. 2825–2830, 2011. springer, 2006. [51] K. Larsen, “Gam: the predictive modeling silver bullet,” Multithreaded Stitch Fix, vol. 30, pp. 1–27, 2015. [52] K. Neshatian, M. Zhang, and P. Andreae, “A filter approach to multiple feature construction for symbolic learning classifiers using genetic programming,” IEEE Transactions on Evolutionary Computation, vol. 16, no. 5, pp. 645–661, 2012. [53] A. Cano, A. Zafra, and S. Ventura, “An interpretable classification rule mining algorithm,” Information Sciences, vol. 240, pp. 1–20, 2013. [54] I. De Falco, A. D. Cioppa, and E. Tarantino, “Discovering interesting classification rules with genetic programming,” Applied Soft Computing, vol. 1, no. 4, pp. 257–269, 2002. [55] M. C. J. Bot and W. B. Langdon, “Application of genetic programming to induction of linear classification trees,” in European Conference on Genetic Programming. Springer, 2000, pp. 247–258. [56] J. Eggermont, J. N. Kok, and W. A. Kosters, “Genetic programming for data classification: Partitioning the search space,” in Proceedings of the 2004 ACM symposium on Applied computing, 2004, pp. 1001–1005. [57] K. C. Tan, A. Tay, T. H. Lee, and C. M. Heng, “Mining multiple comprehensible classification rules using genetic programming,” in Proceedings of the 2002 Congress on Evolutionary Computation. CEC’02, vol. 2. IEEE, 2002, pp. 1302–1307. 143 [58] H. Iba, “Bagging, boosting, and bloating in genetic programming,” in Proceedings of the 1st Annual Conference on Genetic and Evolutionary Computation, 1999, pp. 1053–1060. [59] J. Blank and K. Deb, “pymoo - Multi-objective Optimization in Python,” https://pymoo.org. [60] J. Schulman, S. Levine, P. Abbeel, M. Jordan, and P. Moritz, “Trust region policy optimiza- tion,” in International Conference on Machine Learning (ICML), 2015, pp. 1889–1897. [61] J. Schulman, F. Wolski, P. Dhariwal, A. Radford, and O. Klimov, “Proximal policy optimiza- tion algorithms,” arXiv preprint arXiv:1707.06347, 2017. [62] T. P. Lillicrap, J. J. Hunt, A. Pritzel, N. Heess, T. Erez, Y. Tassa, D. Silver, and D. Wierstra, “Continuous control with deep reinforcement learning,” arXiv preprint arXiv:1509.02971, 2015. [63] V. Mnih, A. P. Badia, M. Mirza, A. Graves, T. Lillicrap, T. Harley, D. Silver, and K. Kavukcuoglu, “Asynchronous methods for deep reinforcement learning,” in International Conference on Machine Learning (ICML), 2016, pp. 1928–1937. [64] R. S. Sutton, “Generalization in reinforcement learning: Successful examples using sparse coarse coding,” in Advances in Neural Information Processing Systems 8, D. S. Touretzky, M. C. Mozer, and M. E. Hasselmo, Eds. MIT Press, 1996, pp. 1038–1044. [65] J. Peters, K. Mülling, and Y. Altun, “Relative entropy policy search,” in Proceedings of the Twenty-Fourth AAAI Conference on Artificial Intelligence (AAAI-10), vol. 10. Atlanta, 2010, pp. 1607–1612. [66] W. D. Smart and L. P. Kaelbling, “Practical reinforcement learning in continuous spaces,” in International Conference on Machine Learning (ICML), 2000, pp. 903–910. [67] R. Noothigattu, D. Bouneffouf, N. Mattei, R. Chandra, P. Madan, K. Varshney, M. Campbell, M. Singh, and F. Rossi, “Interpretable multi-objective reinforcement learning through policy orchestration,” arXiv preprint arXiv:1809.08343, 2018. [68] F. Maes, R. Fonteneau, L. Wehenkel, and D. Ernst, “Policy search in a space of simple closed-form formulas: towards interpretability of reinforcement learning,” in International Conference on Discovery Science. Springer, 2012, pp. 37–51. [69] D. Hein, S. Udluft, and T. A. Runkler, “Interpretable policies for reinforcement learning by genetic programming,” Engineering Applications of Artificial Intelligence, vol. 76, pp. 158–169, 2018. [70] G. Liu, O. Schulte, W. Zhu, and Q. Li, “Toward interpretable deep reinforcement learning with linear model u-trees,” in Joint European Conference on Machine Learning and Knowledge Discovery in Databases. Springer, 2018, pp. 414–429. [71] A. Verma, V. Murali, R. Singh, P. Kohli, and S. Chaudhuri, “Programmatically interpretable reinforcement learning,” arXiv preprint arXiv:1804.02477, 2018. 144 [72] J. Kennedy and R. Eberhart, “Particle swarm optimization,” in Proceedings of ICNN’95- International Conference on Neural Networks, vol. 4. IEEE, 1995, pp. 1942–1948. [73] D. Hein, A. Hentschel, T. Runkler, and S. Udluft, “Particle swarm optimization for generating interpretable fuzzy reinforcement learning policies,” Engineering Applications of Artificial Intelligence, vol. 65, pp. 87–98, 2017. [74] S. Ross, G. Gordon, and D. Bagnell, “A reduction of imitation learning and structured predic- tion to no-regret online learning,” in Proceedings of the Fourteenth International Conference on Artificial Intelligence and Statistics, 2011, pp. 627–635. [75] O. Bastani, Y. Pu, and A. Solar-Lezama, “Verifiable reinforcement learning via policy extrac- tion,” in Advances in neural information processing systems (NIPS), 2018, pp. 2494–2504. [76] O. Bastani, C. Kim, and H. Bastani, “Interpretability via model extraction,” arXiv preprint arXiv:1706.09773, 2017. [77] G. Vandewiele, O. Janssens, F. Ongenae, F. De Turck, and S. Van Hoecke, “Genesim: Genetic extraction of a single, interpretable model,” arXiv preprint arXiv:1611.05722, 2016. [78] D. Ernst, P. Geurts, and L. Wehenkel, “Tree-based batch mode reinforcement learning,” Journal of Machine Learning Research, vol. 6, no. Apr, pp. 503–556, 2005. [79] J. L. Bentley, “Multidimensional binary search trees used for associative searching,” Commu- nications of the ACM, vol. 18, no. 9, pp. 509–517, 1975. [80] A. Silva, M. Gombolay, T. Killian, I. Jimenez, and S.-H. Son, “Optimization methods for interpretable differentiable decision trees applied to reinforcement learning,” in International Conference on Artificial Intelligence and Statistics, 2020, pp. 1855–1865. [81] Y. Dhebar and K. Deb, “Interpretable rule discovery through bilevel optimization of split-rules of nonlinear decision trees for classification problems,” arXiv preprint arXiv:2008.00410, 2020. [82] S. Nageshrao, B. Costa, and D. Filev, “Interpretable approximation of a deep reinforcement learning agent as a set of if-then rules,” in 18th IEEE International Conference On Machine Learning And Applications (ICMLA). IEEE, 2019, pp. 216–221. [83] H. Van Hasselt, A. Guez, and D. Silver, “Deep reinforcement learning with double q-learning” arXiv preprint arXiv:1509.06461, 2015. [84] G. A. Rummery and M. Niranjan, “On-line q-learning using connectionist systems,” Uni- versity of Cambridge, Department of Engineering Cambridge, UK, Tech. Rep. CUED/F- INFENG/TR166, 1994. [85] E. Coumans and Y. Bai, “Pybullet, a python module for physics simulation for games, robotics and machine learning,” http://pybullet.org, 2016–2020. 145 [86] P. Virtanen, R. Gommers, T. E. Oliphant, M. Haberland, T. Reddy, D. Cournapeau, E. Burovski, P. Peterson, W. Weckesser, J. Bright, S. J. van der Walt, M. Brett, J. Wil- son, K. J. Millman, N. Mayorov, A. R. J. Nelson, E. Jones, R. Kern, E. Larson, C. J. Carey, İ. Polat, Y. Feng, E. W. Moore, J. VanderPlas, D. Laxalde, J. Perktold, R. Cimrman, I. Hen- riksen, E. A. Quintero, C. R. Harris, A. M. Archibald, A. H. Ribeiro, F. Pedregosa, P. van Mulbregt, and SciPy 1.0 Contributors, “SciPy 1.0: Fundamental Algorithms for Scientific Computing in Python,” Nature Methods, vol. 17, pp. 261–272, 2020. [87] N. Subramanya, C. Bruno, and D. Filev, “Interpretable approximation of a deep reinforcement learning agent as a set of if-then rules.” 146