ONLINE INNOVIZATION : TOWARDS KNOWLEDGE DISCOVERY AND ACHIEVING FASTER CONVERGENCE IN MULTI-OBJECTIVE OPTIMIZATION By Abhinav Gaur A DISSERTATION Submitted to Michigan State University in partial fulfillment of the requirements for the degree of Electrical Engineering — Doctor of Philosophy 2020 ABSTRACT ONLINE INNOVIZATION : TOWARDS KNOWLEDGE DISCOVERY AND ACHIEVING FASTER CONVERGENCE IN MULTI-OBJECTIVE OPTIMIZATION By Abhinav Gaur “Innovization” is a task of learning common principles that exist among some or all of the Pareto-optimal solutions in a multi-objective optimization problem. Except a few earlier studies, most innovization related studies were performed on the final non-dominated so- lutions found by an evolutionary multi-objective algorithm either manually or by using a machine learning method. Recent studies have shown that these principles can be learned during intermediate iterations of an optimization run and simultaneously utilized in the same optimization run to repair variables to achieve a faster convergence to the Pareto-optimal set. This is what we are calling as “online innovization” as it is performed online during the run of an evolutionary multi-objective optimization algorithm. Special attention is paid to learning rules that are easier to interpret, such as short algebraic expressions, instead of complex decision trees or kernel based black box rules. We begin by showing how to learn fixed form rules that are encountered frequently in multi-objective optimization problems. We also show how can we learn free form rules, that are linear combination of non-linear terms, using a custom genetic programming algorithm. We show how can we use the concept of ‘knee’ in PO set of solutions along with a custom dimensional penalty calculator to discard rules that may be overly complex, or inaccurate or just dimensionally incorrect. The results of rules learned using this custom genetic program- ming algorithm show that it is beneficial to let evolution learn the structure of rules while the constituent weights should be learned using some classical learning algorithm such as linear regression or linear support vector machines. When the rules are implicit functions of the problem variables, we use a computationally inexpensive way of repairing the variables by turning the problem of repairing the variable into a single variable golden section search. We show the proof of concept on test problems by learning fixed form rules among variables of the problem, which we then use during the same optimization run to repair vari- ables. Different principles learned during an optimization run can involve different number of variables and/or variables that are common among a number of principles. Moreover, a preference order for repairing variables may play an important role for proper convergence. Thus, when multiple principles exist, it is important to use a strategy that is most beneficial for repairing evolving population of solutions. The above methods are applied to a mix of test problems and engineering design problems. The results are encouraging and strongly supports the use of innovization task in enhancing the convergence of an evolutionary multi-objective optimization algorithms. Moreover, the custom genetic program developed in this work can be a useful machine learning tool for practitioners to learn human interpretable rules in the form of algebraic expressions. Dedicated to Maa. iv ACKNOWLEDGMENTS I first want to thank my family for the bulwark of support they have provided me ever since I decided to leave my job and pursue a career in research eleven years ago. Without them, I could never have mustered so much courage in my decisions. A big thank you to my wife for supporting me with the voice of hope and courage at times when I found it hard to believe in myself. I owe her so many days in kitchen and house chores. I want to take a moment to thank my friends from my undergraduate days; Asjad, Akshat, Aman and Gandhi, who paid so many of my bills and university application fees when I left my job eleven years ago to pursue a career in research, always kept me in high spirits and even provided support to my parents in my absence from India. Secondly, thanks to Prof. Kalyanmoy Deb for his invaluable time and advice throughout my doctoral studies. In terms of human qualities of intelligence, humility and fairness, he surely is a Utopian point. Quite literally, un-achievable, yet point of aspiration for all his students. I am very thankful to him for putting so much effort in arranging for my funding in every semester. I aspire to nurture this relationship in future and hopefully be part of growth of the COIN lab and Q-Lyst story in future. These acknowledgements would not be complete without mentioning the immense sup- port and camaraderie I received from my colleagues at COIN lab. Most notably Haitham, who selflessly helped me with so many discussions on optimization, Java programming and just life in general. I am so glad that he is my friend and now colleague at Ford Motor Com- pany. Yashesh, who has been like a younger brother, whom I can even call in the middle of the night for help and who always made me feel more important than I should be probably. Rayan, whose smile in face of difficulties is enviable and I wish everyone could have that v fighting spirit that he possesses. Mohammad and Julian’s hardwork is very inspiring and I absorbed some of that quality from them. Thanks to Proteek and Zhichao, who were always happy to discuss research problems and ideas. Not to forget Khaled, probably one of the best programmers that I have seen and with whom I learned so much about C++ programming in such a short time. I want to thank General Motor’s Vehicle Optimization Group and Manufacturing Ana- lytics Group who at various points of time during my doctoral studies provided me with the opportunity to work on real world optimization problems and gave an invaluable experience of working on industry research problems. I want to acknowledge the BEACON’s support for providing me with many generous travel grants to be able to attend international confer- ences and also the opportunity to meet and work with world class researchers of Evolutionary Computation all within a stones throw away from my lab. I am sure there are more people who are part of this work in different ways and yet I am forgetting them. I want to assure them that this omission is not deliberate and a mere sign of my aging. vi TABLE OF CONTENTS LIST OF TABLES . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . xi LIST OF FIGURES . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . xiv LIST OF ALGORITHMS. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . xix KEY TO ABBREVIATIONS . . . . . . . . . . . . . . . . . . . . . . . . . . . . . xx Chapter 1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1.1 Search for Knowledge in MOO Problems . . . . . . . . . . . . . . . . . . . . 1.2 Using Discovered Knowledge in Expediting Convergence of EMO Algorithms 1.3 Organization of Dissertation . . . . . . . . . . . . . . . . . . . . . . . . . . . Chapter 2 Literature Survey and Proposed Online Innovization . . . . . . 2.1 Multi-objective Optimization . . . . . . . . . . . . . . . . . . . . . . . . . . 2.2 Evolutionary Multi-objective Optimization Algorithms . . . . . . . . . . . . 2.3 Innovization . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 2.4 A Taxonomy of Innovization . . . . . . . . . . . . . . . . . . . . . . . . . . . 2.4.1 Manual Innovization . . . . . . . . . . . . . . . . . . . . . . . . . . . 2.4.2 Automated Innovization . . . . . . . . . . . . . . . . . . . . . . . . . 2.4.3 Higher Level Innovization . . . . . . . . . . . . . . . . . . . . . . . . 2.4.4 Lower Level Innovization . . . . . . . . . . . . . . . . . . . . . . . . . 2.4.5 Temporal Innovization . . . . . . . . . . . . . . . . . . . . . . . . . . 2.5 Proposed Online Innovization . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 2.5.1 Target Data Set(s) 2.5.2 Type of Knowledge Representation . . . . . . . . . . . . . . . . . . . 2.5.3 Using Extracted Knowledge to Expedite the Convergence . . . . . . . 2.6 Dimensional Awareness in Rule Learning . . . . . . . . . . . . . . . . . . . . Part I Rule Learning Chapter 3 Learning Fixed Form Rules . . . . . . . . . . . . . . . . . . . . . . 3.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 3.2 Learning Constant Rules . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 3.2.1 Estimating Rule Parameters and Quality . . . . . . . . . . . . . . . . 3.3 Learning Power Law Rules . . . . . . . . . . . . . . . . . . . . . . . . . . . . 3.3.1 Estimating Parameters . . . . . . . . . . . . . . . . . . . . . . . . . . 3.3.2 Comparison with Automated Innovization . . . . . . . . . . . . . . . 3.3.3 Learning Multiple Rules Simultaneously . . . . . . . . . . . . . . . . 1 2 5 6 8 8 11 12 14 14 14 15 15 16 17 17 18 20 22 26 27 27 29 29 31 31 32 33 vii Chapter 4 Learning Free Form Rules Using a Symbolic Regression Task 4.1 Form of Rules . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 4.2 A Primer on Genetic Programming . . . . . . . . . . . . . . . . . . . . . . . 4.3 A Custom GP (CGP) . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 4.3.1 Two Objectives : Prediction Error and Rule Complexity . . . . . . . 4.3.2 Using Multiple Small Trees . . . . . . . . . . . . . . . . . . . . . . . 4.3.3 Learning Weights Using a Faster Method . . . . . . . . . . . . . . . . 4.3.4 Diversity Preserving Mechanism . . . . . . . . . . . . . . . . . . . . . 4.3.5 Higher and Lower Level Crossovers . . . . . . . . . . . . . . . . . . . 4.3.6 CGP Flowchart for Rule Learning . . . . . . . . . . . . . . . . . . . . 4.4 Using CGP for Symbolic Regression Task . . . . . . . . . . . . . . . . . . . . 35 35 36 38 38 39 40 41 44 46 49 4.4.1 Evaluating Fitness of a CGP Individual for a Symbolic Regression Task 49 52 52 53 54 57 59 61 . . . . . . . . . . . . . . . . . . . . . . . . . 4.5.1 Test Problem-1 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 4.5.2 Test Problem-2 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 4.5.3 Test Problem-3 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 4.5.4 Test Problem-4 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 4.6 Noise Study . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 4.7 Choosing a Solution . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 4.5 CGP Results on Test Problems Chapter 5 Using Dimensional Awareness with Rule Learning . . . . . . . 5.1 Measuring Dimension Mismatch Penalty . . . . . . . . . . . . . . . . . . . . 5.1.1 Case-I : Terms with Only Product and Division Operations . . . . . . 5.1.2 Dimensionally Inconsistent Example . . . . . . . . . . . . . . . . . . 5.1.2.1 First Term . . . . . . . . . . . . . . . . . . . . . . . . . . . 5.1.2.2 Second Term . . . . . . . . . . . . . . . . . . . . . . . . . . 5.1.2.3 Third Term . . . . . . . . . . . . . . . . . . . . . . . . . . . 5.1.3 Dimensionally Consistent Example . . . . . . . . . . . . . . . . . . . 5.1.4 Case-II : More Complex Terms . . . . . . . . . . . . . . . . . . . . . 63 63 64 69 70 71 72 73 75 Chapter 6 Learning Free Form Rules Using a Classification Task . . . . . 6.1 Target Classification Problem . . . . . . . . . . . . . . . . . . . . . . . . . . 6.2 Using CGP for a Classification Task . . . . . . . . . . . . . . . . . . . . . . . 78 79 80 6.2.1 Evaluating Fitness of a CGP Individual for a Binary Classification Task 83 84 85 87 90 90 92 6.3 Performance on Small Feature Space . . . . . . . . . . . . . . . . . . . . . . 6.3.1 Results on Production Data Set-1 . . . . . . . . . . . . . . . . . . . . 6.3.2 Results on Production Data Set-2 . . . . . . . . . . . . . . . . . . . . 6.4 Results on Larger Feature Space with Dimension Check . . . . . . . . . . . . 6.4.1 Data and Results . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 6.5 Concluding Remarks . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . viii Part II Rule Based Repair 95 7.1 Rule Based Repair Chapter 7 Performing Repairs Based on Fixed Form Rules for Expedited 96 Convergence . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 96 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 7.1.1 Rule Basis and Quality Block . . . . . . . . . . . . . . . . . . . . . . 98 7.1.2 Decision Block-L . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 100 7.1.3 Learn Block . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 101 7.1.3.1 Constant Type Rules . . . . . . . . . . . . . . . . . . . . . . 101 7.1.3.2 Power Law Type of Rules . . . . . . . . . . . . . . . . . . . 102 7.1.4 Decision Block-R . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 103 7.1.5 Repair Block . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 104 7.1.5.1 Repairing Variables Based on Constant Rule . . . . . . . . . 105 7.1.5.2 Repairing variables based on power law rules . . . . . . . . 106 7.2 Repair Strategies . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 107 7.2.1 Rule Preference Strategies . . . . . . . . . . . . . . . . . . . . . . . . 107 7.2.2 Variable Preference Strategies . . . . . . . . . . . . . . . . . . . . . . 109 7.3 Test Problems . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 111 7.3.1 ZDT1-1 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 111 7.3.2 ZDT1-2 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 112 7.3.3 ZDT1-3 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 113 . . . . . . . . . . . . . . . . . . . . . . . . . . . . 114 7.4.1 ZDT1-1 Results . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 115 7.4.2 ZDT1-2 Results . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 116 7.4.3 ZDT1-3 Results . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 117 7.4.3.1 Part-A : Strategies Preferring Short Rules . . . . . . . . . . 117 7.4.3.2 Part-B : Strategies with no Preference Based on Length of 7.4 Results on Test Problems Rules 7.4.4 7.4.3.3 Part-C : Strategies Preferring Long Rules Summary of Results on Test Problems . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 119 . . . . . . . . . . 120 . . . . . . . . . . . . . . . . . 120 7.5 Engineering Problems . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 122 7.5.1 Two Bar Truss Problem (TBT) . . . . . . . . . . . . . . . . . . . . . 122 7.5.2 Metal Cutting Problem (MC) . . . . . . . . . . . . . . . . . . . . . . 124 7.5.3 Welded Beam Problem (WB) . . . . . . . . . . . . . . . . . . . . . . 126 . . . . . . . . . . . . . . . . . . . . . . . . 129 7.6.1 Two Bar Truss Problem Results . . . . . . . . . . . . . . . . . . . . . 129 7.6.1.1 Rules Found in TBT Problem . . . . . . . . . . . . . . . . . 130 7.6.2 Metal Cutting Problem Results . . . . . . . . . . . . . . . . . . . . . 131 7.6.2.1 Rules Found in MC Problem . . . . . . . . . . . . . . . . . 132 7.6.3 Weld Beam Problem Results . . . . . . . . . . . . . . . . . . . . . . . 132 7.6.3.1 Rules Found in WB Problem . . . . . . . . . . . . . . . . . 133 7.7 Concluding Remarks . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 135 7.6 Results on Engineering Problems Chapter 8 Issues of Transition Points and Repairs Based on Complex Rules139 8.1 Handling Transition Points . . . . . . . . . . . . . . . . . . . . . . . . . . . . 139 ix 8.1.1 Identifying Regions Separated by Transition Points (Active Set Parti- tioning) 8.1.2 Results on Problems with Transition Points . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 143 . . . . . . . . . . . . . . 147 8.1.2.1 Modified Two Bar Truss Problem (TBT-2) . . . . . . . . . . 148 8.1.2.2 Modified Metal Cutting Problem (MC-2) . . . . . . . . . . . 149 . . . . . . . . . . . . . . . 151 8.2.1 Results on Truss problem . . . . . . . . . . . . . . . . . . . . . . . . 154 8.3 Concluding Remarks . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 156 8.2 Power Law Rule Involving Functions of Variables Chapter 9 Conclusion and Future Studies . . . . . . . . . . . . . . . . . . . . 157 9.1 Contributions of this Thesis . . . . . . . . . . . . . . . . . . . . . . . . . . . 159 9.2 Future Studies . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 160 9.2.1 GP with in Tandem Dimensional Consistency Check . . . . . . . . . 160 9.2.2 Non-linear Decision Trees . . . . . . . . . . . . . . . . . . . . . . . . 161 9.2.3 Experimenting with Different Definitions of Rule Complexity . . . . . 162 9.2.4 Measuring Efficacy of Repairs . . . . . . . . . . . . . . . . . . . . . . 162 BIBLIOGRAPHY . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 164 x LIST OF TABLES Table 4.1: List of parameters in CGP. . . . . . . . . . . . . . . . . . . . . . . . . . 48 Table 4.2: List of CGP parameters used for solving symbolic regression problem of Section 4.5.1. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Table 4.3: List of CGP parameters used for solving symbolic regression problem of Section 4.5.2. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Table 4.4: List of CGP parameters used for solving symbolic regression problem of Section 4.5.3. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Table 4.5: List of CGP parameters used for solving symbolic regression problem of Section 4.5.4. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 53 54 57 59 Table 4.6: Results of noise study performed on test problem of Section 4.5.3. . . . 61 Table 6.1: Production data details used for testing CGP classifier. . . . . . . . . . 85 Table 6.2: Small feature set and their details. . . . . . . . . . . . . . . . . . . . . . 85 Table 6.3: List of CGP parameters used for solving binary classification problem of . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Section 6.3.1. 85 Table 6.4: Summary of classification rules found for binary classification problem of Section 6.3.1 and their corresponding error rates on training and test sets. 86 Table 6.5: List of CGP parameters used for solving binary classification problem of Section 6.3.2. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 87 Table 6.6: Summary of classification rules found for binary classification problem of Section 6.3.2 and their corresponding error rates on training and test sets. 90 Table 6.7: Larger feature set and their dimensions. . . . . . . . . . . . . . . . . . . 91 Table 6.8: The set of physical constants which were considered relevant to the un- . . . . . . . . . . . . . . . . derlying physics of the production process. 91 Table 6.9: Details of PO classifiers shown in Figure 6.7. . . . . . . . . . . . . . . . 93 Table 6.10: Details of PO classifiers shown in Figure 6.7 after dimensional check. . . 94 xi Table 7.1: An example of candidate rule information available in RBQ block. . . . 99 Table 7.2: Different variable repair strategies for EMO/I studied in this work. . . . 111 Table 7.3: EMO parameters used in the test problems discussed in Section 7.3. . . 115 Table 7.4: Results of Wilcoxon rank sum test for GD and IGD of ZDT1-1 problem at 36,000 function evaluations and 5% significance level. . . . . . . . . . 115 Table 7.5: Results of Wilcoxon rank sum test for GD and IGD of ZDT1-2 problem at 36,400 function evaluations and 5% significance level. . . . . . . . . . 117 Table 7.6: Results of Wilcoxon rank sum test for GD and IGD of ZDT1-3 problem comparing EMO/I repair strategies NI, SN, SC and SU at 50,400 function evaluations and 5% significance level. . . . . . . . . . . . . . . . . . . . 118 Table 7.7: Results of Wilcoxon rank sum test for GD and IGD of ZDT1-3 problem comparing EMO/I repair strategies NN, NC, NU, SN and NI at 50,400 function evaluations and 5% significance level. . . . . . . . . . . . . . . 120 Table 7.8: Results of Wilcoxon rank sum test for GD and IGD of ZDT1-3 problem comparing EMO/I repair strategies LN, LC, LU, SN and NI at 50,400 function evaluations and 5% significance level. . . . . . . . . . . . . . . 121 Table 7.9: EMO parameters used for solving the engineering problems discussed in Section 7.5. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 128 Table 7.10: Results of Wilcoxon rank sum test for GD of TBT, MC and WB problem at their respective maximum function evaluations and 5% significance level.130 Table 8.1: An example of partitioning of solution space based on constraint activity. 145 Table 8.2: EMO/I parameters used for solving the modified engineering design prob- lems that contain transition points in the PO front. . . . . . . . . . . . 148 Table 8.3: Results of Wilcoxon rank sum test for GD of TBT-2 problem comparing EMO/I repair strategies NI, SN and SNasp at 10,036 function evaluations and 5% significance level. . . . . . . . . . . . . . . . . . . . . . . . . . . 149 Table 8.4: Results of Wilcoxon rank sum test for GD and IGD of MC-2 problem at 100,000 function evaluations and 5% significance level. . . . . . . . . . . 151 Table 8.5: EMO/I parameters used for solving the truss problem when only rules involving objective functions are learned. . . . . . . . . . . . . . . . . . 155 xii Table 8.6: Results of Wilcoxon rank sum test for GD of truss problem compar- ing EMO/I repair strategies NI and SNobj strategies at 25,024 function evaluations and 5% significance level. . . . . . . . . . . . . . . . . . . . 155 xiii LIST OF FIGURES Figure 1.1: Number of publications in the last two decades using search key “multi- . . . . . . . . . . . . . . . . objective optimization” (Source : Scopus). Figure 1.2: Number of publications in the last two decades that have used atleast one of the following MOO software-HEEDS, Mode-Frontier, iSight, Optimus, OptiSlang (Source : Scopus). . . . . . . . . . . . . . . . . . . . . . . . . Figure 1.3: Number of publications per year in the last decade which have the key- words “machine learning” and “interpretable” in their title, abstract or keywords ( Source : Scopus). . . . . . . . . . . . . . . . . . . . . . . . . Figure 2.1: Illustration of the objective and design space of a MOO problem. . . . 2 3 5 9 Figure 2.2: An illustration of the concept of dominance. . . . . . . . . . . . . . . . 10 Figure 2.3: An illustration of the concept of innovization. . . . . . . . . . . . . . . 13 Figure 2.4: An illustration of the concept of higher level innovization. . . . . . . . . 15 Figure 2.5: An illustration of the concept of lower level innovization. . . . . . . . . 16 Figure 2.6: The concept of Online Innovization. . . . . . . . . . . . . . . . . . . . . 17 Figure 2.7: An example of DimTransform() function. . . . . . . . . . . . . . . . . . 23 Figure 3.1: Data availability when solving a MOO using an EMO algorithm. . . . . 28 Figure 3.2: Data availability when solving a MOO using an EMO algorithm. . . . . 30 Figure 4.1: Example of two GP solutions using tree representation. . . . . . . . . . 37 Figure 4.2: An example of a sub-tree crossover between two GP individuals. . . . . 37 Figure 4.3: An example of a sub-tree Koza-style mutation of a GP individual. . . . 38 Figure 4.4: The two conflicting objectives in any machine learning algorithm. . . . 39 Figure 4.5: An MGGP individual composed of many small binary expression trees . . . . . . . . . . . . . . . . . instead of one big binary expression tree. 40 Figure 4.6: Our GP learns the rule structure and their weights separately. . . . . . 41 xiv Figure 4.7: Fitness adjustment by penalizing duplicate solutions to promote pop- ulation diversity in CGP. (A) shows the original state with two non- dominated fronts present in the population and total of seven solutions. (B) Solution-4, which is a duplicate of solution-3 after penalization. (C) . . . . . Solution-6 which is a duplicate of solution-5 after penalization. 42 Figure 4.8: Example of the high-level crossover used in CGP. . . . . . . . . . . . . 45 Figure 4.9: Example of the low-level crossover used in CGP. . . . . . . . . . . . . . 46 Figure 4.10: The flowchart for Custom Genetic Program or CGP. . . . . . . . . . . . 47 Figure 4.11: Graphical representation of test problem of Section 4.5.1. . . . . . . . . 52 Figure 4.12: PO set of solutions found by CGP on solving the symbolic regression test . . . . . . . . . . . . . . . . . . . . . . . . . . problem of Section 4.5.1. Figure 4.13: The three trees corresponding to the chosen knee solution shown in Fig- . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . ure 4.12. 53 54 Figure 4.14: Graphical representation of test problem of Section 4.5.2. . . . . . . . . 54 Figure 4.15: PO set of solutions found by CGP on solving the symbolic regression test . . . . . . . . . . . . . . . . . . . . . . . . . . problem of Section 4.5.2. 55 Figure 4.16: The tree corresponding to the chosen knee solution shown in Figure 4.15. 55 Figure 4.17: PO set of solutions found by CGP on solving the symbolic regression test . . . . . . . . . . . . . . . . . . . . . . . . . . problem of Section 4.5.3. Figure 4.18: The two trees corresponding to the chosen knee solution shown in Fig- . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . ure 4.17. Figure 4.19: PO set of solutions found by CGP on solving the symbolic regression test . . . . . . . . . . . . . . . . . . . . . . . . . . problem of Section 4.5.4. Figure 4.20: The three trees corresponding to the chosen knee solution shown in Fig- . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . ure 4.19. Figure 5.1: PO set of solutions found by CGP on solving the symbolic regression test problem of Section 4.5.3 followed by a dimensional penalty calculation for each case. Only the knee solution, which is also the exact solution, . . . . . . . . . . . . . . . has a dimensional mismatch penalty of zero. 57 58 59 60 75 Figure 6.1: A hand crafted example of binary class data. . . . . . . . . . . . . . . . 81 xv Figure 6.2: Binary classification tree model for classifying the two class data shown . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . in Figure 6.1. Figure 6.3: The binary class data of Figure 6.1 shown in a derived feature space where the data is linearly separable. . . . . . . . . . . . . . . . . . . . . Figure 6.4: PO set of solutions found by CGP on solving the binary classification . . . . . . . . . . . . . . . . . . . . . . . . . . problem of Section 6.3.1. Figure 6.5: Decision boundary for the knee solution classifier of PO set of classifiers shown in Figure 6.4. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Figure 6.6: PO set of solutions found by CGP on solving the binary classification . . . . . . . . . . . . . . . . . . . . . . . . . . problem of Section 6.3.2. Figure 6.7: PO set of classifiers found by CGP for two day’s worth production data . . . . . . . . . . . . . . . . . . . . . . . . . described in Section 6.4.1. 82 83 87 88 89 92 Figure 7.1: The flowchart of an EMO/I algorithm. . . . . . . . . . . . . . . . . . . 98 Figure 7.2: Pareto-optimal fronts for the test problems. . . . . . . . . . . . . . . . . 112 Figure 7.3: Median GD and IGD results for ZDT1-1 problem over 30 runs. . . . . . 116 Figure 7.4: Median GD and IGD results for ZDT1-2 problem over 30 runs. . . . . . 117 Figure 7.5: Median GD and IGD results for ZDT1-3 problem over 30 runs comparing NI, SN, SC and SU repair strategies of EMO/I. . . . . . . . . . . . . . 118 Figure 7.6: Median GD and IGD results for ZDT1-3 problem over 30 runs comparing NI, SN, NN, NC and NU repair strategies of EMO/I. . . . . . . . . . . 119 Figure 7.7: Median GD and IGD results for ZDT1-3 problem over 30 runs comparing LN, LC, LU, SN and NI repair strategies of EMO/I. . . . . . . . . . . . 121 Figure 7.8: A two membered truss structure. . . . . . . . . . . . . . . . . . . . . . . 122 Figure 7.9: Pareto-optimal front for the two member truss problem. . . . . . . . . . 124 Figure 7.10: A diagram of metal turning process. . . . . . . . . . . . . . . . . . . . . 124 Figure 7.11: Pareto-optimal front for the Metal Cutting problem. . . . . . . . . . . . 126 Figure 7.12: Pareto optimal front for the WB problem. . . . . . . . . . . . . . . . . 128 xvi Figure 7.13: Median GD and IGD results for TBT problem over 30 runs. . . . . . . 130 Figure 7.14: Front coverage by EMO/I method in TBT problem for the best GD run. 131 Figure 7.15: Rule between variables x1 and x2 identified by EMO/I in TBT problem at the end of best GD run. . . . . . . . . . . . . . . . . . . . . . . . . . 132 Figure 7.16: Rule for variable x3 identified by EMO/I in TBT problem at the end of best GD run. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 133 Figure 7.17: Median GD and IGD results for MC problem over 30 runs. . . . . . . . 134 Figure 7.18: Front coverage by EMO/I method in MC problem for the best GD run. 134 Figure 7.19: Rule for variable f identified by EMO/I in MC problem at the end of best GD run. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 135 Figure 7.20: Rule among variables v and a identified by EMO/I in MC problem at the end of best GD run. . . . . . . . . . . . . . . . . . . . . . . . . . . . 135 Figure 7.21: Median GD and IGD results for WB problem over 30 runs. . . . . . . . 136 Figure 7.22: Front coverage by EMO/I method in WB problem for the best GD run. 136 Figure 7.23: Rule between variables h and l identified by EMO/I in WB problem at the end of best GD run. . . . . . . . . . . . . . . . . . . . . . . . . . . . 137 Figure 7.24: Rule between variables h and b identified by EMO/I in WB problem at the end of best GD run. . . . . . . . . . . . . . . . . . . . . . . . . . . . 137 Figure 7.25: Rule between variables l and b identified by EMO/I in WB problem at the end of best GD run. . . . . . . . . . . . . . . . . . . . . . . . . . . . 138 Figure 7.26: Rule for variable t identified by EMO/I in WB problem at the end of best GD run. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 138 Figure 8.1: Transition point encountered in Truss problem shown in objective space. 140 Figure 8.2: Transition point encountered in Truss problem shown in variable space. 141 Figure 8.3: Median GD and IGD results for TBT-2 problem over 30 runs. . . . . . 149 Figure 8.4: Transition point encountered in Metal Cutting problem shown in objec- tive space. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 150 xvii Figure 8.5: Transition point encountered in Metal Cutting problem shown in variable space. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 151 Figure 8.6: Median GD and IGD results for MC-2 problem over 30 runs. . . . . . . 152 Figure 8.7: Median GD and IGD results for the Truss problem over 30 runs when only rules involving objective functions are learned. . . . . . . . . . . . 156 Figure 9.1: The idea of combining CGP with a Decision Tree classification algorithm. 161 xviii LIST OF ALGORITHMS Algorithm 7.1: RepairPop( ). . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 104 Algorithm 7.2: wrShuffle( ). . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 108 Algorithm 8.1: activeSetPartition( ). . . . . . . . . . . . . . . . . . . . . . . . 146 xix KEY TO ABBREVIATIONS CGP Customized Genetic Program EA Evolutionary Algorithm EMO Evolutionary Multi-objective Optimization EMO/I Evolutionary Multi-objective Optimization with Innovization GA Genetic Algorithm GD Generational Distance GP Genetic Programming IGD Inverse Generational Distance ML Machine Learning MOO Multi-Objective Optimization OLSR Ordinary Least Squares Regression PO Pareto Optimal SVM Support Vector Machines xx Chapter 1 Introduction These are exciting times for academicians, practitioners and entrepreneurs who are working in the field of Multi-objective optimization (MOO). Figure 1.1 shows the increasing attention this field has been receiving from academia over the last two decades. Commercial MOO software, especially catering to the computer aided engineering industry, has seen a lot of growth in the past decade. Now, practitioners have many options available for commercial MOO software; HEEDS [1], modeFrontier [2], , iSight [3], Optimus [4] and optiSLang [5] are to name a few. Figure 1.2 shows the rising use of these softwares in research. Evolutionary algorithms (EA) are a popular method of solving MOO problems in practice. In this work, we refer to EAs designed to solve MOO problems as evolutionary multi-objective optimization (EMO) algorithms. The key reason for the wide spread adoption of EMOs in tackling MOO problems is their flexibility in handling real world problem complexities such as non-linearity, non-convexity, and non-differentiability [6]. As the adoption of MOO is increasing in the industry, so are the expectations of practi- tioners from MOO solving software in terms of what it can do apart from providing optimal or near-optimal solution(s). Two key areas where researchers can try to address those ex- pectations are; 1. Offering some insight or knowledge about the problem, and 2. Using the aforementioned knowledge in making an EMO algorithm converge faster 1 Figure 1.1: Number of publications in the last two decades using search key “multi-objective optimization” (Source : Scopus). towards the PO front. EMO algorithms or any population based MOO problem solving algorithm are uniquely positioned to deliver on both the aforementioned needs of MOO software users. Let us discuss the two aforementioned requirements in the context of this work. 1.1 Search for Knowledge in MOO Problems The Oxford dictionary defines knowledge as, “the theoretical or practical understanding of a subject”. In older Artificial Intelligence (AI) literature [7], “having knowledge” corresponds to recognizing something as information about the world or part of it. In the context of an engineering design and optimization problem, this is equivalent of understanding the optimal behaviour in relation to the objectives and the design variables of the problem. A closely related concept of “knowledge representation” is defined as fundamentally a surrogate, a substitute for the thing itself, used to enable an entity to determine consequences by 2 19982002200620102014201801234# of Publications/Year ( 1000) Figure 1.2: Number of publications in the last two decades that have used atleast one of the following MOO software-HEEDS, Mode-Frontier, iSight, Optimus, OptiSlang (Source : Scopus). reasoning about the world rather than taking action in it [8]. Let us further discuss what we mean by knowledge in the context of this MOO problems. The MOO problems with two or more conflicting objectives have more than one optimal solution to the problem. These pareto optimal (PO) solutions may possess certain special properties that makes them PO in the first place. For example, in a mathematically well behaved MOO problem with convex, continuous and differentiable functions, the PO solutions must adhere to at least the Fritz-John or Karush-Kuhn-Tucker necessary conditions for Pareto optimality [9]. Even in real world engineering design problems that lack the aforementioned regularities around optima(s), the PO solutions may still possess certain special patterns or “rules” that sets them apart from non-optimal or even random solutions. This idea of looking for rules in PO solutions of MOO problem was coined by the authors of [10] as innovization. The authors showed the existence and manual extraction of such rules from PO solutions of many engineering design problems. Since then, the idea of innovization has been applied to many MOO problems in engineering design [11–13]. 3 199820022006201020142018020406080100# of Publications/Year A recent work [14] provides an extensive survey of various forms of knowledge sought af- ter in MOO problems. Knowledge representation in a MOO problem can be categorized into implicit or explicit forms based on representation [15]. Although very popular, knowledge in implicit form has no formal notation and may require user to have a specific experience. Most of the visual data mining methods such as parallel coordinate plots [16], value paths [17], heat-maps [18] and self-organizing maps [19] fall in this category. Explicit knowledge on the other hand has crisp mathematical form and can be interpreted by humans unambiguously. A couple of examples of explicit form of knowledge representation in MOO problems are de- cision trees [20] and regression rules [21]. In this work, we have decided to pursue knowledge in explicit form which has crisp mathematical notation and is amenable to consistent human “interpretation”. Although we could not find a mathematical definition of interpretabil- ity, [22] defines it as the degree to which a human can consistently predict a model’s result. For example, a model having power law structure in terms of design variables, which are common in the problems from engineering and physics [23], are easier to interpret compared to a model based on deep neural networks [24]. Christoph [25] presents a nice case for the need of interpretability in machine learning models. Figure 1.3 shows a sharp increase over last five years in the amount of research trying to address interpretability in machine learn- ing. In this work, rules refer to explicit mathematical expressions involving MOO problem variables or functions thereof, that are adhered to by some or all of the PO solutions. Another important aspect of rule learning in practical MOO problems is of adherence to principle of dimensional homogeneity [26]. In any rule representing some aspect of a physical or engineering system, adding or subtracting two dimensionally incommensurate quantities can never produce physically meaningful knowledge. For example, a rule should not be adding physical quantities having the dimensions of length and time. This is another 4 Figure 1.3: Number of publications per year in the last decade which have the keywords “machine learning” and “interpretable” in their title, abstract or keywords ( Source : Sco- pus). aspect that we have paid attention to while deciding the kind of rules that we want to learn from PO data. It is for these reasons of maintaining interpretability and easy verifiability of dimensional consistency, that we chose to learn only certain kind of rules in this work such as power law rules or rules involving basic operations of {+,−,×,÷} over problem variables. For the latter, we developed a bi-objective genetic programming (GP) [27] based method which we will go over in the coming chapters. 1.2 Using Discovered Knowledge in Expediting Con- vergence of EMO Algorithms The EMO algorithms tend to be computationally inefficient because they evaluate many alternative solutions in the population of solutions before converging to PO solutions. There have been examples [28, 29] in which researchers have learned some heuristics about the problem from some initial EMO runs on a problem and then adopted those heuristics as 5 200820102012201420162018050100150200250# of Publications / Year part of the problem solving EMO to expedite its convergence. This method is known as derived heuristics in the literature. In the above examples, the authors waited for the EMO algorithm run to completion, first to obtain a PO solutions set followed by a manual innovization procedure conducted over the solutions to come up with the rules. In this work, we intend to learn explicit mathematical form rules from a select set of solutions (say non- dominated set) during an EMO algorithm run and then use those rules to directly repair variables for expediting the convergence to PO solutions set. There are many challenging questions in this approach including; ˆ When to begin the learning process? ˆ What are some computationally efficient ways of repairing solution variables without making extra evaluations of the objective function? ˆ How to learn rules in MOO problems where different parts of PO front adhere to different set of rules? We have tried to answer these questions towards the later half of this dissertation. 1.3 Organization of Dissertation This dissertation is organized as follows. Chapter 2 presents relevant literature on the subject of innovization, existing literature on interpretable rule learning from data and on the idea of finding dimensionally consistent rules. Dimensional awareness here refers to adherence to the law of dimensional homogeneity by the learned rules. Then the dissertation is presented in two parts. Part I focuses on learning interpretable mathematical rules from data. Part I consists of four chapters. Chapter 3 illustrates how prevalent power laws are in engineering 6 problems and how can they be learned by using ordinary least square method of linear regresssion [30]. Chapters 4 and 6 show how can we learn free form rules use bi-objective GP for non-linear regression and classification tasks. We developed this GP as part of solving a real world industry problem. Chapter 5 then illustrates how can the principle of dimensional homogeneity be useful in generating or choosing physically meaningful rules from a set of rules which may all be acceptable to a user in terms of regression/classification accuracy. Part II then focuses on how can we use rules learned during an optimization task to expedite the convergence of an EMO algorithm. Chapter 7 shows results of online innovization on test and engineering MOO problems when we learn power law rules based on design variables only. Chapter 8 shows results of online innovization on test and engineering MOO problems when we learn power law rules based on design variables as well as some function of the same such as the objective functions. Furthermore, a methodology is devised to handle transition points in PO fronts and still be able to expedite the convergence of the EMO algorithm. We present the main conclusions of this work in Chapter 9 along with some interesting research threads worth pursuing in future research. 7 Chapter 2 Literature Survey and Proposed Online Innovization 2.1 Multi-objective Optimization Many real-world design problems from engineering can be formulated as a Multi-objective optimization (MOO) problem [6, 31]. This approach is particularly useful when it is difficult to capture the decision makers’ preference in the objectives in terms of some convex set of weights and consequently turning the problem into a single objective problem using the weighted sum approach . The goal of MOO is to provide the decision maker with a set of optimal trade-off solutions in terms of the objectives and let the decision maker develop a preference and finally make a choice. Equation (2.1) shows a formulation of a MOO problem with nf objectives (minimize all), nx design variables with known lower and upper bounds 8 and ng inequality constraints. Minimize all in f (x) Subect to g(x) ≤ 0 where x = [x1 x2 . . . xnx] (cid:124) , (cid:124) f (x) = [f1(x) f2(x) . . . fnf (x)] , (cid:124) g(x) = [g1(x) g2(x) . . . gng (x)] and li ≤ xi ≤ ui, i ∈ {1, 2, . . . , nx}. (2.1) (cid:124) A solution x∗ = [x∗1 x∗2 . . . x∗nx] to such a problem is a vector of n decision variables and must satisfy the variable bounds and ng inequality constraints. Otherwise, it is an infeasible solution. Figure 2.1 illustrates the objective, design and feasible spaces for a bi-objective Figure 2.1: Illustration of the objective and design space of a MOO problem. minimization problem having three design variables. 9 Feasible RegionDesign SpaceObjective Spacef2(minimize)x1x2x3f1(minimize) Concept of Dominance In a single objective optimization problem, it is straight forward to compare two solutions simply based on their corresponding objective values. In case of minimizing an objective, the solution with lower objective value is considered better than the other. However, in case of MOO problem, the comparison is based on the concept of dominance [31]. Definition 2.1.1. A solution x(1) is said to dominate a solution x(2), if the following con- ditions are true: 1. The solution x(1) is no worse than x(2) in all objectives. 2. The solution x(1) is strictly better than x(2) in at least one objective. We denote x(1) dominating x(2) as x(1) (cid:22) x(2) and x(2) dominating x(1) as x(2) (cid:22) x(1). If neither of two solutions dominate each other, then we denote this condition as x(1) (cid:107) x(2). In Figure 2.2, upon doing pairwise comparisons of solution x(2) with all others, x(2) (cid:22) x(4), Figure 2.2: An illustration of the concept of dominance. x(2) (cid:22) x(5) and x(2) (cid:22) x(6) while x(2) (cid:107) x(1) and x(2) (cid:107) x(3). Another important concept is 10 6f1(minimize)f2(minimize)12345 of the non-dominated set [31]. Definition 2.1.2. Among a set of solutions P , the non-dominated set of solutions P(cid:48) are those that are not dominated by any member of the set P . In Figure 2.2, set P = {1, 2, 3, 4, 5, 6} and set P(cid:48) = {1, 2, 3}. Furthermore, when the set P is the entire search space, the resulting non-dominated set P(cid:48) is called the Pareto-optimal set. Sometimes, we are interested in dividing a set P into different non-domination levels or ranks. If a set P contains J = {1, 2, . . . , j} non-domination levels, then it means that the set P can be partitioned into j levels, i.e. (cid:91) P = Pi and i∈J Pr ∩ Ps = ∅ ∀ r ∈ J and r (cid:54)= s, such that a point in a set with lower non-domination rank are never dominated by a point in any set of higher non-domination rank. In Figure 2.2, there are three non-domination ranks present. Set P1 = {1, 2, 3}, P2 = {4, 5} and P3 = {6}. 2.2 Evolutionary Multi-objective Optimization Algo- rithms Real world optimization problems have many complexities built into the problem such as non-linearity, discontinuity, non-differentiability, discrete variables etc. These complexities 11 make it impossible to apply known classical optimization algorithms, such as gradient de- scent algorithm [32], without making strong simplifying assumptions about the problem. An evolutionary algorithm (EA) mimics natural evolutionary principles to conduct search and optimization tasks. Some of the popular EAs are genetic algorithms [33], evolution strate- gies [34] and genetic programming [35]. Evolutionary algorithms (EAs) have shown to have an edge in optimizing MOO problem for finding multiple trade-off solutions, simply because EAs are capable of finding and storing multiple solutions from generations to generations. The population approach of EAs and their ability to build multiple niches within a popu- lation enabled EA researchers to develop evolutionary multi-objective optimization (EMO) algorithms [31,36]. EMO algorithms are extensively applied to MOO problems from the real world [37–44]. However, EMO algorithms are considered computationally inefficient because they sample the space stochastically and require many function evaluations to reach a high quality solution set. This inefficiency can be partly reduced if the algorithm is made to take advantage of some problem knowledge discovered during or at the end of the optimization from PO or non-dominated solutions set. 2.3 Innovization While multiple PO solutions allow decision-makers to choose a single preferred solution by making a comparative analysis of them, multiple PO (or high-performing) solutions may also provide users with valuable information about the problem being solved. These PO solutions may possess certain special properties that makes them PO in the first place. For example, in a mathematically well behaved MOO problem with convex, continuous and differentiable functions, the PO solutions must adhere to at least the Fritz-John or Karush- 12 Kuhn-Tucker necessary conditions for Pareto optimality [9]. Even in real world engineering design problems that lack the aforementioned regularities around optima(s), the PO solutions may still possess certain special patterns or “rules” that sets them apart from non-optimal or even random solutions. The idea of learning from the PO solutions and deciphering design principles that are unique to the optimization problem was first presented in [10]. This concept is called innovization and it is pictorially illustrated in Figure 2.3. Originally, the innovization task was conceived to be performed once at the end of an EMO run in order to decipher principles that are common to most or all PO solutions [10]. Figure 2.3: An illustration of the concept of innovization. Since an innovization task involves learning from a set of equivalent trade-off solutions, it is appealing to couple this idea with EMO algorithms. There are examples in the litera- ture, where the innovization task has been conducted on the results of an EMO algorithm to decipher interesting knowledge about the problem. For example, [45] uses innovization to discover new design principles in three engineering case studies. Another example [46] proposes a novel recommendation tool for software re-factoring based on innovization. [11] suggests a way to automate the finding of salient problem design principles for engineering problems. [13] uses the innovization task in a simulation based MOO problem and suggest different ways to couple the innovization with the EMO that can throw light on important principles of the problem. In recent literature [14], the idea of innovization has also been referred to as knowledge discovery in MOO problems. Before moving on to discuss some 13 EMOInnovizationproblemMOOStep IStep IIConvergedDesignPrinciplesFrontPOf2f1 examples of knowledge extracted from MOO problems, let us look at a taxonomy of the innovization procedures. 2.4 A Taxonomy of Innovization In the following sections, we will try to cover the various kinds of innovization procedures that exist in the literature. 2.4.1 Manual Innovization This was first proposed in [10] as a way of manually extracting innovative design principles from PO data of MOO design problems. As shown in Figure 2.3, it requires first solving a MOO problem and obtaining its PO set of solutions followed by manually plotting dif- ferent variables, objectives and constraints for PO solutions against each other one by one in 2-dimensional and 3-dimensional plots and then manually deciphering innovative design knowledge from the same based on the plots. 2.4.2 Automated Innovization Automated innovization was first proposed in [11] as a way to automate the labour of manual innovization but at the cost of fixing the form of rules to power law form. This approach has produced some interesting case studies in engineering design evident in [45]. Note that both manual and automated innovization are based on the approach shown in Figure 2.3. 14 Figure 2.4: An illustration of the concept of higher level innovization. 2.4.3 Higher Level Innovization Upon changing certain parameters of a MOO problem, it is possible that the shape or location or both may change for the PO front of the problem. Figure 2.4 illustrates this for a hypothetical problem where changing some parameter results in some shift in the PO front as well. In such a case, if an innovization study is performed for multiple fronts resulting from multiple parametric studies, it is possible that we may encounter certain rules or principles that remain invariant across the fronts. Such a procedure is called a higher level innovization procedure. Both [47, 48] present some good engineering design case studies of higher level innovization. 2.4.4 Lower Level Innovization Lower level innovization task is performed when we can divide the solutions of a problem into two sets, namely the preferred set and the not preferred set. Figure 2.5 shows a hypothetical case where the user may be interested in patterns that are present in the preferred set of 15 Parametervaluechangingf1(minimize)f2(minimize) Figure 2.5: An illustration of the concept of lower level innovization. solutions but not otherwise. A lower level innovization procedure learns the most discrimi- natory rules that completely adhered to by the preferred class of solutions but not so much by the not preferred class of solutions. [49] presents a case study of lower level innovization procedure applied to the Car side impact problem [50]. 2.4.5 Temporal Innovization Temporal innovization is the innovization procedure aimed at learning the relative hierarchy of design principles in an overall solution to a MOO problem. The design principles in the PO front are searched in all previous generations and their significance is recorded. When looked over the time line of all the generations of an EMO, it can be seen which principles evolve earlier in the solutions and which one appear later. The ones that appear earlier are hierarchically more important than the ones that appear later in the evolution of solutions. The authors of [51] present an interesting analogy between the evolution of design of a MEMS resonator using an EMO and the evolution of human beings. 16 Unpreferredregionf1(minimize)f2(minimize)Preferredregion 2.5 Proposed Online Innovization Figure 2.6: The concept of Online Innovization. Figure 2.6 illustrates the idea of online innovization. An online innovization procedure has three components namely; 1. Target data set(s) for extracting knowledge, 2. Type of knowledge representation being extracted, and 3. How to use the extracted knowledge to expedite the convergence of EMO algorithm. Let us look at these aspects. 2.5.1 Target Data Set(s) When solving a MOO using an EMO or any population based optimization algorithm, the algorithm carries a number of data sets possessing important as well as unimportant knowl- edge about the problem. For example, the objective values data, variable values data and 17 Terminate?YesStopStartTarget data setin population(e.g. non−dominated)NoKnowledgeConvergenceFasterInnovizationMOOproblemActiveInterventionProblemSpecificEMO constraint violation values data for non-dominated solutions set [31] can be considered im- portant for problem information and the same data for dominated solutions may not hold the same importance, especially during initial generations of the EMO. However, the infor- mation contained the same dominated solutions becomes important if one is interested in finding knowledge that can discriminate between the dominated and non-dominated sets. Another way to select a target data set is if the preference of decision maker is known at every step of the optimization process [52]. 2.5.2 Type of Knowledge Representation The Oxford dictionary defines knowledge as, “the theoretical or practical understanding of a subject”. In older Artificial Intelligence (AI) literature [7], “having knowledge” corresponds to recognizing something as information about the world or part of it. As mentioned in Section 1.1, “knowledge representation” is defined as fundamentally a surrogate, a substitute for the knowledge, used to enable an entity to determine consequences by reasoning about the world rather than taking action in it [8]. Knowledge representation in MOO problem can be categorized into implicit or explicit forms based on representation [15]. Although very popular, knowledge in implicit form has no formal notation and may require user to have a specific experience [53]. Most of the visual data mining methods used for extracting implicit knowledge about MOO problems such as parallel coordinate plots [16], value paths [17], heat-maps [18], RadViz visualization method [54] and self-organizing maps [19] fall in this category. [14] provides a good survey of such methods used for knowledge extraction in MOO problems. Explicit knowledge on the other hand has crisp mathematical form and can be interpreted by humans unambiguously. Furthermore, it can be implemented as a computer program [55]. 18 A couple of examples of explicit form of knowledge representation in MOO problems are decision trees [20] and regression rules [21]. Kernel based Support vector machines [56], artificial neural networks [57, 58] are other examples but rules learned through them are not very amenable to easy human interpretation. [59] applies classification rule mining [60] to extract rules of the form A −→ B where A is a design feature and B is a class-label that determines the quality of the design (e.g. non-dominated). However the rules in disjunctive normal form combining many design features and permutations thereof become difficult to interpret by human unless we limit the maximum number of features allowed in such rules. [11, 61] have developed a custom unsupervised learning algorithm based on a genetic algorithm to learn power law relations from the data of PO solutions of a MOO problem. This method does produce very interpretable rules in power law form, however it may fail to capture knowledge if it exists in any other form, say a rule with addition or substration operations. [62] employs classification trees to extract decision rules that distinguish between dominated and non-dominated solutions. These rules also stay easy to interpret only until the number of levels in the decision tree stays under five. In this work, we have decided to pursue knowledge of the following two explicit forms: ˆ Power law form and ˆ Free form or algebraic expressions containing {+,−,×,÷} operations on operands. We call these mathematical expressions as “rules” and target them as they are amenable to consistent human “interpretation”. Although we could not find a mathematical definition of interpretability, [22] defines it as the degree to which a human can consistently predict a model’s result. For example, a model having power law structure in terms of design variables, which are common in the problems from engineering and physics [23], are easier to interpret 19 compared to a more accurate model based on, say deep neural networks [24]. Christoph [25] presents a nice case for the need of interpretability in machine learning models. Although [11,61] have used a custom unsupervised learning algorithm to learn power law relations, such a method can be computationally very expensive, given that power laws can be learned using a combination of log-linear modeling along with ordinary least square regression method [63]. Such a method can be applied repeatedly without much cost, however we need to find the transition points in the MOO set via some other method. A transition point in a PO front is a point across which there is a quantitative change in the nature of rules that are adhered to by the PO solutions [10]. Encountering a constraint boundary is a common cause for appearance of transition points in PO fronts for MOO problems from engineering. Furthermore, we have also developed a custom bi-objective genetic programming (GP) method that can be used for learning simple “free form rules from data involving only +,−,×,÷ operations. This is because adding rule parsimony as another objective in GP [64,65] has been known to control the problem of bloating [66] when using GP for regressing rules from data. Furthermore, we designed our GP to be able to learn the constants/coefficients involved in rules accurately without the constants being provided by the user. This capability is only known to be present in a commercial software named Eureqa [67] which is again a GP based software. For target rules involving only +,−,×,÷ operations, our GP is quite competitive to Eureqa software. In addition to this, our GP has additional capability of conducting dimensional consistency checks on the obtained rules. 2.5.3 Using Extracted Knowledge to Expedite the Convergence There are EAs that are hybridized with machine learning methods to acquire knowledge extraction from a set of solutions and knowledge application to affect the convergence. For 20 example Bayesian optimization [68,69] and Estimation of Distribution Algorithms [70]. The probabilistic models constructed and updated in these algorithms guide the algorithm on how to sample the decision space to improve the objective values. However, the constructed probabilistic models are relatively difficult to interpret by humans compared to say alge- braic expressions. Authors of [29] first discover the salient principles by using a manual innovization task with the solutions found at the end of an EMO run and then use those principles as a heuristic for local search and obtain a faster convergence than the EMO ap- plied alone. [71] suggest learning the ‘innovized’ rules using decision trees and then adding the learned rules as if-then-else statement type of constraints during the EMO run. [71] learns distance based regression trees to extract rules differentiating between solutions near and far from PO region of interest and introduces these regression trees as constraints to guide the EMO algorithm towards the region of interest. Learnable evolution models (LEM) [72] gen- erates multiple logical rules that relate specific design features to the high-quality solutions in the population. The rules are combined into one logical sentence in a disjunctive nor- mal form, and future solutions must satisfy the sentence by conforming to at least one of the rules. [59] uses an adaptive operator selection to track the efficacy of each evolutionary operator and manually introduced knowledge dependent operator at consistently creating improved solutions and focus on using the most effective ones using a credit assignment strategy. Since, we are learning rules in the form of mathematical expressions, we try to use the same for directly repairing/modifying solutions and possibly expedite the convergence of the EMO algorithm. Furthermore, we have tried to investigate different repair strategies for rule and repair-variable selection. We will learn more on this in Part-II of this dissertation. 21 2.6 Dimensional Awareness in Rule Learning For rules to be physically meaningful, especially the ones coming from problems in engi- neering and physics, it is a must that the rules should at least adhere to law of dimensional homogeneity [26]. The first work to address the issue of learning dimensionally consistent rules from data using Genetic Programming (GP) was [73]. The authors modified the defi- nitions of terminal set and function set to include the physical dimensions of the quantities, such as length, mass and time etc. Furthermore, an additional function DimTransform() is defined to guarantee closure. In addition to goodness-of-fit objective, an additional ob- jective “goodness of dimension” is introduced to reduce the number of applications of the DimTransform() function necessary to make dimension based repairs. Figure 2.7 shows how the DimTransform() function is applied with an example. Here the input tree is adding two dimensionally inconsistent terms, namely “F ” with physical dimensions M1 L1 T−2 and another quantity “m” with dimensions M1 L0 T0. The DimTransform() function multiplies the term “m” with a random physical constant “c” having physical dimensions of M0 L1 T−2 and a numeric value randomly chosen from the set of allowed random terminal constants. Consequent to this dimension matching operation, this transform adds a dimensional penalty of three units to the output tree. The work showed that addition of dimension information as another objective produced more parsimonious results instead of results having high fit- ness and low interpretability. Nevertheless, the algorithm did not perform well when exact scientific constants (e.g. 9.81 m/s2) were not provided to the algorithm. [74] is another work that tries to address the issue of finding meaningful rules using dimension information using the idea of grammar based GPs [75, 76]. The authors do so by enforcing dimensional constraints through formal grammars in the GP tree construction 22 Figure 2.7: An example of DimTransform() function. and manipulation rules. Furthermore, the authors also suggest a way of initialization so that feasible trees can be generated in pre-specified maximum tree depth. This approach is impractical for real problems because of the size of grammar that needs to be defined even for problems of small size. As an example, consider three elementary units mass (M), length (L) and time (T) to be present in the variables. Further assume that the powers of these basic units are restricted to the integer set {−2,−1, 0, 1, 2} and no fractional powers of basic units are allowed. Thus one needs to exclude operators that yield fractional powers, e.g. square root operator. Using a restricted function set of {+,−,÷,×}, a full specification of grammar requires 53 = 125 symbols each having many rules. For example, for a symbol S0,0,0 representing expressions which are dimensionless, can be obtained using multiplication 23 +Fmm·c+FM1L1T−2M1L0T0DimTransform(c∈R,M0L1T−2)M1L1T−2M1L1T−2+DimPenalty=3 function using the following 125 rules. < S0,0,0 >::= < S0,0,0 > × < S0,0,0 > | < S1,0,0 > × < S−1,0,0 > | < S2,0,0 > × < S−2,0,0 > | . . . [122 more such definitions for multiplication.] Correctly so, the authors of [77] have termed this context-free-grammar approach to be unfit or impractical for modeling the “units of measurement” or uom system. It is very difficult to produce a set of feasible initial population with so many constraints, even with a limited set of allowed exponents. The authors of [78] suggest using a set of dimensionless quantities as terminal set for the GP. From an initial set of variables, they generate a set of normalized variables which are dimensionless and then use those as part of the terminal set of the subsequent GP. This method, although guaranteed to produce dimensionally consistent expressions, is impractical when number of variables of interest is large. In such a case, finding a set of dimensionless quantities that can be produced using those variables of interest with elementary operations such as multiplication or division is not trivial and itself becomes an optimization problem. The authors of [79] try addressing the dimensional consistency of GP individuals (trees) the following way. It first assumes an upper bound on the largest possible exponent for any fundamental unit (length, mass or time) to be u (say). Then for any tree, with maximum depth D, the maximum exponent of any fundamental quantity can be u × 2D. For each fundamental quantity, it then evaluates the resultant exponent which will exceed u × 2D 24 for trees combining dimensionally incommensurate quantities. Such trees become infeasible in subsequent optimization procedure to induce rules from data. However, such penalty approach makes it almost impossible to find rules having addition or subtraction operations, which are the only operations that make them different from power law rules. This is the reason that out of forty rules found by the author over three engineering design problems, only one rule has an addition operation. In the rest of the chapters, we will focus on applying the idea of online innovization on MOO problems from the engineering design domain where the problem variables are of continuous nature. In the following chapters, we will see how can we: ˆ Part-I : learn mathematically explicit rules from data that are simple to interpret, and ˆ Part-II : then use them to make variable repairs to EMO population members to bring them closer to the PO front faster. This concludes our literature survey chapter. 25 Part I Rule Learning 26 Chapter 3 Learning Fixed Form Rules 3.1 Introduction In a MOO problem being solved by any EMO or population based optimization algorithm generates a myriad of data related to the design variables, objective functions and constraint values/violations. Recall from Section 2.5.2 that we are interested in discovering knowledge from this data that can be expressed in the form of mathematical algebraic expressions which we will be referring to as rules in this work. Thus we are looking for rules of the form ψ(x, f (x), g(x)) = c (3.1) where x is the set of design variables, f (x) are the objective functions and g(x) are the set of constraint functions for the problem and c is some constant. We can re-write Equation (3.1) as ψ(x1, . . . . , xnx, f1, . . . , fnf , g1, . . . , gng ) = c. (3.2) Borrowing the terminology from [11], let us call each design variable and any function of de- sign variables (including objective functions and constraint functions) as a basis function and we will represent a basis function using the symbol ‘φ’.Thus, we can re-write Equation (3.2) 27 as ψ(φ1, φ2, . . .) = c. (3.3) Figure 3.1 shows a typical scenario of data availability while solving a MOO problem of Figure 3.1: Data availability when solving a MOO using an EMO algorithm. two objectives, three variables and one inquality constraint. In some MOO problems, user may want to use additional basis functions other than the ones included in Equation (3.1), e.g. some composite function of first objective and a constraint. This is acceptable and would require us to calculate this additional basis function during the optimization run. We will focus our attention to the general case of including only variables, objectives and constraints for our study. An underlying assumption in applying these methods is that the values for each basis function are always non-negative, which is generally the case for MOO problems from engineering design domain. Note that the methodology to learn the two types of rules being presented in this chapter already exists in literature. We are presenting them for completeness, maintaining coherency of thesis structure and their suitability for 28 DominatedNon-dominatedithGenerationStateithGenerationDataIDφ1φ2φ3φ4φ5φ6···x1x2x3f1f2g1···10.20.80.84.52.81.2···21.00.11.06.56.85.0···31.00.40.77.16.69.6···40.50.90.07.51.63.4···........................f1(minimize)f2(minimize) computationally efficient application to MOO engineering design problems. 3.2 Learning Constant Rules We call a rule of the form ψ(φ1, φ2, . . . , φnφ) ≡ φi = c (3.4) as a “constant” rule where φi is some basis function from the set of nφ basis functions being considered for rule learning from MOO problem and c is some constant. These are one of the most commonly encountered design rules in MOO problems in engineering designs. A few examples of a constant rule would be a variable or a constraint reaching its boundary value and being same for a set of high quality solutions. 3.2.1 Estimating Rule Parameters and Quality In Equation (3.4), there is only one parameter, c to be estimated. Consider a hypothetical MOO problem which requires its PO solutions to have some variable x1 = 0.5. Let x1 be named the first basis function, φ1. Figure 3.2 shows this scenario for this problem as the EMO algorithm progresses in the optimization run. Since EMO algorithms work based on stochastic sampling, a non-dominated front at the end of an EMO run is very likely not the PO front for the problem but an approximation of the same. EMO algorithms rarely reach the PO front of MOO problems unless they are followed by some local search from a set of non-dominated solutions which are close to the PO solutions of the problem. Hence, the value of x1 cannot be expected to be exactly 0.5 for all solutions at the end. Figure 3.2 reflects this thought, where the values of the variable x1 can be seen to be moving, on average, closer and closer to the value of 0.5 but are never exactly 0.5. Hence, the value of 29 Figure 3.2: Data availability when solving a MOO using an EMO algorithm. parameter c in Equation (3.4) can be estimated using the mean of basis function φi as ˆc = µφi (3.5) which in this example would be just the mean of variable x1 of the target data set. Furthermore, we use coefficient of variation or Cv of φi values of target data set to assess the quality of a constant rule. Coefficient of variation for a data set is defined as Cv = σ µ (3.6) where σ is the standard deviation and µ is the mean of data. Coefficient of variation is a dimensionless quantity and is useful for comparing the variability of a sample of numbers drawn from distributions with different units. For example, Cv of the weights of a group of students will be the same whether the data is in kilograms or pounds. The lower the value 30 StartGen.φ1=x10.60.20.80.3...MiddleGen.φ1=x10.610.580.550.48...LastGen.φ1=x10.510.500.490.52...EMOGenerationsf2(minimize)f1(minimize) of Cv for data of some basis function, the closer it is to being called a constant rule. Let us now look at the other type of rule we are focusing on in this chapter, i.e. the power law rules. 3.3 Learning Power Law Rules A power law form rule involving basis functions from a MOO can be represented as ψ(φ1, φ2, . . . , φnφ) ≡ bi i = c φ (3.7) nφ(cid:89) i=1 where nφ is the number of basis functions being considered for learning a power law rule from data, bi is exponent of ith basis function and c is some constant. Equation (3.7) is a multi-variate power law. Such laws are very common in many nat- ural phenomenon [23] and have been commonly found in MOO problems from engineering design [10, 11, 45, 80, 81]. By using such a universally occurring functional form for the de- sign principles, it is expected that most correlations between various basis functions can be captured. 3.3.1 Estimating Parameters Given the data for nφ number of basis functions, and if we want to learn a multi-variate power law relation involving all k basis functions, then we can use the log-linear modeling followed by ordinary least square regression [82] method. bk b2 b1 1 · φ 2 ··· φ φ k = c, then (3.8) 31 taking log on both sides, b1 log φ1 + b2 log φ2 + ··· + bk log φk = log c, (3.9) which is a linear equation. If log φ1 is chosen as the regressand and others as regressors then, log φ1 = −b2 b1 log φ2 + −b3 b1 log φ3 + ··· + −bk b1 log φk + log c b1 , or (3.10) = ˆβ2 log φ2 + ˆβ3 log φ3 + ··· + ˆβk log φk + ˆγ. Parameters ˆβ2, ˆβ3,··· , ˆβk and ˆγ are estimates returned by a Ordinary Least Square linear regression (OLSR). OLSR also returns the R2 adj value which can be used as metric to signify the quality of such a learned rule. Ofcourse we can choose any logarithmic term in Equa- tion (3.9) as a regressand and the rest as regressors. Let us say that the ith log term in Equation (3.9) is chosen as the regressand to apply OLSR. In that case, Equation (3.10) can be re-written as log φi = which upon taking an anti-log on both sides can be re-written as + ˆγ, j=1,j(cid:54)=i (cid:17) (cid:16) ˆβj log φj k(cid:88)  .  k(cid:89) φi = eˆγ · ˆβj φ j j=1,j(cid:54)=i (3.11) (3.12) 3.3.2 Comparison with Automated Innovization The authors of [11] developed an unsupervised machine learning method by combining grid based clustering [83] method with a genetic algorithm to learn rules of the form given by Equations (3.4) and (3.7). As part of automated innovization (see Section 2.4.2), authors 32 of [11] needed to learn such rules only once at the end of a MOO task using an EMO algorithm. However, in case of online innovization (see Section 2.5), we may have to learn such rules multiple times while, a MOO task is going on. If we use the method suggested in [11] to learn these rules, it will be computationally very inefficient because of an EA (for learning rules from MOO problem data) nested inside an EMO algorithm (for solving MOO problem). Although our method is computationally much faster in obtaining power-law rules as compared to the method of automated innovization, however it holds a disadvantage as well. Our method cannot identify the rules that are applicable to only part of the PO front, say 50% of the PO front, and not the entire front. We will come back to this point in Part-7.2 where we have tried to address this shortcoming. 3.3.3 Learning Multiple Rules Simultaneously It is possible that the non-dominated solutions of a MOO problem carry multiple design rules simultaneously. For example, not only some constraint g1 is some constant upper bound but also two variables x1 and x2 are related in a power law form. Using the method given in Section 3.3, we can efficiently learn as many as our computational budget allows. Consider a MOO problem shown in Equation (2.1) having nf objectives, nx design vari- ables and ng inequality constraints. Hence, if we are considering only variables, objective functions and constraint functions as basis functions, we have a minimum of nφ = nx+nf +ng nφ − 1 basis functions for learning rules. With nφ basis functions, exhaustively we have 2 number of constant and power-law rules for which we have to estimate the parameters and quality. These numerous rules for learning can become prohibitively large very quickly. Fur- thermore, the interpretability and use of such design rules for an engineer quickly diminishes 33 as the number of basis functions in a rule increases. For these reasons, we limit ourselves to learning rules with a maximum of two or three basis functions. This chapter has covered the rule forms that are commonly encountered in MOO en- gineering design problems and their quality can also be ascertained quickly. However, we are also interested in learning rules where the form is not constrained a priori. In the next chapter, we will look at an efficient way of learning “free-form” non-linear rules from data. 34 Chapter 4 Learning Free Form Rules Using a Symbolic Regression Task In Chapter 3, we saw how can we efficiently learn rules of two fixed forms, constant rules and power-law rules, in MOO data. These two forms are commonly encountered in MOO engineering design problems and can be learned quickly from data. However, in this chap- ter, we wish to expand our scope of knowledge extraction to rules with a weaker set of constraints limiting the form of the rules that we learn. Yet, we want to stay in the territory of mathematical/algebraic expressions that are simpler to interpret than, say some kernel based rules [84] and yet have the ability to capture complex physical behavior. 4.1 Form of Rules We are focusing on learning rules of the form ψ(φ1, φ2, . . . , φnφ) ≡ φi = w0 + nt(cid:88) j=1,j(cid:54)=i wj · tj, (4.1) where ψ represents one such rule, φ are the basis functions explained in Section 3.1 that represent the data from MOO problem, φi is one of the basis functions that can either be a regressand in a symbolic regression problem or a class label in a classification problem, w0 35 and wjs are numerical weights, nt is the number of terms in the rule where each term tj is some function of φj, j ∈ {1, 2, . . . , nφ} \ i using the operations {+,−,×,÷}. Such rules are a good candidate for learning using genetic programming [35]. 4.2 A Primer on Genetic Programming A GP can be considered an application of GAs when the space of solutions to search con- sists of programs or equations for solving a task [27, 35, 85]. Instead of decision variables, an individual is a program or an equation to solve a task. In order to create an initial population, a terminal set T and a function set F must be pre-specified. A terminal set consists of operands (constants and variables) where as a function set consists of operators or basic functions. Like other evolutionary computation techniques, a typical GP starts with a population of randomly created individuals, which in this case are math expressions. The fitness for each individual is determined by evaluating the rules. High fitness individuals are selected to form the mating pool, on which primary genetic operations, namely crossover and mutation, are applied to create a new population of programs. The process is repeated until some termination criterion (like maximum number of generations) is met. An individual in a GP can be represented using different data structures such as string of words [86], or trees [87] or graphs [88]. We will be using the tree data structure to represent a rule, hence we will discuss a few important concepts in the context of tree-based GPs. Consider a GP with terminal set T = {1, 2, x} and function set F = {+,−,∗}. Then, Figure 4.1 shows two candidate solutions that belong to the set of valid GP individuals for such a GP. Furthermore, sub-tree swap crossover [89] is a popular crossover mechanism used in tree based GPs. An example of the same is shown in Figure 4.2. A sub-tree to be 36 Figure 4.1: Example of two GP solutions using tree representation. exchanged between two GP individuals (parents) is first chosen at random in each parent. Then, the sub-tree crossover operation is completed by exchanging the chosen sub-trees between the two parents. A Koza-style sub-tree mutation [35] involves swapping either a terminal with another element from the terminal set or a function with another element from the function set. Figure 4.3 shows an example of subtree mutation for an individual. When swapping functions, care must be taken to maintain the arity of functions being swapped are the same. Figure 4.2: An example of a sub-tree crossover between two GP individuals. 37 y(x) = 1 + x1*x2−1+x+1xy(x) = 2*x*x − 2*x + 1Child−2+1−1x1*x2−1+x+1x1*x2+xy(x) = 2*x*x − 2*x + 1Parent−1Parent−2y(x) = 1 + xy(x) = 2*x*x + 1y(x) = xChild−1 Figure 4.3: An example of a sub-tree Koza-style mutation of a GP individual. 4.3 A Custom GP (CGP) GPs are used extensively at inducing mathematical models based on observations. Currently, two of the most popular commercial products for inducing mathematical models from data, Eureqa [67] and DataModeler [90], are GP based products. We developed our own customized GP or CGP by experimenting and combining with many different ideas from literature as well as our own to improve its performance. Some of the ideas exist in literature, however they have not been tried all together in the same GP implementation. This section describes the important ideas implemented in CGP along with its flowchart description towards the end. 4.3.1 Two Objectives : Prediction Error and Rule Complexity Designer of any classification or regression technique faces this dilemma, i.e. whether to learn rules that are accurate or the ones that are simple for human interpretation [91]. Prediction error and rule complexity are conflicting objectives and we usually decide (unknowingly) apriori which one is more important for us by choosing a method. For a given problem, it is critical to have a clear idea of the which is a priority, accuracy or interpretability so that 38 After Mutation1*x2+x*x2+x2y(x) = 2*x*x + 1Before Mutationy(x) = 2*x*x + 2 this trade-off can be made explicitly rather than implicitly. Figure 4.4 shows this typical trade-off dilemma when selecting a machine learning method. We decided to minimize both simultaneously by basing our GP on the bi-objective op- timization algorithm NSGA-II [92]. Furthermore, having more than one objectives in a GP has known to reduce the problem of bloat [93]. Complexity of rule can be defined in many ways [94]. We define complexity of a rule as the number of nodes in a binary tree needed to represent the rule. Figure 4.4: The two conflicting objectives in any machine learning algorithm. 4.3.2 Using Multiple Small Trees As the complexity of rules to be learned increases, so does the size of its binary tree rep- resentation. This tree representation, which resides in the phenotype space of the GP is the object that undergoes important crossover and mutation operations. If the size of this tree is big, the possibility of a beneficial crossover reduces drastically. For this reason, [95] 39 Deep Neural NetsRule Complexity (minimize)Prediction Error (minimize)Decision TreeLinear SVMLinear RegressionRandom ForestsKernel based Regression suggested having multiple depth-limited small binary trees as phenotype of a GP individual instead of a single big tree and called the GP multi-gene genetic program or MGGP. Unlike traditional GP, each symbolic model (and each member of the GP population) in MGGP is a weighted linear combination of the outputs from a number of GP trees. Each of these trees may be considered to be a “gene”. Figure 4.5 shows an example of a multi-gene GP individual. This fits very nicely with the type of rules shown in Equation (4.1) that we are aiming to learn. Figure 4.5: An MGGP individual composed of many small binary expression trees instead of one big binary expression tree. MGGP however treats the problem as a single objective optimization problem and faces the problem of horizontal bloat [95]. Horizontal bloating is the tendency of multi-gene models to acquire genes that are either performance neutral (i.e. deliver no improvement in predic- tion error on the training data) or offer very small incremental performance improvements. In case of CGP, having a bi-objective formulation helps allay this issue. 4.3.3 Learning Weights Using a Faster Method Unlike traditional GPs that try to learn the constants/coefficients/weights in rule as well along with the rule structure, we separate that task completely from the GP. Figure 4.6 40 y=w0+w1·(x1×x2)+w2·(x1+x22)Gene-1Gene-2×+×x2x2x2x1x1 shows a sample individual with multiple genes. Like MGGP [95], we let evolution optimize the structure of the rule but use some gradient based optimization method, such as OLSR or linear support vector machines, to learn the weights in the rule. Figure 4.6: Our GP learns the rule structure and their weights separately. 4.3.4 Diversity Preserving Mechanism In the bi-objective problem formulation mentioned in Section 4.3.1, the rule complexity objective is discrete in nature. This highly discretized objective space causes good individuals encountered early on in the CGP run to be reproduced and selected multiple times to produce many copies of these good solutions found early on. This leads to the population reducing to a few individuals and their duplicates, thus leading to complete loss in diversity and premature convergence of the algorithm [93]. To counter this, we need to adopt a strategy that can: 41 ++Optimize Rule Weights with a Classical Method (e.g. Regression, SVM)(e.g. Custom bi−Objective GP)Optimize Rule Structure with GPCandidate Rulew2x2x3w3x1x5w1x3x1+x2 ˆ Penalize all but one of each solution having duplicates (in both objectives) so that the ranking of the duplicate solutions worsens. ˆ The aforementioned penalization would push the non-domination rankings of dupli- cate solutions higher. However, the penalized duplicate(s) corresponding to a solution having a lower non-domination rank should continue to have lower non-domination rank compared to the penalized duplicate(s) corresponding to a solution having higher non-domination rank. Figure 4.7 illustrates this idea with the help of a hypothetical example. Figure 4.7: Fitness adjustment by penalizing duplicate solutions to promote population diversity in CGP. (A) shows the original state with two non-dominated fronts present in the population and total of seven solutions. (B) Solution-4, which is a duplicate of solution-3 after penalization. (C) Solution-6 which is a duplicate of solution-5 after penalization. Part-A of Figure 4.7 ˆ Part-A of Figure 4.7 shows the starting state of solutions. There are a total of seven solutions including two duplicate solutions. 42 F3F3&updateworstobjectivevector&updateworstobjectivevectorPenalizefitnessforSoln-6PenalizefitnessforSoln-4ACBF1F2F1F2F1F2F4w23,45,65,63711123744567f2(minimize)f1(minimize)f2(minimize)f1(minimize)f2(minimize)f1(minimize)ww2 ˆ Solutions 1,2,3,5 and 7 are shown with black dots. Solution-4, which is a duplicate of solution-3, is shown with a blue circle. Similarly, solution-6, which is a duplicate of solution-5, is shown with a red circle. ˆ The green dot represents the worst objective vector [31]. ˆ The non-dominated front F1 is comprised of solutions {1,2,3,4}. The non-dominated front F2 is comprised of solutions {5,6,7}. ˆ We begin by penalizing solution-4, the duplicate solution present in front F1. Let solution-4 be represented by objective vector (f (4) 1 , f (4) 2 ) and the worst objective vector be represented as (f (w) 1 (w) , f 2 ). Then the updated objective vector of solution-4 is given by (f (4) 1 , f (4) 2 ) assign ←−−−− (f (4) 1 + f (w) 1 , f (4) 2 + f (w) 2 ). (4.2) This results in creating another non-dominated front, F3, for updated solution-4. Part-B of Figure 4.7 ˆ Part-B of Figure 4.7 shows the updated state of the solutions where updated solution- 4 now belongs to a new non-dominated front F3 and the worst objective vector is updated as (f (w) 1 (w) , f 2 ) assign ←−−−− (f (4) 1 + f (4) 1 ). (4.3) ˆ Next solution to be penalized is solution-6 (shown with red-circle) which is a duplicate of solution-5 (shown with a black dot). As in the case of solution-4, solution-6 is penalized as (f (6) 1 , f (6) 2 ) assign ←−−−− (f (6) 1 + f (w) 1 , f (6) 2 + f (w) 2 ). (4.4) 43 This results in creating another non-dominated front, F4, for updated solution-6. Part-C of Figure 4.7 ˆ Part-C of Figure 4.7 shows the updated state of the solutions where penalized solution- 4 now belongs to a new non-dominated front F3, penalized solution-6 now belongs to a new non-dominated front F4, and the worst objective vector is updated as (f (w) 1 (w) , f 2 ) assign ←−−−− (f (6) 1 + f (6) 1 ). (4.5) ˆ Note that the relative non-domination ranking of solutions 4 and 6, both before and after penalizing, remains the same. Thus, the non-domination hierarchy of penalized solutions is preserved in this method. ˆ At this stage, no more duplicate solutions are present in any of the non-dominated fronts prompts the penalizing algorithm to stop. This penalizing of duplicate solutions helps in reducing the clout of duplicates of good solu- tions and gives poorer solutions in higher non-dominated ranks a better chance at surviving and participating in parent selections of evolutionary algorithm. 4.3.5 Higher and Lower Level Crossovers Recall from Section 4.3.2 that the genotypic space of CGP consists of many small trees or genes. Borrowing the idea from [95], CGP uses two types of crossovers namely low- level crossover and a high-level crossover. Any two parent individuals chosen to reproduce undergo a crossover with a probability pc. With a (preferably) small probability when the 44 individuals do not go through a crossover operation, the outcome of the crossover operation are two child individuals that are identical copies of their parents. When crossover does happen, then it can either be of high-level type with a probability of pch or of low-level type with a probability pcl = 1 − pch. Consider two individuals from the CGP population, having three and two terms respec- tively as shown in the left half of Figure 4.8. Then for a high-level crossover to occur between these two individuals, CGP randomly chooses one term from each individual to cross and then swaps them between the individuals to create two children. This process is pictorially shown in Figure 4.8 where the second term of parent-1 is swapped with the second term of parent-2. Figure 4.8: Example of the high-level crossover used in CGP. If a low-level crossover need to be carried out, then CGP first chooses one term from each parent to cross and then carries out a sub-tree crossover among those two terms. This process is shown in Figure 4.9. 45 Child−1Parent−2High LevelCrossoverChild−2Parent−1 Figure 4.9: Example of the low-level crossover used in CGP. 4.3.6 CGP Flowchart for Rule Learning Figure 4.10 shows the flowchart for the rule learning part of CGP. Except for one block, penalizing duplicates, everything else is same as that of NSGA-II algorithm. The algorithm begins with initialization of a population, say of N of individuals, composed of tree structures as explained in Section 4.3.2, each with not more than nt trees. Each individual represents a rule of the form given in Eqation (4.1). The maximum depth [96] of each tree, say dmax, is also specified at time of initialization. Then the fitness functions are invoked to evaluate both prediction error on training set and complexity objectives for entire initial population. Then these individuals are assigned non-domination ranks and crowding distances [31] just like NSGA-II [92]. Once this parent population is ranked, the parent selection process produces list a of parents that are allowed to reproduce children for the next generation. CGP uses binary tournament selection [31] for selecting parents to reproduce. Such a parent selection process promotes the fittest individuals in the population to mate more often. Once these parents are selected, they go through genetic operations of crossover and mutation to produce a child 46 CrossoverChild−1Child−2Low LevelParent−1Parent−2 Figure 4.10: The flowchart for Custom Genetic Program or CGP. population of N individuals. The crossover operation transpires via one of the two alterna- tives, either higher lever crossover or a lower-level crossover, both of which are explained in Section 4.3.5. After the crossover operation, the N child individuals undergo mutation operation. For an individual, a mutation is carried out with probability pm otherwise the child individual is left unchanged. In CGP, to mutate an individual (composed of many trees), first one of the trees is randomly selected for carrying out the mutation operation and then a Koza-style sub-tree mutation [35] is carried out on the chosen tree. 47 Child pop creationElite preservationYesNoReport POCrossoverAssign Rank &Crowding Dist.PenalizeEvaluationMergePopulationStopAssign Rank &Crowding Dist.ParentSelectionDuplicatesMutationInitial PopulationChild popevaluationSelectionSurvivorBeginTerm.Condn.Solutions After undergoing the crossover and mutation operations, CGP evaluates the fitness of the N child individuals. Now these N children are combined with the N parent individuals of the current generation to obtain a merged population of size 2N . This population of 2N individuals is passed on to the survivor selection procedure, where all the 2N individuals are again ranked and assigned crowding distances before selecting N individuals using the crowded tournament selection operator [31] used in NSGA-II. This population of N individuals may then contain certain duplicate solutions. These duplicate solutions are penalized using the method given in Section 4.3.4 and the entire population is again assigned rank and crowding distance values. If termination condition is not met, these N individuals become the parent population for the next generation. This process goes on until the termination condition is met and the final PO set of solutions is reported. Table 4.1 shows the list of parameters for CGP.We now look at how can we use CGP for a symbolic regression task. Table 4.1: List of parameters in CGP. Symbol Description N G nt Population size Number of generations Maximum number of terms in a CGP individ- ual Suggested Value 50-200 100-500 3-10 dmax Maximum depth of trees representing individ- 4-10 ual terms of an individual Probability of crossover Probability of high level crossover Probability of low level crossover Probability of mutation pc pch pcl pm 0.8-0.95 0.2-0.3 1 − pch 0.05-0.2 48 4.4 Using CGP for Symbolic Regression Task Now that we are familiar with the working of CGP, recall that we wish to use CGP to learn rules of the form given by Equation (4.1) from MOO data as part of online innovization. This is a problem of symbolic regression and it is a type of regression analysis that searches the space of mathematical expressions to find the model that best fits a given data set. It is different and a difficult task as compared to conventional regression analysis which seeks to optimize the parameters for a pre-specified mathematical model structure involving regressand and regressors. 4.4.1 Evaluating Fitness of a CGP Individual for a Symbolic Re- gression Task Recall from Section 4.3.1 that CGP approaches rule learning problem as a bi-objective op- timization problem minimizing prediction error and rule complexity as the two objectives. Lets look at how these two objectives are calculated one by one. The prediction error fitness function in CGP for symbolic regression task is an extension of ordinary least squares regression (OLSR) estimation method of linear regression. Consider a linear regression model y = Zw + , (4.6) where Z ∈ Rn×(k+1) is a matrix with n observations on k independent variables and the first column of Z contains only ones to include the bias term in the regression model, y ∈ Rn×1 is a vector of observations on the dependent variable,  ∈ Rn×1 is a vector of errors and w ∈ R(k+1)×1 is a vector of unknown regression coefficients including the w0 bias term that 49 we wish to estimate. Equation (4.6) in expanded form can be written as   . (4.7)  1 2 ... n +  =  y1 y2 ... yn  1 Z11 1 Z21 ... ... 1 Zn1  ··· Zk1 ··· Zk2 ... . . . ··· Zkn  w0 w1 w2 ... wk Then, the OLS estimate of the regression coefficients [97] wi, i ∈ {0, 1, . . . , k} is given by ˆw = (Z (cid:124) Z)−1Z (cid:124) y. (4.8) Consider a five variable data set having n observations where we wish to symbolically regress a relation of the form x4 = f (x1, x2, x3, x5). When we try to solve this problem using CGP , the CGP will initialize with a number of random CGP individuals. Consider an example of such an individual as shown in Figure 4.6 having three terms, x3./(x1 + x2), x2./x3 and x1. ∗ x5, where the operations ‘./’ and ‘.∗’ represent element wise division and multiplication of two vectors. For finding rules of the form shown in Equation (4.1) using this individual, the problem boils down to finding the appropriate weights wi, i ∈ {0, 1, 2, 3} in the relation 50  =  x41 x42 ... x4n  1 1 ... 1  x31 x11 + x21 x32 x12 + x22 ... x3n x1n + x2n x21 x31 x22 x32 ... x2n x3n x11 · x51 x12 · x52 ... x1n · x5n  w0 w1 w2 w3  . (4.9) Comparing Equation (4.9) with (4.7), then we can obtain the weight estimates using Equation (4.8). The ˆx4 calculated for all observations can then be compared with the actual x4 values to get prediction error as e = x4− ˆx4. Note that e ∈ Rn×1 is a vector of prediction errors where n is the number of observations. Then for this individual, the fitness value or error objective, say ferror, can be calculated as a root mean square of e over all observations. (cid:114) e(cid:124)e n ferror = (4.10) The complexity fitness function simply calculates the total number of nodes in the trees of all terms of a CGP individual, i.e. (cid:88) fcomplexity = ( Nodes in all terms of CGP individual). (4.11) Again, consider the example of a CGP individual shown in Figure 4.6 having three terms. The fcomplexity = 11 for this individual. 51 4.5 CGP Results on Test Problems Let us look at some results of symbolic regression on test problems using CGP . 4.5.1 Test Problem-1 This problem has a single variable x1 as regressor for the regressand y. The data for this problem was generated using the equation y = x1 − x3 1 3! + x5 1 5! (4.12) where 100 values of x1 are sampled from a uniform distribution over [−π, π]. Figure 4.11 shows the relation between x1 and y graphically. The CGP parameters used in solving Figure 4.11: Graphical representation of test problem of Section 4.5.1. this problem are given in Table 4.2. The PO solutions obtained by CGP for this problem are shown in Figure 4.12 in which each point represents a regressed non-linear relation between y and x1. Also shown in Figure 4.12 is the regressed rule for a knee [98] solution. The trees corresponding to the three terms in this knee solution are shown in Figure 4.13. The expression for the chosen knee solution is almost same as the true relation shown in Equation (4.12) except for a small bias term of 1.155 · 10−14 which is a numerical error and can be ignored. 52 x1y−ππ−11 Table 4.2: List of CGP parameters used for solving symbolic regression problem of Sec- tion 4.5.1. Parameter N G nt Value 10 30 20 dmax 10 pc 0.95 pch 0.20 pcl 0.80 pm 0.05 Figure 4.12: PO set of solutions found by CGP on solving the symbolic regression test problem of Section 4.5.1. 4.5.2 Test Problem-2 This problem has a single variable x1 as regressor for the regressand y. The data for this problem was generated using the equation y = x1 + 1 x2 1 + x1 + 1 (4.13) where 100 values of x1 are sampled from a uniform distribution over [−2, 2]. Figure 4.14 shows the relation between x1 and y graphically. The CGP parameters used in solving this problem are given in Table 4.3. The PO solutions obtained by CGP for this problem are shown in Figure 4.15 in which each point represents a regressed non-linear relation between y and x1. Also shown in Figure 4.15 is the regressed rule for the solution with least error 53 102030405000.050.10.150.20.250.30.35 (a) t1 (b) t2 (c) t3 Figure 4.13: The three trees corresponding to the chosen knee solution shown in Figure 4.12. Figure 4.14: Graphical representation of test problem of Section 4.5.2. and highest complexity among the PO solutions. The tree corresponding to this term is shown in Figure 4.16. The expression for the chosen solution is exactly the same as the true relation shown in Equation (4.13). Table 4.3: List of CGP parameters used for solving symbolic regression problem of Sec- tion 4.5.2. Parameter N G nt Value 10 50 50 dmax 6 pc 0.85 pch 0.20 pcl 0.80 pm 0.05 4.5.3 Test Problem-3 The Bernoulli equation [99] is a famous equation from the subject of fluid mechanics. Con- sider an incompressible fluid flowing under steady flow condition. Then, valid at any arbi- 54 ×x1×××x1x1x1x1×××x1x1x1x1x1x1y−22−0.3331 Figure 4.15: PO set of solutions found by CGP on solving the symbolic regression test problem of Section 4.5.2. trary point along a stream line is: y + v2 2g + p ρg = c (4.14) Figure 4.16: The tree corresponding to the chosen knee solution shown in Figure 4.15. 55 05101500.050.10.150.20.250.30.35÷×x1x1+÷x1x1×+x1÷x1x1x1 where y v is the elevation of the point above a reference plane is the fluid flow speed at a point on a streamline, p is the pressure at the chosen point, g is acceleration due to gravity, ρ is the density of the fluid at all points in the fluid, and c is a constant for the streamline chosen and also called energy head. To generate data to test CGP , we assumed the fluid to be water (ρ = 1000 kg/m3) and value of g to be 9.81 m/s2 and we chose the value of c = 20 m of energy head for some streamline in some water flowing under steady flow conditions. We then randomly sampled 100 values each for the variables v and p from uniform distribution over [0,10] m/s and [101325,400000] Pa respectively. The atmospheric pressure at the surface of Earth is ≈ 101325 Pa. For these values of v, p, g, ρ and c, we then calculated 100 values of z using the relation y = 20 − 0.051 · v2 − 1.0194 × 10−4 · p. (4.15) Note that Equation (4.15) is obtained by substituting the values of g, ρ and c assumed above in Equation (4.14). Considering y as the regressand variable and variables v and p as regressors, CGP was supplied with this data to learn the non-linear relation among these variables. The CGP parameters used in solving this problem are given in Table 4.4. The PO solutions obtained by CGP for this problem are shown in Figure 4.17 in which each point represents a regressed non-linear relation between y and (v, p). Also shown in Figure 4.17 is the regressed rule for a knee solution. The trees corresponding to this term are shown 56 in Figure 4.18. The expression for the chosen solution is exactly same as the true relation shown in Equation (4.14). Table 4.4: List of CGP parameters used for solving symbolic regression problem of Sec- tion 4.5.3. Parameter N G nt 6 Value 28 30 dmax 4 pc 0.95 pch 0.20 pcl 0.80 pm 0.05 Figure 4.17: PO set of solutions found by CGP on solving the symbolic regression test problem of Section 4.5.3. 4.5.4 Test Problem-4 Let us look at the equation of deflection of a simply supported beam [100]. Consider a simply supported beam of length l having a mass per unit length of w. If E is the Young’s modulous of the material of the beam and I is the area moment of inertia of the beam (about axis of bending), then the deflection of the beam at a distance x from one end is given by ∆x = wx 24EI (l3 − 2lx2 + x3). 57 (4.16) 51015202500.20.40.60.811.2 (a) t1 (b) t2 Figure 4.18: The two trees corresponding to the chosen knee solution shown in Figure 4.17. To generate data to test CGP , we assumed the beam to be made of steel with the following knowns parameters: E = 29, 000, 000 psi, w = 0.568 lb/inch, I = 0.167 inch4, l = 39.37 inch. (4.17) Substituting these values in Equation (4.16), we get ∆x = 4.89 × 10−9x4 − 3.85 × 10−7x2 + 2.98 × 10−4x. (4.18) Using this relation, we then calculated 100 values of ∆x based on 100 value of x ∈ (0, l]. Considering ∆x as the regressand variable and variable x as the regressor, CGP was supplied with this data to learn the non-linear relation among these variables. The CGP parameters used in solving this problem are given in Table 4.5. The PO solutions obtained by CGP for this problem are shown in Figure 4.19 in which each point represents a regressed non-linear relation between ∆x and x. Also shown in Figure 4.19 is the regressed rule for a knee solution. The trees corresponding to this term are shown in Figure 4.20. The expression for the chosen solution is exactly same as the true relation shown in Equation (4.14). 58 ×vvp Table 4.5: List of CGP parameters used for solving symbolic regression problem of Sec- tion 4.5.4. Parameter N G nt Value 6 30 30 dmax 6 pc 0.95 pch 0.20 pcl 0.80 pm 0.05 Figure 4.19: PO set of solutions found by CGP on solving the symbolic regression test problem of Section 4.5.4. 4.6 Noise Study For the test problem of Section 4.5.3, a noise study was also performed. The purpose of this study was to understand, how much noise needs to be added to the regressand before the true relation is no longer present among the PO set of solutions of CGP . Gaussian white noise was added to the value of original regressand y as ˜y = y + a ∗ N (0, 1), (4.19) where a or noise factor is some numeric value to change the level of noise and N (0, 1) is the standard normal distribution. The noise levels are represented in terms of signal to noise 59 510152025303540024681010-4 (a) t1 (b) t2 (c) t3 Figure 4.20: The three trees corresponding to the chosen knee solution shown in Figure 4.19. (SNR) ratio and measured in decibels or dBs using the formula (cid:35) (cid:34)(cid:80)L−1 n=0 y2(n) L · a2 SNR = 10log10 dB, (4.20) where n is the index of observed value, L is the number of observations which is 100 in this case and a2 is the variance of the signal ˜y having noise. This formula is taken from the work [101]. A high SNR means low noise. The value of regressors v and p were not changed. Effectively, CGP attempted to learn the relation ˜y = 20 − 0.051 · v2 + 1.0194 × 10−4 · p (4.21) instead of the one given by Equation (4.15). Note that the regressands are different in the two equations. Table 4.6 shows the results of this study. The first column carries the value of noise factor a, the second column carries value of SNR in dBs, the third column carries the rule corresponding to the knee solution from among the PO set obtained by CGP on noisy data and the last column carries the R2 adj value or goodness of fit value of the knee solution. This 60 ××xxxxx×××xxxx study shows that the true solution is present among the PO set of solutions even until a high noise level of SNR = 15 dB, although the goodness of fit suffers a lot at that level of noise. Table 4.6: Results of noise study performed on test problem of Section 4.5.3. a SNR Knee soln rule (in dB) 0.0010 0.1925 0.3431 0.6090 1.0825 5.4315 100 30 25 20 15 1 ˜y = 20 − 0.051v2 − 1.019 × 10−4p ˜y = 20 − 0.051v2 − 1.000 × 10−4p ˜y = 20 − 0.051v2 − 9.9 × 10−5p ˜y = 19 − 0.051v2 − 9.7 × 10−5p ˜y = 19 − 0.052v2 − 9.2 × 10−5p ˜y = 28−6.3v+1.9×10−4p+2.6×10−4vp−4.7v2+0.72v3−0.036v4 R2 adj 1 0.9882 0.9622 0.8484 0.6993 0.0125 4.7 Choosing a Solution The results in the previous section show that CGP is competitive at symbolically regressing a set of PO rules that provide a good trade-off between prediction error and rule complexity. However, recall from Sections 2.5 and 2.5.3 that to perform online innovization, we not only have to learn the rules present in MOO data but also use that knowledge to modify solutions to expedite the EMO algorithm’s convergence. Hence, if we use CGP for rule learning, once we have a set of PO solutions from CGP, we also need to choose a solution before passing the control back to the EMO algorithm. This can be done in two ways: 1. Choosing a knee solution : In PO solutions of bi-objective optimization problems, a knee point is a special solution. This is because choosing any solution other then the knee solution requires a large sacrifice in one objective to gain a small amount in the other objective. There is good amount of literature available on detecting and choosing a knee solution in a PO front [98, 102]. It can be seen from the PO fronts shown in Figures 4.12,4.15,4.17 and 4.19 that these problems show a strong propensity for knee 61 points. Hence choosing a knee solution in an automated way can be one way to choose one of the many PO solutions of CGP. 2. Checking dimensional consistency : Recall from Section 2.6 that one of the advantages of learning rules in form of mathematical algebraic expressions is that once can check if the learned rule adheres to the law of dimensional homogeneity. A rule that adds a physical quantity having the dimensions of length with a physical quantity having the dimensions of mass cannot add anything to the knowledge of a designer, even if rule has a low prediction error on training data. Hence, once CGP provides a set of PO solutions, we can check the dimensional consistency of those rules and discard the ones that violate it. We call this as serial dimensional awareness check because we are using the dimensionality check at the end of the CGP algorithm. Thus we can use this idea to shortlist some candidates from the PO set of CGP and then follow it up with a knee solution choosing method. Even in the absence of a knee in the PO front, it may be possible to parametrize the user preference in terms of rule complexity and prediction error and we can automatically choose a rule based on user’s preference. In the next chapter, we look at how can we check dimensional consistency of rules obtained by CGP. 62 Chapter 5 Using Dimensional Awareness with Rule Learning For any physical law, adherence to the law of dimensional homogeneity is of utmost impor- tance. The law of dimensional homogeneity [26] states that, “only commensurable quantities (physical quantities having the same dimension) may be compared, equated, added or sub- tracted”. Although, the rule learning part of CGP can learn rules that accurately fit the data, but if any rule adds or subtracts two incommensurable quantities, then such a rule is physically meaningless. Hence, we need to quantify the degree of dimensional mismatch in a rule found by CGP. Such a quantification of dimensional mismatch for the PO rules found by rule learning part of CGP can give us additional information, if we need to choose only one or very few solutions out of the PO solutions of CGP. 5.1 Measuring Dimension Mismatch Penalty Let us try to figure out how can we quantify dimensional mismatch penalty in a rule found by CGP. Say, the rule learning part of CGP is used for solving a symbolic regression problem relating regressand (y) and regressors (xk, k ∈ {1, 2, . . . , nx}), which yields a set of PO rules. 63 Consider one such PO rule having the form r ≡ y = w0 + nt(cid:88) i=1 wi · ti, (5.1) where w0 is a bias term, nt is the total number of terms, wi is the regression coefficient for term ti and ti is some function of regressors xk, k ∈ {1, 2, . . . , nx}. 5.1.1 Case-I : Terms with Only Product and Division Operations In this case, we will show how dimensional mismatch penalty, say P, is calculated if the terms ti in Equation (5.1) is comprised of only multiplication and division operations among the regressors xk, k ∈ {1, 2, . . . , nx}. Let us further assume that; ˆ The number of fundamental dimensions present in data is three and they are the fun- damental physical dimensions of mass (M), length (L) and time (T). Of course there can be more (for example temperature (θ), current (I etc.) but we are choosing aforemen- tioned three for ease of representation. ˆ The derived physical dimensions of a term ti is MαiLβiTγi, ˆ Cj, j ∈ {1, 2, . . . , nc} are a set of nc physical constants relevant to the process that is generating the data. These have to be chosen by subject matter experts. For example, when studying a fluid flow problem, some of the physical constants that may be considered important for the process are acceleration due to gravity, density of fluid and fluid viscosity. These Cj constants have dimensionless numeric values cj, j ∈ θj . The symbol Cj encapsulates both {1, 2, . . . , nc} and derived dimensions of M numeric value cj as well as units information. λj L ηj T 64 ˆ To mend the dimensional inconsistencies between y and all terms ti in Equation (5.1) using the constants Cj’s, the constants may appear in dimensionally consistent form of Equation (5.1) with a limited set of exponents, say E = {e1, e2, . . . , ek} where k is number of such chosen exponents. For example, E = {−2.0, −1.0, −0.5, 0.0, 0.5, 1.0, 2.0}. (5.2) ˆ The fundamental units of the left hand side or regressand y in regressed rule shown in Equation (5.1) are MεLϕTω. If the derived physical dimensions of term ti and regressand y are different, then a natural question to ask is, in what way can different physical constants Cj of the process be multiplied with the term ti such that dimensional homogeneity can be maintained between ti and y. One way to achieve this is as follows. For the ith term ti, there may exist a set of real values . . . , z(i,nc)} such that a product of ti with (cid:81)nc {z(i,1), z(i,2), equivalence between term ti and y. This can be represented as j=1 C z(i,j) j yields dimensional nc(cid:89) MεLϕTω = (MαiLβiTγi) λj L ηj T (M θj ) z(i,j), which can be re-written as j=1 M(ε−αi) L(ϕ−βi) T(ω−γi) = M (cid:80)nc j=1 λj z(i,j) · L (cid:80)nc j=1 ηj z(i,j) · T (cid:80)nc j=1 θj z(i,j). (5.3) (5.4) 65 Solving Equation (5.4) is equivalent to the system of simultaneous linear equations given by  (cid:125)  (cid:124) ε − αi ϕ − βi ω − γi (cid:123)(cid:122) bi  (cid:125) , (5.5) =  (cid:124) λ1 λ2 η1 η2 θ1 θ2  (cid:125) · ··· λnc ··· ηnc ··· θnc (cid:123)(cid:122) A  (cid:124) z(i,1) z(i,2) ... z(i,nc) (cid:123)(cid:122) zi From theory of linear algebra [103], solution z to Equation (5.5) is; I The exact solution if A is full rank and square matrix, II The least square solution if A is full rank and skinny matrix (nf > nc), III The least square and least norm solution if A is full rank and fat matrix (nf < nc), and IV The least square solution using Singular Value Decomposition method if A is singular. Let the solution to Equation (5.5) be ˆzi where ˆzi = [ˆz(i,1) ˆz(i,1) . . . ˆz(i,nc)] (cid:124) where ˆz(i,j) ∈ R ∀ i, j. (5.6) Recall from Equation (5.3) that z(i,j)s represent the exponents of the chosen physical con- stants Cjs. Also, we are looking to quantize these exponents to a set of select few given by some set E, an example of which is given by Equation (5.2). Hence, all the elements of ˆzi are quantized to their nearest value in set E. This quantization on the elements of ˆzi yields 66 us the set ¯zi, where ¯zi = [¯z(i,1) ¯z(i,2) . . . ¯z(i,nc)] (cid:124) where ¯z(i,j) ∈ R ∀ i. (5.7) For example, if ˆzi = {−0.95 0.55 1.89} and set E is given by Equation (5.2), then the quantized set will be given by (cid:124) ¯zi = [−1.0 0.5 2.0] . Once a quantized set of exponents ¯zi is obtained, we then obtain the residue vector di corresponding to the dimensional inconsistency in the ith term in Equation (5.1) as  =  d(i,1) d(i,2) d(i,3)  −  ε − αi ϕ − βi ω − γi di =  λ1 λ2 η1 η2 θ1 θ2  ··· λnc ··· ηnc ··· θnc .  ¯z(i,1) ¯z(i,2) ... ¯z(i,nc) (5.8) The Root Mean Square or RMS value of residue vector di can then be treated as penalty Pi of dimensional mismatch corresponding to the ith term in Equation (5.1) as (cid:115) (i,2) + d2 d2 (i,1) + d2 3 (i,3) . (5.9) Pi = This dimensional mismatch penalty is a non-negative value and it will be zero if the physical dimensions of y and term ti in Equation (5.1) can be matched exactly by multiplying ti with just the right combination of physical constants, Cj where j ∈ {1, 2, . . . , nc}, when raised to a particular set of exponents given by ¯zi in Equation (5.7). Once this penalty value can be calculated for all nt the terms of Equation (5.1), we can calculate the overall dimensional 67 mismatch penalty P for the expression given by Equation (5.1) as nt(cid:88) nt(cid:88) i=1 Pi  0 P = nt − δPi,0 i=1 nt(cid:88) i=1 if δPi,0 > 0, otherwise, where δi,j is the Kronecker delta function such that 1 if i = j, 0 otherwise. δi,j = If Equation (5.10) evaluates to zero, then it implies that the original equation can be made dimensionally consistent using the set of chosen constants Cjs and a set of exponents such as given by (5.7). In such a case, its beneficial to modify the original equation given by (5.1) to include the chosen constants as (5.10) (5.11) (5.12) (5.13)  ˆwi · ti nc(cid:89) j=1 ¯z(i,j) j C  , where nt(cid:88) i=1 r ≡ y = w0 + ˆwi = nc(cid:89) j=1 . wi ¯z(i,j) j c This is done so that we do not affect the regression or classification accuracy of the original equation while making it dimensionally consistent at the same time. Recall that cj rep- resents the dimensionless numerical value of some physical constant Cj and the symbol Cj encapsulates both numeric value cj as well as units information. 68 5.1.2 Dimensionally Inconsistent Example Let us understand the above procedure with an example. Consider the symbolic regression problem (Bernoulli equation case) presented in Section 4.5.3. We chose variable y as the regressand and variable v and p as regressors. The rule learning part of CGP returns a set of PO rules as shown in Figure 4.17. Suppose we want to check the dimensional consistency of a one of the PO solutions given by r ≡ y = 20(cid:124)(cid:123)(cid:122)(cid:125) w0 (cid:124) (cid:123)(cid:122) (cid:125) − 0.05097 w1 · v2(cid:124)(cid:123)(cid:122)(cid:125) t1 − 0.0001019 (cid:125) · p(cid:124)(cid:123)(cid:122)(cid:125) t2 (cid:124) (cid:125) (cid:123)(cid:122) + 3.9 × 10−8 w3 · v(cid:124)(cid:123)(cid:122)(cid:125) t3 . (5.14) (cid:124) (cid:123)(cid:122) w2 The rule shown in Equation (5.14) is different from the knee solution shown in Figure 4.17. In this rule, there are three terms (nt = 3) namely, t1 = v2, t2 = p and t3 = v and their corresponding weights are w1 = −0.050907, w2 = −0.0001019 and w3 = 3.9 × 10−8 respectively. Also, there is a bias term w0 = 20. Recall that the units of v and p are m/s and Pa respectively. Therefore, the derived units for t1 are M0L2T−2, for t2 are M1L−1T−2 and for t3 are M0L1T−1. Let us choose two physical constants (nc = 2) in the Bernoulli problem, namely acceler- ation due to gravity, ‘g’ and density of fluid (water here), ρ. These are measured in m/s2 and kg/m3 respectively. Hence, g ≡ C1 = 9.81(cid:124)(cid:123)(cid:122)(cid:125) (cid:124)(cid:123)(cid:122)(cid:125) ρ ≡ C2 = 1000 c2 c1 m/s2 and , kg/m3. (5.15) (5.16) The derived units for these physical constants of the problem are M0L1T−2 for g and M1L−3T0 69 for ρ. Furthermore, let us choose the set given in (5.2) as the set of allowed exponents for the constants. Since y is the elevation of a point on a streamline in fluid, it is measured in meters. Thus, the derived dimensions of regressand y is given by M0L1T0. This information is summarized below. α1 = 0 α2 = 1 α3 = 0 λ1 = 0 λ2 = 1 ε = 0 β1 = 2 β2 = −1 β3 = 1 η1 = 1 η2 = −3 ϕ = 1 γ1 = −2 γ2 = −2 γ3 = −1 θ1 = −2 θ2 = 0 ω = 0 Let us look at the dimensional consistency of each term one by one. 5.1.2.1 First Term For term t1 in Equation (5.14), Equation (5.5) is given by  (cid:124) 0 1 1 −3 −2 0 (cid:123)(cid:122) A   0 , −1 2 (cid:124) (cid:123)(cid:122) (cid:125) b1  (cid:125) z(1,1)  (cid:124) (cid:123)(cid:122) (cid:125) z(1,2) z1 · = 70 which upon solving and quantizing to allowed exponent values yields (cid:124) ¯z1 = [−1 0] . Substituting ¯z1 in Equation (5.8) and then using Equation (5.9), we obtain the dimensional mismatch penalty for first term of Equation (5.14) as P1 = 0. (5.17) 5.1.2.2 Second Term For term t2 in Equation (5.14), Equation (5.5) is given by  (cid:124) 0 1 1 −3 −2 0 (cid:123)(cid:122) A  (cid:125) z(2,1)  (cid:124) (cid:123)(cid:122) (cid:125) z(2,2) z2 · =  ,  −1 2 (cid:124) (cid:123)(cid:122) (cid:125) 2 b2 which upon solving and quantizing to allowed exponent values yields (cid:124) ¯z2 = [−1 − 1] . Substituting ¯z2 in Equation (5.8) and then using Equation (5.9), we obtain the dimensional mismatch penalty for second term of Equation (5.14) as P2 = 0. 71 (5.18) 5.1.2.3 Third Term For term t3 in Equation (5.14), Equation (5.5) is given by  (cid:124) 0 1 1 −3 −2 0 (cid:123)(cid:122) A  (cid:125) z(3,1)  (cid:124) (cid:123)(cid:122) (cid:125) z(3,2) z3 · =  0 0  , (cid:124)(cid:123)(cid:122)(cid:125) 1 b3 which upon solving yields (cid:124) ˆz3 = [−0.4878 − 0.1463] . (5.19) Since we chose the allowed set of exponents to be E = {−2.0, −1.0, −0.5, 0.0, 0.5, 1.0, 2.0}, hence quantizing the exponents in (5.19) to values in E yields (cid:124) ¯z3 = [−0.5 0.0] . Substituting ¯z3 in Equation (5.8) and then using Equation (5.9), we obtain the dimensional mismatch penalty for third term of Equation (5.14) as P3 = 0.2887. (5.20) 72 The total dimensional penalty for symbolic regression solution given by (5.14) is found using Equation (5.10), i.e. P = nt(cid:88) nt(cid:88) i=1 Pi nt − δPi,0 i=1 = 0 + 0 + 0.2887 3 − (1 + 1 + 0) = 0.2887 (cid:54)= 0 (5.21) This should be the case because if we compare Equation (5.14) with the correct Bernoulli equation of Equation (4.15), the third term of Equation (5.14) is not commensurate with the physical dimensions of the regressand y. 5.1.3 Dimensionally Consistent Example Again consider the symbolic regression problem presented in Section 4.5.3. Suppose that the rule learning part of CGP returns a set of PO rules, and we want to check the dimensional consistency of one such rule given by r ≡ y = 20(cid:124)(cid:123)(cid:122)(cid:125) w0 (cid:124) (cid:123)(cid:122) (cid:125) − 0.05097 w1 · v2(cid:124)(cid:123)(cid:122)(cid:125) t1 − 0.0001019 (cid:125) · p(cid:124)(cid:123)(cid:122)(cid:125) t2 . (5.22) (cid:124) (cid:123)(cid:122) w2 Equation (5.22) is actually a knee region solution among the PO solutions found by CGP in Bernoulli test data, shown in Figure 4.17. From Sections 5.1.2.1 and 5.1.2.2, we know that this equation can be made dimensionally consistent using the chosen physical constants. The first chosen constant (acceleration due to gravity) C1 is symbolically represented by g and numerically equal to c1 = 9.81 m/s2 in SI units. The second chosen constant (density of water) C2 is symbolically represented by ρ and numerically equal to c2 = 1000 kg/m3 in SI units. Hence, we can re-evaluate the weights in Equation (5.22) to include these chosen phys- 73 ical constants using Equations (5.12) and (5.13). Substituting values from Equations (5.17), (5.18) and (5.22) in Equation (5.13), we get ˆw1 = c ˆw2 = w1 ¯z(1,1) 1 c ¯z(1,2) 2 w2 ¯z(2,1) 1 ¯z(2,2) 2 c c = −0.05097 9.81−1 10000 = −0.50001 and = −0.0001019 9.81−1 1000−1 = −0.99964. (5.23) (5.24) Using Equation (5.12), the modified form of Equation (5.22) that includes the chosen physical constants can be written as r ≡ y = 20 + ˆw1 · v2 C ¯z(1,1) 1 ¯z(1,2) 2 C + ˆw2 · p C = 20 + ˆw1 · v2 g−1 ρ0 + ˆw2 · p g−1 ρ−1 = 20 − 0.50001 v2 g − 0.99964 p ρg ¯z(2,1) 1 ¯z(2,2) 2 C , or (upon substituting ˆw1 & ˆw2). (5.25) Compare Equation (5.25) with known form of Bernoulli’s equation given in Equation (4.14) and notice that two are same (within a small tolerance) if c = 20. Figure 5.1 shows the PO solutions of the Bernoulli equation problem shown in Figure 4.17 along with the dimension mismatch penalty information. The solutions having a non-zero penalty value are shown in red. Note that only the knee solution has a zero dimension mismatch penalty in this case. Hence, we can use dimensional mismatch penalty at the end of a CGP run to discard PO solutions which do not adhere to law of dimensional homogeneity. 74 Figure 5.1: PO set of solutions found by CGP on solving the symbolic regression test problem of Section 4.5.3 followed by a dimensional penalty calculation for each case. Only the knee solution, which is also the exact solution, has a dimensional mismatch penalty of zero. 5.1.4 Case-II : More Complex Terms In Section 5.1.1, we limited our study to cases where the term ti in Equation (5.1) is only composed of product and division operators among the regressors. However as shown in Sections 4.5.1, 4.5.2 and 4.5.3, CGP is capable of much more, including rational functions and functions involving addition and subtraction operators. In such a case, we have to make a slight change to the procedure outlined in Section 5.1.1 for calculating dimensional penalty values for a term ti of Equation (5.1). We know that in Equation (5.1), ti = f (x1, x2, . . . , xnx) (5.26) where xi ∈ {1, 2, . . . , nx} are the regressors and function f () is some function involving the operations of addition, subtraction, division and multiplication. Upon substituting the physical dimensions of the regressors in (5.26), lets say that the derived physical dimensions 75 51015202500.20.40.60.811.2 of term ti are obtained as some rational function of the fundamental dimensions as P (M, L, T) Q(M, L, T) (5.27) where P() and and Q() are some polynomials in M,L and T. Using partial fractions, we can decompose (5.27) as (cid:88) i P (M, L, T) Q(M, L, T) = such that pi(M, L, T) qi(M, L, T) , (5.28) ˆ The denominator, qi(M, L, T), of each fraction is a power of an irreducible (not factorable into polynomials of positive degree) polynomial and ˆ The numerator, pi(M, L, T), is a polynomial of smaller degree than that of qi(M, L, T). In such a case, both pi() and qi() are polynomials in M,L and T that cannot be broken down further. Let us fully expand both of these polynomials as sums of product terms, then we can re-write Equation (5.28) as (cid:88) i pi(M, L, T) qi(M, L, T) = (cid:80) (cid:80) j k (cid:88) i ri,j(M, L, T) si,k(M, L, T) . (5.29) In Equation (5.29), each term ri,j(M, L, T) and si,k(M, L, T) are purely product terms in M,L,T. In the CGP code, we implement this using the Symbolic Math Toolbox of Matlab. Now we can use the method of Section 5.1.1 to find the dimension mismatch penalty value for each product term separately. The only difference being that now the dimensions of the numerator terms, ri,j(M, L, T), have to be matched with those of numerator of the left hand side of Equation (5.1) (i.e. y). Similarly, the dimensions of the denominator 76 terms, si,k(M, L, T), have to be matched with those of denominator of the left hand side of Equation (5.1) (i.e. 1). Summing together the penalty values for all the numerator and denominator terms gives us the total dimension mismatch penalty associated with a term ti that has addition and subtraction operators as well combined with product and division operators. Until now, we have looked at how can we use CGP for a symbolic regression task and then follow it up with a dimension mismatch check on the PO solutions. In the next chapter, we will look at using CGP as a binary classifier and produce the decision boundaries as algebraic expressions. Note that the procedure explained in this chapter will be applicable even when CGP is being used for a classification task. The only difference in that case will be that the regressand y will be replaced by a class label Y which is dimensionless. Lets now move on to another interesting capability of CGP, i.e. solving binary classification tasks. 77 Chapter 6 Learning Free Form Rules Using a Classification Task In this chapter, we will learn how to use the CGP developed in Chapter 4 for a classification task. Recall from Sections 2.4 and 2.5 that when solving a MOO using an EMO algorithm, knowledge can be derived in one of two ways, namely: ˆ By searching for rules in some preferred set of solutions, say the non-dominated set. The methods of rule learning developed in Chapters 3 and 4 can be used in this case. ˆ By searching for discriminatory rules between a preferred set and an unpreferred set of solutions, say non-dominated versus dominated solutions set. This is the case of rule learning that we will address in this chapter. We will present as to how we can use the CGP developed in Chapter 4 for a classification task. This work was developed as part of an industry project. The methodology developed as part of this project is directly applicable to task of online innovization as well for learning decision boundary in a classification task as an algebraic expression in terms of features. Let us quickly learn about the classification problem first before learning about how to use CGP for a classification task. 78 6.1 Target Classification Problem This goal of this project was to develop a computationally efficient machine learning method- ology that can: ˆ Automate the process of selecting a few important features from a set of features and then building a classifier using those features, ˆ And learn the classifier (decision boundary) as an algebraic expression involving the selected important features. The data is being produced by a fast manufacturing process in which it is being captured as a multi-variate time series data via many sensors. This time series data is then processed to extract many features using basic mathematical functions such as differentiation and integration without any expert knowledge for feature creation. This part of feature extraction is not being shared in this work because of non-disclosure agreement with the industry partner.On similar grounds, we will not describe the exact manufacturing process as well. However, our method of extracting rules based classifier is applicable to any manufacturing process where we are collecting a lot of data about the process while the process is still going on. For example, if we are collecting multiple sensors data for a welding process such as welding current, voltage and distance of electrode from the weld region. Reiterating, that the term ‘interpretable-rules’ in the context of this research refers to rules in the form of mathematical expressions/equations involving the process features, pro- cess constants and some simple operations such as addition, subtraction, multiplication and division. The term ‘meaningful-rules’ in the context of this research refers to the idea of aforementioned expressions being physically meaningful by being dimensionally consistent. Now, let us now look as how can we use the CGP developed in Chapter 4 for a classification 79 task. 6.2 Using CGP for a Classification Task In Chapter 4, we saw how CGP learns a rule of the form given by Equation (4.1), i.e. ψ(φ1, φ2, . . . , φnφ) ≡ φi = w0 + wj · tj, nt(cid:88) j=1,j(cid:54)=i where ψ represents one such rule, φ are the basis functions explained in Section 3.1 that represent the data from MOO problem, φi is one of the basis functions that is a regressand, rest of the basis functions are regressors and we use CGP as a symbolic regression tool to learn the aforementioned form of rules. In this case, CGP optimizes the structure of rules and learns the weights using OLSR. However, if the basis function φi is a class label instead of a regressand, then we can use a linear support vector machine (SVM) [56] learning algorithm for learning the weights. This is because the results of linear SVM are considered very interpretable. The challenge lies in finding the right number of higher dimensions (of feature space) and the right features/derived-features corresponding to those dimensions, in which the data is linearly separable. In such a space, a linear SVM will be able to find out an appropriate separation plane with relative ease, provided the that the decision boundary is not discontinuous. By derived features, we mean features that are composed from the initial set of features provided to CGP using basic operations such as addition, subtraction, multiplication and division. Let us look at an example. Consider the binary data shown in Figure 6.1 which is generated using the following 80 equation of an ellipse y = −x2 1 + 2.02x1 · x2 − 3.05x2 2 + 1.98 = 0, (6.1) where x1 and x2 are the two features for this data. The data of hypothetical Go class (y < 0) is shown in green and the data of hypothetical NoGo class (y ≥ 0) is shown in red. Clearly, Equation (6.1) defines the decision boundary for this problem. Upon trying to classify this Figure 6.1: A hand crafted example of binary class data. data set using a tree classifier, the learned tree model has 13 levels and 147 nodes as shown in Figure 6.2. What is interesting to note is that, if we provide only the features x1 and x2 to a linear SVM algorithm, it will perform very poorly as the data is not linearly separable. Now consider the following three features, namely x2 2 and x1 · x2. We call these three features as derived features as they were not provided with the original features of the problem but 1, x2 are derived from the same. Now if we provide these three features to a linear SVM algorithm, it will perform exceedingly well on the same data. The reason being that in this modified 3-Dimensional feature space, the data is linearly separable. This can be seen in Figure 6.3 81 -2-1012-2-1012 Figure 6.2: Binary classification tree model for classifying the two class data shown in Figure 6.1. where the sub-figures show the same data in the derived feature space from three different camera angles. As can be seen in Figures 6.3a, 6.3b and 6.3c, a linear SVM could find a plane (in blue) clearly separating the Go and NoGo data. Not surprisingly, the equation of the plane is same as Equation (6.1). One may argue, why don’t we use a quadratic or polynomial kernel based SVM for this problem. That may work in this problem but what if the problem requires a rational function as one of the dimensions of the expanded feature space? Furthermore, the choice of what kernel to select with keeping the interpretability of the classifier/decision-boundary in mind is not very straight forward. Also, having the decision boundary in terms of some algebraic 82 NodeLeaf (a) Camera angle-1 (b) Camera angle-2 (c) Camera angle-3 Figure 6.3: The binary class data of Figure 6.1 shown in a derived feature space where the data is linearly separable. equation in terms of features and not some kernel which corresponds to an infinite dimen- sional feature space such as radial basis functions helps us later in checking the dimensional consistency of learned model. Maintaining dimensional consistency is an important handle available with engineers to do a sanity check of models learned for some physical process or system. 6.2.1 Evaluating Fitness of a CGP Individual for a Binary Classi- fication Task Consider a classification problem with no observations, nx number of features xi, and no binary class labels (yj ∈ {0, 1}) initially provided with the problem. When solving a classi- fication problem using CGP, consider a CGP individual with same rule structure as shown in Equation(5.1), i.e. nt(cid:88) i=1 wi · ti, r ≡ y = w0 + where nt is the number of terms in the rule. The terms ti can be considered as derived features obtained by simple operations of {+,−,×,÷} on the original features. The weights of this individual are then learned using a linear SVM method and the miss-classification 83 error rate at the end of weight optimization by SVM is assign as error fitness to the individual. This is measured as ferror = e (trn) I + e (trn) II (6.2) (trn) where e I and e (trn) II represent the type-I and type-II error rates (in %) [104] on the training set. The complexity fitness is calculated same as in case of the symbolic regression case given by Equation (4.11), i.e. fcomplexity = (cid:88) ( Nodes in all terms of CGP individual). Furthermore, in case of industry’s classification problem, the cost of miss-classifying NoGo product was much more then the cost of miss-classifying a Go product. For this reason, the cost matrix used by the linear SVM for arriving at the weights is kept to be:  0  , 1 C = 25 0 i.e. cost of making type-II error on the training set is set 25 times higher than cost of making a type-I error. 6.3 Performance on Small Feature Space Let us now look at some results obtained for classifying real production data. We chose production data from two dates for our study. We will name these two data sets as data set-1 and data set-2. The details of the production data from these two days is given in Table 6.1. A total of ten features, namely xi such that i ∈ {1, . . . , 10} were extracted. 84 Table 6.1: Production data details used for testing CGP classifier. Date Data set-1 Data set-2 # of Go prod- ucts 2381 1882 # of NoGo products 6 6 Table 6.2 shows the physical dimensions of these ten features. Due to highly imbalanced datasets, we used adaptive smote method [105, 106] to oversample the minority NoGo class. Table 6.2: Small feature set and their details. Feature(s) Physical Dimension L2 M1 T−2 x1, x2, x3 L1 M0 T0 x4 L1 M0 T−1 x5, x6 L0 M0 T0 x7 x8, x9, x10 L0 M0 T−1 6.3.1 Results on Production Data Set-1 Here we discuss the CGP results for production data set-1. Training was conducted over 70% of data. Table 6.3 shows the CGP parameters used in this case. Figure 6.4 shows the Table 6.3: List of CGP parameters used for solving binary classification problem of Sec- tion 6.3.1. Parameter N Value 200 G nt 200 10 dmax 6 pc 0.85 pch 0.20 pcl 0.80 pm 0.15 set of PO classifiers obtained for production data of data set-1. The vertical axis of the figure represents misclassification error in percent of training data set and the horizontal axis represents the complexity of a decision rule. Three solutions are highlighted with different colors with some extra information about the corresponding classifier. These three solutions/classifiers represent three different trade- 85 offs with respect to accuracy and complexity, starting with a classifier which is simplest but most inaccurate (shown in blue), to a solution with intermediate values of classification error and complexity (shown in red), and finally a solution which is very complex but highly accurate (shown in brown). For each of these solutions, we have also shown the type-I and type-II error obtained on the test data set. Table 6.4 shows this information in a tabular form. Note that for all three solutions shown in Table 6.4, only six features of the original ten features appear in the discovered classification rules. The features x7, x8, x9 and x10 do not appear in any of these solutions. Furthermore, Figure 6.5 shows the decision boundary and segregation of the Go and NoGo classes in the feature space for the knee solution classifier of Figure 6.4. The NoGo class shown in Figure 6.5 includes both real and synthetic NoGo class data of data set-1. Interestingly, the CGP algorithm is able to find the right feature space, namely t1 = x3 x6 , t2 = x5 x4 and t3 = x1 x3·x4 , in which the production data is linearly separable. Table 6.4: Summary of classification rules found for binary classification problem of Sec- tion 6.3.1 and their corresponding error rates on training and test sets. Soln Desc. Rule for NoGo prod- ucts etrn I etrn II etest I etest II Simplest −20.08 + 7.385 Knee Most Accu- rate −25.72 + 3.028 x5 + x4 x6 3.141 x1 x3 x4 −34.76 − 0.53x3 + 4.07x1 x4(x3 − x4) 0.65x2 x4x6 > 0 x2 x4 1.348 x3 > 0 0.23 x5 x1x2 4 + 0.17x5 x4x6 30.90% 0.00% 32.18% 0.00% 3.61% 0.00% 4.57% 0.00% 0.36% 0.00% 1.25% 0.0% − − + > 0 86 Figure 6.4: PO set of solutions found by CGP on solving the binary classification problem of Section 6.3.1. 6.3.2 Results on Production Data Set-2 Here we discuss the CGP results for production data from data set-2. Training was con- ducted over 70% of data and rest of the data was kept unseen to the training stage of CGP. Table 6.5 shows the CGP parameters used in this case. Figure 6.6 shows the set of PO clas- Table 6.5: List of CGP parameters used for solving binary classification problem of Sec- tion 6.3.2. Parameter N Value 100 G nt 100 10 dmax 6 pc 0.85 pch 0.20 pcl 0.80 pm 0.15 sifiers obtained for production data of data set-2. The vertical axis of the figure represents misclassification error in percent of training data set and the horizontal axis represents the complexity of a decision rule. Notice that CGP has performed relatively better in terms of 87 46810121416182022240.02.55.07.51012.515eItrn+eIItrn (%) Figure 6.5: Decision boundary for the knee solution classifier of PO set of classifiers shown in Figure 6.4. finding very accurate classifiers on data from data set-2 as compared to classifiers learned on data from data set-1. Again, we have highlighted three solutions with different colors and we provide some extra information about the corresponding classifier. These three solutions/classifiers represent three different trade-offs with respect to accuracy and complexity, starting with a classifier which is simplest but most inaccurate (shown in blue), to a solution with intermediate values of classification error and complexity (shown in red), and finally a solution which is very complex but highly accurate (shown in brown). For each of these solutions, we have also shown the type-I and type-II error obtained on the test data set. Table 6.6 shows the same information in a tabular form. One result that may bother a careful eye is that of the error rates obtained for the knee solutions classifier shown in Table 6.6. As shown, both type-I and type-II errors on the test 88 00-2-4-6-21412-8108-10642-4-6Go-TrainingNoGo-Training(real+synthetic)DecisionBoundary Figure 6.6: PO set of solutions found by CGP on solving the binary classification problem of Section 6.3.2. set for the knee classifier discovered in data set-2 data are zero. One reason for such a result can be that these results are based on single runs of the CGP on the data. If multiple runs are made and classification errors are calculated by averaging the results of multiple runs, then we should at least be able to see a non-zero type-I error on the test set as has been the case with other results. 89 051015202500.10.20.30.40.50.60.70.8eItrn+eIItrn (%) Table 6.6: Summary of classification rules found for binary classification problem of Sec- tion 6.3.2 and their corresponding error rates on training and test sets. Soln Desc. Rule for NoGo prod- ucts etrn I etrn II etest I etest II Simplest 2.68 + 3.92x5 > 0 1.51% 0.00% 1.96% 0.00% Knee Most Accu- rate −0.039 − 0.6 x5 x2 + x5 > 0 0.15% 0.00% 0.00% 0.00% + −3.88 6.06x1 (9 × 10−6)x2(x1 + x4) x4(x4x6 + x5x10) + 0.08% 0.00% 0.18% 0.00% > 0 6.4 Results on Larger Feature Space with Dimension Check In the previous section, we presented some results showing the performance of CGP as a classifier on real production data, but with just ten features. After these initial encouraging results, our collaborators were much more interested in knowing ˆ How the CGP performs with a larger set of features. Furthermore, ˆ If a set of suitable constants for the process is provided, can dimensional analysis identify a set of dimensionally consistent classifiers from the PO set obtained from CGP? 6.4.1 Data and Results Although, we conducted experiments on 5 years worth of production data ( > 6 Terabytes of data), here we are presenting the results of two successive days of production where first day’s data is used as training set and the next day’s data is used as test set. In this case, we 90 extracted a total of 56 features from the data, xi where i ∈ {1, . . . , 56}. These are shown in Table 6.7. We were also provided the units of measurement of those features. Furthermore, Table 6.7: Larger feature set and their dimensions. Physical Dimension Feature(s) L0 M0 T1 x1 L2 M1 T−2 x2, x3 L2 M1 T−4 xi, i ∈ {4, 5, . . . , 13} xi, i ∈ {14, 15, . . . , 23} L2 M1 T−2 L1 M0 T0 x24 xi, i ∈ {25, 26, . . . , 36} L1 M0 T−1 xi, i ∈ {37, 38, . . . , 46} L1 M0 T1 xi, i ∈ {47, 48, . . . , 56} L0 M0 T−1 a list of four physical constants relevant to the physics of the process was also provided. These are listed in Table 6.8. Upon applying CGP as a classifier followed by the dimensional Table 6.8: The set of physical constants which were considered relevant to the underlying physics of the production process. Symbol Name ρ E Hv τ Material Density Young’s Modulus Vickers Hardness Material thickness Numeric Value SI units Physical Dimension kg/m3 2.7 × 103 7.0 × 1010 P a 2.55 × 108 P a 2.0 × 10−4 m L−3 M1 T0 L−1 M1 T−2 L−1 M1 T−2 L1 M0 T0 consistency check, the results of CGP are shown in Figure 6.7. Note that the classifiers found to be dimensionally inconsistent are marked in red. Table 6.9 shows the details of the 14 PO classifiers obtained by CGP for this case along with their type-I and type-II errors on the test set. Table 6.10 shows the classifiers shown in Table 6.9 after conducting the dimensional consistency check. 91 Figure 6.7: PO set of classifiers found by CGP for two day’s worth production data described in Section 6.4.1. 6.5 Concluding Remarks In this chapter we showed how can we apply CGP followed by dimensional analysis to learn simple and interpretable rules binary data. We could obtain dimensionally consistent rules from real world production data from industry. Such models bring valuable insight about the underlying physics of the process and can be used for both, for validating existing analytical laws obtained known from data as well as aid in building analytical models for physical processes which are not yet completely understood by researchers. This concludes Part-I of this thesis in which we presented various ways in which we can learn simple and interpretable rules from data. Now, we move on to the other important aspect of online innovization, i.e. using this derived knowledge to intervene during an EMO 92 024681012141618200246810121416 Table 6.9: Details of PO classifiers shown in Figure 6.7. Expression (NoGo:=y > 0) etest I (%) etest II (%) + 0.9329 + 0.7842 y = 2.531 (x4 + x12) (x25 + x50) − 0.2903 x7 − 0.3398 x24 − 2.302 x6 − 1.142 x5 x3 x4 y = 2.446 (x4 + x12) (x25 + x50) − 0.3075 x7 − 2.446 x6 − 1.482 x5 x3 x4 y = 0.7485 x4 − 2.999 x6 − 0.4648 x7 − 0.6235 x8 x25 2.999 (x4 + x12) (x25 + x50) − 1.432 y = 0.4964 x4 − 1.465 x5 − 2.401 x6 − 0.2713 x7 + 2.401 (x4 + x12) (x25 + x50) + 0.5066 y = 0.4421 x4 − 1.301 x5 − 1.803 x6 + 3.468 x25 + 2.917 x45 − 0.5354 x8 y = 0.7815 x4−2.063 x6+3.934 x25+4.339 x45− 0.8135 x8 y = 1.124 x4 − 0.1635 x7 + 7.317 x25 + y = 0.9842 x4 − 0.1513 x7 + 6.852 x25 + y = 4.768 x25 − 0.4487 x7 − 1.274 x5 + y = 5.591 x25 − 1.631 x5 + y = 0.7103 x4 − 1.761 x6 − 1.342 x7 + 4.998 x25 + 4.906 y = 4.894 x25 − 1.677 x7 − 1.747 x6 + 5.896 y = 6.097 x25 − 2.237 x6 + 6.353 y = 6.516 x25 + 6.321 x25 −2.6 + 3.309 x6 + 5.252 + 0.0814 0.4802 x8 x6 0.4642 x8 x6 0.3278 x4 0.1578 x8 x6 x25 + + 3.21 + 4.121 0.7 0.9 1.4 1.0 1.9 2.6 2.8 3.2 2.4 3.9 3.5 4.2 8.8 52.7 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 Soln ID 1 2 3 4 5 6 7 8 9 10 11 12 13 14 run and hopefully expedite its convergence to the PO front. We will look at this aspect in the next part. 93 Table 6.10: Details of PO classifiers shown in Figure 6.7 after dimensional check. ID Original Expression Updated expression Dim. Penalty (with Physical constants of Table 6.8) 2.309 NA 2.309 NA 2.374 NA 4.041 NA 0.707 NA 0.707 NA EHvτ x4 − 2.12 × x25 + Hv EHvτ x4 − 2.0 × x25 + Hv (cid:113) ρ (cid:113) ρ ρ ρ ρ ρ x6 + 3.31 EHvτ x7 + 2249.54 y = 1.48 × 1012 1011 0.48 x8 y = 1.30 × 1012 1011 (cid:113) ρ 0.4642 x8 (cid:113) ρ y 1011 0.3278 x4 EHvτ x7 + 2105.69 x6 EHvτ x7 − 1.68 × 1012 x6 1465.25 + 4.121 + 3.21 Hv = ρ = 1718.17 y Hv 0.1578 x8 ρ ρ ρ 1535.93 EHvτ x5 + (cid:113) ρ EHvτ x6 − 1.77 × 1012 1012 x6 y = 9.39 × 1011 1012 (cid:113) ρ (cid:113) ρ x25 + 4.906 1503.97 = EHvτ x7 − 1.747 x6 + 5.896 = 1874.62 EHvτ x6 + 6.35 y 1012 y 1012 Hv Hv Hv ρ ρ (cid:113) ρ x25 + 6.32 Hv x25 − 5.93 × EHvτ x5 + ρ x25 − 2.16 × + 5.252 EHvτ x4 − 2.33 × EHvτ x7 + ρ x25 − 1.68 × x25 − 2.96 × 1 2 3 4 5 6 7 8 9 10 11 12 13 = − + 0.7842 0.6235 x8 1.482 x5 x3 x4 (NoGo:=y > 0) y = 2.531 (x4 + x12) (x25 + x50) − 0.2903 x7 − 0.3398 x24 − 2.302 x6 − 1.142 x5 x3 x4 y = 2.446 (x4 + x12) (x25 + x50) − 0.3075 x7 − 2.446 x6 − + 0.9329 0.7485 x4 − 2.999 x6 − y + 0.4964 x4 − 1.465 x5 − − + 0.4648 x7 2.999 (x4 + x12) (x25 + x50) − 1.432 y = 2.401 x6 2.401 (x4 + x12) (x25 + x50) + 0.5066 y = 0.4421 x4 − 1.301 x5 − 1.803 x6 + 3.468 x25 +2.917 x45− +0.0814 y = 0.7815 x4 − 2.063 x6 + 3.934 x25 + 0.8135 x8 4.339 x45 − y = 1.124 x4 − 0.1635 x7 + 7.317 x25 + 0.4802 x8 0.5354 x8 0.2713 x7 − 2.6 x25 x25 x25 + 3.309 x6 y = 0.9842 x4 − 0.1513 x7 + 6.852 x25 + 0.4642 x8 + 3.21 x6 y = 4.768 x25 − 0.4487 x7 − 1.274 x5 + 0.3278 x4 + 4.121 x6 0.0 0.0 0.0 y = 5.591 x25 − 1.631 x5 + 5.252 0.1578 x8 x6 + 0.0 y = 0.7103 x4 − 1.761 x6 − 1.342 x7 + 4.998 x25 + 4.906 0.0 y = 4.894 x25−1.677 x7−1.747 x6 +5.896 0.0 y = 6.097 x25 − 2.237 x6 + 6.353 0.0 14 y = 6.516 x25 + 6.321 0.0 y = 2003.69 94 Part II Rule Based Repair 95 Chapter 7 Performing Repairs Based on Fixed Form Rules for Expedited Convergence Part I of this dissertation has discussed ways in which we can distill fixed and free form rules or mathematical expressions from data. Recall from Figure 2.6 that when we use an EMO algorithm to solve a MOO problem, applying an online innovization procedure involves: ˆ First choosing a set of solutions to learn rules from, then ˆ Learning rules or patterns that exist in those solutions and then, ˆ Perform an intervention in the EMO algorithm to expedite its convergence rate. The two chapters in Part II will focus on using learned rules of fixed form discussed in Chapter 3 to intervene in the EMO algorithm and possibly expedite its convergence towards the PO front of MOO problem. 7.1 Rule Based Repair Recall from Section 3.1 that basis function(s) for a rule in the context of innovization can be any scalar function φ of the design variables including variables, objectives, constraints 96 and any function thereof. Fixed form rules of constant type, given by Equation (3.4), and of power-law form, given by Equation (3.7), can be represented in a single equation using the form given by Equation (7.1) as ψi(φ1, φ2, . . . , φnφ) ≡ nφ(cid:89) j aij bij φ j = ci, where, (7.1) aij ∈ {0, 1} , ci, bij ∈ R, nφ = # of basis functions. In Equation (7.1), ψi represents ith candidate rule, φj represents the jth basis function in the list of nφ basis functions being considered for learning rules, aij is a binary variable that decides if jth basis function is included in ith candidate rule ψi and, bij, ci are rule parameters that are estimated during an EMO/I run. If ψi is a constant rule, say φj = ci then aij = 1 if j = i, 0 otherwise. (7.2) This chapter considers candidate rules involving only design variables as basis functions, thus Equation (7.1) takes the form ψi(x1, x2, . . . , nx) ≡ nx(cid:89) j=1 aij·bij j x = ci (7.3) where nx is the number of design variables in a MOO problem. We refer to the set of rules maximum of(cid:80)nx represented by Equation (7.3) as candidate rules. In a problem with nx design variables, a nxCk = 2nx − 1 candidate rules of fixed form involving just variables as basis functions can be evaluated. Out of those, nx rules are constant type rules corresponding k=1 to each variable and the rest are power law type of rules. 97 Figure 7.1 shows a flowchart of the online innovization applied to EMO algorithms or “EMO/I” method proposed in Section 2.5. Except for the three regular blocks, namely the Rule Basis and Quality block, Learn block and Repair block and two decision blocks, namely L and R blocks, the rest of the flowchart is that of a generic EMO algorithm. In this work, we used NSGA-II algorithm [92] as the EMO algorithm of choice but it can be any population based optimization method. The following sections provide a brief description of Figure 7.1: The flowchart of an EMO/I algorithm. the flowchart with a focus on the aforementioned blocks that are specific to EMO/I. 7.1.1 Rule Basis and Quality Block The rule basis and quality or RBQ block carries information on: 98 P(0)P(t)AssignFitnessgen+=1StopNYP(t),Q(t)T?L?R?NYYNP(t)copyR(t)P(t),Q(t)or,P(t),Q(t),R(t)EvaluateInitializePopgen=0StartEvolutionaryOperatorsLearnRepairKeyTTerminate?LLearn?RRepair?P(t),Q(t)RuleBasis&QualityP(t)copy,Λ(t)Q 1. Which rules out of the maximum of 2nx − 1 that we wish to learn? and, 2. What is the threshold quality measure on a candidate rule to consider it suitable for variable repair? Information for both of these questions is sought from the user at the beginning of EMO/I procedure. For example, if a MOO problem has three variables but if the user wants to discover rules with a maximum two variable interactions, then we can set aij as shown in Table 7.1. Table 7.1: An example of candidate rule information available in RBQ block. Rule Id (i) 1 2 3 4 5 6 aij values Rule type a11 = 1 a12 = 0 a13 = 0 Constant rule a21 = 0 a22 = 1 a23 = 0 Constant rule a31 = 0 a32 = 0 a33 = 1 Constant rule a41 = 1 a42 = 1 a43 = 0 Power law rule a51 = 1 a52 = 0 a53 = 1 Power law rule a61 = 0 a62 = 1 a63 = 1 Power law rule Each candidate rule is learned and evaluated for its quality. For a constant rule, this quality is measured using the coefficient of variation (Cv) metric (see Section 3.2). A constant rule is considered good quality and ready for making repairs if its Cv ≤ C for power-law type of rules, we use the R2 adj metric from OLSR to assess its quality (see (max) v . Similarly Section 3.3). A power-law type of rule is considered good quality and ready for making repairs if its R2 adj ≥ R 2 (min) adj . We have kept the default value of these thresholds as C (max) v = 0.05 and R 2 (min) adj = 0.95. The user may change these values as per his/her requirement. A Note on Transition Points A transition point in a bi-objective PO front is a point across which the nature of mathe- matical relations among the PO solutions changes. The innovized principles on either side 99 of the transition point are different. Some reasons for encountering transition points are: (a) Some constraint or variable bound becomes active (or inactive) and forces (or eases) the PO solutions to adhere to additional (or fewer) rules across the transition boundary, (b) The nature of one or more of the objective functions changes significantly across the transition boundary. In this chapter, we will consider optimization MOO problems that do not have a transition point and we will come back to addressing it in Chapter 8. Thus for problems in this chapter, the rules of the form given by Equation (7.1), if discovered, are applicable to all the PO solutions. 7.1.2 Decision Block-L Once the EMO/I begins and a population of individuals is initialized, evaluated, assigned fitness and operated by genetic operators (crossover and mutation), the decision block ‘L’ decides how long should the algorithm wait before it begins to learn the parameters and quality of candidate rules. This is necessary as in the initial phases of an EMO algorithm, the solutions may still be far away from the PO-front and not adhere to any rule. Such a waiting decision can be implemented in many ways such as: ˆ If some minimum fraction of maximum function evaluations (MFEs) allowed for a problem have passed or, ˆ If the number of solutions in the non-dominated set is above a minimum threshold or, ˆ If the increase in hypervolume has slowed down to some pre-specified level. 100 In this work, we have gone ahead with using the first one, i.e. some fraction of MFEs, to be the deciding criterion to start learning rules. 7.1.3 Learn Block This block attempts to learn the rule parameters bij and ci shown in Equation (7.3), as well as the quality of each candidate rule. Once EMO/I meets the criterion set by the decision block ‘L’, a copy of the parent population, P (t) copy, is sent to the ‘Learn’ block. The rule parameters and the rule quality for constant rules and power-law rules are estimated as follows : 7.1.3.1 Constant Type Rules Consider a rule of the form given in Equation (7.3) and a MOO problem with nx variables. Then, for a candidate rule ψi composed of jth variable xj, the constant rule is given by ψi(φj) ≡ xj = ci. Comparing with Equation (7.3), we know in this case that aik = 1 if k = j, 0 otherwise, bik = 1, if k = j, 0 otherwise, (7.4) (7.5) where k ∈ {1, 2, . . . , nx}. From Section 3.2.1, we know that the value of parameter ci can be estimated as ˆci = µj 101 (7.6) where µj is the mean of jth decision variable over the non-dominated solutions set Furthermore, the coefficient of variation (Cv), of the jth variable over the non-dominated solutions set is considered as a measure of the quality for such a rule. This quality parameter is later used in the decision block R shown in Figure 7.1. 7.1.3.2 Power Law Type of Rules As shown in Section 3.3.1, to estimate the parameters and quality of power law rules, we use log-linear modeling followed by applying OLSR on data of non-dominated solutions to learn the parameters.Consider an example of a three variable MOO problem, and we want to learn the parameters of a two variable power law rule as: x b1 1 x b3 3 = c, then (7.7) taking log on both sides, b1 log x1 + b3 log x3 = log c, which is a linear equation. If log x1 is chosen as the regressand then, log x1 = −b3 b1 log x3 + log c b1 , or = ˆβ log x3 + ˆγ, =⇒ x1 = eˆγx ˆβ 3 (7.8) Parameters ˆβ and ˆγ are estimates returned by a Ordinary Least Square linear regression (OLSR) method using the log of 1st and 3rd variables from non-dominated set. OLSR also returns the R2 adj value which is later used to assess the quality of such a candidate rule in 102 repair stage. 7.1.4 Decision Block-R This block decides if some rule is good enough in quality to qualify for the repair stage. As mentioned in Section 7.1.1, ˆ The quality of one variable rules is measured using coefficient of variation (Cv) values and, ˆ The quality of rules with more than one variable is measured using R2 adj value returned from OLSR. EMO/I is provided with threshold quality parameters namely, the maximum coefficient of variation (C (max) v ) for constant type rules and R2 (min) for power law rules, in the RBQ block at the start of EMO/I. A rule is said to qualify for repair in next stage as follows: ˆ For constant rules, Cv ≤ C (max) v and ˆ For power law rules, R2 adj ≥ R 2 (min) adj . We refer to the candidate rules for which the quality parameters surpass the threshold quality parameter values as Qualifying Rules. As set of u such qualifying rules in generation t of (t) EMO/I, say Ψ Q = {ψq1, ψq2, . . . , . . . , ψqu}, along with a copy of parent population, say (t) copy, is passed on to the Repair block as shown in Figure 7.1. In this work, we refer to the P union of set of variables constituting the qualifying rules as Qualifying Variables. 103 7.1.5 Repair Block In some generation t of EMO/I, the repair block receives a copy of parent population ( P (t) copy) and a set of qualifying rules (Ψ Q = {ψq1, ψq2, . . . , . . . , ψqu}), and yields a population with repaired variables (R(t)). The procedure for making such a repair is given by Algorithm 7.1 (t) and is named RepairPop( ). Next, we briefly describe the overall algorithm and followed by detailed explanation its important components. Algorithm 7.1 RepairPop( ) Q = {ψq1, ψq2, . . . , . . . , ψqu} Parent population, List of Qualifying rules Repaired population (t) (t) copy, Ψ input: P output: R(t) 1: R(t) ← ∅ 2: for k ← 1 to |P (t) copy| do (t) copy 3: 4: 5: 6: 7: 8: 9: 10: 11: 12: 13: 14: (t) (t) Pk ← kth individual of P I ← ∅ (t) ˜Ψ Q ← wrShuffle(Ψ Q ) for m ← 1 to | ˜Ψ Q | do ψ ← mth rule of ˜Ψ Iψ ← getVars(ψ) J ← Iψ \ I if J (cid:54)= ∅ then (t) Q v ← chooseVar(J ) makeRepair(Pk, v) I ← I ∪ v R(t) ← R(t) ∪ Pk Initialize set of repaired variables in Pk. Wtd. random shuffle w.r.t rule length preference. Get set of variables in rule ψ. Get set of variables in rule ψ and not yet repaired in Pk. Get variable w.r.t frequency preference. Repair variable v of individual Pk. Update repaired population set. RepairPop( ) procedure takes P (t)copy and Λ (t) Q as input and further line wise descrip- tion is as below: ˆ Lines 1-4 : The output R(t) is initialized as an empty set. It then loops over all individuals of P (t) copy to repair them one by one. For the kth individual of P (t) copy, Pk, the set of repaired variables in Pk is initialized to an empty set. ˆ Lines 5-6 : The rules present in Ψ (t) Q are shuffled using wrShuffle() function with 104 respect to some probability distribution based on length of rules. This is discussed in Section 7.2.1. We refer to the number of variables in a rule as the length of a rule. For example, the rule λ ≡ x1 · x2 = 3 has length two. The procedure then sequentially goes over all rules in the sorted list ˜Ψ (t) Q . ˆ Lines 7-9 : For some mth rule in the shuffled rule list ˜Ψ (t) Q , say ψ, the set of variables involved in rule ψ is stored in Iψ using getVars() function. For example, for the rule ψ ≡ x1 · x2 = 3, Iψ = {x2, x3}. Subsequently, set of variables that are involved in rule ψ and have not been repaired in the individual Pk is stored in J . ˆ Lines 10-13 : If J = ∅, then the control passes back to the for loop of line-6. Else, some variable v ∈ J is chosen using chooseVar() function. This choice depends on a probability distribution based on frequency of the variables in the qualifying rules and is explained in Section 7.2.2. Subsequently, the variable v is repaired in individual Pk using makeRepair() function and its operation is explained in Sections 7.1.5.1 and 7.1.5.2. After repairing variable v, the set I is updated. ˆ Line 14 : Once all possible repairs have been performed to the individual Pk, it is added to the repaired population R(t). This continues until for loop of line-2 covers all individuals of P (t) copy. Next we discuss the working of the makeRepair() function of Algorithm 7.1. 7.1.5.1 Repairing Variables Based on Constant Rule Consider a rule of the form given in Equation (7.4) corresponding to a candidate rule Ψi composed of jth variable xj. Ψi ≡ xj = ci, where 105 (7.9) ci is as estimated as given in Equation (7.6). Then, the jth variable in an individual of P (t) copy population is repaired as, ˆxj = U(µj − σj, µj + σj), where (7.10) ˆxJ is the repaired value of variable xj, µj and σj are the mean and standard-deviation re- spectively of xj decision variable over the non-dominated solutions set and U(a, b) represents a uniform random distribution between a and b. 7.1.5.2 Repairing variables based on power law rules To explain the repair method involved in this case, we take the example shown in Sec- tion 7.1.3.2. From Equation (7.8), we can get the repaired value of variable x1 as Similarly, the repaired value for the regressor variable x3 can be obtained as (cid:19) 1 ˆβ ˆx3 = ˆx1 = eˆγx ˆβ 3 . (cid:18) x1 eˆγ (7.11) (7.12) For cases with more than two variables as well, a similar logic follows. Note that if the repaired value of a variable lies outside its a priori defined bounds, then the repaired variable is set to its nearest bound value. 106 7.2 Repair Strategies This section discusses three rule-preference strategies, corresponding to the wrShuffle() function, and three variable-preference strategies corresponding to the chooseVar() func- tion, used in Algorithm 7.1. Recall that we call a set of rules for which the quality parameters surpass the threshold value of quality set in RBQ block of Figure 7.1 as qualifying rules, and the union of set of variables constituting these qualifying rules as qualifying variables. The following two choices need to be made before any repair of the kinds illustrated in Section 7.1.5 is made to solution individuals of P (t) copy: ˆ Choose one of the possibly many qualifying rules on which to base the repair and, ˆ Choose one of the possibly many variables for repair from the chosen qualifying rule. For example, choosing a random rule from the qualifying rules pool and then choosing a random variable from the variables of the chosen qualifying rule can be one such strategy.Both these choices have an effect on: 1. The parameters used to make a repair and, 2. The sequence in which variables are repaired. 7.2.1 Rule Preference Strategies We investigate three rule-preference strategies. We call the number of variables in a rule to be the length of that rule. Before discussing these strategies, lets look at the wrShuf- fle() function as it is used in implementing all three rule preference strategies in Line-5 of Algorithm 7.1. 107 The wrShuffle() function stands for weighted random shuffle (WRS) and it is same as weighted random sampling from a sequence without replacement. Let a = (a1, a2, . . . , an) and w = (w1, w2, . . . , wn) be a sequences of n objects and unnormalized weights respectively such that weight wi corresponds to element ai. Then Algorithm 7.2 describes a pseudo code for wrShuffle() function. Algorithm 7.2 wrShuffle( ) input: a = (a1, a2, . . . , an), w = (w1, w2, . . . , wn) output: r 1: r ← ∅ (cid:80)n 2: n ← sizeOf(a) 3: while n > 0 do 1(cid:80) k=1 wk 4: 5: s ← P ← (p1, p2, . . . , pn) where pi = wi/s C ← (0, pk) r ← U(0, 1) i ← 1 while true do 2(cid:80) n(cid:80) k=1 k=1 6: 7: 8: 9: 10: 11: 12: 13: 14: 15: 16: if Ci ≤ r < Ci+1 then else break i ← i + 1 r ← r(cid:95) < ai > a ← a\ < ai > n ← n − 1 pk, pk, . . . , k=1 Random no. in 0-1 using uniform random distribution. Wtd. Random Shuffled sequence get number of elements in sequence a Probability mass value sequence. Cumulative distribution value sequence. Ci is ith element of C Append ai to sequence r at end. Remove element ai from sequence a. Lets look at the rule preference strategies next. Let there be m rules in the qualifying rules list Λ (t) Q = (λ1, λ2, . . . , λm) of Algorithm 7.1 with lengths (l1, l2, . . . , lm) respectively. Then the three rule-preference strategies assign different weights sequences to the m qualifying rules in wrShuffle() function. (i) No preference: In this strategy, no rule is given preference over others based on its length and a weights sequence of w = (1, 1, . . . , 1) is used by wrShuffle() function. (ii) Prefer long rules: In this strategy, lengthier rules are preferred over shorter rules and 108 a weights sequence of w = (l1, l2, . . . , lm) is used by wrShuffle() function. (iii) Prefer short rules: In this strategy, shorter rules are preferred over lengthier rules and a weights sequence of w = (1/l1, 1/l2, . . . , 1/lm) is used by wrShuffle() function. 7.2.2 Variable Preference Strategies We investigate three variable-preference strategies which are based on the frequency of a vari- able among the qualifying rules and it is implemented using the chooseVar() function in Algorithm 7.1. In Algorithm 7.1, let there be n qualifying variables, V(t) Q = {v1, v2, . . . , vn}, (t) Q = (ψ1, ψ2, . . . , ψm). Furthermore, let the frequency of each in the list of qualifying rules Ψ qualifying variable in V(t) Q be represented by F (t) of qualifying variable vi. Let some rule ψi ∈ Ψ able repair to population individual Pk at some step of for loop of line-6 of Algorithm 7.1. Then, J ⊆ V(t) individual Pk until that instant. Let, Q = {f1, f2, . . . , fn} where fi is frequency (t) Q be under consideration for making vari- Q , represents the set of variables in rule λi that have not been repaired in the J = {vj1, vj2, . . . , vjs}, and F = {fj1, fj2, . . . , fjs}, where (7.13) fju ∈ F is the frequency of variable vju ∈ J in the qualifying variable set V(t) {1, 2, . . . , s}. Let us discuss the variable-preference strategies. Q and u ∈ (i) No preference: In this strategy, no variable is given preference over others based on its frequency among the qualifying variables. In this case, the chooseVar() function randomly picks one variable from the set J defined in Equation (7.13). 109 (ii) Prefer common variables: In this strategy, qualifying variables that are more frequent among the qualifying variables are given a higher preference of getting selected first to get repaired. In this case, the chooseVar() function picks some variable vju ∈ J of Equation (7.13) with probability pu defined as; fju(cid:80)s r=1 fjr pu = (7.14) (iii) Prefer less-common variables: In this strategy, qualifying variables that are less fre- quent among the qualifying variables are given a higher preference of getting selected first to get repaired. In this case, the chooseVar() function picks some variable vju ∈ J of Equation (7.13) with probability pu defined as; 1/fju(cid:80)s r=1 1/fju pu = (7.15) These aforementioned strategies in Sections 7.2.1 and 7.2.2, when permuted together form a total of nine strategy combinations. These are listed in the first nine rows of Table 7.2. The tenth strategy is of a pure EMO algorithm without any innovization based repairs. 110 Table 7.2: Different variable repair strategies for EMO/I studied in this work. ID Rules Preference Variable Preference 1 2 3 4 5 6 7 8 9 10 ——— NSGA-II with No Repair ———- None Common variables Uncommon variables None Common variables Uncommon variables None Common variables Uncommon variables None None None Long rules Long rules Long rules Short rules Short rules Short rules Abbrev. NN NC NU LN LC LU SN SC SU NI 7.3 Test Problems All test problems in this work have been derived from the ZDT1 problem [107] and have the following form. Minimize f1(x) = x1, f2(x) = g(x) h(f1(x), g(x)), (7.16) (cid:112) Where h(f1, g) = 1 − f1/g. Every problem has a different g function, variable bounds and Pareto-optimal set and they are described in the following sections. 7.3.1 ZDT1-1 This problem is designed to have rules in the PO set such that: ˆ No variable is common among any two rules and, ˆ Different rules may have different number of variables. 111 The problem is given by Equations (7.16) where g(x) is given by Equation (7.17). g(x) = 1 + |x2 − 0.5| + |x3x4 − 0.5| + |x5x6x7 − 0.5|, x1,2 ∈ [0, 1], x3,4,5,6 ∈ [0.5, 1], x7 ∈ [0.5, 2]. (7.17) Equation (7.18) shows the PO solutions set for this problem and Figure 7.2a shows the corresponding PO front. 0.0 ≤ x∗1 ≤ 1.0, x∗2 = 0.5, x∗3 · x∗4 = 0.5, and, x∗3 · x∗4 · x∗7 = 0.5. (7.18) This problem is specifically designed to test repair strategies: NI, NN, SN and LN. Refer to (a) Pareto-optimal front for ZDT1-1 prob- lem. (b) Pareto-optimal ZDT1-3 problems. front for ZDT1-2 and Figure 7.2: Pareto-optimal fronts for the test problems. Table 7.2 for description of these strategies. 7.3.2 ZDT1-2 This problem is designed to have rules in the PO set such that ˆ Some variables are common among two or more rules and, ˆ Each rule has the same number of variables. 112 00.20.40.60.81f100.20.40.60.81f20.50.751f100.10.20.3f2 The ZDT1-2 problem is given by Equation (7.16) where g(x) is given by Equation (7.19). g(x) = 1 + |x1x2 − 0.5| + |x0.5 x1,2 ∈ [0.5, 1], x3 ∈ [√0.5, 1], x4 ∈ [0.25, 1], x5 ∈ [0.5,√0.5]. 2 x3 − √0.5| + |x2 2x4 − 0.25| + |x−0.5 2 √0.5|, x5 − (7.19) The Pareto-optimal region for this problem is given by Equation (7.20) and Figure 7.2b shows the corresponding PO front. x∗1 · x∗2 = 0.5, (x∗2)0.5 · x∗3 = √0.5, (x∗2)2 · x∗4 = 0.25 and, (x∗2)−0.5 · x∗5 = √0.5. (7.20) This problem is designed to test repair strategies: NI, NN, NC and NU, shown in Table 7.2. 7.3.3 ZDT1-3 This problem is designed to have rules in the PO set such that: ˆ Some variables may be common among two or more rules and, ˆ Each rule may have different number of variables. The ZDT1-3 problem is given by Equation (7.16) where g(x) is given by Equation (7.21). g(x) = 1 + |x1x2 − 0.5| + |x0.5 x1,2 ∈ [0.5, 1], x3 ∈ [√0.5, 1], x4 ∈ [0.25, 1], x5 ∈ [0, 1]. 1 x2x3 − 0.5| + |x5 − 0.5| + |x−0.5 1 x2 2x3x4 − 0.25|, (7.21) The Pareto-optimal region for this problem is given by Equation (7.22) and Figure 7.2b shows the corresponding PO front. x∗1 · x∗2 = 0.5, (x∗1)0.5 · x∗2 · x∗3 = 0.5, x∗5 = 0.5 and, (x∗1)−0.5 · (x∗2)2 · x∗3 · x∗4 = 0.25 . (7.22) 113 This problem is designed to test and compare all the repair strategies given in Table 7.2. 7.4 Results on Test Problems This section compares the performance of an EMO algorithm, NSGA-II in this work, with that of an EMO/I algorithm using the different variable repair strategies of Table 7.2 on the test problems of Section 7.3. The tables and figures in this section refer to different contending algorithms by their respective abbreviations mentioned in Table 7.2. For example, an EMO algorithm without any variable repair strategy is referred to as ‘NI’. Furthermore, the different algorithms are compared on the metrics of median Gener- ational Distance (GD) and median Inverse Generational Distance (IGD) [31] over thirty runs. All claims of one algorithm being better than the other are backed with results of Wilcoxon Rank Sum (WRS) test [108] of statistical significance. We use the following con- vention to represent a test hypothesis. HA≺B represents the left tailed hypothesis test, where the alternative hypothesis states that the median of distribution A is lower than the median of distribution B at some significance level α. A significance level of α = 5% is used in all the statistical tests. For all the statistical tests, both h and p values are shown in the results. An h-value of ‘No’ means that the aforementioned alternative hypothesis cannot be accepted at the desired significance level and an h-value of ‘Yes’ means otherwise. Also, for the aforementioned alternative hypothesis to be accepted, the corresponding p-value must be lower than the chosen α. The EMO parameters used for solving the test problems are given in Table 7.3. 114 Table 7.3: EMO parameters used in the test problems discussed in Section 7.3. Population Size Max Func. Evals Prob. of Crossover Prob. of Mutation Crossover Index Mutation Index Start of Learning (% of Max FEs) ZDT1-1 ZDT1-2 ZDT1-3 72 52 72 36,000 36,400 50,400 0.9 1/7 15 20 5 0.9 1/5 15 20 5 0.9 1/5 15 20 5 7.4.1 ZDT1-1 Results This problem is designed to test the variable repair strategies NN, LN and SN of Table 7.2 against each other as well as their performance relative to the no-repair case, i.e. NI. Fig- ure 7.3 shows the median GD and IGD results for ZDT1-1 problem. The plots show that EMO/I with any of the three repair strategies NN, LN and SN perform better than the NI strategy in both GD and IGD for same number of objective function evaluations. This observation is supported by the WRS results shown in Table 7.4. The table shows that the three alternate hypothesis namely, HNN≺NI, HLN≺NI and HSN≺NI, can be accepted at 5% significance level in case of GD as well as IGD in ZDT1-1 problem at the end of maximum function evaluations (36,000 in case of ZDT1-1 problem). Furthermore, Figure 7.3 shows that the variable repair strategies namely, NN, LN and SN, have very similar performance and none can claim to be better than the other in this problem. Table 7.4: Results of Wilcoxon rank sum test for GD and IGD of ZDT1-1 problem at 36,000 function evaluations and 5% significance level. Hypothesis HNN≺NI HLN≺NI HSN≺NI Y es < 10−6 Y es < 10−6 Yes p < 10−6 D h G D h G I Yes p < 10−6 Yes < 10−6 Yes < 10−6 115 (a) GD results for ZDT1-1. (b) IGD results for ZDT1-1. Figure 7.3: Median GD and IGD results for ZDT1-1 problem over 30 runs. 7.4.2 ZDT1-2 Results This problem is designed to test the variable repair strategies NN, NC and NU of Table 7.2 against each other as well as their performance relative to the no-repair case, i.e. NI. Fig- ure 7.4 shows the median GD and IGD results for ZDT1-2 problem. The plots show that EMO/I with any of the three repair strategies NN, NC and NU perform better than the NI strategy in both GD and IGD for same number of objective function evaluations. This observation is supported by the WRS results shown in Table 7.5. The table shows that the three alternate hypothesis namely, HNN≺NI, HNC≺NI and HNU≺NI, can be accepted at 5% significance level in case of GD as well as IGD in ZDT1-2 problem at the end of maximum function evaluations (36,400 in case of ZDT1-2 problem). Furthermore, Figure 7.4 shows that the variable repair strategies namely, NN, NC and NU, have very similar performance and none can claim to be better than the other in this problem. 116 0.51.52.53.5Funcn Evals x 104110Median GD x 10-3NINNLNSN0.51.52.53.5Funcn Evals x 10421030Median IGD x 10-4NINNLNSN (a) GD results for ZDT1-2. (b) IGD results for ZDT1-2. Figure 7.4: Median GD and IGD results for ZDT1-2 problem over 30 runs. Table 7.5: Results of Wilcoxon rank sum test for GD and IGD of ZDT1-2 problem at 36,400 function evaluations and 5% significance level. Hypothesis HNN≺NI HNC≺NI HNU≺NI < 10−6 < 10−6 Yes Yes p < 10−6 Yes Yes D h G D h G I p < 10−6 Yes < 10−6 Yes < 10−6 7.4.3 ZDT1-3 Results This problem is designed to test all nine variable repair strategies of Table 7.2 namely: NN, NC, NU, LN, LC, LU, SN, SC and SU, against each other as well as their performance relative to the no-repair case, i.e. NI. To avoid illegible plots, these results are discussed in three parts. 7.4.3.1 Part-A : Strategies Preferring Short Rules Figure 7.5 shows the median GD and IGD results for ZDT1-3 problem comparing variable repair strategies SN, SC and SU against each other as well as the no-repair case NI. The 117 0.51.52.53.5Funcn Evals x 104110Median GD x 10-3NINNNCNU0.51.52.53.5Funcn Evals x 10421030Median IGD x 10-4NINNNCNU plots show that EMO/I with any of the three repair strategies SN, SC and SU performs better than the NI strategy in both GD and IGD for same number of objective function evaluations. This observation is supported by the WRS test results shown in Table 7.6. The table shows that the three alternate hypothesis namely, HSN≺NI, HSC≺NI and HSU≺NI, can be accepted at 5% significance level in case of GD as well as IGD in ZDT1-3 problem at the end of corresponding maximum function evaluations (50,400 in case of ZDT1-3 problem). Furthermore, Figure 7.5 shows that the variable repair strategies SN, SC and SU, have very similar performance and none can claim to be better than the other in this problem. (a) GD results for ZDT1-3. (b) IGD results for ZDT1-3. Figure 7.5: Median GD and IGD results for ZDT1-3 problem over 30 runs comparing NI, SN, SC and SU repair strategies of EMO/I. Table 7.6: Results of Wilcoxon rank sum test for GD and IGD of ZDT1-3 problem com- paring EMO/I repair strategies NI, SN, SC and SU at 50,400 function evaluations and 5% significance level. Hypothesis HSN≺NI HSC≺NI HSU≺NI < 10−6 < 10−6 Yes Yes p < 10−6 D h G D h G I Yes Yes p < 10−6 Yes < 10−6 Yes < 10−6 118 12345Funcn Evals x 104110Median GD x 10-3NISNSCSU12345Funcn Evals x 10421030Median IGD x 10-4NISNSCSU 7.4.3.2 Part-B : Strategies with no Preference Based on Length of Rules Figure 7.6 shows the median GD and IGD results for ZDT1-3 problem comparing variable repair strategies NN, NC, NU and SN against each other as well as the no-repair case NI. The plots show that: ˆ The repair strategies NN, NC and NU perform better than NI in both GD and IGD. This is backed by results shown in the Hypothesis (group-I) column of Table 7.7. Repair strategy SN too performs better than NI but the corresponding results are already shown in Section 7.4.3.1. ˆ Furthermore, the repair strategy SN performs better than NN, NC and NU in terms of GD whereas their performance in terms of IGD metric are similar. The corresponding WRS test results are shown in Hypothesis (group-II) column of Table 7.7. (a) GD results for ZDT1-3. (b) IGD results for ZDT1-3. Figure 7.6: Median GD and IGD results for ZDT1-3 problem over 30 runs comparing NI, SN, NN, NC and NU repair strategies of EMO/I. 119 12345Funcn Evals x 104110Median GD x 10-3NISNNNNCNU12345Funcn Evals x 10421030Median IGD x 10-4NISNNNNCNU Table 7.7: Results of Wilcoxon rank sum test for GD and IGD of ZDT1-3 problem comparing EMO/I repair strategies NN, NC, NU, SN and NI at 50,400 function evaluations and 5% significance level. Hypothesis (group-I) Hypothesis (group-II) HNN≺NI HNC≺NI HNU≺NI HSN≺NN HSN≺NC HSN≺NU 1.1 × 10−5 2.0 × 10−3 < 10−6 < 10−6 Yes Yes Yes p < 10−6 Yes 3.6 × 10−5 Inconclusive Yes Yes D h G D h G I Yes Yes Yes p < 10−6 < 10−6 < 10−6 7.4.3.3 Part-C : Strategies Preferring Long Rules Figure 7.7 shows the median GD and IGD results for ZDT1-3 problem comparing variable repair strategies LN, LC, LU and SN against each other as well as the no-repair case NI. The plots show that: ˆ The repair strategies LN, LC and LU perform better than NI in both GD and IGD. This is backed by results shown in the Hypothesis (group-I) column of Table 7.8. Repair strategy SN too performs better than NI but the corresponding results are already shown in Section 7.4.3.1. ˆ Furthermore, the repair strategy SN performs better than LN, LC and LU in terms of GD whereas their performance in terms of IGD metric are similar. The corresponding WRS test results are shown in Hypothesis (group-II) column of Table 7.8. 7.4.4 Summary of Results on Test Problems Based on the results shown in Sections 7.4.1, 7.4.2 and 7.4.3, we can conclude the following: ˆ EMO/I with any of the nine variable repair strategies, namely, NN, NU, NC, SN, SU, SC, LN, LU and LC, performs better than no variable repair case (NI), in terms of GD 120 (a) GD results for ZDT1-3. (b) IGD results for ZDT1-3. Figure 7.7: Median GD and IGD results for ZDT1-3 problem over 30 runs comparing LN, LC, LU, SN and NI repair strategies of EMO/I. Table 7.8: Results of Wilcoxon rank sum test for GD and IGD of ZDT1-3 problem comparing EMO/I repair strategies LN, LC, LU, SN and NI at 50,400 function evaluations and 5% significance level. Hypothesis (group-I) Hypothesis (group-II) HLN≺NI HLC≺NI HLU≺NI HSN≺LN HSN≺LC HSN≺LU < 10−6 < 10−6 < 10−6 < 10−6 Yes Yes < 10−6 Yes Yes Yes Yes < 10−6 Yes < 10−6 Inconclusive D h G D h G I p < 10−6 Yes Yes p < 10−6 and IGD on all three test problems. ˆ Results of Section 7.4.3 show that EMO/I with the variable repair strategy SN performs better than the repair strategies NN, NU, NC, LN, LU and LC in terms of GD. In terms of IGD, there is no single clear winner. Following the implication of above summary, we compare the performance of EMO/I with SN repair strategy with that of no-repair NI strategy on three engineering design MOO problems presented in following sections. 121 12345Funcn Evals x 104110Median GD x 10-3NISNLNLCLU12345Funcn Evals x 10421030Median IGD x 10-4NISNLNLCLU 7.5 Engineering Problems This section describes three MOO problems from engineering design. All three problems are slightly modified from their original form so that there are no transition points in the final PO front. Doing so makes all the rules among PO solutions set to be applicable for the entire PO front. 7.5.1 Two Bar Truss Problem (TBT) Figure 7.8: A two membered truss structure. In this problem, a two bar truss structure shown in Figure 7.8 is designed to carry a certain load without elastic failure. The truss members have cross sectional area x1 m2 and x2 m2 respectively and the point of loading is at a vertical distance of x3 m from the truss member hinge. The two objectives, both to be minimized, are the volume of the structure and the maximum stress developed in its two members. Its mathematical formulation is given by Equation (7.23). The original formulation of the problem given in [31] did not contain the constraint g2 given in Equation (7.23). The constraint g2 is added to avoid any 122 bb4m1myx1x2ABC100kN transition points in the PO front so that rules found are applicable to all the PO solutions. The transition point in this problem is encountered because the cross sectional area x2 hits its upper limit in order to minimize both objectives in a PO way. The reader is referred to [10] for more details on the reason and location of transition point at V ≈ 0.044 m3. (cid:113) (cid:113) Minimize V = x1 16 + x2 3 + x2 1 + x2 3, S = max(σAC , σBC ), Subject to g1 ≡ max(σAC , σBC ) ≤ 105, g2 ≡ V ≤ 0.044, (cid:113) Where σAC = , σBC = 20 16 + x2 3 x3x1 (cid:113) 80 1 + x2 3 x3x2 (7.23) , x1, x2 ∈ [0, 0.01], x3 ∈ [1.0, 3.0]. Figure 7.9 shows the PO front for this problem obtained using NSGA-II. To ascertain that the black curve shown in Figure 7.9 is indeed the true PO front, a local search is run from six points (shown with gray circles) using -constraint method [109]. An -constraint is set on the volume objective while maximum stress objective is minimized with local search. As can be seen, the local search is not able find any point very different from the start point. Furthermore, the new front obtained by running a local search from all the points of NSGA- II front amounts to a minor increase of 0.0132% in the Hypervolume [31]. Thus, the front of Figure 7.9 can safely be assumed to be very close to the idea PO front and hence it is used for evaluating GD and IGD value results in later sections. 123 Figure 7.9: Pareto-optimal front for the two member truss problem. Figure 7.10: A diagram of metal turning process. 7.5.2 Metal Cutting Problem (MC) This MOO problem involves optimizing the twin objectives of operation time (Tp) and tool life (ξ) during the turning process of a steel bar on a CNC lathe with a P20 carbide tool as shown in Figure 7.10. Equation (7.24) presents the mathematical formulation of this problem. The formulation given in Equation (7.24) contains an additional constraint g4 compared to the original formulation of the problem given in [110]. This constraint on the operation time objective is put to avoid transition points in the PO front so that the rules found are applicable to all PO solutions. The location of this transition point at Tp ≈ 0.9545 124 0.0050.0150.0250.0350.045Volume, V (m3)13579Max. Stress, S (kPa) x 104NSGA-IIstart point-constr methodv(m/min)f(mm/rev)a(mm) minutes is discovered manually in [29]. Minimize Tp(x), ξ(x), Subject to g1(x) ≡ 1 − g3(x) ≡ 1 − P (x) ηP max ≥ 0, R(x) Rmax ≥ 0, g2(x) ≡ 1 − Fc(x) F max c ≥ 0, g4(x) ≡ Tp(x) ≤ 0.9545,  1 + 0.20 T (x) M RR(x)  + 0.05, Where Tp(x) = 0.15 + 219,912 ξ(x) = 21,991,200 M RR(x)T (x) , T (x) = P (x) = 5.48 × 109 v3.46f 0.696a0.46 , Fc(x) = vFc 6.56 × 103f 0.917a1.10 125f 2 v0.286 , M RR(x) = 1000 vf a, R(x) = , , (7.24) 60,000 rn P max = 10 kW, F max c = 5000 N, Rmax = 50 µm, rn = 0.8 mm, η = 0.75, v ≡ x1 ∈ [250.0, 400.0] in m/min, f ≡ x2 ∈ [0.15, 0.55] in mm/rev and, a ≡ x3 ∈ [0.5, 6.0] in mm. Figure 7.11 shows the PO front for this problem obtained using NSGA-II. To ascertain that the black curve shown in Figure 7.11 is indeed the true PO front, a local search is run from six points (shown with gray circles) using -constraint method. An -constraint is set on the operation time objective, Tp, while the used tool life objective, ξ, is minimized with local search. As can be seen, the local search is not able find any point very different from the start point. Furthermore, the new front obtained by running a local search from all the points of NSGA-II front amounts to a minor increase of 0.0692 % in the Hypervolume. Thus 125 Figure 7.11: Pareto-optimal front for the Metal Cutting problem. the front of Figure 7.11 can safely be assumed to be very close to the idea PO front and hence it is used for evaluating GD and IGD value results in later sections. 7.5.3 Welded Beam Problem (WB) In this problem, a beam needs to be welded to another beam and must carry a load F [10]. It is desired to find four design parameters (thickness of the beam, b, width of the beam t, length of weld l, and weld thickness h) for which both, the cost of the beam C and the vertical deflection at the end of the beam D, is minimized. Its mathematical formulation is given by Equation (7.25). The original formulation of the problem did not contain the constraint g5 given in Equation (7.25). The constraint g5 is added to avoid any transition points in the PO front so that rules found are applicable to all the PO solutions. The transition point in this problem is encountered because beyond a deflection of 0.004 inches, weld thickness h encounters its maximum possible value, i.e. beam thickness b. This value is obtained using 126 0.850.890.930.97Operation Time, Tp (mins)3579Used Tool Life, (%)NSGA-IIstart point-constr method manual innovization as outlined in [10]. Minimize C(x) = 1.10471 h2l + 0.04811 tb(14.0 + l), D(x) = 2.1952 t3b , Subject to g1(x) ≡ 13,600 − τ (x) ≥ 0, g2(x) ≡ 30,000 − σ(x) ≥ 0, g4(x) ≡ Pc(x) − 6000 ≥ 0, g3(x) ≡ b − h ≥ 0, g5(x) ≡ 0.004 − D(x) ≥ 0, (cid:114) (cid:113) (τ(cid:48))2 + (τ(cid:48)(cid:48))2 + (lτ(cid:48)τ(cid:48)(cid:48))/ (0.25(l2 + (h + t)2)), (7.25) Where τ (x) = τ(cid:48) = 6000/√2hl, (cid:112) τ(cid:48)(cid:48) = 0.25(l2 + (h + t)2)2 6000(14 + 0.5l) 2{0.707hl(l2/12 + 0.25(h + t)2)} , σ(x) = 504,000/t2b, Pc(x) = 64,746.022(1 − 0.0282346t)tb3, 0.125 ≤ h, b ≤ 5.0 inches, 0.1 ≤ l, t ≤ 10.0 inches. Figure 7.12 shows the PO front for this problem obtained using NSGA-II. To ascertain that the black curve shown in Figure 7.12 is indeed the true PO front, a local search is run from six points (shown with gray circles) using -constraint method. An -constraint is set on the cost objective, C, while the deflection objective, D, is minimized with local search. As can be seen, the local search is not able find any point very different from the start points. Furthermore, the new front obtained by running a local search from all the points of NSGA-II front amounts to a minor increase of 0.0017 % in the Hypervolume. Thus the front of Figure 7.12 can safely be assumed to be very close to the idea PO front and hence 127 Figure 7.12: Pareto optimal front for the WB problem. it is used for evaluating GD and IGD results in later sections. The EMO parameters used for these engineering design problems are shown in Table 7.9. Next we discuss the performance of no repair strategy NI and variable repair strategy SN from Table 7.2 in solving these engineering design problems. Table 7.9: EMO parameters used for solving the engineering problems discussed in Sec- tion 7.5. Parameter Name Population Size Max Func. Evals Prob. of Crossover Prob. of Mutation Crossover Index Mutation Index Start of Learning (% of Max FEs) TBT 52 MC 72 WB 92 10,036 36,000 25,024 0.9 1/3 10 50 5 0.9 1/3 10 50 5 0.9 1/4 10 50 5 128 5152535Cost, C ($)01234Deflection, D (inch) x 10-3NSGA-IIstart pt-constr 7.6 Results on Engineering Problems The repair strategy SN was shown to be the best performer among all other repair strategies (in Table 7.2) on test problems at the end of Section 7.4. Hence, this section compares the performance of an EMO algorithm with that of an EMO/I algorithm with repair strategy SN. As in Section 7.4, the different algorithms are compared on the metrics of median GD and median IGD over 30 runs. All claims of one algorithm being better than the other are backed with results of WRS test of statistical significance. As before, HA≺B represents the left tailed hypothesis test, where the alternative hypothesis states that the median of distribution A is lower than the median of distribution B at some significance level α. A significance level of α = 5% is used in all the statistical tests. For all the statistical tests, both h and p values are shown in the results. An h-value of No means that the aforementioned alternative hypothesis cannot be accepted at the desired significance level and an h-value of Yes means otherwise. Also, for the aforementioned alternative hypothesis to be accepted, the corresponding p-value must be lower than the chosen α. 7.6.1 Two Bar Truss Problem Results Figure 7.13 shows the median GD and IGD results for TBT problem. The plots show that EMO/I with SN repair strategy performs better than the NI strategy in terms of GD. This observation is supported by the WRS test results shown in the TBT column of Table 7.10. In terms of IGD, the NI strategy performs marginally better. This is expected as EMO/I is designed to focus more on convergence than diversity. Figure 7.14 shows PO front for the best GD case out of the thirty runs for EMO/I with SN strategy in TBT problem overlaid with PO front of TBT problem taken from Figure 7.9. 129 (a) GD results for TBT problem. (b) IGD results for TBT problem. Figure 7.13: Median GD and IGD results for TBT problem over 30 runs. Table 7.10: Results of Wilcoxon rank sum test for GD of TBT, MC and WB problem at their respective maximum function evaluations and 5% significance level. Hypothesis WB TBT HSN≺NI HSN≺NI HSN≺NI Yes 1.2 × 10−4 < 10−6 Yes MC Yes D h G p < 10−6 7.6.1.1 Rules Found in TBT Problem The rules detected and used in repairing the PO solutions of TBT problem towards the end of EMO/I (with SN) in the best GD run are: x−1.0082 1 · x2 = 2.075 and x3 = 1.9449 ± 0.0674. (7.26) Figures 7.15 and 7.16 show the PO solutions (corresponding to the PO front shown in Figure 7.9) shown with black points, overlaid with the learned variable relations of Equa- tion (7.26) shown with a gray line. The two overlaid plots are very close, thus a close up of a part of the plot is shown in the inset for clarity. The two learned rules of Equation (7.26) 130 246810Funcn Evals x 10351080Median GD x 10-4NISN246810Funcn Evals x 10341020Median IGD x 10-4NISN Figure 7.14: Front coverage by EMO/I method in TBT problem for the best GD run. are very close to the analytical relations that exist in the PO solutions of TBT problem, i.e. x2/x1 = 2.0 and x3 = 2.0 [10]. 7.6.2 Metal Cutting Problem Results Figure 7.17 shows the median GD and IGD results for MC problem. The plots show that EMO/I with SN repair strategy performs better than the NI strategy in terms of GD. This observation is supported by the WRS test results shown in the MC column of Table 7.10. In terms of IGD, the NI strategy performs marginally better. This is confirmed in Figure 7.18 that shows PO front for the best GD case out of the thirty runs for EMO/I (with SN strategy) in MC problem overlaid with PO front of MC problem taken from Figure 7.11. This is expected as EMO/I is designed to focus more on convergence than diversity. 131 0.0050.0150.0250.0350.045Volume, V (m3)13579Max Stress, S (kPa) x 104PO frontEMO/I front Figure 7.15: Rule between variables x1 and x2 identified by EMO/I in TBT problem at the end of best GD run. 7.6.2.1 Rules Found in MC Problem The rules detected and used in repairing the PO solutions of MC problem towards the end of EMO/I (with SN) in the best GD run are: f = 0.55 ± 5.4 × 10−6 and v · a1.5420 = 804.6058. (7.27) Figures 7.19 and 7.20 show the PO solutions (corresponding to the PO front shown in Figure 7.11) shown with black points, overlaid with the learned variable relations of Equa- tion (7.27) shown with a gray line. The two overlaid plots are very close, thus a close up of a part of the plot is shown in the inset for clarity. 7.6.3 Weld Beam Problem Results Figure 7.21 shows the median GD and IGD results for MC problem. The plots show that EMO/I with SN repair strategy performs better than the NI strategy in terms of GD. This 132 13579x1 (m2) x 10-313579x2 (m2) x 10-3x1-1.0082 x2 = 2.0750PO SetLearned Rule4.55.5x1 (m2) x 10-4911x2 (m2) x 10-4 Figure 7.16: Rule for variable x3 identified by EMO/I in TBT problem at the end of best GD run. observation is supported by the WRS test results shown in the WB column of Table 7.10. In terms of IGD, the SN strategy appears to perform better (Figure 7.21b) but NI catches up towards the end. Also, for the intermediate function evaluations where SN appears to have a lower IGD is not supported by WRS test results. Figure 7.22 that shows PO front for the best GD case out of the thirty runs for EMO/I (with SN strategy) in WB problem overlaid with PO front of WB problem taken from Figure 7.12. 7.6.3.1 Rules Found in WB Problem The rules detected and used in repairing the PO solutions of WB problem towards the end of EMO/I (with SN) in the best GD run are: h · l0.8842 = 0.9386 , h · b−0.5600 = 0.7239 , l · b−0.6323 = 1.3408 , t = 9.9939 ± 0.0087. (7.28) 133 12004006008001000Soln ID123x3 (m)PO SetLearned Rulex3 = 1.9449 0.0674 (a) GD results for MC problem. (b) IGD results for MC problem. Figure 7.17: Median GD and IGD results for MC problem over 30 runs. Figure 7.18: Front coverage by EMO/I method in MC problem for the best GD run. Figures 7.23, 7.24, 7.25 and 7.26 show the PO solutions (corresponding to the PO front shown in Figure 7.12) shown with black points, overlaid with the learned variable relations of Equation (7.28) shown with a gray line. The two overlaid plots are very close, thus a close up of a part of the plot is shown in the inset for clarity. 134 0.51.52.53.5Funcn Evals x 1040.1110Median GD x 10-2NISN0.51.52.53.5Funcn Evals x 1041510Median IGD x 10-3NISN0.850.890.930.97Operation Time, Tp (mins)3579Used Tool Life, (%)PO frontEMO/I front Figure 7.19: Rule for variable f identified by EMO/I in MC problem at the end of best GD run. Figure 7.20: Rule among variables v and a identified by EMO/I in MC problem at the end of best GD run. 7.7 Concluding Remarks The results in Sections 7.4 and 7.6 show that the idea of online innovization can be useful in expediting convergence of EMO algorithms, however this may come at an expense of the diversity achieved in final PO front. The PO fronts of all the problems that we have 135 12004006008001000Soln ID0.150.250.350.450.55f (mm/rev) 10-6PO SetLearned Rule400500Soln ID0.55f (mm/rev)f = 0.55 5.4x10-6250300350400v (m/min)123456a (mm)PO SetLearned Rule300302v (m/min)1.891.90a (mm)v a1.5420 = 804.6058 (a) GD results for WB problem. (b) IGD results for WB problem. Figure 7.21: Median GD and IGD results for WB problem over 30 runs. Figure 7.22: Front coverage by EMO/I method in WB problem for the best GD run. considered till now do not have any transition points. Also, the rules on which we tested the EMO/I idea were composed solely of MOO problem’s variables and no rules were con- tained any other possible basis functions in them such as MOO problem’s objective values or constraints. In the coming chapter, we will look at ideas that can address these issues. 136 0.51.52.5Funcn Evals x 104110Median GD x 10-3NISN0.51.52.5Funcn Evals x 1040.217Median IGD x 10-3NISN5152535Cost, C ($)01234Deflection, D (inch) x 10-3PO frontEMO/I front Figure 7.23: Rule between variables h and l identified by EMO/I in WB problem at the end of best GD run. Figure 7.24: Rule between variables h and b identified by EMO/I in WB problem at the end of best GD run. 137 012345h (inch)0246810l (inch)PO SetLearned Rule0.91.1h (inch)0.81l (inch)h l0.8842 = 0.9386012345h (inch)012345b (inch)PO SetLearned Rule0.60.9h (inch)0.61.4b (inch)h b-0.5600 = 0.7239 Figure 7.25: Rule between variables l and b identified by EMO/I in WB problem at the end of best GD run. Figure 7.26: Rule for variable t identified by EMO/I in WB problem at the end of best GD run. 138 0246810l (inch)012345b (inch)PO SetLearned Rule12l (inch)0.51.5b (inch)l b0.6323 = 1.340812004006008001000Soln ID0246810t (inch)PO SetLearned Rule400415Soln ID9.981010.02t (inch)t = 9.9939 0.0087 Chapter 8 Issues of Transition Points and Repairs Based on Complex Rules 8.1 Handling Transition Points Until now, we have avoided dealing with transition points in the PO fronts and assumed that whatever rules we find are applicable to all the solutions of a PO front. However, this is not the case in practice and usually PO solutions to MOO problems contain one or more transition points [10]. A transition point is a point on a PO front across which the design rules being adhered to by the solutions, change significantly. Although such “points” can be lines or regions in case the PO front is a surface in three or more dimensional objective space, in case of bi-objective problems, these invariably are points and hence we will stick with the term transition point or TP from hereon. In problems with transition points present in the PO front, our methods developed in Chapter 7 may not work because there we assumed that the rules being learned are applicable to the entire front. In this section, we will see how can we learn rules and repair solutions as part of EMO/I in the presence of transition points. Let us first try to understand the concept of a transition point with an example. Equation (8.1) shows another version of the two bar truss problem. Recall that this truss problem is also presented earlier in Section 7.5.1 by Equation (7.23). Note that the problem 139 description given in Equation (8.1) has one less constraint, namely g2 ≡ V ≤ 0.044, than the problem given in Equation (7.23). (cid:113) (cid:113) Minimize V = x1 16 + x2 3 + x2 1 + x2 3, S = max(σAC , σBC ), Subject to g1 ≡ max(σAC , σBC ) ≤ 105, (cid:113) 20 16 + x2 3 x3x1 (cid:113) 80 1 + x2 3 x3x2 (8.1) , Where σAC = , σBC = x1, x2 ∈ [0, 0.01], x3 ∈ [1.0, 3.0]. This change in problem description introduces a transition point in the PO front of the problem. Figure 8.1 shows this transition point in the objective space and Figure 8.2 shows it in the variable space. Figure 8.1: Transition point encountered in Truss problem shown in objective space. As is analytically and numerically shown in [10], the location of this transition point in the objective space is (V = 0.04472 m3, S = 8.94426 MPa) and in the variable space is 140 00.010.020.030.040.050.06020406080100 Figure 8.2: Transition point encountered in Truss problem shown in variable space. (x1 = 0.005 m, x2 = 0.01 m, x3 = 2.0 m). The reason for such a transition point to appear is as follows. Until variable x2 reaches its upper bound of 0.01m, to achieve a solution with smaller maximum stress, S (and larger volume V ) optimally, both cross sections x1 and x2 need to be increased linearly in the ratio of x1/x2 = 0.5 while keeping x3 = 2.0. Hence Equation (8.2) shows the rules that exist before encountering the transition point. This region is shown in blue in Figures 8.1 and 8.2. x1/x2 = 0.5, y = 2.0m (8.2) Beyond this critical point (T), since x2 cannot be increased any further, the only way to reduce the stresses is to increase the length x3 in a manner so as to make the stresses in both members equal. An increase of x3 increases the length of the members, but decreases the component of the applied load on each member. Thus, a smaller cross-sectional area can be used to withstand the smaller load causing a smaller developed stress. Equation (8.3) shows the rules that the variables adhere to in order to reduce the maximum stress (while 141 0.0100.00520.002500.0053 increasing volume) in an optimal manner. This region is shown in red in Figures 8.1 and 8.2. (cid:115) x2 = 0.01m, x1 = 0.0025 16 + x2 3 1 + x2 3 m (8.3) Now, if we attempt to apply the method of rule learning and variable repair discussed in Chapter 7 directly to all non-dominated solutions of a MOO problem using EMO/I, we will learn the wrong rule parameters, because we will be trying to learn a rule using data of all non-dominated solutions instead of over individual partitions. As seen in case of two bar truss problem, the rules are very different across the transition point shown in Figure 8.1 and 8.2. Furthermore, we will also make poor repairs, waste function evaluations and worsen the algorithm performance. To avoid this, the following is necessary: ˆ Detect disjoint regions of PO solutions separated by transition points, ˆ Learn rules separately for each of these regions, and ˆ Repair solutions belonging to different regions using learned rules of corresponding regions. If no rules of desired quality can be learned from a region, then do not attempt to repair solutions of that region. To apply the above line of thought, it becomes imperative to first detect such regions in PO solutions that are separated by transition points. This is discussed in the next section. 142 8.1.1 Identifying Regions Separated by Transition Points (Active Set Partitioning) Transition points may appear in the PO fronts of MOO problems usually because of two reasons, which are; (a) Making a constraint active or inactive, or (b) A sudden change in the nature of objective function(s). The former is a more common cause of transition points and is addressed in this work. Recall from Section 8.1 that the transition point encountered in the truss problem was because of meeting the upper bound for variable x2. The latter is not only less common in practice but also needs a more subtle mathematical treatment to first define what is meant by a “sudden” change in objective function(s). Although change in nature of any function can be defined in terms of higher order derivatives of a function, but it is currently not been studied in this work. For every solution in a population of non-dominated solutions, we can find out which constraints are active and which ones are inactive. Then we can partition the solutions into groups based on proximity to different constraint boundaries. If all the transition points in a problem are a result of meeting some constraint bounds and not any other reason, then such a partitioning is all we want, to be able to learn the right rules for solutions of each partition rather than trying to learn a single rule for all of the solutions. Consider a MOO problem with ng inequality constraints {g1, g2, . . . , gng} and nx contin- uous variables {x1, x2, . . . , xnx}. Without loss of generality, let all the inequality constraints be of more than or equal to type with corresponding upper bounds being gu 1 . These inequality 143 constraints can be represented as i or gi ≥ gu gi gn i − 1 ≥ 0, i = gu (8.4) where gn i is the normalized constraint value for constraint gi. Furthermore, let xl i and xu i where i ∈ {1, 2, . . . , nx}, represent the upper and lower bounds on the variables. In a manner similar to Equation (8.4), we can obtain constraints gl 1 which are given by 1 and gu gl i = xi i − 1 ≥ 0 and xl gu i = 1 − xi i ≥ 0, xu where gl i and gu i are the normalized box constraints for variable xi. (8.5) In total, there are nc = ng +2∗nx number of normalized constraints for this problem. Let ctol be the tolerance value such that if any of the normalized constraints are > 0 and <= ctol then we call them as active constraints, else we call them inactive constraints. Let hi, i ∈ {1, 2, . . . , ng} be a boolean variable showing if the normalized constraint gn is active or not by carrying a 1 i or a 0 respectively. Similarly, hl j, j ∈ {1, 2, . . . , nx} and hu j , j ∈ {1, 2, . . . , nx} be boolean variables showing if the normalized constraints gl i and gu i are active or not. With this information, let us look at an example to see how we can partition a set of solutions based on the active/inactive constraints for the solutions. Consider a MOO problem with one inequality constraint g1 and two variables x1 and x2 respectively. Based on the aforementioned notation, the normalized constraints for this problem are gn 1 , gl 1, gu 1 , 2 and gu gl 2 . The corresponding boolean variables that show if for a solution, some constraint is active or not, are hn 1 , hl 1, hu 1 , hl 2 and hu 2 . 144 Table 8.1: An example of partitioning of solution space based on constraint activity. 2 Partition ID Solution ID hn 1 1 1 1 1 1 0 0 0 0 0 1 2 3 4 5 6 7 8 9 10 hl 1 hu 1 1 0 0 1 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 hl 2 hu 0 0 0 0 0 0 0 1 1 0 0 1 0 1 0 1 0 0 0 0 5 5 5 9 9 8 8 8 0 0 Table 8.1 shows an example of ten non-dominated solutions for such a hypothetical MOO problem. It also shows the constraint activity for these ten solutions with the boolean variables hn 1 , hl 1, hu 1 , hl 2 and hu 2 . For a solution, if a boolean variable is 1, it means that the corresponding constraint is active for the solution. For example, a value of hn 1 = 1 means that the constraint gn 1 is active, otherwise its inactive. The boolean string for every solution can then be converted to its corresponding decimal value. For example, the boolean string for Solution ID-1 in the table is {1,0,1,0,0}. The corresponding decimal value for this string is calculated as: 1 · 20 + 0 · 21 + 1 · 22 + 0 · 23 + 0 · 24 = 5 We choose to call this decimal value as partition ID because it is unique for a set of ac- tive/inactive constraints. The solutions shown in Table 8.1 can be partitioned into four groups. Three solutions belong to partition ID-5, two belong to partition ID-9, three belong to partition ID-8 and two belong to partition ID-0. The rule learning method discussed in Section 7.1.3 needs some minimum number of solutions to learn meaningful rules. Hence, we can additionally define a minimum number of solutions to be present in a partition for it 145 to be considered for rule learning. All this is encapsulated in Algorithm 8.1. Let us look at Non-dominated set, Constr. tol., minimum partition size Partition IDs, Partitioned validity h = {h1, . . . , hng , hl 1, . . . , hl nx, hu 1 , . . . , hu nx} append to set P Algorithm 8.1 activeSetPartition( ) input: S, ctol, minPartSize output: P, Pv 1: P,Pv ← ∅ 2: for k ← 1 to |S| do s ← Sk h ← getBool(s,ctol) p ← getPartitionId(h) P ← P (cid:95) p 7: [Pu,Pc] ← getUniqueCount(P) 8: for i ← 1 to |P| do 3: 4: 5: 6: p ← Pi for j ← 1 to |Pu| do 9: 10: 11: 12: 13: 14: 15: 16: 17: j j u ← Pu c ← Pc if (u == p) ∧ (c ≥ minPartSize) then v ← 1 else v ← 0 Pv ← Pv (cid:95) v a line wise explanation of this algorithm. ˆ Input: The algorithm receives the non-dominated solutions set S for a MOO problem, minimum normalized constraint tolerance ctol for some constraint to be considered active/inactive at a solution and minPartSize, which is the minimum number of mem- bers in a partition for it to be considered a valid partition. ˆ Output: Output contains the partition ID P and partition validity Pv, both of which have the same size as S. The set P carries information on the partition ID of each solution of S and set Pv carries information on validity of the partition corresponding to each solution. ˆ Lines 1-6: Sets P and Pv are initialized to null sets. The For each non-dominated 146 solution s ∈ S, we obtain the boolean string carrying information on which constraints are active and which ones are inactive for solution s based on the constraint values and the ctol parameter. This boolean string h is then converted into a decimal value p, which we are calling the partition ID to which solution s belongs. Then, the value of p is appended to the set P. ˆ Line 7: Of the set of partition IDs obtained, the unique partition IDs along with their respective count are stored in Pu and Pc respectively. ˆ Lines 8-17: For each solution s ∈ S, the validity of its partition v, is ascertained based on whether the corresponding partition ID p has at least minPartSize number of copies. The partition validity values of each solution are appended and stored in set Pv. Once such active set partitioning is achieved, both rule learning as well as variable repair based on those rules, is done individually for each partition. We can apply the rule learning and variable repair methods described in Section 7.1 to MOO problems having transition points if we learn rules (Section 7.1.3) and then repair variables (Section 7.1.5) of individual solutions of each partition separately. Let us now look at some results based on this idea. 8.1.2 Results on Problems with Transition Points In this section, we present results for slightly modified versions of the two engineering design problems of Section 7.5. In each problem, the modification amounts to removal of some constraint that was earlier allowing us to get PO solutions without any transition points. In each problem; 147 ˆ EMO/I learns the rules for each active set partition (ASP) separately and then at- tempts to repair the solutions using corresponding rules and, ˆ The repair strategy “Short Rules, No Preference on Variables” or SN from Table 7.2 is used. In the text, we refer to this strategy of combining SN repair strategy with active set parti- tioning as ‘SNasp’. We compare SNasp against repair strategies SN and NI (refer Table 7.2) in the results. The EMO/I parameters used for the three problems are shown in Table 8.2. Table 8.2: EMO/I parameters used for solving the modified engineering design problems that contain transition points in the PO front. Population Size Max Func. Evals Prob. of Crossover Prob. of Mutation Crossover Index Mutation Index Start of Learning (% of Max FEs) ctol minPartSize TBT-2 52 MC-2 200 10,036 100,000 0.9 1/3 10 50 20 0.001 0.9 1/3 10 50 20 0.01 5% of pop 5% of pop 8.1.2.1 Modified Two Bar Truss Problem (TBT-2) This problem is obtained from problem description given in Equation (7.23) by removing the constraint g2 ≡ V2 ≤ 0.044 from the problem. This leads to the induction of a transition point in the PO front as shown in Figures 8.1 and 8.2. Figure 8.3 shows that the SNasp repair strategy does better than both SN and NI strategy in terms of GD, whereas, because of presence of transition point, the SN repair strategy is unable to do better than NI strategy. The WRS results in Table 8.3 show that this difference 148 in performance of SNasp and SN is indeed significant. In case of IGD, both SN and SNasp perform inferior to NI. This is understandable as the repair strategies are aimed at improving convergence to the PO front and not convergence and diversity at the same time. (a) GD results. (b) IGD results Figure 8.3: Median GD and IGD results for TBT-2 problem over 30 runs. Table 8.3: Results of Wilcoxon rank sum test for GD of TBT-2 problem comparing EMO/I repair strategies NI, SN and SNasp at 10,036 function evaluations and 5% significance level. Hypothesis HSNasp≺NI HSNasp≺SN Yes 0.047 D h G Yes p 2.26 × 10−4 8.1.2.2 Modified Metal Cutting Problem (MC-2) This problem is obtained from problem description given in Equation (7.24) by removing the constraint g4 ≡ Tp(x) ≤ 0.9545 from the problem. This leads to the induction of a transition point in the PO front as shown in Figures 8.4 and 8.5. The transition point divides the PO solutions into two regions shown in red and blue. The solutions of the two regions adhere to different set of rules as shown in Figure 8.5. 149 02468100.515NISNSNasp246810510NISNSNasp Figure 8.4: Transition point encountered in Metal Cutting problem shown in objective space. Figure 8.6 shows that the SNasp repair strategy does better than both SN and NI strategy in terms of GD whereas because of presence of transition point, the SN repair strategy is not able to do better than NI strategy. The WRS results in Table 8.4 show that this difference in performance of SNasp and SN is indeed significant. In terms of GD, although Figure 8.6a shows that strategy SN performs better than NI, but WRS results in Table 8.4 do not back this observation which means the difference is not statistically significant. In case of IGD, although Figure 8.6b shows that relative difference between the SNasp, SN and NI repair strategies, but the WRS results in Table 8.4 do not back this observa- tion meaning that in term of IGD, the apparent difference between the three strategies is statistically insignificant. Now that we have studied one way of handling MOO problems with transition points, let us now look at another idea of learning power laws involving functions of design variables as basis functions and then repairing variables based on those rules. 150 0.850.90.9511.051.11.152345678910Tool Life (%) Figure 8.5: Transition point encountered in Metal Cutting problem shown in variable space. Table 8.4: Results of Wilcoxon rank sum test for GD and IGD of MC-2 problem at 100,000 function evaluations and 5% significance level. Hypothesis HSNasp≺NI HSNasp≺SN HSN≺NI Yes Yes D h G p D h G p I 0.0082 0.0306 No - No - No - No - 8.2 Power Law Rule Involving Functions of Variables In Section 7.1, we looked at ways of learning fixed form rules involving variables only from non-dominated solutions and then repairing the non-dominated solutions based on those learned rules. Whether we made a repair based on a constant rule (see Section 7.1.5.1) or based on power law rule (see Section 7.1.5.2), making the repair was straight forward because the functions being learned were explicit in terms of all its constituent variables. If the learned rule is an implicit function of the variable that needs to be changed for solution repair, then this task may not be very straight forward. For example, if we learn a fixed form 151 12400304563500.23000.42500.6 (a) GD results. (b) IGD results Figure 8.6: Median GD and IGD results for MC-2 problem over 30 runs. rule based on problem objectives or constraint functions or if we use CGP from Chapter 4 or Chapter 6 to find an algebraic expression involving problem variables that fits the data of non-dominated solutions but is implicit in terms of variable to be repaired. Consider a bi-objective optimization problem with nx variables given in Equation (8.6). Minimize f1(x) f2(x) Subject to gi(x) ≤ 0 ∀i ∈ {1, 2, . . . , ng} Where x = {x1, x2, . . . , xnx}, (8.6) j ] ∀ j ∈ {1, 2, . . . , nx}, and j, xu xj ∈ [xl xj ∈ R ∀ j. If we use f1 and f2 as our basis functions for learning power laws, we may learn functions of the form f1 · f b 2 = c 152 (8.7) 246810110NISNSNasp246810123NISNSNasp from data of the non-dominated solutions. Estimating the parameters b and c is same as explained in Section 7.1.3.2. However, for repairing variables, Equation (8.7) can not be used in the same way a power law among variables could be used as described in Section 7.1.5.2. This is because the left hand side of Equation (8.7) may no longer be assumed to be an explicit expression in terms of the problem variables. In this section, we discuss a way we can use a rule that is implicitly defined in terms of the variables to make variable repairs. Recall from Section 3.1 that in an MOO problem, the basis functions can be the design variables or functions thereof including the objectives and the constraints. Let us re-write Equation (8.7) as f(x) = f(x1, x2, . . . , xnx) = c (8.8) where f(x) = f1(x)· (f2(x))b. Since data of non-dominated solutions is used to arrive at this equation, in effect it is an implicit equation (in terms of problem variables) that describes the dominance relation. Without loss of generality, let us say that we wish to repair variable xr in an individual I of the non-dominated set. Let the individual I be represented as xI = {a1, a2, . . . , ar, . . . , anx} (8.9) in the variable space. The candidate individual, say I+, for repair can then be represented as xI+ = {a1, a2, . . . , xr, . . . , anx} (8.10) because only variable xr of individual I is under consideration for repair. The problem of 153 repairing variable xr can be turned into a one variable minimization problem as Minimize | f(xI+) − c | r , xu+ r Subject to xr ∈ [xl+ ]. (8.11) Note that the bounds of xr in Equation (8.11) are not the original bounds of variable xr in Equation (8.6). The bounds [xl+ r , xu+ r ] are different from [xl r, xu r ] and represent the updated bounds of variable xr derived from the non-dominated set utilized in learning the rule of Equation (8.7). This helps in efficiently reducing the search space. The single variable unconstrained minimization problem of Equation (8.11) can be solved using an inexpensive solver such as Golden Section method [111]. We can use xr = ar as a starting point for the algorithm. Such a method is be a computationally inexpensive way of improving the performance of a non-dominated solution. It is true that the repaired individual I+ may turn out to infeasible because of not considering problem constraints in Equation (8.11), but we are expecting this to happen less often as we are already learning the rule from a non-dominated solutions set. In the next section, we see how this method performs on a simple engineering design problem. 8.2.1 Results on Truss problem To test the method described in the previous section, we choose the two bar truss problem given by Equation (7.23). This is the problem without any transition points in the PO front. Furthermore, from [47] we know that the following relation exists between the two objectives in the PO solutions, V · S = 400. 154 (8.12) To avoid wrongly attributing improvements in convergence to repairs based on rules involving variables only, which we know from Section 7.6.1 that they work, we decided to use only objectives as basis functions for these results. Again, the rule repair strategy SN (refer Table 7.2) was used. Only this time, the learned rules involved objective functions only. Once an individual is chosen for repair, we used Golden Section method to repair one of the chosen variables by solving the minimization problem given by Equation (8.11). Every golden section search was limited to a maximum of 100 iterations or convergence tolerance of 10−6, whichever was reached earlier. We call this repair strategy as ‘SNobj’, i.e. learning and repairing based on rules involving objective functions only along with SN repair strategy. Table 8.5 shows the other EMO/I parameters used in the work. Table 8.5: EMO/I parameters used for solving the truss problem when only rules involving objective functions are learned. Parameters Population Size Max Func. Evals Prob. of Crossover Prob. of Mutation Crossover Index Mutation Index Start of Learning (% of Max FEs) TBT-2 92 25,024 0.9 1/3 10 50 5 Table 8.6: Results of Wilcoxon rank sum test for GD of truss problem comparing EMO/I repair strategies NI and SNobj strategies at 25,024 function evaluations and 5% significance level. Hypothesis HSNobj≺NI p 4.5617 × 10−13 Yes D h G Figure 8.7a shows that this strategy of learning only rules involving objective functions and then making repairs based on golden section method indeed helps the algorithm converge 155 (a) GD results. (b) IGD results Figure 8.7: Median GD and IGD results for the Truss problem over 30 runs when only rules involving objective functions are learned. faster to the PO front. This is backed by the WRS test results shown in Table 8.6. In terms of IGD, there is not much difference between the repair strategies SNobj and NI. 8.3 Concluding Remarks In this chapter, we then looked at a way of handling transition points in the PO front by using active set partitioning method. Towards the end, we also discussed as to how we can also learn and make solution repairs based on rules that are implicitly defined in terms of the repairing variable using an inexpensive Golden Section method. 156 0.51.52.54812NISNobj0.51.52.52.42.83.23.6NISNobj Chapter 9 Conclusion and Future Studies This thesis identified and developed methods that can help with extracting human inter- pretable knowledge from EMO algorithms, while solving a MOO problem and using the same knowledge to repair the EMO solutions to expedite the algorithm. The first half of this dissertation begins by showing a couple of fixed form rules that are prevalent in MOO engineering design problems and ways in which we can learn then on the fly during an EMO run in a more efficient manner than what is used by [11]. We then developed a customized GP that can be used for symbolically regressing rules from data as well as identifying decision boundary in binary classification problems, while the learned rules are constrained to be in the form of mathematical algebraic expressions involving regressands (features in case of classification) to maintain easy interpretability of the rules where we use the following definition of interpretability of rules/models, “how consistently can a human predict the results of the models” [112]. The use of a bi-objective optimization approach – minimization of classification error and minimization of rule complexity – has enabled us to find not one, but multiple rule structures having a trade-off between the two objectives. This allows an user to analyze a number of alternate trade-off rule structures for choosing a single or multiple of them for practice. In case of the industry problem, we used basic mathematical functions such as integration, differentiation and Fourier transform to arrive at a set of features from time 157 series manufacturing data. This is very different from the standard painstaking method of handcrafting features where each feature is created using an experts knowledge [113]. Our dimensional awareness procedure has been able to convert some of the obtained trade-off rules to dimensionally meaningful rules by using relevant problem constants. This technology allows a user to have a much better understanding and insights to the underlying process producing the data. The dimensional analysis along with searching for a knee solution provides the user with an effective decision support system when choosing from a PO set of regressors or classifiers obtained as a result of CGP. Both these ideas can be automated as well when we are following rule learning in an EMO with a rule based repair. The second half of this dissertation developed methods of expediting convergence in EMO algorithms by using the aforementioned learned rules for making direct variable repairs during an EMO algorithm run to expedite its convergence. Unlike [71, 72], which learn rules in the form of decision trees and disjunctive normal form logical rules and add them as constraints to guide the algorithm, we learn rules in the form of algebraic expressions that can be used for direct variable repair to expedite the algorithm. A total of nine variable repair strategies for rule and variable selection were tested on test problems and engineering design problems. The results showed that when faced with multiple qualifying rules to choose from, we should choose the shorter rules for making repairs. We did not see any effect of choosing variable (for repair) based on its frequency in the qualifying rules set. This was followed by developing method to learn and repair solutions in the presence of transition points in the PO front. Furthermore, we also proposed a computationally inexpensive method using of repairing variables when the learned rule is implicitly defined in terms of the variable to be repaired. 158 9.1 Contributions of this Thesis The key contributions of this thesis, presented in the order of appearance in this thesis, can be summarized as follows: 1. We developed a custom GP for a symbolic regression task that can learn rules from data in form of algebraic expressions. This custom GP has many interesting ideas implemented from the literature such as, bi-objective formulation to control bloat, multiple small expression trees instead of a single expression tree to make genetic operations on expression trees more efficient, weights learning using a faster classical method such as OLSR to learn constants in the rule much more efficiently as compared to regular GPs. We also developed a custom diversity preserving mechanism in GP that penalizes duplicate solutions in the population while maintaining the relative non-domination rank order between the penalized duplicates belonging to different non-domination ranks. 2. We developed a custom dimensional inconsistency penalty metric that can calculate the level of dimensional inconsistency in an algebraic expression. This metric is very helpful in identifying PO set of rules learned by CGP that are not dimensionally meaningful. 3. We developed a custom GP for classification task that can learn decision boundary between two classes as an algebraic expression using the problem features. This custom GP performed well on binary class data from industry. The CGP could provide multiple PO rules with varying trade-off between classification error and rule complexity. It could also select the six most important features out of more than fifty features to build the classifiers. These important features were in line with the understanding of the process of our industry partners which they obtained by conducting physical 159 experiments and analytic studies of the same process in a controlled environment. Furthermore, the knowledge of dimensional consistency of these classifiers was found to be beneficial for planning future experiments studies to understand the physics of the process. 4. We developed the EMO/I framework that combines the idea of innovization with that of an EMO algorithm. We could evaluate various repair strategies when repairing solutions based on multiple rules in the form of algebraic expressions. Results on test problems and engineering design problems show that choosing smaller rules first from qualifying rules for repair is more beneficial for faster convergence of EMO. 5. We developed a custom methodology, active set partitioning, to address the issue of learning rules and repairing solutions in the presence of transition points in the PO front. Furthermore, a computationally efficient method of repairing solutions when a rule is defined implicitly in terms of the variables is also suggested, which has been shown to work on an engineering design problem. 9.2 Future Studies This study has opened doors for many exciting ideas/questions to be explored next. A few of them are mentioned below: 9.2.1 GP with in Tandem Dimensional Consistency Check In Chapters 4 to 6, we have shown how a dimensional consistency check on the final PO solutions of CGP is a powerful tool to shortlist dimensionally meaningful rules. Instead of 160 using this check at the end of CGP, i.e. serially, it may be beneficial to use it during the run of CGP. A few ways to use this dimensional inconsistency penalty are: ˆ Modifying the tree crossover operator in CGP to avoid crossover operation between trees whose dimensional inconsistency penalties are very different. Similarly, making sure not to mutate trees that have zero dimensional inconsistency penalty. ˆ Using the dimensional inconsistency penalty value in survivor selection operation of the CGP instead of the currently used crowded tournament selection operator [31]. 9.2.2 Non-linear Decision Trees Figure 9.1 shows the idea of combining CGP with that of decision tree based classifier to capture complex and even disconnected decision boundaries in the feature space. The figure shows an arbitrary binary classification problem where the green patches represent Go class data and the red color data represents the NoGo class data. If one uses only a decision tree Figure 9.1: The idea of combining CGP with a Decision Tree classification algorithm. based classifier in this problem, the decision tree obtained will be very deep as a decision tree 161 gfffghf, g, h are function of features or decision variables and obtained using one complete run of DAGPgffghf tries to partition the feature space parallel to the coordinate axes representing the features. However, when combined with CGP such that each decision node of the decision tree is obtained using a CGP run minimizing node impurity, the final tree will be a smaller tree with each node of the tree being an algebraic expression of original features. Such a method can be utilized to derive a classifier that is both effective and interpretable. 9.2.3 Experimenting with Different Definitions of Rule Complex- ity The complexity aspect of a learned model can be defined in many ways [94]. In CGP, we defined complexity of a CGP individual simply as the number of nodes in the correspond- ing expression tree. It can also be calculated differently if we assign various operators in the function set of CGP different complexities based on some a-priori hierarchy. For ex- ample, the set of operations {+,−,÷,×} can be assigned a lower complexity value than say {√, exp(), log()} functions. These hierarchies can be based on user’s preference and knowledge to aid the CGP in finding the appropriate rules from data. 9.2.4 Measuring Efficacy of Repairs In the repair methodology chosen in Chapters 7 and 8, once at least one qualifying rule is found to make repairs, an extra population with same size as the population size of EMO/I is created that carries the repaired population. This may be inefficient use of the computational budget as we get limited number of evaluations of the objective functions. However, if we start measuring the efficacy of repair operation in every generation and then, keep reducing/increasing the size of population for repair from previous generation based 162 on the degradation/improvement of this repair efficacy metric, we may be able to use our computational budget more efficiently. One metric to measure the efficacy of the repair operation is the fraction of population members present in each generation that came via a repair operation instead of any of the genetic operations. Finally, we should apply the methods developed in this thesis to more real world problems to make it more generally applicable. The ideas developed in this thesis can be very useful for practitioners in MOO as well as machine learning. I hope that this thesis contributes, in howsoever small way, towards bridging the gap between human expertise in optimal design and its digital analog. 163 BIBLIOGRAPHY 164 BIBLIOGRAPHY [1] [2] Siemens-PLM-HEEDS, “We help you discover better designs, faster,” https://www. plm.automation.siemens.com/global/en/products/simcenter/simcenter-heeds.html, 2004, [Online]. Esteco-modeFrontier, “Process automation and optimization in the engineering design process,” https://www.esteco.com/modefrontier, 1998, [Online]. [3] Dassault-iSight, “Automate Design Exploration and Optimization,” https://www.3ds. com/products-services/simulia/products/isight-simulia-execution-engine/, 1996, [On- line]. [4] Noesis-Optimus, “The Industry-Leading PIDO Software Platform,” https://www. noesissolutions.com/our-products/optimus/, 2008, [Online]. [5] Dynardo-optiSLang, “Robust Design Optimization (RDO) in virtual product develop- ment,” https://www.dynardo.de/en/software/optislang.html, 2002, [Online]. [6] [7] [8] C. A. C. Coello, G. B. Lamont, D. A. Van Veldhuizen et al., Evolutionary algorithms for solving multi-objective problems. Springer, 2007, vol. 5. H. J. Levesque, “Knowledge representation and reasoning,” Annual review of computer science, vol. 1, no. 1, pp. 255–287, 1986. R. Davis, H. Shrobe, and P. Szolovits, “What is a knowledge representation?” AI magazine, vol. 14, no. 1, pp. 17–17, 1993. [9] D. P. Bertsekas, A. Nedi, A. E. Ozdaglar et al., Convex analysis and optimization. Athena Scientific, 2003. [10] K. Deb and A. Srinivasan, “Innovization: Innovating design principles through opti- mization,” in Proceedings of the 8th annual conference on Genetic and evolutionary computation. ACM, 2006, pp. 1629–1636. [11] S. Bandaru and K. Deb, “Automated discovery of vital knowledge from pareto-optimal solutions: First results from engineering design,” in Evolutionary Computation (CEC), 2010 IEEE Congress on. IEEE, 2010, pp. 1–8. [12] ——, “Higher and lower-level knowledge discovery from pareto-optimal sets,” Journal of Global Optimization, vol. 57, no. 2, pp. 281–298, 2013. [13] A. Ng, K. Deb, and C. Dudas, “Simulation-based innovization for production sys- tems improvement: An industrial case study,” in Proceedings of The International 3rd 165 Swedish Production Symposium, SPS’09, Goteborg, Sweden, 2-3 December 2009. The Swedish Production Academy, 2009, pp. 278–286. [14] S. Bandaru, A. H. Ng, and K. Deb, “Data mining methods for knowledge discovery in multi-objective optimization: Part a-survey,” Expert Systems with Applications, vol. 70, pp. 139–159, 2017. [15] I. Nonaka and H. Takeuchi, The knowledge-creating company: How Japanese companies create the dynamics of innovation. Oxford university press, 1995. [16] A. Inselberg, “The plane with parallel coordinates,” The visual computer, vol. 1, no. 2, pp. 69–91, 1985. [17] A. M. Geoffrion, J. S. Dyer, and A. Feinberg, “An interactive approach for multi- criterion optimization, with an application to the operation of an academic depart- ment,” Management science, vol. 19, no. 4-part-1, pp. 357–368, 1972. [18] K. J. Chichakly and M. J. Eppstein, “Discovering design principles from dominated solutions.” IEEE Access, vol. 1, no. iii, pp. 275–289, 2013. [19] S. Obayashi and D. Sasaki, “Visualization and data mining of pareto solutions us- ing self-organizing map,” in International Conference on Evolutionary Multi-Criterion Optimization. Springer, 2003, pp. 796–809. [20] A. H. Ng, C. Dudas, J. Nießen, and K. Deb, “Simulation-based innovization using data mining for production systems analysis,” in Multi-objective Evolutionary Optimisation for Product Design and Manufacturing. Springer, 2011, pp. 401–429. [21] T. W. Simpson, J. Poplinski, P. N. Koch, and J. K. Allen, “Metamodels for computer- based engineering design: survey and recommendations,” Engineering with computers, vol. 17, no. 2, pp. 129–150, 2001. [22] B. Kim, R. Khanna, and O. O. Koyejo, “Examples are not enough, learn to criticize! criticism for interpretability,” in Advances in Neural Information Processing Systems, 2016, pp. 2280–2288. [23] M. E. Newman, “Power laws, pareto distributions and zipf’s law,” Contemporary physics, vol. 46, no. 5, pp. 323–351, 2005. [24] W. Liu, Z. Wang, X. Liu, N. Zeng, Y. Liu, and F. E. Alsaadi, “A survey of deep neural network architectures and their applications,” Neurocomputing, vol. 234, pp. 11–26, 2017. [25] C. Molnar, Interpretable Machine Learning, 2019, accessed: 2019-04-16. 166 [26] P. W. Bridgman, Dimensional analysis. Yale university press, 1922. [27] W. Banzhaf, P. Nordin, R. E. Keller, and F. D. Francone, Genetic programming: an introduction. Morgan Kaufmann San Francisco, 1998, vol. 1. [28] C. Myburgh and K. Deb, “Derived heuristics-based consistent optimization of material flow in a gold processing plant,” Engineering optimization, vol. 50, no. 1, pp. 1–18, 2018. [29] K. Deb and R. Datta, “Hybrid evolutionary multi-objective optimization and analysis of machining operations,” Engineering Optimization, vol. 44, no. 6, pp. 685–706, 2012. [30] D. C. Montgomery, E. A. Peck, and G. G. Vining, Introduction to linear regression analysis. John Wiley & Sons, 2012, vol. 821. [31] K. Deb, Multi-objective optimization using evolutionary algorithms. John Wiley & Sons, 2001, vol. 16. [32] J. Nocedal and S. Wright, Numerical optimization. Springer Science & Business Media, 2006. [33] D. Goldberg, Genetic Algorithms in Search, Optimization and Machine Learning, 1st ed. Boston, MA, USA: Addison-Wesley Longman Publishing Co., Inc., 1989. [34] H.-P. P. Schwefel, Evolution and optimum seeking: the sixth generation. John Wiley & Sons, Inc., 1993. [35] J. R. Koza, Genetic programming: on the programming of computers by means of natural selection. MIT press, 1992, vol. 1. [36] C. A. C. Coello, D. A. VanVeldhuizen, and G. Lamont, Evolutionary Algorithms for Solving Multi-Objective Problems. Boston, MA: Kluwer, 2002. [37] S. M. Sait, H. Youssef, and J. A. Khan, “Fuzzy evolutionary algorithm for vlsi place- ment,” in Proceedings of the 3rd Annual Conference on Genetic and Evolutionary Computation. Morgan Kaufmann Publishers Inc., 2001, pp. 1056–1063. [38] J. E. Rodr´ıguez, A. L. Medaglia, and J. P. Casas, “Approximation to the optimum design of a motorcycle frame using finite element analysis and evolutionary algorithms,” in 2005 IEEE Design Symposium, Systems and Information Engineering. IEEE, 2005, pp. 277–285. [39] Y.-J. Kim and J. Ghaboussi, “A new genetic algorithm based control method using state space reconstruction,” in Proceedings of the Second World Conference on Struc. Control, 1998, pp. 2007–2014. 167 [40] K. Harada, K. Ikeda, and S. Kobayashi, “Hybridization of genetic algorithm and local search in multiobjective function optimization: recommendation of ga then ls,” in Proceedings of the 8th annual conference on Genetic and evolutionary computation. ACM, 2006, pp. 667–674. [41] S. P. Harris and E. C. Ifeachor, “Nonlinear fir filter design by genetic algorithm,” in Proceedings of the First Online Workshop on Soft Computing (WSC1), 1996, pp. 216–221. [42] F. Y. Cheng and D. Li, “Multiobjective optimization design with pareto genetic algorithm,” Journal of Structural Engineering, vol. 123, no. 9, pp. 1252–1261, 1997. [43] M. S. Bright, “Evolutionary strategies for the high-level synthesis of vlsi-based dsp systems for low power,” Ph.D. dissertation, University of Wales. Cardiff, 1998. [44] A. Belegundu, E. Constans, R. Salagame, and D. Murthy, “Multi-objective optimiza- tion of laminated ceramic composites using genetic algorithms,” in 5th Symposium on Multidisciplinary Analysis and Optimization, 1994, p. 4363. [45] K. Deb, S. Bandaru, D. Greiner, A. Gaspar-Cunha, and C. C. Tutum, “An integrated approach to automated innovization for discovering useful design principles: Case stud- ies from engineering,” Applied Soft Computing, vol. 15, pp. 42–56, 2014. [46] M. W. Mkaouer, M. Kessentini, S. Bechikh, K. Deb, and M. ´O Cinn´eide, “Recom- mendation system for software refactoring using innovization and interactive dynamic optimization,” in Proceedings of the 29th ACM/IEEE international conference on Au- tomated software engineering. ACM, 2014, pp. 331–336. [47] K. Deb and A. Srinivasan, “Innovization: Discovery of innovative design principles through multiobjective evolutionary optimization,” in Multiobjective Problem Solving from Nature. Springer, 2008, pp. 243–262. [48] S. Bandaru, C. C. Tutum, K. Deb, and J. H. Hattel, “Higher-level innovization: A case study from friction stir welding process optimization,” in Evolutionary Computation (CEC), 2011 IEEE Congress on. IEEE, 2011, pp. 2782–2789. [49] S. Bandaru, “Automated innovization: Knowledge discovery through multi-objective optimization,” Ph.D. dissertation, Indian Institute of Technology-Kanpur, Dept. of Mechanical Engineering, India, 2012. [50] K. Deb, S. Gupta, D. Daum, J. Branke, A. K. Mall, and D. Padmanabhan, “Reliability- based optimization using evolutionary algorithms,” IEEE Transactions on Evolution- ary Computation, vol. 13, no. 5, pp. 1054–1074, 2009. 168 [51] S. Bandaru and K. Deb, “Temporal innovization: Evolution of design principles us- ing multi-objective optimization,” in International Conference on Evolutionary Multi- Criterion Optimization. Springer, 2015, pp. 79–93. [52] A. H. Ng, S. Bandaru, and M. Frantz´en, “Innovative design and analysis of production systems by multi-objective optimization and data mining,” Procedia CIRP, vol. 50, pp. 665–671, 2016. [53] B. Meyer and K. Sugiyama, “The concept of knowledge in km: a dimensional model,” Journal of knowledge management, vol. 11, no. 1, pp. 17–35, 2007. [54] P. Hoffman, G. Grinstein, K. Marx, I. Grosse, and E. Stanley, “Dna visual and analytic IEEE, 1997, data mining,” in Proceedings. Visualization’97 (Cat. No. 97CB36155). pp. 437–441. [55] I. Hatzilygeroudis and J. Prentzas, “Integrating (rules, neural networks) and cases for knowledge representation and reasoning in expert systems,” Expert Systems with Applications, vol. 27, no. 1, pp. 63–75, 2004. [56] J. Shawe-Taylor and N. Cristianini, An Introduction to Support Vector Machines and Other Kernel-based Learning Methods. Cambridge University Press, 2000. [57] M. H. Hassoun, Fundamentals of Artificial neural networks. Cambridge: MIT Press, 1995. [58] G. Hinton and T. J. Sejnowski, Unsupervised Learning: Foundations of Neural Com- putation, 1st ed. MIT Press, 1999. [59] N. Hitomi, H. Bang, and D. Selva, “Extracting and applying knowledge with adaptive knowledge-driven optimization to architect an earth observing satellite system,” in AIAA Information Systems-AIAA Infotech@ Aerospace, 2017, p. 0794. [60] B. L. W. H. Y. Ma, B. Liu, and Y. Hsu, “Integrating classification and association rule mining,” in Proceedings of the fourth international conference on knowledge discovery and data mining, 1998, pp. 24–25. [61] S. Bandaru and K. Deb, “Towards automating the discovery of certain innovative design principles through a clustering-based optimization technique,” Engineering op- timization, vol. 43, no. 9, pp. 911–941, 2011. [62] A. H. Ng, C. Dudas, J. Nießen, and K. Deb, “Simulation-based innovization using data mining for production systems analysis,” in Multi-objective Evolutionary Optimisation for Product Design and Manufacturing. Springer, 2011, pp. 401–429. 169 [63] N. R. Draper and H. Smith, Applied regression analysis. John Wiley & Sons, 2014, vol. 326. [64] Y. Bernstein, X. Li, V. Ciesielski, and A. Song, “Multiobjective parsimony enforce- ment for superior generalisation performance,” in Proceedings of the 2004 Congress on Evolutionary Computation (IEEE Cat. No. 04TH8753), vol. 1. IEEE, 2004, pp. 83–89. [65] E. D. De Jong and J. B. Pollack, “Multi-objective methods for tree size control,” Genetic Programming and Evolvable Machines, vol. 4, no. 3, pp. 211–233, 2003. [66] H. Iba, “Bagging, boosting, and bloating in genetic programming,” in Proceedings of the 1st Annual Conference on Genetic and Evolutionary Computation-Volume 2. Morgan Kaufmann Publishers Inc., 1999, pp. 1053–1060. [67] D. LLC. (2018) Eureqa : The a.i.-powered modeling engine by datarobot. [Online]. Available: https://www.nutonian.com/products/eureqa/ [68] C. W. Ahn and R. S. Ramakrishna, “On the scalability of real-coded bayesian opti- mization algorithm,” IEEE Transactions on Evolutionary Computation, vol. 12, no. 3, pp. 307–322, 2008. [69] M. Pelikan, D. E. Goldberg, and E. Cant´u-Paz, “Boa: The bayesian optimization algorithm,” in Proceedings of the 1st Annual Conference on Genetic and Evolutionary Computation-Volume 1. Morgan Kaufmann Publishers Inc., 1999, pp. 525–532. [70] A. Zhou, Q. Zhang, and Y. Jin, “Approximating the set of pareto-optimal solutions in both the decision and objective spaces by an estimation of distribution algorithm,” IEEE transactions on evolutionary computation, vol. 13, no. 5, pp. 1167–1189, 2009. [71] A. H. Ng, C. Dudas, H. Bostr¨om, and K. Deb, “Interleaving innovization with evo- lutionary multi-objective optimization in production system simulation for faster con- vergence,” in Learning and Intelligent Optimization. Springer, 2013, pp. 1–18. [72] R. S. Michalski, G. Cervone, and K. Kaufman, “Speeding up evolution through learn- ing: Lem,” in Intelligent Information Systems. Springer, 2000, pp. 243–256. [73] M. Keijzer and V. Babovic, “Dimensionally aware genetic programming,” in Proceed- ings of the 1st Annual Conference on Genetic and Evolutionary Computation-Volume 2. Morgan Kaufmann Publishers Inc., 1999, pp. 1069–1076. [74] A. Ratle and M. Sebag, “Genetic programming and domain knowledge: Beyond the limitations of grammar-guided machine discovery,” in Parallel Problem Solving from Nature PPSN VI. Springer, 2000, pp. 211–220. 170 [75] R. I. Mckay, N. X. Hoai, P. A. Whigham, Y. Shan, and M. Oneill, “Grammar-based ge- netic programming: a survey,” Genetic Programming and Evolvable Machines, vol. 11, no. 3-4, pp. 365–396, 2010. [76] D. J. Montana, “Strongly typed genetic programming,” Evolutionary computation, vol. 3, no. 2, pp. 199–230, 1995. [77] M. Keijzer and V. Babovic, “Declarative and preferential bias in gp-based scientific discovery,” Genetic Programming and Evolvable Machines, vol. 3, no. 1, pp. 41–79, 2002. [78] A. Ashour, L. Alvarez, and V. Toropov, “Empirical modelling of shear strength of rc deep beams by genetic programming,” Computers & structures, vol. 81, no. 5, pp. 331–338, 2003. [79] S. Bandaru and K. Deb, “A dimensionally-aware genetic programming architecture for automated innovization,” in International Conference on Evolutionary Multi-Criterion Optimization. Springer, 2013, pp. 513–527. [80] N. Padhye and K. Deb, “Multi-objective optimisation and multi-criteria decision mak- ing in sls using evolutionary approaches,” Rapid Prototyping Journal, vol. 17, no. 6, pp. 458–478, 2011. [81] K. Deb and K. Sindhya, “Deciphering innovative principles for optimal electric brush- less dc permanent magnet motor design,” in 2008 IEEE Congress on Evolutionary Computation (IEEE World Congress on Computational Intelligence). IEEE, 2008, pp. 2283–2290. [82] G. A. Seber and A. J. Lee, Linear regression analysis. John Wiley & Sons, 2012, vol. 329. [83] P. Berkhin, “A survey of clustering data mining techniques,” in Grouping multidimen- sional data. Springer, 2006, pp. 25–71. [84] T. S. Jaakkola and D. Haussler, “Probabilistic kernel regression models.” in AISTATS, 1999. [85] J. R. Koza, Genetic Programming II, Automatic Discovery of Reusable Subprograms. MIT Press, Cambridge, MA, 1992. [86] M. F. Brameier and W. Banzhaf, Linear genetic programming. Springer Science & Business Media, 2007. [87] S. Silva and J. Almeida, “Gplab-a genetic programming toolbox for matlab,” in Pro- ceedings of the Nordic MATLAB conference. Citeseer, 2003, pp. 273–278. 171 [88] M. Schmidt and H. Lipson, “Distilling free-form natural laws from experimental data,” science, vol. 324, no. 5923, pp. 81–85, 2009. [89] P. J. Angeline, “Subtree crossover: Building block engine or macromutation,” Genetic programming, vol. 97, pp. 9–17, 1997. [90] E. Analytics. (2018) Datamodeler : by evolved analytics. [Online]. Available: http://www.evolved-analytics.com/?q=about [91] M. Kuhn and K. Johnson, Applied predictive modeling. Springer, 2013, vol. 26. [92] K. Deb, A. Pratap, S. Agarwal, and T. Meyarivan, “A fast and elitist multiobjective genetic algorithm: Nsga-ii,” Evolutionary Computation, IEEE Transactions on, vol. 6, no. 2, pp. 182–197, 2002. [93] S. Bleuler, J. Bader, and E. Zitzler, “Reducing bloat in gp with multiple objectives,” in Multiobjective Problem Solving from Nature. Springer, 2008, pp. 177–200. [94] B.-T. Zhang and H. M¨uhlenbein, “Balancing accuracy and parsimony in genetic pro- gramming,” Evolutionary Computation, vol. 3, no. 1, pp. 17–38, 1995. [95] D. Searson, M. Willis, and G. Montague, “Co-evolution of non-linear pls model com- ponents,” Journal of Chemometrics: A Journal of the Chemometrics Society, vol. 21, no. 12, pp. 592–603, 2007. [96] J. Doe. (2014) Height, depth and level of a tree. [Online]. Available: http: //typeocaml.com/2014/11/26/height-depth-and-level-of-a-tree/ [97] J. Neter, M. H. Kutner, C. J. Nachtsheim, and W. Wasserman, Applied linear statistical models. Irwin Chicago, 1996, vol. 4. [98] K. Deb and S. Gupta, “Understanding knee points in bicriteria problems and their implications as preferred solution principles,” Engineering optimization, vol. 43, no. 11, pp. 1175–1204, 2011. [99] B. R. Munson, T. H. Okiishi, A. P. Rothmayer, and W. W. Huebsch, Fundamentals of fluid mechanics. John Wiley & Sons, 2014. [100] J. E. Shigley, C. R. Mischke, and R. G. Budynas, Mechanical Engineering Design, International. Mc-Graw Hill Book Co., Singapore, 1989. [101] D. Chakraborty, S. Soni, J. Wei, N. Kovvali, A. Papandreou-Suppappola, D. Cochran, and A. Chattopadhyay, “Physics based modeling for time-frequency damage classifi- cation,” in Modeling, Signal Processing, and Control for Smart Structures 2008, vol. 6926. International Society for Optics and Photonics, 2008, p. 69260M. 172 [102] J. Branke, K. Deb, H. Dierolf, and M. Osswald, “Finding knees in multi-objective optimization,” in International conference on parallel problem solving from nature. Springer, 2004, pp. 722–731. [103] G. Strang, G. Strang, G. Strang, and G. Strang, Introduction to linear algebra. Wellesley-Cambridge Press Wellesley, MA, 1993, vol. 3. [104] G. Casella and R. L. Berger, Statistical inference. Duxbury Pacific Grove, CA, 2002, vol. 2. [105] N. V. Chawla, K. W. Bowyer, L. O. Hall, and W. P. Kegelmeyer, “Smote: synthetic minority over-sampling technique,” Journal of artificial intelligence research, vol. 16, pp. 321–357, 2002. [106] H. He, Y. Bai, E. A. Garcia, and S. Li, “Adasyn: Adaptive synthetic sampling ap- proach for imbalanced learning,” in Neural Networks, 2008. IJCNN 2008.(IEEE World Congress on Computational Intelligence). IEEE International Joint Conference on. IEEE, 2008, pp. 1322–1328. [107] E. Zitzler, K. Deb, and L. Thiele, “Comparison of multiobjective evolutionary algo- rithms: Empirical results,” Evolutionary computation, vol. 8, no. 2, pp. 173–195, 2000. [108] F. Wilcoxon, “Individual comparisons by ranking methods,” Biometrics bulletin, vol. 1, no. 6, pp. 80–83, 1945. [109] K. Miettinen, Nonlinear multiobjective optimization. Springer Science & Business Media, 2012, vol. 12. [110] R. Q. Sardi˜nas, M. R. Santana, and E. A. Brindis, “Genetic algorithm-based multi- objective optimization of cutting parameters in turning processes,” Engineering Ap- plications of Artificial Intelligence, vol. 19, no. 2, pp. 127–133, 2006. [111] W. H. Press, S. A. Teukolsky, W. T. Vetterling, and B. P. Flannery, “Golden section search in one dimension,” Numerical Recipes in C: The Art of Scientific Computing, p. 2, 1992. [112] F. Doshi-Velez and B. Kim, “Towards a rigorous science of interpretable machine learning,” arXiv preprint arXiv:1702.08608, 2017. [113] T. Hastie, R. Tibshirani, and J. Friedman, “The elements of statistical learning: data mining, inference, and prediction, springer series in statistics,” 2009. 173