ONLINE INNOVIZATION : TOWARDS KNOWLEDGE

DISCOVERY AND ACHIEVING FASTER CONVERGENCE IN

MULTI-OBJECTIVE OPTIMIZATION

By

Abhinav Gaur

A DISSERTATION

Submitted to

Michigan State University

in partial fulﬁllment of the requirements

for the degree of

Electrical Engineering — Doctor of Philosophy

2020

ABSTRACT

ONLINE INNOVIZATION : TOWARDS KNOWLEDGE DISCOVERY AND

ACHIEVING FASTER CONVERGENCE IN MULTI-OBJECTIVE

OPTIMIZATION

By

Abhinav Gaur

“Innovization” is a task of learning common principles that exist among some or all of the

Pareto-optimal solutions in a multi-objective optimization problem. Except a few earlier

studies, most innovization related studies were performed on the ﬁnal non-dominated so-

lutions found by an evolutionary multi-objective algorithm either manually or by using a

machine learning method. Recent studies have shown that these principles can be learned

during intermediate iterations of an optimization run and simultaneously utilized in the same

optimization run to repair variables to achieve a faster convergence to the Pareto-optimal

set. This is what we are calling as “online innovization” as it is performed online during

the run of an evolutionary multi-objective optimization algorithm. Special attention is paid

to learning rules that are easier to interpret, such as short algebraic expressions, instead of

complex decision trees or kernel based black box rules.

We begin by showing how to learn ﬁxed form rules that are encountered frequently in

multi-objective optimization problems. We also show how can we learn free form rules, that

are linear combination of non-linear terms, using a custom genetic programming algorithm.

We show how can we use the concept of ‘knee’ in PO set of solutions along with a custom

dimensional penalty calculator to discard rules that may be overly complex, or inaccurate or

just dimensionally incorrect. The results of rules learned using this custom genetic program-

ming algorithm show that it is beneﬁcial to let evolution learn the structure of rules while

the constituent weights should be learned using some classical learning algorithm such as

linear regression or linear support vector machines. When the rules are implicit functions of

the problem variables, we use a computationally inexpensive way of repairing the variables

by turning the problem of repairing the variable into a single variable golden section search.

We show the proof of concept on test problems by learning ﬁxed form rules among

variables of the problem, which we then use during the same optimization run to repair vari-

ables. Diﬀerent principles learned during an optimization run can involve diﬀerent number

of variables and/or variables that are common among a number of principles. Moreover, a

preference order for repairing variables may play an important role for proper convergence.

Thus, when multiple principles exist, it is important to use a strategy that is most beneﬁcial

for repairing evolving population of solutions.

The above methods are applied to a mix of test problems and engineering design problems.

The results are encouraging and strongly supports the use of innovization task in enhancing

the convergence of an evolutionary multi-objective optimization algorithms. Moreover, the

custom genetic program developed in this work can be a useful machine learning tool for

practitioners to learn human interpretable rules in the form of algebraic expressions.

Dedicated to Maa.

iv

ACKNOWLEDGMENTS

I ﬁrst want to thank my family for the bulwark of support they have provided me ever since

I decided to leave my job and pursue a career in research eleven years ago. Without them, I

could never have mustered so much courage in my decisions. A big thank you to my wife for

supporting me with the voice of hope and courage at times when I found it hard to believe

in myself. I owe her so many days in kitchen and house chores. I want to take a moment

to thank my friends from my undergraduate days; Asjad, Akshat, Aman and Gandhi, who

paid so many of my bills and university application fees when I left my job eleven years ago

to pursue a career in research, always kept me in high spirits and even provided support to

my parents in my absence from India.

Secondly, thanks to Prof. Kalyanmoy Deb for his invaluable time and advice throughout

my doctoral studies. In terms of human qualities of intelligence, humility and fairness, he

surely is a Utopian point. Quite literally, un-achievable, yet point of aspiration for all his

students. I am very thankful to him for putting so much eﬀort in arranging for my funding

in every semester. I aspire to nurture this relationship in future and hopefully be part of

growth of the COIN lab and Q-Lyst story in future.

These acknowledgements would not be complete without mentioning the immense sup-

port and camaraderie I received from my colleagues at COIN lab. Most notably Haitham,

who selﬂessly helped me with so many discussions on optimization, Java programming and

just life in general. I am so glad that he is my friend and now colleague at Ford Motor Com-

pany. Yashesh, who has been like a younger brother, whom I can even call in the middle of

the night for help and who always made me feel more important than I should be probably.

Rayan, whose smile in face of diﬃculties is enviable and I wish everyone could have that

v

ﬁghting spirit that he possesses. Mohammad and Julian’s hardwork is very inspiring and I

absorbed some of that quality from them. Thanks to Proteek and Zhichao, who were always

happy to discuss research problems and ideas. Not to forget Khaled, probably one of the best

programmers that I have seen and with whom I learned so much about C++ programming

in such a short time.

I want to thank General Motor’s Vehicle Optimization Group and Manufacturing Ana-

lytics Group who at various points of time during my doctoral studies provided me with the

opportunity to work on real world optimization problems and gave an invaluable experience

of working on industry research problems. I want to acknowledge the BEACON’s support

for providing me with many generous travel grants to be able to attend international confer-

ences and also the opportunity to meet and work with world class researchers of Evolutionary

Computation all within a stones throw away from my lab.

I am sure there are more people who are part of this work in diﬀerent ways and yet I am

forgetting them. I want to assure them that this omission is not deliberate and a mere sign

of my aging.

vi

TABLE OF CONTENTS

LIST  OF  TABLES    .  .  .  .  .  .  .  .  .  .  .  .  .  .  .  .  .  .  .  .  .  .  .  .  .  .  .  .  .  .  .  .  .  .  .  . 

xi

LIST  OF  FIGURES        .  .  .  .  .  .  .  .  .  .  .  .  .  .  .  .  .  .  .  .  .  .  .  .  .  .  .  .  .  .  .  .  .  .  .  xiv

LIST  OF  ALGORITHMS.  .  .  .  .  .  .  .  .  .  .  .  .  .  .  .  .  .  .  .  .  .  .  .  .  .  .  .  .  .  .  .  .  .  .  xix

KEY  TO  ABBREVIATIONS

.  .  .  .  .  .  .  .  .  .  .  .  .  .  .  .  .  .  .  .  .  .  .  .  .  .  .  .  .  xx

Chapter 1

Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
1.1 Search for Knowledge in MOO Problems . . . . . . . . . . . . . . . . . . . .
1.2 Using Discovered Knowledge in Expediting Convergence of EMO Algorithms
1.3 Organization of Dissertation . . . . . . . . . . . . . . . . . . . . . . . . . . .

Chapter 2

Literature Survey and Proposed Online Innovization . . . . . .
2.1 Multi-objective Optimization . . . . . . . . . . . . . . . . . . . . . . . . . .
2.2 Evolutionary Multi-objective Optimization Algorithms
. . . . . . . . . . . .
2.3
Innovization . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
2.4 A Taxonomy of Innovization . . . . . . . . . . . . . . . . . . . . . . . . . . .
2.4.1 Manual Innovization . . . . . . . . . . . . . . . . . . . . . . . . . . .
2.4.2 Automated Innovization . . . . . . . . . . . . . . . . . . . . . . . . .
2.4.3 Higher Level Innovization . . . . . . . . . . . . . . . . . . . . . . . .
2.4.4 Lower Level Innovization . . . . . . . . . . . . . . . . . . . . . . . . .
2.4.5 Temporal Innovization . . . . . . . . . . . . . . . . . . . . . . . . . .
2.5 Proposed Online Innovization . . . . . . . . . . . . . . . . . . . . . . . . . .
. . . . . . . . . . . . . . . . . . . . . . . . . . . .
2.5.1 Target Data Set(s)
2.5.2 Type of Knowledge Representation . . . . . . . . . . . . . . . . . . .
2.5.3 Using Extracted Knowledge to Expedite the Convergence . . . . . . .
2.6 Dimensional Awareness in Rule Learning . . . . . . . . . . . . . . . . . . . .

Part I Rule Learning

Chapter 3

Learning Fixed Form Rules . . . . . . . . . . . . . . . . . . . . . .
3.1
Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
3.2 Learning Constant Rules . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
3.2.1 Estimating Rule Parameters and Quality . . . . . . . . . . . . . . . .
3.3 Learning Power Law Rules . . . . . . . . . . . . . . . . . . . . . . . . . . . .
3.3.1 Estimating Parameters . . . . . . . . . . . . . . . . . . . . . . . . . .
3.3.2 Comparison with Automated Innovization . . . . . . . . . . . . . . .
3.3.3 Learning Multiple Rules Simultaneously . . . . . . . . . . . . . . . .

1
2
5
6

8
8
11
12
14
14
14
15
15
16
17
17
18
20
22

26

27
27
29
29
31
31
32
33

vii

Chapter 4

Learning Free Form Rules Using a Symbolic Regression Task
4.1 Form of Rules . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
4.2 A Primer on Genetic Programming . . . . . . . . . . . . . . . . . . . . . . .
4.3 A Custom GP (CGP) . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
4.3.1 Two Objectives : Prediction Error and Rule Complexity . . . . . . .
4.3.2 Using Multiple Small Trees
. . . . . . . . . . . . . . . . . . . . . . .
4.3.3 Learning Weights Using a Faster Method . . . . . . . . . . . . . . . .
4.3.4 Diversity Preserving Mechanism . . . . . . . . . . . . . . . . . . . . .
4.3.5 Higher and Lower Level Crossovers . . . . . . . . . . . . . . . . . . .
4.3.6 CGP Flowchart for Rule Learning . . . . . . . . . . . . . . . . . . . .
4.4 Using CGP for Symbolic Regression Task . . . . . . . . . . . . . . . . . . . .

35
35
36
38
38
39
40
41
44
46
49
4.4.1 Evaluating Fitness of a CGP Individual for a Symbolic Regression Task 49
52
52
53
54
57
59
61

. . . . . . . . . . . . . . . . . . . . . . . . .
4.5.1 Test Problem-1 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
4.5.2 Test Problem-2 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
4.5.3 Test Problem-3 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
4.5.4 Test Problem-4 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
4.6 Noise Study . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
4.7 Choosing a Solution . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

4.5 CGP Results on Test Problems

Chapter 5 Using Dimensional Awareness with Rule Learning . . . . . . .
5.1 Measuring Dimension Mismatch Penalty . . . . . . . . . . . . . . . . . . . .
5.1.1 Case-I : Terms with Only Product and Division Operations . . . . . .
5.1.2 Dimensionally Inconsistent Example
. . . . . . . . . . . . . . . . . .
5.1.2.1 First Term . . . . . . . . . . . . . . . . . . . . . . . . . . .
5.1.2.2
Second Term . . . . . . . . . . . . . . . . . . . . . . . . . .
5.1.2.3 Third Term . . . . . . . . . . . . . . . . . . . . . . . . . . .
5.1.3 Dimensionally Consistent Example . . . . . . . . . . . . . . . . . . .
5.1.4 Case-II : More Complex Terms
. . . . . . . . . . . . . . . . . . . . .

63
63
64
69
70
71
72
73
75

Chapter 6

Learning Free Form Rules Using a Classiﬁcation Task . . . . .
6.1 Target Classiﬁcation Problem . . . . . . . . . . . . . . . . . . . . . . . . . .
6.2 Using CGP for a Classiﬁcation Task . . . . . . . . . . . . . . . . . . . . . . .

78
79
80
6.2.1 Evaluating Fitness of a CGP Individual for a Binary Classiﬁcation Task 83
84
85
87
90
90
92

6.3 Performance on Small Feature Space . . . . . . . . . . . . . . . . . . . . . .
6.3.1 Results on Production Data Set-1 . . . . . . . . . . . . . . . . . . . .
6.3.2 Results on Production Data Set-2 . . . . . . . . . . . . . . . . . . . .
6.4 Results on Larger Feature Space with Dimension Check . . . . . . . . . . . .
6.4.1 Data and Results . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
6.5 Concluding Remarks . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

viii

Part II Rule Based Repair

95

7.1 Rule Based Repair

Chapter 7 Performing Repairs Based on Fixed Form Rules for Expedited
96
Convergence . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
96
. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
7.1.1 Rule Basis and Quality Block . . . . . . . . . . . . . . . . . . . . . .
98
7.1.2 Decision Block-L . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 100
7.1.3 Learn Block . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 101
7.1.3.1 Constant Type Rules . . . . . . . . . . . . . . . . . . . . . . 101
7.1.3.2 Power Law Type of Rules . . . . . . . . . . . . . . . . . . . 102
7.1.4 Decision Block-R . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 103
7.1.5 Repair Block . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 104
7.1.5.1 Repairing Variables Based on Constant Rule . . . . . . . . . 105
7.1.5.2 Repairing variables based on power law rules
. . . . . . . . 106
7.2 Repair Strategies . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 107
7.2.1 Rule Preference Strategies . . . . . . . . . . . . . . . . . . . . . . . . 107
7.2.2 Variable Preference Strategies . . . . . . . . . . . . . . . . . . . . . . 109
7.3 Test Problems . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 111
7.3.1 ZDT1-1 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 111
7.3.2 ZDT1-2 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 112
7.3.3 ZDT1-3 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 113
. . . . . . . . . . . . . . . . . . . . . . . . . . . . 114
7.4.1 ZDT1-1 Results . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 115
7.4.2 ZDT1-2 Results . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 116
7.4.3 ZDT1-3 Results . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 117
7.4.3.1 Part-A : Strategies Preferring Short Rules . . . . . . . . . . 117
7.4.3.2 Part-B : Strategies with no Preference Based on Length of

7.4 Results on Test Problems

Rules

7.4.4

7.4.3.3 Part-C : Strategies Preferring Long Rules
Summary of Results on Test Problems

. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 119
. . . . . . . . . . 120
. . . . . . . . . . . . . . . . . 120
7.5 Engineering Problems . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 122
7.5.1 Two Bar Truss Problem (TBT) . . . . . . . . . . . . . . . . . . . . . 122
7.5.2 Metal Cutting Problem (MC) . . . . . . . . . . . . . . . . . . . . . . 124
7.5.3 Welded Beam Problem (WB)
. . . . . . . . . . . . . . . . . . . . . . 126
. . . . . . . . . . . . . . . . . . . . . . . . 129
7.6.1 Two Bar Truss Problem Results . . . . . . . . . . . . . . . . . . . . . 129
7.6.1.1 Rules Found in TBT Problem . . . . . . . . . . . . . . . . . 130
7.6.2 Metal Cutting Problem Results . . . . . . . . . . . . . . . . . . . . . 131
7.6.2.1 Rules Found in MC Problem . . . . . . . . . . . . . . . . . 132
7.6.3 Weld Beam Problem Results . . . . . . . . . . . . . . . . . . . . . . . 132
7.6.3.1 Rules Found in WB Problem . . . . . . . . . . . . . . . . . 133
7.7 Concluding Remarks . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 135

7.6 Results on Engineering Problems

Chapter 8

Issues of Transition Points and Repairs Based on Complex Rules139
8.1 Handling Transition Points . . . . . . . . . . . . . . . . . . . . . . . . . . . . 139

ix

8.1.1

Identifying Regions Separated by Transition Points (Active Set Parti-
tioning)

8.1.2 Results on Problems with Transition Points

. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 143
. . . . . . . . . . . . . . 147
8.1.2.1 Modiﬁed Two Bar Truss Problem (TBT-2) . . . . . . . . . . 148
8.1.2.2 Modiﬁed Metal Cutting Problem (MC-2) . . . . . . . . . . . 149
. . . . . . . . . . . . . . . 151
8.2.1 Results on Truss problem . . . . . . . . . . . . . . . . . . . . . . . . 154
8.3 Concluding Remarks . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 156

8.2 Power Law Rule Involving Functions of Variables

Chapter 9 Conclusion and Future Studies . . . . . . . . . . . . . . . . . . . . 157
9.1 Contributions of this Thesis . . . . . . . . . . . . . . . . . . . . . . . . . . . 159
9.2 Future Studies . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 160
9.2.1 GP with in Tandem Dimensional Consistency Check . . . . . . . . . 160
9.2.2 Non-linear Decision Trees
. . . . . . . . . . . . . . . . . . . . . . . . 161
9.2.3 Experimenting with Diﬀerent Deﬁnitions of Rule Complexity . . . . . 162
9.2.4 Measuring Eﬃcacy of Repairs . . . . . . . . . . . . . . . . . . . . . . 162

BIBLIOGRAPHY . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 164

x

LIST OF TABLES

Table 4.1:

List of parameters in CGP. . . . . . . . . . . . . . . . . . . . . . . . . .

48

Table 4.2:

List of CGP parameters used for solving symbolic regression problem of
Section 4.5.1.
. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

Table 4.3:

List of CGP parameters used for solving symbolic regression problem of
Section 4.5.2.
. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

Table 4.4:

List of CGP parameters used for solving symbolic regression problem of
Section 4.5.3.
. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

Table 4.5:

List of CGP parameters used for solving symbolic regression problem of
Section 4.5.4.
. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

53

54

57

59

Table 4.6:

Results of noise study performed on test problem of Section 4.5.3.

. . .

61

Table 6.1:

Production data details used for testing CGP classiﬁer.

. . . . . . . . .

85

Table 6.2:

Small feature set and their details. . . . . . . . . . . . . . . . . . . . . .

85

Table 6.3:

List of CGP parameters used for solving binary classiﬁcation problem of
. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
Section 6.3.1.

85

Table 6.4:

Summary of classiﬁcation rules found for binary classiﬁcation problem of
Section 6.3.1 and their corresponding error rates on training and test sets. 86

Table 6.5:

List of CGP parameters used for solving binary classiﬁcation problem of
Section 6.3.2.
. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

87

Table 6.6:

Summary of classiﬁcation rules found for binary classiﬁcation problem of
Section 6.3.2 and their corresponding error rates on training and test sets. 90

Table 6.7:

Larger feature set and their dimensions. . . . . . . . . . . . . . . . . . .

91

Table 6.8:

The set of physical constants which were considered relevant to the un-
. . . . . . . . . . . . . . . .
derlying physics of the production process.

91

Table 6.9:

Details of PO classiﬁers shown in Figure 6.7.

. . . . . . . . . . . . . . .

93

Table 6.10: Details of PO classiﬁers shown in Figure 6.7 after dimensional check. . .

94

xi

Table 7.1:

An example of candidate rule information available in RBQ block.

. . .

99

Table 7.2:

Diﬀerent variable repair strategies for EMO/I studied in this work. . . . 111

Table 7.3:

EMO parameters used in the test problems discussed in Section 7.3.

. . 115

Table 7.4:

Results of Wilcoxon rank sum test for GD and IGD of ZDT1-1 problem
at 36,000 function evaluations and 5% signiﬁcance level. . . . . . . . . . 115

Table 7.5:

Results of Wilcoxon rank sum test for GD and IGD of ZDT1-2 problem
at 36,400 function evaluations and 5% signiﬁcance level. . . . . . . . . . 117

Table 7.6:

Results of Wilcoxon rank sum test for GD and IGD of ZDT1-3 problem
comparing EMO/I repair strategies NI, SN, SC and SU at 50,400 function
evaluations and 5% signiﬁcance level.

. . . . . . . . . . . . . . . . . . . 118

Table 7.7:

Results of Wilcoxon rank sum test for GD and IGD of ZDT1-3 problem
comparing EMO/I repair strategies NN, NC, NU, SN and NI at 50,400
function evaluations and 5% signiﬁcance level.

. . . . . . . . . . . . . . 120

Table 7.8:

Results of Wilcoxon rank sum test for GD and IGD of ZDT1-3 problem
comparing EMO/I repair strategies LN, LC, LU, SN and NI at 50,400
function evaluations and 5% signiﬁcance level.

. . . . . . . . . . . . . . 121

Table 7.9:

EMO parameters used for solving the engineering problems discussed in
Section 7.5.

. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 128

Table 7.10: Results of Wilcoxon rank sum test for GD of TBT, MC and WB problem

at their respective maximum function evaluations and 5% signiﬁcance level.130

Table 8.1:

An example of partitioning of solution space based on constraint activity. 145

Table 8.2:

EMO/I parameters used for solving the modiﬁed engineering design prob-
lems that contain transition points in the PO front.

. . . . . . . . . . . 148

Table 8.3:

Results of Wilcoxon rank sum test for GD of TBT-2 problem comparing
EMO/I repair strategies NI, SN and SNasp at 10,036 function evaluations
and 5% signiﬁcance level. . . . . . . . . . . . . . . . . . . . . . . . . . . 149

Table 8.4:

Results of Wilcoxon rank sum test for GD and IGD of MC-2 problem at
100,000 function evaluations and 5% signiﬁcance level. . . . . . . . . . . 151

Table 8.5:

EMO/I parameters used for solving the truss problem when only rules
involving objective functions are learned.

. . . . . . . . . . . . . . . . . 155

xii

Table 8.6:

Results of Wilcoxon rank sum test for GD of truss problem compar-
ing EMO/I repair strategies NI and SNobj strategies at 25,024 function
evaluations and 5% signiﬁcance level.

. . . . . . . . . . . . . . . . . . . 155

xiii

LIST OF FIGURES

Figure 1.1: Number of publications in the last two decades using search key “multi-
. . . . . . . . . . . . . . . .

objective optimization” (Source : Scopus).

Figure 1.2: Number of publications in the last two decades that have used atleast one
of the following MOO software-HEEDS, Mode-Frontier, iSight, Optimus,
OptiSlang (Source : Scopus). . . . . . . . . . . . . . . . . . . . . . . . .

Figure 1.3: Number of publications per year in the last decade which have the key-
words “machine learning” and “interpretable” in their title, abstract or
keywords ( Source : Scopus). . . . . . . . . . . . . . . . . . . . . . . . .

Figure 2.1:

Illustration of the objective and design space of a MOO problem.

. . .

2

3

5

9

Figure 2.2: An illustration of the concept of dominance.

. . . . . . . . . . . . . . .

10

Figure 2.3: An illustration of the concept of innovization.

. . . . . . . . . . . . . .

13

Figure 2.4: An illustration of the concept of higher level innovization. . . . . . . . .

15

Figure 2.5: An illustration of the concept of lower level innovization.

. . . . . . . .

16

Figure 2.6: The concept of Online Innovization. . . . . . . . . . . . . . . . . . . . .

17

Figure 2.7: An example of DimTransform() function. . . . . . . . . . . . . . . . . .

23

Figure 3.1: Data availability when solving a MOO using an EMO algorithm. . . . .

28

Figure 3.2: Data availability when solving a MOO using an EMO algorithm. . . . .

30

Figure 4.1: Example of two GP solutions using tree representation.

. . . . . . . . .

37

Figure 4.2: An example of a sub-tree crossover between two GP individuals.

. . . .

37

Figure 4.3: An example of a sub-tree Koza-style mutation of a GP individual.

. . .

38

Figure 4.4: The two conﬂicting objectives in any machine learning algorithm.

. . .

39

Figure 4.5: An MGGP individual composed of many small binary expression trees
. . . . . . . . . . . . . . . . .

instead of one big binary expression tree.

40

Figure 4.6: Our GP learns the rule structure and their weights separately.

. . . . .

41

xiv

Figure 4.7: Fitness adjustment by penalizing duplicate solutions to promote pop-
ulation diversity in CGP. (A) shows the original state with two non-
dominated fronts present in the population and total of seven solutions.
(B) Solution-4, which is a duplicate of solution-3 after penalization. (C)
. . . . .
Solution-6 which is a duplicate of solution-5 after penalization.

42

Figure 4.8: Example of the high-level crossover used in CGP.

. . . . . . . . . . . .

45

Figure 4.9: Example of the low-level crossover used in CGP.

. . . . . . . . . . . . .

46

Figure 4.10: The ﬂowchart for Custom Genetic Program or CGP. . . . . . . . . . . .

47

Figure 4.11: Graphical representation of test problem of Section 4.5.1.

. . . . . . . .

52

Figure 4.12: PO set of solutions found by CGP on solving the symbolic regression test
. . . . . . . . . . . . . . . . . . . . . . . . . .

problem of Section 4.5.1.

Figure 4.13: The three trees corresponding to the chosen knee solution shown in Fig-
. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

ure 4.12.

53

54

Figure 4.14: Graphical representation of test problem of Section 4.5.2.

. . . . . . . .

54

Figure 4.15: PO set of solutions found by CGP on solving the symbolic regression test
. . . . . . . . . . . . . . . . . . . . . . . . . .

problem of Section 4.5.2.

55

Figure 4.16: The tree corresponding to the chosen knee solution shown in Figure 4.15. 55

Figure 4.17: PO set of solutions found by CGP on solving the symbolic regression test
. . . . . . . . . . . . . . . . . . . . . . . . . .

problem of Section 4.5.3.

Figure 4.18: The two trees corresponding to the chosen knee solution shown in Fig-
. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

ure 4.17.

Figure 4.19: PO set of solutions found by CGP on solving the symbolic regression test
. . . . . . . . . . . . . . . . . . . . . . . . . .

problem of Section 4.5.4.

Figure 4.20: The three trees corresponding to the chosen knee solution shown in Fig-
. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

ure 4.19.

Figure 5.1: PO set of solutions found by CGP on solving the symbolic regression test
problem of Section 4.5.3 followed by a dimensional penalty calculation
for each case. Only the knee solution, which is also the exact solution,
. . . . . . . . . . . . . . .
has a dimensional mismatch penalty of zero.

57

58

59

60

75

Figure 6.1: A hand crafted example of binary class data.

. . . . . . . . . . . . . . .

81

xv

Figure 6.2: Binary classiﬁcation tree model for classifying the two class data shown
. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

in Figure 6.1.

Figure 6.3: The binary class data of Figure 6.1 shown in a derived feature space
where the data is linearly separable. . . . . . . . . . . . . . . . . . . . .

Figure 6.4: PO set of solutions found by CGP on solving the binary classiﬁcation
. . . . . . . . . . . . . . . . . . . . . . . . . .

problem of Section 6.3.1.

Figure 6.5: Decision boundary for the knee solution classiﬁer of PO set of classiﬁers
shown in Figure 6.4. . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

Figure 6.6: PO set of solutions found by CGP on solving the binary classiﬁcation
. . . . . . . . . . . . . . . . . . . . . . . . . .

problem of Section 6.3.2.

Figure 6.7: PO set of classiﬁers found by CGP for two day’s worth production data
. . . . . . . . . . . . . . . . . . . . . . . . .

described in Section 6.4.1.

82

83

87

88

89

92

Figure 7.1: The ﬂowchart of an EMO/I algorithm.

. . . . . . . . . . . . . . . . . .

98

Figure 7.2: Pareto-optimal fronts for the test problems. . . . . . . . . . . . . . . . . 112

Figure 7.3: Median GD and IGD results for ZDT1-1 problem over 30 runs. . . . . . 116

Figure 7.4: Median GD and IGD results for ZDT1-2 problem over 30 runs. . . . . . 117

Figure 7.5: Median GD and IGD results for ZDT1-3 problem over 30 runs comparing

NI, SN, SC and SU repair strategies of EMO/I.

. . . . . . . . . . . . . 118

Figure 7.6: Median GD and IGD results for ZDT1-3 problem over 30 runs comparing

NI, SN, NN, NC and NU repair strategies of EMO/I.

. . . . . . . . . . 119

Figure 7.7: Median GD and IGD results for ZDT1-3 problem over 30 runs comparing

LN, LC, LU, SN and NI repair strategies of EMO/I. . . . . . . . . . . . 121

Figure 7.8: A two membered truss structure. . . . . . . . . . . . . . . . . . . . . . . 122

Figure 7.9: Pareto-optimal front for the two member truss problem. . . . . . . . . . 124

Figure 7.10: A diagram of metal turning process. . . . . . . . . . . . . . . . . . . . . 124

Figure 7.11: Pareto-optimal front for the Metal Cutting problem. . . . . . . . . . . . 126

Figure 7.12: Pareto optimal front for the WB problem.

. . . . . . . . . . . . . . . . 128

xvi

Figure 7.13: Median GD and IGD results for TBT problem over 30 runs.

. . . . . . 130

Figure 7.14: Front coverage by EMO/I method in TBT problem for the best GD run. 131

Figure 7.15: Rule between variables x1 and x2 identiﬁed by EMO/I in TBT problem

at the end of best GD run.

. . . . . . . . . . . . . . . . . . . . . . . . . 132

Figure 7.16: Rule for variable x3 identiﬁed by EMO/I in TBT problem at the end of

best GD run. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 133

Figure 7.17: Median GD and IGD results for MC problem over 30 runs.

. . . . . . . 134

Figure 7.18: Front coverage by EMO/I method in MC problem for the best GD run.

134

Figure 7.19: Rule for variable f identiﬁed by EMO/I in MC problem at the end of

best GD run. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 135

Figure 7.20: Rule among variables v and a identiﬁed by EMO/I in MC problem at

the end of best GD run. . . . . . . . . . . . . . . . . . . . . . . . . . . . 135

Figure 7.21: Median GD and IGD results for WB problem over 30 runs.

. . . . . . . 136

Figure 7.22: Front coverage by EMO/I method in WB problem for the best GD run. 136

Figure 7.23: Rule between variables h and l identiﬁed by EMO/I in WB problem at

the end of best GD run. . . . . . . . . . . . . . . . . . . . . . . . . . . . 137

Figure 7.24: Rule between variables h and b identiﬁed by EMO/I in WB problem at

the end of best GD run. . . . . . . . . . . . . . . . . . . . . . . . . . . . 137

Figure 7.25: Rule between variables l and b identiﬁed by EMO/I in WB problem at

the end of best GD run. . . . . . . . . . . . . . . . . . . . . . . . . . . . 138

Figure 7.26: Rule for variable t identiﬁed by EMO/I in WB problem at the end of

best GD run. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 138

Figure 8.1: Transition point encountered in Truss problem shown in objective space. 140

Figure 8.2: Transition point encountered in Truss problem shown in variable space.

141

Figure 8.3: Median GD and IGD results for TBT-2 problem over 30 runs.

. . . . . 149

Figure 8.4: Transition point encountered in Metal Cutting problem shown in objec-

tive space.

. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 150

xvii

Figure 8.5: Transition point encountered in Metal Cutting problem shown in variable

space. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 151

Figure 8.6: Median GD and IGD results for MC-2 problem over 30 runs.

. . . . . . 152

Figure 8.7: Median GD and IGD results for the Truss problem over 30 runs when

only rules involving objective functions are learned.

. . . . . . . . . . . 156

Figure 9.1: The idea of combining CGP with a Decision Tree classiﬁcation algorithm. 161

xviii

LIST  OF  ALGORITHMS

Algorithm 7.1:  RepairPop(  ).  .  .  .  .  .  .  .  .  .  .  .  .  .  .  .  .  .  .  .  .  .  .  .  .  .  .  .  .  .  104

Algorithm 7.2:  wrShuffle(  ).  .  .  .  .  .  .  .  .  .  .  .  .  .  .  .  .  .  .  .  .  .  .  .  .  .  .  .  .  .  108

Algorithm 8.1:  activeSetPartition(  ).  .  .  .  .  .  .  .  .  .  .  .  .  .  .  .  .  .  .  .  .  .  .  .  146

xix

KEY TO ABBREVIATIONS

CGP Customized Genetic Program

EA Evolutionary Algorithm

EMO Evolutionary Multi-objective Optimization

EMO/I Evolutionary Multi-objective Optimization with Innovization

GA Genetic Algorithm

GD Generational Distance

GP Genetic Programming

IGD Inverse Generational Distance

ML Machine Learning

MOO Multi-Objective Optimization

OLSR Ordinary Least Squares Regression

PO Pareto Optimal

SVM Support Vector Machines

xx

Chapter 1

Introduction

These are exciting times for academicians, practitioners and entrepreneurs who are working

in the ﬁeld of Multi-objective optimization (MOO). Figure 1.1 shows the increasing attention

this ﬁeld has been receiving from academia over the last two decades. Commercial MOO

software, especially catering to the computer aided engineering industry, has seen a lot of

growth in the past decade. Now, practitioners have many options available for commercial

MOO software; HEEDS [1], modeFrontier [2], , iSight [3], Optimus [4] and optiSLang [5] are

to name a few. Figure 1.2 shows the rising use of these softwares in research. Evolutionary

algorithms (EA) are a popular method of solving MOO problems in practice. In this work,

we refer to EAs designed to solve MOO problems as evolutionary multi-objective optimization

(EMO) algorithms. The key reason for the wide spread adoption of EMOs in tackling MOO

problems is their ﬂexibility in handling real world problem complexities such as non-linearity,

non-convexity, and non-diﬀerentiability [6].

As the adoption of MOO is increasing in the industry, so are the expectations of practi-

tioners from MOO solving software in terms of what it can do apart from providing optimal

or near-optimal solution(s). Two key areas where researchers can try to address those ex-

pectations are;

1. Oﬀering some insight or knowledge about the problem, and

2. Using the aforementioned knowledge in making an EMO algorithm converge faster

1

Figure 1.1: Number of publications in the last two decades using search key “multi-objective
optimization” (Source : Scopus).

towards the PO front.

EMO algorithms or any population based MOO problem solving algorithm are uniquely

positioned to deliver on both the aforementioned needs of MOO software users. Let us

discuss the two aforementioned requirements in the context of this work.

1.1 Search for Knowledge in MOO Problems

The Oxford dictionary deﬁnes knowledge as, “the theoretical or practical understanding of a

subject”. In older Artiﬁcial Intelligence (AI) literature [7], “having knowledge” corresponds

to recognizing something as information about the world or part of it.

In the context of

an engineering design and optimization problem, this is equivalent of understanding the

optimal behaviour in relation to the objectives and the design variables of the problem. A

closely related concept of “knowledge representation” is deﬁned as fundamentally a surrogate,

a substitute for the thing itself, used to enable an entity to determine consequences by

2

19982002200620102014201801234# of Publications/Year ( 1000)Figure 1.2: Number of publications in the last two decades that have used atleast one of
the following MOO software-HEEDS, Mode-Frontier, iSight, Optimus, OptiSlang (Source :
Scopus).

reasoning about the world rather than taking action in it [8]. Let us further discuss what

we mean by knowledge in the context of this MOO problems. The MOO problems with

two or more conﬂicting objectives have more than one optimal solution to the problem.

These pareto optimal (PO) solutions may possess certain special properties that makes them

PO in the ﬁrst place. For example, in a mathematically well behaved MOO problem with

convex, continuous and diﬀerentiable functions, the PO solutions must adhere to at least

the Fritz-John or Karush-Kuhn-Tucker necessary conditions for Pareto optimality [9]. Even

in real world engineering design problems that lack the aforementioned regularities around

optima(s), the PO solutions may still possess certain special patterns or “rules” that sets

them apart from non-optimal or even random solutions. This idea of looking for rules in

PO solutions of MOO problem was coined by the authors of [10] as innovization. The

authors showed the existence and manual extraction of such rules from PO solutions of

many engineering design problems. Since then, the idea of innovization has been applied to

many MOO problems in engineering design [11–13].

3

199820022006201020142018020406080100# of Publications/YearA recent work [14] provides an extensive survey of various forms of knowledge sought af-

ter in MOO problems. Knowledge representation in a MOO problem can be categorized into

implicit or explicit forms based on representation [15]. Although very popular, knowledge in

implicit form has no formal notation and may require user to have a speciﬁc experience. Most

of the visual data mining methods such as parallel coordinate plots [16], value paths [17],

heat-maps [18] and self-organizing maps [19] fall in this category. Explicit knowledge on the

other hand has crisp mathematical form and can be interpreted by humans unambiguously.

A couple of examples of explicit form of knowledge representation in MOO problems are de-

cision trees [20] and regression rules [21]. In this work, we have decided to pursue knowledge

in explicit form which has crisp mathematical notation and is amenable to consistent human

“interpretation”. Although we could not ﬁnd a mathematical deﬁnition of interpretabil-

ity, [22] deﬁnes it as the degree to which a human can consistently predict a model’s result.

For example, a model having power law structure in terms of design variables, which are

common in the problems from engineering and physics [23], are easier to interpret compared

to a model based on deep neural networks [24]. Christoph [25] presents a nice case for the

need of interpretability in machine learning models. Figure 1.3 shows a sharp increase over

last ﬁve years in the amount of research trying to address interpretability in machine learn-

ing. In this work, rules refer to explicit mathematical expressions involving MOO problem

variables or functions thereof, that are adhered to by some or all of the PO solutions.

Another important aspect of rule learning in practical MOO problems is of adherence

to principle of dimensional homogeneity [26].

In any rule representing some aspect of a

physical or engineering system, adding or subtracting two dimensionally incommensurate

quantities can never produce physically meaningful knowledge. For example, a rule should

not be adding physical quantities having the dimensions of length and time. This is another

4

Figure 1.3: Number of publications per year in the last decade which have the keywords
“machine learning” and “interpretable” in their title, abstract or keywords ( Source : Sco-
pus).

aspect that we have paid attention to while deciding the kind of rules that we want to learn

from PO data. It is for these reasons of maintaining interpretability and easy veriﬁability of

dimensional consistency, that we chose to learn only certain kind of rules in this work such as

power law rules or rules involving basic operations of {+,−,×,÷} over problem variables.
For the latter, we developed a bi-objective genetic programming (GP) [27] based method

which we will go over in the coming chapters.

1.2 Using Discovered Knowledge in Expediting Con-

vergence of EMO Algorithms

The EMO algorithms tend to be computationally ineﬃcient because they evaluate many

alternative solutions in the population of solutions before converging to PO solutions. There

have been examples [28, 29] in which researchers have learned some heuristics about the

problem from some initial EMO runs on a problem and then adopted those heuristics as

5

200820102012201420162018050100150200250# of Publications / Yearpart of the problem solving EMO to expedite its convergence. This method is known as

derived heuristics in the literature.

In the above examples, the authors waited for the

EMO algorithm run to completion, ﬁrst to obtain a PO solutions set followed by a manual

innovization procedure conducted over the solutions to come up with the rules. In this work,

we intend to learn explicit mathematical form rules from a select set of solutions (say non-

dominated set) during an EMO algorithm run and then use those rules to directly repair

variables for expediting the convergence to PO solutions set. There are many challenging

questions in this approach including;

 When to begin the learning process?

 What are some computationally eﬃcient ways of repairing solution variables without

making extra evaluations of the objective function?

 How to learn rules in MOO problems where diﬀerent parts of PO front adhere to

diﬀerent set of rules?

We have tried to answer these questions towards the later half of this dissertation.

1.3 Organization of Dissertation

This dissertation is organized as follows. Chapter 2 presents relevant literature on the subject

of innovization, existing literature on interpretable rule learning from data and on the idea

of ﬁnding dimensionally consistent rules. Dimensional awareness here refers to adherence to

the law of dimensional homogeneity by the learned rules. Then the dissertation is presented

in two parts. Part I focuses on learning interpretable mathematical rules from data. Part I

consists of four chapters. Chapter 3 illustrates how prevalent power laws are in engineering

6

problems and how can they be learned by using ordinary least square method of linear

regresssion [30]. Chapters 4 and 6 show how can we learn free form rules use bi-objective GP

for non-linear regression and classiﬁcation tasks. We developed this GP as part of solving a

real world industry problem. Chapter 5 then illustrates how can the principle of dimensional

homogeneity be useful in generating or choosing physically meaningful rules from a set of

rules which may all be acceptable to a user in terms of regression/classiﬁcation accuracy.

Part II then focuses on how can we use rules learned during an optimization task to expedite

the convergence of an EMO algorithm. Chapter 7 shows results of online innovization on

test and engineering MOO problems when we learn power law rules based on design variables

only. Chapter 8 shows results of online innovization on test and engineering MOO problems

when we learn power law rules based on design variables as well as some function of the same

such as the objective functions. Furthermore, a methodology is devised to handle transition

points in PO fronts and still be able to expedite the convergence of the EMO algorithm. We

present the main conclusions of this work in Chapter 9 along with some interesting research

threads worth pursuing in future research.

7

Chapter 2

Literature Survey and Proposed

Online Innovization

2.1 Multi-objective Optimization

Many real-world design problems from engineering can be formulated as a Multi-objective

optimization (MOO) problem [6, 31]. This approach is particularly useful when it is diﬃcult

to capture the decision makers’ preference in the objectives in terms of some convex set

of weights and consequently turning the problem into a single objective problem using the

weighted sum approach . The goal of MOO is to provide the decision maker with a set of

optimal trade-oﬀ solutions in terms of the objectives and let the decision maker develop a

preference and ﬁnally make a choice. Equation (2.1) shows a formulation of a MOO problem

with nf objectives (minimize all), nx design variables with known lower and upper bounds

8

and ng inequality constraints.

Minimize all in f (x)

Subect to g(x) ≤ 0

where x = [x1 x2 . . . xnx]

(cid:124)

,

(cid:124)
f (x) = [f1(x) f2(x) . . . fnf (x)]

,

(cid:124)
g(x) = [g1(x) g2(x) . . . gng (x)]

and

li ≤ xi ≤ ui, i ∈ {1, 2, . . . , nx}.

(2.1)

(cid:124)
A solution x∗ = [x∗1 x∗2 . . . x∗nx]

to such a problem is a vector of n decision variables and

must satisfy the variable bounds and ng inequality constraints. Otherwise, it is an infeasible

solution. Figure 2.1 illustrates the objective, design and feasible spaces for a bi-objective

Figure 2.1: Illustration of the objective and design space of a MOO problem.

minimization problem having three design variables.

9

Feasible RegionDesign SpaceObjective Spacef2(minimize)x1x2x3f1(minimize)Concept of Dominance

In a single objective optimization problem, it is straight forward to compare two solutions

simply based on their corresponding objective values. In case of minimizing an objective,

the solution with lower objective value is considered better than the other. However, in case

of MOO problem, the comparison is based on the concept of dominance [31].

Deﬁnition 2.1.1. A solution x(1) is said to dominate a solution x(2), if the following con-

ditions are true:

1. The solution x(1) is no worse than x(2) in all objectives.

2. The solution x(1) is strictly better than x(2) in at least one objective.

We denote x(1) dominating x(2) as x(1) (cid:22) x(2) and x(2) dominating x(1) as x(2) (cid:22) x(1).
If neither of two solutions dominate each other, then we denote this condition as x(1) (cid:107) x(2).
In Figure 2.2, upon doing pairwise comparisons of solution x(2) with all others, x(2) (cid:22) x(4),

Figure 2.2: An illustration of the concept of dominance.

x(2) (cid:22) x(5) and x(2) (cid:22) x(6) while x(2) (cid:107) x(1) and x(2) (cid:107) x(3). Another important concept is

10

6f1(minimize)f2(minimize)12345of the non-dominated set [31].

Deﬁnition 2.1.2. Among a set of solutions P , the non-dominated set of solutions P(cid:48) are

those that are not dominated by any member of the set P .

In Figure 2.2, set P = {1, 2, 3, 4, 5, 6} and set P(cid:48) = {1, 2, 3}. Furthermore, when the set
P is the entire search space, the resulting non-dominated set P(cid:48) is called the Pareto-optimal

set.

Sometimes, we are interested in dividing a set P into diﬀerent non-domination levels or

ranks. If a set P contains J = {1, 2, . . . , j} non-domination levels, then it means that the
set P can be partitioned into j levels, i.e.

(cid:91)

P =

Pi

and

i∈J

Pr ∩ Ps = ∅ ∀ r ∈ J and r (cid:54)= s,

such that a point in a set with lower non-domination rank are never dominated by a point in

any set of higher non-domination rank. In Figure 2.2, there are three non-domination ranks

present. Set P1 = {1, 2, 3}, P2 = {4, 5} and P3 = {6}.

2.2 Evolutionary Multi-objective Optimization Algo-

rithms

Real world optimization problems have many complexities built into the problem such as

non-linearity, discontinuity, non-diﬀerentiability, discrete variables etc. These complexities

11

make it impossible to apply known classical optimization algorithms, such as gradient de-

scent algorithm [32], without making strong simplifying assumptions about the problem. An

evolutionary algorithm (EA) mimics natural evolutionary principles to conduct search and

optimization tasks. Some of the popular EAs are genetic algorithms [33], evolution strate-

gies [34] and genetic programming [35]. Evolutionary algorithms (EAs) have shown to have

an edge in optimizing MOO problem for ﬁnding multiple trade-oﬀ solutions, simply because

EAs are capable of ﬁnding and storing multiple solutions from generations to generations.

The population approach of EAs and their ability to build multiple niches within a popu-

lation enabled EA researchers to develop evolutionary multi-objective optimization (EMO)

algorithms [31,36]. EMO algorithms are extensively applied to MOO problems from the real

world [37–44]. However, EMO algorithms are considered computationally ineﬃcient because

they sample the space stochastically and require many function evaluations to reach a high

quality solution set. This ineﬃciency can be partly reduced if the algorithm is made to take

advantage of some problem knowledge discovered during or at the end of the optimization

from PO or non-dominated solutions set.

2.3

Innovization

While multiple PO solutions allow decision-makers to choose a single preferred solution by

making a comparative analysis of them, multiple PO (or high-performing) solutions may

also provide users with valuable information about the problem being solved. These PO

solutions may possess certain special properties that makes them PO in the ﬁrst place. For

example, in a mathematically well behaved MOO problem with convex, continuous and

diﬀerentiable functions, the PO solutions must adhere to at least the Fritz-John or Karush-

12

Kuhn-Tucker necessary conditions for Pareto optimality [9]. Even in real world engineering

design problems that lack the aforementioned regularities around optima(s), the PO solutions

may still possess certain special patterns or “rules” that sets them apart from non-optimal or

even random solutions. The idea of learning from the PO solutions and deciphering design

principles that are unique to the optimization problem was ﬁrst presented in [10]. This

concept is called innovization and it is pictorially illustrated in Figure 2.3. Originally, the

innovization task was conceived to be performed once at the end of an EMO run in order to

decipher principles that are common to most or all PO solutions [10].

Figure 2.3: An illustration of the concept of innovization.

Since an innovization task involves learning from a set of equivalent trade-oﬀ solutions,

it is appealing to couple this idea with EMO algorithms. There are examples in the litera-

ture, where the innovization task has been conducted on the results of an EMO algorithm

to decipher interesting knowledge about the problem. For example, [45] uses innovization

to discover new design principles in three engineering case studies. Another example [46]

proposes a novel recommendation tool for software re-factoring based on innovization. [11]

suggests a way to automate the ﬁnding of salient problem design principles for engineering

problems. [13] uses the innovization task in a simulation based MOO problem and suggest

diﬀerent ways to couple the innovization with the EMO that can throw light on important

principles of the problem. In recent literature [14], the idea of innovization has also been

referred to as knowledge discovery in MOO problems. Before moving on to discuss some

13

EMOInnovizationproblemMOOStep IStep IIConvergedDesignPrinciplesFrontPOf2f1examples of knowledge extracted from MOO problems, let us look at a taxonomy of the

innovization procedures.

2.4 A Taxonomy of Innovization

In the following sections, we will try to cover the various kinds of innovization procedures

that exist in the literature.

2.4.1 Manual Innovization

This was ﬁrst proposed in [10] as a way of manually extracting innovative design principles

from PO data of MOO design problems. As shown in Figure 2.3, it requires ﬁrst solving

a MOO problem and obtaining its PO set of solutions followed by manually plotting dif-

ferent variables, objectives and constraints for PO solutions against each other one by one

in 2-dimensional and 3-dimensional plots and then manually deciphering innovative design

knowledge from the same based on the plots.

2.4.2 Automated Innovization

Automated innovization was ﬁrst proposed in [11] as a way to automate the labour of manual

innovization but at the cost of ﬁxing the form of rules to power law form. This approach

has produced some interesting case studies in engineering design evident in [45]. Note that

both manual and automated innovization are based on the approach shown in Figure 2.3.

14

Figure 2.4: An illustration of the concept of higher level innovization.

2.4.3 Higher Level Innovization

Upon changing certain parameters of a MOO problem, it is possible that the shape or

location or both may change for the PO front of the problem. Figure 2.4 illustrates this for a

hypothetical problem where changing some parameter results in some shift in the PO front as

well. In such a case, if an innovization study is performed for multiple fronts resulting from

multiple parametric studies, it is possible that we may encounter certain rules or principles

that remain invariant across the fronts. Such a procedure is called a higher level innovization

procedure. Both [47, 48] present some good engineering design case studies of higher level

innovization.

2.4.4 Lower Level Innovization

Lower level innovization task is performed when we can divide the solutions of a problem into

two sets, namely the preferred set and the not preferred set. Figure 2.5 shows a hypothetical

case where the user may be interested in patterns that are present in the preferred set of

15

Parametervaluechangingf1(minimize)f2(minimize)Figure 2.5: An illustration of the concept of lower level innovization.

solutions but not otherwise. A lower level innovization procedure learns the most discrimi-

natory rules that completely adhered to by the preferred class of solutions but not so much

by the not preferred class of solutions. [49] presents a case study of lower level innovization

procedure applied to the Car side impact problem [50].

2.4.5 Temporal Innovization

Temporal innovization is the innovization procedure aimed at learning the relative hierarchy

of design principles in an overall solution to a MOO problem. The design principles in the

PO front are searched in all previous generations and their signiﬁcance is recorded. When

looked over the time line of all the generations of an EMO, it can be seen which principles

evolve earlier in the solutions and which one appear later. The ones that appear earlier are

hierarchically more important than the ones that appear later in the evolution of solutions.

The authors of [51] present an interesting analogy between the evolution of design of a MEMS

resonator using an EMO and the evolution of human beings.

16

Unpreferredregionf1(minimize)f2(minimize)Preferredregion2.5 Proposed Online Innovization

Figure 2.6: The concept of Online Innovization.

Figure 2.6 illustrates the idea of online innovization. An online innovization procedure

has three components namely;

1. Target data set(s) for extracting knowledge,

2. Type of knowledge representation being extracted, and

3. How to use the extracted knowledge to expedite the convergence of EMO algorithm.

Let us look at these aspects.

2.5.1 Target Data Set(s)

When solving a MOO using an EMO or any population based optimization algorithm, the

algorithm carries a number of data sets possessing important as well as unimportant knowl-

edge about the problem. For example, the objective values data, variable values data and

17

Terminate?YesStopStartTarget data setin population(e.g. non−dominated)NoKnowledgeConvergenceFasterInnovizationMOOproblemActiveInterventionProblemSpecificEMOconstraint violation values data for non-dominated solutions set [31] can be considered im-

portant for problem information and the same data for dominated solutions may not hold

the same importance, especially during initial generations of the EMO. However, the infor-

mation contained the same dominated solutions becomes important if one is interested in

ﬁnding knowledge that can discriminate between the dominated and non-dominated sets.

Another way to select a target data set is if the preference of decision maker is known at

every step of the optimization process [52].

2.5.2 Type of Knowledge Representation

The Oxford dictionary deﬁnes knowledge as, “the theoretical or practical understanding of a

subject”. In older Artiﬁcial Intelligence (AI) literature [7], “having knowledge” corresponds

to recognizing something as information about the world or part of it. As mentioned in

Section 1.1, “knowledge representation” is deﬁned as fundamentally a surrogate, a substitute

for the knowledge, used to enable an entity to determine consequences by reasoning about

the world rather than taking action in it [8]. Knowledge representation in MOO problem can

be categorized into implicit or explicit forms based on representation [15]. Although very

popular, knowledge in implicit form has no formal notation and may require user to have a

speciﬁc experience [53]. Most of the visual data mining methods used for extracting implicit

knowledge about MOO problems such as parallel coordinate plots [16], value paths [17],

heat-maps [18], RadViz visualization method [54] and self-organizing maps [19] fall in this

category. [14] provides a good survey of such methods used for knowledge extraction in MOO

problems.

Explicit knowledge on the other hand has crisp mathematical form and can be interpreted

by humans unambiguously. Furthermore, it can be implemented as a computer program [55].

18

A couple of examples of explicit form of knowledge representation in MOO problems are

decision trees [20] and regression rules [21]. Kernel based Support vector machines [56],

artiﬁcial neural networks [57, 58] are other examples but rules learned through them are not

very amenable to easy human interpretation. [59] applies classiﬁcation rule mining [60] to

extract rules of the form A −→ B where A is a design feature and B is a class-label that
determines the quality of the design (e.g. non-dominated). However the rules in disjunctive

normal form combining many design features and permutations thereof become diﬃcult

to interpret by human unless we limit the maximum number of features allowed in such

rules. [11, 61] have developed a custom unsupervised learning algorithm based on a genetic

algorithm to learn power law relations from the data of PO solutions of a MOO problem.

This method does produce very interpretable rules in power law form, however it may fail

to capture knowledge if it exists in any other form, say a rule with addition or substration

operations. [62] employs classiﬁcation trees to extract decision rules that distinguish between

dominated and non-dominated solutions. These rules also stay easy to interpret only until

the number of levels in the decision tree stays under ﬁve.

In this work, we have decided to pursue knowledge of the following two explicit forms:

 Power law form and

 Free form or algebraic expressions containing {+,−,×,÷} operations on operands.

We call these mathematical expressions as “rules” and target them as they are amenable to

consistent human “interpretation”. Although we could not ﬁnd a mathematical deﬁnition

of interpretability, [22] deﬁnes it as the degree to which a human can consistently predict a

model’s result. For example, a model having power law structure in terms of design variables,

which are common in the problems from engineering and physics [23], are easier to interpret

19

compared to a more accurate model based on, say deep neural networks [24]. Christoph [25]

presents a nice case for the need of interpretability in machine learning models. Although

[11,61] have used a custom unsupervised learning algorithm to learn power law relations, such

a method can be computationally very expensive, given that power laws can be learned using

a combination of log-linear modeling along with ordinary least square regression method [63].

Such a method can be applied repeatedly without much cost, however we need to ﬁnd the

transition points in the MOO set via some other method. A transition point in a PO front is a

point across which there is a quantitative change in the nature of rules that are adhered to by

the PO solutions [10]. Encountering a constraint boundary is a common cause for appearance

of transition points in PO fronts for MOO problems from engineering. Furthermore, we have

also developed a custom bi-objective genetic programming (GP) method that can be used

for learning simple “free form rules from data involving only +,−,×,÷ operations. This is
because adding rule parsimony as another objective in GP [64,65] has been known to control

the problem of bloating [66] when using GP for regressing rules from data. Furthermore, we

designed our GP to be able to learn the constants/coeﬃcients involved in rules accurately

without the constants being provided by the user. This capability is only known to be present

in a commercial software named Eureqa [67] which is again a GP based software. For target

rules involving only +,−,×,÷ operations, our GP is quite competitive to Eureqa software.
In addition to this, our GP has additional capability of conducting dimensional consistency

checks on the obtained rules.

2.5.3 Using Extracted Knowledge to Expedite the Convergence

There are EAs that are hybridized with machine learning methods to acquire knowledge

extraction from a set of solutions and knowledge application to aﬀect the convergence. For

20

example Bayesian optimization [68,69] and Estimation of Distribution Algorithms [70]. The

probabilistic models constructed and updated in these algorithms guide the algorithm on

how to sample the decision space to improve the objective values. However, the constructed

probabilistic models are relatively diﬃcult to interpret by humans compared to say alge-

braic expressions. Authors of [29] ﬁrst discover the salient principles by using a manual

innovization task with the solutions found at the end of an EMO run and then use those

principles as a heuristic for local search and obtain a faster convergence than the EMO ap-

plied alone. [71] suggest learning the ‘innovized’ rules using decision trees and then adding

the learned rules as if-then-else statement type of constraints during the EMO run. [71] learns

distance based regression trees to extract rules diﬀerentiating between solutions near and far

from PO region of interest and introduces these regression trees as constraints to guide the

EMO algorithm towards the region of interest. Learnable evolution models (LEM) [72] gen-

erates multiple logical rules that relate speciﬁc design features to the high-quality solutions

in the population. The rules are combined into one logical sentence in a disjunctive nor-

mal form, and future solutions must satisfy the sentence by conforming to at least one of

the rules. [59] uses an adaptive operator selection to track the eﬃcacy of each evolutionary

operator and manually introduced knowledge dependent operator at consistently creating

improved solutions and focus on using the most eﬀective ones using a credit assignment

strategy.

Since, we are learning rules in the form of mathematical expressions, we try to use the

same for directly repairing/modifying solutions and possibly expedite the convergence of the

EMO algorithm. Furthermore, we have tried to investigate diﬀerent repair strategies for rule

and repair-variable selection. We will learn more on this in Part-II of this dissertation.

21

2.6 Dimensional Awareness in Rule Learning

For rules to be physically meaningful, especially the ones coming from problems in engi-

neering and physics, it is a must that the rules should at least adhere to law of dimensional

homogeneity [26]. The ﬁrst work to address the issue of learning dimensionally consistent

rules from data using Genetic Programming (GP) was [73]. The authors modiﬁed the deﬁ-

nitions of terminal set and function set to include the physical dimensions of the quantities,

such as length, mass and time etc. Furthermore, an additional function DimTransform()

is deﬁned to guarantee closure. In addition to goodness-of-ﬁt objective, an additional ob-

jective “goodness of dimension” is introduced to reduce the number of applications of the

DimTransform() function necessary to make dimension based repairs. Figure 2.7 shows how

the DimTransform() function is applied with an example. Here the input tree is adding

two dimensionally inconsistent terms, namely “F ” with physical dimensions M1 L1 T−2 and

another quantity “m” with dimensions M1 L0 T0. The DimTransform() function multiplies

the term “m” with a random physical constant “c” having physical dimensions of M0 L1 T−2

and a numeric value randomly chosen from the set of allowed random terminal constants.

Consequent to this dimension matching operation, this transform adds a dimensional penalty

of three units to the output tree. The work showed that addition of dimension information

as another objective produced more parsimonious results instead of results having high ﬁt-

ness and low interpretability. Nevertheless, the algorithm did not perform well when exact

scientiﬁc constants (e.g. 9.81 m/s2) were not provided to the algorithm.

[74] is another work that tries to address the issue of ﬁnding meaningful rules using

dimension information using the idea of grammar based GPs [75, 76]. The authors do so

by enforcing dimensional constraints through formal grammars in the GP tree construction

22

Figure 2.7: An example of DimTransform() function.

and manipulation rules. Furthermore, the authors also suggest a way of initialization so

that feasible trees can be generated in pre-speciﬁed maximum tree depth. This approach

is impractical for real problems because of the size of grammar that needs to be deﬁned

even for problems of small size. As an example, consider three elementary units mass (M),

length (L) and time (T) to be present in the variables. Further assume that the powers of

these basic units are restricted to the integer set {−2,−1, 0, 1, 2} and no fractional powers
of basic units are allowed. Thus one needs to exclude operators that yield fractional powers,

e.g. square root operator. Using a restricted function set of {+,−,÷,×}, a full speciﬁcation
of grammar requires 53 = 125 symbols each having many rules. For example, for a symbol

S0,0,0 representing expressions which are dimensionless, can be obtained using multiplication

23

+Fmm·c+FM1L1T−2M1L0T0DimTransform(c∈R,M0L1T−2)M1L1T−2M1L1T−2+DimPenalty=3function using the following 125 rules.

< S0,0,0 >::= < S0,0,0 > × < S0,0,0 > |
< S1,0,0 > × < S−1,0,0 > |
< S2,0,0 > × < S−2,0,0 > |
. . .

[122 more such deﬁnitions for multiplication.]

Correctly so, the authors of [77] have termed this context-free-grammar approach to be unﬁt

or impractical for modeling the “units of measurement” or uom system. It is very diﬃcult

to produce a set of feasible initial population with so many constraints, even with a limited

set of allowed exponents.

The authors of [78] suggest using a set of dimensionless quantities as terminal set for

the GP. From an initial set of variables, they generate a set of normalized variables which

are dimensionless and then use those as part of the terminal set of the subsequent GP. This

method, although guaranteed to produce dimensionally consistent expressions, is impractical

when number of variables of interest is large. In such a case, ﬁnding a set of dimensionless

quantities that can be produced using those variables of interest with elementary operations

such as multiplication or division is not trivial and itself becomes an optimization problem.

The authors of [79] try addressing the dimensional consistency of GP individuals (trees)

the following way. It ﬁrst assumes an upper bound on the largest possible exponent for any

fundamental unit (length, mass or time) to be u (say). Then for any tree, with maximum

depth D, the maximum exponent of any fundamental quantity can be u × 2D. For each
fundamental quantity, it then evaluates the resultant exponent which will exceed u × 2D

24

for trees combining dimensionally incommensurate quantities. Such trees become infeasible

in subsequent optimization procedure to induce rules from data. However, such penalty

approach makes it almost impossible to ﬁnd rules having addition or subtraction operations,

which are the only operations that make them diﬀerent from power law rules. This is the

reason that out of forty rules found by the author over three engineering design problems,

only one rule has an addition operation.

In the rest of the chapters, we will focus on applying the idea of online innovization

on MOO problems from the engineering design domain where the problem variables are of

continuous nature. In the following chapters, we will see how can we:

 Part-I : learn mathematically explicit rules from data that are simple to interpret, and

 Part-II : then use them to make variable repairs to EMO population members to bring

them closer to the PO front faster.

This concludes our literature survey chapter.

25

Part I

Rule Learning

26

Chapter 3

Learning Fixed Form Rules

3.1

Introduction

In a MOO problem being solved by any EMO or population based optimization algorithm

generates a myriad of data related to the design variables, objective functions and constraint

values/violations. Recall from Section 2.5.2 that we are interested in discovering knowledge

from this data that can be expressed in the form of mathematical algebraic expressions which

we will be referring to as rules in this work. Thus we are looking for rules of the form

ψ(x, f (x), g(x)) = c

(3.1)

where x is the set of design variables, f (x) are the objective functions and g(x) are the set of

constraint functions for the problem and c is some constant. We can re-write Equation (3.1)

as

ψ(x1, . . . . , xnx, f1, . . . , fnf , g1, . . . , gng ) = c.

(3.2)

Borrowing the terminology from [11], let us call each design variable and any function of de-

sign variables (including objective functions and constraint functions) as a basis function and

we will represent a basis function using the symbol ‘φ’.Thus, we can re-write Equation (3.2)

27

as

ψ(φ1, φ2, . . .) = c.

(3.3)

Figure 3.1 shows a typical scenario of data availability while solving a MOO problem of

Figure 3.1: Data availability when solving a MOO using an EMO algorithm.

two objectives, three variables and one inquality constraint. In some MOO problems, user

may want to use additional basis functions other than the ones included in Equation (3.1),

e.g. some composite function of ﬁrst objective and a constraint. This is acceptable and

would require us to calculate this additional basis function during the optimization run.

We will focus our attention to the general case of including only variables, objectives and

constraints for our study. An underlying assumption in applying these methods is that the

values for each basis function are always non-negative, which is generally the case for MOO

problems from engineering design domain. Note that the methodology to learn the two

types of rules being presented in this chapter already exists in literature. We are presenting

them for completeness, maintaining coherency of thesis structure and their suitability for

28

DominatedNon-dominatedithGenerationStateithGenerationDataIDφ1φ2φ3φ4φ5φ6···x1x2x3f1f2g1···10.20.80.84.52.81.2···21.00.11.06.56.85.0···31.00.40.77.16.69.6···40.50.90.07.51.63.4···........................f1(minimize)f2(minimize)computationally eﬃcient application to MOO engineering design problems.

3.2 Learning Constant Rules

We call a rule of the form

ψ(φ1, φ2, . . . , φnφ) ≡ φi = c

(3.4)

as a “constant” rule where φi is some basis function from the set of nφ basis functions being

considered for rule learning from MOO problem and c is some constant. These are one of

the most commonly encountered design rules in MOO problems in engineering designs. A

few examples of a constant rule would be a variable or a constraint reaching its boundary

value and being same for a set of high quality solutions.

3.2.1 Estimating Rule Parameters and Quality

In Equation (3.4), there is only one parameter, c to be estimated. Consider a hypothetical

MOO problem which requires its PO solutions to have some variable x1 = 0.5. Let x1 be

named the ﬁrst basis function, φ1. Figure 3.2 shows this scenario for this problem as the

EMO algorithm progresses in the optimization run. Since EMO algorithms work based on

stochastic sampling, a non-dominated front at the end of an EMO run is very likely not the

PO front for the problem but an approximation of the same. EMO algorithms rarely reach

the PO front of MOO problems unless they are followed by some local search from a set

of non-dominated solutions which are close to the PO solutions of the problem. Hence, the

value of x1 cannot be expected to be exactly 0.5 for all solutions at the end. Figure 3.2

reﬂects this thought, where the values of the variable x1 can be seen to be moving, on

average, closer and closer to the value of 0.5 but are never exactly 0.5. Hence, the value of

29

Figure 3.2: Data availability when solving a MOO using an EMO algorithm.

parameter c in Equation (3.4) can be estimated using the mean of basis function φi as

ˆc = µφi

(3.5)

which in this example would be just the mean of variable x1 of the target data set.

Furthermore, we use coeﬃcient of variation or Cv of φi values of target data set to assess

the quality of a constant rule. Coeﬃcient of variation for a data set is deﬁned as

Cv =

σ
µ

(3.6)

where σ is the standard deviation and µ is the mean of data. Coeﬃcient of variation is a

dimensionless quantity and is useful for comparing the variability of a sample of numbers

drawn from distributions with diﬀerent units. For example, Cv of the weights of a group of
students will be the same whether the data is in kilograms or pounds. The lower the value

30

StartGen.φ1=x10.60.20.80.3...MiddleGen.φ1=x10.610.580.550.48...LastGen.φ1=x10.510.500.490.52...EMOGenerationsf2(minimize)f1(minimize)of Cv for data of some basis function, the closer it is to being called a constant rule. Let
us now look at the other type of rule we are focusing on in this chapter, i.e. the power law

rules.

3.3 Learning Power Law Rules

A power law form rule involving basis functions from a MOO can be represented as

ψ(φ1, φ2, . . . , φnφ) ≡

bi
i = c
φ

(3.7)

nφ(cid:89)

i=1

where nφ is the number of basis functions being considered for learning a power law rule

from data, bi is exponent of ith basis function and c is some constant.

Equation (3.7) is a multi-variate power law. Such laws are very common in many nat-

ural phenomenon [23] and have been commonly found in MOO problems from engineering

design [10, 11, 45, 80, 81]. By using such a universally occurring functional form for the de-

sign principles, it is expected that most correlations between various basis functions can be

captured.

3.3.1 Estimating Parameters

Given the data for nφ number of basis functions, and if we want to learn a multi-variate

power law relation involving all k basis functions, then we can use the log-linear modeling

followed by ordinary least square regression [82] method.

bk
b2
b1
1 · φ
2 ··· φ
φ
k = c, then

(3.8)

31

taking log on both sides,

b1 log φ1 + b2 log φ2 + ··· + bk log φk = log c,

(3.9)

which is a linear equation. If log φ1 is chosen as the regressand and others as regressors then,

log φ1 = −b2
b1

log φ2 + −b3
b1

log φ3 + ··· + −bk

b1

log φk +

log c
b1

, or

(3.10)

= ˆβ2 log φ2 + ˆβ3 log φ3 + ··· + ˆβk log φk + ˆγ.

Parameters ˆβ2, ˆβ3,··· , ˆβk and ˆγ are estimates returned by a Ordinary Least Square linear
regression (OLSR). OLSR also returns the R2
adj value which can be used as metric to signify

the quality of such a learned rule. Ofcourse we can choose any logarithmic term in Equa-

tion (3.9) as a regressand and the rest as regressors. Let us say that the ith log term in

Equation (3.9) is chosen as the regressand to apply OLSR. In that case, Equation (3.10) can

be re-written as

log φi =

which upon taking an anti-log on both sides can be re-written as

+ ˆγ,

j=1,j(cid:54)=i

(cid:17)

(cid:16) ˆβj log φj
k(cid:88)
 .
 k(cid:89)

φi = eˆγ ·

ˆβj
φ
j

j=1,j(cid:54)=i

(3.11)

(3.12)

3.3.2 Comparison with Automated Innovization

The authors of [11] developed an unsupervised machine learning method by combining grid

based clustering [83] method with a genetic algorithm to learn rules of the form given by

Equations (3.4) and (3.7). As part of automated innovization (see Section 2.4.2), authors

32

of [11] needed to learn such rules only once at the end of a MOO task using an EMO

algorithm. However, in case of online innovization (see Section 2.5), we may have to learn

such rules multiple times while, a MOO task is going on. If we use the method suggested

in [11] to learn these rules, it will be computationally very ineﬃcient because of an EA (for

learning rules from MOO problem data) nested inside an EMO algorithm (for solving MOO

problem).

Although our method is computationally much faster in obtaining power-law rules as

compared to the method of automated innovization, however it holds a disadvantage as well.

Our method cannot identify the rules that are applicable to only part of the PO front, say

50% of the PO front, and not the entire front. We will come back to this point in Part-7.2

where we have tried to address this shortcoming.

3.3.3 Learning Multiple Rules Simultaneously

It is possible that the non-dominated solutions of a MOO problem carry multiple design rules

simultaneously. For example, not only some constraint g1 is some constant upper bound but

also two variables x1 and x2 are related in a power law form. Using the method given in

Section 3.3, we can eﬃciently learn as many as our computational budget allows.

Consider a MOO problem shown in Equation (2.1) having nf objectives, nx design vari-

ables and ng inequality constraints. Hence, if we are considering only variables, objective

functions and constraint functions as basis functions, we have a minimum of nφ = nx+nf +ng
nφ − 1

basis functions for learning rules. With nφ basis functions, exhaustively we have 2

number of constant and power-law rules for which we have to estimate the parameters and

quality. These numerous rules for learning can become prohibitively large very quickly. Fur-

thermore, the interpretability and use of such design rules for an engineer quickly diminishes

33

as the number of basis functions in a rule increases. For these reasons, we limit ourselves to

learning rules with a maximum of two or three basis functions.

This chapter has covered the rule forms that are commonly encountered in MOO en-

gineering design problems and their quality can also be ascertained quickly. However, we

are also interested in learning rules where the form is not constrained a priori. In the next

chapter, we will look at an eﬃcient way of learning “free-form” non-linear rules from data.

34

Chapter 4

Learning Free Form Rules Using a

Symbolic Regression Task

In Chapter 3, we saw how can we eﬃciently learn rules of two ﬁxed forms, constant rules

and power-law rules, in MOO data. These two forms are commonly encountered in MOO

engineering design problems and can be learned quickly from data. However, in this chap-

ter, we wish to expand our scope of knowledge extraction to rules with a weaker set of

constraints limiting the form of the rules that we learn. Yet, we want to stay in the territory

of mathematical/algebraic expressions that are simpler to interpret than, say some kernel

based rules [84] and yet have the ability to capture complex physical behavior.

4.1 Form of Rules

We are focusing on learning rules of the form

ψ(φ1, φ2, . . . , φnφ) ≡ φi = w0 +

nt(cid:88)

j=1,j(cid:54)=i

wj · tj,

(4.1)

where ψ represents one such rule, φ are the basis functions explained in Section 3.1 that

represent the data from MOO problem, φi is one of the basis functions that can either be a

regressand in a symbolic regression problem or a class label in a classiﬁcation problem, w0

35

and wjs are numerical weights, nt is the number of terms in the rule where each term tj is

some function of φj, j ∈ {1, 2, . . . , nφ} \ i using the operations {+,−,×,÷}. Such rules are
a good candidate for learning using genetic programming [35].

4.2 A Primer on Genetic Programming

A GP can be considered an application of GAs when the space of solutions to search con-

sists of programs or equations for solving a task [27, 35, 85]. Instead of decision variables,

an individual is a program or an equation to solve a task.

In order to create an initial

population, a terminal set T and a function set F must be pre-speciﬁed. A terminal set
consists of operands (constants and variables) where as a function set consists of operators

or basic functions. Like other evolutionary computation techniques, a typical GP starts with

a population of randomly created individuals, which in this case are math expressions. The

ﬁtness for each individual is determined by evaluating the rules. High ﬁtness individuals are

selected to form the mating pool, on which primary genetic operations, namely crossover and

mutation, are applied to create a new population of programs. The process is repeated until

some termination criterion (like maximum number of generations) is met. An individual

in a GP can be represented using diﬀerent data structures such as string of words [86], or

trees [87] or graphs [88]. We will be using the tree data structure to represent a rule, hence

we will discuss a few important concepts in the context of tree-based GPs.

Consider a GP with terminal set T = {1, 2, x} and function set F = {+,−,∗}. Then,
Figure 4.1 shows two candidate solutions that belong to the set of valid GP individuals

for such a GP. Furthermore, sub-tree swap crossover [89] is a popular crossover mechanism

used in tree based GPs. An example of the same is shown in Figure 4.2. A sub-tree to be

36

Figure 4.1: Example of two GP solutions using tree representation.

exchanged between two GP individuals (parents) is ﬁrst chosen at random in each parent.

Then, the sub-tree crossover operation is completed by exchanging the chosen sub-trees

between the two parents. A Koza-style sub-tree mutation [35] involves swapping either a

terminal with another element from the terminal set or a function with another element from

the function set. Figure 4.3 shows an example of subtree mutation for an individual. When

swapping functions, care must be taken to maintain the arity of functions being swapped

are the same.

Figure 4.2: An example of a sub-tree crossover between two GP individuals.

37

y(x) = 1 + x1*x2−1+x+1xy(x) = 2*x*x − 2*x + 1Child−2+1−1x1*x2−1+x+1x1*x2+xy(x) = 2*x*x − 2*x + 1Parent−1Parent−2y(x) = 1 + xy(x) = 2*x*x + 1y(x) = xChild−1Figure 4.3: An example of a sub-tree Koza-style mutation of a GP individual.

4.3 A Custom GP (CGP)

GPs are used extensively at inducing mathematical models based on observations. Currently,

two of the most popular commercial products for inducing mathematical models from data,

Eureqa [67] and DataModeler [90], are GP based products. We developed our own customized

GP or CGP by experimenting and combining with many diﬀerent ideas from literature as

well as our own to improve its performance. Some of the ideas exist in literature, however

they have not been tried all together in the same GP implementation. This section describes

the important ideas implemented in CGP along with its ﬂowchart description towards the

end.

4.3.1 Two Objectives : Prediction Error and Rule Complexity

Designer of any classiﬁcation or regression technique faces this dilemma, i.e. whether to learn

rules that are accurate or the ones that are simple for human interpretation [91]. Prediction

error and rule complexity are conﬂicting objectives and we usually decide (unknowingly)

apriori which one is more important for us by choosing a method. For a given problem, it

is critical to have a clear idea of the which is a priority, accuracy or interpretability so that

38

After Mutation1*x2+x*x2+x2y(x) = 2*x*x + 1Before Mutationy(x) = 2*x*x + 2this trade-oﬀ can be made explicitly rather than implicitly. Figure 4.4 shows this typical

trade-oﬀ dilemma when selecting a machine learning method.

We decided to minimize both simultaneously by basing our GP on the bi-objective op-

timization algorithm NSGA-II [92]. Furthermore, having more than one objectives in a GP

has known to reduce the problem of bloat [93]. Complexity of rule can be deﬁned in many

ways [94]. We deﬁne complexity of a rule as the number of nodes in a binary tree needed to

represent the rule.

Figure 4.4: The two conﬂicting objectives in any machine learning algorithm.

4.3.2 Using Multiple Small Trees

As the complexity of rules to be learned increases, so does the size of its binary tree rep-

resentation. This tree representation, which resides in the phenotype space of the GP is

the object that undergoes important crossover and mutation operations. If the size of this

tree is big, the possibility of a beneﬁcial crossover reduces drastically. For this reason, [95]

39

Deep Neural NetsRule Complexity (minimize)Prediction Error (minimize)Decision TreeLinear SVMLinear RegressionRandom ForestsKernel based Regressionsuggested having multiple depth-limited small binary trees as phenotype of a GP individual

instead of a single big tree and called the GP multi-gene genetic program or MGGP. Unlike

traditional GP, each symbolic model (and each member of the GP population) in MGGP

is a weighted linear combination of the outputs from a number of GP trees. Each of these

trees may be considered to be a “gene”. Figure 4.5 shows an example of a multi-gene GP

individual. This ﬁts very nicely with the type of rules shown in Equation (4.1) that we are

aiming to learn.

Figure 4.5: An MGGP individual composed of many small binary expression trees instead
of one big binary expression tree.

MGGP however treats the problem as a single objective optimization problem and faces

the problem of horizontal bloat [95]. Horizontal bloating is the tendency of multi-gene models

to acquire genes that are either performance neutral (i.e. deliver no improvement in predic-

tion error on the training data) or oﬀer very small incremental performance improvements.

In case of CGP, having a bi-objective formulation helps allay this issue.

4.3.3 Learning Weights Using a Faster Method

Unlike traditional GPs that try to learn the constants/coeﬃcients/weights in rule as well

along with the rule structure, we separate that task completely from the GP. Figure 4.6

40

y=w0+w1·(x1×x2)+w2·(x1+x22)Gene-1Gene-2×+×x2x2x2x1x1shows a sample individual with multiple genes. Like MGGP [95], we let evolution optimize

the structure of the rule but use some gradient based optimization method, such as OLSR

or linear support vector machines, to learn the weights in the rule.

Figure 4.6: Our GP learns the rule structure and their weights separately.

4.3.4 Diversity Preserving Mechanism

In the bi-objective problem formulation mentioned in Section 4.3.1, the rule complexity

objective is discrete in nature. This highly discretized objective space causes good individuals

encountered early on in the CGP run to be reproduced and selected multiple times to produce

many copies of these good solutions found early on. This leads to the population reducing

to a few individuals and their duplicates, thus leading to complete loss in diversity and

premature convergence of the algorithm [93].

To counter this, we need to adopt a strategy that can:

41

++Optimize Rule Weights with a Classical Method (e.g. Regression, SVM)(e.g. Custom bi−Objective GP)Optimize Rule Structure with GPCandidate Rulew2x2x3w3x1x5w1x3x1+x2 Penalize all but one of each solution having duplicates (in both objectives) so that the

ranking of the duplicate solutions worsens.

 The aforementioned penalization would push the non-domination rankings of dupli-

cate solutions higher. However, the penalized duplicate(s) corresponding to a solution

having a lower non-domination rank should continue to have lower non-domination

rank compared to the penalized duplicate(s) corresponding to a solution having higher

non-domination rank.

Figure 4.7 illustrates this idea with the help of a hypothetical example.

Figure 4.7: Fitness adjustment by penalizing duplicate solutions to promote population
diversity in CGP. (A) shows the original state with two non-dominated fronts present in the
population and total of seven solutions. (B) Solution-4, which is a duplicate of solution-3
after penalization. (C) Solution-6 which is a duplicate of solution-5 after penalization.

Part-A of Figure 4.7

 Part-A of Figure 4.7 shows the starting state of solutions. There are a total of seven

solutions including two duplicate solutions.

42

F3F3&updateworstobjectivevector&updateworstobjectivevectorPenalizeﬁtnessforSoln-6PenalizeﬁtnessforSoln-4ACBF1F2F1F2F1F2F4w23,45,65,63711123744567f2(minimize)f1(minimize)f2(minimize)f1(minimize)f2(minimize)f1(minimize)ww2 Solutions 1,2,3,5 and 7 are shown with black dots. Solution-4, which is a duplicate of

solution-3, is shown with a blue circle. Similarly, solution-6, which is a duplicate of

solution-5, is shown with a red circle.

 The green dot represents the worst objective vector [31].

 The non-dominated front F1 is comprised of solutions {1,2,3,4}. The non-dominated

front F2 is comprised of solutions {5,6,7}.

 We begin by penalizing solution-4, the duplicate solution present in front F1. Let

solution-4 be represented by objective vector (f

(4)
1 , f

(4)
2 ) and the worst objective vector

be represented as (f

(w)
1

(w)
, f
2

). Then the updated objective vector of solution-4 is given

by

(f

(4)
1 , f

(4)
2 )

assign

←−−−− (f

(4)
1 + f

(w)
1

, f

(4)
2 + f

(w)
2

).

(4.2)

This results in creating another non-dominated front, F3, for updated solution-4.

Part-B of Figure 4.7

 Part-B of Figure 4.7 shows the updated state of the solutions where updated solution-

4 now belongs to a new non-dominated front F3 and the worst objective vector is

updated as

(f

(w)
1

(w)
, f
2

)

assign

←−−−− (f

(4)
1 + f

(4)
1 ).

(4.3)

 Next solution to be penalized is solution-6 (shown with red-circle) which is a duplicate

of solution-5 (shown with a black dot). As in the case of solution-4, solution-6 is

penalized as

(f

(6)
1 , f

(6)
2 )

assign

←−−−− (f

(6)
1 + f

(w)
1

, f

(6)
2 + f

(w)
2

).

(4.4)

43

This results in creating another non-dominated front, F4, for updated solution-6.

Part-C of Figure 4.7

 Part-C of Figure 4.7 shows the updated state of the solutions where penalized solution-

4 now belongs to a new non-dominated front F3, penalized solution-6 now belongs to

a new non-dominated front F4, and the worst objective vector is updated as

(f

(w)
1

(w)
, f
2

)

assign

←−−−− (f

(6)
1 + f

(6)
1 ).

(4.5)

 Note that the relative non-domination ranking of solutions 4 and 6, both before and

after penalizing, remains the same. Thus, the non-domination hierarchy of penalized

solutions is preserved in this method.

 At this stage, no more duplicate solutions are present in any of the non-dominated

fronts prompts the penalizing algorithm to stop.

This penalizing of duplicate solutions helps in reducing the clout of duplicates of good solu-

tions and gives poorer solutions in higher non-dominated ranks a better chance at surviving

and participating in parent selections of evolutionary algorithm.

4.3.5 Higher and Lower Level Crossovers

Recall from Section 4.3.2 that the genotypic space of CGP consists of many small trees

or genes. Borrowing the idea from [95], CGP uses two types of crossovers namely low-

level crossover and a high-level crossover. Any two parent individuals chosen to reproduce

undergo a crossover with a probability pc. With a (preferably) small probability when the

44

individuals do not go through a crossover operation, the outcome of the crossover operation

are two child individuals that are identical copies of their parents. When crossover does

happen, then it can either be of high-level type with a probability of pch or of low-level type

with a probability pcl = 1 − pch.

Consider two individuals from the CGP population, having three and two terms respec-

tively as shown in the left half of Figure 4.8. Then for a high-level crossover to occur between

these two individuals, CGP randomly chooses one term from each individual to cross and

then swaps them between the individuals to create two children. This process is pictorially

shown in Figure 4.8 where the second term of parent-1 is swapped with the second term of

parent-2.

Figure 4.8: Example of the high-level crossover used in CGP.

If a low-level crossover need to be carried out, then CGP ﬁrst chooses one term from

each parent to cross and then carries out a sub-tree crossover among those two terms. This

process is shown in Figure 4.9.

45

Child−1Parent−2High LevelCrossoverChild−2Parent−1Figure 4.9: Example of the low-level crossover used in CGP.

4.3.6 CGP Flowchart for Rule Learning

Figure 4.10 shows the ﬂowchart for the rule learning part of CGP. Except for one block,

penalizing duplicates, everything else is same as that of NSGA-II algorithm. The algorithm

begins with initialization of a population, say of N of individuals, composed of tree structures

as explained in Section 4.3.2, each with not more than nt trees. Each individual represents

a rule of the form given in Eqation (4.1). The maximum depth [96] of each tree, say dmax,

is also speciﬁed at time of initialization. Then the ﬁtness functions are invoked to evaluate

both prediction error on training set and complexity objectives for entire initial population.

Then these individuals are assigned non-domination ranks and crowding distances [31] just

like NSGA-II [92].

Once this parent population is ranked, the parent selection process produces list a of

parents that are allowed to reproduce children for the next generation. CGP uses binary

tournament selection [31] for selecting parents to reproduce. Such a parent selection process

promotes the ﬁttest individuals in the population to mate more often. Once these parents

are selected, they go through genetic operations of crossover and mutation to produce a child

46

CrossoverChild−1Child−2Low LevelParent−1Parent−2Figure 4.10: The ﬂowchart for Custom Genetic Program or CGP.

population of N individuals. The crossover operation transpires via one of the two alterna-

tives, either higher lever crossover or a lower-level crossover, both of which are explained in

Section 4.3.5.

After the crossover operation, the N child individuals undergo mutation operation. For

an individual, a mutation is carried out with probability pm otherwise the child individual

is left unchanged. In CGP, to mutate an individual (composed of many trees), ﬁrst one of

the trees is randomly selected for carrying out the mutation operation and then a Koza-style

sub-tree mutation [35] is carried out on the chosen tree.

47

Child pop creationElite preservationYesNoReport POCrossoverAssign Rank &Crowding Dist.PenalizeEvaluationMergePopulationStopAssign Rank &Crowding Dist.ParentSelectionDuplicatesMutationInitial PopulationChild popevaluationSelectionSurvivorBeginTerm.Condn.SolutionsAfter undergoing the crossover and mutation operations, CGP evaluates the ﬁtness of

the N child individuals. Now these N children are combined with the N parent individuals

of the current generation to obtain a merged population of size 2N . This population of 2N

individuals is passed on to the survivor selection procedure, where all the 2N individuals

are again ranked and assigned crowding distances before selecting N individuals using the

crowded tournament selection operator [31] used in NSGA-II.

This population of N individuals may then contain certain duplicate solutions. These

duplicate solutions are penalized using the method given in Section 4.3.4 and the entire

population is again assigned rank and crowding distance values. If termination condition is

not met, these N individuals become the parent population for the next generation. This

process goes on until the termination condition is met and the ﬁnal PO set of solutions is

reported. Table 4.1 shows the list of parameters for CGP.We now look at how can we use

CGP for a symbolic regression task.

Table 4.1: List of parameters in CGP.

Symbol Description

N
G
nt

Population size
Number of generations
Maximum number of terms in a CGP individ-
ual

Suggested
Value
50-200
100-500
3-10

dmax Maximum depth of trees representing individ-

4-10

ual terms of an individual
Probability of crossover
Probability of high level crossover
Probability of low level crossover
Probability of mutation

pc
pch
pcl
pm

0.8-0.95
0.2-0.3
1 − pch
0.05-0.2

48

4.4 Using CGP for Symbolic Regression Task

Now that we are familiar with the working of CGP, recall that we wish to use CGP to learn

rules of the form given by Equation (4.1) from MOO data as part of online innovization.

This is a problem of symbolic regression and it is a type of regression analysis that searches

the space of mathematical expressions to ﬁnd the model that best ﬁts a given data set.

It is diﬀerent and a diﬃcult task as compared to conventional regression analysis which

seeks to optimize the parameters for a pre-speciﬁed mathematical model structure involving

regressand and regressors.

4.4.1 Evaluating Fitness of a CGP Individual for a Symbolic Re-

gression Task

Recall from Section 4.3.1 that CGP approaches rule learning problem as a bi-objective op-

timization problem minimizing prediction error and rule complexity as the two objectives.

Lets look at how these two objectives are calculated one by one.

The prediction error ﬁtness function in CGP for symbolic regression task is an extension

of ordinary least squares regression (OLSR) estimation method of linear regression. Consider

a linear regression model

y = Zw + ,

(4.6)

where Z ∈ Rn×(k+1) is a matrix with n observations on k independent variables and the ﬁrst
column of Z contains only ones to include the bias term in the regression model, y ∈ Rn×1
is a vector of observations on the dependent variable,  ∈ Rn×1 is a vector of errors and
w ∈ R(k+1)×1 is a vector of unknown regression coeﬃcients including the w0 bias term that

49

we wish to estimate. Equation (4.6) in expanded form can be written as



 .

(4.7)



1

2
...

n

+

 =



y1

y2
...

yn



1 Z11

1 Z21
...
...

1 Zn1



··· Zk1
··· Zk2
...
. . .

··· Zkn



w0

w1

w2
...

wk

Then, the OLS estimate of the regression coeﬃcients [97] wi, i ∈ {0, 1, . . . , k} is given by

ˆw = (Z

(cid:124)

Z)−1Z

(cid:124)

y.

(4.8)

Consider a ﬁve variable data set having n observations where we wish to symbolically

regress a relation of the form x4 = f (x1, x2, x3, x5). When we try to solve this problem

using CGP , the CGP will initialize with a number of random CGP individuals. Consider

an example of such an individual as shown in Figure 4.6 having three terms, x3./(x1 + x2),

x2./x3 and x1. ∗ x5, where the operations ‘./’ and ‘.∗’ represent element wise division and
multiplication of two vectors. For ﬁnding rules of the form shown in Equation (4.1) using

this individual, the problem boils down to ﬁnding the appropriate weights wi, i ∈ {0, 1, 2, 3}
in the relation

50

 =



x41

x42
...

x4n



1

1
...

1



x31

x11 + x21

x32

x12 + x22

...
x3n

x1n + x2n

x21
x31

x22
x32
...
x2n
x3n

x11 · x51

x12 · x52

...

x1n · x5n



w0

w1

w2

w3

 .

(4.9)

Comparing Equation (4.9) with (4.7), then we can obtain the weight estimates using

Equation (4.8). The ˆx4 calculated for all observations can then be compared with the actual
x4 values to get prediction error as e = x4− ˆx4. Note that e ∈ Rn×1 is a vector of prediction
errors where n is the number of observations. Then for this individual, the ﬁtness value or

error objective, say ferror, can be calculated as a root mean square of e over all observations.

(cid:114) e(cid:124)e

n

ferror =

(4.10)

The complexity ﬁtness function simply calculates the total number of nodes in the trees

of all terms of a CGP individual, i.e.

(cid:88)

fcomplexity =

( Nodes in all terms of CGP individual).

(4.11)

Again, consider the example of a CGP individual shown in Figure 4.6 having three terms.

The fcomplexity = 11 for this individual.

51

4.5 CGP Results on Test Problems

Let us look at some results of symbolic regression on test problems using CGP .

4.5.1 Test Problem-1

This problem has a single variable x1 as regressor for the regressand y. The data for this

problem was generated using the equation

y = x1 −

x3
1
3!

+

x5
1
5!

(4.12)

where 100 values of x1 are sampled from a uniform distribution over [−π, π]. Figure 4.11
shows the relation between x1 and y graphically. The CGP parameters used in solving

Figure 4.11: Graphical representation of test problem of Section 4.5.1.

this problem are given in Table 4.2. The PO solutions obtained by CGP for this problem

are shown in Figure 4.12 in which each point represents a regressed non-linear relation

between y and x1. Also shown in Figure 4.12 is the regressed rule for a knee [98] solution.

The trees corresponding to the three terms in this knee solution are shown in Figure 4.13.

The expression for the chosen knee solution is almost same as the true relation shown in

Equation (4.12) except for a small bias term of 1.155 · 10−14 which is a numerical error and
can be ignored.

52

x1y−ππ−11Table 4.2: List of CGP parameters used for solving symbolic regression problem of Sec-
tion 4.5.1.

Parameter N G nt
Value
10

30

20

dmax

10

pc
0.95

pch
0.20

pcl
0.80

pm
0.05

Figure 4.12: PO set of solutions found by CGP on solving the symbolic regression test
problem of Section 4.5.1.

4.5.2 Test Problem-2

This problem has a single variable x1 as regressor for the regressand y. The data for this

problem was generated using the equation

y =

x1 + 1
x2
1 + x1 + 1

(4.13)

where 100 values of x1 are sampled from a uniform distribution over [−2, 2]. Figure 4.14
shows the relation between x1 and y graphically. The CGP parameters used in solving this

problem are given in Table 4.3. The PO solutions obtained by CGP for this problem are

shown in Figure 4.15 in which each point represents a regressed non-linear relation between

y and x1. Also shown in Figure 4.15 is the regressed rule for the solution with least error

53

102030405000.050.10.150.20.250.30.35(a) t1

(b) t2

(c) t3

Figure 4.13: The three trees corresponding to the chosen knee solution shown in Figure 4.12.

Figure 4.14: Graphical representation of test problem of Section 4.5.2.

and highest complexity among the PO solutions. The tree corresponding to this term is

shown in Figure 4.16. The expression for the chosen solution is exactly the same as the true

relation shown in Equation (4.13).

Table 4.3: List of CGP parameters used for solving symbolic regression problem of Sec-
tion 4.5.2.

Parameter N G nt
Value
10

50

50

dmax

6

pc
0.85

pch
0.20

pcl
0.80

pm
0.05

4.5.3 Test Problem-3

The Bernoulli equation [99] is a famous equation from the subject of ﬂuid mechanics. Con-

sider an incompressible ﬂuid ﬂowing under steady ﬂow condition. Then, valid at any arbi-

54

×x1×××x1x1x1x1×××x1x1x1x1x1x1y−22−0.3331Figure 4.15: PO set of solutions found by CGP on solving the symbolic regression test
problem of Section 4.5.2.

trary point along a stream line is:

y +

v2
2g

+

p
ρg

= c

(4.14)

Figure 4.16: The tree corresponding to the chosen knee solution shown in Figure 4.15.

55

05101500.050.10.150.20.250.30.35÷×x1x1+÷x1x1×+x1÷x1x1x1where

y

v

is the elevation of the point above a reference plane

is the ﬂuid ﬂow speed at a point on a streamline,

p is the pressure at the chosen point,

g

is acceleration due to gravity,

ρ is the density of the ﬂuid at all points in the ﬂuid, and

c

is a constant for the streamline chosen and also called energy head.

To generate data to test CGP , we assumed the ﬂuid to be water (ρ = 1000 kg/m3) and value

of g to be 9.81 m/s2 and we chose the value of c = 20 m of energy head for some streamline

in some water ﬂowing under steady ﬂow conditions. We then randomly sampled 100 values

each for the variables v and p from uniform distribution over [0,10] m/s and [101325,400000]

Pa respectively. The atmospheric pressure at the surface of Earth is ≈ 101325 Pa. For these
values of v, p, g, ρ and c, we then calculated 100 values of z using the relation

y = 20 − 0.051 · v2 − 1.0194 × 10−4 · p.

(4.15)

Note that Equation (4.15) is obtained by substituting the values of g, ρ and c assumed

above in Equation (4.14). Considering y as the regressand variable and variables v and p

as regressors, CGP was supplied with this data to learn the non-linear relation among these

variables.

The CGP parameters used in solving this problem are given in Table 4.4. The PO

solutions obtained by CGP for this problem are shown in Figure 4.17 in which each point

represents a regressed non-linear relation between y and (v, p). Also shown in Figure 4.17

is the regressed rule for a knee solution. The trees corresponding to this term are shown

56

in Figure 4.18. The expression for the chosen solution is exactly same as the true relation

shown in Equation (4.14).

Table 4.4: List of CGP parameters used for solving symbolic regression problem of Sec-
tion 4.5.3.

Parameter N G nt
6
Value

28

30

dmax

4

pc
0.95

pch
0.20

pcl
0.80

pm
0.05

Figure 4.17: PO set of solutions found by CGP on solving the symbolic regression test
problem of Section 4.5.3.

4.5.4 Test Problem-4

Let us look at the equation of deﬂection of a simply supported beam [100]. Consider a simply

supported beam of length l having a mass per unit length of w. If E is the Young’s modulous

of the material of the beam and I is the area moment of inertia of the beam (about axis of

bending), then the deﬂection of the beam at a distance x from one end is given by

∆x =

wx
24EI

(l3 − 2lx2 + x3).

57

(4.16)

51015202500.20.40.60.811.2(a) t1

(b) t2

Figure 4.18: The two trees corresponding to the chosen knee solution shown in Figure 4.17.

To generate data to test CGP , we assumed the beam to be made of steel with the

following knowns parameters:

E = 29, 000, 000 psi, w = 0.568 lb/inch, I = 0.167 inch4, l = 39.37 inch.

(4.17)

Substituting these values in Equation (4.16), we get

∆x = 4.89 × 10−9x4 − 3.85 × 10−7x2 + 2.98 × 10−4x.

(4.18)

Using this relation, we then calculated 100 values of ∆x based on 100 value of x ∈ (0, l].
Considering ∆x as the regressand variable and variable x as the regressor, CGP was supplied

with this data to learn the non-linear relation among these variables.

The CGP parameters used in solving this problem are given in Table 4.5. The PO

solutions obtained by CGP for this problem are shown in Figure 4.19 in which each point

represents a regressed non-linear relation between ∆x and x. Also shown in Figure 4.19

is the regressed rule for a knee solution. The trees corresponding to this term are shown

in Figure 4.20. The expression for the chosen solution is exactly same as the true relation

shown in Equation (4.14).

58

×vvpTable 4.5: List of CGP parameters used for solving symbolic regression problem of Sec-
tion 4.5.4.

Parameter N G nt
Value
6

30

30

dmax

6

pc
0.95

pch
0.20

pcl
0.80

pm
0.05

Figure 4.19: PO set of solutions found by CGP on solving the symbolic regression test
problem of Section 4.5.4.

4.6 Noise Study

For the test problem of Section 4.5.3, a noise study was also performed. The purpose of this

study was to understand, how much noise needs to be added to the regressand before the

true relation is no longer present among the PO set of solutions of CGP . Gaussian white

noise was added to the value of original regressand y as

˜y = y + a ∗ N (0, 1),

(4.19)

where a or noise factor is some numeric value to change the level of noise and N (0, 1) is the

standard normal distribution. The noise levels are represented in terms of signal to noise

59

510152025303540024681010-4(a) t1

(b) t2

(c) t3

Figure 4.20: The three trees corresponding to the chosen knee solution shown in Figure 4.19.

(SNR) ratio and measured in decibels or dBs using the formula

(cid:35)

(cid:34)(cid:80)L−1

n=0 y2(n)
L · a2

SNR = 10log10

dB,

(4.20)

where n is the index of observed value, L is the number of observations which is 100 in this

case and a2 is the variance of the signal ˜y having noise. This formula is taken from the

work [101]. A high SNR means low noise. The value of regressors v and p were not changed.

Eﬀectively, CGP attempted to learn the relation

˜y = 20 − 0.051 · v2 + 1.0194 × 10−4 · p

(4.21)

instead of the one given by Equation (4.15). Note that the regressands are diﬀerent in the

two equations.

Table 4.6 shows the results of this study. The ﬁrst column carries the value of noise

factor a, the second column carries value of SNR in dBs, the third column carries the rule

corresponding to the knee solution from among the PO set obtained by CGP on noisy data

and the last column carries the R2

adj value or goodness of ﬁt value of the knee solution. This

60

××xxxxx×××xxxxstudy shows that the true solution is present among the PO set of solutions even until a high

noise level of SNR = 15 dB, although the goodness of ﬁt suﬀers a lot at that level of noise.

Table 4.6: Results of noise study performed on test problem of Section 4.5.3.

a

SNR Knee soln rule
(in dB)

0.0010
0.1925
0.3431
0.6090
1.0825
5.4315

100
30
25
20
15
1

˜y = 20 − 0.051v2 − 1.019 × 10−4p
˜y = 20 − 0.051v2 − 1.000 × 10−4p
˜y = 20 − 0.051v2 − 9.9 × 10−5p
˜y = 19 − 0.051v2 − 9.7 × 10−5p
˜y = 19 − 0.052v2 − 9.2 × 10−5p
˜y = 28−6.3v+1.9×10−4p+2.6×10−4vp−4.7v2+0.72v3−0.036v4

R2

adj

1

0.9882
0.9622
0.8484
0.6993
0.0125

4.7 Choosing a Solution

The results in the previous section show that CGP is competitive at symbolically regressing

a set of PO rules that provide a good trade-oﬀ between prediction error and rule complexity.

However, recall from Sections 2.5 and 2.5.3 that to perform online innovization, we not only

have to learn the rules present in MOO data but also use that knowledge to modify solutions

to expedite the EMO algorithm’s convergence. Hence, if we use CGP for rule learning, once

we have a set of PO solutions from CGP, we also need to choose a solution before passing

the control back to the EMO algorithm. This can be done in two ways:

1. Choosing a knee solution : In PO solutions of bi-objective optimization problems, a

knee point is a special solution. This is because choosing any solution other then the

knee solution requires a large sacriﬁce in one objective to gain a small amount in the

other objective. There is good amount of literature available on detecting and choosing

a knee solution in a PO front [98, 102]. It can be seen from the PO fronts shown in

Figures 4.12,4.15,4.17 and 4.19 that these problems show a strong propensity for knee

61

points. Hence choosing a knee solution in an automated way can be one way to choose

one of the many PO solutions of CGP.

2. Checking dimensional consistency : Recall from Section 2.6 that one of the advantages

of learning rules in form of mathematical algebraic expressions is that once can check

if the learned rule adheres to the law of dimensional homogeneity. A rule that adds

a physical quantity having the dimensions of length with a physical quantity having

the dimensions of mass cannot add anything to the knowledge of a designer, even if

rule has a low prediction error on training data. Hence, once CGP provides a set of

PO solutions, we can check the dimensional consistency of those rules and discard the

ones that violate it. We call this as serial dimensional awareness check because we

are using the dimensionality check at the end of the CGP algorithm. Thus we can use

this idea to shortlist some candidates from the PO set of CGP and then follow it up

with a knee solution choosing method. Even in the absence of a knee in the PO front,

it may be possible to parametrize the user preference in terms of rule complexity and

prediction error and we can automatically choose a rule based on user’s preference.

In the next chapter, we look at how can we check dimensional consistency of rules obtained

by CGP.

62

Chapter 5

Using Dimensional Awareness with

Rule Learning

For any physical law, adherence to the law of dimensional homogeneity is of utmost impor-

tance. The law of dimensional homogeneity [26] states that, “only commensurable quantities

(physical quantities having the same dimension) may be compared, equated, added or sub-

tracted”. Although, the rule learning part of CGP can learn rules that accurately ﬁt the

data, but if any rule adds or subtracts two incommensurable quantities, then such a rule is

physically meaningless. Hence, we need to quantify the degree of dimensional mismatch in

a rule found by CGP. Such a quantiﬁcation of dimensional mismatch for the PO rules found

by rule learning part of CGP can give us additional information, if we need to choose only

one or very few solutions out of the PO solutions of CGP.

5.1 Measuring Dimension Mismatch Penalty

Let us try to ﬁgure out how can we quantify dimensional mismatch penalty in a rule found

by CGP. Say, the rule learning part of CGP is used for solving a symbolic regression problem

relating regressand (y) and regressors (xk, k ∈ {1, 2, . . . , nx}), which yields a set of PO rules.

63

Consider one such PO rule having the form

r ≡ y = w0 +

nt(cid:88)

i=1

wi · ti,

(5.1)

where w0 is a bias term, nt is the total number of terms, wi is the regression coeﬃcient for

term ti and ti is some function of regressors xk, k ∈ {1, 2, . . . , nx}.

5.1.1 Case-I : Terms with Only Product and Division Operations

In this case, we will show how dimensional mismatch penalty, say P, is calculated if the
terms ti in Equation (5.1) is comprised of only multiplication and division operations among

the regressors xk, k ∈ {1, 2, . . . , nx}. Let us further assume that;

 The number of fundamental dimensions present in data is three and they are the fun-

damental physical dimensions of mass (M), length (L) and time (T). Of course there can

be more (for example temperature (θ), current (I etc.) but we are choosing aforemen-

tioned three for ease of representation.

 The derived physical dimensions of a term ti is MαiLβiTγi,

 Cj, j ∈ {1, 2, . . . , nc} are a set of nc physical constants relevant to the process that
is generating the data. These have to be chosen by subject matter experts. For

example, when studying a ﬂuid ﬂow problem, some of the physical constants that may

be considered important for the process are acceleration due to gravity, density of

ﬂuid and ﬂuid viscosity. These Cj constants have dimensionless numeric values cj, j ∈
θj . The symbol Cj encapsulates both
{1, 2, . . . , nc} and derived dimensions of M
numeric value cj as well as units information.

λj L

ηj T

64

 To mend the dimensional inconsistencies between y and all terms ti in Equation (5.1)

using the constants Cj’s, the constants may appear in dimensionally consistent form
of Equation (5.1) with a limited set of exponents, say E = {e1, e2, . . . , ek} where k is
number of such chosen exponents. For example,

E = {−2.0, −1.0, −0.5, 0.0, 0.5, 1.0, 2.0}.

(5.2)

 The fundamental units of the left hand side or regressand y in regressed rule shown in

Equation (5.1) are MεLϕTω.

If the derived physical dimensions of term ti and regressand y are diﬀerent, then a natural

question to ask is, in what way can diﬀerent physical constants Cj of the process be multiplied
with the term ti such that dimensional homogeneity can be maintained between ti and y.

One way to achieve this is as follows. For the ith term ti, there may exist a set of real values

. . . , z(i,nc)} such that a product of ti with (cid:81)nc

{z(i,1), z(i,2),
equivalence between term ti and y. This can be represented as

j=1 C

z(i,j)
j

yields dimensional

nc(cid:89)

MεLϕTω = (MαiLβiTγi)

λj L

ηj T

(M

θj )

z(i,j),

which can be re-written as

j=1

M(ε−αi) L(ϕ−βi) T(ω−γi) = M

(cid:80)nc
j=1 λj z(i,j) · L

(cid:80)nc
j=1 ηj z(i,j) · T

(cid:80)nc

j=1 θj z(i,j).

(5.3)

(5.4)

65

Solving Equation (5.4) is equivalent to the system of simultaneous linear equations given by



(cid:125)



(cid:124)

ε − αi
ϕ − βi
ω − γi

(cid:123)(cid:122)

bi



(cid:125)

,

(5.5)

=



(cid:124)

λ1 λ2

η1 η2

θ1

θ2



(cid:125)

·

··· λnc
··· ηnc
··· θnc

(cid:123)(cid:122)

A



(cid:124)

z(i,1)

z(i,2)

...

z(i,nc)

(cid:123)(cid:122)

zi

From theory of linear algebra [103], solution z to Equation (5.5) is;

I The exact solution if A is full rank and square matrix,

II The least square solution if A is full rank and skinny matrix (nf > nc),

III The least square and least norm solution if A is full rank and fat matrix (nf < nc),

and

IV The least square solution using Singular Value Decomposition method if A is singular.

Let the solution to Equation (5.5) be ˆzi where

ˆzi = [ˆz(i,1) ˆz(i,1) . . . ˆz(i,nc)]

(cid:124)

where ˆz(i,j) ∈ R ∀ i, j.

(5.6)

Recall from Equation (5.3) that z(i,j)s represent the exponents of the chosen physical con-

stants Cjs. Also, we are looking to quantize these exponents to a set of select few given by
some set E, an example of which is given by Equation (5.2). Hence, all the elements of ˆzi
are quantized to their nearest value in set E. This quantization on the elements of ˆzi yields

66

us the set ¯zi, where

¯zi = [¯z(i,1) ¯z(i,2) . . . ¯z(i,nc)]

(cid:124)

where ¯z(i,j) ∈ R ∀ i.

(5.7)

For example, if ˆzi = {−0.95 0.55 1.89} and set E is given by Equation (5.2), then the
quantized set will be given by

(cid:124)
¯zi = [−1.0 0.5 2.0]

.

Once a quantized set of exponents ¯zi is obtained, we then obtain the residue vector di

corresponding to the dimensional inconsistency in the ith term in Equation (5.1) as

 =



d(i,1)

d(i,2)

d(i,3)

 −



ε − αi
ϕ − βi
ω − γi

di =



λ1 λ2

η1 η2

θ1

θ2



··· λnc
··· ηnc
··· θnc

.



¯z(i,1)

¯z(i,2)

...

¯z(i,nc)

(5.8)

The Root Mean Square or RMS value of residue vector di can then be treated as penalty Pi
of dimensional mismatch corresponding to the ith term in Equation (5.1) as

(cid:115)

(i,2) + d2
d2
(i,1) + d2
3

(i,3)

.

(5.9)

Pi =

This dimensional mismatch penalty is a non-negative value and it will be zero if the physical

dimensions of y and term ti in Equation (5.1) can be matched exactly by multiplying ti with

just the right combination of physical constants, Cj where j ∈ {1, 2, . . . , nc}, when raised to
a particular set of exponents given by ¯zi in Equation (5.7). Once this penalty value can be

calculated for all nt the terms of Equation (5.1), we can calculate the overall dimensional

67

mismatch penalty P for the expression given by Equation (5.1) as

nt(cid:88)
nt(cid:88)

i=1

Pi



0

P =

nt −

δPi,0

i=1

nt(cid:88)

i=1

if

δPi,0 > 0,

otherwise,

where δi,j is the Kronecker delta function such that

1 if i = j,

0 otherwise.

δi,j =

If Equation (5.10) evaluates to zero, then it implies that the original equation can be

made dimensionally consistent using the set of chosen constants Cjs and a set of exponents
such as given by (5.7). In such a case, its beneﬁcial to modify the original equation given by

(5.1) to include the chosen constants as

(5.10)

(5.11)

(5.12)

(5.13)

 ˆwi · ti

nc(cid:89)

j=1

¯z(i,j)
j

C

 , where

nt(cid:88)

i=1

r ≡ y = w0 +

ˆwi =

nc(cid:89)

j=1

.

wi
¯z(i,j)
j

c

This is done so that we do not aﬀect the regression or classiﬁcation accuracy of the original

equation while making it dimensionally consistent at the same time. Recall that cj rep-

resents the dimensionless numerical value of some physical constant Cj and the symbol Cj
encapsulates both numeric value cj as well as units information.

68

5.1.2 Dimensionally Inconsistent Example

Let us understand the above procedure with an example. Consider the symbolic regression

problem (Bernoulli equation case) presented in Section 4.5.3. We chose variable y as the

regressand and variable v and p as regressors. The rule learning part of CGP returns a set

of PO rules as shown in Figure 4.17. Suppose we want to check the dimensional consistency

of a one of the PO solutions given by

r ≡ y = 20(cid:124)(cid:123)(cid:122)(cid:125)

w0

(cid:124) (cid:123)(cid:122) (cid:125)

− 0.05097

w1

· v2(cid:124)(cid:123)(cid:122)(cid:125)

t1

− 0.0001019

(cid:125)

· p(cid:124)(cid:123)(cid:122)(cid:125)

t2

(cid:124)
(cid:125)
(cid:123)(cid:122)
+ 3.9 × 10−8

w3

· v(cid:124)(cid:123)(cid:122)(cid:125)

t3

.

(5.14)

(cid:124)

(cid:123)(cid:122)

w2

The rule shown in Equation (5.14) is diﬀerent from the knee solution shown in Figure 4.17.

In this rule, there are three terms (nt = 3) namely, t1 = v2, t2 = p and t3 = v and
their corresponding weights are w1 = −0.050907, w2 = −0.0001019 and w3 = 3.9 × 10−8
respectively. Also, there is a bias term w0 = 20. Recall that the units of v and p are m/s

and Pa respectively. Therefore, the derived units for t1 are M0L2T−2, for t2 are M1L−1T−2

and for t3 are M0L1T−1.

Let us choose two physical constants (nc = 2) in the Bernoulli problem, namely acceler-

ation due to gravity, ‘g’ and density of ﬂuid (water here), ρ. These are measured in m/s2

and kg/m3 respectively. Hence,

g ≡ C1 = 9.81(cid:124)(cid:123)(cid:122)(cid:125)
(cid:124)(cid:123)(cid:122)(cid:125)

ρ ≡ C2 = 1000
c2

c1

m/s2 and ,

kg/m3.

(5.15)

(5.16)

The derived units for these physical constants of the problem are M0L1T−2 for g and M1L−3T0

69

for ρ. Furthermore, let us choose the set given in (5.2) as the set of allowed exponents for

the constants.

Since y is the elevation of a point on a streamline in ﬂuid, it is measured in meters. Thus,

the derived dimensions of regressand y is given by M0L1T0. This information is summarized

below.

α1 = 0

α2 = 1

α3 = 0

λ1 = 0

λ2 = 1

ε = 0

β1 = 2

β2 = −1
β3 = 1

η1 = 1

η2 = −3
ϕ = 1

γ1 = −2
γ2 = −2
γ3 = −1
θ1 = −2
θ2 = 0

ω = 0

Let us look at the dimensional consistency of each term one by one.

5.1.2.1 First Term

For term t1 in Equation (5.14), Equation (5.5) is given by



(cid:124)

0

1

1 −3
−2
0

(cid:123)(cid:122)

A





0

,

−1
2

(cid:124) (cid:123)(cid:122) (cid:125)

b1



(cid:125)

z(1,1)

(cid:124) (cid:123)(cid:122) (cid:125)

z(1,2)

z1

·

=

70

which upon solving and quantizing to allowed exponent values yields

(cid:124)
¯z1 = [−1 0]

.

Substituting ¯z1 in Equation (5.8) and then using Equation (5.9), we obtain the dimensional

mismatch penalty for ﬁrst term of Equation (5.14) as

P1 = 0.

(5.17)

5.1.2.2 Second Term

For term t2 in Equation (5.14), Equation (5.5) is given by



(cid:124)

0

1

1 −3
−2
0

(cid:123)(cid:122)

A



(cid:125)

z(2,1)

(cid:124) (cid:123)(cid:122) (cid:125)

z(2,2)

z2

·

=



,



−1
2

(cid:124) (cid:123)(cid:122) (cid:125)

2

b2

which upon solving and quantizing to allowed exponent values yields

(cid:124)
¯z2 = [−1 − 1]

.

Substituting ¯z2 in Equation (5.8) and then using Equation (5.9), we obtain the dimensional

mismatch penalty for second term of Equation (5.14) as

P2 = 0.

71

(5.18)

5.1.2.3 Third Term

For term t3 in Equation (5.14), Equation (5.5) is given by



(cid:124)

0

1

1 −3
−2
0

(cid:123)(cid:122)

A



(cid:125)

z(3,1)

(cid:124) (cid:123)(cid:122) (cid:125)

z(3,2)

z3

·

=



0

0



,

(cid:124)(cid:123)(cid:122)(cid:125)

1

b3

which upon solving yields

(cid:124)
ˆz3 = [−0.4878 − 0.1463]

.

(5.19)

Since we chose the allowed set of exponents to be

E = {−2.0, −1.0, −0.5, 0.0, 0.5, 1.0, 2.0},

hence quantizing the exponents in (5.19) to values in E yields

(cid:124)
¯z3 = [−0.5 0.0]

.

Substituting ¯z3 in Equation (5.8) and then using Equation (5.9), we obtain the dimensional

mismatch penalty for third term of Equation (5.14) as

P3 = 0.2887.

(5.20)

72

The total dimensional penalty for symbolic regression solution given by (5.14) is found using

Equation (5.10), i.e.

P =

nt(cid:88)
nt(cid:88)

i=1

Pi

nt −

δPi,0

i=1

=

0 + 0 + 0.2887
3 − (1 + 1 + 0)

= 0.2887 (cid:54)= 0

(5.21)

This should be the case because if we compare Equation (5.14) with the correct Bernoulli

equation of Equation (4.15), the third term of Equation (5.14) is not commensurate with

the physical dimensions of the regressand y.

5.1.3 Dimensionally Consistent Example

Again consider the symbolic regression problem presented in Section 4.5.3. Suppose that the

rule learning part of CGP returns a set of PO rules, and we want to check the dimensional

consistency of one such rule given by

r ≡ y = 20(cid:124)(cid:123)(cid:122)(cid:125)

w0

(cid:124) (cid:123)(cid:122) (cid:125)

− 0.05097

w1

· v2(cid:124)(cid:123)(cid:122)(cid:125)

t1

− 0.0001019

(cid:125)

· p(cid:124)(cid:123)(cid:122)(cid:125)

t2

.

(5.22)

(cid:124)

(cid:123)(cid:122)

w2

Equation (5.22) is actually a knee region solution among the PO solutions found by CGP in

Bernoulli test data, shown in Figure 4.17. From Sections 5.1.2.1 and 5.1.2.2, we know that

this equation can be made dimensionally consistent using the chosen physical constants. The

ﬁrst chosen constant (acceleration due to gravity) C1 is symbolically represented by g and
numerically equal to c1 = 9.81 m/s2 in SI units. The second chosen constant (density of
water) C2 is symbolically represented by ρ and numerically equal to c2 = 1000 kg/m3 in SI
units. Hence, we can re-evaluate the weights in Equation (5.22) to include these chosen phys-

73

ical constants using Equations (5.12) and (5.13). Substituting values from Equations (5.17),

(5.18) and (5.22) in Equation (5.13), we get

ˆw1 =

c

ˆw2 =

w1

¯z(1,1)
1

c

¯z(1,2)
2
w2

¯z(2,1)
1

¯z(2,2)
2

c

c

= −0.05097

9.81−1 10000 = −0.50001 and

= −0.0001019

9.81−1 1000−1 = −0.99964.

(5.23)

(5.24)

Using Equation (5.12), the modiﬁed form of Equation (5.22) that includes the chosen physical

constants can be written as

r ≡ y = 20 + ˆw1 · v2 C

¯z(1,1)
1

¯z(1,2)
2

C

+ ˆw2 · p C
= 20 + ˆw1 · v2 g−1 ρ0 + ˆw2 · p g−1 ρ−1
= 20 − 0.50001

v2
g − 0.99964

p
ρg

¯z(2,1)
1

¯z(2,2)
2

C

, or

(upon substituting ˆw1 & ˆw2).

(5.25)

Compare Equation (5.25) with known form of Bernoulli’s equation given in Equation (4.14)

and notice that two are same (within a small tolerance) if c = 20. Figure 5.1 shows the PO

solutions of the Bernoulli equation problem shown in Figure 4.17 along with the dimension

mismatch penalty information. The solutions having a non-zero penalty value are shown in

red. Note that only the knee solution has a zero dimension mismatch penalty in this case.

Hence, we can use dimensional mismatch penalty at the end of a CGP run to discard PO

solutions which do not adhere to law of dimensional homogeneity.

74

Figure 5.1: PO set of solutions found by CGP on solving the symbolic regression test problem
of Section 4.5.3 followed by a dimensional penalty calculation for each case. Only the knee
solution, which is also the exact solution, has a dimensional mismatch penalty of zero.

5.1.4 Case-II : More Complex Terms

In Section 5.1.1, we limited our study to cases where the term ti in Equation (5.1) is only

composed of product and division operators among the regressors. However as shown in

Sections 4.5.1, 4.5.2 and 4.5.3, CGP is capable of much more, including rational functions

and functions involving addition and subtraction operators. In such a case, we have to make

a slight change to the procedure outlined in Section 5.1.1 for calculating dimensional penalty

values for a term ti of Equation (5.1). We know that in Equation (5.1),

ti = f (x1, x2, . . . , xnx)

(5.26)

where xi ∈ {1, 2, . . . , nx} are the regressors and function f () is some function involving
the operations of addition, subtraction, division and multiplication. Upon substituting the

physical dimensions of the regressors in (5.26), lets say that the derived physical dimensions

75

51015202500.20.40.60.811.2of term ti are obtained as some rational function of the fundamental dimensions as

P (M, L, T)
Q(M, L, T)

(5.27)

where P() and and Q() are some polynomials in M,L and T. Using partial fractions, we can
decompose (5.27) as

(cid:88)

i

P (M, L, T)
Q(M, L, T)

=

such that

pi(M, L, T)
qi(M, L, T)

,

(5.28)

 The denominator, qi(M, L, T), of each fraction is a power of an irreducible (not factorable

into polynomials of positive degree) polynomial and

 The numerator, pi(M, L, T), is a polynomial of smaller degree than that of qi(M, L, T).

In such a case, both pi() and qi() are polynomials in M,L and T that cannot be broken down

further. Let us fully expand both of these polynomials as sums of product terms, then we

can re-write Equation (5.28) as

(cid:88)

i

pi(M, L, T)
qi(M, L, T)

=

(cid:80)
(cid:80)

j

k

(cid:88)

i

ri,j(M, L, T)

si,k(M, L, T)

.

(5.29)

In Equation (5.29), each term ri,j(M, L, T) and si,k(M, L, T) are purely product terms in M,L,T.

In the CGP code, we implement this using the Symbolic Math Toolbox of Matlab.

Now we can use the method of Section 5.1.1 to ﬁnd the dimension mismatch penalty

value for each product term separately. The only diﬀerence being that now the dimensions

of the numerator terms, ri,j(M, L, T), have to be matched with those of numerator of the

left hand side of Equation (5.1) (i.e. y). Similarly, the dimensions of the denominator

76

terms, si,k(M, L, T), have to be matched with those of denominator of the left hand side of

Equation (5.1) (i.e. 1). Summing together the penalty values for all the numerator and

denominator terms gives us the total dimension mismatch penalty associated with a term

ti that has addition and subtraction operators as well combined with product and division

operators.

Until now, we have looked at how can we use CGP for a symbolic regression task and then

follow it up with a dimension mismatch check on the PO solutions. In the next chapter, we

will look at using CGP as a binary classiﬁer and produce the decision boundaries as algebraic

expressions. Note that the procedure explained in this chapter will be applicable even when

CGP is being used for a classiﬁcation task. The only diﬀerence in that case will be that the

regressand y will be replaced by a class label Y which is dimensionless. Lets now move on
to another interesting capability of CGP, i.e. solving binary classiﬁcation tasks.

77

Chapter 6

Learning Free Form Rules Using a

Classiﬁcation Task

In this chapter, we will learn how to use the CGP developed in Chapter 4 for a classiﬁcation

task. Recall from Sections 2.4 and 2.5 that when solving a MOO using an EMO algorithm,

knowledge can be derived in one of two ways, namely:

 By searching for rules in some preferred set of solutions, say the non-dominated set.

The methods of rule learning developed in Chapters 3 and 4 can be used in this case.

 By searching for discriminatory rules between a preferred set and an unpreferred set of

solutions, say non-dominated versus dominated solutions set. This is the case of rule

learning that we will address in this chapter.

We will present as to how we can use the CGP developed in Chapter 4 for a classiﬁcation

task. This work was developed as part of an industry project. The methodology developed

as part of this project is directly applicable to task of online innovization as well for learning

decision boundary in a classiﬁcation task as an algebraic expression in terms of features. Let

us quickly learn about the classiﬁcation problem ﬁrst before learning about how to use CGP

for a classiﬁcation task.

78

6.1 Target Classiﬁcation Problem

This goal of this project was to develop a computationally eﬃcient machine learning method-

ology that can:

 Automate the process of selecting a few important features from a set of features and

then building a classiﬁer using those features,

 And learn the classiﬁer (decision boundary) as an algebraic expression involving the

selected important features.

The data is being produced by a fast manufacturing process in which it is being captured

as a multi-variate time series data via many sensors. This time series data is then processed

to extract many features using basic mathematical functions such as diﬀerentiation and

integration without any expert knowledge for feature creation. This part of feature extraction

is not being shared in this work because of non-disclosure agreement with the industry

partner.On similar grounds, we will not describe the exact manufacturing process as well.

However, our method of extracting rules based classiﬁer is applicable to any manufacturing

process where we are collecting a lot of data about the process while the process is still going

on. For example, if we are collecting multiple sensors data for a welding process such as

welding current, voltage and distance of electrode from the weld region.

Reiterating, that the term ‘interpretable-rules’ in the context of this research refers to

rules in the form of mathematical expressions/equations involving the process features, pro-

cess constants and some simple operations such as addition, subtraction, multiplication and

division. The term ‘meaningful-rules’ in the context of this research refers to the idea of

aforementioned expressions being physically meaningful by being dimensionally consistent.

Now, let us now look as how can we use the CGP developed in Chapter 4 for a classiﬁcation

79

task.

6.2 Using CGP for a Classiﬁcation Task

In Chapter 4, we saw how CGP learns a rule of the form given by Equation (4.1), i.e.

ψ(φ1, φ2, . . . , φnφ) ≡ φi = w0 +

wj · tj,

nt(cid:88)

j=1,j(cid:54)=i

where ψ represents one such rule, φ are the basis functions explained in Section 3.1 that

represent the data from MOO problem, φi is one of the basis functions that is a regressand,

rest of the basis functions are regressors and we use CGP as a symbolic regression tool

to learn the aforementioned form of rules.

In this case, CGP optimizes the structure of

rules and learns the weights using OLSR. However, if the basis function φi is a class label

instead of a regressand, then we can use a linear support vector machine (SVM) [56] learning

algorithm for learning the weights. This is because the results of linear SVM are considered

very interpretable. The challenge lies in ﬁnding the right number of higher dimensions (of

feature space) and the right features/derived-features corresponding to those dimensions, in

which the data is linearly separable. In such a space, a linear SVM will be able to ﬁnd out

an appropriate separation plane with relative ease, provided the that the decision boundary

is not discontinuous. By derived features, we mean features that are composed from the

initial set of features provided to CGP using basic operations such as addition, subtraction,

multiplication and division. Let us look at an example.

Consider the binary data shown in Figure 6.1 which is generated using the following

80

equation of an ellipse

y = −x2

1 + 2.02x1 · x2 − 3.05x2

2 + 1.98 = 0,

(6.1)

where x1 and x2 are the two features for this data. The data of hypothetical Go class (y < 0)

is shown in green and the data of hypothetical NoGo class (y ≥ 0) is shown in red. Clearly,
Equation (6.1) deﬁnes the decision boundary for this problem. Upon trying to classify this

Figure 6.1: A hand crafted example of binary class data.

data set using a tree classiﬁer, the learned tree model has 13 levels and 147 nodes as shown in

Figure 6.2. What is interesting to note is that, if we provide only the features x1 and x2 to a

linear SVM algorithm, it will perform very poorly as the data is not linearly separable. Now

consider the following three features, namely x2

2 and x1 · x2. We call these three features
as derived features as they were not provided with the original features of the problem but

1, x2

are derived from the same. Now if we provide these three features to a linear SVM algorithm,

it will perform exceedingly well on the same data. The reason being that in this modiﬁed

3-Dimensional feature space, the data is linearly separable. This can be seen in Figure 6.3

81

-2-1012-2-1012Figure 6.2: Binary classiﬁcation tree model for classifying the two class data shown in
Figure 6.1.

where the sub-ﬁgures show the same data in the derived feature space from three diﬀerent

camera angles.

As can be seen in Figures 6.3a, 6.3b and 6.3c, a linear SVM could ﬁnd a plane (in blue)

clearly separating the Go and NoGo data. Not surprisingly, the equation of the plane is

same as Equation (6.1).

One may argue, why don’t we use a quadratic or polynomial kernel based SVM for this

problem. That may work in this problem but what if the problem requires a rational function

as one of the dimensions of the expanded feature space? Furthermore, the choice of what

kernel to select with keeping the interpretability of the classiﬁer/decision-boundary in mind

is not very straight forward. Also, having the decision boundary in terms of some algebraic

82

NodeLeaf(a) Camera angle-1

(b) Camera angle-2

(c) Camera angle-3

Figure 6.3: The binary class data of Figure 6.1 shown in a derived feature space where the
data is linearly separable.

equation in terms of features and not some kernel which corresponds to an inﬁnite dimen-

sional feature space such as radial basis functions helps us later in checking the dimensional

consistency of learned model. Maintaining dimensional consistency is an important handle

available with engineers to do a sanity check of models learned for some physical process or

system.

6.2.1 Evaluating Fitness of a CGP Individual for a Binary Classi-

ﬁcation Task

Consider a classiﬁcation problem with no observations, nx number of features xi, and no

binary class labels (yj ∈ {0, 1}) initially provided with the problem. When solving a classi-
ﬁcation problem using CGP, consider a CGP individual with same rule structure as shown

in Equation(5.1), i.e.

nt(cid:88)

i=1

wi · ti,

r ≡ y = w0 +

where nt is the number of terms in the rule. The terms ti can be considered as derived

features obtained by simple operations of {+,−,×,÷} on the original features. The weights
of this individual are then learned using a linear SVM method and the miss-classiﬁcation

83

error rate at the end of weight optimization by SVM is assign as error ﬁtness to the individual.

This is measured as

ferror = e

(trn)
I

+ e

(trn)
II

(6.2)

(trn)
where e
I

and e

(trn)
II

represent the type-I and type-II error rates (in %) [104] on the training

set. The complexity ﬁtness is calculated same as in case of the symbolic regression case given

by Equation (4.11), i.e.

fcomplexity =

(cid:88)

( Nodes in all terms of CGP individual).

Furthermore, in case of industry’s classiﬁcation problem, the cost of miss-classifying NoGo

product was much more then the cost of miss-classifying a Go product. For this reason, the

cost matrix used by the linear SVM for arriving at the weights is kept to be:

 0

 ,

1

C =

25 0

i.e. cost of making type-II error on the training set is set 25 times higher than cost of making

a type-I error.

6.3 Performance on Small Feature Space

Let us now look at some results obtained for classifying real production data. We chose

production data from two dates for our study. We will name these two data sets as data

set-1 and data set-2. The details of the production data from these two days is given in

Table 6.1. A total of ten features, namely xi such that i ∈ {1, . . . , 10} were extracted.

84

Table 6.1: Production data details used for testing CGP classiﬁer.

Date

Data set-1
Data set-2

# of Go prod-
ucts
2381
1882

# of NoGo
products
6
6

Table 6.2 shows the physical dimensions of these ten features. Due to highly imbalanced

datasets, we used adaptive smote method [105, 106] to oversample the minority NoGo class.

Table 6.2: Small feature set and their details.

Feature(s) Physical Dimension
L2 M1 T−2
x1, x2, x3
L1 M0 T0
x4
L1 M0 T−1
x5, x6
L0 M0 T0
x7
x8, x9, x10 L0 M0 T−1

6.3.1 Results on Production Data Set-1

Here we discuss the CGP results for production data set-1. Training was conducted over

70% of data. Table 6.3 shows the CGP parameters used in this case. Figure 6.4 shows the

Table 6.3: List of CGP parameters used for solving binary classiﬁcation problem of Sec-
tion 6.3.1.

Parameter N
Value
200

G nt
200
10

dmax

6

pc
0.85

pch
0.20

pcl
0.80

pm
0.15

set of PO classiﬁers obtained for production data of data set-1. The vertical axis of the

ﬁgure represents misclassiﬁcation error in percent of training data set and the horizontal

axis represents the complexity of a decision rule.

Three solutions are highlighted with diﬀerent colors with some extra information about

the corresponding classiﬁer. These three solutions/classiﬁers represent three diﬀerent trade-

85

oﬀs with respect to accuracy and complexity, starting with a classiﬁer which is simplest

but most inaccurate (shown in blue), to a solution with intermediate values of classiﬁcation

error and complexity (shown in red), and ﬁnally a solution which is very complex but highly

accurate (shown in brown). For each of these solutions, we have also shown the type-I and

type-II error obtained on the test data set. Table 6.4 shows this information in a tabular

form. Note that for all three solutions shown in Table 6.4, only six features of the original

ten features appear in the discovered classiﬁcation rules. The features x7, x8, x9 and x10 do

not appear in any of these solutions.

Furthermore, Figure 6.5 shows the decision boundary and segregation of the Go and

NoGo classes in the feature space for the knee solution classiﬁer of Figure 6.4. The NoGo

class shown in Figure 6.5 includes both real and synthetic NoGo class data of data set-1.

Interestingly, the CGP algorithm is able to ﬁnd the right feature space, namely t1 =

x3
x6

,

t2 =

x5
x4

and t3 =

x1
x3·x4

, in which the production data is linearly separable.

Table 6.4: Summary of classiﬁcation rules found for binary classiﬁcation problem of Sec-
tion 6.3.1 and their corresponding error rates on training and test sets.

Soln
Desc.

Rule for NoGo prod-
ucts

etrn
I

etrn
II

etest
I

etest
II

Simplest −20.08 + 7.385

Knee

Most
Accu-
rate

−25.72 +
3.028 x5
+

x4

x6
3.141 x1

x3 x4

−34.76 −
0.53x3 +
4.07x1

x4(x3 − x4)

0.65x2
x4x6

> 0

x2
x4
1.348 x3

> 0

0.23 x5
x1x2
4
+

0.17x5
x4x6

30.90%

0.00%

32.18%

0.00%

3.61%

0.00%

4.57%

0.00%

0.36%

0.00%

1.25%

0.0%

−

−
+

> 0

86

Figure 6.4: PO set of solutions found by CGP on solving the binary classiﬁcation problem
of Section 6.3.1.

6.3.2 Results on Production Data Set-2

Here we discuss the CGP results for production data from data set-2. Training was con-

ducted over 70% of data and rest of the data was kept unseen to the training stage of CGP.

Table 6.5 shows the CGP parameters used in this case. Figure 6.6 shows the set of PO clas-

Table 6.5: List of CGP parameters used for solving binary classiﬁcation problem of Sec-
tion 6.3.2.

Parameter N
Value
100

G nt
100
10

dmax

6

pc
0.85

pch
0.20

pcl
0.80

pm
0.15

siﬁers obtained for production data of data set-2. The vertical axis of the ﬁgure represents

misclassiﬁcation error in percent of training data set and the horizontal axis represents the

complexity of a decision rule. Notice that CGP has performed relatively better in terms of

87

46810121416182022240.02.55.07.51012.515eItrn+eIItrn (%)Figure 6.5: Decision boundary for the knee solution classiﬁer of PO set of classiﬁers shown
in Figure 6.4.

ﬁnding very accurate classiﬁers on data from data set-2 as compared to classiﬁers learned

on data from data set-1.

Again, we have highlighted three solutions with diﬀerent colors and we provide some extra

information about the corresponding classiﬁer. These three solutions/classiﬁers represent

three diﬀerent trade-oﬀs with respect to accuracy and complexity, starting with a classiﬁer

which is simplest but most inaccurate (shown in blue), to a solution with intermediate

values of classiﬁcation error and complexity (shown in red), and ﬁnally a solution which is

very complex but highly accurate (shown in brown). For each of these solutions, we have

also shown the type-I and type-II error obtained on the test data set. Table 6.6 shows the

same information in a tabular form.

One result that may bother a careful eye is that of the error rates obtained for the knee

solutions classiﬁer shown in Table 6.6. As shown, both type-I and type-II errors on the test

88

00-2-4-6-21412-8108-10642-4-6Go-TrainingNoGo-Training(real+synthetic)DecisionBoundaryFigure 6.6: PO set of solutions found by CGP on solving the binary classiﬁcation problem
of Section 6.3.2.

set for the knee classiﬁer discovered in data set-2 data are zero. One reason for such a result

can be that these results are based on single runs of the CGP on the data. If multiple runs

are made and classiﬁcation errors are calculated by averaging the results of multiple runs,

then we should at least be able to see a non-zero type-I error on the test set as has been the

case with other results.

89

051015202500.10.20.30.40.50.60.70.8eItrn+eIItrn (%)Table 6.6: Summary of classiﬁcation rules found for binary classiﬁcation problem of Sec-
tion 6.3.2 and their corresponding error rates on training and test sets.

Soln
Desc.

Rule for NoGo prod-
ucts

etrn
I

etrn
II

etest
I

etest
II

Simplest 2.68 + 3.92x5 > 0

1.51%

0.00%

1.96%

0.00%

Knee

Most
Accu-
rate

−0.039 −

0.6 x5
x2 + x5

> 0

0.15%

0.00%

0.00%

0.00%

+

−3.88
6.06x1
(9 × 10−6)x2(x1 + x4)

x4(x4x6 + x5x10)

+

0.08%

0.00%

0.18%

0.00%

> 0

6.4 Results on Larger Feature Space with Dimension

Check

In the previous section, we presented some results showing the performance of CGP as a

classiﬁer on real production data, but with just ten features. After these initial encouraging

results, our collaborators were much more interested in knowing

 How the CGP performs with a larger set of features. Furthermore,

 If a set of suitable constants for the process is provided, can dimensional analysis

identify a set of dimensionally consistent classiﬁers from the PO set obtained from

CGP?

6.4.1 Data and Results

Although, we conducted experiments on 5 years worth of production data ( > 6 Terabytes

of data), here we are presenting the results of two successive days of production where ﬁrst

day’s data is used as training set and the next day’s data is used as test set. In this case, we

90

extracted a total of 56 features from the data, xi where i ∈ {1, . . . , 56}. These are shown in
Table 6.7. We were also provided the units of measurement of those features. Furthermore,

Table 6.7: Larger feature set and their dimensions.

Physical Dimension
Feature(s)
L0 M0 T1
x1
L2 M1 T−2
x2, x3
L2 M1 T−4
xi, i ∈ {4, 5, . . . , 13}
xi, i ∈ {14, 15, . . . , 23} L2 M1 T−2
L1 M0 T0
x24
xi, i ∈ {25, 26, . . . , 36} L1 M0 T−1
xi, i ∈ {37, 38, . . . , 46} L1 M0 T1
xi, i ∈ {47, 48, . . . , 56} L0 M0 T−1

a list of four physical constants relevant to the physics of the process was also provided.

These are listed in Table 6.8. Upon applying CGP as a classiﬁer followed by the dimensional

Table 6.8: The set of physical constants which were considered relevant to the underlying
physics of the production process.

Symbol Name
ρ
E
Hv
τ

Material Density
Young’s Modulus
Vickers Hardness
Material thickness

Numeric Value SI units Physical Dimension

kg/m3

2.7 × 103
7.0 × 1010 P a
2.55 × 108 P a
2.0 × 10−4 m

L−3 M1 T0
L−1 M1 T−2
L−1 M1 T−2
L1 M0 T0

consistency check, the results of CGP are shown in Figure 6.7. Note that the classiﬁers found

to be dimensionally inconsistent are marked in red. Table 6.9 shows the details of the 14 PO

classiﬁers obtained by CGP for this case along with their type-I and type-II errors on the

test set. Table 6.10 shows the classiﬁers shown in Table 6.9 after conducting the dimensional

consistency check.

91

Figure 6.7: PO set of classiﬁers found by CGP for two day’s worth production data described
in Section 6.4.1.

6.5 Concluding Remarks

In this chapter we showed how can we apply CGP followed by dimensional analysis to learn

simple and interpretable rules binary data. We could obtain dimensionally consistent rules

from real world production data from industry. Such models bring valuable insight about the

underlying physics of the process and can be used for both, for validating existing analytical

laws obtained known from data as well as aid in building analytical models for physical

processes which are not yet completely understood by researchers.

This concludes Part-I of this thesis in which we presented various ways in which we can

learn simple and interpretable rules from data. Now, we move on to the other important

aspect of online innovization, i.e. using this derived knowledge to intervene during an EMO

92

024681012141618200246810121416Table 6.9: Details of PO classiﬁers shown in Figure 6.7.

Expression (NoGo:=y > 0)

etest
I

(%)

etest
II

(%)

+ 0.9329

+ 0.7842

y = 2.531 (x4 + x12) (x25 + x50) − 0.2903 x7 − 0.3398 x24 −
2.302 x6 − 1.142 x5
x3 x4
y = 2.446 (x4 + x12) (x25 + x50) − 0.3075 x7 − 2.446 x6 −
1.482 x5
x3 x4
y = 0.7485 x4 − 2.999 x6 − 0.4648 x7 − 0.6235 x8
x25
2.999 (x4 + x12) (x25 + x50) − 1.432
y = 0.4964 x4 − 1.465 x5 − 2.401 x6 − 0.2713 x7 +
2.401 (x4 + x12) (x25 + x50) + 0.5066
y = 0.4421 x4 − 1.301 x5 − 1.803 x6 + 3.468 x25 + 2.917 x45 −
0.5354 x8
y = 0.7815 x4−2.063 x6+3.934 x25+4.339 x45− 0.8135 x8
y = 1.124 x4 − 0.1635 x7 + 7.317 x25 +
y = 0.9842 x4 − 0.1513 x7 + 6.852 x25 +
y = 4.768 x25 − 0.4487 x7 − 1.274 x5 +
y = 5.591 x25 − 1.631 x5 +
y = 0.7103 x4 − 1.761 x6 − 1.342 x7 + 4.998 x25 + 4.906
y = 4.894 x25 − 1.677 x7 − 1.747 x6 + 5.896
y = 6.097 x25 − 2.237 x6 + 6.353
y = 6.516 x25 + 6.321

x25 −2.6
+ 3.309

x6
+ 5.252

+ 0.0814

0.4802 x8

x6

0.4642 x8

x6

0.3278 x4

0.1578 x8

x6

x25

+

+ 3.21

+ 4.121

0.7

0.9

1.4

1.0

1.9

2.6

2.8

3.2

2.4

3.9
3.5
4.2
8.8
52.7

0.0

0.0

0.0

0.0

0.0

0.0

0.0

0.0

0.0

0.0
0.0
0.0
0.0
0.0

Soln
ID
1

2

3

4

5

6

7

8

9

10
11
12
13
14

run and hopefully expedite its convergence to the PO front. We will look at this aspect in

the next part.

93

Table 6.10: Details of PO classiﬁers shown in Figure 6.7 after dimensional check.

ID Original Expression

Updated expression

Dim.
Penalty (with Physical constants of Table 6.8)
2.309

NA

2.309

NA

2.374

NA

4.041

NA

0.707

NA

0.707

NA

EHvτ x4 − 2.12 ×
x25 +

Hv

EHvτ x4 − 2.0 ×
x25 +

Hv

(cid:113) ρ
(cid:113) ρ

ρ

ρ

ρ

ρ

x6

+ 3.31

EHvτ x7 + 2249.54

y = 1.48 × 1012
1011
0.48 x8
y = 1.30 × 1012
1011
(cid:113) ρ
0.4642 x8
(cid:113) ρ

y
1011
0.3278 x4

EHvτ x7 + 2105.69
x6

EHvτ x7 − 1.68 × 1012
x6

1465.25

+ 4.121

+ 3.21

Hv

=

ρ

=

1718.17

y

Hv
0.1578 x8
ρ

ρ

ρ

1535.93

EHvτ x5 +

(cid:113) ρ
EHvτ x6 − 1.77 × 1012

1012
x6
y = 9.39 × 1011
1012
(cid:113) ρ
(cid:113) ρ

x25 + 4.906
1503.97

=
EHvτ x7 − 1.747 x6 + 5.896
=
1874.62
EHvτ x6 + 6.35

y
1012

y
1012

Hv

Hv

Hv

ρ

ρ

(cid:113) ρ

x25 + 6.32

Hv

x25 − 5.93 ×
EHvτ x5 +

ρ

x25 − 2.16 ×
+ 5.252

EHvτ x4 − 2.33 ×
EHvτ x7 +

ρ

x25 − 1.68 ×

x25 − 2.96 ×

1

2

3

4

5

6

7

8

9

10

11

12

13

=

−

+ 0.7842

0.6235 x8

1.482 x5
x3 x4

(NoGo:=y > 0)
y = 2.531 (x4 + x12) (x25 + x50) −
0.2903 x7 − 0.3398 x24 − 2.302 x6 −
1.142 x5
x3 x4
y = 2.446 (x4 + x12) (x25 + x50) −
0.3075 x7 − 2.446 x6 −
+ 0.9329
0.7485 x4 − 2.999 x6 −
y
+

0.4964 x4 − 1.465 x5 −
−
+

0.4648 x7
2.999 (x4 + x12) (x25 + x50) − 1.432
y
=
2.401 x6
2.401 (x4 + x12) (x25 + x50) + 0.5066
y = 0.4421 x4 − 1.301 x5 − 1.803 x6 +
3.468 x25 +2.917 x45−
+0.0814
y = 0.7815 x4 − 2.063 x6 + 3.934 x25 +
0.8135 x8
4.339 x45 −
y = 1.124 x4 − 0.1635 x7 + 7.317 x25 +
0.4802 x8

0.5354 x8

0.2713 x7

− 2.6

x25

x25

x25

+ 3.309

x6

y = 0.9842 x4 − 0.1513 x7 + 6.852 x25 +
0.4642 x8

+ 3.21

x6

y = 4.768 x25 − 0.4487 x7 − 1.274 x5 +
0.3278 x4

+ 4.121

x6

0.0

0.0

0.0

y = 5.591 x25 − 1.631 x5 +
5.252

0.1578 x8

x6

+

0.0

y = 0.7103 x4 − 1.761 x6 − 1.342 x7 +
4.998 x25 + 4.906

0.0

y = 4.894 x25−1.677 x7−1.747 x6 +5.896

0.0

y = 6.097 x25 − 2.237 x6 + 6.353

0.0

14

y = 6.516 x25 + 6.321

0.0

y = 2003.69

94

Part II

Rule Based Repair

95

Chapter 7

Performing Repairs Based on Fixed

Form Rules for Expedited

Convergence

Part I of this dissertation has discussed ways in which we can distill ﬁxed and free form rules

or mathematical expressions from data. Recall from Figure 2.6 that when we use an EMO

algorithm to solve a MOO problem, applying an online innovization procedure involves:

 First choosing a set of solutions to learn rules from, then

 Learning rules or patterns that exist in those solutions and then,

 Perform an intervention in the EMO algorithm to expedite its convergence rate.

The two chapters in Part II will focus on using learned rules of ﬁxed form discussed in

Chapter 3 to intervene in the EMO algorithm and possibly expedite its convergence towards

the PO front of MOO problem.

7.1 Rule Based Repair

Recall from Section 3.1 that basis function(s) for a rule in the context of innovization can

be any scalar function φ of the design variables including variables, objectives, constraints

96

and any function thereof. Fixed form rules of constant type, given by Equation (3.4), and of

power-law form, given by Equation (3.7), can be represented in a single equation using the

form given by Equation (7.1) as

ψi(φ1, φ2, . . . , φnφ) ≡

nφ(cid:89)

j

aij bij
φ
j

= ci, where,

(7.1)

aij ∈ {0, 1} , ci, bij ∈ R, nφ = # of basis functions.

In Equation (7.1), ψi represents ith candidate rule, φj represents the jth basis function in

the list of nφ basis functions being considered for learning rules, aij is a binary variable

that decides if jth basis function is included in ith candidate rule ψi and, bij, ci are rule

parameters that are estimated during an EMO/I run. If ψi is a constant rule, say φj = ci

then

aij =

1 if j = i,

0 otherwise.

(7.2)

This chapter considers candidate rules involving only design variables as basis functions,

thus Equation (7.1) takes the form

ψi(x1, x2, . . . , nx) ≡

nx(cid:89)

j=1

aij·bij
j

x

= ci

(7.3)

where nx is the number of design variables in a MOO problem. We refer to the set of rules

maximum of(cid:80)nx

represented by Equation (7.3) as candidate rules. In a problem with nx design variables, a
nxCk = 2nx − 1 candidate rules of ﬁxed form involving just variables as
basis functions can be evaluated. Out of those, nx rules are constant type rules corresponding

k=1

to each variable and the rest are power law type of rules.

97

Figure 7.1 shows a ﬂowchart of the online innovization applied to EMO algorithms or

“EMO/I” method proposed in Section 2.5. Except for the three regular blocks, namely

the Rule Basis and Quality block, Learn block and Repair block and two decision blocks,

namely L and R blocks, the rest of the ﬂowchart is that of a generic EMO algorithm. In this

work, we used NSGA-II algorithm [92] as the EMO algorithm of choice but it can be any

population based optimization method. The following sections provide a brief description of

Figure 7.1: The ﬂowchart of an EMO/I algorithm.

the ﬂowchart with a focus on the aforementioned blocks that are speciﬁc to EMO/I.

7.1.1 Rule Basis and Quality Block

The rule basis and quality or RBQ block carries information on:

98

P(0)P(t)AssignFitnessgen+=1StopNYP(t),Q(t)T?L?R?NYYNP(t)copyR(t)P(t),Q(t)or,P(t),Q(t),R(t)EvaluateInitializePopgen=0StartEvolutionaryOperatorsLearnRepairKeyTTerminate?LLearn?RRepair?P(t),Q(t)RuleBasis&QualityP(t)copy,Λ(t)Q1. Which rules out of the maximum of 2nx − 1 that we wish to learn? and,

2. What is the threshold quality measure on a candidate rule to consider it suitable for

variable repair?

Information for both of these questions is sought from the user at the beginning of EMO/I

procedure. For example, if a MOO problem has three variables but if the user wants to

discover rules with a maximum two variable interactions, then we can set aij as shown in

Table 7.1.

Table 7.1: An example of candidate rule information available in RBQ block.

Rule Id

(i)
1
2
3
4
5
6

aij

values

Rule
type

a11 = 1 a12 = 0 a13 = 0 Constant rule
a21 = 0 a22 = 1 a23 = 0 Constant rule
a31 = 0 a32 = 0 a33 = 1 Constant rule
a41 = 1 a42 = 1 a43 = 0 Power law rule
a51 = 1 a52 = 0 a53 = 1 Power law rule
a61 = 0 a62 = 1 a63 = 1 Power law rule

Each candidate rule is learned and evaluated for its quality. For a constant rule, this

quality is measured using the coeﬃcient of variation (Cv) metric (see Section 3.2). A constant

rule is considered good quality and ready for making repairs if its Cv ≤ C
for power-law type of rules, we use the R2

adj metric from OLSR to assess its quality (see

(max)
v

. Similarly

Section 3.3). A power-law type of rule is considered good quality and ready for making repairs

if its R2

adj ≥ R

2 (min)
adj

. We have kept the default value of these thresholds as C

(max)
v

= 0.05

and R

2 (min)
adj

= 0.95. The user may change these values as per his/her requirement.

A Note on Transition Points

A transition point in a bi-objective PO front is a point across which the nature of mathe-

matical relations among the PO solutions changes. The innovized principles on either side

99

of the transition point are diﬀerent. Some reasons for encountering transition points are:

(a) Some constraint or variable bound becomes active (or inactive) and forces (or eases) the

PO solutions to adhere to additional (or fewer) rules across the transition boundary,

(b) The nature of one or more of the objective functions changes signiﬁcantly across the

transition boundary.

In this chapter, we will consider optimization MOO problems that do not have a transition

point and we will come back to addressing it in Chapter 8. Thus for problems in this

chapter, the rules of the form given by Equation (7.1), if discovered, are applicable to all the

PO solutions.

7.1.2 Decision Block-L

Once the EMO/I begins and a population of individuals is initialized, evaluated, assigned

ﬁtness and operated by genetic operators (crossover and mutation), the decision block ‘L’

decides how long should the algorithm wait before it begins to learn the parameters and

quality of candidate rules. This is necessary as in the initial phases of an EMO algorithm,

the solutions may still be far away from the PO-front and not adhere to any rule. Such a

waiting decision can be implemented in many ways such as:

 If some minimum fraction of maximum function evaluations (MFEs) allowed for a

problem have passed or,

 If the number of solutions in the non-dominated set is above a minimum threshold or,

 If the increase in hypervolume has slowed down to some pre-speciﬁed level.

100

In this work, we have gone ahead with using the ﬁrst one, i.e. some fraction of MFEs, to be

the deciding criterion to start learning rules.

7.1.3 Learn Block

This block attempts to learn the rule parameters bij and ci shown in Equation (7.3), as well

as the quality of each candidate rule. Once EMO/I meets the criterion set by the decision

block ‘L’, a copy of the parent population, P

(t)
copy, is sent to the ‘Learn’ block. The rule

parameters and the rule quality for constant rules and power-law rules are estimated as

follows :

7.1.3.1 Constant Type Rules

Consider a rule of the form given in Equation (7.3) and a MOO problem with nx variables.

Then, for a candidate rule ψi composed of jth variable xj, the constant rule is given by

ψi(φj) ≡ xj = ci.

Comparing with Equation (7.3), we know in this case that

aik = 1 if k = j, 0 otherwise,

bik = 1, if k = j, 0 otherwise,

(7.4)

(7.5)

where k ∈ {1, 2, . . . , nx}. From Section 3.2.1, we know that the value of parameter ci can
be estimated as

ˆci = µj

101

(7.6)

where µj is the mean of jth decision variable over the non-dominated solutions set

Furthermore, the coeﬃcient of variation (Cv), of the jth variable over the non-dominated

solutions set is considered as a measure of the quality for such a rule. This quality parameter

is later used in the decision block R shown in Figure 7.1.

7.1.3.2 Power Law Type of Rules

As shown in Section 3.3.1, to estimate the parameters and quality of power law rules, we

use log-linear modeling followed by applying OLSR on data of non-dominated solutions to

learn the parameters.Consider an example of a three variable MOO problem, and we want

to learn the parameters of a two variable power law rule as:

x

b1
1 x

b3
3 = c, then

(7.7)

taking log on both sides,

b1 log x1 + b3 log x3 = log c,

which is a linear equation. If log x1 is chosen as the regressand then,

log x1 = −b3
b1

log x3 +

log c
b1

, or

= ˆβ log x3 + ˆγ,

=⇒ x1 = eˆγx

ˆβ
3

(7.8)

Parameters ˆβ and ˆγ are estimates returned by a Ordinary Least Square linear regression

(OLSR) method using the log of 1st and 3rd variables from non-dominated set. OLSR also

returns the R2

adj value which is later used to assess the quality of such a candidate rule in

102

repair stage.

7.1.4 Decision Block-R

This block decides if some rule is good enough in quality to qualify for the repair stage. As

mentioned in Section 7.1.1,

 The quality of one variable rules is measured using coeﬃcient of variation (Cv) values

and,

 The quality of rules with more than one variable is measured using R2

adj value returned

from OLSR.

EMO/I is provided with threshold quality parameters namely, the maximum coeﬃcient of

variation (C

(max)
v

) for constant type rules and R2 (min) for power law rules, in the RBQ

block at the start of EMO/I. A rule is said to qualify for repair in next stage as follows:

 For constant rules, Cv ≤ C

(max)
v

and

 For power law rules, R2

adj ≥ R

2 (min)
adj

.

We refer to the candidate rules for which the quality parameters surpass the threshold quality

parameter values as Qualifying Rules. As set of u such qualifying rules in generation t of

(t)

EMO/I, say Ψ

Q = {ψq1, ψq2, . . . , . . . , ψqu}, along with a copy of parent population, say
(t)
copy, is passed on to the Repair block as shown in Figure 7.1. In this work, we refer to the

P

union of set of variables constituting the qualifying rules as Qualifying Variables.

103

7.1.5 Repair Block

In some generation t of EMO/I, the repair block receives a copy of parent population ( P

(t)
copy)

and a set of qualifying rules (Ψ

Q = {ψq1, ψq2, . . . , . . . , ψqu}), and yields a population with
repaired variables (R(t)). The procedure for making such a repair is given by Algorithm 7.1

(t)

and is named RepairPop( ). Next, we brieﬂy describe the overall algorithm and followed

by detailed explanation its important components.

Algorithm 7.1 RepairPop( )

Q = {ψq1, ψq2, . . . , . . . , ψqu} Parent population, List of Qualifying rules
Repaired population

(t)

(t)
copy, Ψ

input: P
output: R(t)
1: R(t) ← ∅
2: for k ← 1 to |P

(t)

copy| do

(t)
copy

3:
4:

5:

6:

7:

8:
9:
10:
11:
12:
13:
14:

(t)

(t)

Pk ← kth individual of P
I ← ∅
(t)
˜Ψ
Q ← wrShuffle(Ψ
Q )
for m ← 1 to | ˜Ψ
Q | do
ψ ← mth rule of ˜Ψ
Iψ ← getVars(ψ)
J ← Iψ \ I
if J (cid:54)= ∅ then

(t)
Q

v ← chooseVar(J )
makeRepair(Pk, v)
I ← I ∪ v
R(t) ← R(t) ∪ Pk

Initialize set of repaired variables in Pk.
Wtd. random shuﬄe w.r.t rule length preference.

Get set of variables in rule ψ.
Get set of variables in rule ψ and not yet repaired in Pk.

Get variable w.r.t frequency preference.
Repair variable v of individual Pk.

Update repaired population set.

RepairPop( ) procedure takes P (t)copy and Λ

(t)
Q as input and further line wise descrip-

tion is as below:

 Lines 1-4 : The output R(t) is initialized as an empty set.

It then loops over all

individuals of P

(t)
copy to repair them one by one. For the kth individual of P

(t)
copy, Pk,

the set of repaired variables in Pk is initialized to an empty set.

 Lines 5-6 : The rules present in Ψ

(t)
Q are shuﬄed using wrShuffle() function with

104

respect to some probability distribution based on length of rules. This is discussed in

Section 7.2.1. We refer to the number of variables in a rule as the length of a rule. For

example, the rule λ ≡ x1 · x2 = 3 has length two. The procedure then sequentially
goes over all rules in the sorted list ˜Ψ

(t)
Q .

 Lines 7-9 : For some mth rule in the shuﬄed rule list ˜Ψ

(t)
Q , say ψ, the set of variables
involved in rule ψ is stored in Iψ using getVars() function. For example, for the rule
ψ ≡ x1 · x2 = 3, Iψ = {x2, x3}. Subsequently, set of variables that are involved in rule
ψ and have not been repaired in the individual Pk is stored in J .

 Lines 10-13 : If J = ∅, then the control passes back to the for loop of line-6. Else,
some variable v ∈ J is chosen using chooseVar() function. This choice depends on a
probability distribution based on frequency of the variables in the qualifying rules and

is explained in Section 7.2.2. Subsequently, the variable v is repaired in individual Pk

using makeRepair() function and its operation is explained in Sections 7.1.5.1 and

7.1.5.2. After repairing variable v, the set I is updated.

 Line 14 : Once all possible repairs have been performed to the individual Pk, it is

added to the repaired population R(t). This continues until for loop of line-2 covers

all individuals of P

(t)
copy.

Next we discuss the working of the makeRepair() function of Algorithm 7.1.

7.1.5.1 Repairing Variables Based on Constant Rule

Consider a rule of the form given in Equation (7.4) corresponding to a candidate rule Ψi

composed of jth variable xj.

Ψi ≡ xj = ci, where

105

(7.9)

ci is as estimated as given in Equation (7.6). Then, the jth variable in an individual of P

(t)
copy

population is repaired as,

ˆxj = U(µj − σj, µj + σj), where

(7.10)

ˆxJ is the repaired value of variable xj, µj and σj are the mean and standard-deviation re-

spectively of xj decision variable over the non-dominated solutions set and U(a, b) represents
a uniform random distribution between a and b.

7.1.5.2 Repairing variables based on power law rules

To explain the repair method involved in this case, we take the example shown in Sec-

tion 7.1.3.2. From Equation (7.8), we can get the repaired value of variable x1 as

Similarly, the repaired value for the regressor variable x3 can be obtained as

(cid:19) 1

ˆβ

ˆx3 =

ˆx1 = eˆγx

ˆβ
3 .

(cid:18) x1

eˆγ

(7.11)

(7.12)

For cases with more than two variables as well, a similar logic follows. Note that if the

repaired value of a variable lies outside its a priori deﬁned bounds, then the repaired variable

is set to its nearest bound value.

106

7.2 Repair Strategies

This section discusses three rule-preference strategies, corresponding to the wrShuffle()

function, and three variable-preference strategies corresponding to the chooseVar() func-

tion, used in Algorithm 7.1.

Recall that we call a set of rules for which the quality parameters surpass the threshold

value of quality set in RBQ block of Figure 7.1 as qualifying rules, and the union of set

of variables constituting these qualifying rules as qualifying variables. The following two

choices need to be made before any repair of the kinds illustrated in Section 7.1.5 is made

to solution individuals of P

(t)
copy:

 Choose one of the possibly many qualifying rules on which to base the repair and,

 Choose one of the possibly many variables for repair from the chosen qualifying rule.

For example, choosing a random rule from the qualifying rules pool and then choosing a

random variable from the variables of the chosen qualifying rule can be one such strategy.Both

these choices have an eﬀect on:

1. The parameters used to make a repair and,

2. The sequence in which variables are repaired.

7.2.1 Rule Preference Strategies

We investigate three rule-preference strategies. We call the number of variables in a rule

to be the length of that rule. Before discussing these strategies, lets look at the wrShuf-

fle() function as it is used in implementing all three rule preference strategies in Line-5 of

Algorithm 7.1.

107

The wrShuffle() function stands for weighted random shuﬄe (WRS) and it is same as

weighted random sampling from a sequence without replacement. Let a = (a1, a2, . . . , an)

and w = (w1, w2, . . . , wn) be a sequences of n objects and unnormalized weights respectively

such that weight wi corresponds to element ai. Then Algorithm 7.2 describes a pseudo code

for wrShuffle() function.

Algorithm 7.2 wrShuffle( )
input: a = (a1, a2, . . . , an), w = (w1, w2, . . . , wn)
output: r
1: r ← ∅
(cid:80)n
2: n ← sizeOf(a)
3: while n > 0 do
1(cid:80)
k=1 wk
4:
5:

s ←
P ← (p1, p2, . . . , pn) where pi = wi/s
C ← (0,
pk)
r ← U(0, 1)
i ← 1
while true do

2(cid:80)

n(cid:80)

k=1

k=1

6:

7:
8:
9:
10:
11:
12:
13:

14:
15:
16:

if Ci ≤ r < Ci+1 then
else

break

i ← i + 1
r ← r(cid:95) < ai >
a ← a\ < ai >
n ← n − 1

pk,

pk, . . . ,

k=1
Random no. in 0-1 using uniform random distribution.

Wtd. Random Shuﬄed sequence

get number of elements in sequence a

Probability mass value sequence.

Cumulative distribution value sequence.

Ci is ith element of C

Append ai to sequence r at end.
Remove element ai from sequence a.

Lets look at the rule preference strategies next. Let there be m rules in the qualifying rules

list Λ

(t)
Q = (λ1, λ2, . . . , λm) of Algorithm 7.1 with lengths (l1, l2, . . . , lm) respectively. Then

the three rule-preference strategies assign diﬀerent weights sequences to the m qualifying

rules in wrShuffle() function.

(i) No preference: In this strategy, no rule is given preference over others based on its

length and a weights sequence of w = (1, 1, . . . , 1) is used by wrShuffle() function.

(ii) Prefer long rules: In this strategy, lengthier rules are preferred over shorter rules and

108

a weights sequence of w = (l1, l2, . . . , lm) is used by wrShuffle() function.

(iii) Prefer short rules: In this strategy, shorter rules are preferred over lengthier rules and

a weights sequence of w = (1/l1, 1/l2, . . . , 1/lm) is used by wrShuffle() function.

7.2.2 Variable Preference Strategies

We investigate three variable-preference strategies which are based on the frequency of a vari-

able among the qualifying rules and it is implemented using the chooseVar() function in
Algorithm 7.1. In Algorithm 7.1, let there be n qualifying variables, V(t)

Q = {v1, v2, . . . , vn},
(t)
Q = (ψ1, ψ2, . . . , ψm). Furthermore, let the frequency of each

in the list of qualifying rules Ψ

qualifying variable in V(t)
Q be represented by F (t)
of qualifying variable vi. Let some rule ψi ∈ Ψ
able repair to population individual Pk at some step of for loop of line-6 of Algorithm 7.1.
Then, J ⊆ V(t)
individual Pk until that instant. Let,

Q = {f1, f2, . . . , fn} where fi is frequency
(t)
Q be under consideration for making vari-

Q , represents the set of variables in rule λi that have not been repaired in the

J = {vj1, vj2, . . . , vjs}, and
F = {fj1, fj2, . . . , fjs}, where

(7.13)

fju ∈ F is the frequency of variable vju ∈ J in the qualifying variable set V(t)
{1, 2, . . . , s}. Let us discuss the variable-preference strategies.

Q and u ∈

(i) No preference: In this strategy, no variable is given preference over others based on

its frequency among the qualifying variables. In this case, the chooseVar() function

randomly picks one variable from the set J deﬁned in Equation (7.13).

109

(ii) Prefer common variables: In this strategy, qualifying variables that are more frequent

among the qualifying variables are given a higher preference of getting selected ﬁrst to

get repaired. In this case, the chooseVar() function picks some variable vju ∈ J of
Equation (7.13) with probability pu deﬁned as;

fju(cid:80)s

r=1 fjr

pu =

(7.14)

(iii) Prefer less-common variables: In this strategy, qualifying variables that are less fre-

quent among the qualifying variables are given a higher preference of getting selected

ﬁrst to get repaired.

In this case, the chooseVar() function picks some variable

vju ∈ J of Equation (7.13) with probability pu deﬁned as;

1/fju(cid:80)s

r=1

1/fju

pu =

(7.15)

These aforementioned strategies in Sections 7.2.1 and 7.2.2, when permuted together form

a total of nine strategy combinations. These are listed in the ﬁrst nine rows of Table 7.2.

The tenth strategy is of a pure EMO algorithm without any innovization based repairs.

110

Table 7.2: Diﬀerent variable repair strategies for EMO/I studied in this work.

ID Rules Preference Variable Preference
1
2
3
4
5
6
7
8
9
10 ——— NSGA-II with No Repair ———-

None
Common variables
Uncommon variables
None
Common variables
Uncommon variables
None
Common variables
Uncommon variables

None
None
None
Long rules
Long rules
Long rules
Short rules
Short rules
Short rules

Abbrev.

NN
NC
NU
LN
LC
LU
SN
SC
SU
NI

7.3 Test Problems

All test problems in this work have been derived from the ZDT1 problem [107] and have the

following form.

Minimize

f1(x) = x1,

f2(x) = g(x) h(f1(x), g(x)),

(7.16)

(cid:112)

Where h(f1, g) = 1 −

f1/g.

Every problem has a diﬀerent g function, variable bounds and Pareto-optimal set and they

are described in the following sections.

7.3.1 ZDT1-1

This problem is designed to have rules in the PO set such that:

 No variable is common among any two rules and,

 Diﬀerent rules may have diﬀerent number of variables.

111

The problem is given by Equations (7.16) where g(x) is given by Equation (7.17).

g(x) = 1 + |x2 − 0.5| + |x3x4 − 0.5| + |x5x6x7 − 0.5|,
x1,2 ∈ [0, 1], x3,4,5,6 ∈ [0.5, 1], x7 ∈ [0.5, 2].

(7.17)

Equation (7.18) shows the PO solutions set for this problem and Figure 7.2a shows the

corresponding PO front.

0.0 ≤ x∗1 ≤ 1.0, x∗2 = 0.5, x∗3 · x∗4 = 0.5, and, x∗3 · x∗4 · x∗7 = 0.5.

(7.18)

This problem is speciﬁcally designed to test repair strategies: NI, NN, SN and LN. Refer to

(a) Pareto-optimal front for ZDT1-1 prob-
lem.

(b) Pareto-optimal
ZDT1-3 problems.

front for ZDT1-2 and

Figure 7.2: Pareto-optimal fronts for the test problems.

Table 7.2 for description of these strategies.

7.3.2 ZDT1-2

This problem is designed to have rules in the PO set such that

 Some variables are common among two or more rules and,

 Each rule has the same number of variables.

112

00.20.40.60.81f100.20.40.60.81f20.50.751f100.10.20.3f2The ZDT1-2 problem is given by Equation (7.16) where g(x) is given by Equation (7.19).

g(x) = 1 + |x1x2 − 0.5| + |x0.5
x1,2 ∈ [0.5, 1], x3 ∈ [√0.5, 1], x4 ∈ [0.25, 1], x5 ∈ [0.5,√0.5].

2 x3 −

√0.5| + |x2

2x4 − 0.25| + |x−0.5

2

√0.5|,

x5 −

(7.19)

The Pareto-optimal region for this problem is given by Equation (7.20) and Figure 7.2b

shows the corresponding PO front.

x∗1 · x∗2 = 0.5, (x∗2)0.5 · x∗3 = √0.5, (x∗2)2 · x∗4 = 0.25 and, (x∗2)−0.5 · x∗5 = √0.5.

(7.20)

This problem is designed to test repair strategies: NI, NN, NC and NU, shown in Table 7.2.

7.3.3 ZDT1-3

This problem is designed to have rules in the PO set such that:

 Some variables may be common among two or more rules and,

 Each rule may have diﬀerent number of variables.

The ZDT1-3 problem is given by Equation (7.16) where g(x) is given by Equation (7.21).

g(x) = 1 + |x1x2 − 0.5| + |x0.5
x1,2 ∈ [0.5, 1], x3 ∈ [√0.5, 1], x4 ∈ [0.25, 1], x5 ∈ [0, 1].

1 x2x3 − 0.5| + |x5 − 0.5| + |x−0.5

1

x2
2x3x4 − 0.25|,

(7.21)

The Pareto-optimal region for this problem is given by Equation (7.22) and Figure 7.2b

shows the corresponding PO front.

x∗1 · x∗2 = 0.5, (x∗1)0.5 · x∗2 · x∗3 = 0.5, x∗5 = 0.5 and, (x∗1)−0.5 · (x∗2)2 · x∗3 · x∗4 = 0.25 . (7.22)

113

This problem is designed to test and compare all the repair strategies given in Table 7.2.

7.4 Results on Test Problems

This section compares the performance of an EMO algorithm, NSGA-II in this work, with

that of an EMO/I algorithm using the diﬀerent variable repair strategies of Table 7.2 on

the test problems of Section 7.3. The tables and ﬁgures in this section refer to diﬀerent

contending algorithms by their respective abbreviations mentioned in Table 7.2. For example,

an EMO algorithm without any variable repair strategy is referred to as ‘NI’.

Furthermore, the diﬀerent algorithms are compared on the metrics of median Gener-

ational Distance (GD) and median Inverse Generational Distance (IGD) [31] over thirty

runs. All claims of one algorithm being better than the other are backed with results of

Wilcoxon Rank Sum (WRS) test [108] of statistical signiﬁcance. We use the following con-

vention to represent a test hypothesis. HA≺B represents the left tailed hypothesis test, where
the alternative hypothesis states that the median of distribution A is lower than the median

of distribution B at some signiﬁcance level α. A signiﬁcance level of α = 5% is used in

all the statistical tests. For all the statistical tests, both h and p values are shown in the

results. An h-value of ‘No’ means that the aforementioned alternative hypothesis cannot be

accepted at the desired signiﬁcance level and an h-value of ‘Yes’ means otherwise. Also, for

the aforementioned alternative hypothesis to be accepted, the corresponding p-value must

be lower than the chosen α. The EMO parameters used for solving the test problems are

given in Table 7.3.

114

Table 7.3: EMO parameters used in the test problems discussed in Section 7.3.

Population Size
Max Func. Evals
Prob. of Crossover
Prob. of Mutation
Crossover Index
Mutation Index
Start of Learning (% of Max FEs)

ZDT1-1 ZDT1-2 ZDT1-3

72

52

72

36,000

36,400

50,400

0.9
1/7
15
20
5

0.9
1/5
15
20
5

0.9
1/5
15
20
5

7.4.1 ZDT1-1 Results

This problem is designed to test the variable repair strategies NN, LN and SN of Table 7.2

against each other as well as their performance relative to the no-repair case, i.e. NI. Fig-

ure 7.3 shows the median GD and IGD results for ZDT1-1 problem. The plots show that

EMO/I with any of the three repair strategies NN, LN and SN perform better than the

NI strategy in both GD and IGD for same number of objective function evaluations. This

observation is supported by the WRS results shown in Table 7.4. The table shows that the

three alternate hypothesis namely, HNN≺NI, HLN≺NI and HSN≺NI, can be accepted at 5%
signiﬁcance level in case of GD as well as IGD in ZDT1-1 problem at the end of maximum

function evaluations (36,000 in case of ZDT1-1 problem). Furthermore, Figure 7.3 shows

that the variable repair strategies namely, NN, LN and SN, have very similar performance

and none can claim to be better than the other in this problem.

Table 7.4: Results of Wilcoxon rank sum test for GD and IGD of ZDT1-1 problem at 36,000
function evaluations and 5% signiﬁcance level.

Hypothesis

HNN≺NI HLN≺NI HSN≺NI
Y es
< 10−6

Y es
< 10−6

Yes

p < 10−6

D h
G

D h
G

I

Yes

p < 10−6

Yes

< 10−6

Yes

< 10−6

115

(a) GD results for ZDT1-1.

(b) IGD results for ZDT1-1.

Figure 7.3: Median GD and IGD results for ZDT1-1 problem over 30 runs.

7.4.2 ZDT1-2 Results

This problem is designed to test the variable repair strategies NN, NC and NU of Table 7.2

against each other as well as their performance relative to the no-repair case, i.e. NI. Fig-

ure 7.4 shows the median GD and IGD results for ZDT1-2 problem. The plots show that

EMO/I with any of the three repair strategies NN, NC and NU perform better than the

NI strategy in both GD and IGD for same number of objective function evaluations. This

observation is supported by the WRS results shown in Table 7.5. The table shows that the

three alternate hypothesis namely, HNN≺NI, HNC≺NI and HNU≺NI, can be accepted at 5%
signiﬁcance level in case of GD as well as IGD in ZDT1-2 problem at the end of maximum

function evaluations (36,400 in case of ZDT1-2 problem). Furthermore, Figure 7.4 shows

that the variable repair strategies namely, NN, NC and NU, have very similar performance

and none can claim to be better than the other in this problem.

116

0.51.52.53.5Funcn Evals x 104110Median GD x 10-3NINNLNSN0.51.52.53.5Funcn Evals x 10421030Median IGD x 10-4NINNLNSN(a) GD results for ZDT1-2.

(b) IGD results for ZDT1-2.

Figure 7.4: Median GD and IGD results for ZDT1-2 problem over 30 runs.

Table 7.5: Results of Wilcoxon rank sum test for GD and IGD of ZDT1-2 problem at 36,400
function evaluations and 5% signiﬁcance level.

Hypothesis

HNN≺NI HNC≺NI HNU≺NI
< 10−6

< 10−6

Yes

Yes

p < 10−6

Yes

Yes

D h
G

D h
G

I

p < 10−6

Yes

< 10−6

Yes

< 10−6

7.4.3 ZDT1-3 Results

This problem is designed to test all nine variable repair strategies of Table 7.2 namely: NN,

NC, NU, LN, LC, LU, SN, SC and SU, against each other as well as their performance

relative to the no-repair case, i.e. NI. To avoid illegible plots, these results are discussed in

three parts.

7.4.3.1 Part-A : Strategies Preferring Short Rules

Figure 7.5 shows the median GD and IGD results for ZDT1-3 problem comparing variable

repair strategies SN, SC and SU against each other as well as the no-repair case NI. The

117

0.51.52.53.5Funcn Evals x 104110Median GD x 10-3NINNNCNU0.51.52.53.5Funcn Evals x 10421030Median IGD x 10-4NINNNCNUplots show that EMO/I with any of the three repair strategies SN, SC and SU performs

better than the NI strategy in both GD and IGD for same number of objective function

evaluations. This observation is supported by the WRS test results shown in Table 7.6. The

table shows that the three alternate hypothesis namely, HSN≺NI, HSC≺NI and HSU≺NI, can
be accepted at 5% signiﬁcance level in case of GD as well as IGD in ZDT1-3 problem at the

end of corresponding maximum function evaluations (50,400 in case of ZDT1-3 problem).

Furthermore, Figure 7.5 shows that the variable repair strategies SN, SC and SU, have very

similar performance and none can claim to be better than the other in this problem.

(a) GD results for ZDT1-3.

(b) IGD results for ZDT1-3.

Figure 7.5: Median GD and IGD results for ZDT1-3 problem over 30 runs comparing NI,
SN, SC and SU repair strategies of EMO/I.

Table 7.6: Results of Wilcoxon rank sum test for GD and IGD of ZDT1-3 problem com-
paring EMO/I repair strategies NI, SN, SC and SU at 50,400 function evaluations and 5%
signiﬁcance level.

Hypothesis

HSN≺NI HSC≺NI HSU≺NI
< 10−6

< 10−6

Yes

Yes

p < 10−6

D h
G

D h
G

I

Yes

Yes

p < 10−6

Yes

< 10−6

Yes

< 10−6

118

12345Funcn Evals x 104110Median GD x 10-3NISNSCSU12345Funcn Evals x 10421030Median IGD x 10-4NISNSCSU7.4.3.2 Part-B : Strategies with no Preference Based on Length of Rules

Figure 7.6 shows the median GD and IGD results for ZDT1-3 problem comparing variable

repair strategies NN, NC, NU and SN against each other as well as the no-repair case NI.

The plots show that:

 The repair strategies NN, NC and NU perform better than NI in both GD and IGD.

This is backed by results shown in the Hypothesis (group-I) column of Table 7.7. Repair

strategy SN too performs better than NI but the corresponding results are already

shown in Section 7.4.3.1.

 Furthermore, the repair strategy SN performs better than NN, NC and NU in terms of

GD whereas their performance in terms of IGD metric are similar. The corresponding

WRS test results are shown in Hypothesis (group-II) column of Table 7.7.

(a) GD results for ZDT1-3.

(b) IGD results for ZDT1-3.

Figure 7.6: Median GD and IGD results for ZDT1-3 problem over 30 runs comparing NI,
SN, NN, NC and NU repair strategies of EMO/I.

119

12345Funcn Evals x 104110Median GD x 10-3NISNNNNCNU12345Funcn Evals x 10421030Median IGD x 10-4NISNNNNCNUTable 7.7: Results of Wilcoxon rank sum test for GD and IGD of ZDT1-3 problem comparing
EMO/I repair strategies NN, NC, NU, SN and NI at 50,400 function evaluations and 5%
signiﬁcance level.

Hypothesis (group-I)

Hypothesis (group-II)

HNN≺NI HNC≺NI HNU≺NI HSN≺NN HSN≺NC HSN≺NU
1.1 × 10−5

2.0 × 10−3

< 10−6

< 10−6

Yes

Yes

Yes

p < 10−6

Yes

3.6 × 10−5
Inconclusive

Yes

Yes

D h
G

D h
G

I

Yes

Yes

Yes

p < 10−6

< 10−6

< 10−6

7.4.3.3 Part-C : Strategies Preferring Long Rules

Figure 7.7 shows the median GD and IGD results for ZDT1-3 problem comparing variable

repair strategies LN, LC, LU and SN against each other as well as the no-repair case NI.

The plots show that:

 The repair strategies LN, LC and LU perform better than NI in both GD and IGD. This

is backed by results shown in the Hypothesis (group-I) column of Table 7.8. Repair

strategy SN too performs better than NI but the corresponding results are already

shown in Section 7.4.3.1.

 Furthermore, the repair strategy SN performs better than LN, LC and LU in terms of

GD whereas their performance in terms of IGD metric are similar. The corresponding

WRS test results are shown in Hypothesis (group-II) column of Table 7.8.

7.4.4 Summary of Results on Test Problems

Based on the results shown in Sections 7.4.1, 7.4.2 and 7.4.3, we can conclude the following:

 EMO/I with any of the nine variable repair strategies, namely, NN, NU, NC, SN, SU,

SC, LN, LU and LC, performs better than no variable repair case (NI), in terms of GD

120

(a) GD results for ZDT1-3.

(b) IGD results for ZDT1-3.

Figure 7.7: Median GD and IGD results for ZDT1-3 problem over 30 runs comparing LN,
LC, LU, SN and NI repair strategies of EMO/I.

Table 7.8: Results of Wilcoxon rank sum test for GD and IGD of ZDT1-3 problem comparing
EMO/I repair strategies LN, LC, LU, SN and NI at 50,400 function evaluations and 5%
signiﬁcance level.

Hypothesis (group-I)

Hypothesis (group-II)

HLN≺NI HLC≺NI HLU≺NI HSN≺LN HSN≺LC HSN≺LU
< 10−6

< 10−6

< 10−6

< 10−6

Yes

Yes

< 10−6

Yes

Yes

Yes

Yes

< 10−6

Yes

< 10−6

Inconclusive

D h
G

D h
G

I

p < 10−6

Yes

Yes

p < 10−6

and IGD on all three test problems.

 Results of Section 7.4.3 show that EMO/I with the variable repair strategy SN performs

better than the repair strategies NN, NU, NC, LN, LU and LC in terms of GD. In

terms of IGD, there is no single clear winner.

Following the implication of above summary, we compare the performance of EMO/I

with SN repair strategy with that of no-repair NI strategy on three engineering design MOO

problems presented in following sections.

121

12345Funcn Evals x 104110Median GD x 10-3NISNLNLCLU12345Funcn Evals x 10421030Median IGD x 10-4NISNLNLCLU7.5 Engineering Problems

This section describes three MOO problems from engineering design. All three problems are

slightly modiﬁed from their original form so that there are no transition points in the ﬁnal

PO front. Doing so makes all the rules among PO solutions set to be applicable for the

entire PO front.

7.5.1 Two Bar Truss Problem (TBT)

Figure 7.8: A two membered truss structure.

In this problem, a two bar truss structure shown in Figure 7.8 is designed to carry a

certain load without elastic failure. The truss members have cross sectional area x1 m2 and

x2 m2 respectively and the point of loading is at a vertical distance of x3 m from the truss

member hinge. The two objectives, both to be minimized, are the volume of the structure

and the maximum stress developed in its two members.

Its mathematical formulation is

given by Equation (7.23). The original formulation of the problem given in [31] did not

contain the constraint g2 given in Equation (7.23). The constraint g2 is added to avoid any

122

bb4m1myx1x2ABC100kNtransition points in the PO front so that rules found are applicable to all the PO solutions.

The transition point in this problem is encountered because the cross sectional area x2 hits

its upper limit in order to minimize both objectives in a PO way. The reader is referred

to [10] for more details on the reason and location of transition point at V ≈ 0.044 m3.

(cid:113)

(cid:113)

Minimize V = x1

16 + x2

3 + x2

1 + x2
3,

S = max(σAC , σBC ),

Subject to g1 ≡ max(σAC , σBC ) ≤ 105,

g2 ≡ V ≤ 0.044,

(cid:113)

Where σAC =

, σBC =

20

16 + x2
3

x3x1

(cid:113)

80

1 + x2
3

x3x2

(7.23)

,

x1, x2 ∈ [0, 0.01],

x3 ∈ [1.0, 3.0].

Figure 7.9 shows the PO front for this problem obtained using NSGA-II. To ascertain that

the black curve shown in Figure 7.9 is indeed the true PO front, a local search is run from

six points (shown with gray circles) using -constraint method [109]. An -constraint is set

on the volume objective while maximum stress objective is minimized with local search. As

can be seen, the local search is not able ﬁnd any point very diﬀerent from the start point.

Furthermore, the new front obtained by running a local search from all the points of NSGA-

II front amounts to a minor increase of 0.0132% in the Hypervolume [31]. Thus, the front of

Figure 7.9 can safely be assumed to be very close to the idea PO front and hence it is used

for evaluating GD and IGD value results in later sections.

123

Figure 7.9: Pareto-optimal front for the two member truss problem.

Figure 7.10: A diagram of metal turning process.

7.5.2 Metal Cutting Problem (MC)

This MOO problem involves optimizing the twin objectives of operation time (Tp) and tool

life (ξ) during the turning process of a steel bar on a CNC lathe with a P20 carbide tool

as shown in Figure 7.10. Equation (7.24) presents the mathematical formulation of this

problem. The formulation given in Equation (7.24) contains an additional constraint g4

compared to the original formulation of the problem given in [110]. This constraint on the

operation time objective is put to avoid transition points in the PO front so that the rules

found are applicable to all PO solutions. The location of this transition point at Tp ≈ 0.9545

124

0.0050.0150.0250.0350.045Volume, V (m3)13579Max. Stress, S (kPa) x 104NSGA-IIstart point-constr methodv(m/min)f(mm/rev)a(mm)minutes is discovered manually in [29].

Minimize Tp(x),

ξ(x),

Subject to g1(x) ≡ 1 −

g3(x) ≡ 1 −

P (x)
ηP max ≥ 0,
R(x)
Rmax ≥ 0,

g2(x) ≡ 1 −

Fc(x)
F max

c

≥ 0,

g4(x) ≡ Tp(x) ≤ 0.9545,

 1 + 0.20

T (x)
M RR(x)

 + 0.05,

Where Tp(x) = 0.15 + 219,912

ξ(x) =

21,991,200

M RR(x)T (x)

,

T (x) =

P (x) =

5.48 × 109

v3.46f 0.696a0.46 , Fc(x) =
vFc

6.56 × 103f 0.917a1.10
125f 2

v0.286

, M RR(x) = 1000 vf a, R(x) =

,

,

(7.24)

60,000

rn

P max = 10 kW, F max

c

= 5000 N, Rmax = 50 µm,

rn = 0.8 mm, η = 0.75,

v ≡ x1 ∈ [250.0, 400.0] in m/min,
f ≡ x2 ∈ [0.15, 0.55] in mm/rev and,
a ≡ x3 ∈ [0.5, 6.0] in mm.

Figure 7.11 shows the PO front for this problem obtained using NSGA-II. To ascertain

that the black curve shown in Figure 7.11 is indeed the true PO front, a local search is run

from six points (shown with gray circles) using -constraint method. An -constraint is set

on the operation time objective, Tp, while the used tool life objective, ξ, is minimized with

local search. As can be seen, the local search is not able ﬁnd any point very diﬀerent from

the start point. Furthermore, the new front obtained by running a local search from all the

points of NSGA-II front amounts to a minor increase of 0.0692 % in the Hypervolume. Thus

125

Figure 7.11: Pareto-optimal front for the Metal Cutting problem.

the front of Figure 7.11 can safely be assumed to be very close to the idea PO front and

hence it is used for evaluating GD and IGD value results in later sections.

7.5.3 Welded Beam Problem (WB)

In this problem, a beam needs to be welded to another beam and must carry a load F [10].

It is desired to ﬁnd four design parameters (thickness of the beam, b, width of the beam t,

length of weld l, and weld thickness h) for which both, the cost of the beam C and the vertical

deﬂection at the end of the beam D, is minimized. Its mathematical formulation is given by

Equation (7.25). The original formulation of the problem did not contain the constraint g5

given in Equation (7.25). The constraint g5 is added to avoid any transition points in the

PO front so that rules found are applicable to all the PO solutions. The transition point in

this problem is encountered because beyond a deﬂection of 0.004 inches, weld thickness h

encounters its maximum possible value, i.e. beam thickness b. This value is obtained using

126

0.850.890.930.97Operation Time, Tp (mins)3579Used Tool Life,  (%)NSGA-IIstart point-constr methodmanual innovization as outlined in [10].

Minimize C(x) = 1.10471 h2l + 0.04811 tb(14.0 + l),

D(x) =

2.1952

t3b

,

Subject to g1(x) ≡ 13,600 − τ (x) ≥ 0,

g2(x) ≡ 30,000 − σ(x) ≥ 0,

g4(x) ≡ Pc(x) − 6000 ≥ 0,

g3(x) ≡ b − h ≥ 0,
g5(x) ≡ 0.004 − D(x) ≥ 0,

(cid:114)

(cid:113)

(τ(cid:48))2 + (τ(cid:48)(cid:48))2 + (lτ(cid:48)τ(cid:48)(cid:48))/

(0.25(l2 + (h + t)2)),

(7.25)

Where

τ (x) =
τ(cid:48) = 6000/√2hl,

(cid:112)

τ(cid:48)(cid:48) =

0.25(l2 + (h + t)2)2
6000(14 + 0.5l)
2{0.707hl(l2/12 + 0.25(h + t)2)}

,

σ(x) = 504,000/t2b,

Pc(x) = 64,746.022(1 − 0.0282346t)tb3,
0.125 ≤ h, b ≤ 5.0 inches,

0.1 ≤ l, t ≤ 10.0 inches.

Figure 7.12 shows the PO front for this problem obtained using NSGA-II. To ascertain

that the black curve shown in Figure 7.12 is indeed the true PO front, a local search is run

from six points (shown with gray circles) using -constraint method. An -constraint is set

on the cost objective, C, while the deﬂection objective, D, is minimized with local search.

As can be seen, the local search is not able ﬁnd any point very diﬀerent from the start

points. Furthermore, the new front obtained by running a local search from all the points

of NSGA-II front amounts to a minor increase of 0.0017 % in the Hypervolume. Thus the

front of Figure 7.12 can safely be assumed to be very close to the idea PO front and hence

127

Figure 7.12: Pareto optimal front for the WB problem.

it is used for evaluating GD and IGD results in later sections.

The EMO parameters used for these engineering design problems are shown in Table 7.9.

Next we discuss the performance of no repair strategy NI and variable repair strategy SN

from Table 7.2 in solving these engineering design problems.

Table 7.9: EMO parameters used for solving the engineering problems discussed in Sec-
tion 7.5.

Parameter Name
Population Size
Max Func. Evals
Prob. of Crossover
Prob. of Mutation
Crossover Index
Mutation Index
Start of Learning (% of Max FEs)

TBT

52

MC
72

WB
92

10,036

36,000

25,024

0.9
1/3
10
50
5

0.9
1/3
10
50
5

0.9
1/4
10
50
5

128

5152535Cost, C ($)01234Deflection, D (inch) x 10-3NSGA-IIstart pt-constr7.6 Results on Engineering Problems

The repair strategy SN was shown to be the best performer among all other repair strategies

(in Table 7.2) on test problems at the end of Section 7.4. Hence, this section compares the

performance of an EMO algorithm with that of an EMO/I algorithm with repair strategy

SN. As in Section 7.4, the diﬀerent algorithms are compared on the metrics of median GD

and median IGD over 30 runs. All claims of one algorithm being better than the other

are backed with results of WRS test of statistical signiﬁcance. As before, HA≺B represents
the left tailed hypothesis test, where the alternative hypothesis states that the median of

distribution A is lower than the median of distribution B at some signiﬁcance level α. A

signiﬁcance level of α = 5% is used in all the statistical tests. For all the statistical tests, both

h and p values are shown in the results. An h-value of No means that the aforementioned

alternative hypothesis cannot be accepted at the desired signiﬁcance level and an h-value of

Yes means otherwise. Also, for the aforementioned alternative hypothesis to be accepted,

the corresponding p-value must be lower than the chosen α.

7.6.1 Two Bar Truss Problem Results

Figure 7.13 shows the median GD and IGD results for TBT problem. The plots show that

EMO/I with SN repair strategy performs better than the NI strategy in terms of GD. This

observation is supported by the WRS test results shown in the TBT column of Table 7.10.

In terms of IGD, the NI strategy performs marginally better. This is expected as EMO/I is

designed to focus more on convergence than diversity. Figure 7.14 shows PO front for the

best GD case out of the thirty runs for EMO/I with SN strategy in TBT problem overlaid

with PO front of TBT problem taken from Figure 7.9.

129

(a) GD results for TBT problem.

(b) IGD results for TBT problem.

Figure 7.13: Median GD and IGD results for TBT problem over 30 runs.

Table 7.10: Results of Wilcoxon rank sum test for GD of TBT, MC and WB problem at
their respective maximum function evaluations and 5% signiﬁcance level.

Hypothesis

WB

TBT
HSN≺NI HSN≺NI HSN≺NI
Yes
1.2 × 10−4 < 10−6

Yes

MC

Yes

D h
G

p < 10−6

7.6.1.1 Rules Found in TBT Problem

The rules detected and used in repairing the PO solutions of TBT problem towards the end

of EMO/I (with SN) in the best GD run are:

x−1.0082
1

· x2 = 2.075 and x3 = 1.9449 ± 0.0674.

(7.26)

Figures 7.15 and 7.16 show the PO solutions (corresponding to the PO front shown in

Figure 7.9) shown with black points, overlaid with the learned variable relations of Equa-

tion (7.26) shown with a gray line. The two overlaid plots are very close, thus a close up of

a part of the plot is shown in the inset for clarity. The two learned rules of Equation (7.26)

130

246810Funcn Evals x 10351080Median GD x 10-4NISN246810Funcn Evals x 10341020Median IGD x 10-4NISNFigure 7.14: Front coverage by EMO/I method in TBT problem for the best GD run.

are very close to the analytical relations that exist in the PO solutions of TBT problem, i.e.

x2/x1 = 2.0 and x3 = 2.0 [10].

7.6.2 Metal Cutting Problem Results

Figure 7.17 shows the median GD and IGD results for MC problem. The plots show that

EMO/I with SN repair strategy performs better than the NI strategy in terms of GD. This

observation is supported by the WRS test results shown in the MC column of Table 7.10. In

terms of IGD, the NI strategy performs marginally better. This is conﬁrmed in Figure 7.18

that shows PO front for the best GD case out of the thirty runs for EMO/I (with SN

strategy) in MC problem overlaid with PO front of MC problem taken from Figure 7.11.

This is expected as EMO/I is designed to focus more on convergence than diversity.

131

0.0050.0150.0250.0350.045Volume, V (m3)13579Max Stress, S (kPa) x 104PO frontEMO/I frontFigure 7.15: Rule between variables x1 and x2 identiﬁed by EMO/I in TBT problem at the
end of best GD run.

7.6.2.1 Rules Found in MC Problem

The rules detected and used in repairing the PO solutions of MC problem towards the end

of EMO/I (with SN) in the best GD run are:

f = 0.55 ± 5.4 × 10−6 and v · a1.5420 = 804.6058.

(7.27)

Figures 7.19 and 7.20 show the PO solutions (corresponding to the PO front shown in

Figure 7.11) shown with black points, overlaid with the learned variable relations of Equa-

tion (7.27) shown with a gray line. The two overlaid plots are very close, thus a close up of

a part of the plot is shown in the inset for clarity.

7.6.3 Weld Beam Problem Results

Figure 7.21 shows the median GD and IGD results for MC problem. The plots show that

EMO/I with SN repair strategy performs better than the NI strategy in terms of GD. This

132

13579x1 (m2) x 10-313579x2 (m2) x 10-3x1-1.0082 x2 = 2.0750PO SetLearned Rule4.55.5x1 (m2) x 10-4911x2 (m2) x 10-4Figure 7.16: Rule for variable x3 identiﬁed by EMO/I in TBT problem at the end of best
GD run.

observation is supported by the WRS test results shown in the WB column of Table 7.10.

In terms of IGD, the SN strategy appears to perform better (Figure 7.21b) but NI catches

up towards the end. Also, for the intermediate function evaluations where SN appears to

have a lower IGD is not supported by WRS test results. Figure 7.22 that shows PO front

for the best GD case out of the thirty runs for EMO/I (with SN strategy) in WB problem

overlaid with PO front of WB problem taken from Figure 7.12.

7.6.3.1 Rules Found in WB Problem

The rules detected and used in repairing the PO solutions of WB problem towards the end

of EMO/I (with SN) in the best GD run are:

h · l0.8842 = 0.9386 , h · b−0.5600 = 0.7239 , l · b−0.6323 = 1.3408 , t = 9.9939 ± 0.0087.

(7.28)

133

12004006008001000Soln ID123x3 (m)PO SetLearned Rulex3 = 1.9449  0.0674(a) GD results for MC problem.

(b) IGD results for MC problem.

Figure 7.17: Median GD and IGD results for MC problem over 30 runs.

Figure 7.18: Front coverage by EMO/I method in MC problem for the best GD run.

Figures 7.23, 7.24, 7.25 and 7.26 show the PO solutions (corresponding to the PO front

shown in Figure 7.12) shown with black points, overlaid with the learned variable relations

of Equation (7.28) shown with a gray line. The two overlaid plots are very close, thus a close

up of a part of the plot is shown in the inset for clarity.

134

0.51.52.53.5Funcn Evals x 1040.1110Median GD x 10-2NISN0.51.52.53.5Funcn Evals x 1041510Median IGD x 10-3NISN0.850.890.930.97Operation Time, Tp (mins)3579Used Tool Life,  (%)PO frontEMO/I frontFigure 7.19: Rule for variable f identiﬁed by EMO/I in MC problem at the end of best GD
run.

Figure 7.20: Rule among variables v and a identiﬁed by EMO/I in MC problem at the end
of best GD run.

7.7 Concluding Remarks

The results in Sections 7.4 and 7.6 show that the idea of online innovization can be useful

in expediting convergence of EMO algorithms, however this may come at an expense of

the diversity achieved in ﬁnal PO front. The PO fronts of all the problems that we have

135

12004006008001000Soln ID0.150.250.350.450.55f (mm/rev) 10-6PO SetLearned Rule400500Soln ID0.55f (mm/rev)f = 0.55  5.4x10-6250300350400v (m/min)123456a (mm)PO SetLearned Rule300302v (m/min)1.891.90a (mm)v a1.5420 = 804.6058(a) GD results for WB problem.

(b) IGD results for WB problem.

Figure 7.21: Median GD and IGD results for WB problem over 30 runs.

Figure 7.22: Front coverage by EMO/I method in WB problem for the best GD run.

considered till now do not have any transition points. Also, the rules on which we tested

the EMO/I idea were composed solely of MOO problem’s variables and no rules were con-

tained any other possible basis functions in them such as MOO problem’s objective values

or constraints. In the coming chapter, we will look at ideas that can address these issues.

136

0.51.52.5Funcn Evals x 104110Median GD x 10-3NISN0.51.52.5Funcn Evals x 1040.217Median IGD x 10-3NISN5152535Cost, C ($)01234Deflection, D (inch) x 10-3PO frontEMO/I frontFigure 7.23: Rule between variables h and l identiﬁed by EMO/I in WB problem at the end
of best GD run.

Figure 7.24: Rule between variables h and b identiﬁed by EMO/I in WB problem at the end
of best GD run.

137

012345h (inch)0246810l (inch)PO SetLearned Rule0.91.1h (inch)0.81l (inch)h l0.8842 = 0.9386012345h (inch)012345b (inch)PO SetLearned Rule0.60.9h (inch)0.61.4b (inch)h b-0.5600 = 0.7239Figure 7.25: Rule between variables l and b identiﬁed by EMO/I in WB problem at the end
of best GD run.

Figure 7.26: Rule for variable t identiﬁed by EMO/I in WB problem at the end of best GD
run.

138

0246810l (inch)012345b (inch)PO SetLearned Rule12l (inch)0.51.5b (inch)l b0.6323 = 1.340812004006008001000Soln ID0246810t (inch)PO SetLearned Rule400415Soln ID9.981010.02t (inch)t = 9.9939  0.0087Chapter 8

Issues of Transition Points and

Repairs Based on Complex Rules

8.1 Handling Transition Points

Until now, we have avoided dealing with transition points in the PO fronts and assumed

that whatever rules we ﬁnd are applicable to all the solutions of a PO front. However, this

is not the case in practice and usually PO solutions to MOO problems contain one or more

transition points [10]. A transition point is a point on a PO front across which the design

rules being adhered to by the solutions, change signiﬁcantly. Although such “points” can be

lines or regions in case the PO front is a surface in three or more dimensional objective space,

in case of bi-objective problems, these invariably are points and hence we will stick with the

term transition point or TP from hereon. In problems with transition points present in the

PO front, our methods developed in Chapter 7 may not work because there we assumed that

the rules being learned are applicable to the entire front. In this section, we will see how can

we learn rules and repair solutions as part of EMO/I in the presence of transition points.

Let us ﬁrst try to understand the concept of a transition point with an example.

Equation (8.1) shows another version of the two bar truss problem. Recall that this truss

problem is also presented earlier in Section 7.5.1 by Equation (7.23). Note that the problem

139

description given in Equation (8.1) has one less constraint, namely g2 ≡ V ≤ 0.044, than
the problem given in Equation (7.23).

(cid:113)

(cid:113)

Minimize V = x1

16 + x2

3 + x2

1 + x2
3,

S = max(σAC , σBC ),

Subject to g1 ≡ max(σAC , σBC ) ≤ 105,

(cid:113)

20

16 + x2
3

x3x1

(cid:113)

80

1 + x2
3

x3x2

(8.1)

,

Where σAC =

, σBC =

x1, x2 ∈ [0, 0.01],

x3 ∈ [1.0, 3.0].

This change in problem description introduces a transition point in the PO front of the

problem. Figure 8.1 shows this transition point in the objective space and Figure 8.2 shows

it in the variable space.

Figure 8.1: Transition point encountered in Truss problem shown in objective space.

As is analytically and numerically shown in [10], the location of this transition point in

the objective space is (V = 0.04472 m3, S = 8.94426 MPa) and in the variable space is

140

00.010.020.030.040.050.06020406080100Figure 8.2: Transition point encountered in Truss problem shown in variable space.

(x1 = 0.005 m, x2 = 0.01 m, x3 = 2.0 m). The reason for such a transition point to appear

is as follows. Until variable x2 reaches its upper bound of 0.01m, to achieve a solution with

smaller maximum stress, S (and larger volume V ) optimally, both cross sections x1 and x2

need to be increased linearly in the ratio of x1/x2 = 0.5 while keeping x3 = 2.0. Hence

Equation (8.2) shows the rules that exist before encountering the transition point. This

region is shown in blue in Figures 8.1 and 8.2.

x1/x2 = 0.5, y = 2.0m

(8.2)

Beyond this critical point (T), since x2 cannot be increased any further, the only way

to reduce the stresses is to increase the length x3 in a manner so as to make the stresses in

both members equal. An increase of x3 increases the length of the members, but decreases

the component of the applied load on each member. Thus, a smaller cross-sectional area can

be used to withstand the smaller load causing a smaller developed stress. Equation (8.3)

shows the rules that the variables adhere to in order to reduce the maximum stress (while

141

0.0100.00520.002500.0053increasing volume) in an optimal manner. This region is shown in red in Figures 8.1 and

8.2.

(cid:115)

x2 = 0.01m, x1 = 0.0025

16 + x2
3
1 + x2
3

m

(8.3)

Now, if we attempt to apply the method of rule learning and variable repair discussed

in Chapter 7 directly to all non-dominated solutions of a MOO problem using EMO/I, we

will learn the wrong rule parameters, because we will be trying to learn a rule using data of

all non-dominated solutions instead of over individual partitions. As seen in case of two bar

truss problem, the rules are very diﬀerent across the transition point shown in Figure 8.1

and 8.2. Furthermore, we will also make poor repairs, waste function evaluations and worsen

the algorithm performance. To avoid this, the following is necessary:

 Detect disjoint regions of PO solutions separated by transition points,

 Learn rules separately for each of these regions, and

 Repair solutions belonging to diﬀerent regions using learned rules of corresponding

regions.

If no rules of desired quality can be learned from a region, then do not

attempt to repair solutions of that region.

To apply the above line of thought, it becomes imperative to ﬁrst detect such regions in PO

solutions that are separated by transition points. This is discussed in the next section.

142

8.1.1 Identifying Regions Separated by Transition Points (Active

Set Partitioning)

Transition points may appear in the PO fronts of MOO problems usually because of two

reasons, which are;

(a) Making a constraint active or inactive, or

(b) A sudden change in the nature of objective function(s).

The former is a more common cause of transition points and is addressed in this work. Recall

from Section 8.1 that the transition point encountered in the truss problem was because of

meeting the upper bound for variable x2. The latter is not only less common in practice but

also needs a more subtle mathematical treatment to ﬁrst deﬁne what is meant by a “sudden”

change in objective function(s). Although change in nature of any function can be deﬁned

in terms of higher order derivatives of a function, but it is currently not been studied in this

work.

For every solution in a population of non-dominated solutions, we can ﬁnd out which

constraints are active and which ones are inactive. Then we can partition the solutions into

groups based on proximity to diﬀerent constraint boundaries. If all the transition points in

a problem are a result of meeting some constraint bounds and not any other reason, then

such a partitioning is all we want, to be able to learn the right rules for solutions of each

partition rather than trying to learn a single rule for all of the solutions.

Consider a MOO problem with ng inequality constraints {g1, g2, . . . , gng} and nx contin-
uous variables {x1, x2, . . . , xnx}. Without loss of generality, let all the inequality constraints
be of more than or equal to type with corresponding upper bounds being gu
1 . These inequality

143

constraints can be represented as

i or

gi ≥ gu
gi
gn
i − 1 ≥ 0,
i =
gu

(8.4)

where gn
i

is the normalized constraint value for constraint gi. Furthermore, let xl

i and xu
i

where i ∈ {1, 2, . . . , nx}, represent the upper and lower bounds on the variables. In a manner
similar to Equation (8.4), we can obtain constraints gl

1 which are given by

1 and gu

gl
i =

xi
i − 1 ≥ 0 and
xl

gu
i = 1 −

xi
i ≥ 0,
xu

where gl

i and gu

i are the normalized box constraints for variable xi.

(8.5)

In total, there are

nc = ng +2∗nx number of normalized constraints for this problem. Let ctol be the tolerance
value such that if any of the normalized constraints are > 0 and <= ctol then we call them

as active constraints, else we call them inactive constraints. Let hi, i ∈ {1, 2, . . . , ng} be
a boolean variable showing if the normalized constraint gn
is active or not by carrying a 1
i

or a 0 respectively. Similarly, hl

j, j ∈ {1, 2, . . . , nx} and hu

j , j ∈ {1, 2, . . . , nx} be boolean

variables showing if the normalized constraints gl

i and gu

i are active or not.

With this information, let us look at an example to see how we can partition a set

of solutions based on the active/inactive constraints for the solutions. Consider a MOO

problem with one inequality constraint g1 and two variables x1 and x2 respectively. Based

on the aforementioned notation, the normalized constraints for this problem are gn

1 , gl

1, gu
1 ,

2 and gu
gl

2 . The corresponding boolean variables that show if for a solution, some constraint

is active or not, are hn

1 , hl

1, hu

1 , hl

2 and hu
2 .

144

Table 8.1: An example of partitioning of solution space based on constraint activity.

2 Partition ID

Solution ID hn
1
1
1
1
1
1
0
0
0
0
0

1
2
3
4
5
6
7
8
9
10

hl
1 hu
1
1
0
0
1
1
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0

hl
2 hu
0
0
0
0
0
0
0
1
1
0
0
1
0
1
0
1
0
0
0
0

5
5
5
9
9
8
8
8
0
0

Table 8.1 shows an example of ten non-dominated solutions for such a hypothetical

MOO problem. It also shows the constraint activity for these ten solutions with the boolean

variables hn

1 , hl

1, hu

1 , hl

2 and hu

2 . For a solution, if a boolean variable is 1, it means that the

corresponding constraint is active for the solution. For example, a value of hn

1 = 1 means

that the constraint gn

1 is active, otherwise its inactive. The boolean string for every solution

can then be converted to its corresponding decimal value. For example, the boolean string

for Solution ID-1 in the table is {1,0,1,0,0}. The corresponding decimal value for this
string is calculated as:

1 · 20 + 0 · 21 + 1 · 22 + 0 · 23 + 0 · 24 = 5

We choose to call this decimal value as partition ID because it is unique for a set of ac-

tive/inactive constraints. The solutions shown in Table 8.1 can be partitioned into four

groups. Three solutions belong to partition ID-5, two belong to partition ID-9, three belong

to partition ID-8 and two belong to partition ID-0. The rule learning method discussed in

Section 7.1.3 needs some minimum number of solutions to learn meaningful rules. Hence,

we can additionally deﬁne a minimum number of solutions to be present in a partition for it

145

to  be  considered  for  rule  learning.  All  this  is  encapsulated  in  Algorithm  8.1.  Let  us  look  at

Non-dominated set, Constr. tol., minimum partition size
Partition IDs, Partitioned validity

h = {h1, . . . , hng , hl

1, . . . , hl

nx, hu

1 , . . . , hu

nx}
append to set P

Algorithm  8.1  activeSetPartition(  )
input: S, ctol, minPartSize
output: P, Pv
1: P,Pv ← ∅
2: for k ← 1 to |S| do
s ← Sk
h ← getBool(s,ctol)
p ← getPartitionId(h)
P ← P (cid:95) p

7: [Pu,Pc] ← getUniqueCount(P)
8: for i ← 1 to |P| do

3:
4:
5:
6:

p ← Pi
for j ← 1 to |Pu| do

9:
10:
11:

12:

13:
14:
15:
16:
17:

j

j

u ← Pu
c ← Pc
if (u == p) ∧ (c ≥ minPartSize) then
v ← 1
else
v ← 0

Pv ← Pv (cid:95) v

a line wise explanation of this algorithm.

 Input: The algorithm receives the non-dominated solutions set S for a MOO problem,
minimum normalized constraint tolerance ctol for some constraint to be considered

active/inactive at a solution and minPartSize, which is the minimum number of mem-

bers in a partition for it to be considered a valid partition.

 Output: Output contains the partition ID P and partition validity Pv, both of which
have the same size as S. The set P carries information on the partition ID of each
solution of S and set Pv carries information on validity of the partition corresponding
to each solution.

 Lines 1-6: Sets P and Pv are initialized to null sets. The For each non-dominated

146

solution s ∈ S, we obtain the boolean string carrying information on which constraints
are active and which ones are inactive for solution s based on the constraint values and

the ctol parameter. This boolean string h is then converted into a decimal value p,

which we are calling the partition ID to which solution s belongs. Then, the value of

p is appended to the set P.

 Line 7: Of the set of partition IDs obtained, the unique partition IDs along with their

respective count are stored in Pu and Pc respectively.

 Lines 8-17: For each solution s ∈ S, the validity of its partition v, is ascertained
based on whether the corresponding partition ID p has at least minPartSize number

of copies. The partition validity values of each solution are appended and stored in set

Pv.

Once such active set partitioning is achieved, both rule learning as well as variable repair

based on those rules, is done individually for each partition. We can apply the rule learning

and variable repair methods described in Section 7.1 to MOO problems having transition

points if we learn rules (Section 7.1.3) and then repair variables (Section 7.1.5) of individual

solutions of each partition separately. Let us now look at some results based on this idea.

8.1.2 Results on Problems with Transition Points

In this section, we present results for slightly modiﬁed versions of the two engineering design

problems of Section 7.5.

In each problem, the modiﬁcation amounts to removal of some

constraint that was earlier allowing us to get PO solutions without any transition points. In

each problem;

147

 EMO/I learns the rules for each active set partition (ASP) separately and then at-

tempts to repair the solutions using corresponding rules and,

 The repair strategy “Short Rules, No Preference on Variables” or SN from Table 7.2

is used.

In the text, we refer to this strategy of combining SN repair strategy with active set parti-

tioning as ‘SNasp’. We compare SNasp against repair strategies SN and NI (refer Table 7.2)

in the results. The EMO/I parameters used for the three problems are shown in Table 8.2.

Table 8.2: EMO/I parameters used for solving the modiﬁed engineering design problems
that contain transition points in the PO front.

Population Size
Max Func. Evals
Prob. of Crossover
Prob. of Mutation
Crossover Index
Mutation Index
Start of Learning (% of Max FEs)
ctol
minPartSize

TBT-2

52

MC-2
200

10,036

100,000

0.9
1/3
10
50
20

0.001

0.9
1/3
10
50
20
0.01

5% of pop 5% of pop

8.1.2.1 Modiﬁed Two Bar Truss Problem (TBT-2)

This problem is obtained from problem description given in Equation (7.23) by removing

the constraint g2 ≡ V2 ≤ 0.044 from the problem. This leads to the induction of a transition
point in the PO front as shown in Figures 8.1 and 8.2.

Figure 8.3 shows that the SNasp repair strategy does better than both SN and NI strategy

in terms of GD, whereas, because of presence of transition point, the SN repair strategy is

unable to do better than NI strategy. The WRS results in Table 8.3 show that this diﬀerence

148

in performance of SNasp and SN is indeed signiﬁcant. In case of IGD, both SN and SNasp

perform inferior to NI. This is understandable as the repair strategies are aimed at improving

convergence to the PO front and not convergence and diversity at the same time.

(a) GD results.

(b) IGD results

Figure 8.3: Median GD and IGD results for TBT-2 problem over 30 runs.

Table 8.3: Results of Wilcoxon rank sum test for GD of TBT-2 problem comparing EMO/I
repair strategies NI, SN and SNasp at 10,036 function evaluations and 5% signiﬁcance level.

Hypothesis

HSNasp≺NI HSNasp≺SN

Yes
0.047

D h
G

Yes

p 2.26 × 10−4

8.1.2.2 Modiﬁed Metal Cutting Problem (MC-2)

This problem is obtained from problem description given in Equation (7.24) by removing the

constraint g4 ≡ Tp(x) ≤ 0.9545 from the problem. This leads to the induction of a transition
point in the PO front as shown in Figures 8.4 and 8.5. The transition point divides the PO

solutions into two regions shown in red and blue. The solutions of the two regions adhere to

diﬀerent set of rules as shown in Figure 8.5.

149

02468100.515NISNSNasp246810510NISNSNaspFigure 8.4: Transition point encountered in Metal Cutting problem shown in objective space.

Figure 8.6 shows that the SNasp repair strategy does better than both SN and NI strategy

in terms of GD whereas because of presence of transition point, the SN repair strategy is not

able to do better than NI strategy. The WRS results in Table 8.4 show that this diﬀerence

in performance of SNasp and SN is indeed signiﬁcant. In terms of GD, although Figure 8.6a

shows that strategy SN performs better than NI, but WRS results in Table 8.4 do not back

this observation which means the diﬀerence is not statistically signiﬁcant.

In case of IGD, although Figure 8.6b shows that relative diﬀerence between the SNasp,

SN and NI repair strategies, but the WRS results in Table 8.4 do not back this observa-

tion meaning that in term of IGD, the apparent diﬀerence between the three strategies is

statistically insigniﬁcant.

Now that we have studied one way of handling MOO problems with transition points,

let us now look at another idea of learning power laws involving functions of design variables

as basis functions and then repairing variables based on those rules.

150

0.850.90.9511.051.11.152345678910Tool Life (%)Figure 8.5: Transition point encountered in Metal Cutting problem shown in variable space.

Table 8.4: Results of Wilcoxon rank sum test for GD and IGD of MC-2 problem at 100,000
function evaluations and 5% signiﬁcance level.

Hypothesis

HSNasp≺NI HSNasp≺SN HSN≺NI

Yes

Yes

D h
G
p
D h
G
p

I

0.0082

0.0306

No
-

No
-

No
-
No
-

8.2 Power Law Rule Involving Functions of Variables

In Section 7.1, we looked at ways of learning ﬁxed form rules involving variables only from

non-dominated solutions and then repairing the non-dominated solutions based on those

learned rules. Whether we made a repair based on a constant rule (see Section 7.1.5.1)

or based on power law rule (see Section 7.1.5.2), making the repair was straight forward

because the functions being learned were explicit in terms of all its constituent variables. If

the learned rule is an implicit function of the variable that needs to be changed for solution

repair, then this task may not be very straight forward. For example, if we learn a ﬁxed form

151

12400304563500.23000.42500.6(a) GD results.

(b) IGD results

Figure 8.6: Median GD and IGD results for MC-2 problem over 30 runs.

rule based on problem objectives or constraint functions or if we use CGP from Chapter 4

or Chapter 6 to ﬁnd an algebraic expression involving problem variables that ﬁts the data

of non-dominated solutions but is implicit in terms of variable to be repaired.

Consider a bi-objective optimization problem with nx variables given in Equation (8.6).

Minimize

f1(x)

f2(x)

Subject to gi(x) ≤ 0 ∀i ∈ {1, 2, . . . , ng}

Where x = {x1, x2, . . . , xnx},

(8.6)

j ] ∀ j ∈ {1, 2, . . . , nx}, and

j, xu

xj ∈ [xl
xj ∈ R ∀ j.

If we use f1 and f2 as our basis functions for learning power laws, we may learn functions

of the form

f1 · f b

2 = c

152

(8.7)

246810110NISNSNasp246810123NISNSNaspfrom data of the non-dominated solutions. Estimating the parameters b and c is same as

explained in Section 7.1.3.2. However, for repairing variables, Equation (8.7) can not be used

in the same way a power law among variables could be used as described in Section 7.1.5.2.

This is because the left hand side of Equation (8.7) may no longer be assumed to be an

explicit expression in terms of the problem variables. In this section, we discuss a way we

can use a rule that is implicitly deﬁned in terms of the variables to make variable repairs.

Recall from Section 3.1 that in an MOO problem, the basis functions can be the design

variables or functions thereof including the objectives and the constraints. Let us re-write

Equation (8.7) as

f(x) = f(x1, x2, . . . , xnx) = c

(8.8)

where f(x) = f1(x)· (f2(x))b. Since data of non-dominated solutions is used to arrive at this
equation, in eﬀect it is an implicit equation (in terms of problem variables) that describes

the dominance relation.

Without loss of generality, let us say that we wish to repair variable xr in an individual

I of the non-dominated set. Let the individual I be represented as

xI = {a1, a2, . . . , ar, . . . , anx}

(8.9)

in the variable space. The candidate individual, say I+, for repair can then be represented
as

xI+ = {a1, a2, . . . , xr, . . . , anx}

(8.10)

because only variable xr of individual I is under consideration for repair. The problem of

153

repairing variable xr can be turned into a one variable minimization problem as

Minimize

| f(xI+) − c |
r , xu+

r

Subject to xr ∈ [xl+

].

(8.11)

Note that the bounds of xr in Equation (8.11) are not the original bounds of variable xr in

Equation (8.6). The bounds [xl+

r , xu+

r

] are diﬀerent from [xl

r, xu

r ] and represent the updated

bounds of variable xr derived from the non-dominated set utilized in learning the rule of

Equation (8.7). This helps in eﬃciently reducing the search space. The single variable

unconstrained minimization problem of Equation (8.11) can be solved using an inexpensive

solver such as Golden Section method [111]. We can use xr = ar as a starting point for

the algorithm. Such a method is be a computationally inexpensive way of improving the

performance of a non-dominated solution. It is true that the repaired individual I+ may
turn out to infeasible because of not considering problem constraints in Equation (8.11),

but we are expecting this to happen less often as we are already learning the rule from a

non-dominated solutions set. In the next section, we see how this method performs on a

simple engineering design problem.

8.2.1 Results on Truss problem

To test the method described in the previous section, we choose the two bar truss problem

given by Equation (7.23). This is the problem without any transition points in the PO front.

Furthermore, from [47] we know that the following relation exists between the two objectives

in the PO solutions,

V · S = 400.

154

(8.12)

To avoid wrongly attributing improvements in convergence to repairs based on rules

involving variables only, which we know from Section 7.6.1 that they work, we decided to

use only objectives as basis functions for these results. Again, the rule repair strategy SN

(refer Table 7.2) was used. Only this time, the learned rules involved objective functions

only. Once an individual is chosen for repair, we used Golden Section method to repair one

of the chosen variables by solving the minimization problem given by Equation (8.11). Every

golden section search was limited to a maximum of 100 iterations or convergence tolerance

of 10−6, whichever was reached earlier. We call this repair strategy as ‘SNobj’, i.e. learning

and repairing based on rules involving objective functions only along with SN repair strategy.

Table 8.5 shows the other EMO/I parameters used in the work.

Table 8.5: EMO/I parameters used for solving the truss problem when only rules involving
objective functions are learned.

Parameters
Population Size
Max Func. Evals
Prob. of Crossover
Prob. of Mutation
Crossover Index
Mutation Index
Start of Learning (% of Max FEs)

TBT-2

92

25,024

0.9
1/3
10
50
5

Table 8.6: Results of Wilcoxon rank sum test for GD of truss problem comparing EMO/I
repair strategies NI and SNobj strategies at 25,024 function evaluations and 5% signiﬁcance
level.

Hypothesis
HSNobj≺NI
p 4.5617 × 10−13

Yes

D h
G

Figure 8.7a shows that this strategy of learning only rules involving objective functions

and then making repairs based on golden section method indeed helps the algorithm converge

155

(a) GD results.

(b) IGD results

Figure 8.7: Median GD and IGD results for the Truss problem over 30 runs when only rules
involving objective functions are learned.

faster to the PO front. This is backed by the WRS test results shown in Table 8.6. In terms

of IGD, there is not much diﬀerence between the repair strategies SNobj and NI.

8.3 Concluding Remarks

In this chapter, we then looked at a way of handling transition points in the PO front by

using active set partitioning method. Towards the end, we also discussed as to how we can

also learn and make solution repairs based on rules that are implicitly deﬁned in terms of

the repairing variable using an inexpensive Golden Section method.

156

0.51.52.54812NISNobj0.51.52.52.42.83.23.6NISNobjChapter 9

Conclusion and Future Studies

This thesis identiﬁed and developed methods that can help with extracting human inter-

pretable knowledge from EMO algorithms, while solving a MOO problem and using the

same knowledge to repair the EMO solutions to expedite the algorithm.

The ﬁrst half of this dissertation begins by showing a couple of ﬁxed form rules that

are prevalent in MOO engineering design problems and ways in which we can learn then on

the ﬂy during an EMO run in a more eﬃcient manner than what is used by [11]. We then

developed a customized GP that can be used for symbolically regressing rules from data as

well as identifying decision boundary in binary classiﬁcation problems, while the learned rules

are constrained to be in the form of mathematical algebraic expressions involving regressands

(features in case of classiﬁcation) to maintain easy interpretability of the rules where we use

the following deﬁnition of interpretability of rules/models, “how consistently can a human

predict the results of the models” [112].

The use of a bi-objective optimization approach – minimization of classiﬁcation error

and minimization of rule complexity – has enabled us to ﬁnd not one, but multiple rule

structures having a trade-oﬀ between the two objectives. This allows an user to analyze a

number of alternate trade-oﬀ rule structures for choosing a single or multiple of them for

practice.

In case of the industry problem, we used basic mathematical functions such as

integration, diﬀerentiation and Fourier transform to arrive at a set of features from time

157

series manufacturing data. This is very diﬀerent from the standard painstaking method of

handcrafting features where each feature is created using an experts knowledge [113].

Our dimensional awareness procedure has been able to convert some of the obtained

trade-oﬀ rules to dimensionally meaningful rules by using relevant problem constants. This

technology allows a user to have a much better understanding and insights to the underlying

process producing the data. The dimensional analysis along with searching for a knee solution

provides the user with an eﬀective decision support system when choosing from a PO set of

regressors or classiﬁers obtained as a result of CGP. Both these ideas can be automated as

well when we are following rule learning in an EMO with a rule based repair.

The second half of this dissertation developed methods of expediting convergence in

EMO algorithms by using the aforementioned learned rules for making direct variable repairs

during an EMO algorithm run to expedite its convergence. Unlike [71, 72], which learn rules

in the form of decision trees and disjunctive normal form logical rules and add them as

constraints to guide the algorithm, we learn rules in the form of algebraic expressions that

can be used for direct variable repair to expedite the algorithm. A total of nine variable

repair strategies for rule and variable selection were tested on test problems and engineering

design problems.

The results showed that when faced with multiple qualifying rules to choose from, we

should choose the shorter rules for making repairs. We did not see any eﬀect of choosing

variable (for repair) based on its frequency in the qualifying rules set. This was followed by

developing method to learn and repair solutions in the presence of transition points in the

PO front. Furthermore, we also proposed a computationally inexpensive method using of

repairing variables when the learned rule is implicitly deﬁned in terms of the variable to be

repaired.

158

9.1 Contributions of this Thesis

The key contributions of this thesis, presented in the order of appearance in this thesis, can

be summarized as follows:

1. We developed a custom GP for a symbolic regression task that can learn rules from

data in form of algebraic expressions. This custom GP has many interesting ideas

implemented from the literature such as, bi-objective formulation to control bloat,

multiple small expression trees instead of a single expression tree to make genetic

operations on expression trees more eﬃcient, weights learning using a faster classical

method such as OLSR to learn constants in the rule much more eﬃciently as compared

to regular GPs. We also developed a custom diversity preserving mechanism in GP

that penalizes duplicate solutions in the population while maintaining the relative

non-domination rank order between the penalized duplicates belonging to diﬀerent

non-domination ranks.

2. We developed a custom dimensional inconsistency penalty metric that can calculate the

level of dimensional inconsistency in an algebraic expression. This metric is very helpful

in identifying PO set of rules learned by CGP that are not dimensionally meaningful.

3. We developed a custom GP for classiﬁcation task that can learn decision boundary

between two classes as an algebraic expression using the problem features. This custom

GP performed well on binary class data from industry. The CGP could provide multiple

PO rules with varying trade-oﬀ between classiﬁcation error and rule complexity. It

could also select the six most important features out of more than ﬁfty features to

build the classiﬁers. These important features were in line with the understanding

of the process of our industry partners which they obtained by conducting physical

159

experiments and analytic studies of the same process in a controlled environment.

Furthermore, the knowledge of dimensional consistency of these classiﬁers was found

to be beneﬁcial for planning future experiments studies to understand the physics of

the process.

4. We developed the EMO/I framework that combines the idea of innovization with that

of an EMO algorithm. We could evaluate various repair strategies when repairing

solutions based on multiple rules in the form of algebraic expressions. Results on test

problems and engineering design problems show that choosing smaller rules ﬁrst from

qualifying rules for repair is more beneﬁcial for faster convergence of EMO.

5. We developed a custom methodology, active set partitioning, to address the issue of

learning rules and repairing solutions in the presence of transition points in the PO

front. Furthermore, a computationally eﬃcient method of repairing solutions when a

rule is deﬁned implicitly in terms of the variables is also suggested, which has been

shown to work on an engineering design problem.

9.2 Future Studies

This study has opened doors for many exciting ideas/questions to be explored next. A few

of them are mentioned below:

9.2.1 GP with in Tandem Dimensional Consistency Check

In Chapters 4 to 6, we have shown how a dimensional consistency check on the ﬁnal PO

solutions of CGP is a powerful tool to shortlist dimensionally meaningful rules. Instead of

160

using this check at the end of CGP, i.e. serially, it may be beneﬁcial to use it during the run

of CGP. A few ways to use this dimensional inconsistency penalty are:

 Modifying the tree crossover operator in CGP to avoid crossover operation between

trees whose dimensional inconsistency penalties are very diﬀerent. Similarly, making

sure not to mutate trees that have zero dimensional inconsistency penalty.

 Using the dimensional inconsistency penalty value in survivor selection operation of

the CGP instead of the currently used crowded tournament selection operator [31].

9.2.2 Non-linear Decision Trees

Figure 9.1 shows the idea of combining CGP with that of decision tree based classiﬁer to

capture complex and even disconnected decision boundaries in the feature space. The ﬁgure

shows an arbitrary binary classiﬁcation problem where the green patches represent Go class

data and the red color data represents the NoGo class data. If one uses only a decision tree

Figure 9.1: The idea of combining CGP with a Decision Tree classiﬁcation algorithm.

based classiﬁer in this problem, the decision tree obtained will be very deep as a decision tree

161

gfffghf, g, h are function of features or decision variables  and obtained using one complete run of DAGPgffghftries to partition the feature space parallel to the coordinate axes representing the features.

However, when combined with CGP such that each decision node of the decision tree is

obtained using a CGP run minimizing node impurity, the ﬁnal tree will be a smaller tree

with each node of the tree being an algebraic expression of original features. Such a method

can be utilized to derive a classiﬁer that is both eﬀective and interpretable.

9.2.3 Experimenting with Diﬀerent Deﬁnitions of Rule Complex-

ity

The complexity aspect of a learned model can be deﬁned in many ways [94]. In CGP, we

deﬁned complexity of a CGP individual simply as the number of nodes in the correspond-

ing expression tree. It can also be calculated diﬀerently if we assign various operators in

the function set of CGP diﬀerent complexities based on some a-priori hierarchy. For ex-

ample, the set of operations {+,−,÷,×} can be assigned a lower complexity value than
say {√, exp(), log()} functions. These hierarchies can be based on user’s preference and

knowledge to aid the CGP in ﬁnding the appropriate rules from data.

9.2.4 Measuring Eﬃcacy of Repairs

In the repair methodology chosen in Chapters 7 and 8, once at least one qualifying rule

is found to make repairs, an extra population with same size as the population size of

EMO/I is created that carries the repaired population. This may be ineﬃcient use of the

computational budget as we get limited number of evaluations of the objective functions.

However, if we start measuring the eﬃcacy of repair operation in every generation and then,

keep reducing/increasing the size of population for repair from previous generation based

162

on the degradation/improvement of this repair eﬃcacy metric, we may be able to use our

computational budget more eﬃciently. One metric to measure the eﬃcacy of the repair

operation is the fraction of population members present in each generation that came via a

repair operation instead of any of the genetic operations.

Finally, we should apply the methods developed in this thesis to more real world problems

to make it more generally applicable. The ideas developed in this thesis can be very useful

for practitioners in MOO as well as machine learning. I hope that this thesis contributes, in

howsoever small way, towards bridging the gap between human expertise in optimal design

and its digital analog.

163

BIBLIOGRAPHY

164

BIBLIOGRAPHY

[1]

[2]

Siemens-PLM-HEEDS, “We help you discover better designs, faster,” https://www.
plm.automation.siemens.com/global/en/products/simcenter/simcenter-heeds.html,
2004, [Online].

Esteco-modeFrontier, “Process automation and optimization in the engineering design
process,” https://www.esteco.com/modefrontier, 1998, [Online].

[3] Dassault-iSight, “Automate Design Exploration and Optimization,” https://www.3ds.
com/products-services/simulia/products/isight-simulia-execution-engine/, 1996, [On-
line].

[4]

Noesis-Optimus, “The Industry-Leading PIDO Software Platform,” https://www.
noesissolutions.com/our-products/optimus/, 2008, [Online].

[5] Dynardo-optiSLang, “Robust Design Optimization (RDO) in virtual product develop-

ment,” https://www.dynardo.de/en/software/optislang.html, 2002, [Online].

[6]

[7]

[8]

C. A. C. Coello, G. B. Lamont, D. A. Van Veldhuizen et al., Evolutionary algorithms
for solving multi-objective problems. Springer, 2007, vol. 5.

H. J. Levesque, “Knowledge representation and reasoning,” Annual review of computer
science, vol. 1, no. 1, pp. 255–287, 1986.

R. Davis, H. Shrobe, and P. Szolovits, “What is a knowledge representation?” AI
magazine, vol. 14, no. 1, pp. 17–17, 1993.

[9] D. P. Bertsekas, A. Nedi, A. E. Ozdaglar et al., Convex analysis and optimization.

Athena Scientiﬁc, 2003.

[10] K. Deb and A. Srinivasan, “Innovization: Innovating design principles through opti-
mization,” in Proceedings of the 8th annual conference on Genetic and evolutionary
computation. ACM, 2006, pp. 1629–1636.

[11] S. Bandaru and K. Deb, “Automated discovery of vital knowledge from pareto-optimal
solutions: First results from engineering design,” in Evolutionary Computation (CEC),
2010 IEEE Congress on.

IEEE, 2010, pp. 1–8.

[12] ——, “Higher and lower-level knowledge discovery from pareto-optimal sets,” Journal

of Global Optimization, vol. 57, no. 2, pp. 281–298, 2013.

[13] A. Ng, K. Deb, and C. Dudas, “Simulation-based innovization for production sys-
tems improvement: An industrial case study,” in Proceedings of The International 3rd

165

Swedish Production Symposium, SPS’09, Goteborg, Sweden, 2-3 December 2009. The
Swedish Production Academy, 2009, pp. 278–286.

[14] S. Bandaru, A. H. Ng, and K. Deb, “Data mining methods for knowledge discovery
in multi-objective optimization: Part a-survey,” Expert Systems with Applications,
vol. 70, pp. 139–159, 2017.

[15]

I. Nonaka and H. Takeuchi, The knowledge-creating company: How Japanese companies
create the dynamics of innovation. Oxford university press, 1995.

[16] A. Inselberg, “The plane with parallel coordinates,” The visual computer, vol. 1, no. 2,

pp. 69–91, 1985.

[17] A. M. Geoﬀrion, J. S. Dyer, and A. Feinberg, “An interactive approach for multi-
criterion optimization, with an application to the operation of an academic depart-
ment,” Management science, vol. 19, no. 4-part-1, pp. 357–368, 1972.

[18] K. J. Chichakly and M. J. Eppstein, “Discovering design principles from dominated

solutions.” IEEE Access, vol. 1, no. iii, pp. 275–289, 2013.

[19] S. Obayashi and D. Sasaki, “Visualization and data mining of pareto solutions us-
ing self-organizing map,” in International Conference on Evolutionary Multi-Criterion
Optimization. Springer, 2003, pp. 796–809.

[20] A. H. Ng, C. Dudas, J. Nießen, and K. Deb, “Simulation-based innovization using data
mining for production systems analysis,” in Multi-objective Evolutionary Optimisation
for Product Design and Manufacturing. Springer, 2011, pp. 401–429.

[21] T. W. Simpson, J. Poplinski, P. N. Koch, and J. K. Allen, “Metamodels for computer-
based engineering design: survey and recommendations,” Engineering with computers,
vol. 17, no. 2, pp. 129–150, 2001.

[22] B. Kim, R. Khanna, and O. O. Koyejo, “Examples are not enough, learn to criticize!
criticism for interpretability,” in Advances in Neural Information Processing Systems,
2016, pp. 2280–2288.

[23] M. E. Newman, “Power laws, pareto distributions and zipf’s law,” Contemporary

physics, vol. 46, no. 5, pp. 323–351, 2005.

[24] W. Liu, Z. Wang, X. Liu, N. Zeng, Y. Liu, and F. E. Alsaadi, “A survey of deep neural
network architectures and their applications,” Neurocomputing, vol. 234, pp. 11–26,
2017.

[25] C. Molnar, Interpretable Machine Learning, 2019, accessed: 2019-04-16.

166

[26] P. W. Bridgman, Dimensional analysis. Yale university press, 1922.

[27] W. Banzhaf, P. Nordin, R. E. Keller, and F. D. Francone, Genetic programming: an

introduction. Morgan Kaufmann San Francisco, 1998, vol. 1.

[28] C. Myburgh and K. Deb, “Derived heuristics-based consistent optimization of material
ﬂow in a gold processing plant,” Engineering optimization, vol. 50, no. 1, pp. 1–18,
2018.

[29] K. Deb and R. Datta, “Hybrid evolutionary multi-objective optimization and analysis
of machining operations,” Engineering Optimization, vol. 44, no. 6, pp. 685–706, 2012.

[30] D. C. Montgomery, E. A. Peck, and G. G. Vining, Introduction to linear regression

analysis. John Wiley & Sons, 2012, vol. 821.

[31] K. Deb, Multi-objective optimization using evolutionary algorithms.

John Wiley &

Sons, 2001, vol. 16.

[32] J. Nocedal and S. Wright, Numerical optimization.

Springer Science & Business

Media, 2006.

[33] D. Goldberg, Genetic Algorithms in Search, Optimization and Machine Learning,

1st ed. Boston, MA, USA: Addison-Wesley Longman Publishing Co., Inc., 1989.

[34] H.-P. P. Schwefel, Evolution and optimum seeking: the sixth generation. John Wiley

& Sons, Inc., 1993.

[35] J. R. Koza, Genetic programming: on the programming of computers by means of

natural selection. MIT press, 1992, vol. 1.

[36] C. A. C. Coello, D. A. VanVeldhuizen, and G. Lamont, Evolutionary Algorithms for

Solving Multi-Objective Problems. Boston, MA: Kluwer, 2002.

[37] S. M. Sait, H. Youssef, and J. A. Khan, “Fuzzy evolutionary algorithm for vlsi place-
ment,” in Proceedings of the 3rd Annual Conference on Genetic and Evolutionary
Computation. Morgan Kaufmann Publishers Inc., 2001, pp. 1056–1063.

[38] J. E. Rodr´ıguez, A. L. Medaglia, and J. P. Casas, “Approximation to the optimum
design of a motorcycle frame using ﬁnite element analysis and evolutionary algorithms,”
in 2005 IEEE Design Symposium, Systems and Information Engineering.
IEEE, 2005,
pp. 277–285.

[39] Y.-J. Kim and J. Ghaboussi, “A new genetic algorithm based control method using
state space reconstruction,” in Proceedings of the Second World Conference on Struc.
Control, 1998, pp. 2007–2014.

167

[40] K. Harada, K. Ikeda, and S. Kobayashi, “Hybridization of genetic algorithm and local
search in multiobjective function optimization: recommendation of ga then ls,” in
Proceedings of the 8th annual conference on Genetic and evolutionary computation.
ACM, 2006, pp. 667–674.

[41] S. P. Harris and E. C. Ifeachor, “Nonlinear ﬁr ﬁlter design by genetic algorithm,”
in Proceedings of the First Online Workshop on Soft Computing (WSC1), 1996, pp.
216–221.

[42] F. Y. Cheng and D. Li, “Multiobjective optimization design with pareto genetic

algorithm,” Journal of Structural Engineering, vol. 123, no. 9, pp. 1252–1261, 1997.

[43] M. S. Bright, “Evolutionary strategies for the high-level synthesis of vlsi-based dsp

systems for low power,” Ph.D. dissertation, University of Wales. Cardiﬀ, 1998.

[44] A. Belegundu, E. Constans, R. Salagame, and D. Murthy, “Multi-objective optimiza-
tion of laminated ceramic composites using genetic algorithms,” in 5th Symposium on
Multidisciplinary Analysis and Optimization, 1994, p. 4363.

[45] K. Deb, S. Bandaru, D. Greiner, A. Gaspar-Cunha, and C. C. Tutum, “An integrated
approach to automated innovization for discovering useful design principles: Case stud-
ies from engineering,” Applied Soft Computing, vol. 15, pp. 42–56, 2014.

[46] M. W. Mkaouer, M. Kessentini, S. Bechikh, K. Deb, and M. ´O Cinn´eide, “Recom-
mendation system for software refactoring using innovization and interactive dynamic
optimization,” in Proceedings of the 29th ACM/IEEE international conference on Au-
tomated software engineering. ACM, 2014, pp. 331–336.

[47] K. Deb and A. Srinivasan, “Innovization: Discovery of innovative design principles
through multiobjective evolutionary optimization,” in Multiobjective Problem Solving
from Nature. Springer, 2008, pp. 243–262.

[48] S. Bandaru, C. C. Tutum, K. Deb, and J. H. Hattel, “Higher-level innovization: A case
study from friction stir welding process optimization,” in Evolutionary Computation
(CEC), 2011 IEEE Congress on.

IEEE, 2011, pp. 2782–2789.

[49] S. Bandaru, “Automated innovization: Knowledge discovery through multi-objective
optimization,” Ph.D. dissertation, Indian Institute of Technology-Kanpur, Dept. of
Mechanical Engineering, India, 2012.

[50] K. Deb, S. Gupta, D. Daum, J. Branke, A. K. Mall, and D. Padmanabhan, “Reliability-
based optimization using evolutionary algorithms,” IEEE Transactions on Evolution-
ary Computation, vol. 13, no. 5, pp. 1054–1074, 2009.

168

[51] S. Bandaru and K. Deb, “Temporal innovization: Evolution of design principles us-
ing multi-objective optimization,” in International Conference on Evolutionary Multi-
Criterion Optimization. Springer, 2015, pp. 79–93.

[52] A. H. Ng, S. Bandaru, and M. Frantz´en, “Innovative design and analysis of production
systems by multi-objective optimization and data mining,” Procedia CIRP, vol. 50,
pp. 665–671, 2016.

[53] B. Meyer and K. Sugiyama, “The concept of knowledge in km: a dimensional model,”

Journal of knowledge management, vol. 11, no. 1, pp. 17–35, 2007.

[54] P. Hoﬀman, G. Grinstein, K. Marx, I. Grosse, and E. Stanley, “Dna visual and analytic
IEEE, 1997,

data mining,” in Proceedings. Visualization’97 (Cat. No. 97CB36155).
pp. 437–441.

[55]

I. Hatzilygeroudis and J. Prentzas, “Integrating (rules, neural networks) and cases
for knowledge representation and reasoning in expert systems,” Expert Systems with
Applications, vol. 27, no. 1, pp. 63–75, 2004.

[56] J. Shawe-Taylor and N. Cristianini, An Introduction to Support Vector Machines and

Other Kernel-based Learning Methods. Cambridge University Press, 2000.

[57] M. H. Hassoun, Fundamentals of Artiﬁcial neural networks. Cambridge: MIT Press,

1995.

[58] G. Hinton and T. J. Sejnowski, Unsupervised Learning: Foundations of Neural Com-

putation, 1st ed. MIT Press, 1999.

[59] N. Hitomi, H. Bang, and D. Selva, “Extracting and applying knowledge with adaptive
knowledge-driven optimization to architect an earth observing satellite system,” in
AIAA Information Systems-AIAA Infotech@ Aerospace, 2017, p. 0794.

[60] B. L. W. H. Y. Ma, B. Liu, and Y. Hsu, “Integrating classiﬁcation and association rule
mining,” in Proceedings of the fourth international conference on knowledge discovery
and data mining, 1998, pp. 24–25.

[61] S. Bandaru and K. Deb, “Towards automating the discovery of certain innovative
design principles through a clustering-based optimization technique,” Engineering op-
timization, vol. 43, no. 9, pp. 911–941, 2011.

[62] A. H. Ng, C. Dudas, J. Nießen, and K. Deb, “Simulation-based innovization using data
mining for production systems analysis,” in Multi-objective Evolutionary Optimisation
for Product Design and Manufacturing. Springer, 2011, pp. 401–429.

169

[63] N. R. Draper and H. Smith, Applied regression analysis.

John Wiley & Sons, 2014,

vol. 326.

[64] Y. Bernstein, X. Li, V. Ciesielski, and A. Song, “Multiobjective parsimony enforce-
ment for superior generalisation performance,” in Proceedings of the 2004 Congress
on Evolutionary Computation (IEEE Cat. No. 04TH8753), vol. 1.
IEEE, 2004, pp.
83–89.

[65] E. D. De Jong and J. B. Pollack, “Multi-objective methods for tree size control,”

Genetic Programming and Evolvable Machines, vol. 4, no. 3, pp. 211–233, 2003.

[66] H. Iba, “Bagging, boosting, and bloating in genetic programming,” in Proceedings
of the 1st Annual Conference on Genetic and Evolutionary Computation-Volume 2.
Morgan Kaufmann Publishers Inc., 1999, pp. 1053–1060.

[67] D. LLC. (2018) Eureqa : The a.i.-powered modeling engine by datarobot. [Online].

Available: https://www.nutonian.com/products/eureqa/

[68] C. W. Ahn and R. S. Ramakrishna, “On the scalability of real-coded bayesian opti-
mization algorithm,” IEEE Transactions on Evolutionary Computation, vol. 12, no. 3,
pp. 307–322, 2008.

[69] M. Pelikan, D. E. Goldberg, and E. Cant´u-Paz, “Boa: The bayesian optimization
algorithm,” in Proceedings of the 1st Annual Conference on Genetic and Evolutionary
Computation-Volume 1. Morgan Kaufmann Publishers Inc., 1999, pp. 525–532.

[70] A. Zhou, Q. Zhang, and Y. Jin, “Approximating the set of pareto-optimal solutions
in both the decision and objective spaces by an estimation of distribution algorithm,”
IEEE transactions on evolutionary computation, vol. 13, no. 5, pp. 1167–1189, 2009.

[71] A. H. Ng, C. Dudas, H. Bostr¨om, and K. Deb, “Interleaving innovization with evo-
lutionary multi-objective optimization in production system simulation for faster con-
vergence,” in Learning and Intelligent Optimization. Springer, 2013, pp. 1–18.

[72] R. S. Michalski, G. Cervone, and K. Kaufman, “Speeding up evolution through learn-

ing: Lem,” in Intelligent Information Systems. Springer, 2000, pp. 243–256.

[73] M. Keijzer and V. Babovic, “Dimensionally aware genetic programming,” in Proceed-
ings of the 1st Annual Conference on Genetic and Evolutionary Computation-Volume
2. Morgan Kaufmann Publishers Inc., 1999, pp. 1069–1076.

[74] A. Ratle and M. Sebag, “Genetic programming and domain knowledge: Beyond the
limitations of grammar-guided machine discovery,” in Parallel Problem Solving from
Nature PPSN VI. Springer, 2000, pp. 211–220.

170

[75] R. I. Mckay, N. X. Hoai, P. A. Whigham, Y. Shan, and M. Oneill, “Grammar-based ge-
netic programming: a survey,” Genetic Programming and Evolvable Machines, vol. 11,
no. 3-4, pp. 365–396, 2010.

[76] D. J. Montana, “Strongly typed genetic programming,” Evolutionary computation,

vol. 3, no. 2, pp. 199–230, 1995.

[77] M. Keijzer and V. Babovic, “Declarative and preferential bias in gp-based scientiﬁc
discovery,” Genetic Programming and Evolvable Machines, vol. 3, no. 1, pp. 41–79,
2002.

[78] A. Ashour, L. Alvarez, and V. Toropov, “Empirical modelling of shear strength of
rc deep beams by genetic programming,” Computers & structures, vol. 81, no. 5, pp.
331–338, 2003.

[79] S. Bandaru and K. Deb, “A dimensionally-aware genetic programming architecture for
automated innovization,” in International Conference on Evolutionary Multi-Criterion
Optimization. Springer, 2013, pp. 513–527.

[80] N. Padhye and K. Deb, “Multi-objective optimisation and multi-criteria decision mak-
ing in sls using evolutionary approaches,” Rapid Prototyping Journal, vol. 17, no. 6,
pp. 458–478, 2011.

[81] K. Deb and K. Sindhya, “Deciphering innovative principles for optimal electric brush-
less dc permanent magnet motor design,” in 2008 IEEE Congress on Evolutionary
Computation (IEEE World Congress on Computational Intelligence).
IEEE, 2008,
pp. 2283–2290.

[82] G. A. Seber and A. J. Lee, Linear regression analysis. John Wiley & Sons, 2012, vol.

329.

[83] P. Berkhin, “A survey of clustering data mining techniques,” in Grouping multidimen-

sional data. Springer, 2006, pp. 25–71.

[84] T. S. Jaakkola and D. Haussler, “Probabilistic kernel regression models.” in AISTATS,

1999.

[85] J. R. Koza, Genetic Programming II, Automatic Discovery of Reusable Subprograms.

MIT Press, Cambridge, MA, 1992.

[86] M. F. Brameier and W. Banzhaf, Linear genetic programming.

Springer Science &

Business Media, 2007.

[87] S. Silva and J. Almeida, “Gplab-a genetic programming toolbox for matlab,” in Pro-

ceedings of the Nordic MATLAB conference. Citeseer, 2003, pp. 273–278.

171

[88] M. Schmidt and H. Lipson, “Distilling free-form natural laws from experimental data,”

science, vol. 324, no. 5923, pp. 81–85, 2009.

[89] P. J. Angeline, “Subtree crossover: Building block engine or macromutation,” Genetic

programming, vol. 97, pp. 9–17, 1997.

[90] E. Analytics. (2018) Datamodeler :

by evolved analytics.

[Online]. Available:

http://www.evolved-analytics.com/?q=about

[91] M. Kuhn and K. Johnson, Applied predictive modeling. Springer, 2013, vol. 26.

[92] K. Deb, A. Pratap, S. Agarwal, and T. Meyarivan, “A fast and elitist multiobjective
genetic algorithm: Nsga-ii,” Evolutionary Computation, IEEE Transactions on, vol. 6,
no. 2, pp. 182–197, 2002.

[93] S. Bleuler, J. Bader, and E. Zitzler, “Reducing bloat in gp with multiple objectives,”

in Multiobjective Problem Solving from Nature. Springer, 2008, pp. 177–200.

[94] B.-T. Zhang and H. M¨uhlenbein, “Balancing accuracy and parsimony in genetic pro-

gramming,” Evolutionary Computation, vol. 3, no. 1, pp. 17–38, 1995.

[95] D. Searson, M. Willis, and G. Montague, “Co-evolution of non-linear pls model com-
ponents,” Journal of Chemometrics: A Journal of the Chemometrics Society, vol. 21,
no. 12, pp. 592–603, 2007.

[96] J. Doe. (2014) Height, depth and level of a tree.

[Online]. Available:

http:

//typeocaml.com/2014/11/26/height-depth-and-level-of-a-tree/

[97] J. Neter, M. H. Kutner, C. J. Nachtsheim, and W. Wasserman, Applied linear statistical

models.

Irwin Chicago, 1996, vol. 4.

[98] K. Deb and S. Gupta, “Understanding knee points in bicriteria problems and their
implications as preferred solution principles,” Engineering optimization, vol. 43, no. 11,
pp. 1175–1204, 2011.

[99] B. R. Munson, T. H. Okiishi, A. P. Rothmayer, and W. W. Huebsch, Fundamentals of

ﬂuid mechanics. John Wiley & Sons, 2014.

[100] J. E. Shigley, C. R. Mischke, and R. G. Budynas, Mechanical Engineering Design,

International. Mc-Graw Hill Book Co., Singapore, 1989.

[101] D. Chakraborty, S. Soni, J. Wei, N. Kovvali, A. Papandreou-Suppappola, D. Cochran,
and A. Chattopadhyay, “Physics based modeling for time-frequency damage classiﬁ-
cation,” in Modeling, Signal Processing, and Control for Smart Structures 2008, vol.
6926.

International Society for Optics and Photonics, 2008, p. 69260M.

172

[102] J. Branke, K. Deb, H. Dierolf, and M. Osswald, “Finding knees in multi-objective
optimization,” in International conference on parallel problem solving from nature.
Springer, 2004, pp. 722–731.

[103] G. Strang, G. Strang, G. Strang, and G. Strang, Introduction to linear algebra.

Wellesley-Cambridge Press Wellesley, MA, 1993, vol. 3.

[104] G. Casella and R. L. Berger, Statistical inference. Duxbury Paciﬁc Grove, CA, 2002,

vol. 2.

[105] N. V. Chawla, K. W. Bowyer, L. O. Hall, and W. P. Kegelmeyer, “Smote: synthetic
minority over-sampling technique,” Journal of artiﬁcial intelligence research, vol. 16,
pp. 321–357, 2002.

[106] H. He, Y. Bai, E. A. Garcia, and S. Li, “Adasyn: Adaptive synthetic sampling ap-
proach for imbalanced learning,” in Neural Networks, 2008. IJCNN 2008.(IEEE World
Congress on Computational Intelligence). IEEE International Joint Conference on.
IEEE, 2008, pp. 1322–1328.

[107] E. Zitzler, K. Deb, and L. Thiele, “Comparison of multiobjective evolutionary algo-
rithms: Empirical results,” Evolutionary computation, vol. 8, no. 2, pp. 173–195, 2000.

[108] F. Wilcoxon, “Individual comparisons by ranking methods,” Biometrics bulletin, vol. 1,

no. 6, pp. 80–83, 1945.

[109] K. Miettinen, Nonlinear multiobjective optimization.

Springer Science & Business

Media, 2012, vol. 12.

[110] R. Q. Sardi˜nas, M. R. Santana, and E. A. Brindis, “Genetic algorithm-based multi-
objective optimization of cutting parameters in turning processes,” Engineering Ap-
plications of Artiﬁcial Intelligence, vol. 19, no. 2, pp. 127–133, 2006.

[111] W. H. Press, S. A. Teukolsky, W. T. Vetterling, and B. P. Flannery, “Golden section
search in one dimension,” Numerical Recipes in C: The Art of Scientiﬁc Computing,
p. 2, 1992.

[112] F. Doshi-Velez and B. Kim, “Towards a rigorous science of interpretable machine

learning,” arXiv preprint arXiv:1702.08608, 2017.

[113] T. Hastie, R. Tibshirani, and J. Friedman, “The elements of statistical learning: data

mining, inference, and prediction, springer series in statistics,” 2009.

173