INTERPRETABLE ARTIFICIAL INTELLIGENCE USING NONLINEAR DECISION TREES

By

Yashesh Deepakkumar Dhebar

A DISSERTATION

Michigan State University

in partial fulﬁllment of the requirements

Submitted to

for the degree of

Mechanical Engineering – Doctor of Philosophy

2020

ABSTRACT

INTERPRETABLE ARTIFICIAL INTELLIGENCE USING NONLINEAR DECISION TREES

By

Yashesh Deepakkumar Dhebar

The recent times have observed a massive application of artiﬁcial intelligence (AI) to automate
tasks across various domains. The back-end mechanism with which automation occurs is generally
black-box. Some of the popular black-box AI methods used to solve an automation task include
decision trees (DT), support vector machines (SVM), artiﬁcial neural networks (ANN), etc. In the
past several years, these black-box AI methods have shown promising performance and have been
widely applied and researched across industries and academia. While the black-box AI models have
been shown to achieve high performance, the inherent mechanism with which a decision is made
is hard to comprehend. This lack of interpretability and transparency of black-box AI methods
makes them less trustworthy. In addition to this, the black-box AI models lack in their ability to
provide valuable insights regarding the task at hand. Following these limitations of black-box AI
models, a natural research direction of developing interpretable and explainable AI models has
emerged and has gained an active attention in the machine learning and AI community in the past
three years. In this dissertation, we will be focusing on interpretable AI solutions which are being
currently developed at the Computational Optimization and Innovation Laboratory (COIN Lab)
at Michigan State University. We propose a nonlinear decision tree (NLDT) based framework to
produce transparent AI solutions for automation tasks related to classiﬁcation and control. The
recent advancement in non-linear optimization enables us to eﬃciently derive interpretable AI
solutions for various automation tasks. The interpretable and transparent AI models induced using
customized optimization techniques show similar or better performance as compared to complex
black-box AI models across most of the benchmarks. The results are promising and provide
directions to launch future studies in developing eﬃcient transparent AI models.

Copyright by
YASHESH DEEPAKKUMAR DHEBAR
2020

Dedicated to my parents and elder brother Paurav.

iv

ACKNOWLEDGEMENTS

The PhD journey for me has been really fulﬁlling and transformational and I would be like to
acknowledge those who have played instrumental role during this endeavour. I thought that writing
this Acknowledgements section of my dissertation would be easy and quick, however its not the
case. While I was typing this, I came to a realization that this expression of gratitude would be
very limited and always incomplete for certain people including my parents, my brother Paurav,
my friends with whom I spent most of my PhD duration with and Prof. Kalyanmoy Deb.

The role of my family has been extremely important and encouraging in bringing me to the
point I am at right now. Contributions and sacriﬁces made by my father Mr. Deepak Dhebar and
my mother Mrs. Deepika Dhebar are among the things which cannot be quantiﬁed and most of
which I believe are still outside my conscious realm. If it would not have been my brother Paurav,
who spotted my technical aptitude and envisioned me getting into the IIT, my trajectory of life
would be have been completely diﬀerent. The vision, the support and the encouragement provided
to me by my family has played a big part in turning this dream into reality.

I would like to thank my friends who I see as my family far from my hometown, the friends
with whom I stayed and created a permanent bond with during my PhD – thank you Tarang, Swati,
Mayank, Kokil, Ashish, Saptarshi, Sabhyasachi, Thrilok and Vikram Prajapati. I would like to also
thank Abhinav and Kamala for showing their concern and warmth and being like elder siblings in
the town. The stay has been made equally joyous because of the indoor board games amidst the
harsh weather of Michigan and I would like to thank Kanchan, Aritra, Rahul and Bakul for being
the part of the clan.

I would like to thank Pratap Bhanu Solanki for helping me during my transition from IIT Kanpur
to Michigan State University. I would like to thank Nilay Kant for his valuable inputs during my
recent job search and providing me with some fundamental insights across numerous aspects during
my PhD.

I was really fortunate to have great lab-mates as my colleagues at the Computational Optimiza-

v

tion and Innovation Laboratory. I got to learn a lot from them. Thank you Abhinav, Haitham,
Proteek, Rayan, Julian, Zhichao, Khaled, Airuddh, Ali, Abhiroop, Yash, Shuvei, Mohamed and
visiting scholars viz. Ankur, Flavio, Pablo, Sukrit, Sparsh and Hemant. I would like to thank Mrs.
Debjani Deb for organizing lab gatherings and enriching the social culture of the lab.

It was indeed my pleasure to interact with Prof. Ranjan Mukherjee and Mrs. Moushumi
Mukherjee and am thankful to them for introducing me and involving me in the Durga Pooja
festivals and other social gatherings. It was indeed great to experience the Indian traditional culture
on the foreign land.

I would like to thank Prof. Bhaskar Dasgupta and Prof. Bishakh Bhattacharya of IIT Kanpur
and Prof. G. K. Ananthasuresh of IISc Bangalore who wrote my recommendation letter for PhD.
I would like to thank Prof. Niraj Sinha, Prof. Anindya Chatterjee and Prof. Harish Karnick of
IIT Kanpur for giving valuable advice regarding how to choose a career after bachelors and what
to consider while opting for a PhD. I still remember this line from Prof. Chatterjee’s homepage
and it reads: The PhD advisor is more than merely a source of information and funding. Ideally,
your PhD advisor will profoundly inﬂuence not only how you view your subject, but how you view
I could experience this my case. Working under Prof. Kalyanmoy Deb shaped me
the world.
fundamentally and his guidance showed me the way to conduct the life: both professional and
personal.

Thank you Prof. Deb for everything and thank you Prof. Erik Goodman, Prof. Ronald Averill
and Dr. Vishnu Boddeti for being the part of my PhD committee and for providing valuable
feedback and suggestions in regards to my research.

Warm Regards,
Yashesh Dhebar

vi

TABLE OF CONTENTS

.

.

.

.

.

.

.

.

.

. . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

LIST OF TABLES .
LIST OF FIGURES .
LIST OF ALGORITHMS .
CHAPTER 1 THE NEED AND THE START . . . . . . . . . . . . . . . . . . . . . . .
. . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
.

x
. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . xii
. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . xviii
1
2

1.1 Our Work .

. .

. .

.

.

.

.

.

.

.

.

.

.

.

.

.
.

.
.

. .

3.2.1
3.2.2

3.1 Past Studies .
.
3.2 Proposed Approach .

CHAPTER 2 A HIGH LEVEL VIEW OF OUR INTREPRETABLE AI MODEL . .
4
CHAPTER 3 CLASSIFICATION . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
7
.
.
9
. . . . . . . . . . . . . . . . . . . . . . . . . . . . .
. 12
. . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
Classiﬁer Representation Using Nonlinear Decision Tree . . . . . . . . . . 12
Split-Rule Discovery Using Bilevel Optimization . . . . . . . . . . . . . . 14
3.3 Bilevel Approach for Split-Rule Identiﬁcation . . . . . . . . . . . . . . . . . . . . 17
3.3.1 The Hierarchical Objectives to derive split-rule . . . . . . . . . . . . . . . 17
3.3.1.1 Computation of  . . . . . . . . . . . . . . . . . . . . . . . . . 19
3.3.1.2 Computation of  . . . . . . . . . . . . . . . . . . . . . . . . . 20
3.3.2 Upper Level Optimization (ULGA) . . . . . . . . . . . . . . . . . . . . . . 20
. 21
3.3.2.1 Custom Initialization for Upper Level GA . . . . . . . . . . . .
3.3.2.2 Ranking of Upper Level Solutions . . . . . . . . . . . . . . . . . 21
3.3.2.3 Custom Crossover Operator for Upper Level GA . . . . . . . . . 22
3.3.2.4 Custom Mutation Operator for Upper Level GA . . . . . . . . .
. 24
3.3.2.5 Duplicate Update Operator . . . . . . . . . . . . . . . . . . . . . 27
. . . . . . . . . . . . . . . . . . . . . 27
Lower Level Optimization (LLGA)
3.3.3.1 Custom Initialization for Lower Level GA . . . . . . . . . . . .
. 28
Selection, Crossover, and Mutation for Lower Level GA . . . . . 30
3.3.3.2
. 31
Termination Criteria for Lower level GA . . . . . . . . . . . .
3.3.3.3
3.4 Ablation Studies and Comparison . . . . . . . . . . . . . . . . . . . . . . . . . . . 31
3.4.1 Ablation Studies on Lower Level GA . . . . . . . . . . . . . . . . . . . . . 31
3.4.2 Ablation Studies on the Proposed Bilevel GA . . . . . . . . . . . . . . . . 33
3.5 Visualization of split-rule: X-Space and B-Space
. . . . . . . . . . . . . . . . . . 35
3.6 Overall Tree Induction and Pruning . . . . . . . . . . . . . . . . . . . . . . . . . . 36
3.7 Results . .
. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 38
3.7.1 Customized Datasets: DS1 to DS4 . . . . . . . . . . . . . . . . . . . . . . 39
3.7.2 Breast Cancer Wisconsin Dataset . . . . . . . . . . . . . . . . . . . . . . . 40
3.7.3 Wisconsin Diagnostic Breast Cancer Dataset (WDBC)
. . . . . . . . . . . 41
3.7.4 Real World Auto-Industry Problem (RW-problem) . . . . . . . . . . . . . . 42
3.7.5 Results on Multi-Objective Optimization Problems . . . . . . . . . . . . . 43
Truss 2D and Welded Beam Problems . . . . . . . . . . . . . . . 44

3.7.5.1

3.3.3

. .

. .

.

.

.

.

.

vii

Support Vector Machines (SVMs)

3.7.5.2 Modiﬁed ZDT (m-ZDT) and DLTZ (m-DTLZ) Problems . . . .
3.7.5.3 m-ZDT and m-DTLZ Results: . . . . . . . . . . . . . . . . . .

. 47
. 50
3.8 Additional Comparisons and Results . . . . . . . . . . . . . . . . . . . . . . . . . 51
3.8.1
. . . . . . . . . . . . . . . . . . . . . . 53
3.8.2 Generalized Additive Models (GAMs) . . . . . . . . . . . . . . . . . . . . 55
. 58
3.8.3 Genetic Programming (GP) . . . . . . . . . . . . . . . . . . . . . . . . .
3.8.4 Results
. . . . . . . . . . . . . . . . . . . . . . . . . . . . .
. 62
3.9 Conclusions and Future work . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 63
. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 65
3.10 Parameter Settings .

. .

. .

.

.

.

.

.

3.10.1 Termination Criteria and other Parameter Settings for Inducing a Non-

linear Decision Tree (NLDT) . . . . . . . . . . . . . . . . . . . . . . . .

. 65
3.10.2 Parameter Setting for NSGA-II for multi-objective data creation . . . . . . 65
3.10.3 Parameter Setting for Upper Level GA . . . . . . . . . . . . . . . . . . . . 66
3.10.4 Parameter Setting for Lower Level GA . . . . . . . . . . . . . . . . . . . . 66
3.10.5 Creation of Customized 2D Datasets: DS1- DS4 . . . . . . . . . . . . .
. 67

.

.

CHAPTER 4 CONTROL: INTERPRETABLE POLICY FOR DISCRETE AC-

.

.

.

.

.
.

Introduction .

4.6 Overall Approach .

TION SPACES .
.
.

. .

4.5 Nonlinear Decision Trees (NLDTs) as Policies . . . . . . . . . . . . . . . . . . .

. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 69
4.1
.
. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 69
4.2 Motivation for the Study . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 70
. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 73
4.3 Related Past Studies .
4.4 Performance Measures
.
. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 75
4.4.1 Open-Loop Accuracy . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 75
4.4.2 Closed-loop Performance . . . . . . . . . . . . . . . . . . . . . . . . . . . 76
. 76
4.5.1 Binary-split NLDT . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 77
. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 78
4.5.2 Multi-split NLDT .
.
. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 79
4.6.1 Data Normalization . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 80
4.6.2 Open-loop Training . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 80
4.6.2.1 Open-loop training for Binary-split NLDT . . . . . . . . . . . . 80
4.6.2.2 Open-loop training for Multi-split NLDT . . . . . . . . . . . . . 81
4.6.3 Closed-loop Training . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 82
4.7 Experiments: AIM and Procedure . . . . . . . . . . . . . . . . . . . . . . . . . . 82
4.7.1 Experimental Setup . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 83
. . . . . . . . . . . . . . . . . . . . 84
. . . . . . . . . . . . . . . . . . . 85
. 85
4.8.1 CartPole Problem . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 85
4.8.1.1 NLDT for CartPole Problem . . . . . . . . . . . . . . . . . . . . 86
4.8.2 CarFollowing Problem . . . . . . . . . . . . . . . . . . . . . . . . . . . . 87
4.8.2.1 NLDT for CarFollowing Problem . . . . . . . . . . . . . . . . . 90
. 92
4.9.1 MountainCar Problem . . . . . . . . . . . . . . . . . . . . . . . . . . . . 92
4.9.2 NLDT for MountainCar Problem . . . . . . . . . . . . . . . . . . . . . . . 93

4.9 Experiments and Analysis on Multiple Discrete Action Space . . . . . . . . . . .

4.8 Experiments and Analysis on Control Tasks with Binary Action Spaces . . . . . .

4.7.1.1 Creation of Regular Dataset
4.7.1.2 Creation of Balanced Dataset

viii

4.9.3 LunarLander Problem . . . . . . . . . . . . . . . . . . . . . . . . . . . .

. 94
4.9.3.1 NLDT for LunarLander Problem . . . . . . . . . . . . . . . . . 95
. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 98

4.10 Conclusions .

. .

.

.

.

.

CHAPTER 5 SCALE-UP STUDY AND IMPROVISATION . . . . . . . . . . . . . . . 101
5.1 Ablation Study for Open-loop Training . . . . . . . . . . . . . . . . . . . . . . . . 103
5.2 Closed-loop Visualization .
. 105
. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 109
5.3 Reengineering NLDT* .
. 111
.
.
5.4 Conclusion .

. . . . . . . . . . . . . . . . . . . . . . . . . . . . .

. . . . . . . . . . . . . . . . . . . . . . . . . . . . .

. .

.

.

.

CHAPTER 6 EXTENSION TO REGRESSION AND CONTINUOUS CONTROL

.

.
.

.
.

. .

PROBLEMS .
.
.

. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 113
Introduction .
. 113
. . . . . . . . . . . . . . . . . . . . . . . . . . . . .
6.1
6.2
Interpretable AI for Regression Problems using NLDT . . . . . . . . . . . . . . . 113
6.3 Dual Bilevel Algorithm (DBA) for Regression . . . . . . . . . . . . . . . . . . . . 115
6.3.1 Data Normalization . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 116
6.3.2 Bilevel Regression Algorithm to Obtain (x)
. . . . . . . . . . . . . . . . 117
6.3.2.1
Lower Level Regression Optimization . . . . . . . . . . . . . . . 118
6.3.2.2 Upper Level Regression Optimization . . . . . . . . . . . . . . . 122
6.3.3 Bilevel Partition Algorithm (BPA) to Obtain (x) . . . . . . . . . . . . . . 123
Lower Level Partition Optimization . . . . . . . . . . . . . . . . 124
6.3.4 Results on a Customized Benchmark Problem . . . . . . . . . . . . . . . . 125
Interpretable AI for Control Problem with Continuous Action Space . . . . . . . . 128
6.4.1 NLDT Representation and Training . . . . . . . . . . . . . . . . . . . . . 131
6.4.2 Data generation .
. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 131
6.4.3 Results on Car-Following Problem . . . . . . . . . . . . . . . . . . . . . . 132

6.3.3.1

6.4

.

CHAPTER 7 CONCLUSION AND FUTURE WORK . . . . . . . . . . . . . . . . . . 136
. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 139
BIBLIOGRAPHY .

.

.

.

.

.

.

.

.

ix

LIST OF TABLES

Table 3.1: Results on DS1 to DS4 datasets (2 features). . . . . . . . . . . . . . . . . . . . . 39

Table 3.2: Results on breast cancer Wisconsin dataset (10 features).

. . . . . . . . . . . . . 40

Table 3.3: Results on WDBC dataset (30 features).

. . . . . . . . . . . . . . . . . . . . . . 41

Table 3.4: Results on the real-world auto-industry problem (36 features).

. . . . . . . . . . 43

Table 3.5: Results on Truss-2D with  = 6. . . . . . . . . . . . . . . . . . . . . . . . . 44

Table 3.6: 2D Truss problem with  = 6 and  = 9.

. . . . . . . . . . . . . . . . . 45

Table 3.7: Results on welded beam design with   = 1 and   = 10.  = 3 is

kept ﬁxed. .

.

.

. .

. .

. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 47

Table 3.8: Parameter setting to generate datasets for m-ZDT and m-DTLZ problems. We

generate 1000 datapoints for each class.

. . . . . . . . . . . . . . . . . . . . . . 49

Table 3.9: Results on multi-objective problems for classifying dominated and non-dominated

solutions.

.

.

.

. .

. .

. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 51

Table 3.10: SVM Result for diﬀerent values of penalty parameter . For each dataset,
the ﬁrst row represents the testing accuracy and the second row represents
complexity (number of support vectors).  = 1000 gives overall best performance. 56

Table 3.11: Details regarding parametric study for GAMs. . . . . . . . . . . . . . . . . . . . 58

Table 3.12: GP Result for diﬀerent values of parsimony coeﬃcient . For each dataset,
the ﬁrst row represents the testing accuracy and the second row represents
complexity (number of internal nodes).  = 0.001 produces better results.

. . . 60

Table 3.13: Summary of results obtained using various methods. For each dataset, the
ﬁrst row indicates testing accuracy and the second row indicates complexity.
Italicized entries are statistically insigniﬁcant (according to 95% conﬁdence in
Wilcoxon rank-sum test) compared to the best entry in the same row. . . . . . . . 63

Table 3.14: Parameter setting to create customized datasets D1-D4. For Class-1 data-

points (, ) = (0, 0.01). . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 68

x

Table 4.1: Class Distribution of Regular Training Datasets for diﬀerent problems. For
each row, the -th number in second column represents the number of datapoints
belonging to -th action.

. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 84

Table 4.2: Eﬀect of training data size on performance of NLDT  on CartPole problem. . . 86

Table 4.3: Results on CarFollowing problem correspoing to open-loop training (NLDT ).

89

Table 4.4: Closed-loop performance analysis after re-optimizing NLDT for CarFollowing

problem (k = 103).

.

.

.

. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 89

Table 4.5: Mountain car results. The numbers indicate average scores. . . . . . . . . . . . . 93

Table 4.6: LunarLander open-loop training (NLDT ) results. The numbers indicate

average scores.

.

.

.

.

. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 95

Table 4.7: Closed-loop performance on LunarLander problem with and without re-optimization

on 26-rule NLDT . Number of rules are speciﬁed in brackets for each NLDT
and total parameters for the DNN is marked. . . . . . . . . . . . . . . . . . . . . 97

Table 4.8: NLDT rules before and after the closed-loop training for LunarLander problem,
for which NLDT* is shown in Figure 4.16. Video showing the simulation
output of the performance of NLDTs with rule-sets mentioned in this table can
be found at https://youtu.be/DByYWTQ6X3E. Respective minimum and
maximum state variables are min = [-0.38, -0.08, -0.80, -0.88, -0.42, -0.85,
0.00, 0.00], max = [0.46, 1.52, 0.80, 0.50, 0.43, 0.95, 1.00, 1.00], respectively.

. 99

Table 5.1: Details regarding custom designed Planar Serial Manipulator environments. . . . 102

Table 5.2: Comparing performance of diﬀerent lower-level optimization algorithms. For

comparison, closed-loop performance of the original DNN policy is also reported.104

xi

LIST OF FIGURES

Figure 2.1: NLDT for classiﬁcation: At each conditional node (white colored), a nonlin-
ear condition (x) ≤ 0 is checked for a datapoint x. The datapoint x traverses
the NLDT by following conditions (x) at each conditional node and reaches
a leaf node (colored) where it is assigned to a class represented by the leaf node.

4

Figure 2.2: NLDT for Regression: Each terminal leaf node (green colored) here repre-
sents a regression equation (x) to predict the value of a dependent variable
. However, these regression rules ( = (x)) are valid only for certain part
of the feature space. These regions of the feature space are deﬁned using the
partition rules (x) at each condition node (white).
. . . . . . . . . . . . . .

.

5

Figure 3.1: An illustration of a Nonlinear Decision Tree (NLDT) for a classiﬁcation
problem. For a given conditional node, the split-rule function (x) is derived
using a dedicated bilevel-optimization procedure. At ﬁrst, the algorithm is
applied to the entire data on Node 1 to obtain split-rule 1(x) ≤ 0. If this
split-rule is not able to partition the data perfectly, then a similar bilevel
optimization is invoked at Node 2 and Node 3 to determine 2(x) and 3(x),
respectively, on the subset of data present in Node 1 (wherein the Node 2 will
have the data satisfying the split-rule 1(x) ≤ 0 and Node 3 will have data
of Node 1 which violates split-rule 1(x) ≤ 0). The process continues until
a certain termination criteria is met. Terminal leaf-nodes are assigned with a
class-label based on the distribution of data in that node. . . . . . . . . . . . . . 12

Figure 3.2:

In this illustration, a two-class data comprising of two features 1 and 2 is
provided. ,  and  are three diﬀerent split-rules. Split-rule  is able to
optimally partition the data, but is complex and may not be interpretable.
Rule  is simple, however it does not produce accurate classiﬁcation. Rule 
is simple as well as classiﬁes the data accurately.

. . . . . . . . . . . . . . . . 18

Figure 3.3:

Illustrations of meaningful and meaningless crossover operations.
In the
meaningful crossover operation, children (green dots) will carry information
of parents (red dots). Hence, they will be located with high probability at the
corners of the rectangle . On the other hand, the meaningless crossover
operation creates children at random locations, without taking leverage of parents. 23

Figure 3.4: Upper Level Crossover Operation . . . . . . . . . . . . . . . . . . . . . . . . . 24

Figure 3.5: Mutation for Upper Level GA . . . . . . . . . . . . . . . . . . . . . . . . . . . 26

xii

Figure 3.6:

Illustration of Data in B-Space. The split function  (x, B, , w, ) is a
straight line in B-space. Dotted lines represent the convex hull engulﬁng the
data. Diﬀerent possible split functions are shown by straight lines , ,  and . 28

Figure 3.7: Mixed dipole (xA, xB) and a hyperplane (wh, Ò).
Figure 3.8: Customized datasets. DS1 is linear and balanced, DS2 is linear but unbal-
anced with minority class having 10x less points. DS3 is nonlinear and DS4
has a sandwiched distribution. . . . . . . . . . . . . . . . . . . . . . . . . . . . 32

. . . . . . . . . . . . . . . 30

Figure 3.9: Results on DS1 dataset to benchmark LLGA. Numbers in ﬁrst parenthesis
indicate Type-1 and Type-2 error on training data and the numbers in second
parenthesis indicate Type-1 and Type-2 error on testing data. . . . . . . . . . . . 33

Figure 3.10: Results on DS2 dataset to benchmark LLGA. . . . . . . . . . . . . . . . . . . . 34

Figure 3.11: Bilevel GA results on DS3 and DS4.

. . . . . . . . . . . . . . . . . . . . . . . 35

Figure 3.12: Results on DS3 dataset to benchmark the overall bilevel algorithm.

. . . . . . . 36

Figure 3.13: Results on DS4 dataset to benchmark the overall bilevel algorithm.

. . . . . . . 37

Figure 3.14: Feature Transformation. A point in three-dimensional X-space is mapped to

a point in a two-dimensional B-space for which  = 

1
1 

2
2 

3
3 .

. . . . . . . 38

Figure 3.15: Breast Cancer Wisconsin NLDT. For each node, number of datapoints 
present in the node, impurity of the node (Gini) and class distribution (in
square parenthesis) is reported.

. . . . . . . . . . . . . . . . . . . . . . . . .

. 40

Figure 3.16: B-space plot for Wisconsin breast cancer dataset. . . . . . . . . . . . . . . . . . 41

Figure 3.17: Tree for WDBC dataset. For each node, number of datapoints  present
in the node, impurity of the node (Gini) and class distribution (in square
parenthesis) is reported.

. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 42

Figure 3.18: B-space plot for WDBC dataset.

. . . . . . . . . . . . . . . . . . . . . . . . . 42

Figure 3.19: NLDT for the auto-industry problem. The ﬁrst split-rule uses ﬁve variables
and the second one uses 12. For each node, number of datapoints  present
in the node, impurity of the node (Gini) and class distribution (in square
parenthesis) is reported.

. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 43

xiii

Figure 3.20: Truss design problem data visualization.   = 1 is kept ﬁxed. For a
ﬁxed value of   , larger value of  implies better separation between
datapoints belonging to two classes. . . . . . . . . . . . . . . . . . . . . . . . . 45

Figure 3.21: Comparison of bilevel and CART methods on truss problem with  = 6

dataset. .

.

.

.

.

.

.

.

.

. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 46

Figure 3.22: Welded Beam Design Problem data visualization.  = 3 is kept ﬁxed.

Problem at (b) is more diﬃcult to solve than at (a). . . . . . . . . . . . . . . .

. 47

Figure 3.23: Welded beam classiﬁer with one rule for   = 10.

. . . . . . . . . . . . . . . 48

Figure 3.24: m-ZDT Datasets.

.

Figure 3.25: m-DTLZ Datasets. .

.

.

.

.

. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 49

. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 50

Figure 3.26: Original DS1 and its modiﬁed version.

. . . . . . . . . . . . . . . . . . . . . . 53

Figure 3.27: SVM on separable datasets with a hard margin. . . . . . . . . . . . . . . . . . . 54

Figure 3.28: SVM with non-separable datasets with a soft margin.

. . . . . . . . . . . . .

. 54

Figure 3.29: Eﬀective degree of freedom (EoDF) V/s Accuracy for Cancer-10 dataset.
The best (, ) parameter setting for this dataset is found to be ∗ =
[8, 3, 8, 13, 8, 8, 13, 3, 8, 21] and ∗ = [2, 2, 5, 5, 2, 3, 3, 2, 2, 2].

. . . . . . . . . 59

Figure 3.30: A sample genetic program (GP) tree. The above GP translates to this equation:

 (x) = (5 − 7) + 32.

. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 59

Figure 3.31: Classiﬁers for Cancer data:  = 0.005:  (x) = 9+

−0.537

(0.1716)(0.17132) and

 = 0.01:  (x) = 2 + −0.502

(0.0776) . . . . . . . . . . . . . . . . . . . . . . . . . . 61
. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 69

Figure 4.1: Control Loop .

.

.

.

.

Figure 4.2: Mountain Car problem. It comprises of two state variables ,  and is con-
-1 for deceleration, 0 for nothing and +1 for

trolled using three actions:
acceleration.

. .

. .

. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 71

Figure 4.3: State-action combinations for MountainCar prob.

. . . . . . . . . . . . . . . . 71

Figure 4.4: Performance measures.

. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 75

Figure 4.5: Binary-split NLDT for Discrete action control systems.

. . . . . . . . . . . . . 77

xiv

Figure 4.6: Multi-split NLDT conﬁguration. Numbers in square brackets indicate class

distribution of datapoints.

. . . . . . . . . . . . . . . . . . . . . . . . . . . . . 78

Figure 4.7: A schematic of the proposed overall approach.

. . . . . . . . . . . . . . . . .

. 79

Figure 4.8: Three control problems.

. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 83

Figure 4.9: CartPole NLDT  induced using 10,000 training samples.

It is 91.45%
accurate on the testing dataset but has 100% closed loop performance. Nor-
malization constants are: xmin = [-0.91, -0.43, -0.05, -0.40], xmax = [1.37,
0.88, 0.10, 0.45]. . .

. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 87

Figure 4.10: Reward function for CarFollowing environment.

. . . . . . . . . . . . . . . . . 88

Figure 4.11: Relative distance plot for CarFollowing problem. . . . . . . . . . . . . . . . . . 89

Figure 4.12: NLDT  for the CarFollowing problem. Normalization constants are: min

= [0.25, -7.93, -1.00], max = [30.30, 0.70, 1.00]. . . . . . . . . . . . . . . . . . 90

Figure 4.13: NLDT  for MountainCar problem. Normalization constants: min = [-1.20,

-0.06], max = [0.50, 0.06].

. . . . . . . . . . . . . . . . . . . . . . . . . . .

. 93

Figure 4.14: NLDT-6 (with 26 rules) and other lower depth NLDTs for the LunarLander
problem. Lower depth NLDTs are extracted from the depth-6 NLDT. Each
node has an associated node-id (on top) and a node-class (mentioned in bottom
within parenthesis). Table 4.7 in provides results on closed-loop performance
obtained using these trees before and after applying re-optimization on rule-
sets using the closed-loop training procedure.

Figure 4.15: Final NLDT*-3 for LunarLander prob. (cid:98) is a normalized state variable (see

Section 4.6.1).

. .

. .

. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 96

. . . . . . . . . . . . . . . . . . 96

Figure 4.16: Topology of Depth-3 NLDT()

obtained from a diﬀerent run on the Lu-
narLander problem. The equations corresponding the conditional-nodes be-
fore and after re-optimization are provided in Table 4.8.

. . . . . . . . . . . . . 98

 
Figure 4.17: Closed-loop training plot for ﬁnetuning the rule-set corresponding to depth-3

(Table 4.8) to obtain NLDT* for LunarLander problem.

. . . . . .

. 100

NLDT()

 
Figure 5.1: Acrobot and a customized Planar Serial Manipulator benchmark problems. . . . 101

xv

Figure 5.2: Action Vs. Time plot for 5-Link manipulator problem. Figure 5.2b provides
the plot for NLDT* which is obtained from the NLDT  trained using SQP
algorithm in lower-level. Similarly, Figure 5.2c provides the plot for NLDT*
which is obtained from the NLDT  trained using RGA algorithm in lower-level.106

Figure 5.3: Action Vs. Time plot for 10-Link manipulator problem. Figure 5.3b provides
the plot for NLDT* which is obtained from the NLDT  trained using SQP
algorithm in lower-level. Similarly, Figure 5.3c provides the plot for NLDT*
which is obtained from the NLDT  trained using RGA algorithm in lower-level.108

Figure 5.4: NLDTs for 5-Link Manipulator problem.

. . . . . . . . . . . . . . . . . . . . . 110

Figure 5.5: Pruned version of NLDT* (Figure 5.4b) for 5-link manipulator problem.

. . . . 111

Figure 6.1: Piecewise linear regression tree with two predictors from [1]. At each leaf
node, features involved in the expression of two-regressor linear model is
shown. Splits use only one feature variable.

. . . . . . . . . . . . . . . . . . . 114

Figure 6.2: Conceptual layout of NLDT for regression task.

. . . . . . . . . . . . . . . . . 114

Figure 6.3: Three Island Regression Problem. . . . . . . . . . . . . . . . . . . . . . . . . . 118

Figure 6.4: Computation of Lower Level Objective function  based on Algorithm 9.

. . 120

Figure 6.5: Regression algorithm is applied on all datapoints. The obtained regression
rule 0(x) = 0.911 − 0.900 + 0.08 is able to ﬁt the subset of datapoints
which are represented by Y-predicted (in green circles). Partition rule 0(x)
will be now derived to identify the domain in -space (i.e. 0(x) ≤ 0)) where
this regression rule is applicable.

. . . . . . . . . . . . . . . . . . . . . . . . . 123

Figure 6.6: Result on the Pure Three Island Dataset.

(b)
Intertpretable AI representation using NLDT. Normalization constants are:
Xmean = [0.49, 0.49], Xstd = [0.25, 0.25], Ymean = 0.05 and Ystd = 0.28. . . . 126

(a) Visualization of result.

Figure 6.7: Result on the Noisy Three Island Dataset.

(b)
Intertpretable AI representation using NLDT. Normalization constants are:
Xmean = [0.52, 0.52], Xstd = [0.28, 0.27], Ymean = 0.03 and Ystd = 0.29. . . . 127

(a) Visualization of result.

Figure 6.8: A Car-following problem with continuous acceleration. The rear car is con-
trolled by an AI while the car in the front moves with a random acceleration
proﬁle. This problem is similar to the problem shown in Figure 4.8b with the
only diﬀerence being that the cars can now assume a value of acceleration in
range −2/2 to 2/2 (unlike in the previous case where the rear car could
only have an acceleration from pre-speciﬁed discrete values).

. . . . . . . . . . 128

xvi

Figure 6.9: ANN Output for the continuous car-following problem described in Fig-
ure 6.8. Figures (a), (b) and (c) are the plots of diﬀerent state variables w.r.t
to the time-step. The output of the ANN is shown in Figure (d). . . . . . . . . . 129

Figure 6.10: NLDT Agent representation for continuous control task involving one action. . . 131

Figure 6.11: Plots from diﬀerent simulation runs with diﬀerent initial relative velocity
() between cars. The NLDT’s acceleration output (red line) matches with
the ANN’s acceleration output (blue line). NLDT agent behaviour is almost
the same as that of ANN and can thus be used to explain the behaviour of ANN. 133

Figure 6.12: NLDT for car following problem with continuous action space.

. . . . . . . . . 134

xvii

LIST OF ALGORITHMS

Algorithm 1: Pseudo Code to Recursively Induce NLDT.

. . . . . . . . . . . . . . . . . . . 15

Algorithm 2: ObtainSplitRule subroutine. Implements a Bilevel optimization algorithm of
determine a split-rule. The UpperLevel searches the space of Block Matrix B and
modulus-ﬂag . Constraint violation value for an upper level individuals comes
by executing lower-level GA (LLGA) as shown in Algorithm 3.

. . . . . . . . . . . 16

Algorithm 3: EvaluateUpperLevelPop subroutine. A dedicated LowerLevel optimization

is executed for each upper level population member.

. . . . . . . . . . . . . . . . . 17

Algorithm 4: Crossover operation in upper level GA.

. . . . . . . . . . . . . . . . . . . . . 25

Algorithm 5: CartPole Rules. Normalization constants are:  = [-0.91, -0.43, -0.05,

-0.40],  = [1.37, 0.88, 0.10, 0.45]. . . . . . . . . . . . . . . . . . . . . . . . .

. 87

Algorithm 6: Ruleset corresponding to NLDT  (Figure 4.12) of the CarFollowing problem. 90

Algorithm 7: MountainCar NLDT . Normalization constants are:  = [-1.20, -0.06],

 = [0.50, 0.06].

.

.

.

. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 94

Algorithm 8: Pseudo-code of dual bilevel algorithm (DBA) . . . . . . . . . . . . . . . . .

. 116

Algorithm 9: Algorithm to compute the lower level ﬁtness  . . . . . . . . . . . . . . . . . 119

xviii

CHAPTER 1

THE NEED AND THE START

The year was 2016 and our lab was oﬀered with a project from the Dow Chemical Company. The
goal of that project was to develop an automation system to automatically predict the harmonized
tariﬀ schedule code (HS-code) given the chemical recipe of a product which the Dow Chemical
Company imports or exports. Prof. Deb, Prof. Goodman and I operated as a team from MSU
and collaborated with concerned experts from the Dow who were involved in manually solving the
task of HS-code assignment. At the end of the project, we were able to successfully develop an
automation system to predict the HS-codes of chemical products with a human-level accuracy. This
work attracted attention from various corporate organizations and is currently under the process of
getting patented. During the process of developing a classiﬁer, the team at Dow used to casually
intervene into those slides where I used to show them how the AI doing HS-code prediction looks
like. They were intrigued by the fact that the task being handled manually at Dow was now getting
done by the AI, with the same performance accuracy as that of manual HS-code assignment. More
than that, they were interested to know how the AI is thinking in the background and is coming up
with a HS-code for a supplied chemical product. The AI developed however was fairly complicated
and no such human interpretable logic could be borrowed through the visual inspection of AI, but
we got the signal that a human mind is simply not satisﬁed with answers an AI is providing, and it
longs to understand “why" the thing worked.

The implicit signal to develop an interpretable AI from Dow was made explicit a few months
later. Our lab was approached by General Motors (GM), which wanted to develop a classiﬁer to
distinguish between good and bad designs. The task here was to develop a classiﬁer which could
be read by design engineers at GM. The readable AI would then serve as a tool to develop better
insights regarding the design process and could be used as a recipe by design engineers to develop
and innovate future car designs.

Later, a similar problem of developing an explainable AI was ﬂoated to us by Ford Motors,

1

where the task was to decipher the complicated logic learnt by a deep neural network (DNN)
which is controlling a car for autonomous driving. The research presented in this dissertation is a
byproduct of the process we went through in answering the questions laid out by two automotive
giants. It has triggered many possibilities and we believe that foundation set by our work would
lead to a further advancement in the ﬁeld of interpretable and explainable AI.

1.1 Our Work

The ﬁeld of interpretable and explainable AI has received active attention in the past four years.

Broadly speaking, the interpretable AI approaches are categorized into two main groups

• Intrinsic (or model based) and

• Post-hoc.

Within the framework of intrinsically interpretable AI, the AI model itself is transparent and
interpretable. The task of generating a less complex AI is generally realized using decision trees,
rule-sets [2, 3] or additive models [4, 5, 6]. The aim here is to express an AI (or a subpart of an
AI) in as simple format as possible.

The post-hoc interpretable methods are aimed towards analyzing the behavior of an already
trained AI, mostly using visualization techniques such as partial dependence plots, histogram plots,
heat-maps, attention maps, etc. A more comprehensive explanation to these is provided in [7, 8].
In this dissertation, we focus at developing an intrinsically interpretable AI solutions in form of
a nonlinear decision trees (NLDT). The main criteria we consider in our research is to reduce the
complexity of the transparent AI while ensuring it has a good performance for the machine learning
task under consideration. Additionally, the approach being developed assumes that the features
(or attributes) in a given automation task are interpretable. The optimization algorithm is then
developed to induce an AI which is visually simple, involves less number of terms or functionals
and can be expressed in a format which is humanly comprehensible.

2

The dissertation is organized as follows. First, a birds-eye view of the overall interpretable AI
framework is provided in Chapter 2. In Chapter 3, we formally introduce the algorithm to derive
NLDT for classiﬁcation problems. Next, the approach developed to induce NLDT in Chapter 3 is
extended to solve control problems involving discrete actions. In the same chapter, a reinforcement
learning algorithm in form of closed-loop training is proposed to enhance the control performance
of NLDT. The approach to arrive at interpretable controllers is improvised and tested for scalability
on custom designed benchmarks in Chapter 5. A methodology to extend the concept of NLDT to
solve regression and continuous control problems is discussed in Chapter 6. Concluding remarks
and possible directions for future work are provided in Chapter 7.

3

CHAPTER 2

A HIGH LEVEL VIEW OF OUR INTREPRETABLE AI MODEL

Before we dig into the details, I would like to provide you with a bird’s eye view of the overall
approach. The interpretable and explainable AI we are developing assumes the framework of a
Nonlinear Decision Tree (NLDT). A high-level perspective of NLDT for a classiﬁcation task and
regression task is provided in Figures 2.1 and 2.2 respectively.

Figure 2.1: NLDT for classiﬁcation: At each conditional node (white colored), a nonlinear condition
(x) ≤ 0 is checked for a datapoint x. The datapoint x traverses the NLDT by following conditions
(x) at each conditional node and reaches a leaf node (colored) where it is assigned to a class
represented by the leaf node.

Here (x) and (x) can be any linear or non-linear functions on the input feature vector x.
Conceptually, all AI models can be represented in the NLDT format. For instance, artiﬁcial
neural networks (ANNs) or support vector machines (SVMs) designed for binary classiﬁcation
tasks will have one single condition 0(x) ≤ 0 which will be used to predict if the supplied input
datapoint x belongs to Class 1 or Class 2. A natural extension to handle multi class problems
is possible either by increasing number of conditions (x) or by increasing number of splits per
condition by allowing more branches from a conditional node. We shall visit the topic of handling
multiple classes with our proposed NLDT approach later in the thesis. A complicated axis parallel

4

Figure 2.2: NLDT for Regression: Each terminal leaf node (green colored) here represents a
regression equation (x) to predict the value of a dependent variable . However, these regression
rules ( = (x)) are valid only for certain part of the feature space. These regions of the feature
space are deﬁned using the partition rules (x) at each condition node (white).

decision tree (decision tree involving only rules such as  ≤ , where  is the -th component of
the feature vector x) are by default in the NLDT format.

Similar to classiﬁcation problems, all the AI models developed to handle regression problems
can be represented with the NLDT framework. In case of ANNs, the NLDT representation will
involve only one node (which will be also a terminal node) with regression rule  = 0(x), where
0(x) will represent an ANN.

While ANNs, SVMs or traditional CART decision trees can be used to model classiﬁcation
decision boundaries or regression surfaces, they are expressed with either a very complicated
expression for functions (x) or (x) (like in ANNs or SVMs) but simpler topology (only upto 1
layer of NLDT is suﬃcient), OR are topologically very complex (like CART based decision trees)
but have simple expressions for (x) and (x). Hence, existing AI models lie in either of the two
extremes:

• Either they have a very complicated function representation for (x) and (x), OR

• They are topologically complicated and involves many nodes in the overall structure of the

NLDT.

5

The above mentioned two aspects make these traditional AI models non-comprehensible and

thus non-interpretable.

However, what if we allow a controlled non-linearity for (x) and (x)? Maybe allowing
some degree of non-linearity for (x) and (x) can help us induce a decision tree with fewer
nodes while also ensuring the simplicity of rule expression for (x) and (x) as compared to
black-box AI counterparts like ANN, DNN, SVM 0r CART based DT1.

In this dissertation, we develop algorithms which navigate through the search space of equations
to obtain visually simple linear or non-linear rules ( (x), (x)) as compared to black-box AI
counterparts and eventually induce decision tree with fewer nodes. A more in-depth discussion
will follow in subsequent chapters where we design dedicated algorithms to solve classiﬁcation
and regression problems and apply them to generate a relatively interpretable translator to policy
networks for reinforcement learning tasks. The interpretable AI (IAI) models developed are
relatively simple, easy to read and thus humanly more comprehensible thank black-box AI models.

1In this work, we will use black-box term for AI models represented as DNN, ANN, SVM and topologically

complex CART Based trees.

6

CHAPTER 3

CLASSIFICATION

In a classiﬁcation task, usually a set of labelled data involving a set of features and its belonging to
a speciﬁc class are provided. The task in a classiﬁcation problem solving is to design an algorithm
which works on the given dataset and arrives at one or more classiﬁcation rules (expressed as a
function of problem features) which are able to make predictions with maximum classiﬁcation
accuracy. A classiﬁer can be expressed in many diﬀerent ways suitable for the purpose of the
application.
It can be a procedure in which a feature vector is supplied to the procedure as an
input and the procedure determines the class in which the feature vector belongs. It can also be a
mathematical function of the feature vector, considering only a few features, instead of all features,
that when takes a value within a non-overlapping range in the real space, it belongs to a speciﬁc class.
It can also assume the form of a rule-set system with several “if-then-else" conditional statements.
Each rule in the rule-set system helps to generate an overall classiﬁcation logic. Decision Trees
(DT) fall under this category, wherein the rule-set is represented in an inverted tree based structure.
The non-terminal conditional nodes comprise of conditional statements and the terminal leaf-nodes
have a class label associated with them. A datapoint traverses through the DT by following the
rules at conditional nodes and lands at a particular leaf-node.

The classiﬁer under consideration involves several variables which are supposed to be learned
(or optimized) using an optimization algorithm to achieve an optimal classiﬁcation performance. In
case of mathematical single-rule based classiﬁers like artiﬁcial neural networks (ANNs) or support-
vector-machines (SVM), these variables are associated with connection-weights or coeﬃcients and
location of support-vectors respectively. If the classiﬁer is represented as a rule-set like in DT, each
of the conditional split-rules involve some variables which dictate the equation of the split-rule. The
optimization problem of determining these variables of classiﬁer involves one or more objectives
specifying the quality of classiﬁcation.

Due to their importance in many practical problems involving design, control, identiﬁcation,

7

and other machine learning related tasks, researchers have spent a lot of attention to develop eﬃcient
optimization-based classiﬁcation algorithms [9, 10, 11]. While most algorithms are developed for
classifying two-class data sets, the developed methods can be extended to multi-class classiﬁcation
problems as a hierarchical two-class classiﬁcation problem or by extending them to constitute a
simultaneous multi-class classiﬁcation algorithm.
In this chapter, we will consider the binary
classiﬁcation task. An extension to multi-class classiﬁcation will be discussed later in Chapter 4.
In most classiﬁcation problem solving tasks, maximizing classiﬁcation accuracy or minimizing
classiﬁcation error on the labelled dataset is usually used as the sole objective. However, besides
the classiﬁcation accuracy, in many applications, users are also interested in ﬁnding an easily
interpretable classiﬁer for various practical reasons: (i) it will help identify most important rules
which are responsible for the classiﬁcation task, (ii) it will help provide a more direct relationship
of features to have a better insight for the underlying classiﬁcation task for knowledge enhancement
and future developmental purposes. The deﬁnition of an easily interpretable classiﬁer will largely
depend on the context. In terms of mathematically expressed classiﬁer, this may mean a linear,
polynomial, or posynomial function involving only a few features.
In the case of a DT-based
classiﬁer, this may additionally mean a low-depth tree involving only a few branches of the tree.

In our work, we propose a number of novel ideas. First, instead of a DT, we propose to develop
a nonlinear decision tree (NLDT) as a classiﬁer, in which every non-terminal conditional node will
represent a nonlinear function of features (  (x)) to express a split-rule. Each of these split-rules
will split the data into two non-overlapping subsets. Successive hierarchical splits of these subsets
are carried out and the tree is allowed to grow until one of the termination criteria is met. We argue
that ﬂexibility of allowing nonlinear split-rule at conditional nodes (instead of a single-variable
based rule, which is found in tradition ID3 based DTs [12]) will result in a more compact DT (i.e.
it will have fewer nodes). Second, to derive the split-rule at a given conditional node, a dedicated
bilevel-optimization algorithm is applied. The upper level optimization focuses at determining the
structure of the split-rule, while the lower level optimization searches for the necessary coeﬃcients
(weights) and biases of the corresponding rule-structure as supplied by the upper level. Third,

8

our proposed methodology uses some generic classiﬁcation problem information to make the
overall bilevel optimization algorithm computationally eﬃcient. Fourth, we emphasize simplistic
rule structures in our bilevel optimization method so that obtained rules are also relatively more
interpretable than the black-box AI counterparts like SVM or ANN.

In the remainder of this chapter, we provide a brief survey of the existing approaches of inducing
DTs in Section 3.1. A detailed description about the problem of inducing interpretable DTs from
optimization perspective and a high-level view on proposed approach is provided in Section 3.2.
Next, we provide an in-depth discussion of the bilevel-optimization algorithm which is adopted
to derive interpretable split-rules at each conditional node of NLDT in Section 3.3. Section 3.6
provides a brief overview on the post-processing method which is used to prune the tree to simplify
its topology. Compilation of results on standard classiﬁcation and engineering problems is provided
in section 3.7. The chapter ends with concluding remarks and some highlights on future work in
Section 3.9.

3.1 Past Studies

There exist many studies involving machine learning and data analytics methods for discovering
rules from data. Here, we provide a brief mention of some studies which are close to our proposed
study.

Traditional induction of DTs is done in an axis-parallel fashion [12], wherein each split-rule
is of type  ≤  or  ≥ . Algorithms such as ID3, CART and C4.5 are among the popular
ones in this category. The work done in the past to generate non-axis-parallel trees can be found
in [13, 14, 15], where the researchers rely on randomized algorithms to search for multi-variate
hyperplanes. Work done in [16, 17] use evolutionary algorithms to induce oblique DTs. The idea
of OCI [15] is extended in [18] to generate nonlinear quadratic split-rules in a DT. Bennett Et al.
[19] uses SVM to generate linear/nonlinear split-rules.

However, these works do not address certain key practical considerations, such as the complexity
of the combined split-rules and handling of biased data. Some works which take the aspect of

9

complexity of rule into consideration are discussed next.

In [20], ellipsoidal and interval based rules are determined using the set of support-vectors and
prototype points/vertex points. The authors there primarily focus at coming up with compact set
of if-then-else rules which are comprehensible. Despite its intuitiveness, the approach proposed
in [20] doesn’t result into comprehensible set of rules on high-dimensional datasets. Another
approach suggested in [21] uses the decompositional based technique of deriving rules using the
output of a linear-SVM classiﬁer. The rules here are expressed as hypercubes. The method proposed
is computationally fast, but it lacks in its scope to address nonlinear decision boundaries and its
performance is limited by the performance of a regular linear-SVM. On the other hand, this approach
has a tendency to generate more rules if the data is distributed parallel to the decision boundary.
The study conducted in [22] uses a trained neural-network as an oracle to develop a DT of at least
m-of-n type rule-sets (similar to the one described in ID2-of-3 [23]). The strength of this approach
lies in its potential to scale up. Its pedagogical approach of inducing DT by referring to the oracle
empowers it to create as many synthetic datapoints as desired using the oracle neural-network.
However, its accuracy on unseen dataset usually falls by about 3% from the corresponding oracle
neural-network counterpart. Authors in [24] use a pedagogical technique to evolve comprehensible
rules using genetic-programming. The algorithm G-REX proposed in this work considers the
ﬁdelity (i.e. how closely can the evolved AI agent mimic the behaviour of the oracle NN) as the
primary objective and the compactness of the rule expression is penalized using a parameter to
evolve interpretable set of fuzzy rules for a classiﬁcation problem. The approach is good enough to
produce comprehensible rule-set, but it needs tweaking and ﬁne tuning of the penalty parameter.
A nice summary of the above mentioned methods is provided in [25].
in [26]
implemented a three-objective strategy to evolve fuzzy set of rules using a multi-objective genetic
algorithm. The objectives considered in that research were the classiﬁcation accuracy, number of
rules in the evolved rule-set and the total number of antecedent conditions. In [27], an artiﬁcial
neural network (ANN) is used as a ﬁnal classiﬁer and a multi-objective optimization approach is
employed to ﬁnd simple and accurate classiﬁer. The simplicity is expressed by the number of nodes.

Ishibuchi et al.

10

Hand calculations are then used to analytically express the neural network with a mathematical
function. This procedure however becomes intractable when the evolved neural network has a large
number of nodes.

Genetic programming (GP) based methods have been found eﬃcient in deriving relevant features
from the set of original features which are then used to generate a classiﬁer [28, 29]. In some studies,
the entire classiﬁer is encoded with a genetic-representation and the genome is then evolved using
GP. Some works conducted in this regard also simultaneously consider complexity of the classiﬁer
[30, 31, 32, 33], but those have been limited to axis-parallel or oblique splits. Application of GP
to induce nonlinear multivariate decision tree can be found in [34, 35]. Our approach of inducing
nonlinear decision tree is conceptually similar to the idea discussed in [34], where the DT is induced
in the top-down way and at each internal conditional node, the nonlinear split-rule is derived using a
GP. Here, the ﬁtness of a GP solution is expressed as a weighted-sum of misclassiﬁcation-rates and
generalizability term. However, the interpretability aspect of the split-rule does not get captured
anywhere in the ﬁtness assignment and authors do not report the complexity of the ﬁnal classiﬁer.
No further extension of this study is found in the literature.

In our proposed approach, we attempt to evolve nonlinear DTs which are robust to diﬀerent biases
in the dataset and simultaneously target in evolving nonlinear yet simpler polynomial split-rules at
each conditional node of NLDT with a dedicated bilevel genetic algorithm (shown pictorially in
Figure 3.1). In oppose to the method discussed in [34], where the ﬁtness of GP individual doesn’t
capture the notion of complexity of rule expression, the bilevel-optimization proposed in our work
deals with the aspects of interpretability and performance of split-rule in a logically hierarchical
manner (conceptually illustrated in Figure 3.2). Results indicate that the proposed bilevel algorithm
for evolving nonlinear split-rules eventually generates classiﬁers which are simpler than other
black-box AI and traditional machine learning (ML) based classiﬁers and have high or comparable
classiﬁcation accuracy on all the test problems used in this study.

11

3.2 Proposed Approach

3.2.1 Classiﬁer Representation Using Nonlinear Decision Tree

As mentioned before, the classiﬁer developed in this work is represented in the form of a nonlinear
decision tree (NLDT) as depicted in Figure 3.1.

Figure 3.1: An illustration of a Nonlinear Decision Tree (NLDT) for a classiﬁcation problem.
For a given conditional node, the split-rule function (x) is derived using a dedicated bilevel-
optimization procedure. At ﬁrst, the algorithm is applied to the entire data on Node 1 to obtain
split-rule 1(x) ≤ 0.
If this split-rule is not able to partition the data perfectly, then a similar
bilevel optimization is invoked at Node 2 and Node 3 to determine 2(x) and 3(x), respectively,
on the subset of data present in Node 1 (wherein the Node 2 will have the data satisfying the
split-rule 1(x) ≤ 0 and Node 3 will have data of Node 1 which violates split-rule 1(x) ≤ 0). The
process continues until a certain termination criteria is met. Terminal leaf-nodes are assigned with
a class-label based on the distribution of data in that node.

The decision tree (DT) comprises of conditional (or non-terminal) nodes and terminal leaf-
nodes. Each conditional-node of the DT has a rule associated to it. In NLDT, we allow this rule
to assume a nonlinear equation. A datapoint x traverses the DT based on conditions deﬁned by
split-rules at conditional nodes and eventually lands at a particular terminal leaf node. To make the
DT more interpretable, two aspects are considered:

12

1. Simplicity of split-rule (x) at each conditional nodes (see Figure 3.2) and

2. Simplicity of the topology of overall DT, which is computed by the counting total number of

conditional-nodes in the DT.

Under an ideal scenario, the simplest split-rule will involve just one attribute (or feature) and
can be expressed as  (x) :  −  ≤ 0. Here, the split occurs based on the Ò component of
the overall feature vector x. Since for most of the problems, just one such simple split-rule is not
suﬃcient to partition the data into two-classes, many such splits are used in hierarchical fashion to
partition the dataset, wherein the ﬁrst split is done on the entire training-dataset and the subsequent
splits are conducted on the subsets of the original training-dataset. This exercise usually resolves
into a topologically complicated DT. DTs induced using algorithms such as ID.3 and C4.5 fall
under this category. In the present work, we refer to these trees with CART.

On the other extreme, a topologically-simplest tree will correspond to the DT involving only one
conditional node, but the associated rule is complex to interpret. SVM based classiﬁers fall in this
category, wherein decision-boundary is expressed in form of a complicated mathematical equation.
Another way to represent a classiﬁer is through an artiﬁcial neural network (ANN), which when
attempted to express analytically, will resort into a complicated nonlinear function  (x) without
any easy interpretability.

In this work, we propose a novel and compromise approach to above two extreme cases so
that the resulting DT is not deep and associated split-rule function at each conditional node (x)
assume a nonlinear form with controlled complexity and is easily interpretable as compared to
the rule equation corresponding to SVM (or ANN). Allowing the ﬂexibility to have nonlinear
split-rules is believed to induce the tree which is topologically simpler than the CART counterpart,
while simultaneously ensuring simplicity of split-functions (x). If the split at the root-node is
not suﬃcient to partition the dataset into two classes, subsequent nonlinear splits are determined
in a hierarchical fashion, similar to popular ID3 and C4.5 approaches of inducing CART based

13

decision trees. This process continues until one of the termination criteria is met1. The challenging
task of obtaining nonlinear split-rule  (x) ≤ 0 for each conditional-node is carried out using a
dedicated bilevel-optimization, which we discuss in Section 3.2.2. A high-level perspective of this
is provided in Figure 3.1. During the training phase, NLDT is induced using a recursive algorithm
as shown in Algorithm 1. Brief description about subroutines used in Algorithm 1 is provided
below and relevant pseudo codes for ObtainSplitRule subroutine and an evaluator for upper-level
GA are provided in Algorithm 2 and 3 respectively.

• ObtainSplitRule(Data)

– Input: ( × ( + 1))-matrix representing the dataset comprising of  datapoints with

 features. The last column indicates the class-label.

– Output: Nonlinear split-rule  (x) ≤ 0.
– Method: Bilevel optimization is used to determine  (x). Details regarding this are

discussed in Sec 3.3.

• SplitNode(Data, SplitRule)

– Input: Data, split-rule  (x) ≤ 0.
– Output: LeftNode and RightNode which are node-data-structures, where LeftN-
ode.Data represents datapoints in the input Data satisfying  (x) ≤ 0 while RightN-
ode.Data represents datapoints in the input Data violating the split-rule.

3.2.2

Split-Rule Discovery Using Bilevel Optimization

In this chapter, we restrict the expression of split-rule at a conditional node of the decision tree
operating on a feature vector x to assume the following structure:

 (x, w, , B) ≤ 0,
1Details regarding termination criteria can be found in the Section 3.10

Rule :

(3.1)

14

Algorithm 1: Pseudo Code to Recursively Induce NLDT.
Input: Dataset
Function UpdatedNode = InduceNLDT(Node, depth):

Node.depth = depth;
if TerminationSatisﬁed(Node) then

Node.Type = ‘leaf’;

else

Node.Type = ‘conditional’;
Node.SplitRule = ObtainSplitRule(Node.Data);
[LeftNode, RightNode] = SplitNode(Node.Data, Node.SplitRule);
Node.LeftNode = InduceNLDT(LeftNode, depth + 1);
Node.RightNode = InduceNLDT(RightNode, depth + 1);

end
UpdatedNode = Node;

// NLDT Induction Algorithm
RootNode.Data = Dataset;
Tree = InduceNLDT(RootNode, 0)

where  (x, w, , B) can be expressed in two diﬀerent forms depending on whether a modulus
operator  is sought or not:

 (x, w, , B) =

1 + 11 + . . . +   ,
(cid:12)(cid:12)1 + 11 + . . . +   (cid:12)(cid:12) −(cid:12)(cid:12)2(cid:12)(cid:12) ,

if  = 0,

if  = 1.

(3.2)

Here, ’s are coeﬃcients or weights of several power-laws (’s), ’s are biases,  is the modulus-
ﬂag which indicates the presence or absence of the modulus operator,  is a user-speciﬁed parameter
to indicate the maximum number of power-laws () which can exist in the expression of  (x), and
 represents a power-law rule of type:

 = 

1 × 
1

2 × . . . × 
2


,

(3.3)

15

Algorithm 2: ObtainSplitRule subroutine. Implements a Bilevel optimization algorithm
of determine a split-rule. The UpperLevel searches the space of Block Matrix B and
modulus-ﬂag . Constraint violation value for an upper level individuals comes by
executing lower-level GA (LLGA) as shown in Algorithm 3.
Input: Data
Output : SplitRule // split-rule  (x)
Function SplitRule = ObtainSplitRule(Data):

Initialize:  // Upper Level population
// Execute LLGA (Algorithm 3)
 = EvaluateUpperLevelPop();
// Upper Level GA Loop
for gen = 1:MaxGen do


);

= Selection();
= Crossover(


Ò


);
 = Mutation(Ò
// Execute LLGA (Algorithm 3)
 = EvaluateUpperLevelPop();
// Elite preservation
 = SelectElite(
if TerminationConditionSatisﬁed then

, );


break;

end

end
// Extract best solution in 
SplitRule = ObtainBestInd();

B is a block-matrix of exponents  , given as follows:



B =

11 12 13

. . . 1

21 22 23
...
...

...

. . . 2
...
. . .

 1  2  3

. . .  



.

(3.4)

Exponents  ’s are allowed to assume values from a speciﬁed discrete set E. In this work, we set
 = 3 and E = {−3,−2,−1, 0, 1, 2, 3} to limit the maximum complexity of the rule, however value
of  and set E can be changed by the user. Parameters  and  are real-valued variables in [−1, 1].
The feature vector x is a datapoint in a -dimensional space. Another user-deﬁned parameter max

16

Algorithm 3: EvaluateUpperLevelPop subroutine. A dedicated LowerLevel optimization
is executed for each upper level population member.

Input:  // Upper Level population
(cid:48)
Output
 // Evaluated Population
:
Function (cid:48)
 = EvaluateUpperLevelPop():

for i = 1:PopSize do

// Execute LLGA (see Section 3.3.3)
[, w, ] = LLGA(Data, B, );
[[i].w, [i].] = [w, ];
// Constraint and Fitness Value
[i].CV =  − ;
[i]. = MaxActiveTerms(B);

end
(cid:48)
 = 

controls the maximum number of variables that can be present in each power-law . The default
is max =  (i.e. dimension of the feature space).

3.3 Bilevel Approach for Split-Rule Identiﬁcation

3.3.1 The Hierarchical Objectives to derive split-rule

Here, we illustrate the need of formulating the problem of split-rule identiﬁcation as a bilevel-
problem using Figure 3.2.

The geometry and shape of split-rules is deﬁned by exponent terms   appearing in its expres-
sion (Eq. 3.2 and 3.3) while the orientation and location of the split-rule in the feature-space is
dictated by the values of coeﬃcients  and biases  (Eq. 3.2). Thus, the above optimization task
of estimating split-rule  (x) involves two types of variables:

1. Discrete: -matrix representing exponents of -terms (i.e.   as shown in Eq. 3.4) and the
modulus ﬂag  indicating the presence or absence of a modulus operator in the expression
of  (x), and

2. Continuous: weights w and biases  in each rule function  (x).

17

Figure 3.2: In this illustration, a two-class data comprising of two features 1 and 2 is provided.
,  and  are three diﬀerent split-rules. Split-rule  is able to optimally partition the data, but
is complex and may not be interpretable. Rule  is simple, however it does not produce accurate
classiﬁcation. Rule  is simple as well as classiﬁes the data accurately.

Identiﬁcation of a good structure for  terms and value of  is a more diﬃcult task, compared to the
weight and bias identiﬁcation. We argue that if both types of variables are concatenated in a single
genome, a good (B, ) combination may be associated with a not-so-good (w, ) combination
(like split-rule  in Fig. 3.2), thereby making the whole solution vulnerable to deletion during
evolution. It may be better to separate the search of a good structure of (B, ) combination from
the weight-bias search at the same level, and search for the best weight-bias combination for every
(B, ) pair as a detailed task. This hierarchical structure of the variables motivates us to employ
a bilevel optimization approach [36] to handle above variables. The upper level optimization
algorithm searches the space of B and . Then, for each (B, ) pair, the lower level optimization
algorithm is invoked to determine the optimal values of w and . Referring to Fig. 3.2, the upper-
level will search for the structure of  (x) which might have a nonlinearity (like in rule ) or might
be linear (like rule ). Then, for each upper-level solution (for instance a solution corresponding
to a linear rule structure), the lower-level will adjust its weights and biases to determine its optimal
location .

18

The bilevel optimization problem can be then formulated as shown below:

Min. (B, , w∗, ∗),
s.t.

(w∗, ∗) ∈ argmin(cid:8)(w, )|(B,)(cid:12)(cid:12)
−1 ≤  ≤ 1, ∀,  ∈ [−1, 1]+1(cid:9) ,

(w, )|(B,) ≤  ,

(3.5)

 ∈ {0, 1},   ∈ {−3,−2,−1, 0, 1, 2, 3},

where the upper level objective  is quantiﬁes the simplicity of the split-rule, and, the lower-level
objective  quantiﬁes quality of split resulting due to split-rule  (x) ≤ 0. An upper-level solution
is considered feasible only if it is able to partition the data within some acceptable limit which is
set by a parameter . A more detailed explanation regarding the upper level objective  and the
lower-level objective  is provided in next sections. A pseudo code of the bilevel algorithm to
obtain split-rule is provided in Alogrithm 2 and the pseudo code provided in Algorithm 3 gives on
overview on the evaluation of population-pool in upper level GA. In next sections, we provide a
mathematical insight on the procedure to compute lower-level and upper level objective functions,
 and  respectively.

3.3.1.1 Computation of 

The impurity of a node in a decision tree is deﬁned by the distribution of data present in the node.
In this work, we use gini-score to gauge the impurity of a node. Thus, for a given parent node  in
a decision tree and two child nodes  and  resulting from it, the net-impurity of child nodes ()
can be computed as follows:

(cid:18) 


(w, )|(B,) =

(cid:19)

Gini() + 


Gini()

,

(w,,B,)

(3.6)

where  is the total number of datapoints present in the parent node , and  and  indicate
the total number of points present in left () and right () child nodes, respectively. Datapoints in
(x) ≤ 0, x ∈ ) (Eq 3.1) go to left node, while the
 which satisfy the split-rule at node  (i.e.
rest go to the right node. The objective  of minimizing the net-impurity of child nodes favors

19

the creation of purer child nodes (i.e. nodes with a low gini-score value). For an ideal split, the
left and right child nodes will have completely homogeneous data, i.e. all datapoints present in the
node will belong to one class (say left node will have all points from Class-1 while right node will
have all points from Class-2).

3.3.1.2 Computation of 

The objective  is subjective in its form since it targets at dealing with a subjective notion of
generating visually simple equations of split-rule. Usually, equations with more variables and terms
seem visually complicated. Taking this aspect into consideration,  is set as the total number of
non-zero exponent terms present in the overall equation of the split-rule (Eq. 3.1). Mathematically,
this can be represented with the following equation:

(B, , w∗, ∗) =

( ),

(3.7)


=1

=1



where

() =

1,  ࣔ 0,

0,  = 0.

Here, we use only B to deﬁne , but another more comprehensive simplicity term involving
presence or absence of modulus operators and relative optimal weight/bias values of the rule can
also be used.

3.3.2 Upper Level Optimization (ULGA)

A genetic algorithm (GA) is implemented to explore the search space of power-laws ’s and
modulus ﬂag  in the upper level. The genome is represented as a tuple (B, ) wherein B is a
matrix as shown in Equation 3.4 and  assumes a Boolean value of 0 or 1.

The upper level GA focuses at estimating a simple equation of the split-rule within a desired
value of net impurity () of child nodes. Thus, the optimization problem for upper level is

20

formulated as a single objective constrained optimization problem as shown in Eq. 3.5. The
constraint function (w, )|(B,) is evaluated at the lower level of our bilevel optimization
framework. The threshold value  indicates the desired value of net-impurity of resultant child
nodes (Eq. 3.6). In our experiments, we set  to be 0.05. As mentioned before, a solution satisfying
this constraint implies creation of purer child nodes. Minimization of objective function  should
result in a simplistic structure of the rule and the optimization will also reveal key variables needed
in the overall expression.

3.3.2.1 Custom Initialization for Upper Level GA

Minimization of objective  (given in Eq 3.7) requires to have less number of active (or non-zero)
exponents in the expression of split-rule (Eq 3.1). To facilitate this, the population is initialized
with a restriction of having only one active (i.e. non-zero) exponent in the expression of split-rule,
i.e., any-one of the  ’s in the block matrix B is set to a non-zero value from a user-speciﬁed set E
and the rest of the elements of matrix B are set to zero. Note here that only 2 number of unique
individuals ( individuals with  = 0 and  individuals with  = 1) can exist which satisfy the
above mentioned restriction. If the population size for upper level GA exceeds 2, then remaining
individuals are initialized with two non-zero active-terms (i.e. randomly, two  ’s are set to a
non-zero value, while rest of the elements in B are ﬁxed to zero) and the process continues until all
population members in the upper level are initialized uniquely. As the upper level GA progresses,
incremental enhancement in rule complexity is realized through crossover and mutation operations,
which are described next.

3.3.2.2 Ranking of Upper Level Solutions
The binary-tournament selection operation and (+) survival selection strategies are implemented
for the upper level GA. Selection operators use the following hierarchical ranking criteria to perform
selection:

21

Deﬁnition 3.3.1 For two individuals  and  in the upper level, rank() is better than rank( ), when
any of the following is true:

•  and  are both infeasible AND () < ( ),

•  is feasible (i.e. () ≤ ) AND  is infeasible (i.e. ( ) > ),

•  and  both are feasible AND () < ( ),

•  and  both are feasible AND () = ( )
AND () < ( ),

•  and  both are feasible AND () = ( )
AND () = ( ) AND () < ( ).

3.3.2.3 Custom Crossover Operator for Upper Level GA

The main challenging aspect of developing an evolutionary algorithm is to design meaningful
genetic operators. Crossover operation is one such stage in GAs where there is an information
exchange between two (or more) individual species in the parent population pool. The crossover
operation between two (or more) participating parents creates one or more children. For a real
coded GA involving only real continous variables, Simulated Binary Crossover (SBX) [37] is one
such meaningful genetic operator for exchanging information between two parents to create two
children. A 2D illustration of the concept of meaningful crossover is provided in Figure 3.3.

In our case, in the upper level, the crossover operator needs to be designed to meaningfully
crossover two parent equations. To do that, population members are ﬁrst clustered according to
their -value – all individuals with  = 0 belong to one cluster and all individuals with  = 1
belong to another cluster. The crossover operation is then restricted to individuals belonging to
the same cluster. The crossover operation selects two parents 1 and 2 from the same cluster
having block matrix BP1
. As
mentioned before, each row of a block matrix B represents a power-law  (Eq. 3.3). The crossover

to create two children with block matrices BC1

and BP2

and BC2

22

(a) Meaningful Crossover (SBX) example

(b) Meaningless Crossover example

Figure 3.3: Illustrations of meaningful and meaningless crossover operations. In the meaningful
crossover operation, children (green dots) will carry information of parents (red dots). Hence, they
will be located with high probability at the corners of the rectangle . On the other hand, the
meaningless crossover operation creates children at random locations, without taking leverage of
parents.

operation is executed separately on each row of block matrices of participating parents to generate
corresponding rows of child block matrices. First, rows of block matrices of participating parents
are rearranged in descending order of the magnitude of their corresponding coeﬃcient values (i.e.
weights  of Eq. 3.2). This way, the most inﬂuential power-law in the equation of parent individual
will be shuﬄed to the ﬁrst row of the rearranged block matrix B(cid:48). The second row of the rearranged
block matrix B(cid:48) will have exponents corresponding to the second most inﬂuential power-law and
and B(cid:48)
so on. Let the parent block matrices be represented as B(cid:48)
after this rearrangement. The
P2
P1
crossover operation is then conducted element wise on each row of B(cid:48)
. Doing so allows
P1
us to mate those rows of the participating parents which had similar importance in the respective
equations of the split-rule. For better understanding of the cross-over operation, a psuedo code is
provided in Algorithm 4 and a schematic is provided in Figure 3.4.

and B(cid:48)
P2

23

(a) Sorting of B-mat

(b) Crossover Operation

Figure 3.4: Upper Level Crossover Operation

3.3.2.4 Custom Mutation Operator for Upper Level GA

In genetic algorithms, the purpose of mutation operation is to conduct local perturbations to parent
solutions. A meaningful and controlled mutation generates the mutated individual in the vicinity
of the unmutated original value2. In real coded GAs, polynomial mutation [38] is popularly used to
execute this task. An illustrative plot of polynomial mutation operator is provided in Figure 3.5a.
Since the upper level of our GA focuses on the search space involving discrete variables, we
use the discretized version of mutation to mutate the values of exponents   (Eq. 3.3) as shown in
Figure 3.5. For a given upper level solution with block matrix B and modulus ﬂag , the probability
with which  ’s and  gets mutated is controlled by parameter 
. From the experiments, value
 = 1/ is found to work well. The mutation operation then changes the value of exponents
of 
2while most of the times mutation is desired to create local perturbations, under some scenario, a randomized

mutation is favoured to increase the diversity of the population pool.

24

Child block matrices BC1

input : Block matrices BP1
output
:
B(cid:48)
P1
B(cid:48)
P2
 = Size(BP1,1);
 = Size(BP1,2);
for  ← 1 to  do

= SortRows(BP1,wP1);
= SortRows(BP2, wP2);

for  ← 1 to  do

 = rand();
// random no.
if  ≤ 0.5 then

else

BC1(, ) = BP1(, );
BC2(, ) = BP2(, );
BC1(, ) = BP2(, );
BC2(, ) = BP1(, );

between 0 and 1.

Algorithm 4: Crossover operation in upper level GA.

and BP2

, and weight vectors wP1
and BC2

.

and wP2

.

end

end

end

  and the modulus ﬂag . Let the domain of   be given by E, which is a sorted ﬁnite set of
allowable exponents. Since E is sorted, its elements can be accessed using an integer-id , with
id-value of  = 1 representing the smallest exponent and  =  representing the largest exponent.
In our case, E = [−3,−2,−1, 0, 1, 2, 3] (making  = 7). The mechanism with which the mutation
operator mutates the value of   for any arbitrary sorted array E is illustrated in Fig. 3.5.

Here, the red tile indicates the index () of exponent   in array E. Red shaded vertical bars
indicate the probability distribution for obtaining mutated-values. The   can be mutated to either
of  − 2,  − 1,  + 1, or  + 2 id-values with a probability of , , , and  respectively. The
parameter  is preferred to be greater than one and is supplied by the user. The parameter  is then
computed using the following equation:

 =

1

2(1 + ) .

25

(3.8)

(a) Polynomial Mutation on the -th component of an individual x. The
probability distribution curve (, ) dictates the location where the -th
) will be after mutating the parent component
component of child (i.e. 

has a higher probability of getting
(i.e. 

. The steepness of the probability curve
created near the parent value 

(, ) can be adjusted by parameter , with higher values of  giving
steeper curve.

). Here, the mutated value 


(b) Discretized Mutation mimics the behaviour of real polynomial mutation
discussed in ﬁgure above. Red vertical bars indicate the probability distri-
bution of obtaining mutated values. The parent unmutated value is located
at -th index value. Here, we intentionally avoid creating mutated value on
the parent value and encourage creation of mutated values at immediate two
neighboring allowable values to facilitate the creation of unique solutions.

Figure 3.5: Mutation for Upper Level GA

The above formula to compute  is derived by equating the sum of probabilities to 1.
In our
experiments, we have set  = 3. The value of the modulus ﬂag  is mutated randomly to either 0
or 1 with 50% probability.

In order to bias the search to create simpler rules (i.e. split-rules with a small number of non-
zero  ’s), we introduce a parameter . The value of parameter  indicates the probability
with which a variable   participating in mutation is set to zero (i.e.   ← 0). In our case, we
use  = 0.75. Thus,   → 0 with a net probability of 

 × .

26

3.3.2.5 Duplicate Update Operator

After creating the oﬀspring population, we attempt to eliminate duplicate population members.
For each duplicate found in the child population, the block-matrix B of that individual is randomly
mutated using the following equation

B( , ) = E(),

(3.9)

where  ,  and  are randomly chosen integers within speciﬁed limits. This process is repeated
for all members of child population, until each member is unique within the child population. This
operation allows to maintain diversity and encourages the search to generate multiple novel and
optimized equations of the split-rule.

(+) survival selection operation is then applied on the combined child and parent population.

The selected elites then go to the next generation as parents and the process repeats.

The parameter setting for upper level GA can be found in Section 3.10.

3.3.3 Lower Level Optimization (LLGA)

For a given population member of the upper level GA (with block matrix B and modulus ﬂag  as
variables), the lower level optimization problem determines the coeﬃcients  (Eq. 3.2) and biases
 such that (w, )|(B,) (Eq. 3.6) is minimized. Thus, the lower level optimization problem
can be stated as below:

Minimize:

(w, )|(B,),
w ∈ [−1, 1] ,  ∈ [−1, 1]+1.

(3.10)

We describe the details of the real-parameter GA used for solving the lower level problem next.

27

3.3.3.1 Custom Initialization for Lower Level GA

One of the most crucial and eﬀective operator to determine optimal values of variables of lower
level optimization was the initialization operation. As stated before in Section 3.3.1.1, the lower
level objective function  provides the quantiﬁcation to the net impurity of child nodes (Eq. 3.6).
The number of points going to left child node  and right child node  depends on the sign of the
split function  (x) of the parent node, where  (x) ≤ 0 =⇒ x →  and  (x) > 0 =⇒ x → .
Geometrically the split function  (x, B, , w, ) is linear in the transformed B-space since

3. Hence, the lower level searches the optimal orientation (dictated by w)
and location (dictated by ) of the linear straight line in the mapped B-space as shown in Figure 3.6.

=1 

 
(cid:12)(cid:12)(cid:12)
=1   + 0
=1   + 0



(cid:12)(cid:12)(cid:12) − |1|

if  = 0

if  = 1

 (x, B, , w, ) =

where  =

Figure 3.6: Illustration of Data in B-Space. The split function  (x, B, , w, ) is a straight line
in B-space. Dotted lines represent the convex hull engulﬁng the data. Diﬀerent possible split
functions are shown by straight lines , ,  and .

Since no separation of data is evident if the straight line representing the split function in
3technically, with  = 1, there will be two linear lines which will represent the split fuction. However, each line

will be separating two classes.

28

B-space lies outside of the convex hull (dotted lines in Figure 3.6), the function landscape of
(w, )|B, is ﬂat in the region outside the convex hull (since either  = 0 or  = 0 in
Eq. 3.6). This constitutes a very signiﬁcant portion of the domain of . If the initial population
pool of straight lines lies in the region outside of the convex hull, then the optimization algorithm
gets no signal and motivation to move and reorient pool of initialized straight lines. Under certain
scenarios, crossover and mutation operators might generate solutions representing straight lines
passing through the dataset (shown by  in Figure 3.6), thereby partitioning it into two subsets
 (x) ≤ 0 and  (x) > 0. In practice, a lot of computation is wasted while arriving at a situation
where most of the individuals in the population pool represent straight lines in B-space which
crosses the convex hull and partition the data.

In a bilevel optimization, the lower level problem must be solved for every upper level variable
vector, thereby requiring a computationally fast algorithm for the lower level problem. Considering
these aspects, instead of creating every population member in lower level randomly, we use the
mixed dipole concept [39, 17, 40] to initialize the population pool of straight lines in B-space to
ensure each of the initialized member partitions atleast one point in the dataset from the rest as
shown in Figure 3.7. This smart initialization facilitates faster convergence towards optimal (or
near-optimal) values of w and .

The dipole based initialization is done in the following way:

Step 1: Randomly pick two datapoints xA and xB such that xA ∈ Class-1 and xB ∈ Class-2.
Step 2: Weights wh and bias Ò of the hyperplane  (where (wh, Ò) : wh · x + Ò = 0) separating

xA and xB can be computed with equation mentioned below

wh = xA − xB,
Ò = Ɗ × (−wh · xB) − (1 − Ɗ) × (wh · xA),

(3.11)

where Ɗ is a random number between 0 and 1. This is pictorially represented in Fig. 3.7.

29

Step 3: Set values of variables w and  of a population member of lower level GA as shown below

w = wh,

1 = Ò,
2 = min(Ɗ, 1 − Ɗ)

if  = 1.

(3.12)

(3.13)

(3.14)

Step 4: Repeat Steps 1-3 until all population members of lower level GA are initialized.

Figure 3.7: Mixed dipole (xA, xB) and a hyperplane (wh, Ò).

As mentioned before, this method of initialization is found to be more eﬃcient than randomly
initializing w and Ǝ since in the region beyond the convex-hull bounding the training dataset (which
constitutes major volume of the domain space), the landscape of  is ﬂat thereby making any
optimization algorithm to stagnate.

3.3.3.2 Selection, Crossover, and Mutation for Lower Level GA

The binary tournament selection [38], SBX crossover [37] and polynomial mutation [41] were used
to create the oﬀspring solutions. ( + ) survival selection strategy was then adopted to preserve
elites.

30

3.3.3.3 Termination Criteria for Lower level GA

The lower level GA is terminated after a maximum of 50 generations were reached or when the
change in the lower level objective value (i.e. ) is less than 0.01% in the past 10 consecutive
generations.

Other parameters setting for lower-level GA can be found in Section 3.10.

3.4 Ablation Studies and Comparison

In this section, we test the eﬃcacy of lower-level and upper-level optimization algorithm by
applying them on four customized problems (DS1-DS4) which are shown pictorially in Fig. 3.8.
Their performances are compared against two standard classiﬁers: CART and support-vector-
machines (SVM)4.

3.4.1 Ablation Studies on Lower Level GA

Since the lower level GA (LLGA) focuses at determining coeﬃcients  and bias  of linear
split-rule in B-space, DS1 and DS2 datasets are used to guage its eﬃcacy. The block matrix B is
ﬁxed to 2 × 2 identity matrix and the modulus-ﬂag  is set to 0. Hence, the split-rule (classiﬁer)
to be evolved is

 (x) = 1 + 11 + 22,

where 1, 2 and 1 are to be determined by the LLGA.

The classiﬁer generated by our LLGA is compared with that obtained using the standard SVM
algorithm (without any kernel-trick) on DS1 dataset. Since DS2 dataset is unbalanced, SMOTE
algorithm [42] is ﬁrst applied to over-sample datapoints of the minority class before generating
the classiﬁer using SVM. For the sake of completeness, classiﬁer generated by SVM on DS2
dataset, without any oversampling, is also compared against the ones obtained using LLGA and

4Here, the support vector machine is applied without any kernal-trick.

31

(a) DS1 Dataset.

(b) DS2 Dataset.

(c) DS3 Dataset.

(d) DS4 Dataset.

Figure 3.8: Customized datasets. DS1 is linear and balanced, DS2 is linear but unbalanced with
minority class having 10x less points. DS3 is nonlinear and DS4 has a sandwiched distribution.

SMOTE+SVM. The results are shown in Fig. 3.9 and Fig. 3.10, respectively, for DS1 and DS2
datasets. Type-1 and Type-2 errors are reported for each experiment5.

It is clearly evident from the results shown in Fig. 3.9 that the proposed customized LLGA
is more eﬃcient than SVM in arriving at the desired split-rule as a classiﬁer. The LLGA is able
to ﬁnd the decision boundary with 100% prediction accuracy on training and testing datasets.
However, the classiﬁer generated by SVM has an overlap with the cloud of datapoints belonging to
the scattered class and there-by resulting into the accuracy of 89% on the testing dataset.

On the DS2 dataset, SVM is required to rely on SMOTE to synthetically generate datapoints
5Type-1 error indicates the percentage of datapoints belonging to Class-1 getting classiﬁed as Class-2. Type-2

error indicates the percentage of points belonging to Class-2 getting classiﬁed as Class-1).

32

(a) LLGA: (0%, 0%); (0%, 1%)

(b) SVM: (0%, 17%); (0%, 10%)

Figure 3.9: Results on DS1 dataset to benchmark LLGA. Numbers in ﬁrst parenthesis indicate
Type-1 and Type-2 error on training data and the numbers in second parenthesis indicate Type-1
and Type-2 error on testing data.

in order to achieve respectable performance as can be seen from Fig. 3.10b (without SMOTE) and
3.10c (with SMOTE). However, LLGA performs better without SMOTE, as shown in Fig. 3.10a.
Ablation study conducted above on LLGA clearly indicates that the customized optimization
algorithm developed for estimating weights w and bias  in the expression of a split-rule is reliable
and eﬃcient and could easily outperform the classical SVM algorithm.

3.4.2 Ablation Studies on the Proposed Bilevel GA

As mentioned before, the upper level of our bilevel algorithm aims at determining optimal power-
law structures (i.e. s) and the value of modulus ﬂag . Experiments are performed on DS3
and DS4 datasets to guage the eﬃcacy of upper level (and thus the overall bilevel) GA. No prior
information about the optimal values of block matrix B and modulus ﬂag  is supplied. Thus, the
structure of the equation of the split-rule is unknown to the algorithm. Results of these experiments
are shown in Figs. 3.11a and 3.11b.

As can be seen from Fig. 3.11a, the proposed bilevel algorithm is able to ﬁnd quatratic split-rule
which is able to partition the DS3 dataset with no Type-I or Type-II error. Results on DS4 dataset
validated the importance of involving modulus-ﬂag  as a search variable for upper level GA in

33

(a) LLGA : (0%, 0%); (0%, 0%)

(b) SVM: (100%, 0%); (100%, 0%)

(c) SVM+SMOTE: (0%, 1.5%); (0%, 1.5%)

Figure 3.10: Results on DS2 dataset to benchmark LLGA.

addition to the block-matrix B. The algorithm is able to converge at the optimal tree topology
(Fig. 3.11b) with 100% classiﬁcation accuracy.

It is to note here that the algorithm had no prior knowledge about optimal block matrix B of
exponents  . This observation suggests that the upper level GA is successfully able to navigate
through the search space of block-matrices B and modulus-ﬂag  (which is a binary variable) to
arrive at the optimal power-laws and split-rule.

Fig. 3.12 and Fig. 3.13 shows plots of a decision boundary (split-rule) obtained using our
bilevel-algorithm on DS3 and DS4 datasets, their corresponding NLDTs and its comparison against
the decision-tree obtain using traditional CART algorithm on DS3 and DS4 datasets.

34

11.522.533.5-0.200.20.40.60.811.2(a) DS3: (0%, 0%); (0%, 0%)

(b) DS4: (0%, 0%); (0%, 0%)

Figure 3.11: Bilevel GA results on DS3 and DS4.

3.5 Visualization of split-rule: X-Space and B-Space

A comprehensible visualization of the decision boundary is possible if dimension of the feature
space is up to three. However, if the dimension  of the original feature space (or, X-space) is larger
than three, the decision-boundary can be visualized on the transformed-feature-space (or B-space).
In our experiments, the maximum number of allowable power-law-rules () is ﬁxed to three (i.e.
 = 3). Thus, any datapoint x in a -dimensional X-space can be mapped to a three-dimensional
B-space (Eq. 3.3). A conceptual illustration of this mapping is provided in Fig. 3.14, where, for the
sake of simplicity, a three-dimensional X-space (1 to 3) is mapped to a two-dimensional B-space
using two power-law-rules (1 and 2).

It is to note here that the power-law rules ’s are not known a priori. The proposed bilevel
optimization method determines them so that the data becomes linearly separable in B-space.
Results and discussions provided in Section 3.7 provide clarity on this novel aspect of our nonlinear
decision tree representation.

35

(a) Bilevel GA results: (0%, 0%) and (1%, 0%).

(b) Obtained NLDT for DS3 problem – One rule.

(c) CART results: (0%, 4%) and (4%, 3%) – 13 rules.

Figure 3.12: Results on DS3 dataset to benchmark the overall bilevel algorithm.

3.6 Overall Tree Induction and Pruning

Once the split-rule for a conditional node is determined using the above bilevel optimization

procedure, resulting child nodes are checked for the following termination criteria:

• Depth of the Node > Maximum allowable depth,

• Number of datapoints in node < , and

• Node Impurity ≤ .

36

x1≤1.01x0≤0.753x0≤0.34x1≤0.461x1≤0.84x0≤0.64x0≤0.40x1≤0.82x1≤0.6011x1≤0.58311x1≤0.89313x0≤1.013x1≤0.2731x0≤0.8613(a) Bilevel GA results: (0%, 0%) and (0%, 0%)
– No error.

(b) Obtained NLDT for DS4 problem – One rule.

(c) CART results: (4%, 3%) and (5%, 18%), 27 rules.

Figure 3.13: Results on DS4 dataset to benchmark the overall bilevel algorithm.

If any of the above criteria is satisﬁed, then that node is declared as a terminal leaf node. Otherwise,
the node is ﬂagged as an additional conditional node. It then undergoes another split (using a split-
rule which is derived by running the proposed bilevel-GA on the data present in the node) and the
process is repeated. This overall process is illustrated in Algorithm 1. This procedure eventually
produces the main nonlinear decision tree  . The fully grown decision tree then undergoes
pruning to produce the pruned decision tree  .

During the pruning phase, splits after the root-node are systematically removed until the training
accuracy of the pruned tree does not fall below a pre-speciﬁed threshold value of  = 3%

37

x0≤1.03x0≤0.743x1≤1.52x1≤1.503x0≤0.53x1≤1.93x1≤1.953x0≤0.43x1≤2.20x1≤2.20x1≤2.17x0≤−0.01313x0≤0.07x1≤2.84x1≤2.863113x1≤1.95331x0≤0.6331x1≤1.7313x1≤0.97x0≤0.803x0≤1.01x0≤0.903x1≤1.14x1≤1.243x0≤0.78x1≤1.38x1≤1.4331131313Figure 3.14: Feature Transformation. A point in three-dimensional X-space is mapped to a point
in a two-dimensional B-space for which  = 

3
3 .

1
1 

2
2 

(set here after trial-and-error runs). This makes the resultant tree topologically less complex
and provides a better generalizability. In subsequent sections, we provide results on ﬁnal NLDT
obtained after pruning (i.e.  ), unless otherwise speciﬁed.

3.7 Results

This section summarizes results obtained on four customized test problems, two real-world
classiﬁcation problems, three real-world optimization problems and eight custom designed multi-
objective problems. For each experiment, 50 runs are performed with random training and testing
sets. A dataset is split into training and testing sets with a ratio of 7:3, respectively. Mean and
standard deviation of training and testing accuracy-scores across all 50 runs are evaluated to
gauge the performance of the classiﬁer. Statistics about the number of conditional nodes (given by
#Rules) in NLDT, average number of active terms in split-rules of decision tree (i.e. /Rule) and
rule length (which gives total number of active-terms appearing in the entire NLDT) is provided to
quantify the simplicity and interpretability of classiﬁer (lesser this number, simpler is the classiﬁer).
The comparison is made against classical CART tree solution [12] and SVM [43] results. For SVM,
the Gaussian kernel is used and along with overall accuracy-scores; statistics about the number
of support vectors (which is also equal to the rule-length) is also reported. It is to note that the
decision boundary computed by SVM can be expressed with a single equation, however the length

38

of the equation usually increases with the number of support vectors [44]. We used MATLAB’s
SVM routine with default parameter settings.

For each set of experiments, best scores are highlighted in bold and Wilcoxon signed-rank test is
performed on overall testing accuracy scores to check the statistical similarity with other classiﬁers
in terms of classiﬁcation accuracy. Statistically similar classiﬁers (which will have their -value
greater than 0.05) are italicized.

3.7.1 Customized Datasets: DS1 to DS4

The compilation of results of 50 runs on datasets DS1-DS4 is presented in Table 3.1. The results
clearly indicate the superiority of the proposed nonlinear, bilevel based decision tree approach over
classical CART and SVM based classiﬁcation methods. Bilevel approach ﬁnds a single rule with

Table 3.1: Results on DS1 to DS4 datasets (2 features).

DS1

DS2

Training
Dataset Method
Accuracy
99.78 ± 0.51
Bilevel
97.99 ± 0.96
CART
SVM 98.67 ± 0.31
99.80 ± 0.40
Bilevel
98.80 ± 0.50
CART
SVM 96.95 ± 0.73
99.91 ± 0.35
Bilevel
99.42 ± 0.57
CART
SVM 98.96 ± 0.42
99.34 ± 1.21
Bilevel
96.98 ± 1.19
CART
SVM 98.19 ± 0.44

DS3

DS4

Testing
Accuracy
99.55 ± 1.08
90.32 ± 4.06
98.50 ± 1.09
99.44 ± 0.87
95.43 ± 1.50
95.24 ± 0.16
99.77 ± 0.67
95.00 ± 2.35
98.38 ± 1.15
98.88 ± 1.65
88.68 ± 3.60
97.28 ± 1.19

p-value

–

7.34e-10
1.29e-05

–

5.90e-10
2.43e-10

–

7.00e-10
7.16e-09

–

7.31e-10
1.31e-05

#Rules
1.0 ± 0.0
14.5 ± 1.7

1

1

1

1

1.0 ± 0.0
11.0 ± 1.4

1.0 ± 0.0
11.5 ± 1.3

1.2 ± 0.4
31.3 ± 4.2

1

1

FU/Rule
2.3 ± 0.6
71.5 ± 3.3
2.3 ± 0.7
44.7 ± 1.9
2.2 ± 0.5
62.1 ± 3.4
2.6 ± 0.6
89.8 ± 3.3

1

1

Rule
Length
2.3 ± 0.6
14.5 ± 1.7
71.5 ± 3.3
2.3 ± 0.7
11.0 ± 1.4
44.7 ± 1.9
2.2 ± 0.5
11.5 ± 1.3
62.1 ± 3.4
3.1 ± 1.4
31.3 ± 4.2
89.8 ± 3.3

two to three appearances of variables in the rule, whereas CART requires 11 to 31 rules involving
a single variable per rule, and SVM requires only one rule but involving 44 to 90 appearances of
variables in the rule. Moreover, the bilevel approach achieves this with the best accuracy. Thus,
classiﬁers obtained by bilevel approach are more simplistic and simultaneously more accurate.

39

3.7.2 Breast Cancer Wisconsin Dataset

This problem was originally proposed in 1991. It has two classes: benign and malignant, with
458 (or 65.5%) datapoints belonging to benign class and 241 (or 34.5%) belonging to malignant
class. Each datapoint is represented with 10 attributes. Results are tabulated in Table 3.2. The
bilevel method and SVM had similar performance (p-value> 0.05), but SVM requires about 90
variable appearances in its rule, compared to only about 6 variable appearances in the single rule
obtained by the bilevel approach. The proposed approach outperformed the classiﬁers generated
using techniques proposed in [20, 25, 24, 21, 22] in terms of both: accuracy and comprehensibil-
ity/compactness. The NLDT classiﬁer obtained by a speciﬁc run of the bilevel approach has ﬁve
variable appearances and is presented in Fig. 3.15.

Table 3.2: Results on breast cancer Wisconsin dataset (10 features).

Training
Method
Accuracy
98.07 ± 0.39
Bilevel
98.21 ± 0.49
CART
SVM 97.65 ± 0.39

Testing
Accuracy
96.50 ± 1.16
94.34 ± 1.92
96.64 ± 1.16

p-value
0.308
8.51e-09

–

#Rules
1.0 ± 0.0
11.6 ± 2.4

1

FU/Rule
6.4 ± 1.7
89.4 ± 14.8

1

Rule
Length
6.4 ± 1.7
11.6 ± 2.4
89.4 ± 14.8

Figure 3.15: Breast Cancer Wisconsin NLDT. For each node, number of datapoints  present in
the node, impurity of the node (Gini) and class distribution (in square parenthesis) is reported.

Figure 3.16 provides B-Space visualization of decision-boundary obtained by the bilevel ap-
proach, which is able to identify two nonlinear -terms involving variables to split the data linearly

40

to obtain high accuracy.

Figure 3.16: B-space plot for Wisconsin breast cancer dataset.

3.7.3 Wisconsin Diagnostic Breast Cancer Dataset (WDBC)

This dataset is an extension to the dataset of the previous section. It has 30 features with total 356
datapoints belonging to benign class and 212 to malign class. Results shown in Table 3.3 indicate
that the bilevel-based NLDT is able to outperform standard CART and SVM algorithms. The
NLDT generated by a run of the bilevel approach requires seven out of 30 variables (or features)
and is shown in Fig. 3.17. It is almost as accurate as that obtained by SVM and is more interpretable
(seven versus about 107 variable appearances).

Table 3.3: Results on WDBC dataset (30 features).

Training
Method
Accuracy
98.24 ± 0.64
Bilevel
98.76 ± 0.60
CART
SVM 98.65 ± 0.37

Testing
Accuracy
96.20 ± 1.49
92.11 ± 2.07
97.39 ± 1.37

p-value
4.65e-05
1.09e-09

–

#Rules
1.0 ± 0.0
10.8 ± 2.1

1

FU/Rule
9.2 ± 4.1
106.7 ± 6.6

1

Rule
Length
9.2 ± 4.1
10.8 ± 2.1
106.7 ± 6.6

The -space plot (Fig. 3.18) shows an eﬃcient discovery of -functions to linearly classify the

supplied data with a high accuracy.

41

0500100015002000250030003500400000.20.40.60.81Figure 3.17: Tree for WDBC dataset. For each node, number of datapoints  present in the node,
impurity of the node (Gini) and class distribution (in square parenthesis) is reported.

Figure 3.18: B-space plot for WDBC dataset.

3.7.4 Real World Auto-Industry Problem (RW-problem)

This real-world engineering design optimization problem has 36 variables, eight constraints, and
one objective function. The dataset is highly biased, with 188 points belonging to the good-class
and 996 belonging to the bad-class. Results obtained using the bilevel GA are shown in Table 3.4.
The proposed algorithm is able to achieve near 90% accuracy scores requiring only two split-rules.
The best performing NLDT has the testing-accuracy of 93.82% and is shown in Fig. 3.19.

SVM performed the best in terms of accuracy, but the resulting classiﬁer is complicated with

42

about 241 variable appearances in the rule. However, bilevel GA requires only about two rules, each
having about only 10 variable appearances per rule to achieve slightly less-accurate classiﬁcation.
CART requires about 30 rules with a deep DT, making the classiﬁer diﬃcult to interpret easily.

Table 3.4: Results on the real-world auto-industry problem (36 features).

Training
Method
Accuracy
94.36 ± 1.47
Bilevel
98.00 ± 0.55
CART
SVM 94.98 ± 0.73

Testing
Accuracy
89.93 ± 2.04
91.13 ± 1.32
93.24 ± 1.38

p-value
4.35e-09
9.80e-08

–

#Rules
1.9 ± 0.5
29.6 ± 3.9

1

FU/Rule
10.0 ± 2.9
240.8 ± 9.4

1

Rule
Length
18.2 ± 5.9
29.6 ± 3.9
240.8 ± 9.4

Figure 3.19: NLDT for the auto-industry problem. The ﬁrst split-rule uses ﬁve variables and the
second one uses 12. For each node, number of datapoints  present in the node, impurity of the
node (Gini) and class distribution (in square parenthesis) is reported.

3.7.5 Results on Multi-Objective Optimization Problems

After bench-marking the proposed bilevel GA algorithm on standard classical benchmarks and a
single-objective engineering problem, we now evaluate its performance on two real-world multi-
objective problems: welded-beam design and 2D truss design and eight modiﬁed ZDT and DTLZ
problems. We will brieﬂy discuss the procedure used to generate datasets corresponding to these
problems followed by results obtained on them using our approach.

43

3.7.5.1 Truss 2D and Welded Beam Problems

Data creation:

For each problem, NSGA-II [45] algorithm is ﬁrst applied to evolve the population of  individuals
for  generations. Population for each generation is stored. Naturally, population from later
generations are closer to Pareto-front than the population of initial generations. We artiﬁcially
separate the entire dataset into two classes – good and bad – using two parameters   (indicating
an intermediate generation for data collection) and  (indicating the minimum non-dominated
rank for deﬁning the bad cluster). First, population members from   and  generations are
merged. Next, non-dominated sorting is executed on the merged population to determine their non-
domination rank. Points belonging to same non-domination rank are further sorted based on their
crowding-distance (from highest to lowest) [45]. For the good-class, top  points from rank-1
front are chosen and for the bad-class, top  points from  front onward are chosen. Increasing
the  increases the separation between good and bad classes, while the   parameter has the
inverse eﬀect. Parameter setting for NSGA-II algorithm which is used to generate multi-objective
datasets can be found in Section 3.10.

Truss 2D Results:

Truss 2D problem [46] is a three-variable multi objective problem involving one constraint. Visu-
alization of the Truss dataset in the F-space and X-space is provided in Figure 3.20 for  = 6
and  = 9, with   = 1.

Compilation of results obtained using  = 6 is provided in Table 3.5.

Table 3.5: Results on Truss-2D with  = 6.

Training
Method
Accuracy
99.77 ± 0.72
Bilevel
99.34 ± 0.32
CART
SVM 99.66 ± 0.15

Testing
Accuracy
99.54 ± 0.75
98.33 ± 1.10
99.46 ± 0.50

p-value

–

6.04e-08
0.135

#Rules
1.2 ± 0.5
11.06 ± 3.15

1

FU/Rule
2.9 ± 0.5
62.5 ± 2.9

1

Rule
Length
3.3 ± 0.9
11.06 ± 3.15
62.5 ± 2.9

44

(a) F-space plot with  = 6.

(b) X-space plot with  = 6.

(c) F-space plot with  = 9.

(d) X-space plot with  = 9.

Figure 3.20: Truss design problem data visualization.   = 1 is kept ﬁxed. For a ﬁxed value of
  , larger value of  implies better separation between datapoints belonging to two classes.

Clearly, the bilevel NLDT has the best accuracy and fewer variable appearances (meaning easier

intrepretability) compared to CART and SVM generated classiﬁers.

Fig. 3.21a shows a 100% correct classiﬁer with a single rule obtained by our bilevel NLDT,
whereas Fig. 3.21b shows a typical CART classiﬁer with 19 rules, making it diﬃcult to interpret
easily.

A comparison of results obtained using NLDT approach on truss problem with  = 6 and

 = 9 is provided in Table 3.6.

Table 3.6: 2D Truss problem with  = 6 and  = 9.


6
9

Training
Accuracy
99.86 ± 0.36
100 ± 0

Testing
Accuracy
99.6 ± 0.58
99.92 ± 0.19

#Rules
1.20 ± 0.53
1.36 ± 0.48

FU/Rule
2.92 ± 0.52
2.26 ± 0.52

Clearly, since the data is more separated for  = 9, the results for  = 9 are more

accurate and relatively simpler.

45

00.010.020.030.040.050.060.07F10246810F2104ParetoNon-Pareto10.011.52x32.53x20.0050.01x10.00500ParetoNon-Pareto00.010.020.030.040.050.060.07F10246810F2104ParetoNon-Pareto10.011.52x32.53x20.0050.01x10.00500ParetoNon-Pareto(a) Bilevel classiﬁer requiring only one rule.

Figure 3.21: Comparison of bilevel and CART methods on truss problem with  = 6 dataset.

(b) CART classiﬁer with 11 rules.

Welded Beam Design Problem Results:

This bi-objective optimization problem has four variables and four constraints [46]. Here, two sets
of experiments are conducted for two diﬀerent values of   , keeping the value of  ﬁxed to
3. Visualization of these datasets in the objective space (or F-space) is provided in Figure 3.22.

A higher value of   results in good and bad datapoints too close to each other (Figure 3.22b),
thereby making it diﬃcult for the classiﬁcation algorithm to determine a suitable classiﬁer. This
can be validated from the results presented in Table 3.7. However, bilevel NLDTs have produced

46

x0≤0.01x2≤1.8233x2≤2.19x1≤0.01x1≤0.01x0≤0.00x0≤0.00x0≤0.003113x0≤0.001x1≤0.01313x0≤0.0031(a) F-space plot with   = 1.

(b) F-space plot with   = 10.

Figure 3.22: Welded Beam Design Problem data visualization.  = 3 is kept ﬁxed. Problem at
(b) is more diﬃcult to solve than at (a).

one rule with about 2 to 4 variable appearances compared to 39.5 to 126 (on average) for SVM
classiﬁers. CART does well in this problem with, on average, 2.94 to 8.42 rules. The NLDT
(having a single rule) obtained with   = 10 is shown in Fig. 3.23. Interestingly, the bilevel
NLDTs have similar accuracy to that of CART and SVM.

Table 3.7: Results on welded beam design with   = 1 and   = 10.  = 3 is kept ﬁxed.

Method

Training
Accuracy

Testing
Accuracy

99.98 ± 0.06
Bilevel
99.97 ± 0.07
CART
SVM 99.37 ± 0.17

99.39 ± 0.38
Bilevel
99.46 ± 0.27
CART
SVM 99.46 ± 0.19

99.38 ± 0.49
99.50 ± 0.59
99.40 ± 0.40

98.58 ± 1.13
97.72 ± 1.04
98.97 ± 0.54

p-value
  = 1
0.158

–

0.331
  = 10
3.42e-02
4.63e-08

–

#Rules

FU/Rule

Rule
Length

1.0 ± 0.0
2.94 ± 0.71

1

1.0 ± 0.0
8.42 ± 1.42

1

2.6 ± 0.7
39.5 ± 2.5

1

2.6 ± 0.7
2.94 ± 0.71
39.5 ± 2.5

3.9 ± 1.0
126.0 ± 6.8

1

3.9 ± 1.0
8.42 ± 1.42
126.0 ± 6.8

3.7.5.2 Modiﬁed ZDT (m-ZDT) and DLTZ (m-DTLZ) Problems

Problem Deﬁnition and Dataset Creation:

These are the modiﬁed versions of ZDT and DLTZ problems [45, 47]. For m-ZDT problems, the
g-function is modiﬁed to the following:

47

01020304050F100.511.522.533.54F210-3ParetoNon-Pareto01020304050F100.511.522.533.54F210-3ParetoNon-ParetoFigure 3.23: Welded beam classiﬁer with one rule for   = 10.

(2, . . . , ) =1 + 18
 − 1


 = 2 and even

( + +1 − 1)2.

(3.15)

Many two-variable relationships ( + +1 = 1) must be set to be on the Pareto set.

For m-DTLZ problems, the g-function for xm variables is

(xm) =100 ×


 ∈ xm and  is even

( + +1 − 1)2.

(3.16)

Pareto points for m-ZDT and m-DTLZ problems are generated by using the exact analytical
expression of Pareto-set (locations where  = 0). The non-Pareto set is generated by using two
parameters  and    . To compute the location of a point xnp belonging to non-Pareto
set from the location of the point xp on the Pareto set, following equation is used:

 + 1(    + 2)

()
 =()
where 1 ∈ −1, 1,

2 ∈ [0, 1].

(3.17)

(3.18)

Here, 1 and 2 are randomly generated for each ()

. For m-ZDT problems,  = 1, 2, . . . ,

48

Table 3.8: Parameter setting to generate datasets for m-ZDT and m-DTLZ problems. We generate
1000 datapoints for each class.

Prob.
m-ZDT1
m-ZDT1
m-ZDT2
m-ZDT2
m-DTLZ1
m-DTLZ1
m-DTLZ2
m-DTLZ2

nvars spead oﬀset
30
0.1
0.1
500
0.1
30
500
0.1
0
30
0
500
0
30
500
0

0.3
0.3
0.3
0.3
0.05
0.05
0.05
0.05

, ()

while for m-DTLZ problems ()
 ∈ xm. Parameter setting for generating m-ZDT and m-DTLZ
datasets is provided in Table 3.8. For 30 variable problems, all  ∈ xm are changed according to
Eq. 3.18 for generate non-pareto points. However, for 500 vars problems, only ﬁrst 28 of variables
in xm are modiﬁed according to Eq. 3.18 to generate non-pareto data-points.

Visualization of Datasets for m-ZDT and m-DTLZ problems in provided in Figure 3.24 and

Figure 3.25.

(a) m-ZDT1, 30 vars

(b) m-ZDT1, 500 vars

(c) m-ZDT2, 30 vars

(d) m-ZDT2, 500 vars

Figure 3.24: m-ZDT Datasets.

49

(a) m-DTLZ1, 30 vars

(b) m-DTLZ1, 500 vars

(c) m-DTLZ2, 30 vars

(d) m-DTLZ2, 500 vars

Figure 3.25: m-DTLZ Datasets.

3.7.5.3 m-ZDT and m-DTLZ Results:

Experiments conducted on datasets involving 500 features (see Table 3.9) and for two and three-
objective optimization problems conﬁrm the scalability aspect of the proposed approach. In all
these problems, traditional methods like CART and SVM ﬁnd it diﬃcult to conduct a proper
classiﬁcation task. The provision of allowing controlled non-linearity at each conditional-node
provides the proposed NLDT approach with necessary ﬂexibility to make an appropriate overall
classiﬁcation.

50

Table 3.9: Results on multi-objective problems for classifying dominated and non-dominated
solutions.

Method Training Acc.

Testing Acc.

NLDT
CART
SVM

NLDT
CART
SVM

NLDT
CART
SVM

NLDT
CART
SVM

NLDT
CART
SVM

NLDT
CART
SVM

NLDT
CART
SVM

NLDT
CART
SVM

99.18 ± 0.40
97.48 ± 0.37
84.14 ± 2.42

99.23 ± 0.38
97.41 ± 0.36
99.28 ± 0.18

97.21 ± 2.52
89.72 ± 0.88
52.44 ± 0.63

97.76 ± 1.88
85.64 ± 3.33
54.82 ± 0.98

99.20 ± 0.29
98.76 ± 0.27
100.00 ± 0.00

99.18 ± 0.38
98.61 ± 0.36
100.00 ± 0.00

94.36 ± 4.03
83.23 ± 3.52
51.56 ± 0.52

95.89 ± 3.83
89.09 ± 1.75
51.30 ± 0.55

98.96 ± 0.59
94.78 ± 1.02
82.84 ± 2.77

98.96 ± 0.57
94.72 ± 0.89
98.44 ± 0.57

96.65 ± 2.86
71.74 ± 2.49
45.86 ± 1.28

97.22 ± 2.25
63.41 ± 7.59
49.61 ± 1.62

98.93 ± 0.60
93.48 ± 0.94
100.00 ± 0.00

98.88 ± 0.71
93.95 ± 1.33
100.00 ± 0.00

93.77 ± 4.24
58.26 ± 7.37
47.13 ± 1.19

95.33 ± 4.46
70.04 ± 3.46
47.01 ± 1.20

# Rules
m-ZDT1-2-30

1.80 ± 0.60
26.42 ± 1.89
1.00 ± 0.00

m-ZDT2-2-30

1.92 ± 0.56
27.80 ± 2.19
1.00 ± 0.00
m-DTLZ1-3-30
3.12 ± 0.59
93.88 ± 3.23
1.00 ± 0.00
m-DTLZ2-3-30
3.02 ± 0.62
100.82 ± 6.11
1.00 ± 0.00
m-ZDT1-2-500
1.78 ± 0.54
20.58 ± 1.39
1.00 ± 0.00
m-ZDT2-2-500
1.80 ± 0.69
20.98 ± 1.98
1.00 ± 0.00
m-DTLZ1-3-500
3.00 ± 0.45
104.88 ± 5.49
1.00 ± 0.00
m-DTLZ2-3-500
3.04 ± 0.56
96.08 ± 3.93
1.00 ± 0.00

/Rule

Rule Length

4.47 ± 2.21
1.00 ± 0.00

1192.38 ± 11.34

7.64 ± 3.50
26.42 ± 1.89
1192.38 ± 11.34

4.37 ± 1.82
1.00 ± 0.00
315.54 ± 6.12

6.14 ± 2.18
1.00 ± 0.00
1381.56 ± 8.25

5.81 ± 1.95
1.00 ± 0.00
1367.42 ± 8.44

5.66 ± 3.23
1.00 ± 0.00
240.88 ± 4.62

5.30 ± 2.45
1.00 ± 0.00
248.06 ± 4.37

7.61 ± 2.36
1.00 ± 0.00
1382.38 ± 8.84

7.28 ± 3.16
1.00 ± 0.00
1382.62 ± 8.46

8.14 ± 3.36
27.80 ± 2.19
315.54 ± 6.12

19.10 ± 7.03
93.88 ± 3.23
1381.56 ± 8.25

17.50 ± 6.73
100.82 ± 6.11
1367.42 ± 8.44

9.36 ± 4.15
20.58 ± 1.39
240.88 ± 4.62

8.88 ± 3.98
20.98 ± 1.98
248.06 ± 4.37

22.76 ± 7.57
104.88 ± 5.49
1382.38 ± 8.84

22.14 ± 10.45
96.08 ± 3.93
1382.62 ± 8.46

3.8 Additional Comparisons and Results

In previous sections, we focused at fundamental procedure of inducing the NLDT and com-
pared its performance against two standard algorithms: SVM and CART. In this section, we extend
our experimental study to compare our approach against other methods which are strong poten-

51

tial candidates to generate an interpretable AI: generalized additive models (GAMs) and genetic
programming (GP). We also re-iterate experiments using the default parameter settings for NLDT
and CART, while we ﬁne tune the results corresponding to SVM using the best possible parameter
values.

New problems (m-DS1 to m-DS3) are introduced here to obtain further insight into the working
principles of above mentioned algorithms. The customized problems DS1 to DS4, m-DS1 to m-
DS3, m-ZDTs and m-DTLZs are designed to benchmark the performance of classiﬁers on various
properties of datasets such as:

• Data Distribution: For DS1-DS4 datasets, degree of scatter in data varies across classes.
For m-DS1, m-DS2 and m-DS3 the scattering of data for each class is more similar than
that in original DS datasets. A visualisation of feature spaces for DS1 and m-DS1 dataset is
provided in Figure 3.26a and 3.26b, respectively to demonstrate this.

• Geometry of Decision Boundary: Here, the eﬀect of the nature of the simplest possible de-
cision boundary is considered. Decision boundary corresponding to DS1-DS2 and modiﬁed
DS1-DS2 is linear, DS3 and m-DS3 have decision boundary involving nonlinearity of order
2 and DS4 have two disjoint linear decision boundaries.

• Data Bias: Here, eﬀect of bias in class representation is considered. All datasets except DS2
and m-DS2 are balanced. For DS2 and m-DS2, minority class has 5 times less number of
data points as the majority class.

In next few sections, we brieﬂy discuss SVM, GAM and GP to gain a conceptual insight towards
critical parameters which can inﬂuence the classiﬁcation performance and the complexity of the
classiﬁer. A more detailed discussion is provided in [48].

52

(a) DS1 Dataset.

(b) m-DS1 Dataset.

Figure 3.26: Original DS1 and its modiﬁed version.

3.8.1 Support Vector Machines (SVMs)

For a separable dataset, support vector machine (SVM) algorithm attempts to derive a decision
boundary in the form of a single mathematical equation as shown below:

(x) = w

(x) + ,

(3.19)

where (x) is a set of feature transformation functions which can be either linear or non-linear
functions of feature vector x, w is a weight vector and  is a bias term. A conceptual understanding
of SVM is provided in Figure 3.27. For a binary classiﬁcation task involving class labels  = −1
or  = 1, an optimal hyper-surface is derived by maximizing the margin between two classes, as
shown by  = 0 line in the ﬁgure. Points with  ≤ −1 belong to one class and points with  ≥ 1
belong to another class. The points which fall on  = 1 and  = −1 are called support vectors, as
they alone decide the classiﬁer. However, for non-separable datasets, such as the scenario shown
in Figure 3.28, a soft margin approach is used to allow some data points within || < 1 (margin)
while training the SVM. These points are also declared as support vectors in addition to the points
on the margin.

53

Figure 3.27: SVM on separable datasets with a hard margin.

Figure 3.28: SVM with non-separable datasets with a soft margin.

Minimize:

subject to :

=1

2||||2 + 
1
(w (x) + ) ≥ 1 − ,
 ≥ 0,
 = 1, 2, . . . , ,

To identify the classiﬁer and the support vectors, the underlying optimization problem is solved:


,

(3.20)

where  is the true class label (either 1 or -1) of the datapoint,  is the distance of -th data point from
its representative margin, thus  = max [0, 1 −  (x)] (where value of (x) is estimated from
Eq. 3.19).  is a penalty parameter which is used to enhance generalizability by compromising
It is also aimed to balance the complexity of the classiﬁer (described
with training accuracy.

54

with the number of non-zero terms of w) and soft support vectors within the margin and is an
important parameter. With lower values of , broader margin (with some misclassiﬁcation of
training datapoints) is achieved while for large values of , misclassiﬁcation of training datapoints
is heavily penalized and so narrower margin is achieved.

Using a kernel trick [49] (x, x) = (x) (x) Eq. 3.19 is transformed into the following:

(x) =

 (x, x) + ,

(3.21)


=1


=1

=1

=1
 = 0.

where  is a Lagrange multiplier which is obtained by converting the optimization problem of
maximizing the margin (Eq. 3.20) to a dual Lagrangian representation [49]:

Min: (a) =

 − 1
2

    (x, x ),

=1
s.t.  ∈ [0, ],

(3.22)

Classical gradient based algorithms can then be employed to ﬁnd . In Eq. 3.21, data points x
for which  = 0 do not contribute in the equation of the split rule (Eq. 3.21) and data points for
which  > 0 are called support vectors for the SVM classiﬁer and they dictate the overall length of
the classiﬁer’s equation (Eq. 3.21). The penalty parameter  has to be tuned to eﬃciently derive
the decision boundary. Lower value of  makes the classiﬁer more generalizable.  = ∞ (hard
margin) attempts to achieve near 100% training accuracy and hence is prone to overﬁtting. In our
case, we use scikit-learn’s [50] SVM module and set  = 1, 000. We use RBF (or Gaussian) kernel
function. Table 3.10 shows results for various settings of  on some datasets considered in our
study.

3.8.2 Generalized Additive Models (GAMs)

For a binary classiﬁcation task involving two classes: Class 1 ( = 0) and Class 2 ( = 1), the
GAM based classiﬁer [5, 6] estimates the probability of a data point belonging to class  = 1 (i.e.

55

Table 3.10: SVM Result for diﬀerent values of penalty parameter . For each dataset, the ﬁrst
row represents the testing accuracy and the second row represents complexity (number of support
vectors).  = 1000 gives overall best performance.
DS2

Pen. Param.

DS1

DS4

DS3

96.93 ± 1.87
64.68 ± 2.16
99.32 ± 0.70
26.68 ± 1.88
99.75 ± 0.58
10.60 ± 0.92
m-DS3

99.97 ± 0.23
36.54 ± 1.72
100.00 ± 0.00
12.60 ± 0.98
100.00 ± 0.00
8.82 ± 1.01

45.20 ± 4.24
138.76 ± 1.99
68.77 ± 4.00
262.60 ± 4.76
96.63 ± 1.35
56.70 ± 3.13
Cancer-10
97.15 ± 1.08
69.98 ± 6.51
95.98 ± 1.13
56.22 ± 6.43
95.23 ± 1.09
52.36 ± 4.91

 = 1

 = 10

 = 1, 000

Pen. Param.

 = 1

 = 10

 = 1, 000

94.75 ± 1.97
191.94 ± 4.38
98.42 ± 1.16
58.70 ± 2.90
99.88 ± 0.33
8.36 ± 0.87
m-DS1

99.77 ± 0.67
70.22 ± 2.23
100.00 ± 0.00
26.42 ± 1.46
99.93 ± 0.33
7.38 ± 0.75

95.24 ± 0.00
16.56 ± 0.80
95.24 ± 0.00
30.46 ± 0.64
99.70 ± 0.50
8.56 ± 0.67
m-DS2

95.24 ± 0.00
16.18 ± 0.59
98.89 ± 0.85
14.40 ± 0.89
99.97 ± 0.22
5.34 ± 0.55

Pen. Param.

 = 1

 = 10

 = 1, 000

Truss

77.31 ± 2.16
343.76 ± 9.51
81.29 ± 2.19
258.52 ± 10.85
88.54 ± 1.60
176.22 ± 7.87

Welded Beam
98.83 ± 0.70
47.26 ± 3.00
99.53 ± 0.42
17.86 ± 1.73
99.63 ± 0.38
7.88 ± 0.86

Cancer-30
90.83 ± 1.83
106.88 ± 4.44
91.94 ± 1.36
81.66 ± 4.54
95.08 ± 1.65
58.74 ± 5.18

( = 1|x))6 as ˆ(x) using the following equation

ˆ(x) =

1

1 + −(x) ,

(3.23)

where (x) is referred to as link function [51]. The link function (x) in GAM is expressed as a
sum of non-linear functions as shown below:

(x) = 1(x) + 2(x) + · · · + (x) + 0,

(3.24)

6probability of datapoint belonging to other class (i.e.  = 0) will be 1 − ˆ.

56

where 0 is a constant and (x) are scalar valued nonlinear functions. The functional form of
(x) and total number of such nonlinear functions is pre-speciﬁed by the user. Modelling of link
function (x) using Eq. 3.24 makes GAMs more generalizable than its precursor: generalized
linear models (GLMs) [4], which involves only linear terms.

In our experiments, we use penalized B-splines to model non-linearity of each feature separately

(i.e. referring to Eq. 3.24, (x) = ()). Thus, the -function in our case is given by

(x) = 1(1) + 2(2) . . . () + 0,
() =

()   = B(cid:48)

()

() .


=1

where:

(3.25)


Here, () denotes a spline function corresponding to -th feature, ()
() indicates the basis
function of order ,   are scalar coeﬃcients and  is the total number of basis functions used to
model the spline. The order of spline (i.e. ) and the number of basis-functions  is user-speciﬁed.
Once the structure of link function (x) is speciﬁed, an optimization algorithm is invoked to
learn parameters corresponding to basis functions ()
() and coeﬃcients   with an objective
to minimize the error between the estimated value of probability ( ˆ(x) Eq. 3.23) and the actual 
values across the dataset. To make the resulting model more generalize and simple, a second-order
smoothing is employed. Thus, using Eq. 3.23 and 3.25, the overall optimization problem translates
to minimizing the following function:


=1

Þ


=1

Min: (B(cid:48), ) =

( − ˆ(B(cid:48)))2 +

 
((cid:48)(cid:48)
 ( |B(cid:48)

  
))2  ,

(3.26)

where  is the actual class of the -th datapoint (which can have value of either 0 or 1) and ˆ is
the probability of -th point belonging to class  = 1 (i.e. ( = 1|xi)) as predicted by the GAM
classiﬁer using Eq. 3.23.   are the penalty parameters which are prespeciﬁed. In our case, we
use   = 0.6 for all features. The rule complexity of a GAM classiﬁer can be tuned using  ,
where higher values of   imposes heavy penalty on non-linearities with more than second order.
Additionally, the complexity can also be controlled by regulating the degree () and number of

57

basis-functions  (Eq. 3.25). In our experimental setup, we conduct series of experiments using
diﬀerent combinations of (, ) to model splines for each feature. Values of  and  are picked
from the one listed in Table 3.11.

Table 3.11: Details regarding parametric study for GAMs.

# Basis Functions () Degree ()

2, 3, 5, 8, 13, 21

2, 3, 5

Total number of terms arising from the expression of rule (x) (Eq. 3.25) is


=1

Total Terms =

(  + 1) ×   +   + 1.

(3.27)

which are not contributing in minimizing the error

However, due to second-order smoothening eﬀect (Eq. 3.26), non-linearities greater than 2nd order
=1( − ˆ(B(cid:48)))2 will get removed from
the rule and thus, the eﬀective degree of freedom (EoDF) will be far less than the total length of
the rule. Eﬀective degrees of freedom versus accuracy plot for GAM classiﬁers obtained using
various combinations of (, ) on Cancer-10 dataset is shown in Figure 3.29. It is clear that a
high training accuracy is achieved with a large EoDF, but makes an over-ﬁtting and produces less
testing accuracy. About 500 such experiments are performed and the best combinations of (, )
are used to generate results (Table 3.13) for a given dataset. Note here that generating classiﬁers
using GAM is computationally expensive for high-dimensional datasets and so, we do not run
experiments on datasets involving 500 features.

3.8.3 Genetic Programming (GP)

Genetic Programming has been extensively used to derive non-linear and interpretable classiﬁers
[52, 53, 54, 55, 56, 57]. A GP algorithm evolves programs (or equations of classiﬁer’s decision
boundary in our case) using genetic operators like crossover and mutation. Programs in GP are
usually represented with tree architecture as shown in Figure 3.30. Internal nodes of this tree can
involve mathematical operations, like +,×,−,÷, log, sin. Allowable set of mathematical operations

58

(a) Training Accuracy

(b) Testing Accuracy

Figure 3.29: Eﬀective degree of freedom (EoDF) V/s Accuracy for Cancer-10 dataset. The best
(, ) parameter setting for this dataset is found to be ∗ = [8, 3, 8, 13, 8, 8, 13, 3, 8, 21] and
∗ = [2, 2, 5, 5, 2, 3, 3, 2, 2, 2].

Figure 3.30: A sample genetic program (GP) tree. The above GP translates to this equation:
 (x) = (5 − 7) + 32.

are pre-speciﬁed by the user. In our case, we use {+,×,−,÷} only. Terminal leaf nodes of a GP
program either have one of the input feature  or a constant term . It is to note here that a GP tree
(T) represents one non-linear equation and is fundamentally diﬀerent from the decision tree which
involves assembly of split-rule equations which are organized in a hierarchical format. The optimal
structure of tree, operators used, features  involved and value of constants  are all unknown and
are determined through an evolutionary algorithm. The evolution is conducted with an objective to
minimize the cross-entropy loss. However, if unchecked, the size of GP trees grows as the evolution
progress and the GP algorithm suﬀers from bloating [58]. To counter this eﬀect of bloating and
encourage evolution of simpler trees (trees with less number of nodes), a parsimony coeﬃcient 
is used to penalize the ﬁtness of a GP tree (T) as shown below:

Min:

(T) =  +  × ,

(3.28)

59


where  = − 1
=1[(x) log( ˆ(x))+(1− (x)) log(1− ˆ(x))], and ˆ(x) = Sigmoid(  (x)).
In Eq. 3.28,  represents size of the tree and is computed by counting total number of nodes in
the tree.  (x) is the value the GP tree outputs for a given feature vector x (see Figure 3.30).


It is important to choose a suitable parsimony coeﬃcient  for a problem. Smaller value of
 will encourage bloating and will evolve complex equations while the higher value of  will
evolve simpler equations at an expense of reduced classiﬁcation accuracy. In our case, we perform
experiments using three values : 0.01, 0.005 and 0.001, and conduct 50 runs on each dataset
shown in Table 3.12 after randomly splitting the dataset into 70% training and 30% testing for each
run. Statistics regarding testing accuracy and complexity (measured as the total number of internal
nodes) is reported in the table. It is clear from the table that while a small  produces a better
Table 3.12: GP Result for diﬀerent values of parsimony coeﬃcient . For each dataset, the ﬁrst
row represents the testing accuracy and the second row represents complexity (number of internal
nodes).  = 0.001 produces better results.

Pars. coeﬀ.
 = 0.01

 = 0.005

 = 0.001

Pars. coeﬀ.
 = 0.01

 = 0.005

 = 0.001

DS1

61.07 ± 9.91
3.40 ± 3.70
77.3 ± 11.29
16.18 ± 9.99
91.70 ± 6.91
67.72 ± 26.72
m-DS1
89.53 ± 3.27
8.34 ± 1.98
93.37 ± 4.57
16.32 ± 9.55
98.83 ± 1.88
55.38 ± 22.39

DS2

95.24 ± 0.00
1.98 ± 0.14
95.24 ± 0.00
3.86 ± 1.23
95.37 ± 0.63
15.14 ± 13.55

m-DS2
95.65 ± 0.70
3.58 ± 1.07
95.65 ± 0.70
3.76 ± 1.22
96.67 ± 1.93
14.08 ± 9.11

DS3

65.37 ± 11.57
4.44 ± 2.37
86.27 ± 11.41
19.86 ± 11.45
96.50 ± 3.3
76.74 ± 33.36

m-DS3

96.33 ± 4.68
15.04 ± 6.36
98.4 ± 1.99
19.88 ± 9.94
99.27 ± 1.22
49.80 ± 21.69

DS4

49.93 ± 1.43
1.12 ± 3.40
50.37 ± 2.96
2.06 ± 3.88
58.00 ± 11.22
18.76 ± 23.94
Cancer-10
94.03 ± 4.59
5.56 ± 2.06
95.04 ± 1.76
7.88 ± 3.07
96.13 ± 1.29
15.80 ± 5.66

Pars. coeﬀ.
 = 0.01

 = 0.005

 = 0.001

Truss

82.78 ± 11.28
5.20 ± 3.30
90.03 ± 8.50
11.98 ± 7.12
97.36 ± 3.81
36.02 ± 16.99

WeldedBeam
84.88 ± 13.08
9.32 ± 5.15
92.35 ± 6.06
14.08 ± 5.35
96.46 ± 4.14
35.90 ± 18.28

Cancer-30
90.47 ± 4.54
4.78 ± 2.30
90.96 ± 6.29
5.74 ± 1.84
92.40 ± 4.98
14.58 ± 7.14

60

accuracy, a large  produces smaller sized GPs. To demonstrate, we present two GP classiﬁers for
 = 0.005 and 0.01 obtained for the breast cancer Wisconsin dataset (involving total 10 features)
in Figure 3.31. Training () and testing () accuracy are better for  = 0.005.

(a)  = 0.005,  = 96.44,  = 99.02,

Complexity = 6.

(b)  = 0.01,  = 95.60,  = 98.05,

Complexity = 3.

Figure 3.31: Classiﬁers for Cancer data:  = 0.005:
 = 0.01:  (x) = 2 + −0.502
(0.0776) .

 (x) = 9 +

−0.537

(0.1716)(0.17132) and

Table 3.12 indicates that GP does not perform well on certain problems even in small-sized
problems, such as DS1 and DS4. In a mathematical classiﬁer search, there are two hierarchical
aspects which must be learnt: (i) structure of the classiﬁer, and (ii) coeﬃcient of each term in the
structure. GP attempts to learn both aspects in a single optimization task. We argue that while a
“good" structure may have evolved at a generation, if its associated coeﬃcients are not proper, the
whole classiﬁer will be judged as “bad". We attempt to alleviate this aspect in the next procedure
by using a bilevel optimization framework.

61

3.8.4 Results

A comprehensive compilation of results obtained on all 19 datasets is shown in Table 3.13. The
results are segmented into four parts: Sr. 1-7 for synthetic DS datasets, Sr. 8-9 for traditional
cancer datasets, Sr. 10-11 for real-world multi-objective datasets, Sr. 12-19 for high-dimensional
modiﬁed multi-objective ZDT and DTLZ benchmarks. For each method, a parametric study is
performed on each problem and the setting which obtained the best testing accuracy is used to
generate the ﬁnal results. Statistics of 50 runs (with random data-split of 70% training and 30%
testing in each) on each dataset for two performance metrics is presented in Table 3.13.

For CART, the complexity metric is deﬁned as total number of nodes; for SVM, it is deﬁned
as total number of support vectors; for GAM, it is deﬁned as the eﬀective degrees of freedom
(EoDF); for GP, it is deﬁned as the total number of internal nodes; and for NLDT, it is deﬁned as
total number of variable occurrences in the entire tree. It is clear that a method with high testing
accuracy and low complexity is better.

The table clearly indicates that NLDT performs well in terms of both metrics. Also, the
performance of NLDT scales well with an increase in feature size. CART produces a good
compromise on accuracy and complexity, but performs worse than NLDT on both metrics. While
SVM achieves a high accuracy, in general, the complexity of its classiﬁers is large, thereby making
them not easy to interpret for any explainability purposes. The performance of GP is poor for
achieving a high accuracy. GAM is clearly not suitable for problems with a large number of
features and cannot be run due to impractical computational time requirement for some problems
(marked with a dash). GP cannot match both accuracy and complexity obtained by NLDT. GP’s
performance is also depended on the way the data is scattered which is evident from results on
DS1 and m-DS1 problem. This is mostly because of the fact that in GP, the rule-structure and
coeﬃcients are evolved simultaneously, thereby making it diﬃcult to eﬃciently navigate through
the search-space of equations. A future study could be launched to incorporate bilevel search
strategy with customized initialization to evolve rules within GP framework. In most problems,
NLDT classiﬁers require fewer conditional rules (albeit with restricted nonlinearities) and still

62

achieve near 100% correct testing accuracy.

Table 3.13: Summary of results obtained using various methods. For each dataset, the ﬁrst row
indicates testing accuracy and the second row indicates complexity. Italicized entries are statistically
insigniﬁcant (according to 95% conﬁdence in Wilcoxon rank-sum test) compared to the best entry
in the same row.

Sr.
1

2

3

4

5

6

7

8

9

Problem

DS1

DS2

DS3

DS4

m-DS1

m-DS2

m-DS3

Cancer-10

Cancer-30

10 Welded Beam

11

12

13

14

15

16

Truss

m-ZDT1-30

m-ZDT1-500

m-ZDT2-30

m-ZDT2-500

m-DTLZ1-30

17 m-DTLZ1-500

18

m-DTLZ2-30

19 m-DTLZ2-500

NLDT
99.55 ± 1.08
2.3 ± 0.6
99.44 ± 0.87
2.3 ± 0.7
99.77 ± 0.67
2.2 ± 0.5
98.88 ± 1.65
3.1 ± 1.4
99.10 ± 1.54
2.00 ± 0.00
99.46 ± 1.08
2.10 ± 0.30
99.20 ± 1.30
2.02 ± 0.14
96.50 ± 1.16
6.4 ± 1.7
96.20 ± 1.49
9.2 ± 4.1
98.58 ± 1.13
3.9 ± 1.0
99.54 ± 0.75
3.30 ± 0.90
98.97 ± 0.57
7.60 ± 3.50
98.93 ± 0.60
9.34 ± 4.15
98.96 ± 0.57
8.10 ± 3.35
98.87 ± 0.72
8.84 ± 3.95
98.77 ± 0.87
11.98 ± 5.85
93.76 ± 4.24
22.72 ± 7.50
97.22 ± 2.25
17.48 ± 6.75
95.32 ± 4.45
22.02 ± 10.62

CART
90.32 ± 4.06
14.5 ± 1.7
95.43 ± 1.50
11.0 ± 1.4
95.00 ± 2.35
11.5 ± 1.3
88.68 ± 3.60
31.3 ± 4.2
89.73 ± 4.53
7.90 ± 1.22
96.25 ± 1.92
5.96 ± 0.81
92.87 ± 4.35
5.78 ± 1.11
94.34 ± 1.92
11.6 ± 2.4
92.11 ± 2.07
10.8 ± 2.1
97.72 ± 1.04
8.42 ± 1.42
98.33 ± 1.10
11.06 ± 3.15
97.77 ± 0.58
30.26 ± 4.65
95.96 ± 0.80
21.02 ± 1.55
97.88 ± 0.70
28.22 ± 2.35
95.96 ± 0.80
21.02 ± 1.55
78.52 ± 7.94
128.40 ± 22.39
78.31 ± 7.21
126.94 ± 20.07
69.83 ± 6.16
156.00 ± 16.09
76.68 ± 5.44
133.22 ± 15.95

SVM

99.87 ± 0.45
8.16 ± 0.88
99.33 ± 1.10
7.64 ± 0.87
99.63 ± 0.69
10.22 ± 1.42
93.97 ± 2.35
43.70 ± 2.69
99.90 ± 0.40
7.50 ± 0.75
99.94 ± 0.44
5.44 ± 0.67
100.00 ± 0.00
8.82 ± 0.89
95.07 ± 1.23
51.26 ± 5.02
95.24 ± 1.29
58.88 ± 4.46
99.58 ± 0.45
7.86 ± 1.27
88.21 ± 1.62
174.28 ± 8.49
99.39 ± 0.35
82.08 ± 4.19
100.00 ± 0.00
140.58 ± 4.25
99.51 ± 0.33
80.98 ± 3.50
100.00 ± 0.00
140.56 ± 4.00
94.22 ± 0.95
615.54 ± 9.24
64.32 ± 1.76
1236.82 ± 13.26
94.25 ± 1.04
615.44 ± 10.67
64.22 ± 1.48
1245.92 ± 11.45

GAM

100.0 ± 0.00
2.89 ± 0.00
100.0 ± 0.00
2.89 ± 0.00
99.47 ± 1.03
4.98 ± 0.14
48.63 ± 6.50
3.80 ± 0.99
100.0 ± 0.00
2.90 ± 0.00
99.94 ± 0.31
2.90 ± 0.00
99.17 ± 1.48
3.24 ± 0.22
95.32 ± 1.49
22.14 ± 10.36
93.74 ± 5.83
32.47 ± 12.41
99.53 ± 0.48
11.06 ± 0.81
96.18 ± 1.20
19.19 ± 1.06
85.31 ± 1.35
220.20 ± 11.73

84.97 ± 1.18
233.69 ± 6.56

—
—

—
—

—
—

—
—

55.96 ± 3.19
33.89 ± 0.01

54.52 ± 2.96
35.84 ± 0.02

GP

91.70 ± 6.91
67.72 ± 26.72
95.37 ± 0.63
15.14 ± 13.55
96.50 ± 3.30
76.74 ± 33.36
59.63 ± 10.81
24.70 ± 26.40
98.83 ± 1.88
55.38 ± 22.39
96.67 ± 1.93
14.08 ± 9.11
99.27 ± 1.22
49.8 ± 21.69
96.13 ± 1.29
15.80 ± 5.66
92.40 ± 4.98
14.58 ± 7.14
96.46 ± 4.14
35.90 ± 18.28
97.36 ± 3.81
36.02 ± 16.99
93.58 ± 10.21
45.34 ± 26.09
83.21 ± 18.42
52.14 ± 24.47
91.57 ± 11.92
48.44 ± 23.69
85.06 ± 16.82
50.64 ± 21.24
81.59 ± 17.32
16.08 ± 21.89
80.49 ± 11.54
8.66 ± 16.19
79.81 ± 19.46
12.32 ± 14.79
78.46 ± 14.08
8.08 ± 15.21

3.9 Conclusions and Future work

In this chapter, we have addressed the problem of generating interpretable and accurate non-
linear decision trees for binary classiﬁcation problems. Split-rules at the conditional nodes of
a decision tree have been represented as a weighted sum of power-laws of feature variables. A
provision of integrating modulus operation in the split-rule has also been formulated, particularly

63

to handle sandwiched classes of datapoints. The proposed algorithm of deriving split-rule at a
given conditional-node in a decision tree has two levels of hierarchy, wherein the upper-level is
focused at evolving the power-law structures and operates on a discrete variable space, while the
lower-level is focused at deriving values of optimal weights and biases by searching a continuous
variable space to minimize the net impurity of child nodes after the split. Eﬃcacy of upper-level
and lower-level evolutionary algorithm have been tested on customized datasets. The lower-level
GA is able to robustly and eﬃciently determine weights and biases to minimize the net-impurity. A
test conducted on imbalanced dataset has also revealed the superiority of the proposed lower-level
GA to achieve a high classiﬁcation accuracy without relying on any synthetic data. Ablation studies
conducted on the upper-level GA also demonstrate the eﬃcacy of the algorithm to obtain simpler
split-rules without using any prior knowledge. Results obtained on standard classiﬁcation bench-
mark problems and a real-world single-objective problem has amply demonstrated the eﬃcacy and
usability of the proposed algorithm.

The classiﬁcation problem has been extended to discriminate Pareto-data from non-Pareto data
in a multi-objective problem. Experiments conducted on multi-objective engineering problems
have resulted in determining simple polynomial rules which can assist in determining if the given
solution is Pareto-optimal or not. These rules can then be used as design-principles for generating
future optimal designs. As further future studies, we plan to integrate non-polynomial and generic
terms in the expression of the split-rules. The proposed bilevel framework can also be tested for
regression and reinforcement-learning related tasks which we shall discuss in upcoming chapters.
The upper-level problem can be converted to a bi-objective problem for optimizing both  and 
simultaneously, so that a set of trade-oﬀ solutions can be obtained in a single run. Nevertheless, this
proof-of-principle study on the use of a customized bilevel optimization method for classiﬁcation
tasks is encouraging for us to launch such future studies.

64

3.10 Parameter Settings

3.10.1 Termination Criteria and other Parameter Settings for Inducing a Non-linear Deci-

sion Tree (NLDT)

• Number of power-laws per split-rule:  = 3,

• Allowable set of exponents: E = {−3,−2, . . . , 3},

• Impurity Metric: Gini Score.

• Minimum Impurity to conduct a split:  = 0.05

• Minimum number of points required in a node conduct a split:  = 10

• Maximum allowable Depth = 5

• Pruning threshold:  = 3%

3.10.2 Parameter Setting for NSGA-II for multi-objective data creation

We used the implementation of NSGA-II Algorithm as provided in pymoo [59]. pymoo is an oﬃcial
python package for multiobjective optimization algorithms and is developed under supervision of
Prof. Deb who is the original developer of NSGA-II and NSGA-III algorithms. Following
parameter setting was adopted to create Pareto and non-Pareto datasets for two objective Truss and
Welded beam problems:

• Population Size = 500

• Maximum Generations = 1000

• Cross Over = SBX

• SBX  = 15

• SBX probabililty = 0.9

65

• Mutation type = Polynomial Mutation

• Mutation  = 20

• Mutation probability  = 1/ (where  is total number of variables)

3.10.3 Parameter Setting for Upper Level GA

This section provides the general parameter setting used for the upper level of our bilevel GA.

• Population size = 10 × , where  is the dimension of the feature space of the original
problem.

• Selection = Binary tournament selection.

• Crossover probability (

) = 0.9

• Mutation parameters:

– Mutation probability (

) = (0.33, 1/).

– Distribution parameter  = 3 (any value > 1 will be good).

•  = 0.75

• Maximum number of generations = 100

3.10.4 Parameter Setting for Lower Level GA

Following parameter setting is used for lower level GA:

• Population size = 50,

• Maximum generations = 50,

• Variable range = [−1, 1],

• Crossover type = Simulated Binary Crossover (SBX)

66

• Mutation type = Polynomial Mutation

• Selection type = Binary Tournament Selection

• Crossover probability = 0.9,

• Mutation probability = 1/,
• Number of variables  =  + 1, if  = 0 or  + 2 if  = 1, where  is the total number of
power-laws allowed and  is the modulus ﬂag,

• SBX and polynomial mutation parameters: (, ) = (2, 15).

3.10.5 Creation of Customized 2D Datasets: DS1- DS4

As mentioned before, the customized datasets DS1-DS4 (Figure 3.8) were synthetically generated.
This section explains the procedure which was adopted to generate these customized datasets.
Datapoints in 2 space were created using a reference curve (x) and were assigned to either
the Class-1 or Class-2. Two additional parameters  and  controlled the location of a datapoint
relative to the reference curve (x). Datapoints (x(i)) for a given dataset were generated using
following steps:

• First  points falling on curve (x) = 0 were initialized for reference. Lets denote these
reference points on the curve with x(i)
r

(where  = 1, 2, . . . , ). Thus,

(x(i)

r ) = 0,

 = 1, 2, . . . , .

•  points (x(i)) for a dataset were then generated using the following equation

x(i) = x(i)

r +  + ,

 = 1, 2, . . . , ,

(3.29)

where  is a random number between 0 and 1,  represents the oﬀset and  indicates the
amount of spread.

67

Four datasets generated for conducting ablation studies are graphically represented in Fig. 3.8.
Parameter setting which was used to generate these datasets is provided Table 3.14.  and

 indicate the number of datapoints belonging to Class-1 and Class-2 respectively.
Table 3.14: Parameter setting to create customized datasets D1-D4. For Class-1 data-points
(, ) = (0, 0.01).

Dataset
DS1
DS2
DS3
DS4

(x)

21 + 2 − 3 = 0
21 + 2 − 3 = 0
1 + 2 − 1 = 0
2
21 + 2 − 3 = 0

Class-1

NA
100
10
100
100

(, )

Class-2
(0.025, 0.2)
(0.025, 0.2)
(0.025, 0.2)
(0.025, 0.2)
(−0.015,−0.2)

NB
100
200
100
50
50

68

CHAPTER 4

CONTROL: INTERPRETABLE POLICY FOR DISCRETE ACTION SPACES

4.1 Introduction

Control system problems are increasingly being solved by using modern reinforcement learning
(RL) and other machine learning (ML) methods to ﬁnd an autonomous agent (or controller) to
provide an optimal action  for every state variable combination  in a given environment at
every time-step . Execution of the output action  takes the object to the next state +1 in the
environment and the process is repeated until a termination criteria is met. This is conceptually
demonstrated in Figure 4.1

Figure 4.1: Control Loop

The mapping between input state  and output action  is usually captured through an artiﬁcial
intelligence (AI) method. In the RL literature, this mapping is referred to as policy (() : S → A),
where S is the state space and A is the action space. Suﬃcient literature exists in eﬃcient training
of these RL policies [60, 61, 62, 63]. While these methods are eﬃcient at training the AI policies
for a given control system task, the developed AI policies, captured through complicated networks,
are complex and non-interpretable.

Interpretability of AI policies is important to a human mind due to several reasons: (i) they help

69

provide a better insight and knowledge to the working principles of the derived policies, (ii) they
can be easily deployed with a low ﬁdelity hardware, (iii) they may also allow an easier way to extend
the control policies for more complex versions of the problem. While deﬁning interpretability is
a subjective matter, a number of past eﬀorts have attempted to ﬁnd interpretable AI policies with
limited success.

In the remainder of this chapter, we ﬁrst present the main motivation behind ﬁnding interpretable
policies in Section 4.2. A few past studies in arriving at interpretable AI policies is presented in
Section 4.3. In Section 4.5, we brieﬂy discuss an extension to our nonlinear decision tree (NLDT)
approach in the context of arriving at interpretable AI policies. The overall open-loop and closed-
loop NLDT policy generation methods are described in Section 4.6. Results on some control
system problems are presented in Section 4.7. Finally, conclusions and future studies are presented
in Section 4.10.

4.2 Motivation for the Study

As mentioned in the earlier chapters, various data analysis tasks, such as classiﬁcation, controller
design, regression, image processing, etc., are increasingly being solved using artiﬁcial intelligence
(AI) methods.These are done, not because they are new and interesting, but because they have been
demonstrated to solve complex data analysis tasks without much change in their usual frameworks.
With more such studies over the past few decades, they are faced with a huge challenge. Achieving
a high-accuracy solution does not necessarily satisfy a curious domain expert, particularly if the
solution is not interpretable or explainable. A technique (whether AI-based or otherwise) to handle
data well is no more enough, researchers now demand an explanation of why and how they work.
Consider the MountainCar control system problem (see Figure 4.2, which has been extensively
studied using various AI methods [64, 65, 66]. The problem has two state variables (position 
along -axis and velocity  along positive -axis) at every time instant  which would describe the
state of the car at . Based on the state vector  = (, ), a policy () must decide on one of the
three actions : decelerate ( = 0) along positive -axis with a pre-deﬁned value −, do nothing

70

Figure 4.2: Mountain Car problem. It comprises of two state variables ,  and is controlled using
three actions: -1 for deceleration, 0 for nothing and +1 for acceleration.

( = 1), or accelerate ( = 2) with  in positive -axis direction. The goal of the control policy
() is to take the under-powered car (it does not have enough fuel to directly climb the mountain
and reach the destination) over the right hump in a maximum of 200 time-steps starting anywhere
at the trough of the landscape. Physical laws of motion are applied and a policy () has been
trained to solve the problem. The RL produces a black-box policy () for which an action
 ∈ [0, 1, 2] will be produced for a given input  = (, ) ∈ R2. Figure 4.3a shows the state-

(a) Using .

(b) Using NLDT.

Figure 4.3: State-action combinations for MountainCar prob.

action combinations obtained from 92 independent successful trajectories (amounting to total of

71

10,000 time-steps) leading to achieving the goal using a pre-trained deterministic black-box policy
. The -location of the car and its velocity can be obtained from a point on the 2D plot.
The color of the point  = (, ) indicates the action  suggested by the oracle policy 
( = 0: blue,  = 1: orange, and  = 2: green). If a user is now interested in understanding
how the policy  chooses a correct  for a given , one way to achieve this would be to
represent and analyze the policy function () as shown below:



() =

 0,
 1,
 2,

if 0() is true,
if 1() is true,
if 2() is true,

(4.1)

where () : 2 → {0, 1} is a Boolean function which partitions the state space S into two
sub-domains based on its output value and for a given state , exactly one of () is true, thereby
making the policy  deterministic. If we re-look at Figure 4.3a we notice that the three actions
are quite mixed at the bottom part of the - plot (state space). Thus, the partitioning Boolean
functions  of Eq. 4.1 need to be quite complex in order to have 0() = true for all blue points,
1() = true for all orange points and 2() = true for all green points in a mutually exclusive
manner.

What we address in this study is an attempt to ﬁnd an approximated policy function ()
which may not explain all 100% time instance data corresponding to the oracle black-box policy
() (Figure 4.3a), but it is fairly interpretable to capture and explain most of the behavior
of . Consider the state-action plot in Figure 4.3b, which is generated with a simpler and
a relatively more interpretable policy () = {|() is true,  = 0, 1, 2} obtained by our
proposed procedure as shown below

0() = ¬1(),
1() = (1() ∧ ¬2()) ,
2() = (1() ∧ 2()) ,

72

(4.2)

where 1() = |0.96 − 0.63/(cid:98)
0.30(cid:98)

2 + 0.28/(cid:98) − 0.22(cid:98)(cid:98)| ≤ 0.36, and 2() = |1.39 − 0.28(cid:98)
2| ≤ 0.53. Here,(cid:98) and(cid:98) are normalized state variables (see Section 4.6.1). The action 

2 −

predicted using the above policy does not match the output of  at some states (about 8.1%),
but from our experiments we observe that it is still able to drive the mountain-car to the destination
goal located on the right hill in 99.8% episodes.

Importantly, the policies are simplistic and amenable to an easier understanding of the relation-
ships between  and  to make a near perfect control. Since the explanation process used the
data from  as the universal truth, the derived relationships can also provide an explanation
of the working of the black-box policy . A more gross approximation to Figure 4.3a by
more simpliﬁed relationships () may reduce the overall open-loop accuracy (see Section 4.4)
of matching the output of . Hence, a balance between a good interpretability and a high
open-loop accuracy in searching for Boolean functions () becomes an important matter for
such an interpretable AI-policy development study.

In this work, we focus on developing a search procedure for arriving at the -functions (see
Eq. 4.2) for discrete action systems. The structure of the policy () shown in Eq. 4.1 resembles
a decision tree (DT), but unlike a standard DT, it involves a nonlinear function at every non-leaf
node, requiring an eﬃcient nonlinear optimization method to arrive at reasonably succinct and
accurate functionals. The procedure we propose here is generic and is independent of the AI
method used to develop the black-box policy .

4.3 Related Past Studies

A few studies exist which are focused at generating interpretable policies.

In [67], an in-
terpretable orchestrator is developed to choose from two RL-policies:  (modelled for reward
maximization) and  for maximizing an ethical consideration. The orchestrator is dependent
on only one of the state-variables and despite it being interpretable, the policies:  and  are
still black-box and convoluted. [68] constructs a set of interpretable index based policies and uses
multi-arm bandit procedure to select a high performing index based policy. The search space of

73

interepretable policies is much smaller and the procedure suggested for ﬁnding an interpretable
policy is computationally heavy, taking about hours to several days of computational time on simple
control problems. In [69], genetic programming (GP) is used to obtain interpretable policies on
control tasks involving continuous actions space through model-based policy learning. However
the interpretability was not captured in the design of the ﬁtness function and a large archive was
created passively to store every policy for each complexity encountered during the evolutionary
search. A linear decision tree (DT) based model is used in [70] to approximate the Q-values of
trained neural network. In that work, the split in DT occurs based on only one feature, and at each
terminal node the Q-function is ﬁtted using a linear model on all features. [71] uses a program
sketch  to deﬁne the domain of interpretable policies . Interpretable policies are found using a
trained black-box oracle  as a reference by ﬁrst conducting a local search in the sketch space
 to mimic the behaviour of the oracle  and then ﬁne-tuning the policy parameters through
online Bayesian optimization. The bias towards generating interpretable programs is done through
controlled initialization and local search rather than explicitly capturing interpretability as one of
the ﬁtness measure. Particle swarm optimization [72] is used to generate interpretable fuzzy rule
set in [73] and is demonstrated on classic control problems involving continuous actions. Works
on DT [12] based policies through imitation learning has been carried out in [74]. [75] extends this
to utilize Q-values and eventually render DT policies involving < 1, 000 nodes on some toy games
and CartPole environment with an ultimate aim to have the induced policies veriﬁable. [76] used
axis-aligned DTs to develop interpretable models for black-box classiﬁers and RL-policies. They
ﬁrst derive a distribution function P by ﬁtting the training data through axis-aligned Gaussian dis-
tributions. P is then used to compute the loss function for splitting the data in the DT. [77] attempts
to generate interpretable DTs from an ensemble using a genetic algorithm.
In [78], regression
trees are derived using classical methods such as CART [12] and Kd-tree [79] to model Q-function
through supervised training on batch of experiences and comparative study is made with ensemble
techniques. In [80], a gradient based approach is developed to train the DT of pre-ﬁxed topology
involving linear split-rules. These rules are later simpliﬁed to allow only one feature per split node

74

(a) Open-loop

(b) Closed-loop

Figure 4.4: Performance measures.

and resulting DTs are pruned to generate simpliﬁed rule-set.

While the above methods attempt to generate an interpretable policy, the search process does
not use complexity of policy in the objective function, instead, they rely on initializing the search
with certain interpretable policies. In our approach described below, we use to concept of NLDT
induction discussed in Chapter 3 to build an eﬃcient search algorithm to directly ﬁnd relatively
simple and interpretable policies than black-box DNN or tabular-tile based policies using recent
advances in nonlinear optimization.

4.4 Performance Measures

Before we proceed into formally discussing the NLDT based policies, we will quickly discuss

two performance metrics we used to measure and compare performance of policies.

4.4.1 Open-Loop Accuracy

Open-loop accuracy quantiﬁes how accurately does a given policy  mimics the reference black-
box oracle policy . This is conceptually shown in Figure 4.4a. Here, under ideal scenario
(100% open-loop accuracy), for every state (), the output (cid:48)() of policy  will match the
output () of the black-box policy . In our case, since the action-space is discrete, the
open-loop accuracy is identical to the classiﬁcation accuracy on the labelled state-action dataset

75

generated using the black-box policy .

4.4.2 Closed-loop Performance

Under the close-loop setting, the AI is directly used as a controller to control the object by interacting
with the environment or dynamics of the control system as shown in Figure 4.4b. For each transition
from state () to state ( + 1) using action (), a reward () is collected. The loop is repeated
until a termination criteria is reached. The entire sequence of state-action-state transitions from
start to termination is referred to as episode. The cumulative reward value collected during entire
episode is given by

 
=0

 =

(),

(4.3)

where   is the time-step at which an episode terminated. The episode is called a success if the
desired goal is reached or a target is achieved upon the termination. Else, the episode is declared
as a failure. In the mountain-car example (Figure 4.2), an episode is considered as a success if by
200 time-steps, the AI is able to drive the car to the ﬂag-post located on the right up-hill, else it’s a
failure.

We measure closed-loop performance of an AI using two metrics

• Cumulative Reward which is measured using Eq. 4.3, and

• Task Completion Rate, which quantiﬁes number of successful episodes across 100 closed-

loop simulations.

4.5 Nonlinear Decision Trees (NLDTs) as Policies

Similar to classiﬁcation problems where NLDT is used as a classiﬁer to predict the class of a
given datapoint, in control systems involving discrete actions, NLDT can represent a policy wherein
for a given input state x, the corresponding action  can be determined by traversing the tree and
reaching the leaf node.

76

We implement two frameworks to represent non-linear decision trees: binary-split and multi-
split. The binary-split NLDT is identical to the one discussed in Chapter 3. The multi-split NLDT
can be obtained by allowing more than two splits for a given conditional node. We will brieﬂy
discuss binary-split and multi-split NLDT frameworks.

4.5.1 Binary-split NLDT

In binary-split NLDT, a conditional node undergoes exactly two splits. Unlike in Chapter 3 where
we mainly discussed problems involving two classes, in case of discrete-action control problems,
more than two actions can appear. This is conceptually illustrated in Figure 4.5 for control problem
involving three actions.

Figure 4.5: Binary-split NLDT for Discrete action control systems.

Each conditional node represents a non-linear control logic  (x) ≤ 0, where the non-linear
function  (x) assumes the similar form as the one discussed in Chapter 3 and is shown below for
quick reference



(cid:12)(cid:12)(cid:12)
=1   + 1,
=1   + 1
The  are power-laws of type  =

 (x) =

 
=1 

operator.

77

if  = 0,

(cid:12)(cid:12)(cid:12) − |2|,

if  = 1,

(4.4)

and  indicates presence or absence of absolute

4.5.2 Multi-split NLDT

Unlike the binary-split NLDT, in multi-split NLDT, a node is allowed to have more than two
partitions or splits. If the number of classes (or discrete action values in our case) in the dataset is
, then a node in the multi-split NLDT can have upto -splits as shown in Figure 4.6a.

(a) Schematic of Multi-split NLDT.

(b) Biases in (x) for Multi-split NLDT.

Figure 4.6: Multi-split NLDT conﬁguration. Numbers in square brackets indicate class distribution
of datapoints.

The number within square brackets in a node in multi-split NLDT of Figure 4.6a indicates class
distribution. The split function  (x) for a node in multi-split NLDT is given by the following
equation


=1

 (x) =

 .

(4.5)

Notice the diﬀerence between split-rule for binary-split NLDT (Eq. 4.4) and split-rule for multi-split
NLDT (Eq. 4.5). The former involves bias terms  and a modulus parameter  while the latter is
expressed as the weighted sum of power-laws  (Eq. 3.3) only. In multi-split NLDT, for a given
feature vector x, the value of its split function  (x) is compared against biases  to determine
child node where the datapoint will belong. This is schematically shown in Figure 4.6b for a node
undergoing 4 splits. Thus, for a given feature vector x,

If − ∞ <  (x) ≤ 1, Move to Split 1 Node,
If 1 <  (x) ≤ 2,
Move to Split 2 Node,
· · ·
· · ·

78

4.6 Overall Approach

The overall approach is illustrated in Figure 4.7. First, a dedicated black-box policy  is

Figure 4.7: A schematic of the proposed overall approach.

trained from the actual environment/physics of the problem. Training of black-box policy  is
not the focus of our current work and we use standard approaches from literature to obtain .
Next, the trained policy  (Block 1 in the ﬁgure) is used to generate labelled training and
testing datasets of state-action pairs from diﬀerent time-steps. We generate two types of training
datasets: Regular – as they are recorded from multiple episodes, and Balanced – selected from
multiple episodes to have almost equal number of states for each action, where an episode is a
complete simulation of controlling an object with a policy over multiple time-steps. Third, the
labelled training dataset (Block 2) is used to ﬁnd the NLDT (Block 3) using the recursive bilevel
evolutionary algorithm described in Section 4.6.2. We call this an open-loop NLDT (or, NLDT ),
since it is derived from a labelled state-action dataset generated from , without using any
overall reward or any ﬁnal goal objective in its search process, which is typically a case while doing
reinforcement learning. Use of labelled state-action data in supervised manner allows a faster
search of NLDT even with a large dataset as compared to constructing the NLDT from scratch
through reinforcement learning by interacting with the environment to maximize the cumulative
rewards [71]. Next, in an eﬀort to make the overall NLDT relatively more interpretable while
simultaneously ensuring better closed-loop performance, we prune the NLDT by taking only the

79

 
top part of NLDT  (we call NLDT()
in Block 4) and re-optimize all non-linear rules within it for
the weights and biases using an eﬃcient evolutionary optimization procedure to obtain ﬁnal NLDT*
(Block 5). The re-optimization is done here with closed-loop objectives, such as the cumulative
reward function or closed-loop completion rate. We brieﬂy discuss the open-loop training procedure
of inducing NLDT  and the closed-loop training procedure to generate NLDT* in next sections.

4.6.1 Data Normalization

First, we provide the exact normalization of state variables performed before the open-loop learning
task is executed. Before training and inducing the non-linear decision tree (NLDT), features in the
dataset are normalized using the following equation:

(cid:98) = 1 + ( − min


)/(max


− min

),


(4.6)

where  is the original value of the -th feature,(cid:98) is the normalized value of the -th feature, min

and max
are minimum and maximum value of -th feature as observed in the training dataset. This
normalization will make every feature  to lie within [1, 2]. This is done to ensure that  = 0 is
avoided to not cause a division by zero.


4.6.2 Open-loop Training

As a ﬁrst step for open-loop training, the pre-trained black-box oracle  is used to generate
labelled state-action dataset (step A in Figure 4.7). Once the labeled dataset is generated, the
problem of inducing NLDT to ﬁt the data translates to the supervised learning problem of developing
a classiﬁer.

4.6.2.1 Open-loop training for Binary-split NLDT

The open-loop training for binary-split is identical to the procedure discussed in Chapter 3 wherein
an evolutionary bilevel algorithm is employed to derive non-linear split-rule at each conditional

80

node. The NLDT is grown using recursive splitting of training data.

4.6.2.2 Open-loop training for Multi-split NLDT

Similar to binary-split NLDT, a dedicated bilevel optimization algorithm is applied to derive the
split-rule  (x) of Eq. 4.5 and its associated biases  at a given conditional node. The optimization
formulation to obtain the split-rule for a conditional node in multi-split NLDT and values of
corresponding  is as given below

Min. (B, w∗, ∗),
s.t.

(w∗, ∗) ∈ argmin(cid:8)(w, )|(B,)(cid:12)(cid:12)
 ∈ [−1, 1]+1(cid:9) ,

(w, −1 ≤  ≤ 1, ∀, )|(B,) ≤  ,

 <   ,  < , ∀, ,   ∈ .

(4.7)

Here, the upper level objective function  is identical to the one given in Eq. 3.7. It quantiﬁes
the complexity of the rule by counting the number of non-zero exponents in power-laws . The
lower level objective function  quantiﬁes the quality of split by computing weighted sum of
impurity scores of resulting child nodes as shown in the equation below

Ò

=1

 =

Gini(),


(4.8)

where Ò indicates total number of child nodes created from the split. The number of splits (and
hence the number of child nodes) depends on the distribution of datapoints in the given conditional
node. If the number of datapoints belonging to -th class in the given conditional node is  and
there are total  classes in the original dataset, then the number of splits  (or equivalently
Ò) a given conditional node undergoes is


=1

 =

max(0,  − )

 − 

,

81

(4.9)

where  is a user speciﬁed parameter. The optimization formulation for lower-level optimization
in multi-split NLDT is given as under

Min. (w, )(cid:12)(cid:12)(B,),

s.t.  <   ,  ∈ {1, . . . ,  − 2} and  =  + 1,

(4.10)

 ∈ [−1, 1] ,  ∈ [−1, 1]−1,

where  is the total number of power-laws  (Eq. 4.5).

4.6.3 Closed-loop Training

The intention behind the closed-loop training is to enhance the closed-loop performance of NLDT.
It will be discussed in Section 4.7 that while closed-loop performance of NLDT  is at par with
 on control tasks involving two to three discrete actions, like CartPole and MountainCar, the
NLDT  struggles to autonomously control the agent for control problems such as LunarLander
having more states and actions. In closed-loop training, we ﬁne-tune and re-optimize the weights
W and biases  of an entire NLDT  (or pruned NLDT , i.e. NLDT()
– block 4 in Figure 4.7)
to maximize its closed-loop ﬁtness ( ), which is expressed as the average of the cumulative
reward collected on  episodes:

 
(W, ),
Maximize  (W, ) =
Subject to W ∈ [−1, 1] ,  ∈ [−1, 1] ,

=1

1


(4.11)

where  and  are total number of weights and biases appearing in entire NLDT and  = 20 in
our case.

4.7 Experiments: AIM and Procedure

In this work, we conduct experiments to demonstrate the performance of NLDT on control
problems involving discrete actions. Eﬃcacy of the proposed approach of inducing NLDT is

82

shown on two types of control tasks as listed below:

• Binary Discrete Action Space (B-DAS): The agent in these control tasks is allowed to
have exactly two discrete actions. Environments considered for this task are CartPole and
CarFollowing as shown in Figures 4.8a and 4.8b, respectively.

• Multiple Discrete Action Space (M-DAS): Action space of agent in these control tasks in-
volves three or more discrete actions. Environments considered for this task are MountainCar
and LunarLander as shown in Figures 4.2 and 4.8c, respectively.

(b) CarFollowing environment.

(a) CartPole environment.

(c) LunarLander environment.

Figure 4.8: Three control problems.

Through our experiments, we try to empirically demonstrate eﬀects of training data size,
data distribution (regular or balanced) and NLDT framework (binary-split or multi-split) on the
performance of NLDT.

4.7.1 Experimental Setup

At ﬁrst, a dedicated black-box AI is trained for each control tasks listed above. The method used
to derive the black-box AI for each control task is provided in the corresponding sections. It is

83

to note here that training of the black-box AI is not the focus of our research. Our aim here is to
decipher complicated and incomprehensible (but eﬃciently functioning) black-box AI controller by
representing it in a relatively more interpretable format through Non-linear Decision Tree (NLDT).
Once the black-box AI is trained, it is used to generate labelled training and testing datasets of
state-action pairs (block 2 in Figure 4.7). The labelled training dataset is used to induce the NLDT
using the recursive bilevel evolutionary algorithm [81] (Section 4.6.2).

The training dataset is generated by storing state-action values from the sequential interaction
of black-box AI (such as DNN) with the corresponding environment. We generate two types of
training datasets: Regular and Balanced. Discussion regarding the procedure used to generate
regular and balanced dataset is explained next.

4.7.1.1 Creation of Regular Dataset

For creating a regular dataset of  datapoints, the sequential state-action data of diﬀerent
episodes is stored from the interaction of trained black-box AI with the corresponding environment.
Since the datapoint (state-action pair) is stored for each time-step in sequential manner (if the
episode terminates and the collected number of datapoints is less than , then a new episode
is invoked and the process of data collection repeats), there is no explicit control on the number of
datapoints which will fall into a particular action category. Hence, the resulting data distribution
might have an inherent bias. This bias in data distribution is more evident for environments
involving more than two actions as shown in Table 4.1.
Table 4.1: Class Distribution of Regular Training Datasets for diﬀerent problems. For each row,
the -th number in second column represents the number of datapoints belonging to -th action.

Problem
CartPole

CarFollowing
MountainCar
LunarLander

#Classes

2
2
3
4

Class Distribution

4997; 5003
5237; 4763

3452; 495; 6053

3661; 1650; 3779; 910

Since some of the actions have lesser representation in the training dataset, it might negatively
aﬀect the closed-loop performance of the NLDT. To investigate this eﬀect of bias, we create

84

balanced datasets with near-uniform distribution across all actions/classes. The procedure to
create a balanced (or almost balanced) dataset is discussed next.

4.7.1.2 Creation of Balanced Dataset

In order to create a uniformly (or near-uniformly) distributed data across all actions in the dataset
of  datapoints, data creation procedure similar to the one discussed above is implemented,
i.e. data is stored from sequential time-steps. However, if the number of datapoints for a particular
(cid:101) (where  is the total number of discrete actions
action reaches a threshold limit of (cid:100) 

and (cid:100)(cid:101) indicates the round-up function), no extra datapoints are collected for that action. It is
to note that data for other actions (for which number of collected datapoints is still less than the
prescribed threshold value) is still collected from sequential time-steps until total  datapoints
are collected.

The regular and balanced data creation method is used to create training datasets only. Testing
datasets are created with the regular dataset creation approach. Also, since for binary action spaces,
the distribution is fairly uniform (Table 4.1), we don’t create balanced training datasets for these
problems (i.e. CartPole and CarFollowing).

4.8 Experiments and Analysis on Control Tasks with Binary Action Spaces

In this section, we conduct experiments and discuss results obtained by using our approach for

control tasks involving two discrete actions, namely: 1) CartPole, and 2) CarFollowing.

4.8.1 CartPole Problem

As shown in Figure 4.8a, the CartPole problem comprises of four state variables: 1) -position
( → 0), velocity in +ve  direction ( → 1), angular position from vertical ( → 2) and angular
velocity ( → 3) and is controlled by applying force towards left (action 0) or right (action 1) to
the cart. The objective is to balance the inverted pendulum (i.e. −24 deg ≤  ≤ 24 deg) while also
ensuring that the cart doesn’t fall oﬀ from the platform (i.e. −4.8 ≤  ≤ 4.8). For every time-step,

85

a reward value of 1 is received while  is within ±24 deg. The maximum episode length is set to
200 time-steps.

A deep neural network (DNN) controller is trained on the CartPole environment using the PPO
algorithm [61]. Table 4.2 shows the performance of NLDT on the training data sets of diﬀerent
sizes. It is observed that NLDT trained with 5,000 and 10,000 data points shows a robust open-loop
performance and also produces 100% closed-loop performance. Keeping this in mind, we keep
the training data size of 10,000 ﬁxed across all control problems discussed in this dissertation.
It is observed that NLDT  trained with at least 5,000 data points shows a robust open-loop
performance. The obtained NLDT  has a about two rules with on an average three terms in the
derived policy function.

Table 4.2: Eﬀect of training data size on performance of NLDT  on CartPole problem.
Compl. Rate
Training
Data Size
(Closed-loop)

Training
Accuracy

Testing
Accuracy

Cumulative

100
500
1,000
5,000
10,000

97.00
95.5
91.90
92.07
91.86

82.79
79.66
90.59
92.02
92.05

#

Rules
1.50
1.90
1.80
1.70
1.30

Rule
Length
3.30
3.88
4.05
4.25
4.45

Reward
199.73
175.38
200.00
200.00
200.00

95.0
51.00
100
100
100

Interestingly, the same NLDT (without closed-loop training) also produces 100% closed-loop

performance by achieving the maximum cumulative reward value of 200.

4.8.1.1 NLDT for CartPole Problem

One of the NLDT  obtained for the CartPole environment is shown in Figure 4.9 in terms of

normalized state variable vector(cid:98)x.

The respective policy can be alternatively stated using the programmable if-then-else rule-

structure as shown in Algorithm 5:

A little manipulation will reveal that for a correct control startegy, Action 0 must be invoked if

following condition is true:

2.39 ≤

(cid:33)

(cid:32) (cid:98)0(cid:98)2
2 + 3.50(cid:98)3

2

86

≤ 5.06,

Figure 4.9: CartPole NLDT  induced using 10,000 training samples. It is 91.45% accurate on
the testing dataset but has 100% closed loop performance. Normalization constants are: xmin =
[-0.91, -0.43, -0.05, -0.40], xmax = [1.37, 0.88, 0.10, 0.45].

Algorithm 5: CartPole Rules. Normalization constants are:  = [-0.91, -0.43, -0.05,
-0.40],  = [1.37, 0.88, 0.10, 0.45].

(cid:12)(cid:12)(cid:12)−0.18(cid:98)0(cid:98)2−2 − 0.63(cid:98)3−2 + 0.67(cid:12)(cid:12)(cid:12) − 0.24 ≤ 0 then

if

Action = 0

Action = 1

else

end

otherwise, Action 1 must be invoked. First, notice that the above policy does not require the current

velocity ((cid:98)1) to determine the left or right action movement. Second, for small values of angular
position ((cid:98)2 ≈ 1) and angular velocity ((cid:98)3 ≈ 1), i.e.
of (cid:98)2 ≈ 2 and (cid:98)3 ≈ 2), the term in bracket will be smaller than 2.39 for all (cid:98)0 ∈ [1, 2], and the

the pole is falling towards left, the above
condition is always true. That is, the cart should be pushed towards left, thereby trying to stabilize
the pole to vertical position. On the other hand, if the pole is falling towards right (large values

above policy suggests that Action 1 (push the cart towards right) must be invoked. When the pole is
falling right, a push of the cart towards right helps to stabilize the pole towards its vertical position.
These extreme case analyses are intuitive and our policy can be explained for its proper working,
but what our NLDT approach is able to ﬁnd is a precise rule for all situations of the state variables
to control the Cart-Pole to a stable conﬁguration, mainly using the blackbox-AI data.

4.8.2 CarFollowing Problem

This problem simulates a one dimensional car chasing scenario. We have developed a discretized
version of the car following problem discussed in [82] (illustrated in Figure 4.8b), wherein the task

87

is to follow the car in the front which moves with a random acceleration proﬁle (between −1/2
and +1/2) and maintain a safe distance of    = 30 from it. The rear car is controlled
using two discrete acceleration values of −1/2 (Action 0) and +1/2 (Action 1). The car-chase
episode terminates when the relative distance  =    −  is either zero (i.e. collision case)
or is greater than 150 m. At the start of the simulation, both cars start with the initial velocity of
zero. A DNN policy for CarFollowing problem was obtained using a double Q-learning algorithm
[83]. The reward function for the CarFollowing problem is shown in Figure 4.10, indicating that a
relative distance close to 30 m produces the highest reward.

Figure 4.10: Reward function for CarFollowing environment.

It is to note here that unlike the CartPole control problem, where the dynamics of the system was
deterministic, the dynamics of the CarFollowing problem is not deterministic due to the random
acceleration proﬁle with which the car in the front moves. This randomness introduced by the
unpredictable behaviour of the front car makes this problem more challenging.

Results for the CarFollowing problem are shown in Table 4.3. An average open-loop accuracy

of 96.53% is achieved with at most three rules, each having 3.28 terms on an average.

For this problem, we apply the closed-loop re-optimization (Blocks 4 and 5 to produce Block 6
in Figure 4.7) on the entire NLDT . As shown Table 4.4, NLDT* is able to achieve better closed-

88

Table 4.3: Results on CarFollowing problem correspoing to open-loop training (NLDT ).
Rule Length Compl. Rate
Train. Acc.
96.41 ± 1.97
3.28 ± 0.65
100 ± 0.00

Test. Acc.
96.53 ± 1.90

# Rules
2.40 ± 0.66

Depth

1.90 ± 0.30

loop performances (100% completion rate and better average cumulative reward). Figure 4.11
shows that NLDT* adheres the 30 gap between the cars more closely than original DNN or
NLDT .
Table 4.4: Closed-loop performance analysis after re-optimizing NLDT for CarFollowing problem
(k = 103).

Cumulative Reward
AI
Avg ± Std
Best
174.16k 173.75k ±20.95
DNN
NLDT  174.15k 173.87k ±16.48
179.76k 179.71k ±0.95
NLDT*

Compl. Rate
100 ± 0.00
100 ± 0.00
100 ± 0.00

Figure 4.11: Relative distance plot for CarFollowing problem.

89

0100200300400500600700800Time Steps26.527.027.528.028.529.029.530.030.531.0Relative Distancedsafe=30DNNNLDTNLDT*4.8.2.1 NLDT for CarFollowing Problem

The NLDT  obtained for the CarFollowing problem is shown in Figure 4.12. The rule-set is

Figure 4.12: NLDT  for the CarFollowing problem. Normalization constants are: min = [0.25,
-7.93, -1.00], max = [30.30, 0.70, 1.00].

provided in its natural if-then-else form in Algorithm 6.

Algorithm 6: Ruleset corresponding to NLDT  (Figure 4.12) of the CarFollowing
problem.

if 0.63(cid:98)0 − 0.87(cid:98)1−2(cid:98)2 − 1.00 ≤ 0 then
if 0.96(cid:98)1−3 − 0.58(cid:98)0 + 1.00 ≤ 0 then

Action = 1

else

end

Action = 0

else

end

Action = 1

Recall that the physical meaning of state variables is: 0 →  (relative distance between
front car and rear car), 1 →  (relative velocity between front car and rear car) and 2 → 
(acceleration value (−1 or +1 m/s2) at the previous time step). Action = 1 stands for acceleration
and Action = 0 denotes deceleration of the rear car in the next time step.

From the ﬁrst rule (Node 0), it is clear that if the rear car is close to the front car ((cid:98)0 ≈ 1),
acceleration of the rear car (both(cid:98)1 and(cid:98)2 lying in [1,2]). Thus, Node 4 (Action = 1, indicating

the root function 0(x) is never going to be positive for any value of relative velocity or previous

90

acceleration of the rear car in the next time step) will never be invoked when the rear car is too

close to the front car. Thus for(cid:98)0 ≈ 1, the control always passes to Node 1. A little analysis will
also reveal that for(cid:98)0 ≈ 1, the rule 1(x) > 0 for any relative velocity(cid:98)1 ∈ [1, 2]. This means that

when the two cars are relatively close, only Node 3 gets ﬁred to decelerate (Action = 0) the rear
car. This policy is intuitively correct, as the only way to increase the gap between the cars is for the
controlled rear car to be decelerating.

However, when the rear car is far away for which (cid:98)0 ≈ 2, Action 1 (Node 4) gets ﬁred if
(cid:98)1 > 1.829(cid:112)(cid:98)2.
If the rear car was decelerating in the previous time step (meaning (cid:98)2 = 1),
the obtained NLDT recommends that the rear car should accelerate if (cid:98)1 ∈ [1.829, 2], or when
already accelerating in the previous time step ((cid:98)2 = 2), Node 4 does not ﬁre, as (cid:98)1 can never be

the magnitude of the relative velocity is small, or when 1 ∈ [−0.776, 0.700]/. This will
help maintain the requisite distance between the cars. On the other hand, if the rear car was

√
2 and the control goes to Node 1 for another check. Thus, the rule in Node 0
more than 1.829
makes a ﬁne balance of the rear car’s movement to keep it a safe distance away from the front
car, based on the relative velocity, position, and previous acceleration status. When the control

comes to Node 1, Action 1 (acceleration) is invoked if(cid:98)1 ≥ 0.96/(0.58(cid:98)0 − 1). For(cid:98)0 ≈ 2, this
happens when (cid:98)1 > 1.817 (meaning that when the magnitude of the relative velocity is small,

or 1 ∈ [−0.879, 0.700]/), the rear car should accelerate in the next time step. For all other
negative but large relative velocities 1 ∈ [−7.930, 0.879]/), meaning the rear car is rushing to
catch up the front car, the rear car should decelerate in the next time step. From the black-box AI
data, our proposed methodology is able to obtain a simple decision tree with two nonlinear rules to
make a precise balance of movement of the rear car and also allowing us to understand the behavior
of a balanced control strategy.

Results of NLDT’s performance on problems with two discrete actions (Tables 4.2, 4.3 and
4.4) indicate that despite having a noticeable mismatch with the open-loop output of the oracle
black-box policy , the closed-loop performance of NLDT is at par or at times better than
. This observation suggests that certain state-action pairs are not of crucial importance

91

when it comes to executing the closed-loop control and, therefore, errors made in predicting these
state-action events do not aﬀect or deteriorate the closed-loop performance.

4.9 Experiments and Analysis on Multiple Discrete Action Space

In this section, we investigate the performance of proposed NLDT method to approximate
the behavior of the parent ANN/black-box AI for control tasks involving more than two discrete
actions. Here, as mentioned before, we compare the open-loop and closed-loop performances
across diﬀerent representations of NLDT, i.e. binary-split NLDT and multi-split NLDT, and two
diﬀerent training dataset distributions, namely, regular and balanced.

4.9.1 MountainCar Problem

A schematic of this environment is shown in Figure 4.2. The car starts somewhere near the bottom
of the valley and the goal of the task is to reach the ﬂag-post located on the right up-hill with
non-negative velocity. The fuel is not enough to directly climb the hill and hence a control strategy
needs to be devised to move car back (left up-hill), leverage the potential energy and then accelerate
it to eventually reach the ﬂag-post within 200 time-steps. The car receives the reward value of -1 for
each time-step, until it reaches the ﬂag-post where the reward value is zero. The car is controlled
using three actions: accelerate left (action 0), do nothing (action 1) and accelerate right (action 2)
by observing its state which is given by two state-variables:  position → 0 and velocity  → 1.
We use the SARSA algorithm [84] with tile encoding to derive the black-box AI controller, which
is represented in form of a tensor and has total 151, 941 elements.

Compilation of results of the NLDT controller induced using diﬀerent NLDT-representations
and training dataset distributions is presented in Table 4.5. A state-action plot of the black-box AI
and the NLDT controller corresponding to the ﬁrst row of Table 4.5 is provided in Figures 4.3a
and 4.3b, respectively. It can be seen from the plots that about 8% mismatch in the open-loop
performance (i.e. testing accuracy in Table 4.5) comes from the lower region of state-action plot
(Figures 4.3a and 4.3b) due to highly non-linear output of the black-box AI controller. Again,

92

despite having this mismatch, the NDLT controller is able to achieve close to 100% closed-loop
control performance. The regular dataset is able to produce a multi-split NLDT with three control

Table 4.5: Mountain car results. The numbers indicate average scores.

Testing
Accuracy

91.92
88.70

Balanced

Training
Accuracy

No
Yes

No
Yes

90.61
86.11

91.66
88.53

Depth # Rules

Binary-split NLDT
2
4

2
3

Multi-split NLDT

2
2

3
4

91.50
88.44

Avg Rule
Length

Completion

Rate

3.00
3.00

1.67
2.50

99.8
100.0

100.0
100.0

rules with an average 1.67 terms in each rule to achieve 100% closed-loop performance. The
multi-split NLDT has a comparable performance to that of binary-split NLDT.

4.9.2 NLDT for MountainCar Problem

The NLDT  obtained for the MountainCar problem is shown below in Figure 4.13. The resulting
rule-set is also shown in if-then-else statements in Algorithm 7.

Figure 4.13: NLDT  for MountainCar problem. Normalization constants: min = [-1.20, -0.06],
max = [0.50, 0.06].

This rule-set corresponds to the plot shown in Figure 4.3b. A detail analysis of the two rules

can be made to have a deeper understanding of the control policy.

93

Algorithm 7: MountainCar NLDT . Normalization constants are:  = [-1.20, -0.06],
 = [0.50, 0.06].

(cid:12)(cid:12)(cid:12)−0.22(cid:98)0(cid:98)1 + 0.28(cid:98)1−1 − 0.63(cid:98)0−2 + 0.96(cid:12)(cid:12)(cid:12) − 0.36 ≤ 0 then
(cid:12)(cid:12)(cid:12)−0.30(cid:98)1

2 + 1.39(cid:12)(cid:12)(cid:12) − 0.53 ≤ 0 then

2 − 0.28(cid:98)0

if

if

else

end

Action = 2

Action = 1

else

end

Action = 0

4.9.3 LunarLander Problem

This problem is motivated from a classic problem of design of a rocket-controller. Here, the state
of the lunar-lander is expressed with eight state variables, of which six can assume continuous real
values, while the rest two are categorical, and can assume a Boolean value. The ﬁrst six state
variables indicate the (, ) position, and velocity and angular orientation and angular velocity of
the lunar-lander. The two Boolean state variables provide the indication regarding the left-leg and
right-leg contact of lunar-lander with the ground terrain. The lunar-lander is controlled using four
actions: action 0 → do nothing, action 1 → ﬁre left engine, action 2 → ﬁre main engine and action
3 → ﬁre right engine as shown schematically in Figure 4.8c. The black-box DNN based controller
for this problem is trained using the PPO algorithm [61] and involves two hidden layers of 64 nodes.

Table 4.6 provides the compilation of results obtained using open-loop training.
It is evident from the table that the binary-split NLDT  representation is superior to the multi-
split NLDT  representation in terms of both open-loop and closed-loop performances. In this
problem, a better closed-loop performance of NLDT  is observed when the open-loop training
is done on the balanced dataset. This indicates that under-represented actions in regular training
dataset (see Table 4.1) limits the exposure to some crucial states during the training phase of NLDT.
The crucial state-action pairs (or experience) necessary for closed-loop control gets captured while
creating the balanced dataset, and so, despite having the lower open-loop performance on the testing

94

Table 4.6: LunarLander open-loop training (NLDT ) results. The numbers indicate average
scores.

Testing
Accuracy

Depth # Rules

Binary-split NLDT

Balanced

Training
Accuracy

No
Yes
No
Yes

No
Yes
No
Yes

79.17
69.83
87.43
81.74

75.84
68.31
86.95
82.17

76.36
66.58
81.74
71.52

69.68
70.79
79.40
71.08

Multi-split NLDT

3
3
6
6

2
2
6
6

5.60
4.40
34.70
25.70

4
5
37
41

Avg Rule
Length

Completion

Rate

5.59
5.79
4.94
5.17

3.50
5.80
3.84
3.95

14.00
42.00
48.00
93.00

0.11
21.60
17.70
89.50

dataset (which is regular), the overall closed-loop performance improves. The best performance is
observed with a binary-split NLDT  trained on the balanced dataset.

4.9.3.1 NLDT for LunarLander Problem

The topology of one of the binary-split NLDT  obtained after open-loop training using balanced
dataset is shown in Figure 4.14.

This NLDT  successfully lands the lander in 93.4% episodes and has in total 26 rules each

having about 4.15 terms involving state variables.

It is understandable that a complex control task involving many state variables cannot be
simpliﬁed or made interpretable with just one or two control rules. A separate study focusing at
identifying crucial state-action pairs would help us understand this phenomenon better and we plan
to do this in our future work. Next, we use a part of the NLDT  from the root node to obtain
the pruned NLDT()
(step ‘B’ in Figure 4.7) and re-optimize all weights (W) and biases () using
the procedure discussed in Section 4.6.3 (shown by orange box in Figure 4.7) to ﬁnd closed-loop
NLDT*.

 
Table 4.7 shows that for the pruned NLDT-3 which comprises of the top three layers and involves
only four rules of original 26-rule NLDT  (i.e. NLDT-6), the closed-loop performance increases

95

Figure 4.14: NLDT-6 (with 26 rules) and other lower depth NLDTs for the LunarLander problem.
Lower depth NLDTs are extracted from the depth-6 NLDT. Each node has an associated node-id
(on top) and a node-class (mentioned in bottom within parenthesis). Table 4.7 in provides results
on closed-loop performance obtained using these trees before and after applying re-optimization
on rule-sets using the closed-loop training procedure.

from 51% to 96% (NLDT*-3 results in Table 4.7) after re-optimizing its weights and biases with
closed-loop training.

The resulting NLDT with its associated four rules are shown in Figure 4.15.

Figure 4.15: Final NLDT*-3 for LunarLander prob. (cid:98) is a normalized state variable (see Sec-

tion 4.6.1).

As shown in Table 4.7, the NLDT* with just two rules (NLDT-2) is too simplistic and does

96

not recover well after re-optimization. However, the NLDT*s with four and seven rules achieve a
near 100% closed-loop performance. Clearly, an NLDT* with more rules (NLDT-5 and NLDT-6)
are not worth considering since both closed-loop performances and the size of rule-sets are worse
than NLDT*-4. Note that DNN produces a better reward, but not enough completion rate, and the
policy is more complex with 4,996 parameters.

Table 4.7: Closed-loop performance on LunarLander problem with and without re-optimization on
26-rule NLDT . Number of rules are speciﬁed in brackets for each NLDT and total parameters
for the DNN is marked.

Re-Opt. NLDT-2

(2)

Before −1675.77
± 164.29
−133.95
± 2.51

After

Before

After

0.00
± 0.00
48.00
± 7.38

NLDT-3
(4)

42.96
± 13.83
231.42
± 17.95

51.00
± 3.26
96.00
± 2.77

56.16
± 23.50
182.87
± 21.92

NLDT-5
NLDT-4
(7)
(13)
Cumulative Reward
54.24
± 27.44
234.98
±22.25
Completion Rate
82.00
± 9.80
99.00
±1.71

79.00
± 7.66
93.00
± 7.59

DNN
(4,996)

247.27
± 3.90

94.00
±1.96

NLDT-6
(26)

169.43
±23.96
214.94
± 17.31

93.00
±3.30
94.00
± 4.45

 
To demonstrate the eﬃcacy and repeatability of our proposed approach, we perform another
run of the open-loop and closed-loop training and obtain a slightly diﬀerent NLDT*-3, which is
shown in Figure 4.16. This NLDT also has four rules, which are shown in Table 4.8. Four rules
corresponding to the pruned NLDT()
(Depth 3) are also shown in the table for comparison. It can
be noticed that the re-optimization of NLDT through closed-loop training (Section 4.6.3) modiﬁes
the values of coeﬃcients and biases, however the basic structure of all four rules remains intact.
Figure 4.17 shows the closed-loop training curve for generating NLDT* from Depth-3 NLDT()
.
 
The objective is to maximize the closed-loop ﬁtness (reward)   (Eq. 4.11) which is expressed as
the average of the cumulative reward  collected over  episodes. It is evident that the cumulative
reward for the best-population member climbs to the target reward of 200 at around 25-th generation
and the average cumulative reward of the population also catches up the best cumulative reward
value with generations.

97

Figure 4.16: Topology of Depth-3 NLDT()
obtained from a diﬀerent run on the LunarLander
problem. The equations corresponding the conditional-nodes before and after re-optimization are
provided in Table 4.8.

 
A visualization of the real-time closed-loop performance obtained using this new NLDT (Fig-
ure 4.16) for two diﬀerent rule-sets (i.e. before applying re-optimization and after applying the
re-optimization through closed-loop training) is shown in https://youtu.be/DByYWTQ6X3E. It
can be observed in the video that the closed-loop control executed using the Depth-3 NLDT()
 
comprising of rules obtained directly from the open-loop training (i.e. without any re-optimization)
is able to bring the LunarLander close to the target. However the LunarLander hovers above the
landing pad and the Depth-3 NLDT()
is unable to land it in most occasions. Episodes in these
cases are terminated after the ﬂight-time runs out. On the other hand, the Depth-3 NLDT* compris-
ing of rule-sets obtained after re-optimization through closed-loop training is able to successfully
land the LunarLander.

 
4.10 Conclusions

In this work, we have proposed a two-step strategy to arrive at hierarchical and relatively
interpretable rulesets using a nonlinear decision tree (NLDT) concept to facilitate an explanation
of the working principles of AI-based policies. The NLDT training phases use recent advances in

98

Rules before Re-optimization (Depth-3 NLDT()
 )

2 + 0.83(cid:12)(cid:12)(cid:12) − 0.85
(cid:12)(cid:12)(cid:12)−0.23(cid:98)0(cid:98)2−1(cid:98)6−1(cid:98)7−1 − 1.00(cid:98)1−1(cid:98)6 − 0.79(cid:98)0−1(cid:98)1−1(cid:98)6
0.17(cid:98)2−1 − 0.64(cid:98)3(cid:98)7−1 + 0.90(cid:98)1−2(cid:98)6−2(cid:98)7−3 + 0.29
0.82(cid:98)7−1 + 0.52(cid:98)0−1(cid:98)4(cid:98)6−1 − 0.59(cid:98)4−1 − 0.95
(cid:12)(cid:12)(cid:12)−0.16(cid:98)4−3(cid:98)6−3(cid:98)7 − 0.86(cid:98)0(cid:98)5−1(cid:98)6−3 + 1.00(cid:98)4(cid:98)6−1 − 0.70(cid:12)(cid:12)(cid:12) − 0.26
(cid:12)(cid:12)(cid:12)−0.39(cid:98)0(cid:98)2−1(cid:98)6−1(cid:98)7−1 − 0.96(cid:98)1−1(cid:98)6 − 0.12(cid:98)0−1(cid:98)1−1(cid:98)6
2 + 0.89(cid:12)(cid:12)(cid:12) − 0.80
0.17(cid:98)2−1 − 0.78(cid:98)3(cid:98)7−1 + 0.90(cid:98)1−2(cid:98)6−2(cid:98)7−3 + 0.35
0.82(cid:98)7−1 + 0.52(cid:98)0−1(cid:98)4(cid:98)6−1 − 0.59(cid:98)4−1 − 0.96
(cid:12)(cid:12)(cid:12)−(cid:16)1.3 × 10−3(cid:17)(cid:98)4−3(cid:98)6−3(cid:98)7 − 0.86(cid:98)0(cid:98)5−1(cid:98)6−3 + 0.65(cid:98)4(cid:98)6−1 − 0.42(cid:12)(cid:12)(cid:12) − 0.26

Node

0
1
2
6

Node

0
1
2
6

Table 4.8: NLDT rules before and after the closed-loop training for LunarLander problem, for
which NLDT* is shown in Figure 4.16. Video showing the simulation output of the performance of
NLDTs with rule-sets mentioned in this table can be found at https://youtu.be/DByYWTQ6X3E.
Respective minimum and maximum state variables are min = [-0.38, -0.08, -0.80, -0.88, -0.42,
-0.85, 0.00, 0.00], max = [0.46, 1.52, 0.80, 0.50, 0.43, 0.95, 1.00, 1.00], respectively.

Rules after Re-optimization (Depth-3 NLDT*)

nonlinear optimization to focus its search on rule structure and details describing weights and biases
of the rules by using a bilevel optimization algorithm. Starting with an open-loop training, which
is relatively fast (due computationally fast ﬁtness evaluation) but uses only time-instant state-action
data, we have proposed a ﬁnal closed-loop training phase in which the complete or a part of the
open-loop NLDT is re-optimized for weights and biases using complete episode data. Results on
popular discrete action problems have amply demonstrated the usefulness of the proposed overall
approach.

This proof-of-principle study encourages us to pursue a number of further studies. First, the
scalability of the NLDT approach to large-dimensional state-action space problems must now be
explored. A study conducted in Chapter 3 on binary classiﬁcation of dominated versus non-
dominated data in multi-objective problems was successfully extended to 500-variable problems.
While it is encouraging, the use of customization methods for initialization and genetic operators
using problem heuristics and/or recently proposed innovization methods [46] in the upper level
problem can be tried. Second, this study has used a computationally fast open-loop accuracy
measure as the ﬁtness for evolution of the NLDT . This is because, in general, an NLDT  with

99

Figure 4.17: Closed-loop training plot for ﬁnetuning the rule-set corresponding to depth-3 NLDT()
(Table 4.8) to obtain NLDT* for LunarLander problem.

 
a high open-loop accuracy is likely to achieve a high closed-loop performance. However, we have
observed here that a high closed-loop performance is achievable with a NLDT  having somewhat
degraded open-loop performance, but re-optimized using closed-loop performance metrics. Thus, a
method to identify the crucial (open-loop) states from the AI-based simulation dataset that improves
the closed-loop performance would be another interesting step for deriving NLDT . This may
eliminate the need for re-optimization through closed-loop training. Third, a more comprehensive
study using closed-loop performance and respective complexity as two conﬂicting objectives for a
bi-objective NLDT search would produce multiple trade-oﬀ control rule-sets. Such a study can, not
only make the whole search process faster due to the expected similarities among multiple policies,
they will also enable users to choose a single policy solution from a set of accuracy-complexity
trade-oﬀ solutions.

100

051015202530Generation5004003002001000100200RewardClosed-Loop TrainingBest RewardAvg. RewardCHAPTER 5

SCALE-UP STUDY AND IMPROVISATION

In this chapter, we focus at how the overall algorithm developed in previous chapter can be made
more eﬃcient in terms of – training time and scalability. To this purpose, we introduce a benchmark
problem of planar serial robotic manipulator. This problem is inspired from the classical Acrobot
control problem [64]. A schematic of the acrobot problem and serial manipulator problem is
provided in Figure 5.1a and 5.1b respectively.

(a) Acrobot

(b) Planar Serial Manipulator

Figure 5.1: Acrobot and a customized Planar Serial Manipulator benchmark problems.

The state-space for acrobot environment (Figure 5.1a) is six dimensional with following state
variables: (1), (1), (2), (2), 1 and 2, where 1 is the angle the ﬁrst link makes
with the vertical axis and 2 is the angle the longitudinal axis of the second link makes with that of
the ﬁrst one. The second joint between the ﬁrst and second link is actuated with a motor. In case
of planar manipulator (Figure 5.1b), the sate space comprises of angular position  and angular
velocity  of each joint. Thus, for a -link manipulator involving  revolute joints, the state space
would be 2 dimensional. The motor is located at the last joint of the manipulator and is actuated

101

using three torque values: −, 0 and . Each link is of 1 unit length and has its center-of-mass at
its geometric center. The motor is assumed to be massless for the sake of simplicity. The base of
the planar-manipulator is located at (0, 0, 0) and the motion of the planar manipulator is limited
to the XZ plane. There is a downward gravitational pull () of 10 units (i.e. -10 along vertical Z
axis). Torque is applied along the Y-axis. The task in this problem is to take the end-eﬀector (i.e.
tip of the last link) of the serial-manipulator to a desired height of  units by supplying torque to
the motor located at the last joint (joint between ( − 1)-th and -th link). The diﬃculty of this
benchmark problem can be adjusted by

1. changing the number of links,

2. changing the value of desired height level ,

3. changing the value of torque  and

4. placing extra motors at other joints.

We simulate the mechanics of the planar serial manipulator using PyBullet [85]: a Python based

physics engine.

In our work, we provide two case-scenarios by focusing at the ﬁrst three points of the above list.
As mentioned before, by changing the number of links, the dimension of the state-space changes.
The dimension of the action-space depends on the number of motors used. In the present work, we
keep the number of motors ﬁxed to one.

The details regarding two environments which are created and studied in this chapter are

summarized in Table 5.1.

Table 5.1: Details regarding custom designed Planar Serial Manipulator environments.

Env. Name

5-Link Manipulator
10-Link Manipulator

# Links

()
5
10

Motor

Torque ()

1,000
2,000

Desired
Height ()

+2
+2

# State
Vars.
10
20

The reward function (x, ) is given by the following equation

102

(cid:17)2

−1 −(cid:16) +1−

++1
100 if  ≥ H,



(x) =

if  < ,

(5.1)

where  is the vertical location of the end-eﬀector. The minimium value for  is − when the
entire manipulator is stretched to its full length and all joint angles (i.e. ) are 0.

At the beginning of an episode, joint angles  of the manipulator are randomly initialized
between −5 deg and +5 deg, and the angular velocities  are initialized to a value between
−0.5 rad/sec and +0.5 rad/sec.

In next sections we discuss results obtained on the above two custom designed environments
by using diﬀerent procedures of inducing NLDE . The black-box AI (DNN) is trained using the
PPO algorithm [61].

5.1 Ablation Study for Open-loop Training

In this section, we launch two separate studies related to open-loop training procedure (see
Figure 4.7). Following our results from Chapter 4, we restrict our discussion to binary-split
NLDTs.
It was seen in the previous chapter that the open-loop training is conducted using the
hierarchical bilevel-optimization algorithm, which is discussed at length in Chapter 3. A dedicated
bilevel-optimization algorithm is invoked to derive the split-rule  (x) at a given conditional node.
The upper-level search is executed using a discrete version of a genetic algorithm and the lower-
level search is realized through an eﬃcient real-coded genetic algorithm (RGA). Evolutionary
algorithms are in general considered robust and have a potential to conduct more global search.
However, being population driven, their search-speed is often less that of classical optimization
algorithms. In this section, we study the eﬀect of replacing the real-coded algorithm with a classical
sequential quadratic programming (SQP) optimization algorithm in the lower-level of the overall
bilevel algorithm for obtaining NLDT . Later, closed-loop training based on real-code genetic
algorithm is applied to NLDT  to obtain NLDT* by re-optimizing the real valued coeﬃcients of

103

NLDT  (section 4.6.3). We use the SciPy [86] implementation of SQP. The initial point required
for this algorithm is obtained using the dipole concept (Figure 3.7, Eq. 3.11).

For analysis, we induce the NLDT  of depth-3 on the balanced training dataset (section 4.7.1.2)
of 10,000 datapoints. The testing dataset is regular (see section 4.7.1.1) and comprises of 10,000
datapoints. Comparison of accuracy scores and average training time of inducing NLDT  by
using SQP and RGA algorithm at lower-level is provided in Table 5.2. For a given procedure
(SQP or RGA) the best NLDT  from 10 independent runs is chosen and is re-optimized using
closed-loop training (section 4.6.3). Statistics regarding closed-loop performance of NLDT* is
shown in the last two columns of Table 5.2. It is to note here that the closed-loop training is done
using the real-coded genetic algorithm (RGA) discussed in section 4.6.3.

Table 5.2: Comparing performance of diﬀerent lower-level optimization algorithms. For compari-
son, closed-loop performance of the original DNN policy is also reported.

Open-Loop NLDT 

Training
Time (s)
5-Link Manipulator
15.29 ± 4.95

Closed-Loop NLDT*

Cumulative

Reward

Algo.
Name

Training
Accuracy
62.46 ± 2.01
SQP
RGA 71.14 ± 1.77
DNN

NA

56.57 ± 1.00
SQP
RGA 65.84 ± 0.85
DNN

NA

Testing
Accuracy
69.34 ± 5.39
69.17 ± 4.39

NA

55.64 ± 3.58
62.60 ± 3.09

NA

−91.52 ± 14.42
1091.56 ± 319.18 −123.48 ± 25.11
−170.90 ± 42.57

NA

10-Link Manipulator
39.27 ± 11.63
2860.96 ± 789.49

NA

−318.82 ± 15.52
−281.88 ± 9.38
−325.86 ± 4.63

Completion

Rate

97.92 ± 1.93
97.00 ± 5.13
89.00 ± 5.19
96.00 ± 3.92
95.00 ± 4.31
85.88 ± 1.94

It can be observed from the results that open-loop training done with SQP in lower-level is about
70 times faster than the training done using RGA at lower level. The training done with RGA has a
better overall open-loop performance. Thus, if the task is to closely mimic the behavior of black-box
AI or if only a high classiﬁcation accuracy is desired (in case of classiﬁcation problems), then RGA
is the recommended algorithm for lower-level optimization to obtain NLDT . However, NLDT*
obtained after re-optimizing NLDT  corresponding to SQP and RGA have similar closed-loop
completion rate (last column of Table 5.2). This implies that despite low open-loop accuracy
scores, the open-loop training done using SQP in lower-level was able to successfully determine

104

the template of split-rules  (x) and the topology of NLDT, which upon re-optimization via closed-
loop training algorithm could fetch a decent performing NLDT*. During open-loop training,
the search on weights and coeﬃcient using SQP was possibly not as perfect as compared to the
one obtained through RGA, however, the re-optimization done through closed-loop training could
compensate this shortcoming of SQP algorithm and produce NLDT* with a respectable closed-loop
performance. Additionally, in either cases, the NLDT* obtained always had a better closed-loop
performance than the original black-box DNN policy. This observation suggests that it is preferable
to use SQP in lower-level during open-loop training to quickly arrive at a rough structure of NLDT 
and then use closed-loop training to derive a high performing NLDT*. Another important thing
to note here is that the open-loop training time for obtaining NLDT  for 10-link manipulator
problem is about 2.5 times less than the corresponding training time for obtaining NLDT  for
the 5-link manipulator problem. This is because, the population size in upper-level GA for 10-link
manipulator problem (20 state variables) is twice that of the one for the 5-link manipulator problem
(10 state variables). Also, the upper-level search in the high-dimensional space could possibly take
more generations than it will take for lower-dimensional search spaces to converge.

5.2 Closed-loop Visualization

In this section we provide a visual insight into the closed-loop performance of NLDT* and
DNN which we derived in the previous section. In our case, the frequency of the simulation is set
to 240Hz, meaning that the transition to the next state is calculated using the time-step of 1/240
seconds. Geometrically speaking, this implies that the Euclidean distance between states from
neighboring time-steps would be small. The AI (DNN or NLDT*) outputs the action value of 0 (−
torque), 1 (0 torque) or 2 (+ torque) for a given input state. Action Vs Time plots corresponding
to diﬀerent closed-loop simulation runs obtained by using DNN, NLDT* (SQP)1 and NLDT*
(RGA)2 as controllers is shown in Figure 5.2 and 5.3 for 5-link and 10-link manipulator problems
respectively.

1NLDT* (SQP) indicates that the corresponding NLDT  was derived using the SQP algorithm in the lower-level
2NLDT* (RGA) indicates that the corresponding NLDT  was derived using the RGA algorithm in the lower-level

105

(a) DNN

(b) NLDT* (SQP)

(c) NLDT* (RGA)

Figure 5.2: Action Vs. Time plot for 5-Link manipulator problem. Figure 5.2b provides the
plot for NLDT* which is obtained from the NLDT  trained using SQP algorithm in lower-level.
Similarly, Figure 5.2c provides the plot for NLDT* which is obtained from the NLDT  trained
using RGA algorithm in lower-level.

106

020406080100120Time Steps012Action020406080100120140Time Steps012Action020406080100120140Time Steps012ActionCertain key observations can be made by looking at the plots in Figure 5.2. The control output
for DNN is more erratic, with sudden jerks as compared to the control output of NLDT* (SQP)
and NLDT* (RGA). The performance of NLDT* in Figures 5.2b and 5.2c is smooth and regular.
This behaviour can be due to the involvement of a relatively simpler non-linear rules (as compared
to the complicated non-linear rule represented by DNN) which are captured inside NLDT*. This
is equivalent to the observation we made for the mountain car problem in Figures 4.3a and 4.3b,
wherein the black-box AI had a very erratic behavior for the region of state-space in the lower-half
of the state-action plot, while the output of NLDT was more smooth. Additionally, it was seen in
Table 5.2 that the NLDT* (irrespective of how its predecessor NLDT  was obtained, i.e. either
through SQP or RGA in lower-level) showed better closed-loop performance than the parent DNN
policy. This observation implies that simpler rules expressed in the form of a nonlinear decision
tree have better generalizability, thereby giving more robust performance for randomly initialized
control problems. A careful investigation to the plots in Figure 5.2b and 5.2c reveals that only two
out of three allowable actions are required to eﬃciently execute the given control task of lifting the
end-eﬀector of a 5-link serial manipulator. This concept will be used to re-engineer the NLDT*, a
discussion regarding which is provided in the next section.

A similar argument can be made from plots in Figure 5.3 for 10-link manipulator problem. This
is a slightly diﬃcult problem to solve than the 5-link version since it involves twice the number
of state variables. Interestingly, the search for NLDT* provides us with the simplest solution to
this problem to lift the end-eﬀector to the desired height of 2 units above the base. The simplest
solution here is to give a constant torque in one direction as shown in Figure 5.3b. However, for
this problem the best closed-loop performance in terms of cumulative reward (second last column
of Table 5.2) is obtained using the control strategy corresponding to NLDT* (RGA) (Figure 5.3c).
Here too, for most of the states, only one action is required, and occasionally other actions are
invoked.

107

(a) DNN

(b) NLDT* (SQP)

(c) NLDT* (RGA)

Figure 5.3: Action Vs. Time plot for 10-Link manipulator problem. Figure 5.3b provides the
plot for NLDT* which is obtained from the NLDT  trained using SQP algorithm in lower-level.
Similarly, Figure 5.3c provides the plot for NLDT* which is obtained from the NLDT  trained
using RGA algorithm in lower-level.

108

050100150200250Time Steps012Action050100150200250Time Steps012Action050100150200250300Time Steps012ActionAs mentioned before, in the next section we will discuss a post-processing approach to further

simplify the NLDT*.

5.3 Reengineering NLDT*

It was seen in action-time plots in Figure 5.2b, 5.2c, 5.3b and 5.3c that not all actions are required
to perform a given control task. Also, it might be possible that while performing a closed-loop
control using NLDT, not all branches and nodes of NLDT are visited. Thus, the portion of the
NLDT which is not being utilized or is getting utilized very rarely can be pruned and the overall
NLDT architecture can be made simpler. To illustrate this idea, we consider the NLDT which is
derived for the 5-link manipulator problem. The topology of the best performing NLDT  (SQP)
for the 5-link manipulator problem is shown Figure 5.4a.

As mentioned before, this NLDT  is trained on a balanced training dataset which is generated
by collecting state-action pairs using parent DNN controller.
In the ﬁgure, for each node, the
information regarding its node-id, class distribution (given in square parenthesis) and the most
dominating class is provided. Other than the root-node (Node 0), all nodes are colored to indicate
the dominating class, however, it is to note that only the class associated to leaf-nodes carry the
actual meaning while predicting the action for a given input state. This NLDT  comprises of
three split-rules in total. The class-distribution for each node is obtained by counting how many
datapoints from the balanced training dataset visited a given node. Thus, the root node comprises
of all datapoints (total 10,000), which are then scattered according to the split-rules present at each
conditional node.

Figure 5.4b provides topology of NLDT* which is obtained after re-optimizing NLDT  of
Figure 5.4a using closed-loop training. As discussed in section 4.6 and 4.6.3, the topology of the
tree and the structure of non-linear rules is identical for both: NLDT  and NDLT*. However,
the weights and biases of NLDT* are updated to enhance the closed-loop control performance.
Similar to NLDT  of Figure 5.4a, the information regarding node id and class-distribution is
provided for all the nodes of NLDT* in Figure 5.4b. However, the data distribution in NLDT*

109

(a) NLDT 

(b) NLDT*

Figure 5.4: NLDTs for 5-Link Manipulator problem.

is obtained by using the actual state-action data from closed-loop simulations, wherein NLDT* is
used as a controller. Total 10,000 datapoints are collected in form of sequential states-action pairs
from closed-loop simulation runs which are executed using NLDT*. As can be seen in the root
node of NLDT* (Figure 5.4b), out of 10,000 states visited during closed-loop control, action 0 (−
torque) was chosen by NLDT* for total 7736 states and action 2 (+ torque) was chosen for 2264
states. In none of the states visited during closed-loop control was action 1 (no torque) chosen.
This is consistent with what we have observed in the Action Vs Time plot in Figure 5.2b, wherein
most of the time action 0 was executed, while there was no event where action 1 was executed.
The ﬂow of these 10,000 state-action pairs through NLDT* and their corresponding distribution

110

in each node of NLDT* is provided in Figure 5.4b. It can be observed that Node 5, Node 4 and
Node 8 of NLDT* are never visited during closed-loop control. This implies that splits at Node 2,
Node 1 and Node 6 are redundant. Thus, the part of NLDT* shown in red-box in Figure 5.4b can
be pruned and the overall topology of the tree can be simpliﬁed. The pruned NLDT* will involve
only one split (occurring at Node 0) and two leaf nodes: Node 1 and Node 6. However, it is to
note here that we need to re-assign class-labels to the newly formed leaf nodes (i.e. Node 1 and
Node 6) based on the data-distribution from closed-loop simulations. The old class-labelling for
the Node 1 and Node 6 was done based on the open-loop data (Figure 5.4a). Using the new class
distribution corresponding to NLDT* (Figure 5.4b), Node 1 is re-labelled with Class-2 and Node 6
with Class-0. The pruned version of NLDT* of Figure 5.4b is provided in Figure 5.5 (here nodes
are re-numbered, with Node 6 of NLDT* in Figure 5.4b re-numbered to Node 2 in the pruned
NLDT* as shown in Figure 5.5). The split-rule corresponding to the root-node is also shown.
Interestingly, out of 10 total state-variables, only 2 are used to decide which action to execute for
closed-loop control. 1 corresponds to the angular position of the second link and 9 variable
corresponds to the angular velocity of the last joint (i.e. the joint between link-4 and link-5 where
the motor is connected).

Figure 5.5: Pruned version of NLDT* (Figure 5.4b) for 5-link manipulator problem.

5.4 Conclusion

In this chapter we introduced benchmark problems to conduct scalable study and investigate
and compare algorithms to eﬃciently conduct open-loop training. The lower level optimization

111

done using SQP during open-loop training reduces the overall open-loop training time by 70 times.
The NLDT  induced using RGA in lower level shows robust and relatively better open-loop
performance. However the closed-loop performance observed after re-optimizing NLDT  to
NLDT* is similar in either case. This observation suggests the use of SQP in control tasks to
quickly generate NLDT . An extension of this work can be made for action-spaces involving
more than 3 actions. This can be achieved by allowing more joints of the serial-manipulator to get
actuated using motors. Also, a hybrid approach combining SQP and RGA for open-loop training
can be derived to eﬃciently induce high performing NLDT . This aspect will be particularly
useful for classiﬁcation problems.

112

EXTENSION TO REGRESSION AND CONTINUOUS CONTROL PROBLEMS

CHAPTER 6

6.1 Introduction

In this chapter, we provide a conceptual path on how to extend the idea of nonlinear decision
tree to solve problems pertaining to regression and control systems involving continuous action
as output. Regression problems involves several independent features which are represented with
feature vector x, and one dependent variable . The task here is to learn the relationship between
dependent and independent variables ( =  (x)) Traditionally, linear regression models and neural
networks can be trained to predict the value of a dependent variable  from the input features x.
However, these regression models are inherently complex since they translate to long equations to
represent the mathematical relationship between output  and input features x. Regression trees
have been popularly used to ﬁnd piece-wise constant curves to ﬁt the regression surface. The idea
behind regression trees is to partition the feature space into sub-regions using one or more splits
and then have a constant term for each such sub-region to approximate the value of the dependent
variable . CART and M5 based trees fall in this category. An improvisation to this concept is
suggested in [1], where, instead of having the leaf node to represent a constant valued function as
in CART, leaf nodes represent a linear or quadratic functions comprising of  features (where  is
a user-speciﬁed parameter). Partitioning splits however occur on only one feature. An example of
interpretable tree generated in this fashion is shown in Figure 6.1.

6.2 Interpretable AI for Regression Problems using NLDT

The interpretable AI (IAI) developed for regression in our case is represented in the form of
a nonlinear decision tree (NLDT), with terminal leaf nodes representing regression equations and
the non-terminal conditional split-nodes representing conditional rules. The set of conditions at
split-nodes deﬁne the domain in the feature space where the regression rule at a follower terminal

113

Figure 6.1: Piecewise linear regression tree with two predictors from [1]. At each leaf node,
features involved in the expression of two-regressor linear model is shown. Splits use only one
feature variable.

node is applicable. A conceptual illustration of the NLDT framework for regression task is provided
in Figure 6.2.

Figure 6.2: Conceptual layout of NLDT for regression task.

The rule  at each terminal node is a function of feature vector x and provides the value of the
dependent variable  (i.e.  = (x)). Each conditional rule  partitions the feature space into

114

two parts: (x) ≤ 0 and (x) > 0. Thus mathematically, for each -th leaf node:

If:  (x) > 0, for all  < , and (x) ≤ 0,

Rule:  = (x).

(6.1)

For example, for Node 1 ( = 0) leaf node, the rule indicates: If 0 ≤ 0, then  = 0(x).
Node 3 rule ( = 2) indicates: If 0 > 0 ∧ 2 ≤ 0, then  = 2(x). The nonlinear decision tree is
induced by hierarchically applying a Dual bilevel algorithm at each conditional node. For an -th
conditional node, Stage 1 of the proposed dual bilevel algorithm focuses at deriving the regression
rule  = (x). Stage 2 of the algorithm is then invoked to derive the partition rule (x) to split the
data in conditional node  into two subsets:  
follows the regression rule
 = (x) and have a mean squared error (MSE) value within a user deﬁned threshold value .
Thus,  
is declared as a terminal leaf node. A new dedicated dual bilevel algorithm is reapplied

to the datapoints present in node  
 and process repeats. This procedure of hierarchically inducing
the decision tree continues until one of the termination criteria is met. We now discuss the dual
bilevel algorithm which is used to derive regression and partition rule at a given conditional node
.

 . Points in  

 and  


6.3 Dual Bilevel Algorithm (DBA) for Regression

The dual bilevel algorithm is used to derive two rules at a given conditional node : 1) regression
rule  = (x) and 2) partition rule (x). The objective of deriving regression rule (x) is to ﬁt
as many datapoints present in the conditional node  as possible. If the regression rule  = (x)
is able to ﬁt all datapoints present in node  within an acceptable limit on mean squared error,
the dual bilevel algorithm is terminated and the node  is declared as the leaf node. However,
if regression rule  = (x) is not able to ﬁt all datapoints in , the second stage of dual bilevel
algorithm is invoked to derive the partition rule (x). As mentioned before, the partition rule
(x) partitions the data in node  into two subsets
 ) (for which (x) ≤ 0), and

• Left child node ( 

115

• Right child node ( 

 ) (for which (x) > 0),

 have their MSE value w.r.t to regression curve  = (x) within
such that datapoints in subset  
a pre-speciﬁed threshold value . Thus, for a new test datapoint x, conditions listed in Eq. 6.1
are ﬁrst checked before applying regression rule  = (x) to estimate the value of the dependent
variable . A high level pseudo-code for the dual bilevel algorithm is provided in Algorithm 8.

Algorithm 8: Pseudo-code of dual bilevel algorithm (DBA)
Input

: Data in Conditional node :  datapoints with  independent variables
((1, 2, . . . , ) = x) and one dependent variable .
Regression Rule  = (x) and partition rule (x).

Output
:
// Bilevel Algorithm to derive (x)
Step 1 derive regression rule:  = (x);
if MSE(, (x)) ≤  then

terminate;
Derive Partition Rule (x);

else

end

We will next discuss the bilevel algorithm which is used to derive the regression rule  = (x).

6.3.1 Data Normalization

Before training and inducing the NLDT for regression, we normalize the data. Each feature  and
dependent variable  is normalized using the following equation:

 =

 − 
2

,

(6.2)

where  is the normalized value of the variable ,  and  are mean and standard deviation
of variable , respectively.

116

6.3.2 Bilevel Regression Algorithm to Obtain (x)

The regression rule to be evolved is expressed as a weighted sum of power-laws as given in equation
below:

where ’s are power-laws of type  =

(x) =

. Like in the standard classiﬁcation task discussed
in Chapter 3, the genome representing the template of the regression rule is represented with a
block matrix B as shown below

=1 

 
11 + 22 + · · · +    + 0

1

,

(6.3)



B =

11 12 13

. . . 1

21 22 23
...
...

...

. . . 2
...
. . .

 1  2  3

. . .  



.

(6.4)

The upper level of the bilevel algorithm conducts the search in the discrete space of exponents
 . The lower level of the bilevel algorithm searches the coeﬃcients  corresponding to power-
laws  and the bias terms 0 and 1. The upper level algorithm is similar to the upper level
algorithm developed for classiﬁcation problems (see section 3.3.2), however the objective and
constraint evaluation is diﬀerent as shown in Eq. 6.5. For the lower level, we use the sequential
quadratic program (SQP) to estimate the values of ’s and ’s. The optimization formulation
corresponding to the problem of estimating the regression function  = (x) is given below:

Minimize (B, w∗, ∗) = (w∗, ∗)|B,
Subject to (w∗, ∗) ∈ argmin {(w, )|B} ,

−1 ≤  ≤ 1, ∀, 0 ∈ [−1, 1], 1 ∈ (0, 1],
  ∈ ,

(6.5)

where for our study here we choose  = {−3,−2,−1, 0, 1, 2, 3}. The allowable powers indicated
by set  controls the maximum complexity achievable by our procedure. The ﬁtness of the upper

117

level individual corresponds to the ﬁtness of the optimal solution of the corresponding lower level
problem. We will next discuss how the lower level ﬁtness function is designed to facilitate the
regression ﬁt.

6.3.2.1 Lower Level Regression Optimization

Since we are targeting at evolving a simple rule, it is highly likely that one simple rule won’t ﬁt
the entire dataset. Figure 6.3 illustrates one such situation involving two independent variables
and one dependent variable. Here, datapoints are scattered across three disjoint islands. Most of

Figure 6.3: Three Island Regression Problem.

them belong to island 2 while island 1 has the least number of datapoints. There might exist one
complicated regression surface which passes through all the datapoints. However, since we are
interested in having simple rule, one of the simple solution is to have three simple linear rules, one
for each island. Furthermore, in our overall algorithm, we would be interested to determine the
rule corresponding to island 2 ﬁrst since it has majority number of points. The ﬁtness function 
for the lower-level optimization is formulated to capture these aspects. The algorithm to compute
 is provided in Algorithm 9.

We will discuss the motivation behind diﬀerent segments of Algorithm 9 here and explain the

idea with the help of Figure 6.4.

118

Algorithm 9: Algorithm to compute the lower level ﬁtness 
Input

: Dataset D in B-space of size  ×  and array  of length  comprising of values
of dependent variable , weight vector w, bias valus 0 and 1.
Lower level ﬁtness value 

Output
:
// predict the value of 
1 (cid:48) = PredictY(D, w, 0, 1);
// Compute the absolute difference between predicted and actual 
2  = | − (cid:48)|;
// Get a Boolean array of good points
3  =  ≤ ;
// Compute the total number of good points

values

 [];
4  = 
5 if  == 0 then

// No good point found
 = ComputeRMSE((cid:48),  ); // Root Mean Squared Error
if  <  − 2 then

6
7 else
8

(cid:48).

// Sorted array of error in ascending order
 = Sort(   );
/* Get the error value of the best point from set of bad points
*/
(cid:48) = [ + 1];
if (cid:48) < 0.5 then
+1
+1
=1

[]2;

 =

1

9

10
11

12

13
14
15
16
17
18
19
20 end

else

 = ComputeRMSE([], (cid:48)[]);

else

end
 = ComputeRMSE([], (cid:48)[]);

end
 =  − 

The value of  is computed based on which stage the regression surface is relative to the

training data.

Stage 1 Here, the regressed surface is far oﬀ from the location of training datapoints as shown in
Figure 6.4a. This implies that no points are within a  distance from the regressed

119

(a) Stage 1

(b) Stage 2

(c) Stage 3

Figure 6.4: Computation of Lower Level Objective function  based on Algorithm 9.

surface, i.e. ∀x ∈ , |(x) − (cid:48)(x)| > , where  and (cid:48) are actual and regressed values
of x respectively. The value of parameter  is user speciﬁed. Under this scenario,
the objective  is evaluated by computing mean squared error (MSE) between actual and
predicted values. Minimization of MSE in Stage 1 brings the regression surface closer to
the location of training dataset. This is shown pictorially in Fig. 6.4a. This segment is
represented by lines 5-6 in Algorithm 9. Eventually, the regressed surface/curve will come
in the vicinity of one (or more) datapoints (shown by red in Fig. 6.4a) and will result into
 ≥ 0, where  is the number of datapoints for which the diﬀerence between
predicted and actual  value is less than .

Stage 2 During Stage 2, we already have atleast one point with | − (cid:48)| ≤  (i.e.  ≥ 1).
In Stage 2, we design the objective function  such that its minimization will ensure
maximization of number of good points (i.e. points with | − (cid:48)| ≤ ). One such way to
enforce this is to set  = −. However, setting  this way resembles a discontinuous

120

many-to-one mapping. This will create local ﬂat regions (similar to the situation discussed
in Figure 3.6) thereby making it diﬃcult for an optimization solver to navigate the search
space of w and . We thus do the following modiﬁcation to avoid creation of ﬂat zones in
the ﬁtness landscape of :
Step 1 Determine the closest point x(cid:48) in dataset D to the regressed surface such that |(cid:48)(x(cid:48)) −
(x(cid:48))| >  and compute its error as (cid:48) = |(cid:48)(x(cid:48))− (x(cid:48))| (line 9-10 in Algorithm 9).
If (cid:48) < 0.5, go to Step 2, else go to Stage 3

Step 2 Since (cid:48) < 0.5, x(cid:48) is potentially a next good point. Thus, we append datapoint x(cid:48) to the
list of good points and compute the MSE on this modiﬁed set of good points. Let the
MSE value of this set (good points + potentially good point) be denoted with .  is
then computed as

 =  − 

as shown in line 12 and line 19 in Algorithm 9. Note that at ﬁrst, the term  will
remain constant and only  will get minimized while minimizing . This will result
in tilting of the regressed surface as shown in Figure 6.4b (where regressed surface 
is tilted to ). Eventually, minimization of  this way will successively cover more
points (shown by shaded red colors in Figure 6.4b) and will bring the regression surface
to location  (shown in Figure 6.4c).

Stage 3 Here, the next closest point with (cid:48) >  is far away from the regression surface

(marked with green in Figure 6.4c). Hence,  will be computed as

 =  − ,

where  is the mean squared error of good points (i.e. all those points for which
|(x) − (cid:48)(x)| ≤ ). In the example shown in Figure 6.4, values assumed by 
during Stage 3 will be certainly less than those in Stage 2 since  term of  in

121

Stage 3 will be more by atleast 1 unit than the  term in Stage 2. Also,  term
in Stage 3 will remain constant since the next possible good point (green colored in
Figure 6.4c) is located very far away. Thus, minimization of  will imply minimization
of . This will steer to re-position the line  (for which all red points of Figure 6.4c
were within  distance) to line .

Based on the above explanation and illustrations shown in Figure 6.4, in a particular lower level
optimization run using a point based classical optimization approach, the values of  as observed
in Stage 1 will be more than those observed in Stage 2 and values observed in Stage 2 will be more
than those observed in Stage 3. This would be true in general, but will also depend on several other
factors such as noise in the dataset, value of  and relative locations of datapoints in training
set.

Eventually, the regression curve will ﬁt certain segment of the training data (represented by 
in Figure 6.4c). In our experiments, we set  = 0.03. Computing  using the procedure
discussed above removes the bottleneck of having ﬂat ﬁtness landscape. This allows us to use
point based classical optimization algorithms to quickly determine optimal values of w and .
Having a fast and reliable lower level optimization is really important for the eﬃcient design of the
overall bilevel algorithm to estimate the regression function (x). Next we discuss the upper level
optimization which is used to estimate the structure and template of the regression function (x).

6.3.2.2 Upper Level Regression Optimization

Fitness of the optimal solution of lower level optimization becomes the ﬁtness of the upper level
optimization as shown in Eq. 6.5. Since genetic operators in the upper level are designed to
incrementally complexify the expression of the power-law rule (Eq. 6.3), starting from a very
basic rule involving only one variable appearance, we observe a natural preference and bias for
simpler rules over complex rules. Output from one run of bilevel regression algorithm is shown
in Figure 6.5. It can be observed that the algorithm is able to successfully fetch us with a linear

122

regression rule (x) = 0.911 − 0.900 + 0.08 to ﬁt the island comprising of maximum number
of datapoints (Figure 6.5).

Figure 6.5: Regression algorithm is applied on all datapoints. The obtained regression rule
0(x) = 0.911 − 0.900 + 0.08 is able to ﬁt the subset of datapoints which are represented by
Y-predicted (in green circles). Partition rule 0(x) will be now derived to identify the domain in
-space (i.e. 0(x) ≤ 0)) where this regression rule is applicable.

6.3.3 Bilevel Partition Algorithm (BPA) to Obtain (x)
The regression rule ((cid:48) = (x)) derived in the previous section ﬁts certain portion of the dataset
in node  (as an illustration, this is highlighted with green circles in Figure 6.5). The bilevel
partition algorithm is now employed to derive a partitioning boundary (x) = 0 which separates
(for which (x) > 0). The
the dataset in  into two subsets:  

 have their mean squared error w.r.t. the regression curve (x) within
fraction of datapoins in  
a user speciﬁed threshold value . Hence, for these datapoints, the depended variable  can be
predicted using the regression function (x). The partition rule (x) is derived to maximize the
(for which (x) ≤ 0) while keeping their mean squared error value
number of datapoints in  

(where the error is measured between the actual  value and the predicted (cid:48) value) within .
The optimization formulation to derive the partition rule (x) is organized in the bilevel

(for which (x) ≤ 0) and  


structure as shown below

123

Minimize (B, , w∗, ∗) = (w∗, ∗)|(B,),
Subject to (B, , w∗, ∗) = (w∗, ∗)|(B,) ≤ 0,

(w∗, ∗) ∈ argmin(cid:8)(w, )(cid:12)(cid:12)(w, ) ≤ 0(cid:9)(cid:12)(cid:12)(B,),

−1 ≤  ≤ 1, ∀,  ∈ [−1, 1]+1,
 ∈ {0, 1},   ∈ .

(6.6)

The overall bilevel algorithm is identical to the one developed for the classiﬁcation problem
to derive split-rules for a classiﬁcation task (as explained in Section 3.3), with the only change in
the objective and constraint functions. The partition rule (x) (where  is the node number (see
Figure 6.2)) being derived is of the following type

(x, w, , B) =

1 + 11 + . . . +   ,
(cid:12)(cid:12)1 + 11 + . . . +   (cid:12)(cid:12) −(cid:12)(cid:12)2(cid:12)(cid:12) ,

if  = 0,

if  = 1.

(6.7)


where all symbols carry similar meanings to those discussed in the equation of split-rule for
classiﬁcation problem (Eq. 3.2). x is the input feature vector,  is the power-law of type (x) =

 
, ’s are the coeﬃcients and ’s are the biases.

=1 
Since both objective and constraint value for upper level optimization are borrowed from the
optimal solution of the lower level optimization, we discuss the optimization formulation for lower
level optimization in the next section.

6.3.3.1 Lower Level Partition Optimization

The lower level optimization estimates optimal values of weights w and biases  which minimize
the following objective function

Min. (w, )(cid:12)(cid:12)(B,) = −  
s.t. (w, )(cid:12)(cid:12)(B,) =    −  ≤ 0

−1 ≤  ≤ 1, ∀,  ∈ [−1, 1]+1

124

(6.8)

after partition.   

Here,    indicates the number of points in node  for which (x)|(w,,B,) ≤ 0 (Eq. 6.7).
Thus, this objective function tends to maximize the number of points going to the left node
is computed by measuring the mean squared error of the ac-
 

 with the predicted (cid:48) values obtained using
tual  value of datapoints in the left node  
the regression function (cid:48) = (x) derived in the previous section. The constraint satisfaction

(w, )|(B,) =    −  ≤ 0(cid:17) ensures that  value of all datapoints coming to the left

(cid:16)

(i.e. (x) > 0). Hence, the right node which contains datapoints from subset  

 after the split (i.e. (x) ≤ 0) can be reliably predicted by using the regression function
node  
(cid:48) = (x). In our experiments, we set the value of  parameter to 2
 (where  = 0.03).
It is to note here that the regression rule  = (x) is not applicable to points belonging to subset
 becomes the
 

new non-terminal node and a dedicated dual bilevel algorithm is applied to it to derive its corre-
sponding regression and partition rules (2(x) and 2(x) in Figure 6.2). This process is repeated
until maximum depth of tree is reached or the regression rule is able to ﬁt all datapoints belonging
to a non-terminal node (in which case it is declared as a terminal leaf node).

6.3.4 Results on a Customized Benchmark Problem

In this section, we show results obtained on diﬀerent versions of customized benchmark three island
problem (Figure 6.3). In the ﬁrst experiment, we generate a pure dataset using the following set of
equations for islands 1, 2 and 3:

Island 1:
Island 2:
Island 3:
Concatenate:

1 = 0 + 1 + 0.2,
2 = −0 + 1 + 0.1,
3 = 0 − 1 − 0.3,
 = [1; 2; 3]

[0, 1] ∈ [0, 0.2]2
[0, 1] ∈ [0.2, 0.7]2
[0, 1] ∈ [0.7, 1]2

(6.9)

In the second experiment, we add noise to the pure dataset of Eq. 6.9 as shown below

 ()
 =  ()

 + ,

 = 1, 2, . . . ,

 ∈ [−0.1, 0.1]

(6.10)

125

Results obtained on pure and noisy three island datasets are provided in Figures 6.6 and 6.7
respectively. Very small MSE values for each of the three leaf nodes indicate the eﬃcacy of our
proposed Dual bilevel optimization algorithm to ﬁnd regression and partition rules.

(a)

(b)

Figure 6.6: Result on the Pure Three Island Dataset.
(b) Intert-
pretable AI representation using NLDT. Normalization constants are: Xmean = [0.49, 0.49],
Xstd = [0.25, 0.25], Ymean = 0.05 and Ystd = 0.28.

(a) Visualization of result.

As can be seen from results shown in Figure 6.6 and Figure 6.7, the proposed approach of deriv-
ing piece-wise non-linear regression equations fetches us with a relatively simple and interpretable
regressor, which otherwise would have assumed a complicated form with other regression models
like artiﬁcial neural networks or regression CART. The rules evolved are simplistic in structure and

126

(a)

(b)

Figure 6.7: Result on the Noisy Three Island Dataset.
(b) Intert-
pretable AI representation using NLDT. Normalization constants are: Xmean = [0.52, 0.52],
Xstd = [0.28, 0.27], Ymean = 0.03 and Ystd = 0.29.

(a) Visualization of result.

the induced NLDT is also topologically simple.

Currently, we are required to properly set the parameter  ( = 2

) to get optimal
performance. One of the possible way to eﬃciently handle this is to re-formulate the upper-level
problem as multi-objective and search for the knee in the bi-objective pareto front of MSE Vs 
(where  is the number of points in left node) and use that solution to conduct the partition.

In the next section, we will apply the proposed dual bilevel algorithm to arrive at a relatively

interpretable policy for a control problem involving continuous action.

127

6.4 Interpretable AI for Control Problem with Continuous Action Space

In this section, we implement the dual bilevel algorithm we developed in the previous section to
explain control logic of an ANN (artiﬁcial neural network) agent which is used to control an object
using an action value from a continuous real domain. The problem considered here is borrowed
from [87] and is pictorially shown in Figure 6.8.

Figure 6.8: A Car-following problem with continuous acceleration. The rear car is controlled by
an AI while the car in the front moves with a random acceleration proﬁle. This problem is similar
to the problem shown in Figure 4.8b with the only diﬀerence being that the cars can now assume
a value of acceleration in range −2/2 to 2/2 (unlike in the previous case where the rear car
could only have an acceleration from pre-speciﬁed discrete values).

Here, the car in the rear is autonomously controlled using an AI. The car in the front moves
with a random acceleration proﬁle. The state of the rear car at a given time  is determined by three
quantities (the state is denoted by a real-parameter vector x):

1. Its relative distance from the front car (0 =    − ),

2. its relative velocity with respect to the front car (1 =    − ) and

3. its acceleration at time  − 1 (2 = ( − 1)).

The control task requires to have a safe distance    between two cars, while simultaneously
ensuring a comfortable and smooth ride. The ANN agent is trained using the DDPG algorithm
[62]. For a given time-step , the ANN agent takes the state x of the rear car as an input and outputs
the acceleration value  (i.e. ()) in the continuous range from -2/2 to 2/2. Figure 6.9
shows a sample run where the trained ANN agent is used to control the rear car while the car in the
front operates with a random acceleration.

128

(a)

(b)

Figure 6.9: ANN Output for the continuous car-following problem described in Figure 6.8. Figures
(a), (b) and (c) are the plots of diﬀerent state variables w.r.t to the time-step. The output of the
ANN is shown in Figure (d).

129

Figure 6.9 (cont’d)

(c)

(d)

To explain the performance of the ANN, we ﬁrst collect the data comprising of series of states
and corresponding action value as determined by the ANN. The tabulated data is then used to
train and induce the NLDT. This problem of training and deriving NLDT from the ANN generated
dataset translates to a regression problem of predicting the value of a continuous dependent variable
given a set of continuous input features (or state variables in our case). The challenge here is to
obtain the NLDT regressor with simple rules so that it can act as a logic-translator to explain the
inner-logic of the ANN, RL or any more compelex CNN/DNN used earlier to solve the problem.

130

6.4.1 NLDT Representation and Training

As mentioned in the last section, the problem of developing interpretable AI for a continuous
control task involving one action translates to a regression rule ﬁnding problem. We thus adopt the
dual bilevel approach we discussed in section 6.3 to derive the NLDT for this regression problem.
For the control task, the regression rules represent diﬀerent control logic and the partition rules
represent diﬀerent conditions (or scenario) under which a given control logic is applicable. An
illustrative sketch of the same is provided in Figure 6.10.

Figure 6.10: NLDT Agent representation for continuous control task involving one action.

The conditional rules (x) and control logic (x) are expressed as weighed sum of power-laws
as mentioned in Eq. 6.3 and 6.7 respectively. The induction and training of the non-linear decision
tree (NLDT) of Figure 6.10 is done using the methodology discussed in the previous section in
Section 6.3.

6.4.2 Data generation

To derive the NLDT, the data is ﬁrst generated using the trained artiﬁcial neural network (ANN)
controller. For our task, the ANN controller takes state variables as input and outputs the value
In our experiments, we generate 5,000 datapoints to populate the training
of the acceleration.

131

data. Once the training data is generated, the dataset is normalized using the method discussed
in Section 6.3.1. The normalized data is then used to train the NLDT. As mentioned before, each
regression rule and partition rule is derived using the proposed dual bilevel algorithm (Section 6.3).

6.4.3 Results on Car-Following Problem

In this section we discuss some results obtained on the 1D car following problem involving
continuous action space (Figure 6.8). As mentioned before, the car in the front follows a random
acceleration proﬁle. The rear car is controlled using a trained ANN controller. The NLDT takes the
same state information as the ANN counterpart and predicts the value of acceleration. It is desired
to have the predicted acceleration value of NLDT to be as close as possible to the corresponding
ANN output for a given input state. Plots from diﬀerent simulation runs are shown in Figure 6.11
to provide the visualization of how closely the NLDT is able to mimic the parent ANN controller’s
output.

The performance seen in Figure 6.11 is obtained by the NLDT agent shown in Figure 6.12.
From the results shown in Figure 6.11, it can be concluded that our proposed algorithm is able
to induce a relatively interpretable NLDT (Figure 6.12) which is able to match the predictions of the
complicated ANN controller. It is important to highlight all components of the NLDT (Figure 6.12)
which is obtained by our procedure:

1. A two-depth decision tree is adequate to explain control strategies of the ANN with a small

MSE error. Our dual bilevel approach is able to evolve this simple tree.

2. The tree involves one non-linear partition rule (0(x)) at the root node, and two non-linear
regression rules ( = (x)) at left leaf node ( = 1, applicable for 0(x) ≤ 0) and right leaf
node ( = 2, applicable for 0(x) > 0).

3. The structures of the regression and partition rules are evolved by our dual bilevel optimizer

involving simple expressions.

132

(a) ( = 0) = 0

(b) ( = 0) = 5

Figure 6.11: Plots from diﬀerent simulation runs with diﬀerent initial relative velocity ()
between cars. The NLDT’s acceleration output (red line) matches with the ANN’s acceleration
output (blue line). NLDT agent behaviour is almost the same as that of ANN and can thus be used
to explain the behaviour of ANN.

133

Figure 6.11 (cont’d)

(c) ( = 0) = 10

Figure 6.12: NLDT for car following problem with continuous action space.

While the original ANN can control the rear car well, it certainly cannot explain the reasons
behind choosing the action for each combination of three input states. Above partition and regression
rules paves the way of arriving at a relatively humanly-interpretable explanation to the data derived
from the ANN controller. Notice that a regular decision tree (DT) is incapable of coming up
with such a nonlinear combinations of variables. This is true even for the generalized linear
models (GLMs) and generalized additive models (GAMs). Due to hurdles of optimizing non-linear
optimization problems with guaranteed optimality, nonlinear decision trees were not studied in the

134

past. With the advancements of eﬃcient evolutionary (and bilevel) optimization methods, we are
able to open up this avenue to come up with as few as possible and as interpretable as possible
nonlinear decision rules. We believe that this strategy can be extended to more complex control
problems.

135

CHAPTER 7

CONCLUSION AND FUTURE WORK

In this work, we primarily focused on inducing an intrinsically interpretable AI which is relatively
interpretable and simple than ANN, CART and SVM. The AI in our case is modelled using a
nonlinear decision tree (NLDT) wherein, unlike traditional decision trees, the conditional nodes
involve non-linear rules on input features. The non-linear rules at conditional nodes are expressed
as weighted sum of power-laws with exponents assuming values from a discrete set while the
coeﬃcients and biases take real values between -1 to 1. The search for the non-linear rules is
carried out using a custom-designed bilevel approach. The upper level spans the discrete search
space corresponding to exponents while the lower level conducts a search on continuous variables
to optimize the split criteria. The NLDT is grown through a recursive splitting procedure, and
for each new conditional node, the non-linear rule is eﬃciently derived using a bilevel algorithm.
This implies that there is no need to pre-supply the topology of the NLDT since it gets organically
determined. The NLDTs produced are simple in topology as compared to traditional CART trees and
have much simpler rules than ANN or SVM counterparts at each conditional node, thereby making
the overall AI relatively more interpretable. A comparison with classical classiﬁcation algorithms
on some standard and custom-designed benchmarks reveals the superiority of the proposed NLDT
algorithm in eﬃciently deriving simple yet high-performing classiﬁers.

This idea is extended to develop relatively interpretable controllers for control problems involv-
ing discrete actions. The open-loop training determines the structure of the NLDT and non-linear
rules, which are then re-optimized using an evolutionary-algorithm-based closed-loop training pro-
cedure to enhance the closed-loop control performance of the NLDT. The objective evaluation for
closed-loop training is computationally expensive; however, a speed-up to the overall evolutionary
closed-loop training algorithm is possible by distributing the evaluation of population individuals to
diﬀerent computer cores. The proposed algorithm is able to induce an NLDT* controller which is
relatively simple than DNN controller and has better closed-loop control than the parent black-box

136

AI controller. One of our key observations in this study was it was not required to have an NLDT 
with 100% open-loop accuracy to achieve a competent closed-loop performance. We extended our
study of inducing relatively interpretable controllers in Chapter 5 to conduct a scale-up study and
proposed a benchmark-problem. The open-loop training is made considerably fast by replacing the
lower-level RGA algorithm with a classical SQP algorithm. This compromises the open-loop accu-
racy, however the re-optimization through the follow-up closed-loop training successfully readjusts
weights and coeﬃcients of NLDT . The resulting NLDT* has a better closed-loop performance
than the original black-box AI. The NLDT* is then pruned and simpliﬁed by eliminating nodes
which are rarely visited during closed-loop control runs.

A conceptual layout on the possible extension of NLDT idea to solve regression and control
problems with continuous action space is demonstrated in Chapter 6. Here, a dual-bilevel algorithm
to proposed to systematically derive regression functions (x) and partition functions (x). The
proposed approach is ﬁrst tested on a customized three-island regression problem involving three
separate disjoint linear regression surfaces. Next, the dual-bilevel algorithm of inducing NLDT
is applied to derive an interpretable translator to the DNN for a CarFollowing problem involving
continuous action. The derived NLDT involves only one partition rule and two non-linear regression
control rules and is able to successfully mimic the behaviour of the parent DNN controller.

The results obtained are very encouraging and open up a number of possible research paths.
For classiﬁcation problems, the parameter  is required (Eq. 3.5). A parameter-free algorithm
could be developed to conduct splits. Currently, the rule structure is ﬁxed to be a weighted sum of
power-laws. A genetic-programming-based approach can be used with the bilevel idea to derive
rules with diﬀerent non-linearity. Another possible extension to this work would involve developing
an algorithm to handle categorical features in addition to ordinal or continuous features. It would
also be interesting to see how the dual-bilevel approach can be applied to any generic regression
or continuous control task. Currently, this approach is sensitive to the parameters  and
 (Algo. 9 and Eq. 6.8). Another parallel research can be launched focused on automating
the reasoning and explanation process of developed NLDTs. Currently, we could do it for simple

137

problems manually.

This work on producing a transparent AI system which is relatively interpretable used several
concepts from the non-linear optimization literature, such as bilevel-optimization and constraint-
handling. The training of inducing NLDT  and NLDT* can be made better and faster by using
concepts from surrogate optimization and metamodelling. A multi-objective-optimization-based
approach can also be developed to induce NLDT wherein the interpretability and performance are
modelled as separate objectives. As mentioned before, interpretability is subjective, so it calls for
a joint collaborative research to develop transparent AI systems by integrating domain expertise of
human experts and requirements of end-users. We hope that our work helps to advance the research
in the ﬁeld of interpretable artiﬁcial intelligence and beyond.

138

BIBLIOGRAPHY

139

BIBLIOGRAPHY

[1] H. Kim, W.-Y. Loh, Y.-S. Shih, and P. Chaudhuri, “Visualizable and interpretable regression

models with good prediction power,” IIE Transactions, vol. 39, no. 6, pp. 565–579, 2007.

[3]

[2] B. Letham, C. Rudin, T. H. McCormick, D. Madigan et al., “Interpretable classiﬁers using
rules and bayesian analysis: Building a better stroke prediction model,” The Annals of Applied
Statistics, vol. 9, no. 3, pp. 1350–1371, 2015.
J. H. Friedman, B. E. Popescu et al., “Predictive learning via rule ensembles,” The Annals of
Applied Statistics, vol. 2, no. 3, pp. 916–954, 2008.
J. A. Nelder and R. W. Wedderburn, “Generalized linear models,” Journal of the Royal
Statistical Society: Series A, vol. 135, no. 3, pp. 370–384, 1972.

[4]

[5] T. J. Hastie and R. J. Tibshirani, Generalized additive models. CRC press, 1990, vol. 43.
[6] S. N. Wood, Generalized additive models: An introduction with R. CRC press, 2017.
[7] W. J. Murdoch, C. Singh, K. Kumbier, R. Abbasi-Asl, and B. Yu, “Interpretable machine

learning: deﬁnitions, methods, and applications,” arXiv preprint arXiv:1901.04592, 2019.

[8] Z. C. Lipton, “The mythos of model interpretability,” Queue, vol. 16, no. 3, pp. 31–57, 2018.
[9] S. B. Kotsiantis, I. Zaharakis, and P. Pintelas, “Supervised machine learning: A review of clas-
siﬁcation techniques,” Emerging artiﬁcial intelligence applications in computer engineering,
vol. 160, pp. 3–24, 2007.

[10] J. S. Bergstra, R. Bardenet, Y. Bengio, and B. Kégl, “Algorithms for hyper-parameter opti-

mization,” in Advances in neural information processing systems, 2011, pp. 2546–2554.

[11] C. Thornton, F. Hutter, H. H. Hoos, and K. Leyton-Brown, “Auto-weka: Combined selection
and hyperparameter optimization of classiﬁcation algorithms,” in Proceedings of the 19th
ACM SIGKDD international conference on Knowledge discovery and data mining. ACM,
2013, pp. 847–855.

[12] L. Breiman, Classiﬁcation and regression trees. Routledge, 2017.
[13] D. Heath, S. Kasif, and S. Salzberg, “Induction of oblique decision trees,” in IJCAI, vol. 1993,

1993, pp. 1002–1007.

[14] S. K. Murthy, S. Kasif, and S. Salzberg, “A system for induction of oblique decision trees,”

Journal of artiﬁcial intelligence research, vol. 2, pp. 1–32, 1994.

[15] S. K. Murthy, S. Kasif, S. Salzberg, and R. Beigel, “Oc1: A randomized algorithm for building

oblique decision trees,” in Proceedings of AAAI, vol. 93. Citeseer, 1993, pp. 322–327.

140

[16] E. Cantú-Paz and C. Kamath, “Using evolutionary algorithms to induce oblique decision
trees,” in Proceedings of the 2nd Annual Conference on Genetic and Evolutionary Computa-
tion. Morgan Kaufmann Publishers Inc., 2000, pp. 1053–1060.

[17] M. Kretowski, “An evolutionary algorithm for oblique decision tree induction,” in Inter-
Springer, 2004, pp.

national Conference on Artiﬁcial Intelligence and Soft Computing.
432–437.

[18] A. Ittner and M. Schlosser, “Non-linear decision trees-ndt,” in ICML. Citeseer, 1996, pp.

252–257.

[19] K. P. Bennett and J. A. Blue, “A support vector machine approach to decision trees,” in Neural
Networks Proceedings, 1998. IEEE World Congress on Computational Intelligence. The 1998
IEEE International Joint Conference on, vol. 3.

IEEE, 1998, pp. 2396–2401.

[20] H. Núñez, C. Angulo, and A. Català, “Rule extraction from support vector machines.” in

Esann, 2002, pp. 107–112.

[21] G. Fung, S. Sandilya, and R. B. Rao, “Rule extraction from linear support vector machines,” in
Proceedings of the eleventh ACM SIGKDD international conference on Knowledge discovery
in data mining. ACM, 2005, pp. 32–40.

[22] M. Craven and J. W. Shavlik, “Extracting tree-structured representations of trained networks,”

in Advances in neural information processing systems, 1996, pp. 24–30.

[23] P. M. Murphy and M. J. Pazzani, “Id2-of-3: Constructive induction of m-of-n concepts for
discriminators in decision trees,” in Machine Learning Proceedings 1991. Elsevier, 1991,
pp. 183–187.

[24] U. Johansson, R. König, and L. Niklasson, “The truth is in there-rule extraction from opaque
models using genetic programming.” in FLAIRS Conference. Miami Beach, FL, 2004, pp.
658–663.

[25] D. Martens, B. Baesens, T. Van Gestel, and J. Vanthienen, “Comprehensible credit scoring
models using rule extraction from support vector machines,” European journal of operational
research, vol. 183, no. 3, pp. 1466–1476, 2007.

[26] H. Ishibuchi, T. Nakashima, and T. Murata, “Three-objective genetics-based machine learning

for linguistic rule extraction,” Information Sciences, vol. 136, no. 1-4, pp. 109–133, 2001.

[27] Y. Jin and B. Sendhoﬀ, “Pareto-based multiobjective machine learning: An overview and
case studies,” IEEE Transactions on Systems, Man, and Cybernetics, Part C (Applications
and Reviews), vol. 38, no. 3, pp. 397–415, 2008.

[28] H. Guo and A. K. Nandi, “Breast cancer diagnosis using genetic programming generated

feature,” Pattern Recognition, vol. 39, no. 5, pp. 980–987, 2006.

[29] M. Muharram and G. D. Smith, “Evolutionary constructive induction,” IEEE transactions on

knowledge and data engineering, vol. 17, no. 11, pp. 1518–1528, 2005.

141

[30] M. Shirasaka, Q. Zhao, O. Hammami, K. Kuroda, and K. Saito, “Automatic design of binary
decision trees based on genetic programming,” in Proc. The Second Asia-Paciﬁc Conference
on Simulated Evolution and Learning (SEAL’98. Citeseer, 1998.

[31] S. Haruyama and Q. Zhao, “Designing smaller decision trees using multiple objective opti-
mization based gps,” in IEEE International Conference on Systems, Man and Cybernetics,
vol. 6.

IEEE, 2002, pp. 5–pp.

[32] C.-S. Kuo, T.-P. Hong, and C.-L. Chen, “Applying genetic programming technique in classi-

ﬁcation trees,” Soft Computing, vol. 11, no. 12, pp. 1165–1172, 2007.

[33] M. C. Bot, “Improving induction of linear classiﬁcation trees with genetic programming,” in
Proceedings of the 2nd Annual Conference on Genetic and Evolutionary Computation, 2000,
pp. 403–410.

[34] R. E. Marmelstein and G. B. Lamont, “Pattern classiﬁcation using a hybrid genetic program-

decision tree approach,” Genetic Programming, pp. 223–231, 1998.

[35] E. M. Mugambi, A. Hunter, G. Oatley, and L. Kennedy, “Polynomial-fuzzy decision tree
structures for classifying medical data,” in International Conference on Innovative Techniques
and Applications of Artiﬁcial Intelligence. Springer, 2003, pp. 155–167.

[36] A. Sinha, P. Malo, and K. Deb, “A review on bilevel optimization: From classical to evo-
lutionary approaches and applications,” IEEE Transactions on Evolutionary Computation,
vol. 22, no. 2, pp. 276–295, 2017.

[37] K. Deb and R. B. Agrawal, “Simulated binary crossover for continuous search space,” Complex

systems, vol. 9, no. 2, pp. 115–148, 1995.

[38] K. Deb, Multi-objective optimization using evolutionary algorithms. Wiley, 2005.
[39] L. Bobrowski and M. Kretowski, “Induction of multivariate decision trees by using dipolar
criteria,” in European Conference on Principles of Data Mining and Knowledge Discovery.
Springer, 2000, pp. 331–336.

[40] M. Kretowski and M. Grześ, “Global induction of oblique decision trees: an evolutionary
Springer, 2005, pp.

approach,” in Intelligent Information Processing and Web Mining.
309–318.

[41] K. Deb, Multi-objective optimization using evolutionary algorithms.

2001, vol. 16.

John Wiley & Sons,

[42] N. V. Chawla, K. W. Bowyer, L. O. Hall, and W. P. Kegelmeyer, “Smote: synthetic minority
over-sampling technique,” Journal of artiﬁcial intelligence research, vol. 16, pp. 321–357,
2002.

[43] V. Vapnik, The nature of statistical learning theory.

2013.

Springer science & business media,

142

[44] J.-P. Vert, K. Tsuda, and B. Schölkopf, “A primer on kernel methods,” Kernel methods in

computational biology, vol. 47, pp. 35–70, 2004.

[45] K. Deb, A. Pratap, S. Agarwal, and T. Meyarivan, “A fast and elitist multiobjective genetic
algorithm: Nsga-ii,” IEEE transactions on evolutionary computation, vol. 6, no. 2, pp. 182–
197, 2002.

[46] K. Deb and A. Srinivasan, “Innovization: Innovating design principles through optimization,”
in Proceedings of the 8th annual conference on Genetic and evolutionary computation. ACM,
2006, pp. 1629–1636.

[47] K. Deb, L. Thiele, M. Laumanns, and E. Zitzler, “Scalable multi-objective optimization test
problems,” in Proceedings of the 2002 Congress on Evolutionary Computation. CEC’02 (Cat.
No. 02TH8600), vol. 1.

IEEE, 2002, pp. 825–830.

[48] Y. Dhebar, S. Gupta, and K. Deb, “Evaluating nonlinear decision trees for binary classiﬁcation

tasks with other existing methods,” ArXiv, vol. abs/2008.10753, 2020.

[49] C. M. Bishop, Pattern recognition and machine learning.
[50] F. Pedregosa, G. Varoquaux, A. Gramfort, V. Michel, B. Thirion, O. Grisel, M. Blondel,
P. Prettenhofer, R. Weiss, V. Dubourg, J. Vanderplas, A. Passos, D. Cournapeau, M. Brucher,
M. Perrot, and E. Duchesnay, “Scikit-learn: Machine learning in Python,” Journal of Machine
Learning Research, vol. 12, pp. 2825–2830, 2011.

springer, 2006.

[51] K. Larsen, “Gam: the predictive modeling silver bullet,” Multithreaded Stitch Fix, vol. 30,

pp. 1–27, 2015.

[52] K. Neshatian, M. Zhang, and P. Andreae, “A ﬁlter approach to multiple feature construction for
symbolic learning classiﬁers using genetic programming,” IEEE Transactions on Evolutionary
Computation, vol. 16, no. 5, pp. 645–661, 2012.

[53] A. Cano, A. Zafra, and S. Ventura, “An interpretable classiﬁcation rule mining algorithm,”

Information Sciences, vol. 240, pp. 1–20, 2013.

[54] I. De Falco, A. D. Cioppa, and E. Tarantino, “Discovering interesting classiﬁcation rules with

genetic programming,” Applied Soft Computing, vol. 1, no. 4, pp. 257–269, 2002.

[55] M. C. J. Bot and W. B. Langdon, “Application of genetic programming to induction of linear
classiﬁcation trees,” in European Conference on Genetic Programming. Springer, 2000, pp.
247–258.

[56] J. Eggermont, J. N. Kok, and W. A. Kosters, “Genetic programming for data classiﬁcation:
Partitioning the search space,” in Proceedings of the 2004 ACM symposium on Applied
computing, 2004, pp. 1001–1005.

[57] K. C. Tan, A. Tay, T. H. Lee, and C. M. Heng, “Mining multiple comprehensible classiﬁcation
rules using genetic programming,” in Proceedings of the 2002 Congress on Evolutionary
Computation. CEC’02, vol. 2.

IEEE, 2002, pp. 1302–1307.

143

[58] H. Iba, “Bagging, boosting, and bloating in genetic programming,” in Proceedings of the 1st

Annual Conference on Genetic and Evolutionary Computation, 1999, pp. 1053–1060.

[59] J. Blank and K. Deb, “pymoo - Multi-objective Optimization in Python,” https://pymoo.org.

[60] J. Schulman, S. Levine, P. Abbeel, M. Jordan, and P. Moritz, “Trust region policy optimiza-

tion,” in International Conference on Machine Learning (ICML), 2015, pp. 1889–1897.

[61] J. Schulman, F. Wolski, P. Dhariwal, A. Radford, and O. Klimov, “Proximal policy optimiza-

tion algorithms,” arXiv preprint arXiv:1707.06347, 2017.

[62] T. P. Lillicrap, J. J. Hunt, A. Pritzel, N. Heess, T. Erez, Y. Tassa, D. Silver, and D. Wierstra,
“Continuous control with deep reinforcement learning,” arXiv preprint arXiv:1509.02971,
2015.

[63] V. Mnih, A. P. Badia, M. Mirza, A. Graves, T. Lillicrap, T. Harley, D. Silver, and
K. Kavukcuoglu, “Asynchronous methods for deep reinforcement learning,” in International
Conference on Machine Learning (ICML), 2016, pp. 1928–1937.

[64] R. S. Sutton, “Generalization in reinforcement learning: Successful examples using sparse
coarse coding,” in Advances in Neural Information Processing Systems 8, D. S. Touretzky,
M. C. Mozer, and M. E. Hasselmo, Eds. MIT Press, 1996, pp. 1038–1044.

[65] J. Peters, K. Mülling, and Y. Altun, “Relative entropy policy search,” in Proceedings of the
Twenty-Fourth AAAI Conference on Artiﬁcial Intelligence (AAAI-10), vol. 10. Atlanta, 2010,
pp. 1607–1612.

[66] W. D. Smart and L. P. Kaelbling, “Practical reinforcement learning in continuous spaces,” in

International Conference on Machine Learning (ICML), 2000, pp. 903–910.

[67] R. Noothigattu, D. Bouneﬀouf, N. Mattei, R. Chandra, P. Madan, K. Varshney, M. Campbell,
M. Singh, and F. Rossi, “Interpretable multi-objective reinforcement learning through policy
orchestration,” arXiv preprint arXiv:1809.08343, 2018.

[68] F. Maes, R. Fonteneau, L. Wehenkel, and D. Ernst, “Policy search in a space of simple
closed-form formulas: towards interpretability of reinforcement learning,” in International
Conference on Discovery Science. Springer, 2012, pp. 37–51.

[69] D. Hein, S. Udluft, and T. A. Runkler, “Interpretable policies for reinforcement learning
by genetic programming,” Engineering Applications of Artiﬁcial Intelligence, vol. 76, pp.
158–169, 2018.

[70] G. Liu, O. Schulte, W. Zhu, and Q. Li, “Toward interpretable deep reinforcement learning with
linear model u-trees,” in Joint European Conference on Machine Learning and Knowledge
Discovery in Databases. Springer, 2018, pp. 414–429.

[71] A. Verma, V. Murali, R. Singh, P. Kohli, and S. Chaudhuri, “Programmatically interpretable

reinforcement learning,” arXiv preprint arXiv:1804.02477, 2018.

144

[72] J. Kennedy and R. Eberhart, “Particle swarm optimization,” in Proceedings of ICNN’95-

International Conference on Neural Networks, vol. 4.

IEEE, 1995, pp. 1942–1948.

[73] D. Hein, A. Hentschel, T. Runkler, and S. Udluft, “Particle swarm optimization for generating
interpretable fuzzy reinforcement learning policies,” Engineering Applications of Artiﬁcial
Intelligence, vol. 65, pp. 87–98, 2017.

[74] S. Ross, G. Gordon, and D. Bagnell, “A reduction of imitation learning and structured predic-
tion to no-regret online learning,” in Proceedings of the Fourteenth International Conference
on Artiﬁcial Intelligence and Statistics, 2011, pp. 627–635.

[75] O. Bastani, Y. Pu, and A. Solar-Lezama, “Veriﬁable reinforcement learning via policy extrac-

tion,” in Advances in neural information processing systems (NIPS), 2018, pp. 2494–2504.

[76] O. Bastani, C. Kim, and H. Bastani, “Interpretability via model extraction,” arXiv preprint

arXiv:1706.09773, 2017.

[77] G. Vandewiele, O. Janssens, F. Ongenae, F. De Turck, and S. Van Hoecke, “Genesim: Genetic

extraction of a single, interpretable model,” arXiv preprint arXiv:1611.05722, 2016.

[78] D. Ernst, P. Geurts, and L. Wehenkel, “Tree-based batch mode reinforcement learning,”

Journal of Machine Learning Research, vol. 6, no. Apr, pp. 503–556, 2005.

[79] J. L. Bentley, “Multidimensional binary search trees used for associative searching,” Commu-

nications of the ACM, vol. 18, no. 9, pp. 509–517, 1975.

[80] A. Silva, M. Gombolay, T. Killian, I. Jimenez, and S.-H. Son, “Optimization methods for
interpretable diﬀerentiable decision trees applied to reinforcement learning,” in International
Conference on Artiﬁcial Intelligence and Statistics, 2020, pp. 1855–1865.

[81] Y. Dhebar and K. Deb, “Interpretable rule discovery through bilevel optimization of split-rules
of nonlinear decision trees for classiﬁcation problems,” arXiv preprint arXiv:2008.00410,
2020.

[82] S. Nageshrao, B. Costa, and D. Filev, “Interpretable approximation of a deep reinforcement
learning agent as a set of if-then rules,” in 18th IEEE International Conference On Machine
Learning And Applications (ICMLA).

IEEE, 2019, pp. 216–221.

[83] H. Van Hasselt, A. Guez, and D. Silver, “Deep reinforcement learning with double q-learning”

arXiv preprint arXiv:1509.06461, 2015.

[84] G. A. Rummery and M. Niranjan, “On-line q-learning using connectionist systems,” Uni-
versity of Cambridge, Department of Engineering Cambridge, UK, Tech. Rep. CUED/F-
INFENG/TR166, 1994.

[85] E. Coumans and Y. Bai, “Pybullet, a python module for physics simulation for games, robotics

and machine learning,” http://pybullet.org, 2016–2020.

145

[86] P. Virtanen, R. Gommers, T. E. Oliphant, M. Haberland, T. Reddy, D. Cournapeau,
E. Burovski, P. Peterson, W. Weckesser, J. Bright, S. J. van der Walt, M. Brett, J. Wil-
son, K. J. Millman, N. Mayorov, A. R. J. Nelson, E. Jones, R. Kern, E. Larson, C. J. Carey,
İ. Polat, Y. Feng, E. W. Moore, J. VanderPlas, D. Laxalde, J. Perktold, R. Cimrman, I. Hen-
riksen, E. A. Quintero, C. R. Harris, A. M. Archibald, A. H. Ribeiro, F. Pedregosa, P. van
Mulbregt, and SciPy 1.0 Contributors, “SciPy 1.0: Fundamental Algorithms for Scientiﬁc
Computing in Python,” Nature Methods, vol. 17, pp. 261–272, 2020.

[87] N. Subramanya, C. Bruno, and D. Filev, “Interpretable approximation of a deep reinforcement

learning agent as a set of if-then rules.”

146