TOPOLOGICAL DATA ANALYSIS AND MACHINE LEARNING FRAMEWORK FOR
             STUDYING TIME SERIES AND IMAGE DATA
                                      By
                              Melih Can Yesilli
                            A DISSERTATION
                                Submitted to
                        Michigan State University
                 in partial fulfillment of the requirements
                              for the degree of
              Mechanical Engineering – Doctor of Philosophy
                                     2022


                                         ABSTRACT
 TOPOLOGICAL DATA ANALYSIS AND MACHINE LEARNING FRAMEWORK FOR
                       STUDYING TIME SERIES AND IMAGE DATA
                                               By
                                        Melih Can Yesilli
     The recent advancements in signal acquisition and data mining have revealed the impor-
tance of data-driven tools for analyzing signals and images. The availability of large and
complex data has also highlighted the need for investigative tools that provide autonomy,
noise-robustness, and efficiently utilize data collected from different settings but pertaining
to the same phenomenon. State-of-the-art approaches include using tools such as Fourier
analysis, wavelets, and Empirical Mode Decomposition for extracting informative features
from the data. These features can then be combined with machine learning for clustering,
classification, and inference. However, these tools typically require human intervention for
feature extraction, and they are sensitive to the input parameters that the user chooses
during the laborious but often necessary manual data pre-processing. Therefore, this dis-
sertation was motivated by the need for automatic, adaptive, and noise-robust methods for
efficiently leveraging machine learning for studying images as well as time series of dynami-
cal systems. Specifically, this work investigates three application areas: chatter detection in
manufacturing processes, image analysis of manufactured surfaces, and tool wear detection
during titanium alloys machining. This work’s novel investigations are enabled by combining
machine learning with methods from Topological Data Analysis (TDA), a relatively recent
field of applied topology that encompasses a variety of mature tools for quantifying the shape
of data.
     First, this study experimentally shows for the first time that persistent homology (or
persistence) from TDA can be used for chatter classification with accuracies that rival exist-
ing detection methods. Further, the efficient use of chatter data sets from different sources
is formulated and studied as a transfer learning problem using experimental turning and


milling vibration signals. Classification results are shown using comparisons between the
TDA pipeline developed in this dissertation and prominent methods for chatter detection.
    Second, this work describes how to utilize TDA tools for extracting descriptive features
from simulated samples generated using different Hurst roughness exponents. The efficiency
of the feature extraction is tested by classifying the surfaces according to their roughness
level. The resulting accuracies show that TDA can outperform several traditional feature
extraction approaches in surface texture analysis. Further, as part of this work, adaptive
threshold selection algorithms are developed for Discrete Cosine Transform, and Discrete
Wavelet Transform to bypass the need for subjective operator input during surface roughness
analysis. Both experimental and synthetic data sets are used to test the effectiveness of these
two algorithms. This study also discusses a TDA-based framework that can potentially
provide a feasible approach for building an automatic surface finish monitoring system.
    Finally, this work shows that persistence can be used for tool condition monitoring during
titanium alloys machining. Since, in these processes, the cutting tools typically fracture
catastrophically before the gradual tool wear reaches the maximum tool life criteria, the
industry uses very conservative criteria for replacing the tools. An extensive experiment is
described for relating wear markers in various sensor signals to the tool condition at different
stages of the tool life. This work shows how, in this setting, TDA provides significant
advantages in terms of robustness to noise and alleviating the need for an expert user to
extract the informative features. The obtained TDA-based features are compared to existing
state-of-the-art featurization tools using feature-level data fusion. The temporal location of
the most representative tool condition features is also studied in the signals by considering
a variety of window lengths preceding tool wear milestones.


To my beloved family
        iv


                                ACKNOWLEDGEMENTS
    First, I owe my deepest gratitude to my parents, Günay and İsmail, for their endless
support throughout my entire life. My brother, Semih, has been one of my best friends. I
would not get through this journey without his support and encouragement.
    I would like to thank my advisor Dr. Firas Khasawneh for his support and mentorship
over the past four years. His guidance turned me into an independent researcher, and
I have learned a lot from him over the past four years. I enjoyed our weekly meetings,
where we brainstormed about a new problem and lost track of time. I thank my committee
members, Dr. Elizabeth Munch, Prof. Brian Feeny, and Dr. Zhaojian Li, for their guidance
and support over the last four years. I am also grateful for the patience and feedback of
my collaborators Dr. Andreas Otto, Prof. Brain Mann, Dr. Yang Guo, and Prof. Patrick
Kwon. I also would like to thank my labmates, Audun Myers and Max Chumley, for their
collaboration and contribution.
    Finally, this research would not be possible without the financial support of the National
Science Foundation (NSF) and Michigan State University.
                                               v


                              TABLE OF CONTENTS
LIST OF TABLES      . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .   x
LIST OF FIGURES . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . xiv
LIST OF ALGORITHMS . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . xxxi
CHAPTER 1 INTRODUCTION . . . . . . . . . . . . . . . . . . . . . . . . . . .                1
CHAPTER 2 CHATTER DIAGNOSIS USING MACHINE LEARNING . . . . . .                              4
   2.1 Literature Review . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .  4
       2.1.1 Transfer Learning Approaches for Machining . . . . . . . . . . . . . .        10
       2.1.2 Main Contribution . . . . . . . . . . . . . . . . . . . . . . . . . . . .     12
   2.2 Supervised Classification Algorithms . . . . . . . . . . . . . . . . . . . . . .    15
       2.2.1 Support Vector Machine . . . . . . . . . . . . . . . . . . . . . . . . .      15
       2.2.2 Logistic Regression . . . . . . . . . . . . . . . . . . . . . . . . . . . .   16
       2.2.3 Random Forest Classification . . . . . . . . . . . . . . . . . . . . . .      18
       2.2.4 Gradient Boosting . . . . . . . . . . . . . . . . . . . . . . . . . . . .     20
   2.3 Signal Decomposition Based Approach . . . . . . . . . . . . . . . . . . . . .       20
       2.3.1 Wavelet Packet Transform . . . . . . . . . . . . . . . . . . . . . . . .      22
              2.3.1.1 Selection of Informative Wavelet Packets . . . . . . . . . . .       23
              2.3.1.2 Recursive Feature Elimination (RFE) . . . . . . . . . . . . .        28
       2.3.2 Ensemble Empirical Mode Decomposition . . . . . . . . . . . . . . .           28
              2.3.2.1 Selection of Informative Intrinsic Mode Function . . . . . .         31
              2.3.2.2 Feature Extraction Using EEMD . . . . . . . . . . . . . . .          32
       2.3.3 Results . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 32
              2.3.3.1 Wavelet Packet Transform with RFE . . . . . . . . . . . . .          33
              2.3.3.2 Ensemble Empirical Mode Decomposition with RFE . . . .               36
   2.4 Traditional Time and Frequency Domain Features . . . . . . . . . . . . . . .        39
       2.4.1 Feature Extraction . . . . . . . . . . . . . . . . . . . . . . . . . . . .    39
       2.4.2 Results . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 41
   2.5 Similarity Measures of Time Series . . . . . . . . . . . . . . . . . . . . . . .    44
       2.5.1 Dynamic Time Warping (DTW) . . . . . . . . . . . . . . . . . . . . .          44
       2.5.2 K-Nearest Neighbor (KNN) . . . . . . . . . . . . . . . . . . . . . . .        46
       2.5.3 Similarity matrices and classification . . . . . . . . . . . . . . . . . .    47
       2.5.4 Approximate and Eliminate Search Algorithm (AESA) . . . . . . . .             48
       2.5.5 Results . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 53
              2.5.5.1 Classification Results for Dynamic Time Warping (DTW) .              53
              2.5.5.2 Parallel Computing . . . . . . . . . . . . . . . . . . . . . . .     58
              2.5.5.3 Approximate and Eliminate Search Algorithm Results . . .             61
   2.6 Topological Data Analysis (TDA) Based Approach . . . . . . . . . . . . . .          65
       2.6.1 Topological Data Analysis . . . . . . . . . . . . . . . . . . . . . . . .     65
                                             vi


             2.6.1.1 Simplicial complexes . . . . . . . . . . . . . . . . . . . . . .       67
             2.6.1.2 Persistent Homology . . . . . . . . . . . . . . . . . . . . . .        67
      2.6.2 Method . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .      69
      2.6.3 Persistence Diagram Computation . . . . . . . . . . . . . . . . . . . .         70
             2.6.3.1 Delay Reconstruction . . . . . . . . . . . . . . . . . . . . . .       70
             2.6.3.2 Persistence Diagram Computation with Bézier Curve Ap-
                       proximation . . . . . . . . . . . . . . . . . . . . . . . . . . .    72
      2.6.4 Feature Extraction Using Persistence Diagrams . . . . . . . . . . . .           76
             2.6.4.1 Persistence Landscapes . . . . . . . . . . . . . . . . . . . . .       77
             2.6.4.2 Persistence Images . . . . . . . . . . . . . . . . . . . . . . .       80
             2.6.4.3 Carlsson Coordinates . . . . . . . . . . . . . . . . . . . . . .       83
             2.6.4.4 Kernels for Persistence Diagrams . . . . . . . . . . . . . . .         84
             2.6.4.5 Persistence Paths’ Signatures . . . . . . . . . . . . . . . . .        85
             2.6.4.6 Template Functions . . . . . . . . . . . . . . . . . . . . . .         87
      2.6.5 Modeling of Milling Process . . . . . . . . . . . . . . . . . . . . . . .       87
      2.6.6 Simulation Results . . . . . . . . . . . . . . . . . . . . . . . . . . . .      89
      2.6.7 Experimental Data Results . . . . . . . . . . . . . . . . . . . . . . . .       95
             2.6.7.1 Runtime Comparison . . . . . . . . . . . . . . . . . . . . . .         95
             2.6.7.2 Classification Scores . . . . . . . . . . . . . . . . . . . . . .      99
  2.7 Transfer Learning . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .  103
      2.7.1 Methods . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .    105
             2.7.1.1 Wavelet Packet Transform (WPT) . . . . . . . . . . . . . .            107
             2.7.1.2 Ensemble Empirical Mode Decomposition (EEMD) . . . . .                108
             2.7.1.3 Fast Fourier Transform (FFT), Power Spectral Density
                       (PSD) and Auto-correlation Function (ACF) . . . . . . . . .         109
      2.7.2 Transfer Learning Results . . . . . . . . . . . . . . . . . . . . . . . .      111
             2.7.2.1 Results of Transfer Learning Applications Between the Over-
                       hang Distance Cases of Turning Data Set . . . . . . . . . .         112
             2.7.2.2 Results of Transfer Learning Applications Between Turn-
                       ing and Milling Data Sets . . . . . . . . . . . . . . . . . . .     116
             2.7.2.3 Transfer Learning Using Deep Learning . . . . . . . . . . .           119
  2.8 Conclusion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 122
CHAPTER 3 DATA DRIVEN PARAMETER IDENTIFICATION FOR A CHAOTIC
             PENDULUM WITH VARIABLE INTERACTION POTENTIAL . . .                            131
  3.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 131
  3.2 Modeling and Experimental Setup . . . . . . . . . . . . . . . . . . . . . . . .      134
  3.3 Sparse Identification of Nonlinear Dynamics (SINDy) . . . . . . . . . . . . .        136
  3.4 Results and Discussion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .   138
      3.4.1 Total Variation Regularization (TVR) . . . . . . . . . . . . . . . . .         138
             3.4.1.1 Noise-free Case . . . . . . . . . . . . . . . . . . . . . . . . .     139
             3.4.1.2 Regression Overfitting Check . . . . . . . . . . . . . . . . .        140
             3.4.1.3 Noise Effects . . . . . . . . . . . . . . . . . . . . . . . . . .     143
             3.4.1.4 Parameter Sensitivity in TVR via an Example . . . . . . . .           144
      3.4.2 Alternative Derivative Estimation Methods . . . . . . . . . . . . . . .        145
                                            vii


      3.4.3 Experimental Results . . . . . . . . . . . . . . . . . . . . . . . . . . . 147
  3.5 Conclusion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 149
CHAPTER 4 ANALYSIS OF ENGINEERING SURFACES USING TOPOLOGI-
             CAL DATA ANALYSIS . . . . . . . . . . . . . . . . . . . . . . . .                151
  4.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .    151
  4.2 Data-driven and Automatic Surface Texture Analysis Using Persistent Ho-
      mology . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .    155
      4.2.1 Simulation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .      155
      4.2.2 Methodology . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .         156
             4.2.2.1 Gaussian Filtering . . . . . . . . . . . . . . . . . . . . . . .         156
             4.2.2.2 Fast Fourier Transform (FFT) . . . . . . . . . . . . . . . . .           158
             4.2.2.3 Topological Data Analysis (TDA) . . . . . . . . . . . . . . .            161
      4.2.3 Results . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .     163
  4.3 Automated Surface Texture Analysis via Discrete Cosine Transform and
      Discrete Wavelet Transform . . . . . . . . . . . . . . . . . . . . . . . . . . .        166
      4.3.1 Data Preprocessing . . . . . . . . . . . . . . . . . . . . . . . . . . . .        166
      4.3.2 Methodology . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .         168
             4.3.2.1 Discrete Wavelet Transform . . . . . . . . . . . . . . . . . .           168
             4.3.2.2 Discrete Cosine Transform . . . . . . . . . . . . . . . . . . .          172
             4.3.2.3 Feature Extraction from Surfaces . . . . . . . . . . . . . . .           175
      4.3.3 Results . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .     176
  4.4 Surface Finish Monitoring Using Persistent Homology . . . . . . . . . . . . .           182
      4.4.1 Topological Saliency and Simplification . . . . . . . . . . . . . . . . .         182
      4.4.2 Results . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .     185
  4.5 Conclusion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .    186
CHAPTER 5 TOOL WEAR IDENTIFICATION USING MACHINE LEARNING                                   . 190
  5.1 Literature Review . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .   .   190
  5.2 Experimental Procedure and Data Cleaning . . . . . . . . . . . . . . . . .          .   195
      5.2.1 Data Processing . . . . . . . . . . . . . . . . . . . . . . . . . . . . .     .   197
      5.2.2 Data Labeling . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .     .   199
  5.3 Methodology . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .     .   200
      5.3.1 Discrete Wavelet Transform . . . . . . . . . . . . . . . . . . . . . .        .   200
      5.3.2 Ensemble Empirical Mode Decomposition . . . . . . . . . . . . . .             .   203
      5.3.3 Topological Data Analysis . . . . . . . . . . . . . . . . . . . . . . .       .   204
  5.4 Results . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .   206
      5.4.1 Discrete Wavelet Transform Results . . . . . . . . . . . . . . . . . .        .   208
      5.4.2 Ensemble Empirical Mode Decomposition Results . . . . . . . . . .             .   211
      5.4.3 Topological Data Analysis Based Approach Results . . . . . . . . .            .   212
  5.5 Conclusion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .  .   217
CHAPTER 6 CONCLUDING REMARKS . . . . . . . . . . . . . . . . . . . . . . 221
APPENDICES . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 224
                                            viii


   APPENDIX A    CUTTING EXPERIMENTS . . . . . . . . . . . . . . . . . . 225
   APPENDIX B    CLASSIFICATION RESULTS FOR CHATTER DETECTION 234
   APPENDIX C    EXPERIMENTAL PROCEDURE FOR ENGINEERING SUR-
                 FACE SCANS . . . . . . . . . . . . . . . . . . . . . . . . . . 254
   APPENDIX D    CLASSIFICATION RESULTS FOR TOOL WEAR ANALYSIS 256
BIBLIOGRAPHY . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 263
                                        ix


                                     LIST OF TABLES
Table 2.1: The chatter frequency ranges, the informative wavelet packets, and the
            informative IMFs corresponding to each overhang distance of the cutting
            tool for the turning experiments. . . . . . . . . . . . . . . . . . . . . . . .      26
Table 2.2: Comparison between predicted and selected informative wavelet packet
            number for all overhang distances cases of the turning experiment. Pre-
            dicted wavelet packets are decided with by overlapping the chatter fre-
            quency with the wavelet packet frequency range obtained from the WPT
            tree (Figure 2.7) for level 4. . . . . . . . . . . . . . . . . . . . . . . . . . .   27
Table 2.3: Time domain features (a1 , . . . , a10 ) and frequency domain features (a11,...,14 ). 29
Table 2.4: Time domain features for the intrinsic mode functions ci (tk ). The pa-
            rameters tk and c¯i represent, respectively, the kth discrete time and the
            mean of ith IMF. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .       33
Table 2.5: Comparison of the classification results obtained with SVM classifier and
            run times for each chatter detection method. Given run times include
            feature computation and classification for EEMD and level 4 WPT. . . .               36
Table 2.6: Results obtained by using Level 1 WPT and EEMD feature extraction
            methods with four different classifiers. . . . . . . . . . . . . . . . . . . . .     36
Table 2.7: Results obtained by using Level 2 WPT and EEMD feature extraction
            methods with four different classifiers. . . . . . . . . . . . . . . . . . . . .     37
Table 2.8: Results obtained with the signal processing feature extraction method
            for four overhang lengths with four different classifiers. . . . . . . . . . . .     41
Table 2.9: Cross validated results obtained with signal processing feature extraction
            method for four overhang lengths with four different classifiers. . . . . . .        42
Table 2.10: Comparison of classification accuracy and the time required to compute
            the distance between two time series for cDTW and FastDTW packages. .                54
Table 2.11: Comparison of results for similarity-based methods with their counter-
            parts available in the literature. . . . . . . . . . . . . . . . . . . . . . . .     54
Table 2.12: The best accuracy results for three class classification with the DTW
            approach and the corresponding number of nearest neighbors used in the
            KNN algorithm. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .       58
                                                 x


Table 2.13: Time (seconds) comparison between similarity measure method and its
            counterparts. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 60
Table 2.14: Classification time of one test sample for traditional way, parallel comput-
            ing, and AESA algorithm and corresponding average accuracy obtained
            from Table 2.11 and Figure 2.27. . . . . . . . . . . . . . . . . . . . . . . .    64
Table 2.15: Feature matrix for persistence landscapes λ1 and λ3 corresponding to
            persistence diagrams X1 through Xn . The entries in the cells are the
            values of each of the features. . . . . . . . . . . . . . . . . . . . . . . . . . 79
Table 2.16: Feature matrix for persistence images. . . . . . . . . . . . . . . . . . . . .    82
Table 2.17: Feature matrix for path signatures for n persistence diagrams and using
            the first λ1 and second λ2 persistence landscapes. . . . . . . . . . . . . . .    86
Table 2.18: 2 Class classification results for noisy (SNR:20,25,30 dB) and non-noisy
            data sets which belong to downmilling process with N = 4 (N: Teeth
            Number, CC: Carlsson Coordinates, TF: Template Functions, SVM: Sup-
            port Vector Machine, LR: Logistic Regression, H1 : 1D persistence, H2 :
            2D persistence). . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .  91
Table 2.19: 2 Class classification results for noisy (SNR:20,25,30 dB) and non-noisy
            data sets which belong to upmilling process with N = 4 (N: Teeth Num-
            ber, CC: Carlsson Coordinates, TF: Template Functions, SVM: Support
            Vector Machine, LR: Logistic Regression, H1 : 1D persistence, H2 : 2D
            persistence). . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 93
Table 2.20: Comparison of runtimes (seconds) for embedding parameters and persis-
            tence diagram computation of all overhang cases with parallel and serial
            computing. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .  96
Table 2.21: Runtime (seconds) for embedding parameter computation and persis-
            tence diagram computation of a single time series with different methods.         97
Table 2.22: Runtime (seconds) for performing classification with a single time series
            for TDA-based methods and signal decomposition-based ones. . . . . . . .          98
Table 2.23: Comparison of results for each method where WPT is the Wavelet Packet
            Transform, and EEMD stand for Ensemble Empirical Mode Decomposi-
            tion. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 99
Table 2.24: Features and classifiers used for three main category of approaches. SVM:
            Support Vector Machine, LR: Logistic Regression, RF: Random Forest,
            GB: Gradient Boosting. . . . . . . . . . . . . . . . . . . . . . . . . . . . 113
                                                xi


Table 2.25: The number of times when a selected method gives the highest accuracy
            out of 12 different applications between the cases of turning data set is
            denoted with BM. The number of times when a method is in the error
            band of the highest accuracy is denoted with MIEB. These two numbers
            are provided for accuracy and F1-score. . . . . . . . . . . . . . . . . . . . 115
Table 2.26: The number of times when a selected method gives the highest accuracy
            out of 8 different applications between the cases of turning data set and
            the milling data set is denoted with BM. The number of times when
            a method is in the error band of the highest accuracy is denoted with
            MIEB. These two numbers are provided for accuracy and F1-score. . . . . 118
Table 3.1: Part list for the experimental setup. . . . . . . . . . . . . . . . . . . . . . 135
Table 3.2: Coefficients found by sparse regression for the simple pendulum based
            on TVR derivative estimations. . . . . . . . . . . . . . . . . . . . . . . . . 140
Table 3.3: Coefficients used in the simulation versus those estimated by SINDy.
            Shaded cells highlight the matching coefficients. . . . . . . . . . . . . . . . 143
Table 4.1: The features used in the classification of surfaces and surface profiles.
            f (x) and f (x, y) represent surface profile and surface, respectively. . . . . 176
Table 5.1: The ranges for wear amounts used in labeling of time series data. . . . . . 200
Table A.1: Case numbers before and after splitting (in parenthesis) the time series
            and overall amount of tagged data for each stickout case. . . . . . . . . . 230
Table A.2: Cutting parameters for the turning and milling experiments.          . . . . . . . 230
Table B.1: Results obtained with Level 1 Wavelet Packet Transform feature extrac-
            tion for 2, 2.5, 3.5, and 4.5 inch stickout cases. . . . . . . . . . . . . . . . 235
Table B.2: Results obtained with Level 2 Wavelet Packet Transform feature extrac-
            tion for 2, 2.5, 3.5 and 4.5 inch stickout cases. . . . . . . . . . . . . . . . . 236
Table B.3: Results obtained with Level 3 Wavelet Packet Transform feature extrac-
            tion for 2, 2.5, 3.5 and 4.5 inch stickout cases. . . . . . . . . . . . . . . . . 237
Table B.4: Results obtained with Level 4 Wavelet Packet Transform feature extrac-
            tion for 2, 2.5, 3.5 and 4.5 inch stickout cases. . . . . . . . . . . . . . . . . 238
Table B.5: Results obtained with the EEMD feature extraction method for 2, 2.5,
            3.5, and 4.5 inch stickout cases. . . . . . . . . . . . . . . . . . . . . . . . . 239
                                                xii


Table B.6: Two class classification (chatter and stable) results obtained with the
            DTW similarity measure method for 2, 2.5, 3.5, and 4.5 inch overhang cases.239
Table B.7: Two class classification (chatter including intermediate chatter cases as
            chatter and stable) results obtained with the DTW similarity measure
            method for 2, 2.5, 3.5, and 4.5 inch overhang cases. . . . . . . . . . . . . . 240
Table B.8: Three-class classification is applied with the DTW approach. . . . . . . . 240
Table B.9: Persistence Landscape Results with Support Vector Machine (SVM) clas-
            sifier. The landscape number column represents which landscapes were
            used to extract features. . . . . . . . . . . . . . . . . . . . . . . . . . . . . 241
Table B.10: Persistence Landscape Results with Logistic Regression (LR) classifier.
            The landscape number column represents which landscapes were used to
            extract features. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 241
Table B.11: Persistence Landscape Results with Random Forest (RF) classifier. The
            landscape number column represents which landscapes were used to ex-
            tract features. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 242
Table B.12: Persistence Images results with pixel size = 0.1. . . . . . . . . . . . . . . . 242
Table B.13: Persistence Images results with pixel size = 0.05. . . . . . . . . . . . . . . 243
Table B.14: Carlsson Coordinates results. . . . . . . . . . . . . . . . . . . . . . . . . . 243
Table B.15: Kernel Method results with LibSVM package. . . . . . . . . . . . . . . . . 243
Table B.16: Path signature method results obtained with SVM classifier. Landscape
            numbers represent which landscapes were used to compute signatures. . . 244
Table C.1: Selected surfaces under various machining conditions. . . . . . . . . . . . 254
                                              xiii


                                      LIST OF FIGURES
Figure 2.1: Categorization of feature extraction methods and classifiers used for
             chatter detection. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .  7
Figure 2.2: Selected optimal hyperplane for linearly separable two class data set. . .        17
Figure 2.3: Linear (a) and logistic (b) regression onto data set whose output is binary. 17
Figure 2.4: Generation of decision tree. . . . . . . . . . . . . . . . . . . . . . . . . .    19
Figure 2.5: Random Forest Classification (vote0 :Final decision of the corresponding
             tree is class 0. vote1 :Final decision of the corresponding tree is class 1.). . 19
Figure 2.6: Overview of the Wavelet Packet Transform (WPT) method with Recur-
             sive Feature Elimination (WPT). . . . . . . . . . . . . . . . . . . . . . .      21
Figure 2.7: 3 Level Wavelet Packet Transform. . . . . . . . . . . . . . . . . . . . . .       23
Figure 2.8: Time domain and frequency domain of stable (a,b), intermediate (c,d),
             and chatter (e,f) regions for overhang distance of 5.08 cm (2 inch), 320
             rpm, 0.002 inch depth of cut case of turning experiments. . . . . . . . . .      24
Figure 2.9: Energy ratios of wavelet packets for two cases of turning experiment:
             (top) 5.08 cm (2 inch) stickout, 320 rpm and 0.0127 cm (0.005 inch)
             DOC, and (bottom) 5.08 cm (2 inch) stickout, 570 rpm and 0.00508 cm
             (0.002 inch) DOC. Note the differences in the scale of the vertical axis. .      25
Figure 2.10: The spectrum of the first four wavelet packets of level 4 WPT for in-
             termediate chatter (a),(c) and chatter (b),(d) in the case with 5.08 cm
             (2 inch) overhang distance. The spindle speed and depth of cut is 320
             rpm and 0.0127 cm (0.005 inch) in (a), (b) and 570 rpm and 0.00508
             cm (0.002 inch) in (c), (d), respectively. . . . . . . . . . . . . . . . . . . . 27
Figure 2.11: The original time series and the corresponding intrinsic mode functions
             (IMFs) for the case of 5.08 cm (2 inch) stickout, 320 rpm, and 0.0127
             cm (0.005 inch) DOC. . . . . . . . . . . . . . . . . . . . . . . . . . . . .     31
Figure 2.12: The spectrum of each intrinsic mode function (IMF) for the case of 5.08
             cm (2 inch) stickout, 320 rpm and 0.0127 cm (0.005 inch) DOC. . . . . .          32
Figure 2.13: Bar plot for feature ranking for 5.08 cm (2 inch) stickout case at level 4
             WPT. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .   34
                                                xiv


Figure 2.14: Level 4 Wavelet Packet Transform(WPT) feature extraction method
             results for all stickout cases. (a) 5.08 cm (2 inch), (b) 6.35 cm (2.5
             inch), (c) 8.89 cm (3.5 inch), and (d) 11.43 cm (4.5 inch). . . . . . . . . .        35
Figure 2.15: Bar plot including the error bars of the classification results for Level 1
             WPT, Level 2 WPT and EEMD with four different classifier. a) 5.08
             cm (2 inch) stickout size, b)6.35 cm (2.5 inch) stickout size, c)8.89 cm
             (3.5 inch) stickout size, d) 11.43 cm (4.5 inch) stickout size. . . . . . . . .      37
Figure 2.16: First five peaks in the Fast Fourier Transform (FFT) plot of the time se-
             ries of acceleration signal collected from cutting configuration with 5.08
             (2 in) cm overhang length, 320 rpm rotational speed of the spindle, and
             0.0127 cm (0.005 inc) depth of cut. (Minimum Peak Distance (MPD):
             500 and 2500). . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .       40
Figure 2.17: Comparison of classification results of Level 4 WPT, EEMD, and FFT/PS-
             D/ACF methods based on four classifiers with Recursive Feature Elimi-
             nation (RFE). a) 5.08 cm (2 inch) overhang length b) 6.35 cm (2.5 inch)
             overhang length c) 8.89 cm (3.5 inch) overhang length d) 11.43 cm (4.5
             inch) overhang length. . . . . . . . . . . . . . . . . . . . . . . . . . . . .       41
Figure 2.18: Bar plot for feature ranking obtained from SVM-RFE for 5.08 cm (2
             inch) overhang case with FFT/PSD/ACF based method. (f1 , . . . , f4 ),
             (f5 , . . . , f8 ) and (f9 , . . . , f12 ) belong to features obtained from peaks of
             FFT, PSD and ACF respectively. . . . . . . . . . . . . . . . . . . . . . .           43
Figure 2.19: DTW alignment (a) and warping path(b) for two different time series. . .             45
Figure 2.20: K-Nearest Neighbor classification example for two-class classification. . .          46
Figure 2.21: Illustration of elimination and approximation steps of AESA. Count
             refers to the number of distance computations made in the algorithm. . .             50
Figure 2.22: Illustration for elimination criteria. . . . . . . . . . . . . . . . . . . . . .     51
Figure 2.23: Effect of H on the elimination criteria and accuracy of the classification.          52
Figure 2.24: The heat map of averages DTW distances of time series belongs to three
             classes for the 5.08 cm (2 inch) case. . . . . . . . . . . . . . . . . . . . .       56
Figure 2.25: The heat map of averages DTW distances of time series belongs to three
             classes for the 6.35 cm (2.5 inch) case. . . . . . . . . . . . . . . . . . . .       57
Figure 2.26: Histograms for looseness constant of all combinations of three different
             time series for all overhang distance. . . . . . . . . . . . . . . . . . . . .       62
                                                           xv


Figure 2.27: Classification accuracy (%) (solid red line) and the average number of
             DTW computations (green dashed line) for varying looseness constant (H). 63
Figure 2.28: Formation of simplicial complexes from a point cloud. . . . . . . . . . . .    67
Figure 2.29: Generation of persistence diagrams using The Rips Complex. . . . . . . .       68
Figure 2.30: Pipeline for feature extraction using topological features of data. . . . . .  70
Figure 2.31: Persistence diagram computation steps. . . . . . . . . . . . . . . . . . . .   70
Figure 2.32: Illustration showing Bézier curve fit and generation of line segments. . .     74
Figure 2.33: Comparison of persistence diagrams computed with the approximation
             for varying r to true diagrams obtained with Ripser. All diagrams
             belong to time series number 51 of the 8.89 cm (3.5 inch) case, and they
             are computed in H1 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 75
Figure 2.34: The Bottleneck distance between the diagram with blue color in Fig-
             ure 2.33 and the approximated diagrams with green color in Figure 2.33.        76
Figure 2.35: A schematic showing the process of obtaining the landscape functions
             from a persistence diagram. . . . . . . . . . . . . . . . . . . . . . . . . .  77
Figure 2.36: Persistence landscape feature extraction. . . . . . . . . . . . . . . . . . .  79
Figure 2.37: Steps for persistence image computation. . . . . . . . . . . . . . . . . . .   81
Figure 2.38: Milling process illustrations. a) Upmilling b) Downmilling . . . . . . . .     88
Figure 2.39: Stability criteria used in this study based on the eigenvalues of the
             monodromy matrix U. . . . . . . . . . . . . . . . . . . . . . . . . . . . .    90
Figure 2.40: Success and failure of two class classifications performed with Template
             Function feature matrices and Gradient Boosting algorithm for test set
             of data set without noise and with SNR value of 25 dB. a) Classifica-
             tion with 1D persistence features for non-noisy data set, b) Classification
             with 2D persistence features for non-noisy data set, c)Classification with
             1D-2D persistence combined features for non-noisy data set, d) Classifi-
             cation with 1D persistence features for noisy data set with SNR:25 dB,
             e) Classification with 2D persistence features for noisy data set with
             SNR:25 dB, f) Classification with 1D-2D persistence combined features
             for noisy data set with SNR:25 dB. . . . . . . . . . . . . . . . . . . . . .   92
                                              xvi


Figure 2.41: Mean accuracies of downmilling process (a,b) and upmilling process
             (f,g) obtained for two class and three class classification performed with
             Carlsson Coordinates and Template Functions for non-noisy and noisy
             data sets where teeth number is 4. Two class(c) and three class (d) clas-
             sification results obtained with Gradient Boosting algorithm is shown
             on the stability diagram for downmilling simulation data set whose SNR
             is 25 dB. Two class(e) and three class (h) classification results obtained
             with Gradient Boosting algorithm are shown on the stability diagram
             for upmilling simulation data set whose SNR is 25 dB. . . . . . . . . . .          94
Figure 2.42: Sample persistence diagrams for each overhang distance. . . . . . . . . . 101
Figure 2.43: Classification performance of persistence diagrams obtained with differ-
             ent methods for all overhang cases. . . . . . . . . . . . . . . . . . . . . . 102
Figure 2.44: An example of transfer learning where training for chatter detection is
             performed using a turning process (the source) and the gained knowl-
             edge is imported via transfer learning to a milling operation (the target).       104
Figure 2.45: Categorization of transfer learning. . . . . . . . . . . . . . . . . . . . . . 105
Figure 2.46: Outline of the general procedure and the featurization methods used in
             this study. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 106
Figure 2.47: Transfer learning approach used in this chapter for feature extraction
             and similarity measure-based approaches. . . . . . . . . . . . . . . . . . . 106
Figure 2.48: The spectrum of three different time series from milling experiment:
             (left) 13227 rpm, 2.54 mm depth of cut (doc), unstable,(mid) 16861
             rpm, 1.905 mm doc, stable, (right) 27285 rpm, 1.905 doc, unstable. . . . 108
Figure 2.49: Energy ratio of the wavelet packets obtained from decomposition of the
             three time series whose spectrum is provided in Figure 2.48. (Blue bars)
             Milling - Unstable - RPM=13227 - DOC = 2.54mm. (Red bars) Milling
             - Stable - RPM=16861 - DOC = 0.38 mm. (Orange bars) Milling -
             Unstable - RPM=27285 - DOC = 3.556 mm. . . . . . . . . . . . . . . . . 108
Figure 2.50: The spectrum of reconstructed signals from the first four wavelet packets
             of three different time series whose spectrum is shown in Figure 2.48.
             (First row) Milling, 13300 rpm, 2.54 mm depth of cut (doc), unstable,
             (second row) milling, 17300 rpm, 0.3810 mm doc, stable, and (last row)
             milling, 28000 rpm, 3.5560 mm doc, unstable. . . . . . . . . . . . . . . . 109
Figure 2.51: Intrinsic mode functions and their spectrum for the time series with
             11210 rpm and 3.556 mm depth of cut from milling experiments. . . . . . 110
                                               xvii


Figure 2.52: Effect of different MPD values on selected peaks in FFT and ACF plots
             of time series with RPM=13300 and DOC=2.54 mm from milling exper-
             iments. (Top) FFT plot and selected peaks with MPD=2500 (top) and
             MPD=500 (middle). Auto-correlation function with MPD=500 (bot-
             tom). Orange lines represent the MPH. . . . . . . . . . . . . . . . . . . . 111
Figure 2.53: The highest accuracy out of four different classifiers (or out of selected
             numbers of nearest neighbor for DTW) for each approach used in trans-
             fer learning applications between overhang distance cases of turning
             experiments. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 114
Figure 2.54: The classification results are obtained from the selected methods when
             training and testing are performed between the overhang distance cases
             of the turning data set. The selected methods that give the highest
             accuracy are represented with the ’◦’ bar hatch, and the ones that are
             in the error band of the highest accuracy are shown with the ‘/’ bar
             hatch. One approach is selected from each category of the methods,
             and these are Wavelet Packet Transform (WPT), Carlsson Coordinates
             (TDA-CC), and Dynamic Time Warping (DTW). . . . . . . . . . . . . . 115
Figure 2.55: The highest F1-Score out of four different classifiers (or out of selected
             numbers of nearest neighbor for DTW) for each approach used in trans-
             fer learning applications between overhang distance cases of turning
             experiments. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 116
Figure 2.56: F1 scores obtained from the selected methods when training and test-
             ing are performed between the overhang distance cases of the turning
             data set. The selected methods that give the highest accuracy are rep-
             resented with the ’◦’ bar hatch, and the ones that are in the error band
             of the highest accuracy are shown with the ‘/’ bar hatch. One approach
             is selected from each category of the methods, and these are Wavelet
             Packet Transform (WPT), Carlsson Coordinates (TDA-CC), and Dy-
             namic Time Warping (DTW). . . . . . . . . . . . . . . . . . . . . . . . . 117
Figure 2.57: The highest accuracy out of four different classifiers (or out of selected
             numbers of nearest neighbor for DTW) for each approach used in trans-
             fer learning applications between overhang distance cases of turning and
             milling experiments. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 119
                                              xviii


Figure 2.58: The classification results obtained from the selected methods when train-
             ing and testing are performed between the overhang distance cases of
             the turning data set and the milling data set. The selected methods that
             give the highest accuracy are represented with the ’◦’ bar hatch, and
             the ones that are in the error band of the highest accuracy are shown
             with the ‘/’ bar hatch. One approach is selected from each category of
             the methods, and these are FPA, TDA-CC, and DTW. . . . . . . . . . . 120
Figure 2.59: The classification accuracies obtained using Carlsson Coordinates and
             Persistence Images features with ANN algorithms for the transfer learn-
             ing between the cases of turning experiments. . . . . . . . . . . . . . . . 121
Figure 2.60: The classification accuracies obtained using Carlsson Coordinates and
             Persistence Images features with ANN algorithms for the transfer learn-
             ing between the milling and turning experiments. . . . . . . . . . . . . . 122
Figure 3.1: Experimental Setup. See Table3.1 for parts 1-9. A: Photo-interrupter
             blocker, B:Optional Magnet hole, C:Scotch Yoke. . . . . . . . . . . . . . 135
Figure 3.2: Simple pendulum simulation vs SINDy approximation based on TVR
             derivatives. (c = 0.5 Ns/m, m = 2 kg and L=1). . . . . . . . . . . . . . . 140
Figure 3.3: Simple pendulum training and test set predictions. . . . . . . . . . . . . 141
Figure 3.4: Estimation of the chaotic pendulum response with SINDy when TVR
             derivatives (top) and simulation derivatives (middle) are used. The
             bottom panel compares the derivatives of the simulation to their TVR
             counterpart. (TVR parameters: α = 2 × 10-5 and ϵ = 1012 ). . . . . . . . 142
Figure 3.5: Estimation of the simple pendulum response based on SINDy and sim-
             ulation data with SNR = 20 dB. . . . . . . . . . . . . . . . . . . . . . . . 143
Figure 3.6: Estimated response of the chaotic pendulum using TVR derivatives with
             α = 100, ϵ = 1012 , and SNR = 35 dB. . . . . . . . . . . . . . . . . . . . . 144
Figure 3.7: Number of nonzero coefficients estimated by SINDy for Lorenz system
             with parameters: σ = 10, β = 2.667, ρ = 28. Initial conditions: x0 =
             −8, y0 = 8 and z0 = 27. ϵ = 1012 (left), ϵ = 10-3 (right). The correct
             number of nonzero terms is N = 7. . . . . . . . . . . . . . . . . . . . . . 145
Figure 3.8: Estimated responses for the simulated chaotic pendulum. . . . . . . . . . 147
Figure 3.9: Estimated responses for the simulated chaotic pendulum with noise
             (SNR=35 dB). . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 148
                                               xix


Figure 3.10: Estimated model responses for experimental data of chaotic pendulum
             (Top) and a zoomed-in version (bottom). . . . . . . . . . . . . . . . . . . 149
Figure 4.1: The roughest and smoothest surface in the synthetic data set obtained
             with H = 0 and H = 1, respectively. . . . . . . . . . . . . . . . . . . . . 155
Figure 4.2: Filtered surfaces obtained using kernel sizes 5, 11 and 21. . . . . . . . . . 158
Figure 4.3: (Left) The spectrum of the plots. Filtered profiles obtained from two
             cutoff values, 0.2 (middle) and 0.4 (right). . . . . . . . . . . . . . . . . . 159
Figure 4.4: Selected peaks for FFT and PSD plots with respect to MPH and chosen
             MPD values. Red horizontal lines represent the MPH. . . . . . . . . . . 160
Figure 4.5: APSD plots, radial and angular spectrum of roughest (first row, H = 0)
             and smoothest (second row, H = 1) surfaces. APSDs are obtained after
             applying the 2D FFT on roughness surfaces. . . . . . . . . . . . . . . . . 161
Figure 4.6: a,b) Sublevel sets of the image given in c. d) Persistence diagram of the
             image shown in c. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 162
Figure 4.7: Surface profile test set results obtained with the 1D implementation of
             traditional signal processing tools. . . . . . . . . . . . . . . . . . . . . . . 164
Figure 4.8: Surface profile classification results obtained with OD (H0 ) sublevel set
             persistence. Carlsson Coordinates (CC), persistence images (PI), and
             template functions (TF) are used to extract features. The first plot
             in the second row represents the results after applying dimensionality
             reduction to features obtained from persistence images. . . . . . . . . . . 165
Figure 4.9: Results of surface classification obtained with the 2D implementation
             of signal processing tools. . . . . . . . . . . . . . . . . . . . . . . . . . . 166
Figure 4.10: Results of surface classification obtained with 1D persistence H1 us-
             ing Carlsson coordinates (CC), persistence images (PI), and template
             functions (TF). . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 167
Figure 4.11: DWT tree for the first three level of the transform. . . . . . . . . . . . . 169
                                               xx


Figure 4.12: The plots on the first row belong to the roughest surface (H = 0)
             simulation, while the plots in the second row belong to the smoothest
             surface (H = 1) a) Energy ratios of detail coefficients (di : detail coeffi-
             cients of level i). b) Cumulative energy ratios including approximation
             coefficients (a: approximation coefficients of maximum level). c) The
             resulting three main components after applying the automatic threshold
             selection for profile x − 1. . . . . . . . . . . . . . . . . . . . . . . . . . . 170
Figure 4.13: (left) Energy ratios of detail coefficients (di : detail coefficients of level i).
             (middle) Cumulative energy ratios including approximation coefficients
             (a: approximation coefficients of maximum level). (right) The resulting
             three main components after applying the automatic threshold selection
             for profile x − 1. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 171
Figure 4.14: Matrix format of the DCT modes and example modes. . . . . . . . . . . 173
Figure 4.15: The entropy of the roughness and waviness+form surface for varying
             threshold index. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 174
Figure 4.16: Reconstructed surface profiles obtained using thresholds from the en-
             tropy analysis shown in Figure 4.15. . . . . . . . . . . . . . . . . . . . . . 175
Figure 4.17: Classification accuracies obtained with automatic and heuristic thresh-
             old selection for DCT and DWT using a synthetic data set. Three-class
             classification is performed in this case. . . . . . . . . . . . . . . . . . . . 177
Figure 4.18: Experimental data results for DWT (profile classification) and DCT
             (Surface classification). . . . . . . . . . . . . . . . . . . . . . . . . . . . . 178
Figure 4.19: The thresholds selected by proposed algorithm for the synthetic data
             set. The roughness parameter 0 represents the roughest surface, while
             the smoothest surface has a roughness parameter of 1. . . . . . . . . . . 179
Figure 4.20: Roughness surfaces and roughness profiles obtained from three surfaces
             in the experimental data set. (a-d) Milling, (e-h) Profiled. . . . . . . . . 179
Figure 4.21: Threshold values selected from the automatic threshold selection algo-
             rithm and the constant heuristic threshold. . . . . . . . . . . . . . . . . . 180
Figure 4.22: Gaussian filtering results obtained from profile and surface classification
             for synthetic data set. . . . . . . . . . . . . . . . . . . . . . . . . . . . . 181
Figure 4.23: Three and nine class classification results for surface profiles (1D) and
             the surfaces (2D) in experimental data using Gaussian Filter. . . . . . . 181
                                                xxi


Figure 4.24: Motivation example given in Reference [1]. . . . . . . . . . . . . . . . . . 182
Figure 4.25: Steps to perform topological saliency-based simplification . . . . . . . . . 183
Figure 4.26: Topological saliency plot for a simplified synhetic surface. . . . . . . . . . 184
Figure 4.27: Saliency based simplification on an example synthetic surface using ap-
             proximated geodesic distances between the critical points. . . . . . . . . 186
Figure 4.28: Saliency-based clustering of surfaces. The first row represents the clus-
             tering results of the original surface with a feature number of 90 for
             different cluster numbers. The second and third rows represent the re-
             sults for the simplified surface at the second iteration (feature number:
             30) and third iteration (feature number: 7) given in Figure 4.27. . . . . . 187
Figure 5.1: Experimental setup for titanium cutting experiments. (top) Titanium
             bar and the sensors used in data collection, (bottom) Data acquisition
             boxes, and signals conditioner used during the experiments. . . . . . . . 196
Figure 5.2: Filtered signals and raw measurements obtained from cutting configu-
             ration with RPM:407, DOC (depth of cut):1.2 mm, CT (cutting time):
             AC (All cut, where cutting time is T ), TT (tool type): chip breaker,
             TN (tool number): 2 and CN (corner number): 1. . . . . . . . . . . . . . 198
Figure 5.3: Surface scans for cutting inserts with low, medium and severe wear
             levels. All scans are obtained from cutting inserts without chip breaker
             used in different cutting tests. . . . . . . . . . . . . . . . . . . . . . . . . 199
Figure 5.4: FFT spectrum of the processed signals obtained from the first five cut-
             ting configurations. CS: Cutting speed, DOC: depth of cut, FR: feed
             rate, CT: cutting time, QC: quarter cut, HC: half-cut, 3QC: three-
             quarter cut, AC: all cut, TT: tool type, CB: chip breaker, WOCB:
             without chip breaker, TN: tool number, CN: corner number. . . . . . . . 201
Figure 5.5: Processed time series (blue) and the reconstructed signals using EEMD
             (orange) for five cutting configurations when the last 25 seconds of the
             time series used in the analysis. CS: Cutting speed, DOC: depth of cut,
             FR: feed rate, CT: cutting time, QC: quarter cut, HC: half-cut, 3QC:
             three-quarter cut, AC: all cut, TT: tool type, CB: chip breaker, WOCB:
             without chip breaker, TN: tool number, CN: corner number. . . . . . . . 204
                                              xxii


Figure 5.6: Processed time series (blue) and the reconstructed signals using EEMD
             (orange) for five cutting configurations when the last 5 seconds of the
             time series that is used in the analysis. CS: Cutting speed, DOC: depth
             of cut, FR: feed rate, CT: cutting time, QC: quarter cut, HC: half-cut,
             3QC: three-quarter cut, AC: all cut, TT: tool type, CB: chip breaker,
             WOCB: without chip breaker, TN: tool number, CN: corner number. . . 205
Figure 5.7: The number of components obtained after applying PCA for features of
             DWT (left) and EEMD (right) and computing the cumulative variance
             ratio. These variance plots are obtained when the last 25 seconds of the
             time series are used in feature extraction. . . . . . . . . . . . . . . . . . . 207
Figure 5.8: The test set classification scores obtained with the DWT approach. Re-
             sults are provided for four different classification algorithms: SVM: Sup-
             port Vector Machine, LR: Logistic Regression, RF: Random Forest, GB:
             Gradient Boosting. The labeling of the time series used in the analysis
             is done based on flank wear measurements. . . . . . . . . . . . . . . . . 209
Figure 5.9: The test set classification scores obtained with the DWT approach. Re-
             sults are provided for four different classification algorithms: SVM: Sup-
             port Vector Machine, LR: Logistic Regression, RF: Random Forest, GB:
             Gradient Boosting. The labeling of the time series used in the analysis
             is done based on crater wear measurements. . . . . . . . . . . . . . . . . 209
Figure 5.10: The test set classification scores obtained with the DWT approach by
             excluding features from force signals. Results are provided for four
             different classification algorithms: SVM: Support Vector Machine, LR:
             Logistic Regression, RF: Random Forest, GB: Gradient Boosting. The
             labeling of the time series used in the analysis is done based on crater
             wear measurements. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 210
Figure 5.11: The test set classification scores obtained with the EEMD approach.
             Results are provided for four different classification algorithms: SVM:
             Support Vector Machine, LR: Logistic Regression, RF: Random Forest,
             GB: Gradient Boosting. The labeling of the time series used in the
             analysis is done based on flank wear measurements. . . . . . . . . . . . 211
Figure 5.12: The test set classification scores obtained with the EEMD approach.
             Results are provided for four different classification algorithms: SVM:
             Support Vector Machine, LR: Logistic Regression, RF: Random Forest,
             GB: Gradient Boosting. The labeling of the time series used in the
             analysis is done based on crater wear measurements. . . . . . . . . . . . 212
                                               xxiii


Figure 5.13: The test set classification scores obtained with the EEMD approach
             by excluding features from force signals. Results are provided for four
             different classification algorithms: SVM: Support Vector Machine, LR:
             Logistic Regression, RF: Random Forest, GB: Gradient Boosting. The
             labeling of the time series used in the analysis is done based on crater
             wear measurements. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 213
Figure 5.14: The test set classification scores obtained with the Carlsson Coordinates.
             Results are provided for four different classification algorithms: SVM:
             Support Vector Machine, LR: Logistic Regression, RF: Random Forest,
             GB: Gradient Boosting. The labeling of the time series used in the
             analysis is done based on flank wear measurements. . . . . . . . . . . . 214
Figure 5.15: The test set classification scores obtained with the Carlsson Coordinates.
             Results are provided for four different classification algorithms: SVM:
             Support Vector Machine, LR: Logistic Regression, RF: Random Forest,
             GB: Gradient Boosting. The labeling of the time series used in the
             analysis is done based on crater wear measurements. . . . . . . . . . . . 214
Figure 5.16: The test set classification scores obtained with the Carlsson Coordinates
             by excluding features from force signals. Results are provided for four
             different classification algorithms: SVM: Support Vector Machine, LR:
             Logistic Regression, RF: Random Forest, GB: Gradient Boosting. The
             labeling of the time series used in the analysis is done based on crater
             wear measurements. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 215
Figure 5.17: The test set classification scores obtained with the persistence images.
             Results are provided for four different classification algorithms: SVM:
             Support Vector Machine, LR: Logistic Regression, RF: Random Forest,
             GB: Gradient Boosting. The labeling of the time series used in the
             analysis is done based on crater wear measurements. . . . . . . . . . . . 216
Figure 5.18: The test set classification scores obtained with the persistence images
             by excluding features from force signals. Results are provided for four
             different classification algorithms: SVM: Support Vector Machine, LR:
             Logistic Regression, RF: Random Forest, GB: Gradient Boosting. The
             labeling of the time series used in the analysis is done based on crater
             wear measurements. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 216
Figure 5.19: The test set classification scores obtained with the template functions.
             Results are provided for four different classification algorithms: SVM:
             Support Vector Machine, LR: Logistic Regression, RF: Random Forest,
             GB: Gradient Boosting. The labeling of the time series used in the
             analysis is done based on crater wear measurements. . . . . . . . . . . . 218
                                               xxiv


Figure 5.20: The test set classification scores obtained with the template functions
             by excluding features from force signals. Results are provided for four
             different classification algorithms: SVM: Support Vector Machine, LR:
             Logistic Regression, RF: Random Forest, GB: Gradient Boosting. The
             labeling of the time series used in the analysis is done based on crater
             wear measurements. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 218
Figure A.1: The experimental setup showing the workpiece, the cutting tool, and the
             attached accelerometers (left). The visual representation of the stickout
             distance (right). . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 226
Figure A.2: The surface finish corresponding to (a) no-chatter case with 6.35 cm
             (2.5 inch) stickout length, 320 rpm, and 0.127 mm (0.005 inch) depth of
             cut, (b) Mild chatter case with 6.35 cm (2.5 inch) stickout length, 770
             rpm, and 0.127 mm (0.005 inch) depth of cut, and (c) chatter case with
             6.35 cm (2.5 inch) stickout length, 570 rpm, and 0.127 mm (0.005 inch)
             depth of cut. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 227
Figure A.3: Tagging example in the (a) time domain and (b) the frequency domain
             for the case with 8.9 cm (3.5 in) stickout, 770 rpm, and 0.04 cm (0.015
             in) depth of cut. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 229
Figure A.4: (a) Sample tagged time series and some samples of the resulting surface
             finish. The right panel shows a comparison of the frequency spectra be-
             tween a signal labeled as chatter versus (b) no chatter, (c) intermediate
             chatter, and (d) unknown. . . . . . . . . . . . . . . . . . . . . . . . . . . 229
Figure A.5: Experimental setup of milling cutting experiments.(left) Illustration of
             the setup (right) picture of the cutting tool and the workpiece. . . . . . . 231
Figure A.6: The first column represents tool displacements for two teeth, and the
             second column provides Poincare sections for two time series obtained
             with three different rotational speeds and depth of cuts in milling ex-
             periments. The third column shows the PSD plots of the three-time
             series whose Poincare sections are shown in the first column. Red dots
             represent tooth passage frequency. The time series shown in the first
             row is stable with cutting conditions Ω = 11793 rpm and depth of cut
             of 2.54 mm, while the one in the second row is unstable with cutting
             conditions Ω = 17746 rpm and depth of cut of 1.524 mm. The third row
             represents an unstable milling signal with cutting conditions Ω = 27285
             rpm and depth of cut of 3.556 mm. . . . . . . . . . . . . . . . . . . . . . 232
Figure A.7: Illustration for the stability analysis using eigenvalues of the dynamic
             map obtained using spectral element approach [2]. . . . . . . . . . . . . . 233
                                               xxv


Figure A.8: The stability of time series obtained using the analytical model provided
            in Section 2.6.5 [3] with different depth of cut (b) and spindle speed (Ω)
            on 100 × 100 grid. The green color corresponds to the time series with
            Hopf bifurcation (unstable), while the blue color represents the stable
            time series. The red color shows the time series with flip bifurcation.
            Experimental data, whose stability is defined based on the Poincare
            section and PSD plots, is shown with diamond (unstable) and triangle
            (stable) symbols. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 233
Figure B.1: The heat map of averages DTW distances of time series belongs to three
            classes for the 8.89 cm (3.5 inch) case. . . . . . . . . . . . . . . . . . . . 234
Figure B.2: The heat map of averages DTW distances of time series belongs to three
            classes for the 11.43 cm (4.5 inch) case. . . . . . . . . . . . . . . . . . . . 234
Figure B.3: Classification accuracies obtained from transfer learning applications for
            turning experiment case with 5.08 cm overhang distance. (left) Training:
            5.08 cm Test: 6.35 cm (middle) Training: 5.08 cm Test: 8.89 cm, (right)
            Training: 5.08 cm Test: 11.43 cm. CC: Carlsson Coordinates, PI: Per-
            sistence Images, PL: Persistence Landscapes, TF: Template Functions,
            WPT: Wavelet Packet Transform, EEMD: Ensemble Empirical Mode
            Decomposition, FPA: FFT/PSD/ACF, SVM: Support Vector Machine,
            RF: Random Forest, GB: Gradient Boosting. The results for TDA-PL
            implementation with Gradient Boosting classifier (GB) are not avail-
            able due to the large amount of time required in training and testing.
            Therefore, it represents an empty box in the figure. . . . . . . . . . . . . 245
Figure B.4: Classification accuracies obtained from transfer learning applications for
            turning experiment case with 6.35 cm overhang distance. (left) Training:
            6.35 cm Test: 5.08 cm (middle) Training: 6.35 cm Test: 8.89 cm, (right)
            Training: 6.35 cm Test: 11.43 cm. CC: Carlsson Coordinates, PI: Per-
            sistence Images, PL: Persistence Landscapes, TF: Template Functions,
            WPT: Wavelet Packet Transform, EEMD: Ensemble Empirical Mode
            Decomposition, FPA: FFT/PSD/ACF, SVM: Support Vector Machine,
            RF: Random Forest, GB: Gradient Boosting. The results for TDA-PL
            implementation with Gradient Boosting classifier (GB) are not avail-
            able due to the large amount of time required in training and testing.
            Therefore, it represents an empty box in the figure. . . . . . . . . . . . . 246
                                              xxvi


Figure B.5: Classification accuracies obtained from transfer learning applications
            for turning experiment case with 8.89 cm overhang distance . (left)
            Training: 8.89 cm Test: 5.08 cm (middle) Training: 8.89 cm Test: 6.35
            cm, (right) Training: 8.89 cm Test: 11.43 cm. CC: Carlsson Coordi-
            nates, PI: Persistence Images, PL: Persistence Landscapes, TF: Tem-
            plate Functions, WPT: Wavelet Packet Transform, EEMD: Ensemble
            Empirical Mode Decomposition, FPA: FFT/PSD/ACF, SVM: Support
            Vector Machine, RF: Random Forest, GB: Gradient Boosting. The
            results for TDA-PL implementation with Gradient Boosting classifier
            (GB) are unavailable due to a large amount of time required in training
            and testing. Therefore, it represents an empty box in the figure. . . . . . 247
Figure B.6: Classification accuracies obtained from transfer learning applications for
            turning experiment case with 11.43 cm overhang distance . (left) Train-
            ing: 11.43 cm Test: 5.08 cm (middle) Training: 11.43 cm Test: 6.35
            cm, (right) Training: 11.43 cm Test: 8.89 cm. CC: Carlsson Coordi-
            nates, PI: Persistence Images, PL: Persistence Landscapes, TF: Tem-
            plate Functions, WPT: Wavelet Packet Transform, EEMD: Ensemble
            Empirical Mode Decomposition, FPA: FFT/PSD/ACF, SVM: Support
            Vector Machine, RF: Random Forest, GB: Gradient Boosting. The
            results for TDA-PL implementation with Gradient Boosting classifier
            (GB) are not available due to a large amount of time required in train-
            ing and testing. Therefore, it represents an empty box in the figure. . . . 248
Figure B.7: Classification accuracies obtained from transfer learning applications for
            turning data set using DTW approach with K = 1, 2, 3, 4, 5, where K
            represents the nearest neighbor number. Overhang distances used as
            training and testing data sets are shown on y-axis. . . . . . . . . . . . . 249
Figure B.8: Classification accuracies obtained from transfer learning applications
            when the milling data set is used as the training set and the turning
            data set is used as the test set. CC: Carlsson Coordinates, PI: Per-
            sistence Images, PL: Persistence Landscapes, TF: Template Functions,
            WPT: Wavelet Packet Transform, EEMD: Ensemble Empirical Mode
            Decomposition, FPA: FFT/PSD/ACF, SVM: Support Vector Machine,
            RF: Random Forest, GB: Gradient Boosting. The results for TDA-PL
            implementation with Gradient Boosting classifier (GB) are not avail-
            able due to the large amount of time required in training and testing.
            Therefore, it represents an empty box in the figure. . . . . . . . . . . . . 250
                                            xxvii


Figure B.9: Classification accuracies obtained from transfer learning applications
             when the milling data set is used as the test set, and the turning data
             set is used as the training set. CC: Carlsson Coordinates, PI: Per-
             sistence Images, PL: Persistence Landscapes, TF: Template Functions,
             WPT: Wavelet Packet Transform, EEMD: Ensemble Empirical Mode
             Decomposition, FPA: FFT/PSD/ACF, SVM: Support Vector Machine,
             RF: Random Forest, GB: Gradient Boosting. The results for TDA-PL
             implementation with Gradient Boosting classifier (GB) are not avail-
             able due to the large amount of time required in training and testing.
             Therefore, it represents an empty box in the figure. . . . . . . . . . . . . 251
Figure B.10: Classification accuracies obtained from transfer learning applications
             between turning and milling data sets using the DTW approach with
             K = 1, 2, 3, 4, 5, where K represents the nearest neighbor number. Over-
             hang distances (OD.) used as training or testing data sets are shown on
             y-axis. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 252
Figure B.11: The highest F1-score out of four different classifiers (or out of selected
             numbers of nearest neighbor for DTW) for each approach used in trans-
             fer learning applications between overhang distance cases of the turning
             and milling experiments. . . . . . . . . . . . . . . . . . . . . . . . . . . . 252
Figure B.12: F1 scores obtained from the selected methods when we train and test
             between the overhang distance cases of the turning data set and the
             milling data set. The selected methods that give the highest accuracy
             are represented with ’◦’ bar hatch, and the ones that are in the error
             band of the highest accuracy are shown with ‘/’ bar hatch. One ap-
             proach is selected from each category of the methods, and these are
             FPA, TDA-CC, and DTW. . . . . . . . . . . . . . . . . . . . . . . . . . 253
Figure C.1: Scan sample of 125M. . . . . . . . . . . . . . . . . . . . . . . . . . . . . 254
Figure C.2: The microscope used for experimental data collection and the sample
             surfaces. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 255
Figure D.1: Training set classification scores obtained from the DWT approach. Re-
             sults are provided for four different classification algorithms: SVM: Sup-
             port Vector Machine, LR: Logistic Regression, RF: Random Forest, GB:
             Gradient Boosting. The labeling of the data was performed with respect
             to flank wear measurements. . . . . . . . . . . . . . . . . . . . . . . . . . 256
                                              xxviii


Figure D.2: Training set classification scores obtained from the DWT approach. Re-
            sults are provided for four different classification algorithms: SVM: Sup-
            port Vector Machine, LR: Logistic Regression, RF: Random Forest, GB:
            Gradient Boosting. The labeling of the data was performed with respect
            to crater wear measurements. . . . . . . . . . . . . . . . . . . . . . . . . 257
Figure D.3: Training set classification scores obtained from the EEMD approach.
            Results are provided for four different classification algorithms: SVM:
            Support Vector Machine, LR: Logistic Regression, RF: Random Forest,
            GB: Gradient Boosting. The labeling the of data was performed with
            respect to flank wear measurements. . . . . . . . . . . . . . . . . . . . . 257
Figure D.4: Training set classification scores obtained from the EEMD approach.
            Results are provided for four different classification algorithms: SVM:
            Support Vector Machine, LR: Logistic Regression, RF: Random Forest,
            GB: Gradient Boosting. The labeling of the data was performed with
            respect to crater wear measurements. . . . . . . . . . . . . . . . . . . . . 258
Figure D.5: Training set classification scores obtained from the Carlsson Coordi-
            nates approach. Results are provided for four different classification
            algorithms: SVM: Support Vector Machine, LR: Logistic Regression,
            RF: Random Forest, GB: Gradient Boosting. The labeling of the data
            was performed with respect to flank wear measurements. . . . . . . . . . 258
Figure D.6: Training set classification scores obtained from the Carlsson Coordi-
            nates approach. Results are provided for four different classification
            algorithms: SVM: Support Vector Machine, LR: Logistic Regression,
            RF: Random Forest, GB: Gradient Boosting. The labeling of the data
            was performed with respect to crater wear measurements. . . . . . . . . 259
Figure D.7: The test set classification scores obtained with the persistence images.
            Results are provided for four different classification algorithms: SVM:
            Support Vector Machine, LR: Logistic Regression, RF: Random Forest,
            GB: Gradient Boosting. The labeling of the time series used in the
            analysis is done based on flank wear measurements. . . . . . . . . . . . 259
Figure D.8: Training set classification scores obtained from the persistence images.
            Results are provided for four different classification algorithms: SVM:
            Support Vector Machine, LR: Logistic Regression, RF: Random Forest,
            GB: Gradient Boosting. The labeling of the data was performed with
            respect to flank wear measurements. . . . . . . . . . . . . . . . . . . . . 260
                                              xxix


Figure D.9: Training set classification scores obtained from the persistence images.
            Results are provided for four different classification algorithms: SVM:
            Support Vector Machine, LR: Logistic Regression, RF: Random Forest,
            GB: Gradient Boosting. The labeling of the data was performed with
            respect to crater wear measurements. . . . . . . . . . . . . . . . . . . . . 260
Figure D.10:The test set classification scores obtained with the template functions.
            Results are provided for four different classification algorithms: SVM:
            Support Vector Machine, LR: Logistic Regression, RF: Random Forest,
            GB: Gradient Boosting. The labeling of the time series used in the
            analysis is done based on flank wear measurements. . . . . . . . . . . . 261
Figure D.11:Training set classification scores obtained from the template functions.
            Results are provided for four different classification algorithms: SVM:
            Support Vector Machine, LR: Logistic Regression, RF: Random Forest,
            GB: Gradient Boosting. The labeling of the data was performed with
            respect to flank wear measurements. . . . . . . . . . . . . . . . . . . . . 261
Figure D.12:Training set classification scores obtained from the template functions.
            Results are provided for four different classification algorithms: SVM:
            Support Vector Machine, LR: Logistic Regression, RF: Random Forest,
            GB: Gradient Boosting. The labeling of the data was performed with
            respect to crater wear measurements. . . . . . . . . . . . . . . . . . . . . 262
                                             xxx


                            LIST OF ALGORITHMS
Algorithm 2.1: AESA . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 49
Algorithm 5.1: Automatic and adaptive algorithm to find the sensitive frequency. . . 202
                                          xxxi


                                        CHAPTER 1
                                    INTRODUCTION
    Machine learning has become of the trending areas during the last decades. One area
where machine learning has been especially useful is dynamical systems. Data-driven analysis
of dynamical systems has grown increasingly popular due to enhancements in measuring
devices and data mining. A wide variety of signal processing tools is combined with machine
learning to study dynamical systems. However, there is still a need to explore new approaches
because the-state-of-the art has some drawbacks specific to applications. This study aims
to combine machine learning with Topological Data Analysis (TDA) tools to create new
investigative methods to study dynamical systems.
    One example of complex dynamical systems is machining processes, including nonlinear-
ities, time delays, and stochastic effects. One of the challenging problems is detecting the
occurrence of chatter characterized by a large amplitude of vibrations of the cutting tool.
Consequently, identification and mitigation of chatter have become prominent research top-
ics in recent decades. Some of the challenges associated with chatter identification are that
it depends on several factors, including the dynamic properties of the tool and the work-
piece. Therefore, as these properties vary during the cutting process, the results of predictive
models become invalid, thus necessitating a data-based approach for more reliable chatter
detection. Motivated by this goal, many studies in the literature have focused on extracting
chatter features from signals obtained using sensors mounted on the cutting center. Most of
these studies are based on analyzing the spectrum of force or acceleration signals, often in
combination with machine learning techniques [4, 5, 6, 7, 8]. The two most common methods
for analyzing cutting signals are the Wavelet Packet Transform (WPT) and the Empirical
Mode Decomposition (EMD). However, these methods have limitations that preclude them
from being adopted as general chatter detection tools. To elaborate, this study shows that
identifying appropriate feature vectors using these two methods is signal-dependent, and
                                               1


it requires skilled operators [9]. In addition to the contribution to traditional approaches,
Topological Data Analysis (TDA) based approach is developed and used for both synthetic
and experimental cutting signals to identify chatter in time series [3, 10]. Another novel
approach based on similarities between time series has also developed in this study. This
approach combines the Dynamic Time Warping (DTW) algorithm and K-Nearest Neighbor
algorithm to diagnose chatter [11].
    Another challenging problem in complex dynamical system analysis is data-driven model
identification. It provides a useful approach for comparing the performance of a device to
the simplified model used in the design phase. One of the modern and popular methods for
model identification is Sparse Identification of Nonlinear Dynamics (SINDy). Although this
approach has been widely investigated in the literature mainly using numerical models, its
applicability and performance with physical systems are still a topic of current research [12].
In Chapter 3, SINDy is extended to identify the mathematical model of a complicated
physical experiment of a chaotic pendulum with a varying potential interaction. It is also
tested using a simulated model of a nonlinear, simple pendulum. The input to the approach
is a time series and estimates of its derivatives. While the standard approach in SINDy
is to use the Total Variation Regularization (TVR) for derivative estimates, some caveats
for using this route are presented in this study. The performance of TVR is benchmarked
against other methods for derivative estimation.
    In addition to chatter diagnosis and parameter identification, surface texture analysis is
also a challenging problem. Surface roughness determines many important surface properties
such as adhesion, friction, wear, as well as both thermal and electrical contact conductance
[13, 14]. Currently, the most prevalent approach for describing manufactured surfaces uses
statistical point summaries that contain a small fraction of the surface information. These
point summary representations are not robust (they are too dependent on the direction of
measurement and on noise). They are also not amenable to mathematically linking the
surface texture to the physics of the generating process. Therefore, there is a need for
                                               2


alternative compact descriptions of the often complex and possibly fractal manufactured
surfaces [15, 16, 17, 18, 19, 20, 21]. The new objects used for describing these surfaces must
be easily visualized, robust to noise, have well-defined metrics, are capable of representing the
surface texture at multiple scales, and have the ability to leverage machine learning and other
computational and statistical tools. Therefore, Chapter 4 implements tools from TDA in
surface texture analysis to address the current limitations in surface texture representation;
however, the utility of these tools has never been explored in the context of surface texture
characterization and analysis.
    The last application area investigated in this study is tool wear analysis. Signal decom-
position approaches are also widely adopted for tool wear identification and prediction in the
literature. However, these methods have similar problems stated in chatter diagnosis. The fi-
nal decomposition of the signals requires manual preprocessing and input parameters which
are dependent on human error. Therefore, building an automatic and adaptive machine
learning framework is the current area of research. In Chapter 5, persistence homology from
TDA is utilized to build a parameter-free machine learning framework to classify experimen-
tal cutting signals based on the wear amount of corresponding cutting tools. In addition to
implementing TDA in tool wear analysis, Chapter 5 modifies existing approaches to reduce
the number of parameters required from users.
                                                 3


                                         CHAPTER 2
              CHATTER DIAGNOSIS USING MACHINE LEARNING
2.1     Literature Review
    Turning, boring, milling, and drilling operations constitute a major part of manufacturing
processes. One challenging problem that all these processes have in common is the occurrence
of large amplitude, and detrimental oscillations called chatter [22, 23, 24]. Since chatter leads
to increased tool wear, poor surface finish, and noise, it is extremely important to anticipate
and avoid its occurrence. Alternatively, several chatter mitigation techniques, including
increasing stiffness in machine tools and active and passive damping techniques, also exist
[25]. Efficient methods for the identification of the stability lobes that separate stable cutting
and chattering motion [26, 27] can help keep the machine away from chatter via selecting
parameters in the safe area below the stability lobes. However, these models often do not
account for the effect of the changing dynamics or for highly complex cutting operations.
This led to the emergence of in-situ methods for chatter detection based on instrumenting
the cutting center with sensors and analyzing the resulting signals [28, 29, 30, 31, 32, 33].
    The majority of available in-process methods for chatter identification rely on extracting
certain features from the acoustic, vibration, or force signals and comparing them against
some predefined markers of chatter [34, 35, 36, 37, 30, 38, 39, 40, 41, 42, 43, 44]. They can
be broadly categorized into two groups as shown in Figure 2.1. The most prevailing methods
are Wavelet Packet Transforms (WPT) and Empirical Mode Decomposition (EMD) or the
Ensemble Empirical Mode Decomposition (EEMD). Generally, such decomposition-based
methods for analyzing the cutting signal follow the same procedure. First, the signal is
decomposed into different parts using some transformation. Then, the decomposed portions
or packets of the signal, which include the relevant information about machine tool chatter,
are selected to reconstruct a new signal. These packets are chosen by applying the Fast
                                                4


Fourier Transform (FFT) to the different parts or packets and choosing the ones that overlap
with the known chatter frequencies of the system. Finally, various time and frequency domain
features are computed from these packets. In several papers, these features are ranked and
are utilized as the input for the machine learning classifiers. Support Vector Machine (SVM)
algorithm is the most common classifier used for chatter classification [32, 6, 45, 5, 8, 46, 47].
Other less common classifiers include quadratic discrimination analysis [4], Hidden Markov
Model (HMM) [48], generalized HMM [49], and logistic regression [50] (see Figure 2.1).
    Wavelet packet decomposition and wavelet transform are widely adopted in machining
state monitoring. Chen and Zheng [5] generated feature matrices for chatter classification
using wavelet packets whose frequency bands contain the chatter frequency. Yao et al. [32]
used the standard deviation and the energy of the decomposition obtained using the Discrete
Wavelet Transform and the WPT for chatter detection from acceleration signals in a boring
experiment. The energy of the wavelet packets was also utilized in turning experiments with
the comparison of different levels of WPT [51, 49]. Ding et al. [50] used wavelet packet
entropy as a feature for early chatter detection. In addition to WPT, EMD and EEMD are
also often utilized to featurize cutting signals. Ji et al. [6] proposed EMD to both eliminate
noise from milling vibration signals and to extract features from informative Intrinsic Mode
Functions (IMF). Chen et al. [45] used top-ranked features extracted from the IMFs obtained
from EEMD for machining state detection. Li et al. [52] used the energy spectrum of the
IMFs as features for chatter detection. The resulting features are ranked by using Fisher
Discriminant Ratio (FDR) [45] and, when the number of features is high, recursive feature
elimination (RFE) is used to reduce the number of features [5]. Although EMD/EEMD is
typically applied to vibration signals, Liu et al. [8] also used EMD to extract features from
the servo motor current time series.
    In addition to WPT and EMD-based approaches, there are other methods for feature
extraction from metal removal processes. For example, Thaler et al. [4] used Short-Time
Fourier Transform to extract the frequency domain features of the feed force, acceleration,
                                                5


and sound pressure signals in band sawing operation. Moreover, the Q-factor and the power
spectrum of the signal were used for chatter classification in milling [46]. Cao et al. [53]
applied the Hilbert Huang transform to signals reconstructed using only the informative
wavelet packets. Lamraoui et al. applied multi-band resonance filtering and envelope analy-
sis to milling vibration signals [54]. Yesilli and Khasawneh combined Fast Fourier Transform
(FFT), Power Spectral Density (PSD), and Auto-correlation Function (ACF) with super-
vised classification algorithms to detect chatter in turning signals. They used the coordinates
of the peaks of FFT, PSD, and ACF plots as features in classification algorithms [55]. Fourier
Transform is used in signal-based methods for chatter detection, and Liu et al. [56] combined
signal-based and model-based methods to build and hybrid method for chatter detection.
Variational Mode Decomposition (VMD) is another method for chatter detection. For exam-
ple, Liu et al. developed a method to automatically select the VMD parameters and extract
the corresponding features using signal energy entropy [57].
    Chatter detection strategies based on WPT or EEMD require deciding on which infor-
mative parts of the signal to use. However, since searching for the informative parts of the
decomposition is a multi-step process, these approaches become impractically laborious. Al-
though the time required to obtain the needed WPT and EEMD decompositions is relatively
low, choosing the informative decompositions in WPT and EEMD is often not straightfor-
ward. This is because the featurization process involves looking into the Fourier spectra and
the energy ratio plots for each signal in order to determine the most informative parts of the
decomposition. Consequently, only a few cases are often analyzed, and the chosen packets
or decompositions are fixed and used for feature extraction for all the subsequent data sets.
For example, in the WPT-based approach, the standard procedure is to pick the packets
with the highest energy ratio as the most informative part of the decomposition. However,
these packets do not necessarily contain the chatter frequency bands, and thus they may not
be the most suitable markers for chatter detection [9].
    There are also chatter classification methods that do not rely on signal decomposition.
                                                 6


Figure 2.1: Categorization of feature extraction methods and classifiers used for chatter
detection.
For example, Tarng et al. utilized unsupervised neural networks with adaptive resonance
theory [58]. Tangjitsitcharoen et al. proposed three different parameters which are based on
the variance of the cutting force signals to diagnose different cutting states [59]. Fu et al. used
a deep belief network with an automatic feature construction model based on unsupervised
greedy layer-wise pre-training and supervised fine-tuning to monitor the state of milling
processes [60]. Cherukuri et al. used Artificial Neural Network (ANN) on synthetic turning
data for chatter classification [61]. However, using ANN (or other black-box machine learning
methods) requires large training sets. That amount of data may not always be available,
especially in small-batch production processes, which constitute a large portion of discrete
manufacturing.
   Although prior studies on chatter detection have shown some success, these tools typically
share two main limitations: (1) training a classifier requires significant manual pre-processing
                                                 7


of the data, and (2) the trained classifier is sensitive to the differences between the training set
and the test set [9]. For example, in the WPT or EEMD method, the signal is decomposed
into wavelet packets or IMFs, respectively. The pre-processing requires the selection of
the informative wavelet packets or informative IMFs via choosing the packet or IMF that
falls within the range of the chatter frequency. These informative packets or informative
IMFs are used for extracting frequency and time features, which are often ranked with
the Recursive Feature Elimination (RFE) method. Then, an incoming data stream can be
classified based on these features and a classification algorithm such as SVM. This means that
there are fundamental limitations if the chatter frequencies change significantly, for example,
due to changing natural frequencies or changing process parameters. Specifically, Yesilli et
al. assessed the transfer learning performance of WPT and EEMD [9]. Further, these
methods require a level of skill for feature extraction and classifier training that precludes
their wide adoption in chatter detection settings. Therefore, there is a need for an accurate
machine learning algorithm for chatter diagnosis that can (1) be easily and automatically
applied, and (2) can be computed in a reasonable time. Therefore, this work proposes two
novel approaches for chatter detection. These are Topological Data Analysis (TDA) based
approach and the approach that utilizes the similarity between time series via Dynamic Time
Warping (DTW).
    Topological Data Analysis (TDA) [62, 63, 64, 65] is a promising tool for generating fea-
ture vectors for chatter detection comes from a new field with many mature computational
tools. TDA, and more specifically persistent homology, provides a quantifiable way for de-
scribing the topological features in a signal [66]. Specifically, by embedding the sensory
signal into a point cloud, it is then possible to use persistent homology to produce a mul-
tiscale summary of the topological features of the signal, thus enabling the analysis of the
underlying dynamical system. The homology classes that correspond to the embedded signal
are often reported using a planar diagram that shows how long each topological feature per-
sisted. The application of TDA tools to machining dynamics has only been recently explored
                                                  8


[67, 68, 69]. Specifically, Reference [68] and [70] show that maximum persistence—a single
number from the persistence diagram—can be used to ascertain the stability of simulated
data from a stochastic turning model. Khasawneh et al. incorporated more information from
the persistence diagram by extracting 5 features, including Carlsson coordinates ([71]) and
the maximum persistence [69], see Section 2.6.4 for more details on featurizing persistence
diagrams. In combination with SVM, the resulting feature vector was used to train a chatter
classifier, and it was applied to simulated deterministic and stochastic turning data with
success rates as high as 97% in the deterministic case. In addition, Yesilli et al. utilized
Carlsoon Coordinates and Template Functions [72] to diagnose chatter in milling simulations
and show that these two featurization methods are noise-robust [3].
    However, despite the active work in the literature on featurizing persistence diagrams,
all prior studies on chatter detection with TDA have utilized only a small fraction of the
persistence diagram for constructing a feature vector. Further, these publications only stud-
ied simulated signals, and no sensory signals from actual cutting tests have been tested.
Therefore, this work aims to collect and summarize state-of-the-art featurization tools for
persistence diagrams and apply them for the first time for chatter classification using actual
experimental signals obtained from an accelerometer mounted on the cutting tool during a
turning process. The methods that are investigated for featurizing the resulting persistence
diagrams and classifying chatter time series include persistence landscapes [73], Carlsson co-
ordinates [71], persistence images [74], an example kernel method [75], and path signatures
of persistence landscapes [76]. Moreover, the run time for each featurization method is pro-
vided, including the runtime for persistence diagram computation, which constitutes most
of the total computation time. To reduce the runtime for persistence diagram computation,
this study utilizes Bézier curve approximation method [77], greedy permutation [78] and
parallel computing.
    The second approach proposed in this study is based on combining the K-Nearest Neigh-
bor (KNN) classifier with time series similarity measure: Dynamic Time Warping (DTW)
                                               9


and Approximate and Eliminate Search Algorithm (AESA). DTW has been used in many
application domains including speech recognition ([79, 80, 81, 82, 83]), time series classifica-
tion ([84, 85, 86]), and signature verification ([87, 88, 89, 90]). In this study, DTW is comp
with AESA algorithms to detect chatter in signals obtained from turning experiments.
     In machining, natural frequencies of the system shift when cutting configuration param-
eters such as overhang distance is changed. The chatter frequency, where chatter takes place
in the frequency domain, also changes. Since training a classifier on a data set obtained from
each new configuration is cumbersome, this study is interested in how a trained classifier on
one cutting process can transfer knowledge to the different cutting processes. This general
idea is known as transfer learning in the literature. Within the context of machining, it
has the potential to provide a methodology for pooling data from different manufacturing
settings to more robustly detect chatter. In addition to traditional machine learning, this
study also tests the performance of transfer learning in each featurization approach. Clas-
sifiers were trained and tested from data gathered from milling and turning experiments.
Except for DTW, four different classification algorithms were used for all methods: Support
Vector Machine (SVM), Logistic Regression (LR), Random Forest classifier (RF), and Gra-
dient Boosting (GB). K-Nearest Neighbor was used for measuring the performance of the
similarity measure technique DTW.
2.1.1    Transfer Learning Approaches for Machining
     Several studies focus on chatter detection using deep learning and transfer learning.
Cherukuri et al. use synthetic data to train an artificial neural network (ANN) to predict
chatter [61]. Postel et al. used a pre-trained Deep Neural Network to predict stability in
milling operation [91]. A synthetic data set is used to train the network, and then fine-
tuning is performed using the small size of an experimental data set. Unver and Sener used
a numerical simulation of milling operation to train AleXNet structure for Convolutional
Neural Networks, and they tested the same network on experimental milling data to detect
                                               10


chatter [92]. In addition to chatter detection, the majority of prior works that apply transfer
learning focus on fault detection and tool/machine conditioning instead of chatter detec-
tion. Further, these works utilize deep learning algorithms that require a large number of
observations [93] and do not provide insight into the signals’ most informative features for
chatter detection. For instance, Wu et al. used 1D Convolutional Neural Networks (CNN)
for fault detection in bearings and gears [94]. They applied two different transfer learning
approaches: 1) training and testing a classifier on samples from different working condi-
tions and 2) training on simulation data and testing on experimental data. Li and Liang
developed a CNN-based approach to diagnosing severe tool wear, tool breakage, and spindle
failure during machining processes [93]. They used two different CNC machines to train and
test a classifier in an experiment that took six months to collect the data needed to train
the CNN. Kim et al. used Support Vector Regressor to predict the machining power, and
they transferred knowledge from machining power models of steel and aluminum to predict
the power model of titanium [95]. Mamledesai et al. utilized CNN and transfer learning to
monitor tool conditions to help the machinist decide whether to keep using the same tool
or replace it [96]. Marei et al. used Convolution Neural Network-based transfer learning to
predict tool wear of the carbide cutting tool flank [97]. Another study that includes trans-
fer learning and deep learning is focused on the estimation of force in the milling process
using simulation data and experimental data as a source and target domain, respectively
[98]. Wang et al. use the pre-trained network VGG19 to identify machining fault types in
rolling bearings. They modified the final fully connected layer to reduce the number of net-
work parameters and implement the transfer learning between non-manufacturing data and
manufacturing data [99]. Kim et al. proposed another approach that converts cutting force
signals into images using a multi-layer recurrence plot (MRP) to estimate the machining
quality in laser-assisted micro-milling operation [100]. They used a pre-trained ResNet-18
CNN structure and tested it on the images generated from cutting signals.
    Traditional machine learning approaches are also adopted in transfer learning approaches
                                               11


for machining applications. For instance, Gao et al. implemented extreme vector machines
and transfer learning to build a prediction model for remaining tool life [101]. Yesilli et al.
combined traditional signal decomposition tools and machine learning algorithms such as
support vector machines, random forest classifier, and gradient boosting to detect chatter
in experimental turning signals [9]. Fast Fourier Transform, Auto-correlation Function, and
Power Spectral Density are also combined with similar machine learning algorithms to iden-
tify unstable time series obtained from turning experiments [55]. Shen et al. combined the
TrAdaBoost transfer learning algorithm [102] and singular value decomposition-based fea-
ture extraction to identify different fault types in a bearing data set [103]. The TrAdaBoost
algorithm is also used in tool tip dynamics prediction [104].
2.1.2    Main Contribution
    For WPT and EEMD approaches, the resulting informative packets or decompositions
may not contain chatter information, especially if the system parameters shift during op-
eration, e.g., due to the movement of the machining center, which may involve changing
the overhang distance of the tool and thus the flexibility of the cutting tool. Therefore,
in these situations, the classifier is required to categorize signals that may carry different
characteristics and chatter features than the ones it was trained on. In other words, the
ability of the classifier to achieve transfer learning is tested in these situations. However,
there have not been any studies on the transfer learning capabilities of WPT and EEMD.
Therefore, this study investigates the transfer learning performance of these two approaches
to the novel approaches proposed in this study. Further, the common approach for picking
the informative packets in WPT is to choose the packets with the highest energy. However,
these packets do not necessarily contain the chatter frequency bands, and thus they may not
be the most suitable markers for chatter detection [9].
    State-of-the-art methods for chatter detection, WPT, and EEMD require intense manual
preprocessing and thus have low automation potential ([9, 55]). For example, the frequency
                                                12


spectrum of the signals obtained from wavelet packets and IMFs must be checked to choose
the informative wavelet packets or IMFs. In contrast, the proposed DTW approach eliminates
the feature extraction step since it does not depend on signal decomposition but rather relies
on computing pairwise distances between time series. All the steps in the DTW approach can
be performed automatically. It only needs two input parameters: the number of neighbors
required for the KNN classifier and the looseness constant (H) for the AESA algorithm.
     Although this study focuses on turning as a use case, the DTW approach is applicable to
other machining processes where the data stream is in the form of time series. A comparison
of the resulting chatter classification success rates between the DTW approach and other
widely used methods in the literature shows that the DTW approach has the highest average
accuracy or is within the error band of the highest accuracy in two out of the four cutting
configurations for turning experiments.
     This study also shows how to drastically reduce the computation time by combining the
DTW approach with the Approximate and Eliminate Search Algorithm (AESA) or parallel
computing. Although AESA has been widely used in word recognition, pattern recognition,
and handwritten character recognition ([105, 106, 107, 108, 109]), it is believed that this work
is the first to combine AESA and DTW for analyzing engineering systems. In addition, the
results obtained with AESA and parallel computing show that after training a classifier
offline, an incoming time series can be labeled in less than two seconds. Therefore, the DTW
approach is very conducive to online chatter detection applications.
     Previous studies on chatter detection with a TDA-based approach either utilize a small
subset of the available featurization methods of persistence diagrams, or they only consider
simulated data. Specifically, Reference [68] used maximum persistence lifetime on a simulated
stochastic turning model to visually show that persistence diagrams carry chatter informa-
tion. In Reference [69], the authors used Carlsson Coordinates (in addition to maximum
lifetime) as features and logistic regression classifier to distinguish chatter versus chatter-
free signals obtained from stochastic and deterministic turning simulations. Reference [70]
                                               13


used Persistent Homology to visually detect the changes in the behavior of a linearized turn-
ing model using a heat map of maximum persistence plotted on top of the spindle speed
and the depth of cut space. Reference [3] only used Carlsson Coordinates and Template
Functions, where the latter was introduced in [72], for chatter detection in simulated milling
signals. In contrast to previous studies on TDA, machine learning, and machining dynamics,
this study is the first to consider experimental data. Further, in contrast to the previous
studies that used one or two TDA featurization methods, this study compares for the first
time the most common featurization methods from TDA. Further, this study also focuses
on the classification results using four common classifiers: Support Vector Machine (SVM),
Logistic Regression (LR), Random Forest (RF), and Gradient Boosting (GB). It is believed
that this study is the first to apply a variety of persistence diagram featurization techniques
to experimental machining data sets. Another distinguishing feature of this study is a focus
on speeding up persistence computations by leveraging several computational tools such as
greedy permutation, Bézier curve approximation, and parallel computing. The described
speedups significantly reduce the computational time, thus enabling utilizing the approach
described in this work for effective chatter detection.
    Another main contribution of this work is to present the first study on using state-of-
the-art feature extraction tools to transfer chatter knowledge across turning and milling
operations using experimental data. The main goal is to automate chatter detection for
different cutting conditions and operations and to reduce the amount of data and time
needed to train a classifier [110]. Once a classifier is trained using a given data set, the gained
information can be utilized for different operations without needing large and completely new
training data sets from the target process.
    In contrast to prior works on transfer learning for chatter detection, this work focuses on
a large number of feature extraction methods, including WPT, EEMD, FFT, ACF, PSD,
as well as two other novel methods that have been proposed in this study in the context of
machining: Topological Data Analysis (TDA) methods and similarity-based methods using
                                                14


Dynamic Time Warping (DTW).
    This chapter is organized as follows. Section 2.2 gives the background information of four
different classification algorithms that have been widely adopted in this chapter. Section 2.3
explains the feature extraction from WPT and EEMD and presents the results obtained from
these approaches. Section 2.4 describes the traditional feature extraction approach using
FFT/PSD/ACF and provides the classification results. The first novel approach developed
in this study is explained in Section 2.5 and the resulting classification accuracies can be
found in the same section as well. The second novel approach developed by this study
and the results obtained using this approach are provided in Section 2.6. The third main
contribution of this study for chatter diagnosis is the application of transfer learning between
different machining operations using various featurization techniques. Section 2.7 provides
background information for transfer learning and the results obtained using experimental
turning and milling data sets. This chapter uses experimental data set obtained from turning
and milling experiments. One can refer to Section A.1 and A.2 for more details about data
collection and processing.
2.2     Supervised Classification Algorithms
    This section gives background information on the different classifiers used to test the
performance of considered feature extraction methods, namely SVM, logistic regression,
random forest classification, and gradient boosting.
2.2.1    Support Vector Machine
    A Support Vector Machine (SVM) is used to classify the time series by using feature
vectors. The Support Vector Machine algorithm is a supervised machine learning technique
for finding the optimal hyperplane separating two training data set classes. This hyperplane
can then be used to classify the test data. The two dimensional case of a linear SVM is
illustrated in Figure 2.2. The feature vectors corresponding to two different classes, e.g.,
                                               15


chatter (crosses) and no-chatter (circles), form two linearly separable data sets. The optimal
hyperplane is selected such that the perpendicular distances from the feature vectors, which
are closest to the hyperplane and also called support vectors, are equal. This means that
the optimal hyperplane has the largest margin [111]. In general, it can be described by the
set of points x satisfying
                                          w · x + c = 0,                                          (2.1)
and the dashed lines where the support vectors lie on are defined according to
                                    f±1 (x) = w · x + c = ±1.                                     (2.2)
Then, the margin of the optimal hyperplane can be denoted as 2/∥w∥. The two hyperplanes
with Equation 2.2, and therefore the optimal hyperplane from Equation (2.1), can be found
by maximizing the distance 2/∥w∥ or by minimizing ∥w∥2 with the constraints
                                     w · x + c ≥ +1,      and
                                                                                                  (2.3)
                                     w · x + c ≤ −1.
The classification for a feature vector xtest of the test set can be made by checking the sign of
the expression w · xtest + c, which defines the label for the two classes. For the theory behind
multi-class classification with SVM, one can refer to [112]. In some cases, the training data
are not separable by a linear hyperplane. In this case, the SVM can be extended to nonlinear
classification with the help of kernel functions [113].
2.2.2    Logistic Regression
    Logistic regression is a supervised learning classification algorithm that computes the
probability of two class labels for a given dependent variables [114]. It is quite similar to
linear regression, but its output is divided into two categories [115]. Figure 2.3 illustrates
linear and logistic regression on a binary data set. In this figure, X = {x1 , x2 , . . . , xn } is the
set of elements in the feature vector while Y ∈ {0, 1} is the dichotomous outcome variable.
                                                 16


      Figure 2.2: Selected optimal hyperplane for linearly separable two class data set.
   Figure 2.3: Linear (a) and logistic (b) regression onto data set whose output is binary.
For dichotomous output, linear regression can be applied, but the model will not fit well,
as shown in Figure 2.3a. There are two main reasons why the linear equation does not
explain the relationship between the variables X and Y [116]: (1) the relationship between
the variables does not have a linear trend, and (2) the errors are not constant, or they are
not normally distributed. However, this problem can be solved by introducing the logit
transformation.
    Let π(x) = E(Y |x) be the expected value of Y given the value of x. The regression
model g(x) and the logit transformation π(x), respectively, are defined according to [115]
in Equation 2.4, where π(x) is defined as the sigmoid function for logistic regression in
                                              17


Equation 2.5. The regression model is expressed as a linear function. However, it is converted
into a nonlinear probability function with logit transformation.
                                          π(x) 
                               g(x) = ln               = β0 + β1 x,                         (2.4)
                                            1 − π(x)
                                                 eβ0 +β1 x
                                      π(x) =                 ,                              (2.5)
                                               1 + eβ0 +β1 x
Although Equation 2.4 is defined for only one independent input variable x, the model can be
further extended to a multivariate version. To assign labels for a given input x, the decision
boundary must first be formed. In Figure 2.3b this boundary is the sigmoid function that
splits tags 0 and 1. The x values which satisfy β0 + β1 x = 0 form the decision boundary
[114], and the probability at the boundary, per Equation (2.5), is 0.5. The parameters β0
and β1 in the regression model can be identified using maximum likelihood estimators [114].
2.2.3    Random Forest Classification
    Ensemble learning uses multiple methods to get higher prediction rates for a given prob-
lem. For example, random forest is an ensemble learning method composed of decision trees
where the number of these trees is part of the user input to the algorithm [117]. Each decision
tree is composed of branch nodes with two branches emanating from each root node; hence,
they are called binary trees. The nodes that have no descendants are termed leaf nodes or
leafs. Assuming numeric inputs, each branch node corresponds to one variable and its split
point, while the leaf nodes correspond to output variables. Figure 2.4 illustrates decision
tree classification using two classes (0 and 1) and two input variables (x and y). The first
step is to partition the input space of the training set into rectangles (or hyper-rectangles in
higher dimensions), in this case, L1 through L5 . Selecting the partitions is based on making
each subset of the training set purer, i.e., with fewer mixed labels, than the training set itself
[118]. An impurity function defines the goodness of each partition, see [118] for a discussion
on optimum splits. After defining the partitions for the training data set (left graph in
Figure 2.4), a tree is formed (right graph in Figure 2.4).
                                                18


                             Figure 2.4: Generation of decision tree.
Figure 2.5: Random Forest Classification (vote0 :Final decision of the corresponding tree is
class 0. vote1 :Final decision of the corresponding tree is class 1.).
    The branch nodes of the tree correspond to conditions, either on x or on y, such that the
samples L1 through L5 can be placed in one of the leaf nodes. Each leaf node is then labeled
by following the plurality rule [118]: the most frequent labels in any node are assigned as
the label for that node. For example in Figure 2.5, leaf nodes L1 , L3 , and L4 are labeled as
class 0, while leafs L2 and L5 are labeled as class 1. Given a new input, a tag is generated by
traversing the tree starting at the tree’s root node. The new input’s label is then matched
to the leaf node it ends up in within the tree.
    In Random forest classification, there are N decisions trees, and each tree votes for a label
for a given test sample. The algorithm chooses a number of samples to generate the decision
trees, and this is iterated until the desired number of trees is obtained. The estimation for
the label of the sample is made with respect to the most frequent votes [119] as shown in
Figure 2.5.
                                               19


2.2.4    Gradient Boosting
    Gradient boosting algorithm was introduced by Schapire to answer the question of
whether the performance of a single strong learner is equal to the set of weak learner perfor-
mance [120]. Gradient boosting was proposed as an algorithm that provides more accurate
predictions for regression and classification problems by generating new base models, which
can be linear models, smooth models, and decision trees [121]. Gradient Boosting aims to
correct the previous models by adding new base models to minimize the loss function. When
the decision trees are used as new base models, a new decision tree is added after computing
the loss function. The new decision tree is generated by parametrizing it so that it can
decrease the loss of the existing model. Specifically, the gradient descent is used to minimize
the loss function value, and it is applied in functional space since each tree (base learner) can
be represented as a function. Gradient boosting algorithm fits the new base models to the
negative gradient of the loss function, where the choice of the loss function is user-dependent,
to increase the accuracy of the overall model [122].
2.3     Signal Decomposition Based Approach
    This section describes two signal decomposition approaches, Wavelet Packet Transform
(WPT) and Ensemble Empirical Mode Decomposition (EEMD). These approaches are widely
adopted in literature for chatter detection. Recursive Feature Elimination (RFE) is also
explained and used to rank features. The approach which utilizes WPT can be divided into
four steps, which are summarized in Figure 2.6. The first step is the decomposition of the
time series into wavelet packets. This technique from signal processing is especially useful for
a high-resolution time-frequency analysis. The motivation for an additional decomposition
of the signal is the increase of the signal-to-noise ratio and increasing sensitivity for chatter
features [5]. The output of the WPT is wavelet packets. The second step is the selection of
the informative packets based on the properties of the wavelet packets and the characteristics
of chatter in the considered process. The third step is the feature extraction and its automatic
                                                20


Figure 2.6: Overview of the Wavelet Packet Transform (WPT) method with Recursive Fea-
ture Elimination (WPT).
ranking with the RFE method, which is used to distinguish between chatter and chatter-free
motion. On the basis of the extracted features, the fourth step is the classification into
chatter/chatter-free cases via a Support Vector Machine (SVM), Logistic Regression (LR),
Random Forest (RF) classification, and Gradient Boosting (GB).
    The structure of the method that employs EEMD is similar to the WPT-based approach.
However, in contrast to the WPT method, the EEMD is used for the decomposition of the
original time series, and the output of the EEMD is intrinsic mode functions (IMF) instead
of wavelet packets. After the decomposition, the informative IMF is selected, and various
features for chatter detection are extracted. The features are automatically ranked via the
RFE method, and supervised machine learning algorithms are used to classify them into
chatter/chatter-free cases.
                                             21


2.3.1    Wavelet Packet Transform
     The methodology in Reference [5] was followed in this section, and WPT is applied to
the time series before feature extraction and classification. The WPT is an extension of the
discrete wavelet transform. One level of the discrete wavelet transform decomposes the signal
into a low and a high-frequency component by passing it simultaneously through a low and a
high pass filter. The properties of the two filters are related to each other and are determined
by the chosen wavelet basis. According to [5], the Daubechies orthogonal wavelet db10 is
used as the wavelet basis function. The outputs of the low and the high pass filter give
the approximation coefficients and detailed coefficients denoted by Ai and Di , respectively,
where the subscript i specifies the level of the decomposition. The resulting signal after the
decomposition is called a wavelet packet and can be reconstructed from the approximation
or detailed coefficients by using the filter properties [5]. In the discrete wavelet transform,
only the output Ai is passed again through both filters to generate two additional outputs
AAi+1 and ADi+1 in the next level.
     In contrast, in the WPT approach the output Ai of the low pass filter as well as as
the output Di of the high pass filter are both again low- and high-pass filtered to generate
the wavelet packets AAi+1 , ADi+1 , DAi+1 and DDi+1 in the next level. This means that
the WPT generates 2k wavelet packets at the kth level, see Figure 2.7 for a schematic of
level 3 WPT. In Figure 2.7, for example, DAA3 denotes the packet in the third level, where
in the first and the second level, the low pass filter and in the third level, the high pass
filter have been applied. Before passing through the filters in the next level, the signal is
downsampled by a factor of two, which increases the frequency resolution. Moreover, since
the two resulting wavelet packets contain only one-half of the frequencies of the input data
after each decomposition, this downsampling is possible without losing information. As a
consequence, the resulting wavelet packets in one level contain only a frequency band, which
is mainly distinct from the bands of the other packets. Even if the frequency bands become
narrower at each level, the packets contain rich information about the original signal due to
                                                22


the increase in the frequency resolution. The location of the frequency band is determined by
the chronological order of the applied filters, which are used to generate the wavelet packet
(cf. Figure 2.7). In the following, the wavelet packets are labeled according to the order of
their frequency band, beginning with 1 for the packet with the lowest frequencies (A . . . Ak )
resulting only from low pass filtering to 2k for the packet containing the highest frequencies
(D . . . Dk ) resulting from a successive application of the high pass filter.
                          Figure 2.7: 3 Level Wavelet Packet Transform.
2.3.1.1      Selection of Informative Wavelet Packets
    The next step is the selection of the informative wavelet packets, which are best suited
to distinguish between stable cutting and chattering motion. The criteria for selecting the
informative wavelet packets are high signal energy compared to other packets for a good
signal-to-noise ratio and significant overlap of the frequency band of the packet with possi-
ble chatter frequencies. In this section, the selection of the informative wavelet packets is
described for the turning experiment explained in Section A.1.
    The identification of the band of chatter frequencies is made by examining the FFT of the
signals tagged as stable, intermediate chatter, and chatter (see Section A.1 and A.1.1 for the
description of the experimental setup and the labeling of turning experiments). Figure 2.8
shows example time series and the corresponding Fourier spectra for three tagged signals for
                                                23


the case whose stickout length, rotational rpm, and depth of cut are 5.08 cm (2 inch), 320
rpm, and 0.127 mm (0.005 inch), respectively. The dominant frequencies are low for stable
cutting and correspond to the spindle rotation frequency. In addition, there is a significant
peak at 120 Hz, which can be found in all measurements and probably comes from an
external source. For intermediate chatter and chatter, a significant part of the energy in the
signal is contained at high frequencies near 1000 Hz, which is close to the eigenfrequency
of the lateral tool vibration. As a consequence, these chatter frequencies become larger for
increasing stickout length, and for each of the four different stickout lengths, a different range
of chatter frequencies has been identified.
Figure 2.8: Time domain and frequency domain of stable (a,b), intermediate (c,d), and
chatter (e,f) regions for overhang distance of 5.08 cm (2 inch), 320 rpm, 0.002 inch depth of
cut case of turning experiments.
    In order to analyze the properties of the wavelet packets, levels 1, 2, 3, and 4 WPT are
obtained from the experimental data. Figure 2.9 shows the resulting level 4 energy ratios
of the wavelet packets for two example cases. The energy ratios represent the fraction of
energy in each packet relative to the total energy in all the packets. It is obvious from the
                                               24


Figure 2.9: Energy ratios of wavelet packets for two cases of turning experiment: (top) 5.08
cm (2 inch) stickout, 320 rpm and 0.0127 cm (0.005 inch) DOC, and (bottom) 5.08 cm (2
inch) stickout, 570 rpm and 0.00508 cm (0.002 inch) DOC. Note the differences in the scale
of the vertical axis.
figure that most of the energy is concentrated in the first wavelet for stable cutting. In
contrast, the energy is concentrated mainly in the first, third, and fourth wavelet packets
for the intermediate chatter and the chatter regions. This is consistent with the behavior of
the frequency spectrum of the original data in Figure 2.8 since increasing the number of the
wavelet packets corresponds to a higher frequency band.
    Upon identifying the wavelet packets whose energy ratios are relatively high with respect
to the other packets, the third step is to identify the packets whose spectrum has significant
peaks that overlap with the chatter frequencies given in Table 2.1 [5]. Specifically, a time
domain signal for each wavelet packet is reconstructed, and the corresponding FFT is ob-
tained for each of the reconstructed signals. For the two examples with stickout length 5.08
cm (2 inch), the frequency spectrum of the reconstructed signals obtained from the first four
wavelet packets for the intermediate chatter and chatter regions are provided in Figure 2.10.
                                               25


Table 2.1: The chatter frequency ranges, the informative wavelet packets, and the informative
IMFs corresponding to each overhang distance of the cutting tool for the turning experiments.
  Stickout length Chatter frequency range               Informative wavelet                Informative
    (cm (inch))             (Hz)                               packets                         IMF
      5.08 (2)           900–1000          Level 1 :1, Level 2: 1, Level 3: 2, Level 4: 3       2
     6.35 (2.5)         1200–1300          Level 1 :1, Level 2: 1, Level 3: 3, Level 4: 4       2
     8.89 (3.5)         1600–1700          Level 1 :1, Level 2: 2, Level 3: 3, Level 4: 6       1
     11.43 (4.5)        2900–3000          Level 1 :2, Level 2: 3, Level 3: 5, Level 4: 10      1
It can be seen that the peaks in the spectrum of the 3rd and 4th wavelet packet overlap
with the band of the previously identified chatter frequency (900–1000 Hz, see Table 2.1
and Figure 2.8). Since, for the stickout length 5.08 cm (2 inch), the energy ratios and the
amplitudes in the corresponding FFT (see Figure 2.10) are slightly higher in the 3rd wavelet
packet than in the 4th wavelet packet, the 3rd packet was chosen as the informative wavelet
for chatter detection at level 4 WPT. An overview of the selected informative wavelet packet
for each level of the WPT can be found in Table 2.1. For higher stickout length, the dom-
inant chatter frequencies increase, and therefore, in general, a wavelet packet with a higher
frequency band is selected as the informative wavelet packet.
    Current literature attempted to automate the selection of the informative packets by
selecting the packets with the highest energy. However, this study showed that the informa-
tive wavelet packet is not necessarily the one with the highest energy because it is important
that the range of possible chatter frequencies are in the frequency band of the informative
wavelet packet [9]. In fact, often, the first packet has the highest energy ratio. However, its
frequency band does not overlap with the chatter frequencies, which are mainly contained
in packets with a higher index (cf. Table 2.1).
    Since the frequency band of the wavelet packets can be predicted from the WPT tree
in Figure 2.7, it is also possible to predict the informative wavelet packet that contains
information about chatter frequencies. For example, from the sampling rate of 10 kHz, it
follows that the first wavelet packet in level 3 corresponds to the frequency band 0–625 Hz.
The upper frequency limits for other packets in level 3 are equal to the corresponding wavelet
                                                26


Figure 2.10: The spectrum of the first four wavelet packets of level 4 WPT for intermediate
chatter (a),(c) and chatter (b),(d) in the case with 5.08 cm (2 inch) overhang distance. The
spindle speed and depth of cut is 320 rpm and 0.0127 cm (0.005 inch) in (a), (b) and 570
rpm and 0.00508 cm (0.002 inch) in (c), (d), respectively.
Table 2.2: Comparison between predicted and selected informative wavelet packet number
for all overhang distances cases of the turning experiment. Predicted wavelet packets are
decided with by overlapping the chatter frequency with the wavelet packet frequency range
obtained from the WPT tree (Figure 2.7) for level 4.
  Overhang (Stickout) Distance Chatter frequency range Informative wavelet Informative wavelet
           (cm (inch))                   (Hz)          packets (Predicted)  packets (Selected)
             5.08 (2)                 900–1000             Level 4: 3-4         Level 4: 3
            6.35 (2.5)                1200–1300            Level 4: 4-5         Level 4: 4
            8.89 (3.5)                1600–1700             Level 4: 6          Level 4: 6
           11.43 (4.5)                2900–3000            Level 4: 10         Level 4: 10
packet number times the upper frequency level of the first wavelet packet (cf. Figure 2.7).
Table 2.2 provides the predicted and the selected informative wavelet packets for the level
4 WPT. The selected informative wavelet packets are consistent with the predicted ones for
all cases.
                                               27


2.3.1.2    Recursive Feature Elimination (RFE)
    The reconstructed signal from the informative wavelet packet allows the extraction of
both frequency domain as well as time domain features for chatter identification. A collection
of frequency domain and time domain features, which are taken from Reference [5], are
provided in Table 2.3.
    Python is used to train a supervised classification algorithm combined with Recursive
Feature Elimination (RFE), where in this case maximum of 14 features are available at the
level 4 WPT. Recursive feature elimination is an iterative process that eliminates one of
the features in each iteration until all the features are removed for classification [5], which
means that the number of iterations for RFE equals the number of the considered features.
Elimination of features is based on their influence on the classification: the feature with
the smallest effect is eliminated in each iteration [123]. In the end, RFE returns a feature
ranking list corresponding to one specific training set.
    The ranked features are used to generate feature vectors where the first vector contains
only the first ranked feature, while each consecutive feature vector adds the subsequent
feature in the ranking until all the features are included in the 14th vector at the fourth
level of WPT. The classification accuracy is calculated for all 14 feature vectors. In other
words, in the first step, only the top-ranked feature is used, and in each further step, the
next highest-ranked feature is added to the feature matrix, and the classification accuracy
is computed again.
2.3.2    Ensemble Empirical Mode Decomposition
    EEMD is based on the Empirical Mode Decomposition (EMD), which is an elementary
step in the Hilbert-Huang transform [124]. Similar to WPT, EMD is useful for non-stationary
signals since the resulting IMFs contain the time and frequency information of the signal.
The main difference in contrast to WPT and other linear decomposition methods is that the
expansion bases of EMD are not fixed but are rather adaptive, and they are determined by
                                               28


   Table 2.3: Time domain features (a1 , . . . , a10 ) and frequency domain features (a11,...,14 ).
                                                  Features
            N
        1
                xm (Mean)                             a4
                                                                    (Clearance Factor)
           P
  a1 = N
                                     a8 =         N    √
                                             1
                                                          |xm |)2
                                                 P
           m=1                             (N
                                                m=1
  a2 = σ(xm ) (Standard Deviation)   a9 =        a3
                                                 N           (Shape Factor)
                                            1   P
                                           N
                                                     |xm |
       s                                       m=1
               N
           1
                   x2m (RMS)                       a4
                                                              (Impulse Factor)
              P
  a3 =     N
                                     a10 =        N
                                              1   P
              m=1                            N
                                                       |xm |
                                                 m=1
                                              M
                                                  fk2 |X(fi )|
                                              P
  a4 = max(|xm |) (Peak)             a11 =  k=1
                                               PM               (Mean Square Frequency)
                                                   |X(fk )|
                                              k=1
         N                                    M
            (xm −a1 )3
        P                                     P
                                                  cos(2πfk ∆t)|X(fk )|
  a5 = m=1
          (N −1)a33
                       (Skewness)    a12 =  k=1
                                                       M
                                                       P
                                                                         (One Step Auto Correlation Function)
                                                           |X(fk )|
                                                     k=1
         N                                    M
            (xm −a1 )4
        P                                     P
                                                  fk |X(fi )|
  a6 = m=1
          (N −1)a43
                       (Kurtosis)    a13 =  k=1
                                               PM              (Frequency Center)
                                                   |X(fk )|
                                              k=1
                                              M
                                                 (fk −a13 )2 |X(fi )|
                                              P
  a7 = a4
       a3
           (Crest Factor)            a14 =  k=1
                                                    M
                                                    P
                                                                       (Standard Frequency)
                                                         |X(fk )|
                                                   k=1
the data. On the one hand, this means that EMD is a nonlinear decomposition and, on the
other hand, it is suitable for analyzing nonlinear and non-stationary data [124].
    The algorithm for the decomposition of a given time series s(t) can be described as
follows. The first residue r0 (t) is equivalent to the original data, i.e. r0 (t) = s(t). Then the
IMFs ci (t) with i ≥ 1 are generated from the residues ri−1 (t) by repeated application of the
so-called sifting process described below. After extracting the ith IMF ci (t), the next residue
is calculated by
                                      ri (t) = ri−1 (t) − ci (t).                                         (2.6)
This procedure is repeated until the result of Equation (2.6), that is, the ith residue ri (t),
becomes a monotonic function, and no more IMFs can be extracted. As a result, the decom-
position of the original data can be given by
                                                   X N
                                       s(t) =              ci (t) + rN ,                                  (2.7)
                                                   i=1
                                                        29


    The sifting process for the generation of the ith IMF ci from the residue ri−1 is done
via the following iterative scheme. Lower and upper envelopes of the data are generated by
using cubic splines for interpolation between the local minima and maxima of the residue,
respectively. The mean m(t) of the lower and upper envelope is calculated. The first guess
for the IMF is obtained by the difference between the residue ri−1 and m(t). Then the first
guess for the IMF is treated as the new data, and the sifting process is repeated until a
given stoppage criterion is fulfilled. As a consequence of the iteration, the lower and the
upper envelopes of the final IMF ci (t) are nearly symmetric, and the mean of the latter is
approximately zero. Moreover, the number of extrema and the number of zero crossings are
equal or differ at most by one. IMFs with lower indices correspond to high-frequency bands,
while the ones with higher indices correspond to lower frequency bands. These properties of
the decomposition make it useful for further data analysis.
    However, one major problem with the original EMD is the occurrence of mode mixing,
which means that one IMF contains two signals whose frequency bands are totally different,
or a signal of a similar scale is observed inside different IMFs whose frequency bands are
different [125]. EEMD was developed to solve the mode mixing problem in EMD [126].
Accordingly, Wu and Huang [125] proposed the following steps for EEMD:
   1. Create an ensemble from the original data by adding white noise.
   2. Decompose each member of the ensemble into IMFs.
   3. Compute the ensemble means of the corresponding IMFs.
The added white noise amplitude must not exceed 20% of the standard deviation of the
original signal, while the ensemble size for the EEMD can be selected as 200 [45]. The Python
package PyEMD with the default stoppage criterion is used for the analysis [127, 128]. The
ensemble number and the noise width parameter are set to 200 and 0.2 (20%), respectively.
                                                30


  Amplitude (m/s2)
                                                                                               Original Time Series
                                  0.025
                                  0.000
                                 −0.025
                                              15.10                   15.12                   15.14                                                 15.16                   15.18                   15.20
                                                                                                                  Time (s)
                                                                                                                                        0.0025
   eIMF 4 eIMF 3 eIMF 2 eIMF 1                                                                           eIMF 8 eIMF 7 eIMF 6 eIMF 5
                                   0.025
                                   0.000                                                                                                0.0000
                                 −0.025                                                                                                −0.0025
                                   0.025
                                                                                                                                        0.0005
                                   0.000                                                                                                0.0000
                                 −0.025                                                                                                −0.0005
                                                                                                                                        0.0005
                                  0.0025
                                  0.0000                                                                                                0.0000
                                 −0.0025                                                                                               −0.0005
                                  0.0025
                                                                                                                                        0.0055
                                  0.0000
                                                                                                                                        0.0050
                                           15.10      15.12   15.14           15.16   15.18      15.20                                           15.10      15.12   15.14           15.16   15.18      15.20
                                                                 Time (s)                                                                                              Time (s)
Figure 2.11: The original time series and the corresponding intrinsic mode functions (IMFs)
for the case of 5.08 cm (2 inch) stickout, 320 rpm, and 0.0127 cm (0.005 inch) DOC.
2.3.2.1                                    Selection of Informative Intrinsic Mode Function
           In this section, the informative IMF selection is described using the experimental signals
from turning experiments explained in Section A.1. In order to obtain features for machine
learning from vibration signals using EEMD, the vibration signals are decomposed into IMfs,
see Figure 2.11 for an example. For long time series, the computation time is reduced for
this step by dividing the signal into shorter segments whose length is approximately 1000
points. The informative IMF selection process is very similar to its WPT counterparts (see
Section 2.3.1.1). Specifically, the power spectrum in Figure 2.12 shows that the first IMF
includes the high frequency vibrations while higher order IMFs include the low frequency
ones. For example, for the 5.08 cm (2 inch) stickout case, the FFT of the second IMF
matches the chatter frequency region (900–1000 Hz). Therefore, the second IMF is selected
as the informative IMF in this case. The informative IMFs for the other stickout cases are
summarized in Table 2.1.
                                                                                                      31


Figure 2.12: The spectrum of each intrinsic mode function (IMF) for the case of 5.08 cm (2
inch) stickout, 320 rpm and 0.0127 cm (0.005 inch) DOC.
2.3.2.2    Feature Extraction Using EEMD
    Similar to Chen et al. [45], seven time domain features are extracted from the informative
IMF. These features are listed in Table 2.4, and they include the energy ratio, peak to peak
value, standard deviation, root mean square, crest factor, as well as skewness, and kurtosis
of the signals. The features are computed and then ranked using the Recursive Feature
Elimination (RFE) method, which was introduced in Reference [129] and is described in
Section 2.3.1.2. The feature matrix for classification is formed starting with the top-ranked
feature by itself and then by concatenating, in descending order, the rest of the features one
at a time. This results in seven combinations of features, which are then used for classification
into chatter and chatter-free cases via four different classifiers similar to the WPT approach
(see Section 2.2).
2.3.3    Results
    This section shows the classification accuracy for the methods discussed in Section 2.3.1
and 2.3.2 for turning cutting experiments explained in Section A.1. Specifically, Sections
                                               32


Table 2.4: Time domain features for the intrinsic mode functions ci (tk ). The parameters tk
and c¯i represent, respectively, the kth discrete time and the mean of ith IMF.
                 Feature                        Equation
                                                         n
                                                              c2i (tk )
                                                         P
                                                        k=1
                 Energy ratio                   f1 =   I P  n
                                                                 c2i (tk )
                                                      P
                                                      i=1 k=1
                 Peak to Peak                   f2 = max(ci (tk )) − min(ci (tk ))
                 Standard Deviation             f3 = σ(ci (tk ))
                                                     s
                                                              n
                 Root Means Square (RMS)        f4 = n1            c2i (tk )
                                                            P
                                                            k=1
                 Crest Factor                   f5 =  max(ci (tk ))
                                                           f4
                                                       n
                                                         (ci (tk )−c¯i (tk ))3
                                                       P
                 Skewness                       f6 =  k=1
                                                            (n−1)f43
                                                       n
                                                         (ci (tk )−c¯i (tk ))4
                                                       P
                 Kurtosis                       f7 =  k=1
                                                            (n−1)f44
2.3.3.1 and 2.3.3.2 show the WPT-based and the EEMD-based results, respectively. The
results are obtained by randomly splitting the data from each stickout case into 67% train-
ing and 33% testing sets. As described in Sections 2.3.1 and 2.3.2, the features from the
informative wavelet packet or informative IMF are extracted, and four different classifica-
tion algorithms are used for training. Then, each classifier is tested using the corresponding
test set. This split-train-test process is repeated 10 times, and the averages and standard
deviations of the resulting classification accuracy are tabulated.
2.3.3.1    Wavelet Packet Transform with RFE
    In each realization of training data and test data, the feature ranking via RFE is repeated
as described in Section 2.3.1.2. Since the training and test sets are different in each real-
ization, ten different rankings of the features are obtained. Figure 2.13 shows the ranking
for the 10 iterations where each bar corresponds to a feature whose equation is provided in
                                               33


Table 4.1. The height of the bar in the figure shows the number of times each feature is
ranked for the corresponding rank number. For instance, feature a14 (standard frequency)
is the feature with the most influence on the classification in all realizations. On the other
hand, features a11 , a12 and a13 are ranked second, respectively, in three, four, and three out
of ten split-train-test realizations. In general, the features based on the frequency domain
are higher ranked than the time domain features.
Figure 2.13: Bar plot for feature ranking for 5.08 cm (2 inch) stickout case at level 4 WPT.
    The mean and the standard deviation of the accuracy of the classification for the 10
realizations of training and test sets based on the level 4 WPT method are presented in
Figure 2.14 for all stickout cases. In this figure, it is seen that when the number of the
features is 8 or 10, adding lower ranked features into the feature vector does not affect the
result. This shows that RFE ranked the features properly and that lower ranked features do
not have an influence on the results.
    One difference between the WPT-based approach that is described here and the one
described in [5] is that the accuracy of the classifier is investigated using informative wavelet
functions computed at each level of the WPT. On average, the level 1 and level 2 WPT leads
to better classification results in the test sets than the level 3 and level 4 WPT. This might be
                                                  34


Figure 2.14: Level 4 Wavelet Packet Transform(WPT) feature extraction method results for
all stickout cases. (a) 5.08 cm (2 inch), (b) 6.35 cm (2.5 inch), (c) 8.89 cm (3.5 inch), and
(d) 11.43 cm (4.5 inch).
attributed to the fact that the lower level WPT contain information in a broader frequency
range than the higher level WPT, and for chatter detection, only the detection of chatter
frequencies in the spectrum is relevant but not their frequency value or the exact shape of
the peaks. The full classification results for each level of the WPT up to level 4 are tabulated
in Tables B.1–B.4 of the Appendix. Since the feature ranking is different for each realization
of the splitting into training and test data, the ith ranked feature is only denoted by ri .
Below in Table 2.5, the WPT results are reported with the highest average accuracy out of
all the different combinations of WPT levels and feature vectors, and they are compared to
the results of the EEMD method. The performance of both methods, WPT and EEMD, are
also tested with the classifiers explained in Section 2.2. Tables 2.6– 2.7 provide the accuracies
obtained from Level 1 and Level 2 WPT and EEMD feature extraction methods with four
different classifiers and compare the methods to each other.
                                                35


Table 2.5: Comparison of the classification results obtained with SVM classifier and run
times for each chatter detection method. Given run times include feature computation and
classification for EEMD and level 4 WPT.
                                  Classification Results           Time Comparison (seconds)
  Stickout Length      WPT Level          WPT            EEMD       WPT          EEMD
  5.08 cm (2 inch)           1        93.9% ± 5.8% 84.2% ± 0.8% 115.99          14540.06
  6.35 cm (2.5 inch)         2       100.0% ± 0.0% 78.6% ± 1.2% 36.65            3371.58
  8.89 cm (3.5 inch)         1       84.0% ± 15.0% 90.7% ± 1.4% 4.51             1583.38
  11.43 cm (4.5 inch)        2       87.5% ± 11.2% 79.1% ± 1.2% 6.53             3096.07
Table 2.6: Results obtained by using Level 1 WPT and EEMD feature extraction methods
with four different classifiers.
                                WPT                                    EEMD
   Stickout            Logistic   Random      Gradient          Logistic   Random    Gradient
               SVM                                       SVM
    Length            Regression Forest       Boosting         Regression   Forest   Boosting
    5.08 cm
               93.9%    84.6%       93.1%      90.0%     84.2%   93.5%      94.8%      94.9%
    (2 inch)
    6.35 cm
               85.0%    71.7%       91.7%      91.7%     78.6%   79.4%      80.1%      82.2%
  (2.5 inch)
    8.89 cm
               84.0%    94.0%      100.0%      96.0%     90.7%   89.0%      93.5%      94.5%
  (3.5 inch)
   11.43 cm
               78.8%    81.3%       86.3%      87.5%     79.1%   78.7%      81.6%      81.4%
  (4.5 inch)
2.3.3.2      Ensemble Empirical Mode Decomposition with RFE
     Similar to Section 2.3.3.1, EEMD is combined with RFE and utilizes four different clas-
sifiers in each realization of the splitting into test and train data sets. The classification
accuracy is on average better than the results from the level 3 and level 4 WPT and com-
parable to the accuracy of the lower level WPT. The combination with the best accuracies
in each cutting case is reported when comparing the different methods in Table 2.5. In this
table, the results highlighted with dark blue represent the highest accuracy across a given
row, while those highlighted in light blue have an average accuracy which is encapsulated by
the error bars of the method with the highest average accuracy.
     Table 2.5 shows that features based on the WPT algorithm give the highest accuracy for
three stickout cases out of four cutting configurations. Specifically, feature extraction with
WPT and RFE is the most accurate for the 5.08 and 6.35 cm (2 2.5 and 4.5) stickout cases
                                                 36


Table 2.7: Results obtained by using Level 2 WPT and EEMD feature extraction methods
with four different classifiers.
                                 WPT                                   EEMD
   Stickout             Logistic   Random Gradient              Logistic   Random Gradient
               SVM                                    SVM
    Length            Regression    Forest Boosting            Regression   Forest Boosting
   5.08 cm
               91.5%     87.7%      93.8%    90.0%    84.2%      93.5%      94.8%     94.9%
   (2 inch)
   6.35 cm
              100.0%     80.0%      95.0%    96.7%    78.6%      79.4%      80.1%     82.2%
  (2.5 inch)
   8.89 cm
               78.0%     58.0%      94.0%    78.0%    90.7%      89.0%      93.5%     94.5%
  (3.5 inch)
   11.43 cm
               87.5%     78.8%      88.8%     80.0    79.1%      78.7%      81.6%     81.4%
  (4.5 inch)
Figure 2.15: Bar plot including the error bars of the classification results for Level 1 WPT,
Level 2 WPT and EEMD with four different classifier. a) 5.08 cm (2 inch) stickout size,
b)6.35 cm (2.5 inch) stickout size, c)8.89 cm (3.5 inch) stickout size, d) 11.43 cm (4.5 inch)
stickout size.
                                              37


scoring 93.9%, 100.0% and 87.5% respectively. While the results from EEMD give the highest
accuracies for 8.89 cm (3.5 inch) stickout cases, the WPT result for this case still lies within
the error bars of EEMD results. The results with different classifiers other than SVM are
also provided in Tables 2.6 and 2.7. In Table 2.6, the performance of Level 1 WPT is better
than EEMD since WPT has the highest accuracies in three cutting configuration cases and
EEMD results are in the error bars of WPT results. On the other hand, Table 2.7 indicates
that both methods have the highest accuracy for two cutting configurations. These two
tables also provide evidence that lower level (Level 1) WPT outperforms EEMD. Further,
100% accuracy is observed in Tables 2.6 and 2.7 for two different cutting configurations.
These cutting configurations have the lowest number of time series as experimental data.
Since time series are not split into smaller pieces for the WPT method, so the size of the
test set is quite small, and it is possible to get such high results.
    The standard deviation of the WPT results is quite high, as seen from Table 2.5 and
Figure 2.15 since the computation time for this method does not require splitting a long
time series into smaller pieces. Therefore, the total number of samples for identical stickout
cases is smaller in comparison to the EEMD method, where long time series were split
into shorter ones of approximately 1000 points, thus increasing the number of samples and
resulting in tighter error bars. Therefore, the amount of deviation can be reduced, especially
for the WPT-based approach, by increasing the size and the number of the training sets. In
addition, Table 2.5 compares the run time in seconds for each of the different featurization
methods for chatter detection. These comparisons were performed using a Dell Optiplex
7050 desktop with Intel Core i7-7700 CPU and 16.0 GB RAM. It can be seen that feature
extraction with WPT and RFE is the fastest across all of the stickout cases. This study
points out that the built-in WPT package that is used is highly optimized, whereas, in
comparison, the EEMD does not enjoy the same level of code optimization. Moreover, for
EEMD, the EMD is performed for an ensemble of time series with an ensemble size 200,
which needs much higher computation effort and can be reduced by varying the ensemble
                                                38


parameters of the EEMD.
2.4     Traditional Time and Frequency Domain Features
2.4.1    Feature Extraction
    This section describes feature extraction using three traditional signal processing ap-
proaches. These approaches are Fast Fourier Transform (FFT), Power Spectral Density
(PSD), and Autocorrelation Function(ACF). Features for FFT, PSD, and ACF are obtained
by using their peaks’ coordinates. The x and y components of the first five peaks in each of
these three functions are used as features. Although there are some built-in commands for
peak finding in the most common scientific software tools, reliable peak selection remains
a challenging task. This is due to the large number of returned ‘bumps’ that correspond
to local maxima that are artifacts of noise and are not true features of the signal. There-
fore, some constraints are imposed to sift out the redundant peaks and capture the most
useful ones. The FFT, PSD, and ACF peaks are selected by defining the minimum peak
height (MPH) and the minimum distance between two consecutive peaks. The minimum
peak distance (MPD) was kept constant for all sequences, while the minimum peak height
was computed as a fraction of the difference between the 5th and the 95th percentile of the
peaks according to
                        M P H = ymin + α(ymax − ymin ), where α ∈ [0, 1],                    (2.8)
and y min is the 5th percentile of the amplitude of the FFT/PSD/ACF, while y max is the
95th percentile. Figure 2.16 provides the original cutting signals along with its spectrum
including the first five peaks. In Figure 2.16, two sets of the first five peaks are provided, and
the peaks shown in the figure are selected with respect to the minimum peak distance (MPD)
parameter. The figure shows that M P D = 500 chooses more points near the maximum
amplitude in comparison to M P D = 2500. However, some of the spectra of the cutting
signals do not contain five peaks if M P D = 2500 is used. Therefore, the MPD parameter
                                                39


Figure 2.16: First five peaks in the Fast Fourier Transform (FFT) plot of the time series of
acceleration signal collected from cutting configuration with 5.08 (2 in) cm overhang length,
320 rpm rotational speed of the spindle, and 0.0127 cm (0.005 inc) depth of cut. (Minimum
Peak Distance (MPD): 500 and 2500).
for FFT is set to 500 to consistently extract five peaks from the FFT of all signals, but
M P D = 1000 is used with the auto-correlation function. Because the power spectral density
plots were smooth, the MPD parameter is not used as a constraint for them.
    The feature matrices were given as input to four different supervised machine learn-
ing techniques: Support Vector Machine (SVM), Logistic Regression (LR), Random Forest
Classification (RF), and Gradient Boosting (GB). Vibration signals for each overhang size
were split into %67 training set and %33 test set. Recursive Feature Elimination (RFE)
was then used to rank the features. Splitting the data and performing classification were
performed ten times. The resulting mean accuracies and standard deviations are provided
in Section 2.4.2.
                                              40


2.4.2       Results
Figure 2.17: Comparison of classification results of Level 4 WPT, EEMD, and FFT/PS-
D/ACF methods based on four classifiers with Recursive Feature Elimination (RFE). a)
5.08 cm (2 inch) overhang length b) 6.35 cm (2.5 inch) overhang length c) 8.89 cm (3.5 inch)
overhang length d) 11.43 cm (4.5 inch) overhang length.
Table 2.8: Results obtained with the signal processing feature extraction method for four
overhang lengths with four different classifiers.
                                      Without RFE                                                      With RFE
                    5.08 cm (2 inch)               6.35 cm (2.5 inch)              5.08 cm (2 inch)                6.35 cm (2.5 inch)
 Classifier     Test Set      Training Set     Test Set       Training Set     Test Set       Training Set      Test Set       Training Set
 SVM         46.2% ± 6.0%    100.0% ± 0.0% 38.3% ± 13.0% 100.0% ± 0.0%      61.5% ± 9.7%     69.6% ± 8.3%   71.7% ± 10.7%     86.0% ± 6.6%
 LR         78.5% ± 12.8%    100.0% ± 0.0% 76.7% ± 8.2% 100.0% ± 0.0%       83.8% ± 8.0%     82.3% ± 5.8%   81.7% ± 13.8%     89.0% ± 8.3%
 RF          93.8% ± 3.1%    100.0% ± 0.0% 86.7% ± 10% 100.0% ± 0.0%        94.6% ± 4.9%     98.5% ± 1.9%    95.0% ± 7.6%    100.0% ± 0.0%
 GB          84.6% ± 6.9%    100.0% ± 0.0% 76.7% ± 17.0% 100.0% ± 0.0%     96.2% ± 3.9%     100.0% ± 0.0%    93.3% ± 8.2%    100.0% ± 0.0%
                                      Without RFE                                                      With RFE
                   8.89 cm (3.5 inch)             11.43 cm  (4.5 inch)            8.89 cm (3.5 inch)               11.43 cm (4.5 inch)
 SVM        62.0% ± 24.4% 100.0% ± 0.0% 43.8% ± 17.0%        100.0% ± 0.0% 82.0% ± 10.8% 95.6% ± 5.4%       61.3% ± 13.1% 68.6% ± 9.1%
 LR         76.0% ± 12.0% 100.0% ± 0.0% 76.3% ± 10.4%        100.0% ± 0.0% 78.0% ± 14.0% 88.9% ± 8.6%       70.0% ± 16.0% 82.9% ± 6.5%
 RF          90.0% ± 9.4% 100.0% ± 0.0% 90.0% ± 9.4%         100.0% ± 0.0% 92.0% ± 9.8% 100.0% ± 0.0%        91.3% ± 9.8% 99.3% ± 2.1%
 GB          86% ± 23.7% 100.0% ± 0.0% 83.8% ± 11.3%         100.0% ± 0.0% 86.0% ± 9.2% 100.0% ± 0.0%       88.8% ± 11.8% 100.0% ± 0.0%
    This section provides the results obtained for turning cutting experiments explained in
Section A.1 and compares traditional feature extraction results to WPT and EEMD ap-
proaches. Four different classifiers explained in Section 2.2 were utilized to compare tra-
ditional signal processing based feature extraction methods to the WPT/EEMD results
                                                                      41


Table 2.9: Cross validated results obtained with signal processing feature extraction method
for four overhang lengths with four different classifiers.
                                      Without RFE                                                     With RFE
                    5.08 cm (2 inch)              6.35 cm (2.5 inch)              5.08 cm (2 inch)                6.35 cm (2.5 inch)
  Classifier    Test Set     Training Set      Test Set       Training Set    Test Set      Training Set       Test Set       Training Set
  SVM        74.2% ± 4.7% 94.8% ± 4.7% 58.3% ± 3.1%          98.5% ± 3.1%  90.0% ± 1.9%    89.8% ± 1.9%     73.3% ± 6.5%     75.2% ± 6.5%
  LR         74.2% ± 0.0% 100.0% ± 0.0% 71.7% ± 0.0%         100.0 ± 0.0%  79.2% ± 1.3%    89.5% ± 1.3%     73.3% ± 6.7%     82.5% ± 6.7%
  RF         89.2% ± 0.0% 100.0% ± 0.0% 93.3% ± 0.0%        100.0% ± 0.0%  91.7% ± 1.3%    95.2% ± 1.3%     93.3% ± 0.0%    100.0% ± 0.0%
  GB         91.7% ± 0.0% 100.0% ± 0.0% 93.3% ± 0.0%        100.0% ± 0.0%  97.5% ± 0.0%   100.0% ± 0.0%    100.0% ± 0.0%    100.0% ± 0.0%
                                      Without RFE                                                     With RFE
                   8.89 cm (3.5 inch)             11.43 cm (4.5 inch)            8.89 cm (3.5 inch)               11.43 cm (4.5 inch)
  SVM        86.7% ± 0.0%   100.0% ± 0.0% 41.7% ± 7.0% 88.4% ± 7.0% 80.0% ± 5.8% 91.1% ± 5.8%               83.3% ± 2.3%     81.8% ± 2.3%
  LR         93.3% ± 0.0%   100.0% ± 0.0% 46.7% ± 6.6% 94.0% ± 6.6% 86.7% ± 3.6% 92.9% ± 3.6%               78.3% ± 3.2%     78.3% ± 3.2%
  RF         93.3% ± 0.0%   100.0% ± 0.0% 91.7% ± 0.0% 100.0% ± 0.0% 93.3% ± 0.0% 100.0% ± 0.0%            96.7% ± 1.5%      95.5% ± 1.5%
  GB         73.3% ± 0.0%   100.0% ± 0.0% 85.0% ± 0.0% 100.0% ± 0.0% 80.0% ± 0.0% 100.0% ± 0.0%             91.7% ± 0.0%    100.0% ± 0.0%
obtained from Reference [9]. The results obtained using these classifiers for each cutting
configuration are provided in Table 2.8. The table shows that classifiers trained with the
traditional signal processing features have an overfitting problem (significantly higher ac-
curacy rates when training versus testing) when RFE is not utilized. Recursive feature
elimination was used to solve this problem (see Section 2.3.1.2). Table 2.8 only reports the
best results obtained from the classifiers for each overhang length. These results are also
compared to the ones obtained with WPT and EEMD in Figure 2.17. It is seen that the
FFT/PSD/ACF method has higher accuracies in Figure 2.17.
     The feature extraction is explained in Section 2.4.1. Classification for this method has
been performed in the same way it is performed for WPT/EEMD. The left hand side of
Table 2.8 shows the classification results without using feature ranking. Note how the test
accuracy is significantly lower than the training accuracy. This is a typical symptom of over-
fitting, i.e., using too many (unnecessary) features for training. In contrast, the right hand
side of the same table shows that using feature ranking mitigated the overfitting problem. In
addition, if the results in Table 2.8 are compared, it is seen that the FFT/PSD/ACF based
feature extraction methods have better accuracies than WPT and EEMD. Another approach
to combat overfitting is to utilize Cross Validation (CV). Table 2.9 provides the results ob-
tained with CV where 10-fold CV was used for the 5.08 cm (2 inch) and 11.43 cm (2.5 inch)
cases, while 5-fold CV was used for the remaining cutting configurations. The results show
                                                                     42


that while CV somewhat mitigates the overfitting problem, it does not completely solve
it. For example, there is overfitting for the classification accuracy of SVM for the results
obtained without using RFE in Table 2.8. Although CV decreased the difference between
mean accuracies of test set and training set, there is still at least 20% accuracy difference
between training and test sets for overhand lengths 5.08, 6.35, and 11.43 cm (2, 2.5, and 4.5
inch). In addition, a decrease in the standard deviation of the results is observed in Table 2.9
in comparison to the ones in Table 2.8. Figure 2.17 provides bar plots for the classification
accuracy for each cutting configuration along with the associating error bars. Figure 2.18
shows how many times each feature was ranked for the traditional signal processing based
methods. This indicates where the most highly ranked features come from. The figure shows
that the most highly ranked features correspond to the FFT peaks, followed by PSD, then
ACF features.
Figure 2.18: Bar plot for feature ranking obtained from SVM-RFE for 5.08 cm (2 inch)
overhang case with FFT/PSD/ACF based method. (f1 , . . . , f4 ), (f5 , . . . , f8 ) and (f9 , . . . , f12 )
belong to features obtained from peaks of FFT, PSD and ACF respectively.
                                              43


2.5       Similarity Measures of Time Series
2.5.1      Dynamic Time Warping (DTW)
    Dynamic Time Warping is an algorithm that is capable of measuring distance or similarity
between two time series even if they have dissimilar lengths. Let T S1 and T S2 be two time
series with elements xi and yj whose lengths are m and n as follows:
                                    T S1 = x1 , x2 , . . . , xi , . . . , xm ,            (2.9)
                                    T S2 = y2 , y2 , . . . , yj , . . . , yn .          (2.10)
Berndt and Clifford state that the warping path wk = (xi(k) , yj(k) ) between two time series
can be represented by mapping the corresponding elements of the time series on m × n
matrix (see Figure 2.19 for warping path example) [130]. The warping path is composed
of the points wk , which indicate alignment between the elements xi(k) and yj(k) of the time
series. The length L of the warping path fulfills the constraints m ≤ L ≤ n, where it is
assumed that n ≥ m. For instance, w3 in Figure 2.19b corresponds to the alignment of
x2 and y3 . In general, warping paths are not unique, and several warping paths can be
generated for the same two time series. For two different time series, the DTW algorithm
chooses the warping path that gives the minimum distance between the element pairs under
certain constraints. While there are several options for computing the distance between a
pair (xi , yj ) of elements of the time series, in this implementation, the Manhattan distance
d(xi , yj ) = ∥xi − yj ∥1 is used. The minimization of the distance between T S1 and T S2 in
the DTW algorithm can then be written according to ([130]) such that
                                                                    L
                                                                               !
                                                                  X
                               DTW (T S1 , T S2 ) = min                  d(wk ) .       (2.11)
                                                                  k=1
    There are several restrictions to define the optimum warping path. These are monotonic-
ity, continuity, adjustment window condition, slope constraint, and boundary conditions.
These restrictions are applied to the alignment window to reduce the possible number of
                                                    44


    Figure 2.19: DTW alignment (a) and warping path(b) for two different time series.
warping paths since there is an excessive number of possibilities for warping paths without
any constraint ([131]).
   • Monotonicity : The indices i and j should always either increase or stay the same such
      that i(k) ≥ i(k − 1) and j(k) ≥ j(k − 1).
   • Continuity: The indices i and j can only increase at most by one such that i(k) − i(k −
      1) ≤ 1 and j(k) − j(k − 1) ≤ 1.
   • Boundary condition: The warping paths should start where i and j are equal to 1 and
      should end where i = n and j = m.
   • Adjustment window condition: The warping path with minimum distance is searched
      on a restricted area on the alignment window to avoid significant timing difference
      between the two paths ([131]). The restricted area is given by i − r ≤ j ≤ i + r.
   • Slope constraint: This condition avoids significant movement in one direction ([130]).
      After a steps in horizontal or vertical direction, it cannot move in the same direction
      without having b steps in the diagonal direction ([131]). The effective intensity of the
      slope constraint can be defined as P = b/a. P is chosen as 1, which was reported as
      an optimum value in an experiment on speech recognition ([131]).
   In this work, the distances between the time series are computed using the cDTW pack-
age. There is another widely adopted algorithm named FastDTW [132]. The time complexity
                                              45


of FastDTW algorithm is given as N (8r + 14), while cDTW package has time complexity
of rN [132, 133]. N is the number of points in the time series, and r is a parameter for the
adjustment window condition explained above. Both packages have similar time complexity,
which is O(N ).
2.5.2    K-Nearest Neighbor (KNN)
    In this approach, K-Nearest Neighbor (KNN) algorithm is used to train a classifier. KNN
is a supervised machine learning algorithm based on classifying objects with respect to labels
of nearest neighbors ([134]). The ‘K’ corresponds to the number of neighbors chosen to decide
the label of newly introduced samples. Figure 2.20 shows an example that illustrates the
classification process with KNN.
      Figure 2.20: K-Nearest Neighbor classification example for two-class classification.
    Specifically, Figure 2.20 assumes that there are two different classes for a classification
problem denoted by pentagons and stars. Pentagons and stars belong to the training set,
and the red square belongs to a new sample from the test set. When a new sample is
encountered (the square in the figure), a tag is assigned based on the number of K nearest
neighbors to each class. For instance, for the 1-NN case, the closest neighbor is from the star
class; therefore, the test sample is tagged as a star class. On the other hand, for the 5-NN
case, the test sample has two neighbors from the star class and three neighbors from the
pentagon class. Consequently, the label for the test sample is set as pentagon since there are
                                               46


more neighbors from this class than there are from the star class. If the number of nearest
neighbors is the same for multiple classes, the class label is assigned randomly with equal
probability ([135]).
2.5.3    Similarity matrices and classification
    DTW provides a measure of how different/similar any pair of time series T S1 and T S2
is. By comparing N time series T S1 , . . . , T SN with each other, similarity matrices whose
entries are the distance between the two corresponding time series can be generated. Since
DTW is commutative, the resulting matrices are symmetric. Consequently, the resulting
similarity matrix for DTW requires N (N − 1)/2 computations. The similarity matrices can
then be combined with a K-Nearest Neighbor classifier to inform us whether the time series
corresponds to chatter or chatter-free cutting. When chatter occurs in metal cutting, the
dominant frequencies of the vibrations at the tool and workpiece change from harmonics of
the spindle rotation period to chatter frequencies, which are close to some eigenfrequecies of
the mechanical structure. Moreover, the amplitude of the vibrations increases significantly.
This characteristic behavior can be used to distinguish between stable cutting and chatter
by comparing current time-domain signals (e.g. acceleration) with existing labeled signal
segments from a training phase. During classification, the data set is split into training
and test sets. Indices of the training set and test set samples are found, and then the
distance matrix for the training set and test set is generated by using the square distance
matrix that was computed in advance for all the cases. After obtaining these distances, the
nearest training set samples to the test sample are found based on the selected number of
nearest neighbors. Labels of each nearest neighbor are counted as shown in the illustration
in Figure 2.20. The label with the highest count is assigned to the test sample as the
predicted label. In this application, either chatter or chatter-free labels are assigned to the
test sample. After repeating these processes for each of the test samples, the predicted labels
are compared with the ground truth and define the accuracy of the DTW method. Splitting
                                                47


data into training and test sets is repeated 10 times. The distances between new test sets and
training sets do not have to be computed since the pairwise distances between all samples
are already computed in the beginning. Finding the indices of the samples in each iteration
will be enough to generate new similarity matrices for training and test sets. The standard
deviation of the classification is also provided since the classification is repeated 10 times.
     When a new test sample is introduced to a classifier, distance computations between all
training samples and the test sample is required, which can be computationally expensive.
Therefore, the Approximate and Eliminate Search Algorithm (AESA) (see Section 2.5.4) is
implemented to reduce the number of DTW computations per new test sample.
2.5.4    Approximate and Eliminate Search Algorithm (AESA)
     Approximate and Eliminate Search Algorithm (AESA) is a method designed for reducing
the number of distance computations during the test phase of classification. Derivation of
the AESA starts with the question of whether DTW is a metric or not. There are four
requirements for a function to be a metric [136]:
    1. D(x,y) ≥ 0,
    2. D(x,y)=0 ⇐⇒ x=y,
    3. D(x,y) = D(y,x),
    4. D(x,z) ≤ D(x,y)+D(y,z).
Although DTW always satisfies the first two properties, the commutativity condition is only
satisfied when the DTW algorithm does not approximate the distance measure. Therefore,
if the exact DTW is computed, then the first three conditions will be satisfied. However, the
fourth condition may not be satisfied, and a combination of three time series that violate
the triangular inequality can be found in a data set. Consequently, DTW is not accepted
                                               48


as a metric. Depending on the data set, the fourth condition may or may not be satis-
fied; therefore, DTW can still be used by relaxing the strict triangular inequality condition.
Specifically, Ruiz et al. introduced the triangle inequality looseness in [105] as follows
                            H(x, y, z) = D(x, y) + D(y, z) − D(x, z),                      (2.12)
where x, y and z represents the time series in a data set. Loose triangular inequality is also
defined as
                               D(x, y) + D(y, z) ≥ D(x, z) + HL .                          (2.13)
                                      Algorithm 2.1: AESA
Input: P (set of training samples), Dtrain (pairwise distance matrix between training set
   samples), c ∈ P (pseudocenter of P ), x ∈ P (test sample), H ∈ R(looseness constant),
   DT W (cDTW function)
Output: Nearest sample (n) and assigned label count=0
   while E̸= P do
       if count=0 then
           s=c
           U = {c}
           n=c
           E = {c} ∪ {q | DT W (q,c)>2DT W (x,c)-H or DT W (q,c)< H, where q∈ {P-{c}}}
       else
           s = q, such that min(      |DT W (q, u) − DT W (x, u)|), where q ∈ {P − E}
                                  P
                                 ∀u∈U
           DT W (x,s) = cDTW(x,s)
           U = U ∪ {s}
           E = E ∪ {s}
           if DT W (x,s)< DT W (x,n) then
              Q=U
              n=s
           else
              Q={s}
           end if
           for every q ∈ Q do
              E = E ∪ {p| DT W (p,q)> DT W (x,q)+DT W (x,n)-H or DT W (p,q)< DT W (x,q)-
   DT W (x,n)+H, where p ∈ {P-E}}
           end for
           count=count+1
       end if
   end while
                                                49


     Ruiz et al. call DTW distance a loose metric space when a data set does not violate
the triangular inequality [105]. All training samples are considered as potential candidates
for the nearest sample to a new test set. The main purpose of the algorithm is to eliminate
these training samples and approximate the nearest one correctly. In addition, the algorithm
performs the classification part and assigns the label of the nearest training sample to the
test sample. In other words, it applies 1-NN classification, where the classification step can
be part of the AESA algorithm. Results based on 1-NN classification will be presented in
this implementation.
  a)               count = 0  b)         count = 0    c)         count = 0  d)         count = 1
                           p1                    p1                      p1                    p1
                                          x                       x                      x
        p3             p5         p3         p5           p3         p5         p3    r 0 p5
               p0                    p5                      s                     s
       p2                        p2                      p2                    p2
                 p4                    p4                      p4                    p4
  e)               count = 1  f)         count = 1    g)         count = 1  h)         count = 1
                           p1                    p1                      p1                    p1
                     x                    x                       x                      x
        p3 r 1    r 0 p5          p3         p5           p3         p5         p3         s
               s r2                  s                       s                     p0
       p2                        p2                      p2                    p2
                 p4                    p4                      p4                    p4
  i)               count = 2  j)         count = 2    k)         count = 2  l)         count = 2
                           p1                    p1                      p1                    p1
                     x                    x                       x                      x
        p3             s          p3         s            p3         s          p3         n
               p0                    p0                      p0                    p0
       p2                        p2                      p2                    p2
                 p4                    p4                      p4                    p4
Figure 2.21: Illustration of elimination and approximation steps of AESA. Count refers to
the number of distance computations made in the algorithm.
     The pseudo code for the algorithm is given in Algorithm 2.1 ([105]). Illustrations of
                                                   50


some steps are provided in Figure 2.21. Assume that training set is denoted as P and its
samples pi are shown in Figure 2.21a. A new test sample x is introduced to the classifier
(see Figure 2.21b). The aim of AESA is to classify x as accurately as possible with fewer
DTW distance computations. The first step of the algorithm is to select a first candidate
for the nearest training sample s to x. Reference [105] points out that using pseudo center
c of P as the first candidate improves the results; alternatively, s can be randomly selected.
The definition of the pseudo center is
                                     X                       X
                     P C(P ) = {c|       DT W (c, p) = min (      DT W (q, p))}.         (2.14)
                                                       ∀q∈P
                                    ∀p∈P                     ∀p∈P
After this choice of s in Figure 2.21c, the distance between s and x is computed. That
distance is shown in Figure 2.21d) with r0 , and the count number increases by one. The
nearest sample n to x is defined by comparing the distances made inside of the algorithm.
Since only one distance computation is performed so far, s is assigned as the nearest sample
n. The approximation step is completed in the first iteration, now the elimination should be
performed. Reference [105] defined the elimination criteria such that the training set samples
pi which do not satisfy DT W (x, pi ) < DT W (x, n) are eliminated. Applying Equation (2.13)
                       Figure 2.22: Illustration for elimination criteria.
yields two elimination criteria such that
                          DT W (pi , s) < DT W (x, s) + DT W (x, n) − H
                                                                                         (2.15)
                          DT W (pi , s) > DT W (x, s) − DT W (x, n) + H.
These two conditions correspond to the two circles shown in Figure 2.21e with r1 and r2 . In
Figure 2.21e, the region where we look for possible candidates for the nearest samples (n) is
                                                 51


defined based on the elimination criteria. The training samples outside of the shaded, green
region are eliminated as shown in gray in Figure 2.21f thus completing the first iteration.
The set that includes the eliminated samples is called E. Iterations continue until P = E.
Only two samples are eliminated (see Figure 2.21g) so far in this example. Therefore, we
continue searching for new s in the second iteration. The new s = q is selected for iterations
except the first one such that
                                     X
                               min(       |DT W (q, u) − DT W (x, u)|),                   (2.16)
                                     ∀u∈U
where q ∈ {P −E} [105]. This choice makes p5 the new s as shown in orange in Figure 2.21h.
Now, the distance between s and x is computed again, and the count is increased by one (see
Figure 2.21i). Then, circles that define the elimination criteria are drawn (see Figure 2.21j),
and it is seen that all remaining samples are outside of the defined region. Therefore, all
of them are eliminated (see Figure 2.21k). Since all training set samples are eliminated,
there will be no further iterations in the algorithm. In the last step, s is assigned as nearest
sample n to the test sample x if DT W (x, s) < DT W (x, n) (see Figure 2.21l). This completes
the algorithm, and the label of n can be assigned to x.
   Figure 2.23: Effect of H on the elimination criteria and accuracy of the classification.
    In traditional classification, it would be required to compute the distance between P and
x. This would be equal to six DTW computations for the example in Figure 2.21. However,
only two DTW computations were made with the AESA algorithm. This demonstrates how
AESA is capable of significantly reducing the number of DTW computations.
                                                  52


    The choice of the parameter H influences computation reduction and classification ac-
curacy. To show that, two illustrations that correspond to two H values are provided in
Figure 2.23. Figure 2.23 shows how the defined area is reduced when H is increased. The
decrease in the area increases the number of eliminated samples from the training set, and
P and E will be equal to each other in a small number of iterations. Thus, the algorithm
performs fewer DTW computations as H increases. In addition, increasing H can lead to
misclassification. Larger shaded area with low H can cover the nearest sample n in the re-
gion (see Figure 2.23 (left)), while smaller one can exclude n (see Figure 2.23 (right)). This
exclusion can lead to misclassification since the true n is eliminated. Therefore, a decrease
is expected in classification accuracy when H becomes too large. The process for choosing
H is described in more detail in Section 2.5.5.3.
2.5.5   Results
    This section presents the results for the classification accuracy using the approach pro-
posed in Section 2.5.1 and 2.5.4 as well as current state-of-the-art methods in the literature.
The results presented in this section are obtained with turning cutting signals explained
in Section A.1. Specifically, Section 2.5.5.1 compares the classification accuracy using the
same data set for the similarity-based method to the WPT/EEMD methods [9], and the
TDA-based results [10]. Section 2.5.5.2 describes how parallel computing is employed with
DTW approach. Further, Section 2.5.5.3 provides the results obtained using Approximate
and Eliminate Search (AESA) explained in Section 2.5.4.
2.5.5.1    Classification Results for Dynamic Time Warping (DTW)
    Table 2.10 provides classification accuracies and the time needed to compute the distance
between two time series. Although the same time series is used to find the time needed to
compute the distance between them, cDTW is faster compared to the FastDTW, as seen
from Table 2.10. As the larger r parameter is selected for FastDTW, its computational time
                                               53


increases. However, lower r values do not approximate the distance between two time series
well. In addition, the cDTW algorithm matched the accuracy obtained from FastDTW.
Therefore, cDTW algorithm is used to obtain results in this study.
Table 2.10: Comparison of classification accuracy and the time required to compute the
distance between two time series for cDTW and FastDTW packages.
                                     cDTW                               FastDTWr=21
 Overhang Length
                          Accuracy        Time (seconds)         Accuracy        Time (seconds)
      cm (inch)
         5.08
                      98.34% ± 1.08%            1.5          99.24% ± 0.73%           51.1
          (2)
    Table 2.11 compares the best classification scores obtained from WPT, EEMD, and the
TDA-based methods to the results from DTW. Results for K = 1, 2, . . . , 5 for the KNN
classifier are obtained, and the best accuracies are obtained for DTW in Table 2.11. The
cells highlighted in green are the ones with the highest overall classification score, while
those highlighted in blue represent results with error bands that overlap with the best overall
accuracy in the same row. A full list of the average classification scores and the corresponding
standard deviations can be found in Tables B.6 and B.7.
Table 2.11: Comparison of results for similarity-based methods with their counterparts avail-
able in the literature.
                  Similarity
                                    Topological Data Analysis (TDA)          Signal Processing
                   Measure
   Overhang
    Length                         Template      Carlsson      Persistence
                DTW* DTW                                                      WPT      EEMD
       cm                          Functions Coordinates         Images
     (inch)
      5.08
                94.5% 98.3%         91.5%         93.6%           96.4%       93.9%     84.2%
       (2)
      6.35
                86.9% 72.3%         89.3%         86.3%           85.8%      100.0%     78.6%
      (2.5)
      8.89
                92.9% 93.0%         83.9%         95.7%           93.0%       84.0%     90.7%
      (3.5)
     11.43
                70.9% 75.7%         65.1%         72.2%           72.5%      87.5%      79.1%
      (4.5)
  *Intermediate chatter cases are excluded.
                                               54


    Table 2.11 shows that features based on WPT, Carlsson Coordinates, which is a TDA-
based method, and DTW give the highest accuracy for the different overhang cases. For the
5.08 and 8.89 cm (2 and 3.5 inch) overhang cases, DTW and Carlsson Coordinates have the
highest classification accuracies of 98.3% and 95.7%, respectively. However, the DTW is in
the error band of the highest accuracy for the 8.89 cm (3.5 inch) case. On the other hand,
feature extraction with WPT and RFE is the most accurate for the 6.35 and 11.43 cm (2.5
and 4.5 inch) overhang cases scoring 100% and 87.5%. While the results from other methods
are not the highest for any of the considered cases, some of them still lie within the error
bars for the 11.43 cm (4.5 inch) overhang cases. Specifically, EEMD is within one standard
deviation of the best results for the 11.43 cm (4.5 inch) case.
    Note that the results provided in Table 2.11 are for two-class classification: chatter and
chatter-free. In the column named DTW*, the results excluding the intermediate chatter
cases are presented. In contrast, the DTW column shows the results when intermediate
chatter cases are assumed as chatter in classification. The reason why the intermediate
chatter cases are excluded in DTW is revealed in the heat maps of the similarity matrices.
The heatmaps of the average DTW distances between the three classes (chatter, intermediate
chatter, and no-chatter/stable) are given in Figures 2.24, 2.25, B.1, B.2. Each of the nine
regions in these figures shows the average DTW distance between all the cases marked
according to the row and the column labels in that region, e.g., the top right region reports
the average DTW distance between the time series tagged as no-chatter versus those tagged
as chatter. These heatmaps show us how the time series belonging to different classes are
similar to each other. Ideally, the average distance between time series with the same label
is expected to be small compared to the average distance between time series with different
labels. This can also be observed in Figure 2.24. The average distance between stable
cases is the lowest, while the one between stable and unstable (chatter) cases is the highest.
Therefore, the DTW algorithm can distinguish the signals with different labels. However,
this may not hold all the time. For instance, the heat map of the 6.35 cm (2.5 inch) case in
                                              55


Figure 2.25 indicates that the average distance between intermediate chatter and no-chatter
time series is almost identical to the average distance among intermediate chatter time series.
This explains the low accuracy when the intermediate chatter is included as a separate class
in Table 2.11. In this case, the classification algorithm can classify intermediate cases as
stable ones, although these cases are taken into account as chatter cases, thus reducing the
resulting accuracy. When the intermediate cases are excluded completely, the classification
score increases from 72.3% to 86.9% since there is a large difference between the average
distances of stable and chatter cases, as seen in Figure 2.25.
                             Stable      Intermediate Chatter      Chatter
                                                                                      7500
                                                                                      7000
              Stable        3502.65            5760.55             6609.15
                                                                                      6500
                                                                                      6000
                                                                                      5500
        Intermediate        5760.55            6025.13             6256.49
             Chatter                                                                  5000
                                                                                      4500
                                                                                      4000
             Chatter        6609.15            6256.49             5347.66
                                                                                      3500
                                                                                      3000
                             Stable      Intermediate Chatter      Chatter
Figure 2.24: The heat map of averages DTW distances of time series belongs to three classes
for the 5.08 cm (2 inch) case.
    In the 11.43 cm (4.5 inch) case, Figure B.2 indicates that there is no significant difference
in the average DTW distances. As a consequence, there might be errors in the classification of
the chatter cases, and this leads to low overall classification accuracy, as shown in Table 2.11.
When the intermediate chatter cases are not included, the classification score is 70.9%. Since
the difference between the average distance of intermediate-stable and intermediate-chatter
is small, the classification algorithm can still classify intermediate cases as chatter. This can
increase the classification accuracy when the intermediate cases are included, as shown in
Table 2.11. The score increases from 70.9% to 75.7%. For 5.08 and 8.89 cm (2 and 3.5 inch)
                                                 56


                            Stable        Intermediate Chatter   Chatter
                                                                                   7500
                                                                                   7000
              Stable       4594.57              5268.12          6503.78
                                                                                   6500
                                                                                   6000
                                                                                   5500
        Intermediate       5268.12              5270.43          6241.70
             Chatter                                                               5000
                                                                                   4500
                                                                                   4000
             Chatter       6503.78              6241.70          5362.20
                                                                                   3500
                                                                                   3000
                            Stable        Intermediate Chatter   Chatter
Figure 2.25: The heat map of averages DTW distances of time series belongs to three classes
for the 6.35 cm (2.5 inch) case.
cases, the heatmaps (see Figure 2.24 and B.1) show clear differences between the three cases.
This explains the high classification accuracies for both cases.
    However, there might be interest in identifying intermediate chatter as part of a predic-
tion algorithm that intervenes before the process develops into full chatter. Alternatively,
inducing or sustaining intermediate chatter might be desirable for surface texturing applica-
tions. Figures A.4b–d show that the power spectrum for chatter and intermediate chatter
is very similar, making the featurization in the frequency domain extremely challenging.
However, Figure A.4a shows a clear difference between the two chatter regimes in the time
domain. Therefore, it is more advantageous to extract features in the time domain for a
three-class classification (chatter, intermediate chatter, and no chatter).
    For 5.04 and 8.89 cm (2 and 3.5 inch) overhang cases, Figure 2.24 and B.1 show that DTW
can differentiate between chatter and intermediate chatter as evidenced by the high average
distance between time series tagged as chatter and intermediate chatter. The regions that
list the distances between intermediate chatter and chatter cases show that the distances
between chatter-chatter and chatter-intermediate chatter cases are quite different. This
confirms the ability of the proposed approach to distinguish the differences between these two
                                                 57


cases. However, the KNN algorithm may not differentiate these cases due to the similarities
between average distances in the case of 6.35 and 11.43 cm (2.5 and 4.5 inch) overhang
distances. As a concrete example for the three-class classification, the distance matrices are
computed for all overhang distances and fed to the KNN classification algorithm to obtain
the best 3-class classification accuracy. Table 2.12 provides the best results of corresponding
cases (the full classification results can be found in Table B.8). Table 2.12 shows that the
DTW approach successfully distinguishes the three different classes, and the success rates
are only slightly below the success rates of the two-class classification (cf. Table 2.11) for
5.04 and 8.89 cm (2 and 3.5 inch), while the low classification accuracy is observed for 6.35
and 11.43 cm (2.5 and 4.5 inch) cases as expected.
Table 2.12: The best accuracy results for three class classification with the DTW approach
and the corresponding number of nearest neighbors used in the KNN algorithm.
                      Overhang Length
                                                DTW         K-NN algorithm
                          cm (inch)
                           5.08 (2)        97.7% ± 1.1%           1-NN
                          6.35 (2.5)       71.4% ± 7.0%           4-NN
                          8.89 (3.5)       95.5% ± 5.4%           3-NN
                          11.43 (4.5)      73.9% ± 4.6%           4-NN
2.5.5.2     Parallel Computing
    In this section, parallel computing is utilized to expedite calculating the distance matrices.
In a parallelized way, multiple distances can be computed simultaneously, which reduces
the total run time. The High Performance Computing Center (HPCC) of Michigan State
University (MSU) is used, and all the distance matrices are computed to produce the results
shown in this section. HPCC is composed of several supercomputers, each with hundreds
of nodes. Each node can be thought of as a computer with a certain number of CPUs and
cores.
    In parallel computing, 175, 82, 22 and 154 jobs are submitted at the same time for 5.08,
6.35,8.89, and 11.43 cm (2, 2.5, 3.5 and 4.5 inch) cases, respectively. For the 5.08 cm (2
                                                 58


inch) case, the number of distance computations made in each job is 1000, while it is 100 for
other cases. One to five nodes, 5 CPU per job, and 2GB of RAM per CPU are requested.
Therefore, each job is run with 10 GB of RAM in total. It is also worth mentioning that
HPCC-MSU has a job submission policy, and this policy determines the queue time for a
user depending on the requested resources from HPCC-MSU. Each time a user requests a
large amount of resources, the queue time for the job submitted by that user gets higher. In
addition, the queue time depends on the current usage of HPCC-MSU since it is open to all
university members. Although all jobs are submitted simultaneously to HPCC-MSU, their
computation is not started at the same time due to queue time, which causes deviations
from the ideal time, which is equal to the runtime of a single job.
    Table 2.13 provides the times required to obtain classification results with DTW and its
counterparts. For DTW, two different run times are reported: traditional and parallel. In
traditional DTW, only one distance computed is performed at a time. Therefore, a significant
difference is observed between parallel and traditional computations in spite of the fact that
the times reported in Table 2.13 for DTW (Parallel) include the queue time. The times
reported for traditional computation are estimated based on the time required to complete
one distance computation on Dell Optilex 7050 desktop with Intel Core i7-7700 CPU and
16.0 GB RAM. In addition, the times reported for TDA-based feature extraction methods are
obtained by applying parallel computing only on persistence diagram computations, while
parallel computing is not involved in WPT and EEMD. Even though WPT and EEMD are
the fastest methods, parallel computing with DTW significantly reduces the total run time,
making the latter the third-fastest method.
    It is pointed out that the classification based on WPT is optimized, whereas the classi-
fication based on DTW is far from optimal. Therefore, one of the aims of this chapter is
the introduction of DTW for chatter detection and the presentation of its general perfor-
mance. In fact, this includes the necessity for an optimization of the algorithm, as seen from
Table 2.13 even if parallel computing is used. However, the values reported in Table 2.13
                                               59


Table 2.13: Time (seconds) comparison between similarity measure method and its counter-
parts.
                      Similarity
                                          Topological Data Analysis (TDA)      Signal Processing
                       Measure
  Overhang
   Length         DTW           DTW      Template     Carlsson     Persistence
                                                                               WPT       EEMD
      cm      (Traditional)   (Parallel) Functions Coordinates       Images
    (inch)
     5.08
                 227500*         7263      9495         9424          9454      116        309
      (2)
     6.35
                  10984*         3192      3523         3452          3482       37         83
     (2.5)
     8.89
                  2896*          1152      2148         2077          2107        5         47
     (3.5)
    11.43
                  20020*         1932      4894         4823          4853        7         68
     (4.5)
 *These run times are rough estimations.
for WPT and EEMD do not include the time required for choosing the informative wavelet
packets or IMFs using manual preprocessing since the speed of manual preprocessing depends
on the skill of the user and is more difficult to track. It is expected that the overall time will
be much higher for WPT and EEMD after including the manual processing time. First, the
computing time can be decreased by decreasing the length and/or the number of time series.
At the moment, it is not clear how such a reduction affects the performance of the method.
For example, the success rates for the 8.89 cm (3.5 inch) case and the 5.08 cm (2 inch)
case are comparable even though the available training data is much smaller for the former.
Second, the upper diagonal of the distance matrix, including pairwise distances between the
training data, is computed. Although Table 2.13 shows that DTW clocks the second-fastest
runtime, it is noted that this slowdown is mostly related to the training phase because of the
large number of necessary pairwise distance computations during the training/testing phase.
However, once the classifier is obtained, the necessary runtime for DTW will be significantly
reduced because any new data is classified upon computing its pairwise distance with the
training set, i.e., the only needed computation is equivalent to the evaluation of one row of
the training/testing similarity matrix (see Table2.14). Finally, it is possible that the code
                                                60


for calculating the DTW distance matrix can be further optimized. Many researchers have
published on DTW optimization, especially for data mining ([137]), including a speedup of
runtime for distance matrix computations between time series and its query. The resulting
algorithm allows performing fast queries on a single core machine in a very short time, thus
allowing using small consumer electronics to handle the data and possibly extract features
from it in real-time. However, the AESA algorithm explained in Section 2.5.4 and given Al-
gorithm 2.1 is applied to reduce the number of distance computations and the time required
for testing.
2.5.5.3    Approximate and Eliminate Search Algorithm Results
    The run times for DTW methods with parallel computing given in Table 2.13 are for
training a classifier. When a new test sample is introduced, distances between the test
sample and all the training set samples need to be computed to identify the nearest neighbors.
However, computing these distances the traditional way can be too lengthy, especially when
considering DTW for online chatter detection applications. Therefore, the AESA algorithm is
employed to reduce the number of DTW computations for a test sample during classification.
    The first step is to check if there is any violation of the loose triangular inequality given
in Equation (2.13). The looseness constants for all combinations of three different time series
are computed for turning cutting data set, and plots are provided in Fig. 2.26.
    This figure shows that all combinations of three time series comply with the triangular
inequality since the frequencies are accumulated on positive looseness values. Then, a range
of looseness constant(H) is chosen between 0 and 105 with an increment of 100. Therefore,
101 different H values are used as input to the AESA algorithm. The data set of each
overhang distance is split into training (67%), and test (33%) sets. For every H, the number
of DTW distance computations made per test sample, and the predicted labels of test samples
are obtained as output. Then, these predicted labels are compared with the true labels of the
time series to determine the accuracy level, and the average number of distance computations
                                                61


                                                                  0.0005
               0.0005       2 inch                                          2.5 inch
                                                                  0.0004
               0.0004
   Frequency
                                                                  0.0003
               0.0003
               0.0002                                             0.0002
               0.0001                                             0.0001
               0.0000                                             0.0000
               0.0008       3.5 inch                                        4.5 inch
                                                                  0.0005
               0.0006                                             0.0004
   Frequency
                                                                  0.0003
               0.0004
                                                                  0.0002
               0.0002
                                                                  0.0001
               0.0000                                            0.0000
                        0      2500    5000 7500 10000 12500 15000      0      2500    5000 7500 10000 12500 15000
                                         Looseness (H)                                   Looseness (H)
Figure 2.26: Histograms for looseness constant of all combinations of three different time
series for all overhang distance.
made for the samples in the test set is computed. A plot is provided to show how the average
number of distance computations per test sample and classification accuracy change with
varying H in Figure 2.27. All of the results presented in Figure 2.27 are obtained with
HPCC-MSU and using 1NN implementation in the AESA. However, AESA can be modified
to perform classification with a larger number of nearest neighbors.
    Figure 2.27 indicates that there is a decrease in the average number of distance compu-
tations and accuracy as the looseness constant increases. However, the average number of
distance computations decreases dramatically. Therefore, AESA algorithms can reduce the
number of DTW distance computations while keeping the accuracy as large as possible. One
can find a value of H with low distance computation and high accuracy value. Furthermore,
an increase in accuracy with increasing H is observed for 6.35, 8.89, and 11.43 cm (2.5, 3.5,
and 4.5 inch) overhang distances. However, 6.35 and 8.89 cm (2.5 and 3.5 inch) have fewer
samples in comparison to the other cases (see Table A.1). The fewer data samples in an
                                                                62


                            5.08 cm (2 inch)                                        6.35 cm (2.5 inch)
           100                                                100
                                                           400
                                                                                                                   80
            80                                                   80
                                                                                                                        Average number of
                                                           300
                                                                                                                   60
Accuracy
            60                                                   60
                                                           200
                                                                                                                        DTW Computation
                                                                                                                   40
            40                                                   40
                                                           100                                                     20
            20                                                   20
                     Training                                                Training
                     Set Length                                              Set Length
                                                            0                                                       0
             0                                                       0
                 0   2000      4000   6000       8000   10000            0   2000      4000   6000       8000   10000
                            8.89 cm (3.5 inch)                                  11.43 cm (4.5 inch)
           100                                                  100
                                                           40
            80                                                   80                                                100
                                                                                                                        Average number of
                                                           30                                                      75
Accuracy
            60                                                   60
                                                                                                                        DTW Computation
                                                           20                                                      50
            40                                                   40
            20                                             10 20                                                   25
                     Training                                                Training
                     Set Length                                              Set Length
                                                            0                                                       0
             0                                                       0
                 0   2000      4000   6000       8000   10000            0   2000      4000   6000       8000   10000
                              Looseness (H)                                           Looseness (H)
Figure 2.27: Classification accuracy (%) (solid red line) and the average number of DTW
computations (green dashed line) for varying looseness constant (H).
overhang distance case cause bigger jumps in accuracy, as shown in Figure 2.27. In addition,
the plots shown in Figure 2.27 are only for one train-test split. One may obtain a smoother
decrease in accuracy plots by applying train-test split several times and taking the average.
Then, the accuracy values for small H will converge to the values which are obtained with
10 times train-test split and provided in Table 2.11.
           Some H values are chosen, and their corresponding classification score and the average
number of distance computations are found for all overhang distance cases from Figure 2.27.
The time required to classify one test sample is estimated for the traditional and parallelized
way, and a comparison between traditional computing, parallel computing, and AESA al-
gorithm is provided in Table 2.14. Accuracy reported in Table 2.14 is the corresponding
accuracies shown in Figure 2.27 and Table 2.11.
                                                                63


Table 2.14: Classification time of one test sample for traditional way, parallel computing,
and AESA algorithm and corresponding average accuracy obtained from Table 2.11 and
Figure 2.27.
                       Traditional                  AESA                     Parallel
        Overhang
                                                      Avg.
         Distances    Acc.   Time(s)     H     Acc.          Time(s)     Acc.   Time(s)
                                                     count
        cm (inch)
          5.08 (2)    98.3    595.5*   6900   90.81   1.13     1.70      98.3    ≈ 1.5
        6.35 (2.5)    72.3    130.5*   3300   74.42   20.04    30.06     72.3    ≈ 1.5
        8.89 (3.5)    92.9     66*     6300   86.96   1.09     1.64      92.9    ≈ 1.5
        11.43 (4.5)   75.7    175.5*   5600   81.36   1.19     1.79      75.7    ≈ 1.5
       *These run times are rough estimates.
    The H values chosen in Table 2.14 is an example. One can choose different values for H as
well. There is a trade-off between accuracy and the average number of distance computations.
Higher H values can be selected to have a small number of distance computations, but
this comes with a lower classification score. In the traditional way, the reported times are
estimated based on the time required to complete one distance computation. On the other
hand, parallel computing could be the fastest method among them. Ideally, if all distance
computations between the test set sample and the training set are sent to HPCC-MSU in
separate jobs, each job will take nearly 1.5 seconds to complete. However, there might
be some queue time which leads to a delay in time to obtain results. One can also use
the available workstations to perform parallel computing without needing the extravagant
supercomputers that HPCC-MSU has. Using parallel computing is optional and a viable
option if a high-performance computing cluster is available. Alternatively, the user can speed
up the computations by implementing the AESA algorithm on a workstation. Moreover,
AESA algorithm provide promising classification times as seen from Table 2.14. Classification
can be completed in less than two seconds with the chosen H values for all overhang distances
except the 6.35 cm (2.5 inch). One can choose another value of H for the 6.35 cm (2.5 inch)
case to obtain results faster, but this corresponds to a lower classification score. Table 2.14
shows us that AESA and parallel computing can enable in-process chatter detection on the
                                               64


cutting centers with the similarity measure approach using a classifier that is trained first
offline and then loaded to a controller attached to the manufacturing center. Moreover, the
implementation of AESA in an online manufacturing application is cheaper than buying
supercomputers or workstations to perform parallel computing.
     The DTW-based approach operates directly on the time series, thus bypassing the pre-
processing step involved in the WPT and EEMD methods. Further, in contrast to deep
learning techniques, such as neural networks, using DTW does not necessitate a large num-
ber of datasets for training. The feasibility of implementing DTW in a real cutting center
is examined by investigating its transfer learning capabilities. Providing these results with
smaller deviation and eliminating the manual preprocessing are significant advantages for
the DTW approach that still enables it to achieve high classification accuracies even if the
system parameters (in this case, the eigenfrequencies) shift during the process.
2.6     Topological Data Analysis (TDA) Based Approach
2.6.1    Topological Data Analysis
     Topological Data Analysis (TDA) extracts information by investigating the shape of
the data. In this study, persistent homology, which is a powerful tool of Topological Data
Analysis (TDA), is proposed to extract features from the persistence diagrams and use them
in supervised machine learning algorithms. Experimental data is embedded using Takens
embedding theorem [138], and 1-D persistent homology is investigated for feature matrix
generation. In this section, persistent homology is briefly explained, and one can refer to
[62, 63, 64, 65, 139, 140] for detailed information about TDA and persistent homology.
     Persistent homology provides a compact tool for studying the topology of data embedded
in a Euclidean space which is often called a point cloud. The resulting shape information
is represented by a two dimensional plot called the persistence diagram. A persistence
diagram can be obtained for different shape characteristics of interest. For instance, if the
main interest is in the connectivity of the points in the point cloud, then that information
                                               65


can be represented in a 0-dimensional (0-D) persistence diagram. Alternatively, the 1-D
persistence diagram is computed to investigate the loops in the point cloud. For voids, 2-
D persistence diagrams are computed, and so on. This study only focuses on extracting
features from the 0-D and 1-D persistence, so in the following, the basic idea for obtaining
the 1-D persistence will be explained, i.e., for representing loops that emerge and disappear
as the point cloud is thickened. The process for obtaining the 0-D persistence is similar, only
instead of considering loops, the connectivity of the points would need to be tracked as the
point cloud is uniformly thickened.
     Consider the point cloud shown in Figure 2.29a. Then, the point cloud starts to thicken,
i.e., expand disks with radius ϵ around each data point. As ϵ is increased, disks can start
to intersect. The intersection of two disks forms an edge as shown by the two edges in
Figure 2.29b. Increasing ϵ further can lead to three disks intersecting, thus forming a triangle
which is filled in as shown by the nine triangles in Figure 2.29c. At some values of ϵ, some
disk intersections will lead to cycles. The time (here, the ϵ value) at which a cycle appears
is called the birth time of the cycle. Figure 2.29d shows three example cycles numbered 1,2
and 3 with birth time b1 = b2 = b3 .
     As the disks continue thickening in the point cloud, more disks will intersect, leading
to more triangles filling in, and at some point, some cycles may fill in. The time at which
a cycle disappears is called its death time. For example, Figs. 2.29e–g show the death
times of cycles 1–3, respectively. The information about the birth and death of cycles is
succinctly summarized in a persistence diagram. In this diagram, each point corresponds
to the paired birth and death times of a cycle. For example, Figure 2.29h shows the tuples
(b1 , d1 ), (b2 , d2 ), and (b3 , d3 ) corresponding to the birth and death times of the cycles 1–3
shown in Figure 2.29d. The cycle that persists the longest is characterized by the highest
point above the diagonal and is called maximum persistence. In this example, cycle 3 leads
to the maximum persistence in the persistence diagram.
                                                      66


2.6.1.1    Simplicial complexes
    Let {u0 , . . . , uk } ∈ Rd be a set of data points, and the vectors defined between these
data points (u1 -u0 ,u2 -u0 ,. . . ,uk -u0 ) are linearly independent. A geometric k-simplex, σ is
a set of all points in Rd such that             j=0 λj uj where   j=0 λj = 1 and λj ≥ 0 for all j.
                                              Pk                Pn
Figure 2.28 provides illustrations for 0, 1, and 2−simplex. Each data point on a point cloud
is represented as 0-dimensional simplex and they are called vertices. When two vertices are
connected, an edge is formed and it is 1-dimensional simplex. Connection of three vertices
will form 2−simplex which is a triangle. Simplicies spanned by any subset of u0 , . . . , uk are
the faces of σ. In general, n−simplex contains n + 1 vertices, and the set of these simplices,
are called geometric simplicial complexes, K, if the following two conditions are satisfied
[139]: 1) If σ ∈ K, then faces of σ are also in K, 2) If two n−simplex, σ1 and σ2 are in
K, then the intersection of them is either common face or empty. The dimension of the
simplicial complex, K, is equal to the largest dimension of its simplices.
              Figure 2.28: Formation of simplicial complexes from a point cloud.
2.6.1.2    Persistent Homology
    The simplicial complex K is used to compute homology Hn (K) in a different dimension
to identify the shape of the data. 0 dimensional homology, H0 (K) represents connected
components and one dimensional homology, H1 (K) represents loops, while two dimensional
homology H2 (K) represents voids. The simplicial complex K is not fixed in persistent
homology, and it varies over time.
                                                      67


          Figure 2.29: Generation of persistence diagrams using The Rips Complex.
    Disks centered at data points of the point cloud start to expand and let ϵ be the radius of
these disks. As ϵ increases, n-dimensional simplicies are formed, as shown in Figure 2.29. The
intersection of two disks forms an edge (1-simplex), while a triangle (2-simplex) is formed
when three disks intersect with each other (see Figure 2.28). Each ϵ will result in different
simplicial complexes, and they can be approximated using filtration functions. This study
uses a Python package that employs the Rips complex. The definition of Rips complex is
given as
                             Rϵ (K, d) = {σ ⊂ K| max d(x, y) < ϵ},                       (2.17)
                                                 (x,y)∈σ
where d is the distance between vertices of simplicial complex σ. Let {ϵ1 < ϵ2 < . . . < ϵm }
be the set of varying radius of the disk. These radii form Rips complexes such that
                                      R1 ⊆ R2 ⊆ . . . ⊆ Rm                               (2.18)
where Rj = Rϵj (K, d). Then, a specific dimension n can be chosen to identify the shape
of the data along the simplicial complexes Rj . For instance, if a loop is seen first in Ri ,
this is called birth time (b = ϵi ). When it disappears in Rj , this is denoted as death time
(d = ϵj , where d > b). That allows one to generate persistence diagrams, D for a given point
cloud, and selected persistent homology dimension. Figure 2.29h provides an example of a
persistence diagram. The horizontal axis represents the birth time, while the vertical axis
                                              68


is for the death time, and all the points in the diagram are above the diagonal. The points
with larger lifetime (d − b) are farther away from the diagonal, and these points often include
important information about the overall shape and structure of the data. While it is possible
to generate a persistence diagram for each homology dimension, one dimensional persistent
homology is mostly focused on feature extraction since it can capture circular structure in
the data, which is often a characteristic of chatter in turning.
2.6.2    Method
    The method proposed for chatter detection using topological features can be summarized
using Figure 2.30. The parts of the pipeline related to data collection, processing, and
labeling were described in Sections A.1–A.1.1, respectively. In this section, the rest of the
steps shown in Figure 2.30 is described.
    Recall that the cutting tests are composed of four different overhang configurations. Each
configuration includes a different number of time series that correspond to different labels,
rotational speeds, and depths of cut. Therefore, the time series are grouped with respect to
these three parameters, and they are normalized to have zero mean and unit variance. This
normalization reduces the effect of large feature values on smaller ones ([141]).
    The next step is to split long time series into smaller pieces to reduce the computation time
needed for finding the delay reconstruction parameters (see Section 2.6.3.1) and for obtaining
the persistence diagrams (see Section 2.6.1.2). Upon finding the appropriate embedding
parameters, the data is embedded using delay reconstruction, also known as Takens’ delay
embedding, see Section 2.6.3.1. The resulting point cloud is then used to compute the
corresponding persistence diagrams using two different approaches: 1)traditional way where
persistence diagrams are computed with Ripser package for Python, 2) approximating to
point cloud with Bezier curves and computing persistence diagrams using line segments
generated with Bézier curves ([77]). The second method is a different approach introduced
in [77], and it uses Bézier curves to approximate to reduce the time to compute persistence
                                                 69


diagrams. Section 2.6.3 explains both approaches for persistence diagram computation,
and sections 2.6.4.1–2.6.4.5 describe different methods in the literature for featurizing the
resulting diagrams to obtain a feature vector in Euclidean space that can be used with
existing machine learning tools such as SVM. 1-dimensional persistent homology H1 is mainly
utilized for feature extraction except when the template function approach is used. For
template functions, 0-dimensional persistence H0 is also used ([72]).
        Figure 2.30: Pipeline for feature extraction using topological features of data.
2.6.3    Persistence Diagram Computation
    In this section, the embedding of time series to a higher dimension and the persistence
diagram computation using the Bézier curve approximation method are explained briefly.
The steps for persistence diagram computation are summarized as shown in Figure 2.31.
                     Figure 2.31: Persistence diagram computation steps.
2.6.3.1    Delay Reconstruction
    Takens’ theorem lays a theoretical framework for studying deterministic dynamical sys-
tems ([138]). It states that, in general, embedding of the attractor of a deterministic dynam-
                                               70


ical system can be obtained from a one-dimensional recording of a corresponding trajectory.
This embedding is a smooth map Ψ : M → N between the manifolds M and N that
diffeomorphically maps M to N .
    Specifically, assume an observation function β(x) : M → R, where for any time t ∈ R
the point x lies on an m-dimensional manifold M ⊆ Rd . While in practice the flow of the
system is not available for a time t ∈ R given by ϕt (x) : M × R → M , the observation
function implicitly captures the time evolution information according to β(ϕt (x)), typically
in the form of the one-dimensional, discrete and equi-spaced time series {βn }n∈N .
    Takens’ theorem states that by choosing an embedding dimension d ≥ 2m+1, where m is
the dimension of a compact manifold M , and a time lag τ > 0, then the map Φϕ,β : M → Rd
given by
                        Φϕ,β = (β(x), β(ϕ(x)), . . . , β(ϕd−1 (x)))
                                                                                              (2.19)
                             = (β(xt ), β(xt+τ , β(xt+2τ , . . . , β(xt+(d−1)τ ))),
is an embedding of M , where ϕd−1 is the composition of ϕ d − 1 times and xt is the value of
x at time t.
    For noise-free data of infinite precision, any time lag τ can be used; however, in practice,
the choice of τ can influence the resulting embedding. In this study, τ was found by using the
method of Least Median of Squares (LMS) ([142]) combined with the magnitude of the Fast
Fourier Transform (FFT) of the signal. Specifically, the FFT spectrum is obtained, and the
maximum significant frequency is identified in the signal using LMS [143]. Then Nyquist’s
sampling criterion is used to choose the delay value according to the inequality described in
[144]. This approach yielded reasonable delay values in comparison to the standard mutual
information function approach ([145]), where the mutual information function is plotted for
several values of τ , and the first dip in the plot indicates the τ value to use. This is because
(1) the mutual information function is not guaranteed to have a minimum, thus leading to
a failed selection of τ and, (2) the identification of the first true dip, if it exists, is not easy
to automate especially in non-smooth plots.
                                                  71


    The embedding dimension d ∈ N is computed by using the False Nearest Neighbor
(FNN) approach ([146, 147]). In the FNN approach, the delay reconstruction is applied to
the time series using increasing dimensions. The distances between neighboring points in one
dimension are re-computed when the points are embedded into the next higher dimension.
Keeping track of the percent of points that appear to be neighbors in a low dimension but
are farther apart in a higher dimension (termed false neighbors) makes it possible to identify
a threshold that indicates that the attractor has been sufficiently unfolded. Applying FNN
to all of the time series yielded the values in the range d ∈ {1, 2, . . . , 10}, depending on
the time series being reconstructed. Upon identifying τ and d for each time series, delay
reconstruction was used to embed the signal into a point cloud P ⊆ Rd . The shape of
the resulting point cloud was then quantified using persistence, as described in Section 2.6.1.
Five different methods are studied to extract features from the resulting persistence diagrams
as shown in Sections 2.6.4.1–2.6.4.5.
2.6.3.2    Persistence Diagram Computation with Bézier Curve Approximation
    Bézier curves were introduced by Pierre Bézier, and they have been widely used in com-
puter aided design software and path planning applications for robots ([148, 149, 150, 151]).
Recently, [77] utilized Bézier curves to speed up the persistence diagram computation. In
this approach, the first step is to divide the point cloud into groups. The user defines the
number of samples per group (spg), then a Bézier curve is individually fitted to each group.
Line segments are then generated using these curves according to the number of line seg-
ments per curve r selected by the user. Pairwise distance matrix between the line segments
is computed, and it is given to Ripser as input, and it will provide the persistence diagrams
as output based on the selected maximum number of homology dimensions. This section
explains two of three main steps, which are fitting a Bézier curve and computing the dis-
tance matrix between the line segments, and the effect of spg and r on the approximation
of persistence diagrams.
                                              72


     Fitting a Bezier Curve: In this implementation, the cubic Bézier curve is used, and
its expression is defined as
                                          X  3
                                p(t) =         (1 − t)3−i ti pi   0 < t < 1,                      (2.20)
                                           i=0
where t is the parametrization variable and pi is called the control points of the curve. For
cubic Bézier curves, there are four different control points. The first one and the last one
should match with the first and last point of the group of samples where the Bézier curve
is fit, respectively. Solution for these control points is obtained by using the least squares
error method, and the expression is given as
                                                         X  l
                                L(p0 , p1 , p2 , p3 ) =       ||p(ti ) − xi ||2 ,                 (2.21)
                                                          i=1
where xi is the data points of the point cloud. Solution of ∇L = 0 provides the control
points. The expression for ∇L = 0 can be rewritten such that
                         l      3  
                                                                     !            !
                       X       X       3                                 ∂p(tk )
                            2                         3−i i
                                            (1 − tk ) tk pi − xk                    =0            (2.22)
                        k=1    i=0
                                        i                                   ∂pi
                ∂p(tk )
                         = (1 − tk )3 ê1 + 3(1 − tk )2 tk ê2 + 3(1 − tk )t2k ê3 + t3k ê4 = 0, (2.23)
                  ∂pi
where tk represents varying parametrization variable along a Bézier curve. The above ex-
pression can also be written in matrix form as
                                     l
                                              !        l
                                   X                 X
                                          Ak p =           bk   or Ap = b.                        (2.24)
                                   k=1                k=1
The equations for Ak , p and bk are defined as
                                                       73


                  Point Cloud                Divide into groups                Bezier Curves                    Line Segments
      1.00                           1.00                          1.00                                                              Line Segment 1
                                                                                                         1.0                         Line Segment 2
      0.75                           0.75                          0.75
                                                                                                                                     Line Segment 3
      0.50                           0.50                          0.50                                                              pi for Group 1
                                                                                                         0.5
                                                                                                                                     pi for Group 2
      0.25                           0.25                          0.25
                                                                                                                                     pi for Group 3
x2    0.00                           0.00                          0.00                                  0.0
     −0.25                          −0.25                         −0.25
                                                                                                        −0.5
     −0.50                          −0.50                         −0.50
                                                      Group 1                            Curve 1
     −0.75                          −0.75             Group 2 −0.75                      Curve 2        −1.0
                                                      Group 3                            Curve 3
     −1.00                          −1.00                         −1.00
             −1        0        1           −1        0       1           −1         0             1           −1     0      1
                      x1                             x1                             x1                                x1
        Figure 2.32: Illustration showing Bézier curve fit and generation of line segments.
                                                                                                                                
                                             (1 − tk )6      3(1 − tk )5 tk 3(1 − tk )4 t2k                     (1 − tk )3 t3k
                                                                                         
                                           5             4 2           3 3           2  4
                                3(1 − tk ) tk 9(1 − tk ) tk 9(1 − tk ) tk 3(1 − tk ) tk 
                                                                                         
                           Ak = 
                                                                                                                                          (2.25)
                                           4 2           3 3           2 4             5
                                                                                          
                                3(1 − tk ) tk 9(1 − tk ) tk 9(1 − tk ) tk 3(1 − tk )tk 
                                                                                         
                                            (1 − tk )3 t3k   3(1 − tk )2 t4k         3(1 − tk )t5k                   t6k
                                                                                               T
                                     1
                                  p0             p20 . . . pd0                  (1 −  tk )3       1
                                                                                                 xk 
                                                                                              
                                  p1             p21 . . . pd1                3(1 − tk )2 tk  x2 
                                   1                                                             k
                                p=                             ,         bk =                  ,                                      (2.26)
                                                                               
                                  p1             p22 . . . pd2                 3(1 − t )t2   ... 
                                   2                                                    k k  
                                                                                              
                                   p13            p23 . . . pd3                       t3k          xdk
where d is the dimension of the data set. To illustrate the Bézier curve fitting, a sinusoidal
signal is embedded to dimension two. Then, the samples are divided into groups, and bezier
curves are fit. Figure 2.32 represents the control points and fitted lines on the groups.
      Computing pairwise distance matrix: Bézier curves are split into intervals, and the
number of intervals (r) is selected by the user. The endpoints of the intervals are connected
to each other, and line segments are generated. The next step is to compute the distance
                                                   −→
matrix between the line segments. Lets assume that l0 l1 and −
                                                             m−0−
                                                                m→ represent two lines. The
                                                                  1
distance between these two lines is defined as follows
                                                   −→ −−→
                                                 d(l0 l1 , −
                                                           m0 m1 ) =         −−→
                                                                                  min                  d(l, m),                             (2.27)
                                                                           l∈l0 l1 ,m∈−
                                                                                      −−m
                                                                                      m  −→
                                                                                        0 1
                                                                           74


                  r=1, spg = 100             r=3, spg = 100         r=5, spg = 100     r=7, spg = 100
      Ripser 5                           5                      5                  5
      Bezier
             4                           4                      4                  4
     TS:51   3                           3                      3                  3
             2                           2                      2                  2
             1                           1                      1                  1
             0                           0                      0                  0
                0      2      4            0      2      4        0      2      4    0      2      4
Figure 2.33: Comparison of persistence diagrams computed with the approximation for
varying r to true diagrams obtained with Ripser. All diagrams belong to time series number
51 of the 8.89 cm (3.5 inch) case, and they are computed in H1 .
where l0 , l1 , m0 and m1 are the end points of the two lines. The distance is computed by
minimizing the function given as
                                 f (l, m) = d(l(s), m(t))2 = ||l(s) − m(t)||2 ,                  (2.28)
where s and t are parametrization variables. [77] solved this problem using a gradient
descent algorithm. However, simplicial homology global optimization (SHGO) is employed
in this study ([152]). After finding the parameters making function f minimum, their values
are used to compute the distance between two lines. This is repeated for all combinations
between line segments. Only the upper diagonal of the distance matrix is computed since it is
symmetric. Then, it is given to the Ripser package as input to obtain persistence diagrams.
    Effect of parameter selection: The parameters sample per group (spg) and the num-
ber of line segments (r) define how well persistence diagrams of a time series are approxi-
mated. A time series belonging to the 8.89 cm (3.5 inch) overhang distance is chosen, and
persistence diagrams of that time series are computed with r = 1, 3, 5, 7 and spg = 100.
In Figure 2.33, both diagrams computed with Ripser in blue and approximated diagrams
in green color are provided. All diagrams represent the first-dimensional persistence (H1 ),
and they belong 51th time series in the 8.89 cm (3.5 inch) case. Figure 2.33 shows that
approximated diagrams converge to the diagram obtained from Ripser as the number of
line segments (r) increases. Each line segments have only two points, so this means that
                                                         75


                                              1.1
                                                                                          TS:51
                                              1.0
                        Bottleneck distance
                                              0.9
                                              0.8
                                              0.7
                                              0.6
                                                    1      2      3     4       5      6     7
                                                        Number of segments in each group (r)
Figure 2.34: The Bottleneck distance between the diagram with blue color in Figure 2.33
and the approximated diagrams with green color in Figure 2.33.
the Bézier curve approximation uses only 2r points instead of using all 100 points for the
examples provided in Figure 2.33. To show the effect of selection r, Bottleneck distances are
computed. The definition of Bottleneck Distance is given as ([153]),
                        W∞ (X1 , X2 ) =                          inf        sup ||x1 − η(x1 )||∞ ,   (2.29)
                                                               η:X1 →X2 x1 ∈X1
where X1 and X2 represent two different diagrams and η is the bijections between the points
of the diagrams. Figure 2.34 shows the Bottleneck distances between the diagrams, and it
is seen that increasing r results in smaller distances for the 8.89 cm (3.5 inch) case, which
means that larger values of r approximate to blue one better. In addition, this type of
behavior can be observed when the spg decreases since it will lead to a larger number of
groups and line segments.
2.6.4   Feature Extraction Using Persistence Diagrams
   In this section, feature extraction from persistence diagrams is explained. Five different
featurization techniques are described in this section, and these are persistence landscapes,
persistence images, Carlsson Coordinates, the kernel for persistence diagrams, and the signa-
ture path of persistence path’s signatures. The source code for these featurization techniques
are available in Teaspoon package of Python [154].
                                                                       76


Figure 2.35: A schematic showing the process of obtaining the landscape functions from a
persistence diagram.
2.6.4.1        Persistence Landscapes
      Persistence landscapes are functional summaries of persistence diagrams [155]. They
are obtained by rotating persistence diagrams by 45° clockwise and drawing isosceles right
triangles for each point in the rotated diagram [156], see Figure 2.35 where the landscape
functions are represented by λk . Given a persistence diagram, the piecewise linear functions
are defined as [155]                          
                                                      if x ̸∈ (b, d)
                                              
                                              
                                              
                                              
                                               0
                                              
                                              
                                 g(b,d) (x) =   x−b   if x ∈ (b, b+d  ]                  (2.30)
                                                                  2
                                              
                                              
                                              
                                              −x + d if x ∈ ( b+d , d)
                                              
                                              
                                                                2
where b and d correspond to birth and death times, respectively. Figure 2.35 shows that there
are several landscape functions λk (x) indexed by the subscript k ∈ N. For example, the first
landscape function λ1 (x) is obtained by connecting the topmost values of all the functions
g(bi ,di ) (x) [155]. If the second topmost components of g(b,d) (x) are connected, the second
landscape function λ2 is obtained. The other landscape functions are obtained similarly.
Note that the landscape functions are also piecewise linear functions.
      Featurization of persistence landscapes: The persistence landscapes—λk (x) where
x corresponds to the birth time—were computed using the persistence diagrams obtained
                                                  77


from each of the embedded acceleration signals. Although these persistence landscapes can
be utilized to featurize the persistence diagram, there is no one way to define these features.
In this work, a feature vector from the persistence landscapes is extracted by defining (a) a
set of length |K| of the landscapes {λk }k∈K where K ⊂ N to work with, and (b) for the kth
landscape, a mesh of non-empty, distinct birth times bk = {xi ∈ R} where the corresponding
values of the death times dk = {λk (xi ) | xi ∈ bk } constitute the entries of the feature vector
for the kth landscape. Then features from all |K| landscapes are combined to obtain the full
feature vector d = {dk }k∈K that can be used with the machine learning algorithms.
     Although the choice of K, the set of landscapes to use, can be optimized using cross
validation, for example, in this study, K = {1, 2, . . . , 5} is used since it gives good results for
the turning data. Similarly, the mesh may also be optimized in a similar way; however, this
is a more difficult task due to the infinite domain of b, so the mesh can be defined as follows
and as shown in Figure 2.36.
     Let λi,j be the ith landscape corresponding to the jth persistence diagram from a train-
ing set in a supervised learning setting. Fix i and overlay the chosen landscape functions
corresponding to all of the persistence diagrams in the training set. Figure 2.36 provides an
example of this process, and it selects second landscapes to extract features. Now project all
the points that define the linear pieces of each of the landscape functions onto the birth axis.
The red dots in Figure 2.36 represents these projected points. Sort the projected points in
ascending order and remove duplicates. The resulting set of points is a mesh bi with length
|bi | for the ith landscape. The same process can be repeated to get the feature vector for
all the |K| landscapes and construct the overall feature vector b. It is emphasized that a
separate mesh is computed for each selected landscape number and that the number of fea-
tures will generally vary for each landscape function. Now to pull the features out of a given
landscape function, it is evaluated at the mesh points. Computationally, this is efficiently
accomplished using piecewise linear interpolation functions.
     Upon extracting the features from the persistence landscapes, a feature matrix is con-
                                               78


structed, and it represents all the tagged feature vectors. For instance, Table 2.15 shows an
example feature matrix obtained from the first and second landscapes corresponding to each
of the n persistence diagrams in the training set. This table shows data with two labels: 0
for no chatter and 1 for chatter. It also denotes each feature with yi,j      bi
                                                                                  where i ∈ {1, 3} is the
landscape number, j ∈ {1, 2, . . . , n} is the corresponding persistence diagram number, while
the superscript bi ∈ {1, 2, . . . , |bi |} is the feature number corresponding to the ith landscape.
These feature matrices can then be used with supervised machine learning algorithms, for
example, to train a classifier.
                      Figure 2.36: Persistence landscape feature extraction.
Table 2.15: Feature matrix for persistence landscapes λ1 and λ3 corresponding to persistence
diagrams X1 through Xn . The entries in the cells are the values of each of the features.
        Persistence Diagrams          Label                 λ1                        λ3
                                                                    |b |                      |b |
                  X1                     1        1
                                                y1,1    2
                                                       y1,1    ... y1,11  1
                                                                         y3,1     2
                                                                                 y3,1    ... y3,13
                                                                    |b |                      |b |
                  X2                     0        1
                                                y1,2    2
                                                       y1,2    ... y1,21  1
                                                                         y3,2     2
                                                                                 y3,2    ... y3,23
                   ..                     ..      ..    ..           ..   ..      ..           ..
                    .                      .       .     .            .    .       .            .
                                                                    |b |                      |b |
                  Xn                     1       1
                                                y1,n    2
                                                      y1,n     ... y1,n1  1
                                                                         y3,n     2
                                                                                 y3,n    ... y3,n3
                                                      79


     For turning cutting data (see Section A.1), the persistence diagrams and the correspond-
ing first five persistence landscapes for each overhang case are computed. Then the resulting
landscapes are split into a training set (67%) and a test set (33%), and feature matrices
are created for each of the first five landscapes separately. SVM with ’rbf’ kernel, Logistic
Regression and Random Forest and Gradient Boosting algorithms are used as a classifier.
The split-train-test process is repeated 10 times, and every time new meshes were computed
from the training sets, and these same meshes were used with the corresponding test sets.
The mean accuracy and the standard deviation of the classification computed from 10 iter-
ations individually using each of the first 5 landscapes can be found in Tables B.9–B.11 in
the appendix. In the results section, the results with the highest accuracy for each of the
overhang cases are utilized from Tables B.9–B.11 when comparing the different TDA-based
featurization methods.
2.6.4.2     Persistence Images
     Persistence images are another functional summary of persistence diagrams [74, 156].
The first step in converting a persistence diagram X = {(bi , di ) | i ∈ {1, 2, . . . , |X|}} to
persistence images is to define the linear transformation
                                 T (bi , di ) = (bi , di − bi ) = (bi , pi ),             (2.31)
     which transforms the persistence diagram from the birth-death coordinates to birth-
lifetime coordinates (see Figure 2.37a-b). Let Dk (x, y) : R2 → R be the normalized symmet-
ric Gaussian centered at (bk , pk ) with standard deviation σ according to (see Figure 2.37c)
                                                1 −[(x−bk )2 +(y−pk )2 ]/2σ2
                              Dk (x, y) =           e                         .           (2.32)
                                              2πσ 2
     It was shown in [157] that the persistence images method is not very sensitive to σ,
which is set to 0.1 in this study. A weighting function is also defined for the points in the
                                                      80


                   Figure 2.37: Steps for persistence image computation.
persistence diagram W (k) = W (bk , pk ) : (bk , pk ) ∈ T (X) → R according to
                                                  
                                                           if pk ≤ 0;
                                                  
                                                  
                                                  
                                                  
                                                     0
                                                  
                                                  
                          W (k) = W (bk , pk ) = pk if 0 < pk < b;                       (2.33)
                                                      b
                                                  
                                                  
                                                  
                                                           if pk ≥ b.
                                                  
                                                  1
                                                  
Note that this is not the only possible weighting function, but it satisfies the requirements
needed to guarantee the stability of persistence images [74]: it vanishes along the horizontal
axis, is continuous, and is piecewise differentiable. Now define the integrable persistence
surface
                                            X
                               S(x, y) =            W (k) Dk (x, y).                     (2.34)
                                           k∈T (X)
The surface S can be reduced to a finite dimensional vector by defining a grid over its domain
and then assigning to each box (or pixel) in this grid the integral of the surface over that
pixel. For example, the value over the i, j pixel in the grid is given by
                                                 ZZ
                                    Ii,j (S) =         S dxdy,                           (2.35)
where the integral is performed over that entire pixel. The persistence image corresponding
to the underlying persistence diagram X is the collection of all of the resulting pixels (see
Figure 2.37d).
                                                 81


     Featurization of persistence images: Persistence images can be used for support
vector machine classification [158]. The corresponding feature vector is obtained from per-
sistence images by concatenating the pixel values, typically either row-wise or column-wise.
The dimension of the resulting vector depends on the choice of the pixel size, i.e., the res-
olution of the persistence image. For example, let Ii,j be a pixel in the persistence image,
then a persistence image of size 100 × 100 pixels is represented by the matrix
                                                                  
                                   I1,1 I1,2 . . . I1,100 
                                   .          ..
                                   ..                                                        (2.36)
                                                                   
                                                 .                ,
                                                                   
                                                                  
                                     I100,1               I100,100
and a typical feature vector is obtained by concatenating the entries of this matrix row-wise
as shown by the rows in Table 2.16. The table shows a feature matrix where each persistence
diagram is labeled either 0 or 1, and the corresponding feature vector is shown using entries
of the form Ii,j
               k
                 , where k ∈ {1, 2, . . . , n} is the persistence diagram index while i, j are row
and column numbers, respectively, in the image.
                      Table 2.16: Feature matrix for persistence images.
  Persistence Diagrams Label                                  Persistence Image
           X1              1       1
                                 I1,1      ...      1
                                                 I1,100   1
                                                         I2,1   ...    1
                                                                      I2,100 ...  1
                                                                                 I100,1 ...  1
                                                                                            I100,100
           X2              0       2
                                 I1,1      ...      2
                                                 I1,100   2
                                                         I2,1   ...    2
                                                                      I2,100 ...  2
                                                                                 I100,1 ...  2
                                                                                            I100,100
            ..              ..    ...
             .               .
           Xn              1       p
                                 I1,1      ...      p
                                                 I1,100   p
                                                         I2,1   ...    p
                                                                      I2,100 ...  p
                                                                                 I100,1 ...  p
                                                                                            I100,100
     Python’s PersistenceImages package is used to featurize the cutting signals, and then the
resulting images are randomly split into 67%-33% train-test sets. The persistence images
have boundaries depending on lifetime and birth time ranges. Therefore, the maximum
lifetime and maximum birth time are found by checking all diagrams of a data set. These
maximum values can correspond to a point with a significant lifetime, and significant features
can be lost if the boundaries of the image are set to those values exactly. Accordingly, each
value is summed with 1 to be able to capture all important features nearby them. A classifier
                                                      82


is trained using SVM and the ‘rbf’ kernel, Logistic Regression, Random Forest, and Gradient
Boosting classifiers for two different pixel sizes: 0.05 and 0.1. The training and testing results
for each different overhang case are available in Tables B.12–B.13 of the appendix. When the
classification accuracy is compared for persistence images to the other featurization methods,
the best results are chosen from these tables for each cutting configuration.
2.6.4.3     Carlsson Coordinates
    Another method for featurizing persistence diagrams is Carlsson’s four Coordinates [71]
with the addition of the maximum persistence [69], i.e., the highest off-diagonal point in the
persistence diagram. The basic idea of Carlsson’s coordinates is to utilize polynomials that
(1) respect the inherent structure of the persistence diagram and (2) that are defined on the
persistence diagrams’ off-diagonal points. Specifically, these polynomials must be able to
accommodate persistence diagrams with different numbers of off-diagonal points since the
persistence diagrams can vary in size even if the original datasets are of equal size. Further,
the output of the coordinates must not depend on the order in which the off-diagonal points
of a persistence diagram were stored. The resulting features can be computed directly from
a persistence diagram X according to
                                          P
                               f1 (X) =       bi (di − bi ),
                                          P
                               f2 (X) =       (dmax − di )(di − bi ),
                                              b2i (di − bi )4 ,                             (2.37)
                                          P
                               f3 (X) =
                                              (dmax − di )2 (di − bi )4 ,
                                          P
                               f4 (X) =
                               f5 (X) = max{(di − bi )}
where dmax is the maximum death time, bi and di are, respectively, the ith birth and death
times, and the summations and maximum are each taken over all the points in X.
    In order to utilize Carlsson coordinates, the persistence diagrams are computed from the
embedded accelerometer signals and randomly split the data into training (67%), and testing
(33%) sets. Then all five coordinates are calculated for each diagram, and SVM, Logistic
                                                   83


Regression, Random Forest, and Gradient Boosting are utilized to train a classifier. The
                                                    5
feature vectors tested in this study were all             5
                                                              combinations of these coordinates, where
                                                  P         
                                                          i
                                                  i=1
the term inside the summation is 5 choose i. This revealed which combination of features
yielded the highest accuracy in each iteration. The classification results for all of the different
feature vectors are reported in Table B.14 in the appendix. However, in the results section,
the feature vectors that yielded the highest accuracy are utilized when the classification
results of Carlsson coordinates are compared to the other featurization methods.
2.6.4.4    Kernels for Persistence Diagrams
    In addition to featurization methods, many kernel methods have also been developed for
machine learning on persistence diagrams [75, 159, 160, 161, 162, 163, 164]. As an example,
the kernel introduced by [75] is chosen, and it is defined for two persistence diagrams X and
Y according to
                                              ||z1 − z2 ||2                ||z1 − ẑ2 ||2
                                                                                       
                         1     X
          κσ (X, Y ) =                  exp −                    − exp −                    ,    (2.38)
                       8πσ z1 ϵX,z2 ϵY
                                                    8σ                          8σ
where if z = (x, y), then ẑ = (y, x), and σ is a scale parameter for the kernel that can be
used to tune the approach. For this study, two values are investigated for this parameter:
σ = 0.2 and σ = 0.25.
    Given either a training or a testing set {Xi }N     i=1 of labeled persistence diagrams, and using
Equation (2.38), the kernel matrix can be defined as
                                                                                
                            κσ (X1 , X1 ) κσ (X1 , X2 ) . . . κσ (X1 , XN ) 
                                     ..            ..                                            (2.39)
                                                                                
                      κσ = 
                                     .               .                          .
                                                                                 
                                                                                
                            κσ (XN , X1 )                           κσ (XN , XN )
Note that given two persistence diagrams X and Y whose number of points is |X| and
|Y |, respectively, then the corresponding kernel κσ (X, Y ) can be computed in O(|X| · |Y |)
time [75]. Therefore, the computation time for kernel methods is generally high, and this can
complicate optimizing the tuning parameter σ. To emphasize the effect of the computational
                                                   84


complexity, in this study, the long runtime for the 5.08 cm (2 inch) overhang case caused
by its large number of samples has led to reporting the corresponding classification results
for a smaller number of iterations than the other overhang cases and the other featurization
approaches.
    For turning data (see Section A.1), a 67%/33% train/test split is performed for the
labeled persistence diagrams. For each of the training and testing sets, the corresponding
kernel matrices are precomputed, and Python’s LibSVM [165] is used for classification. For
almost all but the 5.08 cm (2 inch) overhang case where only 1 iteration was used, the
split-train-test process is repeated 10 times. The average and the standard deviation of
the resulting accuracies are recorded. The resulting classification accuracies are reported
in Table 2.23. Note that [75] describes another approach for training a classifier based on
measuring the distances between two kernels in combination with a k-Nearest Neighbor (k-
NN) algorithm. However, this alternative method is not explored in this work, and only the
computations using the kernel matrix and the LibSVM library are performed.
2.6.4.5    Persistence Paths’ Signatures
    Persistence paths’ signatures are a recent addition to featurization tools for persistence
diagrams [76]. Let γ : [a, b] → Rd be the piecewise differentiable path given by
                                          γ(t) = γt = [γt1 , γt2 , . . . , γtd ],                             (2.40)
where each γti = γ i (t) is a continuous function with t ∈ [a, b]. The first, second, and third
signatures, respectively, can be defined according to the iterated integrals [166]
                                           Z t
                                     i
                             S(γ)a,t =         dγsi = γti − γai ,           (a < s < t);                     (2.41a)
                                             a
                              Z   t                   Z tZ    s
                   S(γ)i,j
                        a,t =       S(γ)ia,s dγsk =             dγri dγsj ,        (a < r < s < t);         (2.41b)
                                a                       a   a
                     Z  t                         Z  t     Z   t2
     S(γ)i,j,...,k
          a,t      =      S(γ)i,j,...,k−1
                              a,s         dγsk =       ...        dγti1 . . . dγtkk , (a < t1 < t2 < . . . < t).
                      a                            a        a
                                                                                                             (2.41c)
                                                          85


Other signatures are defined similarly, although the computational cost significantly increases
beyond the third level of signatures. The resulting path signatures can be used in classifi-
cation algorithms as features. Looking back at persistence landscape functions in Section
2.6.4.1, it is seen that the kth landscape function λk (t) can be written as a two-dimensional
path
                                                     γt (λk (t)) = [t, λk (t)].                             (2.42)
Therefore, signatures can be obtained from persistence landscapes and used as features
in machine learning algorithms [76]. In this study, path signatures are used up to the
second level. Specifically, let λr,i be the rth persistence landscape corresponding to the ith
persistence diagram. Then the signatures used from the rth landscape function are given by
                                   1,1     1,2      2,1
                     1
S(γt (λr (t))) = [Sr,i     2
                       , Sr,i   , Sr,i , Sr,i  , Sr,i   , Sr2,2 ].
Table 2.17: Feature matrix for path signatures for n persistence diagrams and using the first
λ1 and second λ2 persistence landscapes.
  Diagrams Label                                      λ1                                       λ2
     X1           1       S1,11
                                     S1,12       1,1
                                               S1,1         1,2
                                                          S1,1      2,1
                                                                   S1,1   2,2
                                                                        S1,1      1
                                                                                S2,1   2
                                                                                     S2,1  1,1
                                                                                          S2,1     1,2
                                                                                                  S2,1  2,1
                                                                                                       S2,1    2,2
                                                                                                             S2,1
     X2           0       S1,21
                                     S1,22       1,1
                                               S1,2         1,2
                                                          S1,2      2,1
                                                                   S1,2   2,2
                                                                        S1,2      1
                                                                                S2,2   2
                                                                                     S2,2  1,1
                                                                                          S2,2     1,2
                                                                                                  S2,2  2,1
                                                                                                       S2,2    2,2
                                                                                                             S2,2
      ..          ..         ..         ..       ..         ..      ..    ..     ..   ..   ..      ..   ..     ..
       .           .          .          .        .          .       .     .      .    .    .       .    .      .
     Xn           1         1
                          S1,n          2
                                     S1,n        1,1
                                               S1,n         1,2
                                                          S1,n      2,1
                                                                   S1,n   2,2
                                                                        S1,n     1
                                                                                S2,n  2
                                                                                     S2,n  1,1
                                                                                          S2,n     1,2
                                                                                                  S2,n  2,1
                                                                                                       S2,n    2,2
                                                                                                             S2,n
    By incorporating higher order signatures or signatures from more landscape functions, a
longer feature vector can be constructed for classification. For example, Table 2.17 shows
the second level feature vectors computed using the first and second landscape functions for
n persistence diagrams.
    In the experiment, a classifier is trained using 75% of the data and tested using the
remaining 25%. A feature vector is constructed for each of the first five landscape functions.
Table B.16 shows the classification accuracies for each configuration and for each landscape
function. The best results in this table were used to compare the path signatures method to
the other featurization procedures in Table 2.23.
                                                                   86


2.6.4.6     Template Functions
    Template functions are introduced in Reference. [72]. Given a persistence diagram D, its
coordinate system is converted into birth-lifetime diagram.
    A template function for a persistence diagram is defined as
                                                      X
                                         vf (D) =           f (b, p),                           (2.43)
                                                    (b,p)∈D
where b and p represent the birth time and lifetime, respectively. The set of template func-
tions forms a template system T . For more details about template functions and template
systems, one can refer to Reference [72]. Here a template system of Chebyshev polynomials
is defined using f (x, y) = β(x, y) · |liA (x)liB (y)|, where liA and liB are the Lagrangian functions
[72] computed on mesh A and B which are defined to include all points in the persistence
diagram.
2.6.5    Modeling of Milling Process
    This section explains how to generate time series using an analytical model of the milling
process. A milling operation is considered with straight edge cutters as shown in Figure 2.38.
A single degree of freedom model in the x direction for the tool oscillations is used as shown
in Figure 2.38a, and both upmilling and downmilling processes are considered in the analysis.
The equation of motion that describes the tool oscillations is
                                                                1
                                     ẍ + 2ζωn ẋ + ωn2 x =       F (t),                        (2.44)
                                                               m
where m, wn , ζ and F (t) represent the modal mass, natural frequency, damping ratio and
the cutting force in the x direction, respectively. τ is the time delay given by τ = 2π/N Ω
where ω is the spindle’s rotational speed in rad/s, while N is the number of cutting edges or
teeth. The expression for the cutting force is given by [167, 168]
         Xz
                                                                                                (2.45)
                                                                                         
    F =        − bKt gn (t)(cos θn (t) + tan γ sin θn (t)) sin θn (t) (f + x(t) − x(t − τ )) ,
         n=1
                                                     87


           Figure 2.38: Milling process illustrations. a) Upmilling b) Downmilling
where θn is the angle between the vertical line and the leading tooth of the cutting tool as
shown in Figure 2.38. The constant Kt is the linearized cutting coefficient in the tangential
direction and tan γ = Kn /Kt where Kn (t) is the cutting coefficient in the normal direction.
The screening function gn (t) is either 0 or 1 depending on whether the nth tooth is engaged
in the cut or not, respectively, and f represents the feed per tooth of the cutting tool. The
expression for angular position of the nth tooth θn (t) is given by [168]
                                θn (t) = (2πΩ/60)t + 2π(n − 1)/z,                          (2.46)
where z is the total number of cutting teeth while Ω is the rotational speed given in revolu-
tions per minute (rpm).
    One of the important cutting parameters is the radial immersion ratio (RI) which is de-
fined as the ratio of the radial depth of cut to the diameter of the cutting tool. Smaller radial
immersions indicate shallower cuts and thus more intermittent contact between the tool and
the workpiece, while higher radial immersions indicate deeper cuts with more continuous
contact. In the simulations for both downmilling and upmilling, RI is set to 0.25.
    Inserting Equation (2.45) into Equation (2.44) results in
                                                      bh(t)                      bf0 (t)
               ẍ(t) + 2ζωn (t)ẋ(t) + wn2 x(t) = −         [x(t) − x(t − τ )] −         , (2.47)
                                                       m                            m
where b is the nominal depth of cut and h(t) is the τ -periodic function
                            Xz
                                                                                           (2.48)
                                                                       
                     h(t) =      Kt gn (t) cos θn (t) + tan γ sin θn (t) sin θn (t),
                            n=1
                                                   88


and f0 (t) = h(t) f . The term f0 (t) does not affect the stability analysis, so it is dropped in
the subsequent equations; however, it is kept in the simulation.
    After dropping f0 (t), the equations of motion can be written in state space form according
to
                                 dξ(t)
                                       = A(t)ξ(t) + B(t)ξ(t − τ ),                         (2.49)
                                  dt
where A and B are T -periodic with T = τ . Then, using the spectral element method [2],
the state space is discretized. Then a dynamic map is obtained such that
                                          ξn+1 = U ξn ,                                    (2.50)
where U is the finite dimensional monodromy operator. The eigenvalues of U approximate
the eigenvalues of the infinite dimensional monodromy operator of the equation of motion. If
the modulus of the largest eigenvalue is smaller than 1, then the corresponding spindle speed
and depth of cut pair lead to a chatter-free process; otherwise, chatter occurs. Therefore,
the stability of the milling model and the bifurcation associated with the loss of stability
(chatter) can be obtained by examining these eigenvalues, see Figure 2.39.
    In this study, 10000 time series were generated corresponding to a 100 × 100 grid in the
plane of the spindle speeds and depths of cut. Each time series is tagged using the largest
eigenvalue of the monodromy matrix corresponding to the same grid point.
2.6.6    Simulation Results
    In this section, classification accuracies for each featurization method are provided for
noisy and non-noisy time series of up milling and down milling processes with 4 teeth (N = 4).
The details of the simulation can be found in Section 2.6.5. Ranges of rotational speed and
depth of cut parameters for the simulations are chosen with respect to the stability diagrams
given for both processes in [168]. The 1- and 2-dimensional persistence diagrams were used
with the methods described in Section 2.6.4. Feature matrices were computed for 1D and
2D persistence diagrams individually, and the features were concatenated when using both
                                                89


Figure 2.39: Stability criteria used in this study based on the eigenvalues of the monodromy
matrix U.
dimensions. 0D persistence has been omitted in this study due to its poor performance on
noisy data sets, as evidenced by a reduction in the classification accuracy by 10% in some
cases. Data sets are randomly split, using 67% for training and 33% for testing. The split-
train-test is performed 10 times, and the mean accuracies with the corresponding standard
deviations are reported in this section.
    This data can be used for both a two-class and three-class classification problem. The
first is classifying either chatter-free or chatter, while the second further divides chatter
into two types: Hopf-unstable and period2-unstable. Classification for both two and three
class problems is done using four different algorithms: support vector machines, logistic
regression, random forests, and gradient boosting. Default parameters have been used for
all classification algorithms except random forest classification (n_estimator = 100 and
max_depth = 2). These two types of chatter are based on Hopf and period doubling
bifurcation behaviors as described in Fig. 2.39.
    Two class classification results for downmilling and upmilling simulations with N = 4
are provided in Tables 2.18 and 2.19, respectively. For each data set, the highest accuracy
is highlighted in blue. For instance, 95.5% accuracy is obtained as the best classification
accuracy for non-noisy data sets when gradient boosting classifiers are trained with combined
1D, and 2D persistence features based on Template Functions method in Table 2.18. In most
                                                90


Table 2.18: 2 Class classification results for noisy (SNR:20,25,30 dB) and non-noisy data sets
which belong to downmilling process with N = 4 (N: Teeth Number, CC: Carlsson Coor-
dinates, TF: Template Functions, SVM: Support Vector Machine, LR: Logistic Regression,
H1 : 1D persistence, H2 : 2D persistence).
  Down Milling               Without Noise                            SNR: 20 dB
    N =4               CC                   TF                  CC                  TF
   Classifier    H1    H2    H1 -H2    H1   H2    H1 -H2  H1    H2   H1 -H2    H1   H2   H1 -H2
  SVM          94.3%  85.1% 94.6%    92.9% 94.4% 93.7%   94.2% 92.5% 94.6%   94.3% 94.8% 94.7%
  LR           92.4%  84.3% 92.8%    93.9% 91.6% 94.5%   78.5% 78.3% 91.1%   93.8% 93.5% 94.5%
  RF           93.6%  90.9% 93.8%    95.0% 94.1% 95.7%   92.4% 92.3% 93.0%   94.3% 94.4% 94.6%
  GB           95.0%  93.5% 95.2%    94.7% 94.2% 95.5%   94.2% 93.7% 94.9%   94.7% 94.4% 94.7%
  Down Milling                SNR:  25 dB                             SNR:  30 dB
    N =4               CC                   TF                  CC                  TF
   Classifier    H1    H2    H1 -H2    H1   H2    H1 -H2  H1    H2   H1 -H2    H1   H2   H1 -H2
  SVM          80.8%  77.2% 83.1%    89.4% 83.2% 90.8%   81.0% 77.4% 82.8%   89.2% 83.4% 90.9%
  LR           76.0%  72.2% 75.4%    83.7% 77.5% 85.7%   76.2% 72.3% 75.1%   83.8% 77.2% 85.6%
  RF           76.9%  75.0% 77.5%    88.9% 82.4% 89.6%   76.6% 75.1% 77.1%   88.7% 82.2% 89.9%
  GB           88.3%  79.1% 89.5%    89.0% 82.7% 90.2%   88.5% 79.0% 89.5%   88.7% 83.0% 90.5%
of the cases for downmilling and upmilling, it is seen that the highest accuracies are obtained
when 1D and 2D persistence diagrams features are combined.
    Some of the time series embeddings, especially in the chatter-free regime, do not have
any 2 dimensional topological features, thus giving an empty H2 diagram. If a specific
cutting configuration has a lot of time series with empty H2 , feature matrices for these time
series have a lot of zeros when either featurization method is used. Because of the lack of 2
dimensional information for many of the time series, classifications using only H2 have lower
accuracies than only using H1 as is shown in Tables 2.18 and 2.19.
    When comparing persistence diagram featurizations, the template function method has
the best results for all data sets, with the exception of two: the noisy data set with an
SNR value of 20 dB for downmilling and the one without noise for upmilling. However, for
those two data sets, template functions’ results are very close to those provided by Carlsson
coordinates. When comparing classification algorithms, SVM yields the highest accuracy
for five of the eight data sets, while gradient boosting yields the highest accuracy for the
remaining three data sets.
    To compare the results of different levels of noise and different dimensions of persistence
                                                91


Figure 2.40: Success and failure of two class classifications performed with Template Function
feature matrices and Gradient Boosting algorithm for test set of data set without noise and
with SNR value of 25 dB. a) Classification with 1D persistence features for non-noisy data set,
b) Classification with 2D persistence features for non-noisy data set, c)Classification with 1D-
2D persistence combined features for non-noisy data set, d) Classification with 1D persistence
features for noisy data set with SNR:25 dB, e) Classification with 2D persistence features for
noisy data set with SNR:25 dB, f) Classification with 1D-2D persistence combined features
for noisy data set with SNR:25 dB.
diagrams, classification results are plotted on the 100 × 100 grid of the stability diagram for
the milling process. Figure 2.40 presents the stability diagrams belonging to teeth number
N = 4 of down milling process for noisy data with SNR value of 25 dB and non-noisy data
sets. Figures on the first and second columns belong to the classifications performed with
only H1 and H2 features, respectively, while the ones in the third column represent the results
of combinations of H1 and H2 features. Red crosses on the stability diagrams denote the case
that the prediction of the classifier does not match with the true label of the corresponding
time series while blue dots show matching between predictions and true labels. From the
figures, it is clear that the number of misclassifications increases slightly when the noise is
introduced into the simulation data. This is also reflected in Table 2.18 in the decrease in
accuracies for different levels of noise, especially the noisy data sets with SNR values of 25
                                                92


Table 2.19: 2 Class classification results for noisy (SNR:20,25,30 dB) and non-noisy data sets
which belong to upmilling process with N = 4 (N: Teeth Number, CC: Carlsson Coordinates,
TF: Template Functions, SVM: Support Vector Machine, LR: Logistic Regression, H1 : 1D
persistence, H2 : 2D persistence).
  UpMilling                 Without Noise                             SNR: 20 dB
    N =4               CC                  TF                  CC                   TF
   Classifier   H1     H2   H1 -H2    H1   H2   H1 -H2  H1     H2    H1 -H2    H1   H2   H1 -H2
  SVM         86.0%  78.5% 85.8%    86.1% 80.4% 86.0%  76.8% 80.7%   82.1%   84.6% 84.0% 85.1%
  LR          85.3%  77.8% 85.3%    86.2% 81.3% 85.8%  69.4% 80.4%   80.9%   82.7% 81.3% 84.1%
  RF          84.9%  80.3% 84.9%    85.9% 81.2% 85.7%  75.5% 80.7%   81.1%   82.8% 81.8% 83.2%
  GB          86.1%  80.9% 86.0%    85.6% 81.3% 86.0%  80.6% 82.2%   82.5%   84.1% 83.4% 84.6%
   Upmilling                 SNR:  25 dB                              SNR:  30 dB
    N =4               CC                  TF                  CC                   TF
   Classifier   H1     H2   H1 -H2    H1   H2   H1 -H2  H1     H2    H1 -H2    H1   H2   H1 -H2
  SVM         85.3% 83.4% 84.8%     85.5% 84.4% 85.5%  83.2% 71.6%   83.1%   85.9% 75.0% 86.2%
  LR          79.2% 84.0% 84.1%     84.2% 84.5% 84.5%  79.4% 72.3%   79.5%   84.1% 75.3% 85.0%
  RF          83.8% 84.1% 84.5%     83.5% 82.6% 83.0%  84.3% 75.0%   84.4%   83.9% 75.3% 83.8%
  GB          85.2% 84.3% 84.8%     85.1% 84.5% 84.9%  85.1% 74.6%   85.1%   85.7% 75.7% 85.2%
and 30 db.
     In addition, there is a small accuracy difference which is at most 5% between noisy
(SNR:25, 30 dB) and non-noisy data set for downmilling cases, while this difference is less
for upmilling results presented in Table 2.19 This suggests that the featurization methods
used yield promising results even with noisy data. Persistent homology is known to be
very robust against noise, as noise only adds points close to the diagonal, which have short
lifetimes. Thus, these points do not contribute significantly to the Carlsson coordinate or
template function methods, making both featurizations robust against noise as well.
     Figure 2.41 shows a comparison of the results obtained for up and downmilling with
respect to different noise levels. Since the deviations of accuracies for both featurization
methods are relatively low, the classification accuracies can be considered reliable. This
trend is noticeable for both up and downmilling and for all levels of noise. However, the
classification results for upmilling are noticeably lower than those for downmilling. Addition-
ally, it is clear that H2 features do not perform as well due to the lack of higher dimensional
topological structure, as was explained earlier. Figure 2.41 also presents the classification
results on the stability diagrams for upmilling and downmilling for noisy data sets. It is
                                                93


Figure 2.41: Mean accuracies of downmilling process (a,b) and upmilling process (f,g) ob-
tained for two class and three class classification performed with Carlsson Coordinates and
Template Functions for non-noisy and noisy data sets where teeth number is 4. Two class(c)
and three class (d) classification results obtained with Gradient Boosting algorithm is shown
on the stability diagram for downmilling simulation data set whose SNR is 25 dB. Two
class(e) and three class (h) classification results obtained with Gradient Boosting algorithm
are shown on the stability diagram for upmilling simulation data set whose SNR is 25 dB.
seen that many misclassifications occur nearby the boundary of the stability diagram, espe-
cially for the upmilling process. This boundary separates the unstable (above the boundary
curve) and stable (under the boundary curve) cases, so misclassifications on this boundary
are likely. It is also clear that when increasing to the three-class problem, there are an in-
creased number of misclassification. However, the overall accuracy difference between both
implementations, two and three-class classification, does not exceed 5% at their maximum
accuracies.
    The findings of this study indicate that topological features of data are appropriate
descriptors for chatter recognition in milling. One advantage of the described approach is
its ability to provide promising results without the need for manual preprocessing not only
                                                94


for non-noisy data sets but also for time series with noise.
2.6.7    Experimental Data Results
2.6.7.1    Runtime Comparison
    Runtime is a criterion for comparing the different feature extraction methods. For the
TDA-based methods, the total runtime required for classification is split among three main
computations: (1) obtaining the persistence diagrams, (2) obtaining features or computing
kernels, and (3) training and testing the corresponding classifier. However, obtaining re-
sults with serial computing takes significantly larger runtime. Therefore, parallel computing
is implemented to improve the runtime. High Performance Computing Center (HPCC) of
Michigan State University is utilized for parallel computing. It includes several supercom-
puters, which are composed of hundreds of nodes. Each node represents a computer with a
certain number of processors and RAM capacity. Users are allowed to submit multiple jobs
at the same time to HPCC, and they can define the number of CPUs per job and memory
per CPU. 10 CPUs per job and 2GB of memory per CPU are requested to compute the
embedding parameters and persistence diagrams in parallel. The number of jobs submitted
to HPCC changes depending on the number of time series of the overhang distance.
    Embedded time series are subsampled such that every 10th point is taken into account to
compute persistence diagrams. The times to complete persistence diagram computation of
all overhang cases are recorded, and they are reported in Table 2.20. It also includes times for
serial computing, where one persistence diagram is computed at a time. It is seen that parallel
computing reduces the computation time significantly, although most part of the runtimes for
parallel computing is the queue time. Parallel computing can also be performed with some
workstations available in the market without having the need for expensive supercomputers
that HPCC has. Entry-level workstations with a CPU having 64 cores and 512 GB of RAM
can be afforded by small workshops.
                                              95


Table 2.20: Comparison of runtimes (seconds) for embedding parameters and persistence
diagram computation of all overhang cases with parallel and serial computing.
                        5.08 cm             6.35 cm           8.89 cm            11.43 cm
                        (2 inch)          (2.5 inch )        (3.5 inch)         (4.5 inch)
                    Parallel  Serial   Parallel   Serial  Parallel   Serial  Parallel   Serial
     Persistence
                     9420     84346      3448     23570     2073    11319     4819      37617
      Diagram
    Despite the long computation time for persistence-based methods, it is noted that af-
ter obtaining the persistence diagrams, they can be saved and used in multiple TDA-based
classification methods. In addition, it was observed that delay and embedding dimension
parameters do not change significantly for changing time series of the same cutting config-
uration. Parameters for embedding can be computed in the training phase of a classifier,
and they will be used in the test phase. Therefore, once these diagrams are computed, the
time required for featurization and classification would be a fraction of the ones reported in
Table 2.20. It is worth mentioning that the most computationally expensive step is that of
training a classifier. Once a classifier is trained, which can be done offline, the effort in clas-
sifying incoming streams of data is much smaller because a much smaller set of persistence
diagrams and features are needed. Therefore, the runtimes needed for a single time series are
compared with different methods. Table 2.21 provide runtimes for embedding parameters
computations and persistence diagram computation with different methods.
    First column in Table 2.21 represent the runtimes of computing embedding dimension
and delay parameter. When these parameters are computed, they can be saved and used in
embedding time series In the second column of Table 2.21, the runtime of persistence diagram
computation of subsampled point cloud is given. In this way, nearly 1000 points are used from
the embedded time series to compute persistence diagrams. However, the computation time
for persistence diagrams of a point cloud with that size is still high, as seen from Table 2.21.
Therefore, greedy permutation subsampling and the Bézier curve approximation technique
are also employed. The greedy permutation option of the Ripser package is utilized, and it
                                                 96


Table 2.21: Runtime (seconds) for embedding parameter computation and persistence dia-
gram computation of a single time series with different methods.
             Embedding
                                                        Persistence Diagram
             Parameters
                         Number of
                                            Number of points≈ 100                Number of points≈ 300
                        points≈ 1000
                                                    Bézier       Bézier                  Bézier      Bézier
                                       Greedy                               Greedy
  Overhang   Delay and  Subsampled                   r=1         r=1                      r=3        r=3
                                     Permutation                          Permutation
  Distances  Dimension  Point Cloud               spg = 100 spg = 100                  spg = 100 spg = 100
                                     nperm = 100                          nperm = 300
                                                   (Serial)    (Parallel)               (Serial)   (Parallel)
   5.08 cm
               242.63      266.95         0.2       106.00        0.93        3.62       569.78       6.21
   (2 inch)
   6.35 cm
               191.10      208.95        0.29       106.15        0.85        3.79       538.98       6.19
  (2.5 inch)
   8.89 cm
               166.79      296.21        0.18        96.00        0.72        3.87       541.62       6.48
  (3.5 inch)
   11.43 cm
               200.51      276.38        0.17       113.86        0.77        3.53       600.29       6.99
  (4.5 inch)
is a method that subsamples the point cloud and computes the persistence diagrams with
less number of points. nperm is a parameter that defines the number of points selected by the
greedy permutation algorithm. 100 and 300 points are chosen for this option, and runtimes
are reported for the resulting persistence diagrams. In Table 2.21, the runtimes are grouped
with respect to the number of points used in the corresponding method. It is seen that the
Bézier curve approximation with r = 1 and r = 3 uses approximately 100 and 300 points
as in the case of greedy permutation. Runtime for both serial and parallel computing are
provided in Table 2.21. Parallel computing can only be applied to Bézier curve approximation
among the methods of persistence diagram computation given in Table 2.21. The reason is
that persistence diagrams are obtained directly from the Ripser package for other methods.
However, the steps of the Bézier curve approximation method, computation of coefficients
for the line segments, and the distance matrix between these line segments can be performed
in parallel. In these two steps, a job can only compute coefficients of the lines in a single
group or a distance between two lines. The number of the jobs will be equal to the group
number and number of combinations between lines for coefficient computation and distance
matrix computation, respectively. Ideally, all jobs for a step can be computed simultaneously
if there is no queue time. Therefore, runtimes are recorded for computation of coefficients
                                                   97


of the line segments in a single group, computation of a distance between two line segments,
and persistence diagram from a distance matrix individually. Then, they are summed up and
reported in Table 2.21. Combining parallel computing with the Bézier curve approximation
reduces the runtime significantly. Moreover, it is seen that the fastest method is the greedy
permutation with nperm = 100 and Bézier curve approximation computed in parallel places
second. Both methods are able to complete the diagram computation in less than a second,
while runtime gets larger with increasing nperm and r parameters.
    Table 2.22 provides the times required to complete classification of a single time series.
To be fair in comparison between the runtime of WPT/EEMD and TDA-based methods,
it is assumed that the classifier is already trained and required parameters for all methods
are selected. It is seen that WPT is the fastest method, and EEMD places second. The
runtime for the TDA-based method is comparable to the ones for EEMD. Further, the WPT
and EEMD methods use codes that have been highly optimized, whereas the TDA-based
methods are still under active research with huge future potential for improved optimization.
It is believed that the runtimes for the TDA-based method can be further decreased with
optimization.
Table 2.22: Runtime (seconds) for performing classification with a single time series for
TDA-based methods and signal decomposition-based ones.
                                 Topological Data Analysis             Signal Decomposition
            Overhang Persistence  Template     Carlsson    Persistence
                                                                       WPT        EEMD
            Distances Landscapes  Functions Coordinates      Images
             5.08 cm
                         1.01       0.97         0.97         0.97      0.03       0.52
             (2 inch)
             6.35 cm
                         0.92       0.90         0.90         0.87      0.08       0.65
            (2.5 inch)
             8.89 cm
                         0.81       0.81         0.76         0.76      0.09       0.70
            (3.5 inch)
            11.43 cm
                         0.87       0.81         0.81         0.81      0.06       0.52
            (4.5 inch)
                                                 98


2.6.7.2     Classification Scores
     This section presents the classification accuracies for all the methods that are introduced
in Section 4.2.2 and compares them to the results in Reference [9], which uses the Wavelet
Packet Transform (WPT) and the Ensemble Empirical Mode Decomposition (EEMD). The
latter two methods are used for comparison since they are some of the currently most promi-
nent methods for chatter identification using supervised learning. For persistence images,
Template Functions, and Carlsson Coordinates, four different classifiers are applied, and
these are Support Vector Machine (SVM), Logistic Regression (LR), Random Forest (RF),
and Gradient Boosting (GB) algorithms. All of these classifiers except Gradient Boosting are
used in the Persistence Landscape method, while the Kernel method and Persistence Paths
results are obtained with LibSVM and SVM classifiers, respectively. The classification re-
sults are summarized in Table 2.23 where for each cutting configuration, the best results
of the classification algorithms for each method are included. Table 2.23 also includes the
classification results obtained using a new TDA approach, which is not included in Section
4.2.2, based on template functions [72]. In this table, the best accuracy for each dataset is
highlighted in green. Further, methods whose accuracy is within one standard deviation of
the best result in the same category are highlighted in blue.
Table 2.23: Comparison of results for each method where WPT is the Wavelet Packet Transform, and EEMD
stand for Ensemble Empirical Mode Decomposition.
  Overhang
   Length    Persistence  Persistence Template       Carlsson    Kernel   Persistence
                                                                                        WPT     EEMD
     cm      Landscapes     Images       Functions Coordinates  Method      Paths
   (inch)
    5.08
               96.8%         96.4%         91.5%      93.6%      74.5%*     83.0%       93.9%   84.2%
     (2)
    6.35
               88.6%         85.8%         89.3%      86.3%       58.9%     84.2%      100.0%   78.6%
    (2.5)
    8.89
               92.2%         93.0%         83.9%      95.7%       87.0%     85.9%       84.0%   90.7%
    (3.5)
    11.43
               68.6%         72.5%         65.1%      72.2%       59.3%     70.0%       87.5%   79.1%
    (4.5)
 *This result belongs to only the first iteration for the 5.08 cm (2 inch) overhang case.
     Table 2.23 shows that the WPT approach yields the highest classification accuracy for
                                                   99


the 6.35 cm (2.5 inch) and the 11.43 cm (4.5 inch) overhang cases. However, it is also seen
that for the 5.08 cm (2 inch) and the 8.89 cm (3.5 inch) case, persistence landscape, and
Carlsson Coordinates yield the highest accuracy, respectively. For the 6.35 cm (2.5 inch)
overhang case, it is worth noting that the number of time series is small. Specifically, for
this case, less than 10 time series were divided into small pieces and used as the test set,
see Table A.1. Therefore, the 100% classification accuracy using WPT for this case does not
represent a robust result. Nevertheless, for the same case Table 2.23 shows that the TDA
methods based on persistence landscapes, persistence images, template functions, Carlsson
coordinates, and persistence paths yield better results than EEMD—a leading approach for
chatter detection. For the 8.89 cm (3.5 inch) case, Carlsson coordinates method yields the
highest mean accuracy of 95.7%, placing ahead of both WPT and EEMD. Further, the other
TDA-based method for this cutting configuration score classification accuracies of at least
83.9%. For the last case, the TDA-based approaches underperform in comparison to WPT.
To investigate the reason of this result, persistence diagram plots belonging to some of the
time series from each overhang distance are provided in Figure 2.42. For the first three
cases (5.08, 6.35, and 8.89 cm), it is seen that persistence diagrams of stable time series
show a single significant feature with a high persistence value, which indicates the existence
of a loop i.e., periodic behavior, in reconstructed state space. However, this single high
persistence point is not observed in the persistence diagrams of unstable (chatter) time series.
This significant difference between persistence diagrams enables classification algorithms to
distinguish chatter and chatter-free time series. For the last case (11.43 cm), it is seen that
all persistence diagrams look similar, thus making blurring the signature of chatter in a time
series. Persistence Images and Carlsson Coordinates depend on the persistence value of the
features obtained from the persistence diagrams. Since all persistence diagrams have features
with similar persistence values in the 11.43 cm (4.5 inch) case, classifiers can not distinguish
the two classes. This leads to reduced accuracy, as shown in Table 2.23. Nevertheless, WPT
and EEMD methods for chatter detection necessitate manual preprocessing by well-trained
                                               100


expert users, thus requiring significant extra time and advanced expertise [9]. Therefore, the
automation of these processes is not straightforward with WPT and EEMD, while all the
steps in TDA-based feature extraction can be fully automatized.
           Figure 2.42: Sample persistence diagrams for each overhang distance.
    The performance of different persistence diagram computation methods is compared,
and their runtime is compared in Section2.6.7.1. Figure 2.43 shows you the mean classifi-
cation accuracies and error for the persistence diagrams obtained with the ways shown in
                                              101


Table 2.21. Subsampling the point cloud at every 10th point is the first method to compute
persistence diagrams. Table 2.21 shows that the Bézier curve approximation has a slightly
larger computation time compared to the greedy permutation subsampling method when it
is computed in parallel. However, it is seen that for all overhang cases Bézier curve approxi-
mation method results in higher accuracy compared to greedy permutation. Also, its results
are the closest results to the ones obtained from subsampled point cloud.
                                                                  Persistence Images
                                Carlsson Coordinates              (ps = 0.1, var = 0.1)                    Template Functions (d = 15)
              95                                                                             90
   Accuracy                                                                       Accuracy
              90
                                                                                             80
              85
                                                                                             70
              80
                                                        5.08 cm (2 inch)                                                             6.35 cm (2.5 inch)
            Subsampled Greedy Perm. Greedy Perm.       B.C.A.        B.C.A.                  Subsampled Greedy Perm. Greedy Perm.    B.C.A.        B.C.A.
               P.C.     nperm = 100 nperm = 300        r=1           r=3                        P.C.     nperm = 100 nperm = 300     r=1           r=3
           100                                                                               80
                                                                                             75
              90
Accuracy                                                                          Accuracy
                                                                                             70
              80                                                                             65
                                                       8.89 cm (3.5 inch)                    60                                     11.43 cm (4.5 inch)
              Subsampled Greedy Perm. Greedy Perm.     B.C.A.        B.C.A.                  Subsampled Greedy Perm. Greedy Perm.    B.C.A.        B.C.A.
                 P.C.     nperm = 100 nperm = 300      r=1           r=3                        P.C.     nperm = 100 nperm = 300     r=1           r=3
Figure 2.43: Classification performance of persistence diagrams obtained with different meth-
ods for all overhang cases.
              Increasing the number of points in greedy permutation or increasing the number of line
segments (r) generated for a group does not always yield higher accuracy, as seen from
Figure 2.43. The reason could be that increasing the number of points or the number of line
segments can cause more topological noise on persistence diagrams. For example, Figure 2.33
shows that the new point appears closed to a significant feature, which is the point with the
highest lifetime, as r increases from 5 to 7. Therefore, this could cause small drops in
accuracy for some overhang cases, as seen from Figure 2.43.
                                                                            102


    Greedy permutation and Bézier curve approximation method can provide persistence
diagrams in less than 1 second without optimization. For the greedy permutation method,
optimization can not be applied. However, coefficient computation for the line segments in
the Bézier curve approximation method can be optimized. This could further decrease the
runtimes and opens the possibility for exploring in-situ chatter detection using TDA-based
methods, especially with properly optimized algorithms.
2.7     Transfer Learning
    In traditional machine learning, a classifier is trained and tested on a data set originating
from the same source. However, real-life applications, such as chatter or fault detection
in machining, can experience a shift in the parameters between the time the classifier was
trained and the time the system is put into operation. This means that the data collected
from these applications may no longer have the same feature space as the training set. There-
fore, traditional machine learning can require data collection for each parameter combination,
thus leading to increased cost and low automation potential. As another motivating exam-
ple, some experiments are expensive to set up and perform. This includes chatter studies
which result in long downtime for production machines and personnel during the data col-
lection phase. Besides the cost, some sensor data may be collected during machining one-off
products and, therefore, may be considered of limited use in traditional machine learning
settings. Therefore, it is extremely beneficial to leverage extracted features related to similar
phenomena across different settings and operations. In this case, Transfer Learning presents
a useful machine learning framework that allows training and testing on data sets from dif-
ferent sources. As an example, Figure 2.44 shows a transfer learning application where a
chatter classifier was trained using a turning process, and the gained information is then
transferred for detecting chatter in a milling operation.
    Transfer learning is categorized according to the similarity between tasks and the domain
of each source and target. The source is the system used to train a classifier, while the
                                               103


Figure 2.44: An example of transfer learning where training for chatter detection is performed
using a turning process (the source) and the gained knowledge is imported via transfer
learning to a milling operation (the target).
target is the system where the classifier is tested. There are two main terms in the definition
of transfer learning, and these are domain and task. A domain can be described as the
combination of a feature space F and the marginal probability of the feature space P (F ),
while the task contains a label space L and the conditional probability (P (l|f )) [169]. F
represents the space of feature vectors, xi , and F is the an instance set such that F =
{f | fi ∈ F, i = 1, . . . , n} [170]. For a given domain, D = (F, P (F )), a task is defined as
T = (L, P (l|f )). P (l|f ) is also considered as a predictive function f which estimates the
label for a given feature space.
    Based on the differences between domains and tasks of the source and the target, several
transfer learning settings can be obtained (see Figure 2.45). The interested reader is referred
to [169, 171, 172] for more details on transfer learning. In this study, the machine learning
framework is included under inductive transfer learning category because the same sets of
features are used for the source and the target. The main purpose of inductive transfer
learning is to improve the performance of the target prediction function fT using the in-
formation in the domain and task of the source DS and TS , respectively [169]. There are
several approaches to transfer learning. These include instance-transfer, feature represen-
tation transfer, parameter-transfer, and relational knowledge transfer [170]. In this study,
the knowledge of parameters is transferred by using the same trained classifier in the testing
phase. The same set of features is used for training and testing. However, the distribution
                                                104


of the features is different in each domain since the source, and the target is represented by
two different machining processes: turning and milling. More details about the application
of inductive transfer learning are available in Section 2.7.2.
                        Figure 2.45: Categorization of transfer learning.
2.7.1    Methods
    This section provides a brief description of feature extraction for the milling data set
explained in Section A.2. The preprocessing for commonly adopted approaches, WPT and
EEMD, and the traditional feature extraction approaches are explained in this section. In
addition, Figure 2.46 provides a block diagram that explains the procedure followed in this
study. Specifically, the leftmost block shows the experimental setup and the data collection
process. This is followed by the middle block, which lists the featurization methods used and
the similarity-based approach using DTW. The rightmost block shows the pairwise distance
matrices and feature matrices obtained from the similarity-based approach and the feature
extraction approaches, respectively. Figure 2.47 provides a cartoon of the transfer learning
framework whereby classifiers trained on the turning data are used to detect chatter in
milling and vice versa.
                                              105


Figure 2.46: Outline of the general procedure and the featurization methods used in this
study.
Figure 2.47: Transfer learning approach used in this chapter for feature extraction and
similarity measure-based approaches.
                                          106


2.7.1.1    Wavelet Packet Transform (WPT)
    This section describes the salient details of the Wavelet Packet Transform (WPT) method.
One can refer to Section 2.3.1 for more information. The WPT method decomposes a signal
into approximation and detail coefficients at each level of the transform. Figure 2.7 provides
the decomposition of a time series into three levels of WPT and shows the corresponding
frequency content for each wavelet packet. Detail and approximation coefficients are obtained
after applying the high-pass and low-pass filters, respectively. They are denoted as Di and
Ai as shown in Figure 2.7. At each level of the transform, additional letters A or D are added
to the left side of the previous notation, and the indices change with respect to the level of
the transform. For example, in the second level of transform, the approximation coefficient
A1 passes through the high pass filter and becomes DA2 (see Figure 2.7). In addition, the
number of wavelet packets at level k of the transform is 2k .
    Milling data set: The procedure followed in this study is the same as the one described
in Reference [9]. Level 4 WPT is applied to downsampled time series for both the milling
and turning data sets. The first step was to define the chatter frequency by checking the
spectrum of the downsampled data. Figure 2.48 provides FFT plots of three different time
series from the milling data set. It can be seen that the chatter frequency is around 850 Hz
which is close to the resonant frequency of 728.7 Hz; this leads us to look for wavelet packets
that also have a frequency content near this frequency. Time series were decomposed into
wavelet packets, and the energy ratio of each wavelet packet was computed. The energy
ratio plots and the Fast Fourier Transform (FFT) of the signals, reconstructed from the
packets, have been provided in Figure 2.49 and 2.50. Figure 2.49 shows that most of the
energy belongs to the third wavelet packet for the unstable time series. It is also seen that
the spectrum of the third wavelet packet, the unstable time series, has a frequency content
of around 1000 Hz. Thus the third wavelet packet can be selected as the informative packet
for feature extraction.
                                                107


Figure 2.48: The spectrum of three different time series from milling experiment: (left) 13227
rpm, 2.54 mm depth of cut (doc), unstable,(mid) 16861 rpm, 1.905 mm doc, stable, (right)
27285 rpm, 1.905 doc, unstable.
Figure 2.49: Energy ratio of the wavelet packets obtained from decomposition of the three
time series whose spectrum is provided in Figure 2.48. (Blue bars) Milling - Unstable -
RPM=13227 - DOC = 2.54mm. (Red bars) Milling - Stable - RPM=16861 - DOC = 0.38
mm. (Orange bars) Milling - Unstable - RPM=27285 - DOC = 3.556 mm.
2.7.1.2   Ensemble Empirical Mode Decomposition (EEMD)
   This section provides the preprocessing of experimental milling signals using the EEMD
approach explained in Section 2.3.2. The experimental signal was decomposed into IMFs,
and the informative IMFs were selected to generate a feature matrix. The spectrum from
the original signal and the IMFs was then compared to determine the overlap between them.
The IMF with the largest overlap was selected as the informative IMF. Figure 2.51 provides
an example for the selection of the informative IMF. The original time series has frequency
content of around 1000 Hz, and the first two IMF are the candidates to be informative IMF.
Since the spectrum of the first IMF overlaps with the original signal’s spectrum, it is selected
                                             108


Figure 2.50: The spectrum of reconstructed signals from the first four wavelet packets of
three different time series whose spectrum is shown in Figure 2.48. (First row) Milling,
13300 rpm, 2.54 mm depth of cut (doc), unstable, (second row) milling, 17300 rpm, 0.3810
mm doc, stable, and (last row) milling, 28000 rpm, 3.5560 mm doc, unstable.
as an informative IMF. Ideally, the spectrum of all signals and their decomposition should
be checked to determine the informative IMF. However, this is a manually intensive and
time-consuming process. Therefore, this process is repeated only for a couple of time series,
and chose an informative IMF to use for all time series. Then, the selected informative IMF
was used to compute the features given in Table. 2.4, and a feature matrix was generated as
an input for a supervised classification algorithm.
2.7.1.3     Fast Fourier Transform (FFT), Power Spectral Density (PSD) and Auto-
            correlation Function (ACF)
    This method computes the Fast Fourier Transform (FFT), Power Spectral Density (PSD),
and Auto-correlation (ACF) for each downsampled data set. The next step was to find
the significant peaks in these plots and use their x and y coordinates as a feature in the
classification algorithm. Since built-in functions in computing software for peak finding can
                                              109


Figure 2.51: Intrinsic mode functions and their spectrum for the time series with 11210 rpm
and 3.556 mm depth of cut from milling experiments.
result in incorrect peaks, two restriction parameters are used for peak selection that enabled
us to find the true peaks. These parameters are minimum peak height (MPH) and minimum
peak distance (MPD). The definition for minimum peak height is provided in Reference [55]
as
                               M P H = ymin + α(ymax − ymin ),                           (2.51)
where α ∈ [0,1], ymin and ymax correspond to 5th and 95th percentile of the amplitude
of FFT/PSD/ACF plots. The α parameter is defined with respect to the peak amplitudes.
Since auto-correlation function has negative amplitudes, the choice for α is chosen separately,
while the same α value is used for FFT and PSD plots. In this implementation, respectively,
α was 0.1 and 0.5 for FFT/PSD and ACF plots.
    The second parameter, MPD, was defined by visual inspection on FFT/PSD/ACF plots
of several time series. An example is provided in Figure 2.52. This figure shows the effect of
the chosen MPD value on detecting the peaks in the FFT and ACF plots. The first two plots
provide the spectrum of a time series and the peaks found by a peak detection algorithm
                                              110


                                 x: 8.6e+02
                                    y: 0.995
                           1.0
                                                                    MPD=2500
                    | X ( f )|
                           0.5
                           0.0
                                                Frequency (Hz)
                           1.0
                                                                    MPD=500
                    | X ( f )|
                           0.5
                           0.0
                                                Frequency (Hz)
                     50000                                          MPD=500
                                 0
                   −50000
                                               time lag (seconds)
Figure 2.52: Effect of different MPD values on selected peaks in FFT and ACF plots of time
series with RPM=13300 and DOC=2.54 mm from milling experiments. (Top) FFT plot and
selected peaks with MPD=2500 (top) and MPD=500 (middle). Auto-correlation function
with MPD=500 (bottom). Orange lines represent the MPH.
with two MPD values. It is seen that a smaller MPD value brings the selected peaks closer
to each other and results in the detection of the true peaks. Therefore, MPD was chosen
for FFT and PSD plots as 500. The same value was also used in the ACF function. After
defining the two constraints for peak detection, MPD and MPH, the number of peaks to use
to generate feature matrices is selected. In this implementation, coordinates of the first two
peaks are used as features, and they were given to supervised classification algorithms.
2.7.2   Transfer Learning Results
   This section describes the classification approach and the transfer learning details. As
mentioned in Section A.1 and A.2, the turning data set contains four different cases, and
the milling data does not have categorization. Therefore, the total number of combinations
                                                 111


between the cases of turning data and the milling data is 20. Classification is performed for
all 20 combinations. Section 2.7.2.1 provides the results for the combination between cases of
the turning data set, while Section 2.7.2.2 discusses the results for the combinations between
turning and milling data set. In addition, the mild chatter cases are taken into account in
turning data set as unstable while performing the classification. This is performed since the
turning data is labeled in three classes (see Section A.1). All results provided in this section
belong to two-class classification for both milling and turning data sets.
2.7.2.1     Results of Transfer Learning Applications Between the Overhang Dis-
            tance Cases of Turning Data Set
    The classification was performed with 10 realizations of training and test sets for each
method. 67% of the training set and 70% of the test set were used to train and test the
classifier, respectively. To be fair in comparing methods, the same training and test sets
were used; they were generated with a set of predefined random state parameters. Support
vector machine (SVM) with rbf kernel, logistic regression, random forest classifier with 100
estimators and a maximum depth of two for the trees, and gradient boosting algorithm were
used to train and test a classifier. In addition to these classifiers, similarity measure based
approach, DTW can only be used with the K-Nearest Neighbor (KNN) classifier since it uses
a pairwise distance matrix between the time series. KNN is only used with the similarity-
based approach. Predicted labels were used to compute the average and standard deviation
of the accuracy and F1-score for training and test set separately. In addition to this, the
methods are categorized into three groups: 1) Time-Frequency based approaches (WPT,
EEMD, and FFT/PSD/ACF (FPA)), 2) TDA-based approach, and 3) Similarity measure
(DTW). The results of these groups are compared. Features and classifiers used for each
approach are given in Table 2.24. For more details about the features, one can refer to
Refs. [3, 11, 9].
    For each combination of the cases between turning data sets, figures which show the
                                              112


Table 2.24: Features and classifiers used for three main category of approaches. SVM: Sup-
port Vector Machine, LR: Logistic Regression, RF: Random Forest, GB: Gradient Boosting.
           Category                                     Features                                 Classifiers
                           WPT: Mean, Standard Deviation, Root Mean Square (RMS), Peak,
                           Skewness, Kurtosis, Crest Factor, Clearance Factor, Shape Factor,
                             Impulse Factor, Mean Square Frequency, Standard Frequency,
    Time-Frequency-based         One Step Auto-Correlation Function, Frequency Center        SVM, LR, RF, GB
                               EEMD: Energy Ratio, Peak to Peak, Standard Deviation,
                                        RMS, Crest Factor, Skewness, Kurtosis
                                 FF/PSD/ACF (FPA): The coordinates of the peaks
                            Carlsson Coordinates, Persistence Landscapes, Persistence Images
          TDA-Based                                                                          SVM, LR, RF, GB
                                                  Template Functions
  Similarity measure (DTW)                     Pairwise Distance Matrix                      K-Nearest Neighbor
accuracy of each feature extraction method have been provided for the classifiers mentioned
above. These plots are provided in Figs. B.3-B.7 in the Appendix. However, these plots can
only compare the methods within the same application of transfer learning. Therefore, a
summary plot is provided in Figure 2.53. It provides the highest accuracies obtained out of
four classifiers for all methods except DTW, while it shows the highest score out of all KNN,
where K=(1,. . .,5), for DTW. It can be seen that the time-frequency-based methods, such
as WPT, EEMD, and FPA, are the methods that give the highest score when training and
testing are performed between the overhang cases of the turning data set. However, WPT
outperforms other approaches in most of the applications in the group of time-frequency-
based approaches. On the other hand, the TDA-based approach and DTW have the highest
accuracy in a few applications. For TDA-based approaches, Carlsson Coordinates (TDA-
CC) performs better than other featurization techniques within the group. It is not easy to
distinguish each result in Figure 2.53. Therefore, WPT, TDA-CC, and DTW are selected to
summarize their results for different applications of transfer learning between turning data
set cases and presented the results in Figure 2.54.
    Figure 2.54 contains only the highest scores of the selected approaches and the ones that
are in the error band of the highest score. Each color represents a method; the best results
and the ones in the error band are represented with two different hatches on the bar plots.
The bars with the ‘◦’ hatch are the methods with the highest accuracy, and the ‘/’ hatch
shows the methods that are in the error band. Using Figure 2.54, we can define how many
                                                      113


Figure 2.53: The highest accuracy out of four different classifiers (or out of selected num-
bers of nearest neighbor for DTW) for each approach used in transfer learning applications
between overhang distance cases of turning experiments.
times a group of methods is the best method (BM) or the method in the error band (MIEB),
and these numbers are given in Table 2.25. It is seen that the frequency-based approach
(WPT) has the highest score in 7 out of 12 transfer learning applications between the cases
of turning data set, and it is not in the error band when the TDA-based approach and DTW
provide the highest score. On the other hand, the TDA based-approach and DTW provide
the highest score two times and three times, respectively. The TDA-based method is in
the error band of the highest accuracy in 4 out 12 applications, while this number is three
for DTW. It is also worth noting that WPT results provided in Figure2.54 have a larger
deviation compared to DTW and TDA-based approaches, even though WPT provides the
highest accuracies in most of the applications.
    Another criterion to compare the performance of the methods is to check the F1 score.
The F1-score is computed for all transfer learning applications and each method. Then,
the highest F1 scores out of all classifiers were chosen, and the summary plots are given in
Figure 2.55 and 2.56. In addition, the counting is performed for the number of best methods
                                                114


Figure 2.54: The classification results are obtained from the selected methods when training
and testing are performed between the overhang distance cases of the turning data set. The
selected methods that give the highest accuracy are represented with the ’◦’ bar hatch, and
the ones that are in the error band of the highest accuracy are shown with the ‘/’ bar hatch.
One approach is selected from each category of the methods, and these are Wavelet Packet
Transform (WPT), Carlsson Coordinates (TDA-CC), and Dynamic Time Warping (DTW).
Table 2.25: The number of times when a selected method gives the highest accuracy out
of 12 different applications between the cases of turning data set is denoted with BM. The
number of times when a method is in the error band of the highest accuracy is denoted with
MIEB. These two numbers are provided for accuracy and F1-score.
                                                     Accuracy       F1-Score
                             Method                BM MIEB         BM MIEB
                Time - Frequency-based (WPT)         7      0       9     1
                     TDA-based (TDA-CC)              2      4       3     0
                  Similarity Measure (DTW)           3      3       0     0
                                              115


Figure 2.55: The highest F1-Score out of four different classifiers (or out of selected num-
bers of nearest neighbor for DTW) for each approach used in transfer learning applications
between overhang distance cases of turning experiments.
and the methods that are in the error band as in the case of accuracy, and these numbers
are reported in Table 2.25. It is seen that the performance of the time-frequency-based
approach is better since it has the highest F1 score in 9 out of 12 cases of the applications.
The TDA-based approach provides the best F1 score three times. WPT again provides the
best results with larger deviations compared to the TDA-based approach (see Figure 2.56).
2.7.2.2    Results of Transfer Learning Applications Between Turning and Milling
           Data Sets
    There are eight different permutations when transfer learning is applied between the
turning and milling operations. The classification scores for four different classifiers are pro-
vided in Figure B.8- B.10. However, these plots include detailed results within the same
application of transfer learning. Therefore, summary plots are provided for these permu-
tations in Figure 2.57. It is seen that FFT/PSD/ACF (FPA) gives the highest accuracies
between time-frequency-based approaches in most transfer learning applications. The results
                                             116


Figure 2.56: F1 scores obtained from the selected methods when training and testing are
performed between the overhang distance cases of the turning data set. The selected methods
that give the highest accuracy are represented with the ’◦’ bar hatch, and the ones that are
in the error band of the highest accuracy are shown with the ‘/’ bar hatch. One approach is
selected from each category of the methods, and these are Wavelet Packet Transform (WPT),
Carlsson Coordinates (TDA-CC), and Dynamic Time Warping (DTW).
obtained with TDA-based approaches are similar to each other, especially for Carlsson Co-
ordinates (TDA-CC) and Template Functions (TDA-TF). Since the TDA-CC provides the
highest accuracy when training is performed on a 6.35 cm overhang distance of the turning
data set and testing is performed on the milling data set, TDA-CC is chosen to represent
the TDA-based approaches. Accordingly, FPA, TDA-CC, and DTW are compared with
respect to classification scores in Figure 2.58. Similar figures for F1-Score can also be found
in Figure B.11 and B.12.
    In Reference [9, 55], the authors mentioned several drawbacks of the time-frequency-
based approaches. One of the main drawbacks of these methods is that they require checking
the frequency spectrum of each time series manually to decide informative decomposition
                                               117


for WPT and EEMD or the restriction parameters for FPA. In this study, this manual
preprocessing is performed only for a couple of time series. This process gets cumbersome
as the size of the data set increases. In a real-time application, when a new time series
is introduced to a classifier, the frequency spectrum of it and its reconstructed time series
obtained from the wavelet packets need to be investigated to find the decomposition whose
spectrum has the largest overlap with the signal’s spectrum. On the other hand, the processes
for the TDA-based approach and the DTW do not require any parameter selection, and all
steps can be completed autonomously.
    Based on Figure 2.58 and B.12, Table 2.26 is generated and it shows the number of times
when a selected method gives the highest accuracy (BM) or it is in the error band of the
highest accuracy (MIEB). If the accuracy is considered as the main criterion, the DTW
method provides the highest accuracy in three out of eight applications, and it is in the error
band of the highest accuracy in two applications. In addition, the results of the TDA-based
approach are in the error band in two out of eight applications. Considering the drawbacks of
the frequency-based approach and deviations of the results of the frequency-based approach,
DTW and TDA-based approaches can be preferred when transfer learning is applied between
different machining operations.
Table 2.26: The number of times when a selected method gives the highest accuracy out
of 8 different applications between the cases of turning data set and the milling data set is
denoted with BM. The number of times when a method is in the error band of the highest
accuracy is denoted with MIEB. These two numbers are provided for accuracy and F1-score.
                                                     Accuracy      F1-Score
                             Method                BM MIEB        BM MIEB
                 Time - Frequency-based (FPA)        4      1      6      0
                     TDA-based (TDA-CC)              1      2      0      4
                   Similarity Measure (DTW)          3      2      2      1
                                              118


Figure 2.57: The highest accuracy out of four different classifiers (or out of selected num-
bers of nearest neighbor for DTW) for each approach used in transfer learning applications
between overhang distance cases of turning and milling experiments.
2.7.2.3   Transfer Learning Using Deep Learning
    In addition to traditional machine learning algorithms, Artificial Neural Networks(ANNs)
are also utilized to test the performance of several approaches. Deep learning frameworks
can learn from raw data sets without the need for feature extraction. However, Zhoa et. al
state that inadequate size of data set, noisy raw signal, and complex machining operations
make it necessary to preprocess data before feeding it into deep learning algoritms [160].
Therefore, some of the features extracted from TDA-based approaches are used to apply
deep learning in transfer learning. Some of the studies in the literature (see Refs. [92, 91])
trained the deep learning algorithms using the simulation data set to eliminate the need for
an extensive amount of experimental data set to train the classifier. However, in this work,
only the existing experimental data and the features extracted from them are only used
to train deep learning algorithms to compare the results to traditional machine learning
algorithms. One should be aware of the fact that more observation is needed to have a fair
comparison between deep learning-based transfer learning and traditional machine learning-
                                              119


Figure 2.58: The classification results obtained from the selected methods when training and
testing are performed between the overhang distance cases of the turning data set and the
milling data set. The selected methods that give the highest accuracy are represented with
the ’◦’ bar hatch, and the ones that are in the error band of the highest accuracy are shown
with the ‘/’ bar hatch. One approach is selected from each category of the methods, and
these are FPA, TDA-CC, and DTW.
based transfer learning. Since the raw experimental signals are not split into small pieces
for Time-Frequency based approaches, there are fewer observations for these approaches.
Hence, the features extracted using Time-Frequency based approaches are not utilized for
training deep learning algorithms.
    The ANN structure used in this work has one input, three hidden, and one output layer.
The number of inputs fed into the input layer is based on the number of features extracted
from TDA-based approaches. For instance, Carlsson Coordinates can provide five features
for each persistence diagram, so the number of inputs will be five for this approach. The
first and last hidden layers have 25 neurons, while the second hidden layer has 12 neurons.
                                              120


Figure 2.59: The classification accuracies obtained using Carlsson Coordinates and Persis-
tence Images features with ANN algorithms for the transfer learning between the cases of
turning experiments.
The hyperbolic tangent function is used as an activation function in all layers except the
output layer. Since the classification output is binary, the sigmoid function is chosen as
the activation function in the output layer. Adam optimization algorithm and binary cross-
entropy loss functions are used to compile the ANN. Epoch number and the batch size to
update the weights of the fully connected layers are selected as 100 and 5, respectively.
There are 12 permutations between the overhang distance cases of the turning data set and
eight permutations between the turning and milling data set. 67% of the training data
set is used as the training set, and 70% of the test set data set is used to test the ANNs.
Train-test split is repeated for 10 different pre-defined random state numbers, and the mean
accuracy and standard deviation are computed out of these 10 realizations. The results for
transfer learning applications between the overhang distance cases of the turning data set
are provided in Figure 2.59, while Figure 2.60 provides the accuracies with error bands for
the transfer learning between the milling and turning data set.
    From Figure 2.59 and 2.53, it is seen that traditional machine learning algorithms pro-
vide better accuracies compared to deep learning in 11 out of 12 different transfer learning
                                               121


Figure 2.60: The classification accuracies obtained using Carlsson Coordinates and Persis-
tence Images features with ANN algorithms for the transfer learning between the milling
and turning experiments.
applications for Carlsson Coordinates, while deep learning is outperformed in 9 out of 12
applications of transfer learning between the Looking into Figure 2.60 and 2.57, traditional
machine learning algorithms outperform deep Learning in all applications of transfer learning
for Carlsson Coordinates. However, deep learning outperforms traditional machine learning
in 4 out of 8 applications of transfer learning between the milling and turning data set using
persistence images. Overall, it is obvious that the amount of experimental data set fed to
deep learning is insufficient, and this leads to poor performance against the traditional ma-
chine learning algorithms. In addition, hyperparameter tuning is not performed for ANNs
in this study. This could also be another reason for the poor performance of deep learning.
2.8     Conclusion
    This chapter studied two advanced chatter detection methods, i.e., the Wavelet Packet
Transform (WPT) and the Ensemble Empirical Mode Decomposition (EEMD) with Recur-
sive Feature Elimination (RFE) and traditional feature extraction using FFT/PSA/ACF,
                                               122


and presented the DTW and TDA-based approaches for chatter diagnosis. These approaches
have been used for the classification of recorded acceleration signals from a turning process
into chatter-free cutting or chattering motion. They are not used only to classify measured
test data with the same cutting conditions as used in the training phase but also for transfer
learning, which means that the test data originates from a cutting process with different
cutting conditions. In particular, the chatter frequencies between the training data and the
test data differ significantly.
    For WPT and EEMD approaches, the results in Table 2.5 show that WPT has the high-
est accuracies for three cutting configurations when the classifier is trained and tested on
the same data set, while EEMD provides the best results for one cutting configuration. In
addition, training different classifiers other than SVM leads to increase in mean accuracies
for both method as shown in Table 2.6 and 2.7. These two tables are evidence that WPT
performance decreases as the level of transform increases. In addition to accuracy compar-
isons, the overall runtime of each method is also recorded. Table 2.5 shows that WPT has
the fastest runtime, while the EEMD method clocks the longest runtime. This slowdown is
mostly related to the computation of the ensemble of IMFs and can be reduced by changing
the ensemble parameters and optimizing the code.
    There are two main drawbacks of these two methods. 1) the WPT featurization process
is cumbersome since it requires taking the WPT of the signal, investigating the packets that
contain the chatter frequencies, and then choosing the packet that has a considerably high
energy ratio, and that includes chatter frequency. Once these packets are found, they are
fixed for the investigated process and are used for chatter classification. However, inherent
to this process is the a priori identification of chatter frequency and the assumption that
the chosen packets (referred to as the informative packets) will always contain it. This is
a limitation since (a) it requires highly skilled users for analyzing the signal and extracting
the informative packets, and (b) the chatter frequency band can move during the cutting
process, which will yield the informative packets ineffective for chatter classification. Further,
                                               123


Section 2.3.1 points out that the informative packets are not necessarily the ones with the
higher energy ratio. This makes automating the feature selection process more difficult in
the WPT approach. The EEMD also suffers some of these drawbacks since the process for
choosing the informative IMFs and the informative packets in WPT is quite similar. The
second drawback is that 2) it is not always possible to differentiate between intermediate and
full chatter. Specifically, although the intermediate chatter time series (Figure 2.8c) and the
chatter time series (Figure 2.8e) are visually very different in the time domain, their energy
content shown in the top graph of Figure 2.9 can be too close to distinguish between the two
cases.
    In this study, traditional signal processing tools (FFT, PSD, and ACF) are also applied
for feature extraction from cutting signals obtained from a turning experiment. The results
are compared to the ones obtained from two widely used chatter detection methods: WPT
and EEMD. The results show that the classification results for FFT/PSD/ACF with feature
ranking can provide higher test set accuracies for some of the classifiers, namely Random
Forest Classification and Gradient Boosting. If the overall best score for each overhang
length distance of the turning data set is compared, it is seen that the FFT/PSD/ACF based
methods provide the best results. Despite their shortcomings, traditional signal processing
methods can yield highly-tuned classifiers with superior accuracy. This makes these methods
suitable for manufacturing processes whose parameters are not expected to drift too much
in comparison to the training data. In contrast, if the parameters of the underlying process
are expected to shift significantly, then features based on FFT/PSD/ACF should be used
with caution.
    The first novel method presented in this chapter for chatter detection combines similar-
ity measures of time series via Dynamic Time Warping (DTW) with machine learning. In
this approach, the similarity of different time series is measured using their DTW distance,
and any incoming data stream is then classified using the KNN algorithm. The classifica-
tion accuracy of the DTW approach is tested using a set of turning experiments with four
                                               124


different tool overhang lengths, and the resulting accuracy is compared to two widely used
methods: the Wavelet Packet Transform (WPT) and the Empirical Mode Decomposition,
as well as to the second novel method based on Topological Data Analysis (TDA). The re-
sults in Table 2.11 show that the DTW’s classification accuracy matches or exceeds those
of existing methods for two out of four different overhang cases. This indicates that tem-
poral features extracted using DTW are effective markers for detecting chatter in cutting
processes. Topological Data Analysis (TDA) methods results are also close to the ones for
similarity measures; however, one advantage of the DTW approach in comparison to TDA-
based tools is that it does not require embedding the data into a point cloud, hence avoiding
the complications associated with choosing appropriate embedding parameters.
    The DTW approach can distinguish stable and unstable cases from each as evidenced
by the heat map of the average distances between time series with different labels (see
Figure 2.24), 2.25 and B.1. The DTW approach also successfully distinguishes between
chatter and intermediate chatter, as shown in Table 2.12. These comparisons are difficult or
impossible using only frequency domain features because the frequency content in these two
cases is too similar while the time domain signals are different, as evidenced by Figure A.4
and the heat maps shown in Figure 2.24 and B.1.
    In addition, the combination of the DTW approach with the AESA algorithm provides
a significant decrease in the number of DTW computations, thus reducing the classification
time of a single test sample to less than two seconds for three out of four overhang distances.
Depending on the user choice of H, one can obtain the results even faster with the cost
of loss in classification accuracy as seen from Figure2.27. Therefore, there is a trade-off
between reduction in the number of distance computation and classification performance.
In contrast to the AESA algorithm, parallel computing has the capability of completing the
classification of single time series in about 1.5 seconds if ideal conditions such as zero queue
time and enough resources–the number of processors/cores and RAM capacity–are obtained
to be able to run all jobs simultaneously. Therefore, it is observed that both AESA and
                                               125


parallel computing can be combined with similarity measures for real-time online chatter
detection applications. It is also noted that Table 2.13 does not include the time required for
the manual preprocessing in the WPT and EEMD methods for choosing informative packets
or decompositions. The actual time for these two methods is larger than the ones provided
in the table, depending on the number of the investigated time series and the skill of the
person performing the preprocessing. This is because WPT and EEMD require a process for
checking the frequency spectrum of the times series and examining the energy ratio of the
wavelet packets of the time series. Furthermore, whereas the WPT algorithms are highly
optimized, the Python scripts that are used in this study for computing the DTW have little
to no optimization. This study hypothesizes that further optimization using, for example,
the ideas in [137] and combining the DTW approach with AESA and parallel computing will
speed up the runtime for the similarity measures making them a viable option for on-machine
chatter detection.
    In this study, another novel approach is presented for chatter identification and classifica-
tion based on featurizing the time series of the cutting process using its topological features.
In contrast to the WPT and EEMD methods, this study uses persistent homology—the
most prominent analysis tool from TDA—to obtain a summary of the persistent topological
features of the data. These are based on the global structure of the point cloud embedding
of the acceleration signals in a turning experiment; therefore, upon obtaining a persistence
diagram, there is no manual work involved in selecting the features from the persistence dia-
gram. Since working directly with the resulting persistence diagrams is difficult, the leading
tools are investigated for feature extraction from persistence diagrams. The featurization
methods are based on persistence landscapes, persistence images, Carlsson Coordinates, a
kernel method, template functions, and persistence paths’ signatures. The resulting features
are then combined with several classification algorithms for training a classifier. The classi-
fication results are then computed from multiple split-train-test sets, and the resulting mean
accuracies as well as the corresponding standard deviation are recorded for each featurization
                                              126


method and for each cutting configuration of the turning data set.
    Tables 2.23 summarizes the classification accuracy of all the TDA-based tools as well as
their WPT and EEMD counterparts. In terms of the classification accuracy across the differ-
ent cutting configurations, the Carlsson coordinates and persistence landscapes approaches
yield the best accuracies for two cases, and the former has the smallest runtime in compar-
ison to the other TDA tools. On the other hand, in the remaining cases, WPT yielded the
best accuracies. Table 2.23 shows that WPT yields an accuracy of 100% (with a standard
deviation of zero) for the 6.35 cm (2.5 inch) overhang case; however, as pointed out in Sec-
tion 2.6.7.2 and Table A.1, the size of the test set for this cutting configuration is too small
which casts some doubts about the robustness of this result. Nevertheless, for this case, both
template functions and Carlsson coordinates still yield at least 86% classification accuracy.
Specifically, Table 2.23 shows that for the 6.35 cm (2.5 inch) case, persistence landscapes
and Carlsson coordinates yield accuracies that are off by 13.7%, and 10.7%, respectively,
from the WPT result. Similarly, for the 11.43 cm (4.5 inch) case, Carlsson Coordinates
and persistence images are within 13.3% and 13% of the EEMD result. As mentioned in
Section 2.6.7.2, persistence diagrams of stable and unstable time series of 11.43 cm (4.5 inch)
case are similar in contrast to other overhang distance cases, and this is why there is a large
accuracy difference between the results of TDA-based approaches and the WPT approach.
For the 5.08 cm (2 inch) and 8.89 cm (3.5 inch) overhang case, persistence landscapes and
Carlsson Coordinates yield the highest accuracy scoring 96.8% and 95.7%, respectively, with
tight error bounds that do not enclose the WPT accuracies.
    Runtime comparisons in Table 2.20 show a dramatic decrease in persistence diagrams
computation by at least 82% when they are computed in parallel. Table 2.22 shows that WPT
is the fastest followed by EEMD; however, the reported time for the TDA-based approach
is comparable to the ones of EEMD. The runtime can be reduced for computing persistence
diagrams for a single time series to less than 1 seconds using Greedy permutation and the
Bézier curve approximation method in combination with parallel computing. Figure2.43 also
                                              127


shows that the Bézier curve approximation method results in larger accuracies compared
to Greedy permutation. Therefore, the combination of parallel computing and the Bézier
curve approximation makes the TDA-based approaches a feasible option for online chatter
detection.
    This study shows that persistence features are appropriate for chatter detection in cutting
processes. These features have the potential to lower the barrier to entry when tagging
cutting signals as chatter or chatter-free because no manual preprocessing is needed before
extracting and using the features in the persistence diagram. It is also noted that after
obtaining a classifier, the time required for the classification of new incoming data will be
greatly reduced, thus opening the door for future implementation of TDA methods in-situ
for chatter detection and mitigation.
    Another main contribution of this chapter is to analyze the performance of the traditional
and novel chatter diagnosis approach between two different cutting operations. The features
extracted from turning and milling cutting signals are used to transfer knowledge from
turning experiments to milling experiments or vice versa. The highest scores obtained from
the transfer learning applications, between the cases of the turning data, were between 80%
and 100%, while that drops to 60% when the classifier is trained on turning data and tested
on the milling data. A period-doubling bifurcation was observed in 19 out of 318 time series
of milling data, and a Hopf bifurcation was observed in the rest of the unstable cases of
milling data. On the other hand, the turning data only contains the Hopf bifurcation. When
the classifier is trained on the turning data set, the model is not trained to recognize the
descriptors of period-doubling bifurcation, so it performs poorly when it is tested on milling
data. On the other hand, a classifier is trained with both features of Hopf and period-
doubling when the milling data is used as the training set. This explains why training on
milling data and testing on turning data sets perform better. In addition, the mathematical
model for the milling process has time-varying coefficients, while the turning process has
an autonomous system. Since the coefficients are constant in turning processes, this can
                                              128


lead to misclassification when the classifier is tested on milling processes with time-varying
coefficients.
    This study compared the performance of feature extraction methods from established
methods alongside those recently proposed in the literature. Turning and milling data sets
were used to evaluate the performance of each method. The size of the training sets and the
test sets were kept the same for each method. Since the training set data and test data are
different from each other, 67% and 70% of the training set and test data are used to train
and test a classifier, respectively. Ten random state numbers were used to generate training
and test splits, and these were used to train and test a classifier for each method. The
average and standard deviation of the 10 realizations were computed, and the final results
were reported. This has been repeated for all 20 combinations between the milling data and
overhang cases of turning data.
    Two types of figures are provided for each comparison criterion to compare the results.
Figure 2.53, 2.54, 2.57, and 2.58 were obtained when the criterion was accuracy, while Fig-
ure 2.55, 2.56, B.11, and B.12 were given for the F1-score. It can be seen that the time-
frequency-based approaches give the highest accuracy in most of the applications of trans-
fer learning with larger deviations in comparison to the TDA-based approach and DTW.
When only the transfer learning between the milling and turning data sets is considered,
it is seen that the accuracies obtained from DTW can be as high as 96% while the time-
frequency-based approaches can be up to 86% (see Figure 2.57). For the same cases of
transfer learning, the highest score obtained from TDA is 73% (see Figure 2.57). For the
transfer learning applications where the classifier is trained and tested between the cases of
turning, the time-frequency-based approach has the highest accuracy of 93%; the best score
for the TDA approach and DTW are 97% and 96%, respectively (see Figure 2.53). The
results of traditional machine learning algorithms are also compared to the ones obtained
from ANNs. It is seen that insufficient experimental data set leads to poor results against
traditional machine learning approaches. The small size of the experimental data set also
                                              129


prevents comparing different techniques to each other using deep learning algorithms. In
this work, only several TDA-based approaches are compared to each other. Using synthetic
data sets generated using the analytical model of milling and turning operations can allow
one to further extend the comparison of more approaches using deep learning frameworks in
the future.
    In summary, the TDA-based and DTW approaches can provide accuracies and F1 scores
as high as the time-frequency-based methods. DTW outperforms all other methods when
training on the milling data set and testing on the turning data set. In addition, the TDA-
based approach and DTW can be applied without needing manual preprocessing. All of
the steps in their pipeline can be completed automatically. Therefore, these approaches
may be preferred over the time-frequency-based approaches in either real-time or in fully
automated chatter detection schemes. It is worth noting that this study does not perform
any optimization on hyperparameters of the traditional machine learning and deep learning
algorithms. Thus future studies should also consider the effect of hyperparameter tuning.
                                             130


                                        CHAPTER 3
     DATA DRIVEN PARAMETER IDENTIFICATION FOR A CHAOTIC
         PENDULUM WITH VARIABLE INTERACTION POTENTIAL
3.1     Introduction
    Machine learning has led to many advances in the estimation of governing equations of
physical systems from sensor signals. This is useful in the engineering processes where in the
design phase, an abstraction of the physical system is used to write the governing differential
equations. However, the validity of the assumptions used in these models is not tested
until the part is manufactured and utilized in applications. This necessitates a data-driven
approach for studying the true underlying model of the system versus the idealized model
utilized in the design phase.
    There are many studies based on data-driven model identification including eigensystem
realization [173], equation-free modeling [174], empirical dynamic modeling [175], modeling
emergent behavior [176], automated inference of dynamics [177, 178, 179], nonlinear Lapla-
cian spectral analysis [180], symbolic regression [181, 182, 183], Kalman Filter [184], neural
networks [185, 186, 187] and time series [188] for model identification. These methods have
some drawbacks. For example, the symbolic regression is computationally expensive, and
it suffers from overfitting problems as the system complexity increases [189]. Furthermore,
neural network-based methods act as a black box and require large amount of data. It is
also difficult to assign a physical meaning to these black boxes.
    Another widely used method is sparse regression. The most commonly used sparse regres-
sion algorithms are LASSO [190, 191] and Ridge regression [191]. Schaffer et al. employed
sparse regression to approximate coefficients of nonlinear differential equations [192]. Tran
and Ward used sparse regression, splitting optimization, and compressed sensing to identify
governing equations [193]. Other recent studies on the sparsity of nonlinear systems can be
                                              131


found in [194, 195, 196]. In addition, Rey et al. implemented time-delay embedding for model
identification when measurements for some state variables are not available [197]. Wang et
al. provided an overview of the methods for data-driven identification of complex nonlinear
systems [198].
    Brunton et al. introduced SINDy for the identification of model parameters of nonlinear
systems [189]. SINDy is composed of three parts: 1) a feature library that includes a complete
basis of possible terms in the system equation. Since the feature library is composed of all
combinations of the functions, its size can become significantly large even if the number of the
chosen candidate functions is small. 2) An estimate of the derivatives from the experimental
measurements, and 3) application of sparse regression to determine the coefficients of the
governing equations. To reduce the size of the feature library, Rudy et al. [199] and Schmidt
et al. [182] used Pareto Front analysis [200] to reduce the number of candidate models.
    Other extensions of SINDy include using it for feedback control [201], and Model Predic-
tive Control (MPC) [202]. The latter work showed that SINDy-MPC outperforms neural-
network-based methods in terms of robustness and performance. Recently, SINDy was fur-
ther extended to stochastic dynamical systems [203].
    Most of the nonlinear models to which SINDy applies are composed of polynomial,
trigonometric, and rational expressions. Generally, a combination of various polynomial
functions and sinusoidal functions is used in the feature library. However, one of the limi-
tations of SINDy is that it performs poorly when using polynomial and sinusoidal terms to
approximate rational expressions in the governing equations [189]. Despite the large number
of publications on SINDy, its applicability to experimental data from nonlinear mechanical
systems has not been widely studied (see Refs. [204, 205] for some examples). Therefore,
more testing is needed before this method can be reliably used as part of the engineering
design cycle. This study takes a step in that direction and highlights some admonitions
that need to be considered when using SINDy on actual systems with unknown models. For
example, this study reveals the sensitivity of SINDy to the derivative approximation and
                                              132


shows that there is a limit on the time horizon over which the model yields a good fit. It
is also shown that the fitted model matches for the whole time range if the exact derivative
is known, further illuminating the important role of an accurate derivative estimation in
SINDy.
    There are many methods for derivative estimation from noisy signals [206], and SINDy
utilizes the Total Variation Regularization (TVR) method [207, 208]. However, TVR is
highly dependent on the selection of two parameters, α and ϵ, whose values are positive
real numbers, thus making TVR difficult to tune. This chapter shows that the resulting
coefficients obtained with SINDy for two example nonlinear systems are quite sensitive to
the selection of these two parameters. TVR is replaced by several methods which are com-
paratively easier to tune, such as cubic spline approximation, Savitzky-Golay smoothing,
Gaussian moving average approximation, and convolution smoothing. Other sparse regres-
sion algorithms such as LASSO and ridge regression are also used. SINDy is applied to
two simulated systems: the simple nonlinear pendulum and a model of a more complicated
chaotic pendulum with varying interaction potential. The latter model is based on an ex-
periment that Mork et al. [209] designed and built. In addition to numerical simulation, the
rotational angle from the chaotic physical pendulum is used as input to SINDy to see if the
identified model is similar to the theoretical one used during the design stage.
    The results show that the estimated nonzero coefficients grow rapidly as the TVR param-
eter α varies while keeping its other parameter ϵ constant. In fact, due to the large support
of α and ϵ, it is very difficult to find a parameter combination that yields the most accurate
derivative estimate. Therefore, this chapter examines and suggests several other derivative
estimation methods whose parameters can be more easily tuned. This study shows that
some of these methods can outperform TVR in terms of coefficient estimation and fit the
response data. However, even if the estimated response matches the actual response, the
results show that the estimated coefficients can significantly differ from the true model. The
robustness of all methods to noise is also compared, and an open repository for replicating
                                                133


the experiment [209] is provided.
     This chapter is organized as follows. Section 3.2 explains the modeling and the exper-
imental setup for the chaotic pendulum. Section 3.3 explains the SINDy algorithm with
detailed information on sparse regression. The results are provided in Section 3.4, while the
concluding remarks are included in Section 3.5.
3.2     Modeling and Experimental Setup
     This section describes the model and the experimental setup for the chaotic pendulum.
This experiment is for a complicated nonlinear pendulum whose interaction potential and
initial conditions could be accurately controlled. While the idea for the experimental ap-
paratus was inspired by the work in [210], the details in that work were not sufficient to
reproduce the original design. Therefore, the apparatus is re-designed and built, keeping
the main outline of the original design—see [209] for the design documentation and CAD
files. Specifically, the setup is composed of two towers: right and left. The right tower is
placed on a mini labjack (Table 3.1-No 1). A DC motor (Table 3.1-No 3) is connected on
top of a non-magnetic linear ball slider (Table 3.1-No 2). A disk with a radius of 4.445 cm
(1.75 in.) is connected to the DC motor via a coupler and a shaft. Magnets (Table 3.1-No 4)
with alternating polarity are attached to the circumference of the disk. A stepper motor (Ta-
ble 3.1-No 5) drives the DC motor and the disk with magnets on the slider with a scotch yoke
mechanism. The DC motor rotates at high speed, thus causing the rotating and translating
magnets to apply an eddy force to the aluminum disk on the left tower. The aluminum disk
is connected to an aluminum shaft of 6.096 cm (2.4 in) diameter. This shaft is fixed to the
left tower with two bearings (Table 3.1-No 7) and is connected to a simple pendulum with
an end magnet (Table 3.1-No 4). A photo-interrupter (Table 3.1-No 6) checks the speed of
the stepper motor. Another magnet (3.1-No 4) can be placed under the single pendulum as
an option. The rotational speed of the aluminum disk is measured with an optical encoder
(Table 3.1-No 8) and a rotary disk (Table 3.1-No 9). Figure 3.1 shows each part with labels
                                             134


that match Table 3.1.
                       Table 3.1: Part list for the experimental setup.
             Item Number                            Part Name
                               THORLABS L200 - 10.16 cm x 7.62 cm (4 in. x 3 in.)
                   1
                                       Lab jack, 2.62 cm (1.03 in.) vertical
                   2              Deltron Non-Magnetic Linear Ball Slides- S2-3
                   3                TSINY TRS-775W 12000 rpm DC motor
                                    0.635 cm (0.25 in.) dia. x 0.635 cm thick
                   4
                                             K& J cylinder magnets
                   5        SparkFun ROB-09238 Bipolar stepper motor with 200 steps
                   6                   Panasonic PM-L45 Photointerrupter
                   7                  Bone Swiss Bearings 8mm (0.315 in.)
                   8            US Digital EM2-2-10000-I Optical encoder module
                   9            HUBDISK-2-10000-315-IE Transmissive rotary disk
                  10                    NI USB-6356 Data acquisition box
Figure 3.1: Experimental Setup. See Table3.1 for parts 1-9. A: Photo-interrupter blocker,
B:Optional Magnet hole, C:Scotch Yoke.
    A safety polycarbonate cage was built with a thickness of 0.64 cm (0.25 in.) to protect
against the possibility of dislodged magnets. The speed of the DC motor and the stepper
motor is kept constant during the experiments, and they are controlled with an L298N motor
driver. An NI USB 6356 data acquisition box was used to collect data. In this study, both
the simulation from the experiment’s model as well as the resulting experimental data are
                                               135


used. For the simulation, the equation of motion is
                         mgrcm sinθ bv θ̇ bc sgn(θ̇) τF cos(ϕ) τdipole
                  θ̈ = −            −       −           +           +       .            (3.1)
                             I           I        I           I         I
where mg is the weight of the pendulum and rcm is the distance between rotation axis and
the center of mass of the pendulum. bv is the magnetic damping and viscous drag coeffi-
cient. bc is the Coulomb damping coefficient, I is the moment of inertia for the pendulum,
τF is the driving torque provided by the rotating magnets at high speed, and τdipole is in-
cluded in the equation of motion when the optional magnet (Figure 3.1-B) is used in the
experimental setup. The derivation of the expression for the dipole moment can be found
in Reference [210]. In addition, there is a phase difference ϕ between the disk with magnets
and the aluminum disk given by ϕ = wF t + δ, where wF = 2πfF , fF is the driving frequency
of the stepper motor, and δ is the initial position of the disk with magnets.
3.3     Sparse Identification of Nonlinear Dynamics (SINDy)
    There are two main components in SINDy: sparse representation and sparse regression.
Each of these components is described below.
Sparse representation: A nonlinear system can be represented according to ẋ(t) =
A(x)β, where xn×m contains the state variables of the system and ẋn×m is the time deriva-
tive of the state variables while n and m are the numbers of samples and state variables,
respectively. State space representation of the nonlinear system can be composed of a com-
bination of many nonlinear functions of the state variables. Possible combinations of these
nonlinear functions are stacked into a matrix A(x)n×p called a feature library where each col-
umn represents one possible nonlinear function. There is a coefficient corresponding to each
candidate function in the βp×m matrix where p is the number of these nonlinear functions.
                                              136


    ẋ(t) = A(x)β can be expanded into the matrix form
                                       
       ẋ1 (t1 ) . . . . . . ẋm (t1 ) 
       .         ..              .. 
       ..            .            . 
                                                                                               
                                          = 1 x x    P
                                                          ... x  P
                                                                       sin(x) cos(x) . . . β,           (3.2)
                                                     2
       .                          .. 
                                                                   k
       ..               ..
                           .       .  
                                       
        ẋ1 (tn ) . . . . . . ẋm (tn )
where xPk represents the kth order nonlinearities in the system such that
                                                                                       
                                    k      k−1                   k              k
                               x1 (t1 ) x1 (t1 )x2 (t1 ) . . . x2 (t1 ) . . . xm (t1 ) 
                               .               ..           ..     ..      ..     .. 
                               ..               .            .      .       .      . 
                        xPk =  .                                                         .             (3.3)
                                                                                       
                               ..             ..           ..     ..      ..     .. 
                                               .            .      .       .      .   
                                                                                       
                                    k      k−1                   k              k
                                x1 (tn ) x1 (tn )x2 (tn ) . . . x2 (tn ) . . . xm (tn )
The sin(x) and cos(x) terms represent the sinusoidal nonlinear candidate functions, and this
provides the flexibility to choose the type of nonlinear functions to be used in this feature
library A(x) depending on the system whose parameters are investigated [189]. While gen-
erating the feature matrix, the different combinations of state variables and their kth order
polynomial versions are used. The number of the functions p can increase dramatically when
the maximum order k for the polynomial functions increases slightly. Since the equation of
motion of the system contains several terms among these candidate functions, the coefficients
of most of them will be zero. Therefore, β is a sparse matrix. The nonzero coefficients in
β can be found using sparse regression, and they describe the governing equations of the
system.
Sparse regression: Let us assume there is a model yi = β0 +                           βj Xij , where p repre-
                                                                               Pp
                                                                                 j=1
sents the number of predictors for the model, n is the number of samples with 1 ≤ i ≤ n,
and β contains the coefficients for each predictor. Generally, this model can be solved using
the linear regression method when the number of samples is larger than the number of pre-
dictors (n > p). The intercept term β0 can be removed by normalizing the columns of the
Xij matrix to zero mean and unit variance. Therefore, the optimization problem becomes
                                                    137


                  βj Xij ||2 . However, for the case where p ≫ n, overfitting can occur [211, 212].
           Pp
minp ||y −    j=1
β∈R
Therefore, the coefficients for each predictor are more accurately obtained by training data
samples. Nevertheless, when these coefficients are validated on the test set, the test set and
the model obtained with the coefficients found in the training set will not match. It has
been discovered that the algorithm finds non-zero coefficients for some predictors which ac-
tually do not exist in the actual system. Therefore, sparse regression is used to decrease the
number of predictors with non-zero coefficients during regression. In sparse regression, most
of the predictors will have zero coefficients, and the weight matrix β will be sparse. There
are several ways to decrease the number of predictors, and these are called best-subset selec-
tion, forward, and backward-step wise selection, forward-stage wise regression, and shrinkage
methods such as ridge classifier and LASSO [190, 191].
    An alternative method for sparse regression is used in Reference [189]: an initial guess
for the coefficients in the β matrix is found by linear regression. Then, a threshold value
for all the coefficients is selected. The coefficients under this threshold value are set to zero,
and the indices of non-zero coefficients are defined. A new feature library(A(x)) is generated
by eliminating the columns with zero coefficients. Finally, least squares is again applied to
this feature library, and the coefficients with values below the threshold are eliminated. This
procedure is iterated 10 times resulting in a sparse coefficients’ matrix β.
3.4     Results and Discussion
    In this section, the results for simulated and experimental data are explained. This study
also shows the effect of the regularization parameter of TVR on the estimated coefficients
and compares TVR to other derivative estimation methods.
3.4.1    Total Variation Regularization (TVR)
    SINDy requires the derivatives of the state variables as input, and when these derivatives
are not available, an estimate is obtained using TVR. TVR has two different parameters
                                                  138


that affect the approximation quality: the regularization parameter α and a constant ϵ.
Despite the large influence of these parameters on the derivative estimates, they are difficult
to optimize. Some guidance on choosing α was provided in Reference [208] when the noise
amount in the measurement is known. However, this can be challenging in practice due to
unknown levels of noise. Therefore, trial and error approach is used for selecting the TVR
parameters. For simulated data, the ground truth for the derivatives is known, so TVR’s
accuracy can be checked. However, for experimental data, it is extremely difficult to judge
how well TVR approximates the true derivatives.
3.4.1.1    Noise-free Case
    One of the nonlinear systems to which SINDy is applied is the simple pendulum governed
by the equation of motion
                                          −g           c
                                     θ̈ =    sin(θ) − θ̇,                                 (3.4)
                                          L           m
where m is the mass of the pendulum, L is the length of the rod of the pendulum and c
is the damping coefficient. Figure 3.2 shows the computed TVR derivatives using α = 10
and ϵ = 1012 . Polynomial order up to 4 and the usesine option of the code provided
in Reference [189] are used for sparse regression. This code applies sparse regression by
iteratively thresholding the least square coefficients. For all applications presented in this
section, the threshold value λ = 0.1 was chosen for sparse regression. Figure 3.2 shows that
the estimated and the simulated derivatives match well for both state variables. However,
the aim of SINDy is to predict the underlying model, so the coefficients found by sparse
regression need to be checked. The number of nonzero coefficients in the coefficient matrix
β is higher than the number of terms in the equation of motion (see Table 3.2). Further,
most of the nonlinear terms whose coefficients are not supposed to appear in the matrix
have large coefficients. To pinpoint the reason for this mismatch, the simulation derivatives
were injected directly into the sparse regression algorithm, and an exact model match was
                                              139


Figure 3.2: Simple pendulum simulation vs SINDy approximation based on TVR derivatives.
(c = 0.5 Ns/m, m = 2 kg and L=1).
obtained. This highlights the importance of the derivative estimate and cautions that even
though the estimated derivative may look good, the model identification can still fail.
Table 3.2: Coefficients found by sparse regression for the simple pendulum based on TVR
derivative estimations.
                      Term         1            x1            x2       ...
                       x1          0            0             1        ...
                       x2     8.2693×106  1.8066×106       -0.2506     ...
                      Term      sin(x1 )     cos(x1 )      sin(x2 )    ...
                       x1          0            0             0        ...
                       x2    -2.1477×106        0       -8.6466×106    ...
3.4.1.2    Regression Overfitting Check
    In addition, since the time series of the estimated model matches that of the exact
model—despite estimating significant superfluous coefficients, this suggests the possibility
of overfitting in the regression. In order to check this possibility, the simulation data is
split into a training set (70%) and a test set (30%). Specifically, the training set is used to
predict the model of the system over the first part of the signal, which was 7 seconds of the
pendulum simulation. The last part of the simulation—which is 3 seconds for the pendulum
                                              140


example—was assigned as the test set, and the predicted model was solved for the test set
time span. Figure 3.3 shows that the estimated model matches the simulation; therefore,
although the possibility of overfitting is eliminated for this data, a similar check is needed
for other data sets.
               Figure 3.3: Simple pendulum training and test set predictions.
    To further test the performance of SINDy, it is applied for the estimation of the model of
a chaotic pendulum with varying interaction potential, see Section 3.2. This study forgoes
the optional stationary magnet that repels the rotating pendulum magnet, see part B in
Figure 3.1. Using this magnet in the experimental setup results in a dipole torque term in
the equation of motion. Since this nonlinear torque term includes rational expressions and
SINDy does not perform well in the presence of rational terms [189], the optional magnet
was omitted. The system was simulated with zero initial conditions, and its derivatives
were predicted with TVR. The coefficients for each term in Eq. (3.1) are obtained from
Reference [210], and the state space form of the nonlinear system reads
                                              ẋ1 = θ̇
                ẋ2 = −73.698 sin(θ) − 0.3734θ̇ − 0.423sgn(θ̇) + 46.596 cos(ϕ),           (3.5)
                                          ẋ3 = 6.4717
where x1 = θ, x2 = θ̇, x3 = ϕ, and α = 2 × 10−5 and ϵ = 1012 was set. Note that since the
                                                 141


Figure 3.4: Estimation of the chaotic pendulum response with SINDy when TVR deriva-
tives (top) and simulation derivatives (middle) are used. The bottom panel compares the
derivatives of the simulation to their TVR counterpart. (TVR parameters: α = 2 × 10-5 and
ϵ = 1012 ).
signum function appears in Eq. (3.1), it is added to the feature library. For the polynomial
function candidates, a maximum degree k = 3 was selected, and trigonometric function terms
were also included in the feature library. Figure 3.4 shows the resulting estimated model
and derivatives. The top panel shows that SINDy was able to estimate nearly the first four
seconds of the simulation even though the bottom panel shows that the TVR-estimated and
the simulated derivatives match for the full 10-second time horizon. The middle panel shows
that when using the simulated derivatives in SINDy, the estimated system response matches
the simulation throughout the time horizon.
    Table 3.3 compares the estimated model coefficients and the ones used for the simula-
tion. The left part shows that using the ‘exact’ derivatives correctly identifies the model
coefficients. On the other hand, the right part shows that estimating the derivatives using
TVR leads to correctly estimating four different coefficients; however, some of the candidate
functions which should vanish have nonzero coefficients leading to a mismatch between the
estimated and the simulated models.
                                              142


Table 3.3: Coefficients used in the simulation versus those estimated by SINDy. Shaded cells
highlight the matching coefficients.
                       Simulation Coefficients                                  Estimated Coefficients
  Term     1   x1    x2    sin(x1 ) cos(x3 ) sgn(x2 ) sgn(x3 )    1    x1    x2     sin(x1 ) cos(x3 ) sgn(x2 ) sgn(x3 )
   ẋ1     0   0     1        0         0       0        0        0    0     1         0         0       0         0
   ẋ2     0   0  -0.3734 -73.698 46.596 -0.423          0     -19.386 0  -0.3706 -73.8088 46.5961 -0.4216     19.3841
   ẋ3  6.4717 0     0        0         0       0        0     3.6638  0     0         0         0       0      2.8073
Figure 3.5: Estimation of the simple pendulum response based on SINDy and simulation
data with SNR = 20 dB.
3.4.1.3      Noise Effects
     To investigate the effect of noise on model estimation, SINDy was applied to the simple
pendulum simulation data with additive Gaussian noise with SNR = 20 dB. Figure 3.5
shows the predicted model response and the derivatives estimated with TVR. It is seen
that SINDy is able to match the simulation response despite a large amount of noise in the
simple pendulum simulation. However, Table 3.2 shows that there is a mismatch between
the coefficients of the estimated and simulated models. Similar to the noise-free case, most of
the nonlinear terms in the feature library of the noisy system have large coefficients. While
SINDy correctly estimated the system response for this numerical example, this may not be
the case for other systems.
     In addition, Gaussian white noise with SNR = 35 dB has been added to the chaotic
pendulum simulation (an SNR of 20 dB did not produce meaningful results). α = 100 is
used for TVR to have a smooth derivative. Figure 3.6 provides the estimated model response
                                                         143


where it is seen that SINDy can estimate only the first three seconds. This further shows
that α and ϵ need to be carefully tuned; however, these two parameters can be any positive
real number which makes them hard to optimize. The number of the estimated nonzero
coefficients is larger than the ones in Eq. (3.5). Some of the terms in the feature library
which are supposed to have zero coefficients have nonzero coefficients due to estimating the
derivative. Therefore, the sensitivity of the model estimation to TVR parameters is once
again confirmed.
Figure 3.6: Estimated response of the chaotic pendulum using TVR derivatives with α = 100,
ϵ = 1012 , and SNR = 35 dB.
3.4.1.4     Parameter Sensitivity in TVR via an Example
    Next this study focuses on the search for suitable α and ϵ parameters using a Lorenz
system with the parameters listed in the caption of Figure 3.7, whose model contains N = 7
nonzero terms. Lorenz system example given in the original SINDy paper [189] was used
since it was an example where the actual model was correctly predicted. The Lorenz system
simulation contained zero mean Gaussian noise with a variance of 0.01. Figure 3.7 shows
the influence of varying α and ϵ on the number of significant nonzero terms. Two different
ϵ values are used, and for each value of ϵ, α is varied, and the number of nonzero terms is
plotted.
    Figure 3.7 shows the sensitivity of the model prediction to the TVR parameter values.
Specifically, focusing on the left plot in Figure 3.7 for ϵ = 1012 , large jumps are observed
                                              144


Figure 3.7: Number of nonzero coefficients estimated by SINDy for Lorenz system with
parameters: σ = 10, β = 2.667, ρ = 28. Initial conditions: x0 = −8, y0 = 8 and z0 = 27.
ϵ = 1012 (left), ϵ = 10-3 (right). The correct number of nonzero terms is N = 7.
in the number of nonzero coefficients for small variations in the α value with the number
of coefficients ranging in 7–13 for ϵ = 1012 . The right plot shows even more radical values
for N when ϵ = 10−3 : varying α, in this case, results in large and scattered values of N .
Although the correct total number of nonzero coefficients N = 7 can be achieved when the
regularization parameter is too small—(as seen in Figure 3.7(right) for ϵ = 10−3 )—N jumps
to 80 when α is slightly increased. These two plots show that SINDy is unstable with regard
to the TVR parameters α and ϵ.
3.4.2    Alternative Derivative Estimation Methods
    The limitations of using TVR led us to seek other methods for estimating derivatives of
noisy signals [213]. Among the methods in Reference [213], Savitzky-Golay approximation
was investigated, cubic splines, Gaussian moving average, and convolution smoothing to
estimate the derivative. The following paragraphs briefly describe the tuning parameters for
each method.
Savitzky-Golay: Savitzky-Golay filter is used to smooth noisy data and based on linear
least squares [214]. It has two parameters defined on the natural numbers: the window
length and the polynomial order. There are certain limitations on the window length: it
                                               145


must be an odd natural number and cannot be less than the chosen polynomial order or
greater than the signal length. The Savitzky-Golay Python built-in function has a derivative
option.
Cubic smoothing splines: These splines depend only on the smoothing parameter vary-
ing between 0 and 1.
Gaussian weighted moving averages: This method has only one positive integer pa-
rameter: the window length.
Convolution smoothing: This method smooths noisy signals, and the derivative can
then be computed using centered difference or the function derivative option of MATLAB.
Its parameters are the window length and the window type, where for the latter, a Hanning
window was chosen. The window length must be an odd, natural number similar to Savitzky-
Golay. Arguably, these methods can be more easily tuned than the TVR approach.
    This study uses all four, in addition to TVR, for the chaotic pendulum simulation and
compares the resulting performance of SINDy in terms of the response match and the esti-
mated coefficients. Figure 3.8 compares the ground-truth response to the estimated model
responses corresponding to the different derivative estimates as well as the simulated deriva-
tive. It is seen that the estimated response matches the simulation when the ‘true’ derivatives
are used in SINDy. In this case, the estimated coefficients also match the true model. More-
over, it is seen that Savitzky-Golay performs the best for both state variables. The deviations
of the different methods from the true response can be explained by either their extraneous
estimated terms or by errors in the nonzero coefficients. With the exception of the Gaussian
moving average, all the methods resulted in more nonzero coefficients than the true model.
The deviation of the Gaussian moving average method can be attributed to the error in the
estimated coefficients.
    The effect of noise on the performance of the derivative estimation methods was tested
by adding white noise with SNR= 35 dB to the chaotic pendulum simulation. Figure 3.9
shows that with noise at best the methods can estimate roughly the first 3 seconds of θ(t),
                                              146


            Figure 3.8: Estimated responses for the simulated chaotic pendulum.
but they all perform poorly for θ̇(t). Responses obtained with Savitzky-Golay, convolution
smoothing, and TVR-based derivatives deviate from the simulation around t = 3 for θ(t).
However, in contrast to TVR, Savitzky-Golay and convolution smoothing are easier to tune
than TVR, which gives them an advantage. Further, the estimated coefficients for Savitzky-
Golay and the convolution smoothing are closer than TVR to the true model. Nevertheless,
all derivative estimation methods resulted in spurious nonzero terms in comparison to the
true model.
3.4.3    Experimental Results
    Experimental data is collected from the setup explained in Section 3.2. Similar to the nu-
merical simulation, the optional magnet B was not included. Position data of the pendulum
was collected using an incremental encoder. The data contains A and B quadrature signals,
and the raw data is collected on analog channels of a data acquisition box with a sampling
frequency of 106 Hz per channel. The analog signals were digitized by hard-thresholding at
4V, and by tracking the A and B signals at each time step, the pendulum’s angular position
                                             147


Figure 3.9: Estimated responses for the simulated chaotic pendulum with noise (SNR=35
dB).
in radians is obtained.
    In contrast to numerical data where both θ(t) and its derivative are known, only θ(t)
can be observed in the experimental setting. In this case, [189] suggests using delay recon-
struction of the time series combined with Singular Value decomposition (SVD) to obtain
the remaining state variables. Therefore, this study used Taken’s embedding with embed-
ding dimension 10 and delay parameter 1 and the first three columns of the right singular
matrix V to compute the derivatives. However, this approach led to nearly zero response,
so we proceeded by estimating the derivative of θ(t) using TVR as well as the methods of
Section 3.4.2.
    The results for estimating the response using the experimental data are plotted in Fig-
ure 3.10. The figure shows that none of the derivative estimation methods fits the exper-
imental data, and while they all have large errors, the error in TVR, in particular, grows
rapidly beyond t = 20. In contrast, Savitzky-Golay derivatives result in the closest fit to the
experimental data.
                                            148


Figure 3.10: Estimated model responses for experimental data of chaotic pendulum (Top)
and a zoomed-in version (bottom).
3.5     Conclusion
     This study investigated the performance of SINDy, a data-driven tool for model iden-
tification from numerical and experimental data. Specifically, SINDy was applied to two
nonlinear systems: a simple pendulum and a chaotic pendulum with variable interaction
potential. The time series for the former was obtained only via simulation, while the time
series for the latter originated from both simulation and experiments. The performance
was assessed based on the estimated model’s fit to the data and the estimated coefficients’
correctness– when the true model is known. It is found that when the derivatives of the
states of the system are precisely known, SINDy yielded the correct response and the correct
model. However, sparse representation often requires estimating derivatives from the time
series in practice. So TVR, the standard tool in SINDy, was compared to other available
tools for derivative estimation with noisy and noise-free data.
     The results show that TVR is hard to tune due to the unbounded support of its param-
eters and that even if it is well-tuned, it can result in a good fit for a limited time duration
that seems to depend on the system complexity. Further, even when TVR fits the data, the
                                               149


resulting coefficients can be incorrect. Among the other derivative estimation methods, the
Savitzky-Golay filter showed promising results for both simulation and experimental data,
despite choosing its parameters using trial and error. Convolution smoothing is another
method that results in a good fit for simulation data.
    One reason for adopting TVR in SINDy is its noise robustness. However, it is shown that
for the examples shown in this study, Savitzky-Golay provided nearly the same performance
as TVR, see Figure 3.6. It may be possible to obtain better results for both nonlinear
systems by judiciously tuning the parameters. Nevertheless, even if the fit with any given
data is perfect, it is still possible that the identified model will be incorrect. This is a
crucial drawback when dealing with experimental data where the ground truth is unknown.
Therefore, more work is needed to generate error bounds on the model identified by SINDy,
and guide the parameter choices in derivative estimation.
                                              150


                                         CHAPTER 4
ANALYSIS OF ENGINEERING SURFACES USING TOPOLOGICAL DATA
                                          ANALYSIS
4.1     Introduction
    Surface texture analysis is a prominent field of research with many applications, including
tribology [215], metrology, remote sensing [216], medical imaging [217], and the marine in-
dustry [218]. One specific active area of research is the fast and automatic feature extraction
from image data that reduces the need for the input of expert users. In addition to the need
for reliable, automatic feature extraction, other challenges in surface texture analysis include
the size of the data, which significantly increases with increasing the resolution. Therefore,
there is a need for adaptive and automatic tools for feature extraction from surface images.
    The majority of the tools proposed for roughness analysis of engineering surfaces are
based on decomposing the image data using a set of basis functions that can be grouped
into three main components: form, waviness, and roughness. Form contains the lowest
frequencies, while waviness is composed of sinusoidal waves in the middle frequency range.
Larger frequencies are included in the roughness component. Generally, most surface analysis
tools focus on finding the reference surface or profile. Depending on the feature extraction
tool used, the reference surface (profile) is composed of form, or it is the combination of both
form and waviness. The surface roughness can then be obtained by subtracting the form
and the waviness from the original surface.
    For surface profile analysis, the Gaussian filter is one of the most commonly used filters
in the literature [219, 220, 221, 222, 223]. It is used as a low-pass filter to obtain a smoother
surface, and the roughness profile is then obtained by subtracting the filtered profile from the
original one. Raja et al. used a Gaussian filter to obtain an approximation to surface profiles,
and they compared this approximation with the ones obtained from the 2RC filter, one of
                                                151


the earliest filters used for surface metrology [220]. Hendarto et al. focus on the roughness
analysis of wood surface using Gaussian filter [222]. However, the main drawback of the
Gaussian filtering approach is the boundary distortion, where the mean of the end parts of
a surface profile cannot be used [220]. Raja et al. suggested that the end parts of the mean
line should be ignored for evaluation [220], while this is not feasible for profiles with shorter
lengths. Therefore, Janecki proposed a solution that extrapolates both ends of profiles with
polynomial functions to eliminate the edge effect [223]. Fast Fourier Transform (FFT) is
also another widely adopted filtering approach for feature extraction from 1D signals [55]
including surface profiles. For example, Raja and Radhakrishnan used FFT to denoise 1D
surface data and obtain the corresponding roughness profiles [224]. For surface areal data,
two dimensional implementations of FFT and Gaussian filter can be utilized [225, 226].
Dong et al. provide an extensive understanding of two-dimensional FFT (2D-FFT) analysis
on engineering surfaces [227]. Peng and Kirk applied 2D-FFT to surface images of three
different wear particles and used spectral intensity values in angular and radial spectra to
identify the type of wear particle [226]. Empirical Mode Decomposition (EMD), one of the
most commonly adopted signal decomposition tools, is another approach used for the analysis
of engineering surfaces. Several versions of EMD are proposed to analyze surfaces such as
Bidimensional EMD (BEMD) [228], Image EMD (IEMD)[229], Bidimensional Multivariate
EMD (BMEMD) [230]. However, the computation of EMD in 2D is slow compared to other
approaches.
    Discrete Cosine Transform (DCT) is another widely used approach for decomposing a sur-
face scan into its form, waviness, and roughness components [231, 232, 233, 234]. Lecompte
et al. developed an approach to identify the form and the contribution of classical defects
such as positioning error and tool deflection [232]. They used only a certain percentage of
the DCT coefficients to obtain a filtered surface. However, when there are a large number of
images, each image may require the usage of a different percentage of the DCT coefficients
to generate the form. In general, DCT requires selecting two threshold values for delineating
                                              152


the three different components of the surface.
    Discrete Wavelet Transform (DWT) is another approach used extensively for surface
texture analysis [235, 236, 220, 234, 237, 238, 239, 240, 241]. Chen et al. introduced DWT
for surface profiles [235]. Liu et al. obtained a threshold that isolates the form of the surface
by computing all possible approximations that can be obtained using the coefficients at
each level. Another example of this approach is seen in Reference [220, 237] where the
separation of the three components of a profile is performed using multi-resolution analysis
approximations. The common procedure is to apply the DWT at a certain level and obtain
the approximation and detail coefficients, and then use the approximation coefficients for
the reconstruction of the form component [242]. The detail coefficients are then used to
reconstruct waviness and roughness. Nevertheless, there is a need for a guideline on how to
automatically choose the threshold that separates the mid-frequency content from the higher
ones in the DWT approach. In addition, the selection of the mother wavelet function can
also affect the resulting components. Stkepien et al. used autocorrelation, cross-correlation,
and entropy-based test to evaluate the performance of different wavelet functions used in
surface texture analysis [243].
    It is believed that there is no approach for automatically separating the form, waviness,
and roughness components for DCT and DWT, and the current practice is to manually select
them using the user’s experience and judgment call [244]. Therefore, the first contribution
of this study is to propose an automatic, data-driven approach for identifying the needed
thresholds for DCT and DWT. For DWT, this study utilizes the energy of the reconstructed
signals to separate the waviness and roughness from each other, while for DCT, this study
leverage the surface entropy to define the form and waviness components. Roughness is then
found by subtracting the filtered surface from the original one.
    In addition to this study’s contributions to the automatic threshold selection in DCT and
DWT, the machining processes through which surface samples are generated are identified.
Most studies in the literature are focused on small patch processes with few samples where
                                                153


human interpretation is heavily used to identify and compute profile or surface roughness.
In contrast, the proposed approach is validated for a large data set obtained from simulated
surfaces and experimental surface scans. The roughness components of surfaces and profiles
are obtained using proposed automatic threshold algorithms, and the 1D and 2D features
introduced in the ISO standards [245, 246] are extracted. Machine learning is then utilized
to assess the accuracy of the automatic thresholds. Specifically, the obtained features are
used in supervised classification algorithms to classify surfaces labeled with respect to the
generating surface parameter for the simulated surfaces and the generating machining process
for the experimental data.
    The second contribution of this study is to use persistent homology, which is a tool
from TDA for quantifying the roughness of surfaces. Specifically, 0D and 1D sublevel set
persistence are used on surface profiles and surface images to compute the sublevel set
persistence diagrams [247]. Then, Carlsson Coordinates [71, 69], persistence images [74],
and template functions [72] are utilized to extract features from these diagrams. The TDA-
based approach is used on synthetic data sets to identify the level of roughness, and its
performance is compared to features extracted using traditional image analysis.
    The final contribution of this study is to develop an approach for real-time surface tex-
ture analysis. Investigation of surface scans can be performed using the traditional signal
processing approaches mentioned above. A combination of automatic threshold selection
algorithms and High Performance Computing tools can make these approaches viable for
real-time surface texture analysis. Since TDA-based tools show promising results for sur-
face roughness analysis, this study proposes an alternative approach for real-time surface
texture analysis using topological simplification tools from TDA. Specifically, the proposed
framework takes the physical surface images as input and utilizes topological saliency [1]
from TDA. The proposed pipeline provides clusters of the surface images to identify the
regions where additional machining is required. It is hypothesized that this framework can
significantly reduce material waste and time for obtaining a smooth surface finish.
                                              154


Figure 4.1: The roughest and smoothest surface in the synthetic data set obtained with
H = 0 and H = 1, respectively.
    This chapter is organized as follows. Section 4.2 explains how the synthetic data set was
obtained and analyzed using traditional tools and the TDA-based approach. It also compares
traditional feature extraction approaches to the TDA-based method. In Section 4.3, prepro-
cessing of experimental data and the automatic threshold selection algorithms are described.
In addition, the results of heuristic threshold selection and proposed automatic threshold se-
lection algorithms are compared. Section 4.4 explains the topological simplification approach
based on topological saliency and provides the resulting surface clustering.
4.2     Data-driven and Automatic Surface Texture Analysis Using
        Persistent Homology
4.2.1    Simulation
    Synthetic surfaces are used to test the proposed approach, and they are generated using
the model provided in Reference [248]. The surface roughness of the resulting surfaces is
controlled by Hurst roughness parameter H ∈ [0, 1]. As the value of H varies from 0 to 1,
the generated surface gets smoother and smoother, see the example surfaces in Figure 4.1.
    The [0, 1] range is divided into 200 intervals, and 201 roughness parameters are obtained.
Each roughness parameter was then used to generate synthetic surfaces. Then, the resulting
surfaces are categorized according to their roughness parameter value into three classes. The
                                               155


first and last 67 surfaces are categorized as rough and smooth surfaces, respectively. The
surfaces in between these two cases were tagged as somewhat rough.
     In addition to the generated surface data, surface profiles are also investigated in this
study. Six surface profiles are extracted in two perpendicular directions of surfaces. There-
fore, there are totally 1206 surface profiles whose labels match the underlying generated
surfaces.
4.2.2     Methodology
     This section briefly explains the feature extraction methods from surfaces and surface
profiles. The methods used in this study are categorized into two groups: 1) traditional
image/signal processing methods and 2) TDA-based approach. For the first one, the general
idea is to find a reference surface or a profile and subtract it from the original measurement
to obtain the roughness surface or profile. Then, height parameters, spatial parameters, and
hybrid parameters provided in Sections 4.1-4.3 of Reference [246] are computed for roughness
profiles. While working with roughness surfaces, height and hybrid parameters are used as
features, and they are provided in Secs. 4.2 and 4.4 of Reference [245]. For the 1D peak
selection method of FFT, the coordinates of the peaks of FFT and PSD plots are used. The
angular spectral densities are used as features in the case of the two-dimensional FFT.
     For the TDA based approach, three featurization techniques are used to generate feature
vectors for persistence diagrams: Carlsson Coordinates [71, 69], persistence images[74], and
template functions[72].
4.2.2.1     Gaussian Filtering
1D-Implementation Gaussian filtering is one of the most commonly used tools for profile
filtering [220]. Gaussian filtering is implemented in 1D and 2D to analyze surface profiles
                                                156


and areas, respectively. The 1D kernel definition is given as [219],
                                                                 !
                                          1              x 2
                                 G(x) =      exp − π                ,                      (4.1)
                                         αλc              αλc
where α =       ln2/π, and λc is the roughness long wavelet cutoff [220]. Cutoff selection is
             p
performed with respect to the iterative procedure provided in Reference [249]. First, the
surface roughness parameter, Ra is estimated for surface profiles using the expression [246],
                                            1
                                               Z
                                      Ra =        | z(x) | dx,                             (4.2)
                                            L   L
where L represents the measurement length of the profile. A cutoff value is chosen from
Table 3-3.20.2-1 provided in Reference [249]. Then, Ra is measured for the roughness profile
after applying the filter using the chosen cutoff value. If the new Ra is outside of the range
of the old Ra , a new cutoff is selected with respect to new Ra . However, if it is larger than
the measurement length, the algorithm automatically picks the first chosen value as cutoff.
This procedure is for nonperiodic profiles, and one can refer to [249] for more details.
     After setting the cuttoff value and applying the Gaussian filter, a filtered profile is ob-
tained. This profile is also called the roughness mean line. The roughness profile is obtained
by subtracting the mean line from the original surface profile. Then, the profile features
provided in Reference [246] are computed to generate the feature matrix for supervised clas-
sification.
2D-Implementation The 2D Gaussian kernel expression is given as
                                                                   !
                                             1          −x2 − y 2
                               G(x, y) = √        exp                 ,                    (4.3)
                                            2πσ 2           2σ 2
where σ is the standard deviation. After the kernel in 2D is computed, the surface mea-
surement is convoluted with the kernel to obtain the filtered surface. The convolution is
performed using
                                      XW    W
                                            X
                           I[i, j] =             G[u, v]f [i − u, j − v],                  (4.4)
                                     u=−W v=−W
                                                157


            Figure 4.2: Filtered surfaces obtained using kernel sizes 5, 11 and 21.
where 2 × W + 1 equals the kernel size K, f is the surface measurement, and I is the filtered
surface. The standard deviation σ is defined using the expression, σ = K/6.
    Gaussian filtering is applied in 2D to the roughest surface in the synthetic data set with
three different kernel sizes. The resulting surfaces are provided in Figure 4.2. It is seen that
larger kernel sizes provide smoother filtered surfaces. The roughness surface is obtained by
subtracting the filtered surface from the original surface. Smoother filtered surfaces allow
having higher frequency components in the roughness surface. Therefore, a kernel size of
21 is selected and kept constant in all filtering operations. Then, areal parameters obtained
from Reference [245] are computed on roughness surfaces obtained after filtering. These
parameters constitute the features for supervised classification.
4.2.2.2    Fast Fourier Transform (FFT)
1D - Denoising Fast Fourier Transform is one of the most adopted signal and image
processing tools. It was employed to analyze surface profiles in Reference [224]. The main
idea is to manipulate the spectrum and then apply inverse FFT to obtain a filtered profile.
FFT was applied on the surface profiles, and their normalized spectra were obtained. A
cutoff value is selected between zero and one. The amplitudes below that cutoff are set to
zero, thus eliminating the corresponding frequencies from the data. Inverse FFT is then
applied to the modified spectrum to yield a mean line profile. Subtracting the filtered profile
from the original one gives the roughness profile.
                                               158


                       1.0
                                                   0.75
                  | X(f)|
                       0.5
                                                   0.50
                                                   0.25
                       0.0
                                 100    200      300      0.0       0.2    0.4
                               Frequency(1/µm)                  Distance
Figure 4.3: (Left) The spectrum of the plots. Filtered profiles obtained from two cutoff
values, 0.2 (middle) and 0.4 (right).
     An example of a filtered profile is provided in Figure 4.3. The figure shows that larger
cutoff values provide smoother profiles. Therefore, a cutoff value of 0.4 is chosen to eliminate
high frequencies in the filtered profile. Profile parameters are computed for each roughness
profile, and a feature matrix is generated.
1D - Peak Selection The peaks’ coordinates in Fast Fourier Transform, Power Spectral
Density (PSD), and Autocorrelation (ACF) plots can be used as features in 1D signals [55],
and that is the approach implemented here for identifying the level of roughness in the
synthetic data set. However, ACF plots were excluded since no peaks were detected in the
ACF plots (see Figure 4.4).
     First, the FFT and the PSD spectra are computed from the surface profiles. Then,
peak selection is performed with respect to two restriction parameters to locate the true
peaks of the spectrum. These parameters are minimum peak height (MPH) and minimum
peak distance (MPD). MPD is the minimum sample number between two consecutive peaks.
MPD is selected as 7 and 10 for PSD and FFT plots, respectively. The expression for MPH
is
                                  M P H = ymin + α(ymax − ymin ),                          (4.5)
where ymin and ymax are 40th and 50th percentile of the amplitudes in the spectrum, respec-
tively, and α is set to 0.5.
     MPD and the parameters in the MPH expression can be adjusted depending on the
                                                 159


                                         ×10−3
                1
                                       1                      50
                0                      0                        0
                        100 200 300 0                200         0.00        0.25
                  Frequency (1/µm) Frequency (1/µm)                        lag
Figure 4.4: Selected peaks for FFT and PSD plots with respect to MPH and chosen MPD
values. Red horizontal lines represent the MPH.
data set by visually inspecting the selected peaks. This parameter tuning was performed
for three different surface profiles obtained from the roughest surface in the data set. After
several adjustments, some meaningful peaks are obtained for both spectra, as shown with
an example in Figure 4.4. Since manually inspecting all the spectra is time-consuming, this
parameter tuning for MPD and MPH is performed for only three profiles, and the tuned
parameters are fixed for all the other profiles. After selecting the peaks, their coordinates
are used as features for classification. The user can control the size of the feature matrix by
specifying the number of peaks.
2D - Implementation FFT can also be applied to images. Two-dimensional FFT is
applied to gray scale synthetic surfaces. Areal power spectral density is computed with
respect to the formula [226]
                                                1
                   G(n/N Tx , m/M Ty ) =                | H(n/N Tx , m/M Ty ) |2 ,               (4.6)
                                           M N Tx Ty
where n = 0, 1, . . . , N − 1 and m = 0, 1, . . . , M − 1. M and N are the size of the image,
while Tx and Ty are the sampling intervals in x and y directions. H(n/N Tx , m/M Ty ) is the
2D Discrete Fourier Transform obtained by using
                                         M
                                         X −1 N
                                              X −1
                 H(n/N Tx , m/M Ty ) =              h(pTx , qTy )e−j2πnp/N e−j2πmq/M             (4.7)
                                         q=0 p=0
where h(pTx , qTy ) represents the surface measurement, p = 0, 1, . . . , N −1 and q = 0, 1, . . . , M −
1.
                                                160


    Areal power spectral density (APSD) plots are analyzed using polar coordinates. In this
study, Polar FFT [250] is computed to obtain angular and radial spectrums, similar to [226].
In addition, Dong and Stout applied 2D FFT directly to the roughness surface obtained
after subtracting the reference surface from the original measurement. In this study, this
approach is also employed, and the Gaussian filtering explained in Section 4.2.2.1 is combined
with it. APSD plots, as well as radial and angular spectra for two surfaces, are provided
in Figure 4.5. Since there are fewer peaks in the radial spectra, only the angular spectra
Figure 4.5: APSD plots, radial and angular spectrum of roughest (first row, H = 0) and
smoothest (second row, H = 1) surfaces. APSDs are obtained after applying the 2D FFT
on roughness surfaces.
is taken into account. Density values of the five peaks in the angular spectrum are used as
features in addition to ζmax
                          c
                              and ζmax
                                   d
                                       , given in Reference [226].
4.2.2.3    Topological Data Analysis (TDA)
    In addition to standard signal processing tools used in Secs. 4.2.2.1 and 4.2.2.2, this study
uses persistent homology from TDA to extract features from synthetic surfaces. Persistent
homology is the flagship tool from TDA, and it analyzes the shape of the data. This section
briefly explains persistent homology, and one can refer to Refs. [140, 139, 62] for more details.
                                               161


Background The sublevel sets of the images (see Figure 4.6a and 4.6b) and of surface
profiles are used. Let f be a function that represents data set such that f : X → R. The
domain of surface profiles or surfaces is denoted as X . Then, the sublevel sets of f are
defined as
                            Lλ = {x : f (x) ≤ λ} = f −1 ([−∞, λ)),                         (4.8)
where λ is a threshold [251]. The sorted set of threshold values, λ1 < λ2 < . . . < λl forms an
ordered collection of subsets such that
                                   Lλ1 ⊆ Lλ2 ⊆ . . . ⊆ Lλl .                               (4.9)
The collection of these ordered sets L = ∪λ Lλ is called filtration with respect to f .
    Persistent homology tracks the changes in a given filtration. For instance, persistent
homology in 0D is concerned with connected components, while the homology in 1D tracks
loops. In this study, both 0D and 1D persistent homology are utilized. The threshold value
where a topological feature is observed for the first time is called the birth time of that
Figure 4.6: a,b) Sublevel sets of the image given in c. d) Persistence diagram of the image
shown in c.
                                             162


feature. When the feature disappears, the corresponding threshold is denoted as the death
time of the feature. For instance, a loop can first appear (is born) at threshold λi , and it
can fill in (die) at λj . Pairs of birth and death times for each topological feature are plotted
in a persistence diagram (see Figure 4.6d).
    Working directly with persistence diagrams is not easy due to their complex structure,
and algebraic operations can not be performed simply with persistence diagrams since the
number of topological features can be different for each diagram [251]. Therefore, features
are extracted from persistence diagrams using their functional summaries. Three methods
are employed to featurize the diagrams, namely Carlsson coordinates [71, 69], persistence
images [74] and template functions [72]. One can refer to Sections 2.6.4.2,2.6.4.3 and 2.6.4.6
for more details about these approaches.
4.2.3     Results
    This section compares the results obtained from feature extraction methods explained in
Section 4.2.2. 10-fold cross validation is applied while training and testing the performance of
four supervised classification algorithms: support vector machine (SVM), logistic regression
(LR), random forest (RF), and gradient boosting (GB). In order to be consistent, the same
random state number for cross validation is used to generate the same sets for training and
testing for all feature extraction methods.
    The performance of each feature extraction method is compared with respect to the
accuracy of classification using default parameters for all classification algorithms. The
resulting accuracies of each classifier are represented with box plots. Figure 4.7 shows surface
profile classification results obtained from traditional signal processing approaches. It is seen
that Gaussian Smoothing and peak selection from FFT and PSD plots outperform the FFT
method, where the signal is denoised by setting a threshold in its spectrum. There is no
significant difference between the results of the four classifiers, as seen in the figure.
    The second approach used to extract features from surface profiles is the TDA-based ap-
                                                163


Figure 4.7: Surface profile test set results obtained with the 1D implementation of traditional
signal processing tools.
proach. 0D sublevel set persistence is utilized to compute persistence diagrams, and feature
extraction is performed using the three methods explained in Section 4.2.2.3. Figure 4.8
provides the resulting test set accuracies for four classifiers. Principal Component Analy-
sis (PCA) is also applied to the features obtained from persistence images. The first 10
components with the highest variance ratios are used to project the feature space onto a
10 dimensional space. The resulting feature matrix is used for classification, and the corre-
sponding accuracies are provided in Figure 4.8. It is seen that all three feature extraction
methods give mean accuracies greater than or around 90%. One can notice that this dimen-
sionality reduction does not increase the accuracy of the classifiers for persistence images.
The chosen 10 components may not correspond to regions where a Gaussian is placed, so
an important descriptor may be eliminated accidentally while reducing the dimension of the
feature space. For surface profile data, the feature space dimension is reduced from 320 to 10.
This provides an advantage in terms of the time required for classification, and the resulting
mean accuracies are still around 90%. Thus, it is worthwhile to apply PCA on persistence
image features.
    Surface classification results for traditional signal processing tools are provided in Fig-
ure 4.9. It is seen that Gaussian smoothing combined with FFT provides the highest scores.
However, directly applying FFT in 2D on surface measurements yields poor results for all
                                                164


Figure 4.8: Surface profile classification results obtained with OD (H0 ) sublevel set persis-
tence. Carlsson Coordinates (CC), persistence images (PI), and template functions (TF)
are used to extract features. The first plot in the second row represents the results after
applying dimensionality reduction to features obtained from persistence images.
classifiers. This shows the importance of obtaining the roughness component of a given sur-
face. In addition, the results obtained with TDA based approach are provided in Figure 4.10.
All feature extraction methods from persistence diagrams yield mean accuracies above 90%.
PCA is applied to the feature space of persistence images. Again, the application of PCA
does not increase the classification accuracy, but it still provides mean accuracies around
95%. 0D (H0 ) and 1D (H1 ) persistence provide similar results, so only 1D persistence (H1 )
results are provided in Figure 4.10. The highest mean accuracies are obtained by using the
template function methods for both of them.
    Results of traditional and TDA-based approaches for profile/surface classification are
comparable. However, the TDA-based approach provides three main advantages: 1) it
                                              165


Figure 4.9: Results of surface classification obtained with the 2D implementation of signal
processing tools.
requires no parameter selection, 2) it provides an automatic and systematic way for feature
extraction, and 3) it allows adaptive feature extraction. Traditional methods do not share
all these advantages. For instance, two restriction parameters (MPD and MPH) need to be
selected, and a kernel size is needed for FFT/PSD and Gaussian 2D implementations. In
addition, the selection of the MPD and MPH or kernel size for Gaussian smoothing requires
visually inspecting the spectra. Thus they have low automation potential. These parameters
and thresholds are typically selected using only a small portion of the data set. The selected
parameters may not be suitable for every surface profile or surface, so traditional signal
processing approaches are non-adaptive.
4.3     Automated Surface Texture Analysis via Discrete Cosine Trans-
        form and Discrete Wavelet Transform
4.3.1    Data Preprocessing
    This study uses both synthetic surfaces (Section 4.2.1), and digital scans of machined
surfaces (see Section C). This section only provides information on data preprocessing. For
the data collection procedure of the experimental data set and the detailed information of
simulation, one can refer to Sections C and 4.2.1.
    The raw surface scans include different numbers of pixels with lower gray-level intensity
                                              166


Figure 4.10: Results of surface classification obtained with 1D persistence H1 using Carlsson
coordinates (CC), persistence images (PI), and template functions (TF).
values on the edges of the image, as shown in Figure C.1. These pixels are tedious to isolate
manually, so the images are cropped and adaptively removed these pixels using the following
algorithm. First, the remainder of image pixel values is found in each direction when they
are divided by 1000. The halves of the remainders are used as the number of pixels to
remove from each edge. This procedure successfully reduced the number of pixels with lower
gray-level intensity values at the boundaries, and the resulting images had similar sizes.
    The other challenge was the large dimension of the microscope surface scans, which
can exceed 10000 pixels in each direction of the image, thus elevating computational ex-
penses. Consequently, each surface scan is split into 25 sub-images, each with a dimension
of 2400×2400 pixels.
                                               167


Image Subsampling The resulting subimages still presented computational challenges
for some signal processing tools such as DCT, where the maximum number of modes is
equal to the total number of pixels. Therefore, the images are subsampled to further reduce
the number of samples in the subimages when using 2D signal processing tools for surface
classification. Several approaches are available for image scaling/resampling in the literature.
These also include some signal processing approaches to upscale or downscale an image.
One of the simple and widely used subsampling methods is to replace a block of pixels with
their average values, and that is the approach used in this study. After testing different
sampling factors such as 0.1, 0.2, and 0.5, this study adopted a sampling factor of 0.1 for
the experimental data set. This means the size of each block is 10 × 10 pixels.
4.3.2    Methodology
4.3.2.1     Discrete Wavelet Transform
    Discrete Wavelet Transform (DWT) is one of the widely adopted signal processing tools
[252, 9, 253, 254, 255]. While a signal’s frequency spectrum can only be represented over the
entire time domain with Fourier Transform, Wavelet Transform can decompose the signal into
components with different time and frequency resolutions [256]. In DWT, the time series is
passed through low pass and high pass filters to obtain approximation and detail coefficients.
These filters are obtained from a filter bank derived from scaling and translating a wavelet
function [252]. The diagram for Discrete Wavelet Transform is provided in Figure 4.11
where the level of the transform is denoted as k. As the level increases, only approximation
coefficients are passed through the filters to obtain new approximation and detail coefficients.
For example, the approximation coefficient of the first level transform A1 is passed through
the filters, and new approximation and detail coefficients are obtained as AA2 and DA2 . As
the level of the transform increases, the frequency resolution of decomposition increases as
well. As seen from Figure 4.11, each component corresponds to a distinct frequency range.
                                                168


    In this implementation, DWT is utilized to analyze simulated and experimental sur-
face profiles. 1D DWT is applied to surface profiles with biorthogonal wavelet functions
(BIOR4.4) recommended by the ISO standards [257]. The specified wavelet functions are
symmetric, and surface profiles can be reconstructed without any loss [257]. The level of
the transform is chosen based on the maximum allowable limit defined by the number of
samples in the profile and the type of the wavelet function. Approximation coefficients at
the maximum level of the transform are used to obtain the form of the profile [242]. The
waviness and roughness profiles of the surface are reconstructed using the detail coefficients
of the transform. For instance, DAA3 , DA2 and D1 can be used to reconstruct profiles for
waviness and roughness when 3 levels of DWT are employed (see Figure 4.11). Profile re-
constructed from AAA3 represents the form of the profile. The separation between waviness
and roughness is done heuristically in the literature, and it is believed that there is not any
guide on how to select a threshold for detail coefficients. Therefore, this study describes an
approach that leverages signal energy to automatically select that threshold.
               Figure 4.11: DWT tree for the first three level of the transform.
                                                        P∞
    The energy of a discrete signal is defined as E =     n=−∞    | x[n] |2 . After applying DWT,
profiles are reconstructed using the detail coefficients at all levels and computed their signal
energy. Figure 4.12 provides the energy ratio of each detail coefficient and the cumulative
energy ratios in addition to the decomposition obtained with automatic threshold selection.
The first row of the plots is obtained with the roughest surface, and the ones in the second
row belong to the smoothest surface. Both surfaces have a size of 4096 × 4096 pixels. Three
                                              169


cross-sections are taken in each direction of the surfaces represented with Profile-x (or y) i
in the figure.
Figure 4.12: The plots on the first row belong to the roughest surface (H = 0) simulation,
while the plots in the second row belong to the smoothest surface (H = 1) a) Energy
ratios of detail coefficients (di : detail coefficients of level i). b) Cumulative energy ratios
including approximation coefficients (a: approximation coefficients of maximum level). c)
The resulting three main components after applying the automatic threshold selection for
profile x − 1.
    The cumulative energy ratio plot is given in Figs. 4.12b and 4.12e for the smoothest and
the roughest surface of the simulation. It is seen that most of the energy is accumulated in
approximation coefficients. Since the main aim is to separate waviness and roughness, this
study only takes into account the energy ratios of detail coefficients. In Figure 4.12a, it is
hard to notice the significant increase in signal energy as the level of the transform increases.
Therefore, the differences between consecutive energy ratios are used. For example, the
biggest difference between consecutive energy ratios for Profile-y 1 in Figure 4.12a is between
                                                 170


levels 4 and 5. The proposed algorithm uses detail coefficients of the first four levels to
reconstruct roughness. The rest of the detail coefficients are reconstructed and summed up
to obtain a waviness profile. The resulting surface components are shown in Figure 4.12c.
When the roughness of the surface decreases, the threshold is easier to capture since a
dramatic increase is seen in signal energy between levels 7 and 8 (see Figure 4.12d) for
most of the profiles. In this case, roughness is equal to the summation of reconstructed
signals from detail coefficients of the first seven levels, and the 8th level coefficients are used
to generate the waviness profile. In Figure 4.13, energy plots of the reconstructed signals
are provided. Cumulative energy plots show that the signals obtained using approximation
coefficients have the highest energy. It is evident that the form profile, which is obtained from
approximation coefficients, has the largest amplitudes in the decomposition. The significant
increase in energy occurs between levels 7 and 8. Therefore, for this example, the threshold
is selected at level 7. This approach is applied to all profiles extracted from the sub-images
in the experimental data, and the features are computed accordingly.
Figure 4.13: (left) Energy ratios of detail coefficients (di : detail coefficients of level i).
(middle) Cumulative energy ratios including approximation coefficients (a: approximation
coefficients of maximum level). (right) The resulting three main components after applying
the automatic threshold selection for profile x − 1.
                                                171


4.3.2.2    Discrete Cosine Transform
    Discrete Cosine Transform (DCT) decomposes a signal into cosine functions with different
frequencies. It is similar to Fourier Transform, except that it uses only real coefficients. There
are several types of DCT, and in this implementation, Type II DCT is used. The definition
of 2D transform is given as [258]
                                 −1 N −1
                              M
                                                              !                  !
                              X     X             π(2m + 1)p          π(2n + 1)q
                 Bpq = αp αq             Imn cos                cos                 ,        (4.10)
                              m=0   n=0
                                                      2M                  2N
where                                                        
                     √1                                       √1
                                                             
                    
                        M
                             p=0                              
                                                                 N
                                                                       q=0
             αp = q                               and αq = q                          .
                     2                                        2
                                                             
                    
                         M
                             1≤p≤M −1                         
                                                                   N
                                                                       1≤q ≤N −1
M and N are the numbers of elements in each direction of the image, and Imn represents
the image. The inverse transform is provided as
                           −1 N −1
                         M
                                                             !                  !
                          X   X                   π(2i + 1)p         π(2j + 1)q
                   Iij =           αp αq Bpq cos              cos                .           (4.11)
                          p=0 q=0
                                                     2M                 2N
If Eq. (4.10) is inserted into Eq. (4.11), the resulting expression is
           −1 N −1                −1 N −1
         M                      M
                                                                !                   !!
         X    X                 X    X              π(2m + 1)p          π(2n + 1)q
   Aij =           αp αq αp αq            Amn cos                 cos
          p=0 q=0               m=0   n=0
                                                        2M                 2N
                                                                       !                 !
                                                            π(2i + 1)p        π(2j + 1)q
                                                      cos                cos                 (4.12)
                                                               2M                  2N
    Basis functions (modes of a given image) are defined with indices p and q. Summing all
the modes recovers the original image as shown in Eq. (4.12). In Figure 4.14a, the modes
are represented by Xi,j . For instance, X0,0 represents the first mode of a given surface. The
first four modes of a synthetic surface are provided in Figure 4.14b. Form, waviness, and
roughness components are obtained by summing the modes inside the boxes shown with
grey, blue, and red colors, respectively. Figure 4.14a shows that two thresholds t1 and t2
need to be selected to separate the three components. Since the goal is to find the surface
roughness component, t1 is neglected, and form and waviness are treated as one combined
                                                  172


        (a) DCT modes in matrix format            (b) The first four modes of a synthetic surface.
             Figure 4.14: Matrix format of the DCT modes and example modes.
component. This study then introduces a new approach based on image entropy to generate
an automatic threshold selection for DCT, as described in the next section.
Threshold Selection Using Image Entropy: Information entropy is known as Shan-
non’s entropy [259] and it is defined as
                                             X n
                                    H(X) =        pi log(1/pi ),                            (4.13)
                                              i=1
where X is a discrete random variable with probability distribution p. An analogous expres-
sion is obtained for computing the image entropy [260] according to
                                      X
                               H(I) =     (hI (i)/N )log(N/hI (i)),                         (4.14)
                                        i
where hI (i) is the counts in histogram of the grayscale image I, and N is the number of bins
the histogram.
    As mentioned earlier, two components are computed for a surface: the roughness com-
ponent and the surface that includes waviness and form components. Initially, t2 is set to 0.
The first mode X0,0 combines the form and waviness components of the surface, while the
rest of the modes represent the roughness component. Combining form and waviness also
gives the advantage of avoiding additional mode computation for the surface. The roughness
component can be obtained by simply subtracting the first component (form+waviness) from
                                               173


the original image. After obtaining both components, the next step is to compute their image
entropies. Then, the threshold index t2 is increased by one. Now, the first four modes shown
in Figure 4.14b represent the first component (form+waviness). The roughness component
is obtained by subtracting the first component from the original image, and image entropy
for both components is computed. This procedure is iterated by increasing the threshold
index t2 . This process is repeated for all threshold indices from 0 to 256 for the roughest
surface in the data set, and entropy curves are obtained for both components as shown in
Figure 4.15.
    Figure 4.15 shows that the entropy of the first component (form+waviness) is increasing
dramatically for a small change of t2 . After a certain value of the threshold, its rate of increase
slows down, as seen from its derivative curve. Therefore, a threshold value is selected for
the slope of the entropy curve of the first form+waviness component. Two threshold values
Figure 4.15: The entropy of the roughness and waviness+form surface for varying threshold
index.
are tested, and these are 0.1 and 0.005 (see Figure 4.15, grey vertical lines). When the
slope of the curve is under these slope thresholds, their corresponding t2 values are taken
as thresholds to separate the roughness component. For these two slope thresholds, 0.1 and
                                                174


0.005, t2 is found as 5 and 30, respectively. Then, two components of the surface are obtained
using these threshold values as shown in Figure 4.16. The figure shows only surface profiles
to clearly illustrate the difference between the components. It is seen that threshold 30 is
more reasonable to use since the form+waviness component obtained with t2 = 5 does not
provide a good approximation to the surface. Therefore, it is decided to use a slope threshold
of 0.005 for all surfaces in the data set.
Figure 4.16: Reconstructed surface profiles obtained using thresholds from the entropy anal-
ysis shown in Figure 4.15.
    An automatic threshold selection algorithm is defined such that it can terminate the loop
when the slope of the entropy curve of the form+waviness component is below 0.005. For
instance, only the first 302 modes of the surface are computed instead of computing all 2562
modes, and profiles of the reconstructed surfaces are given in Figure 4.16. Therefore, the
proposed algorithm avoids unnecessary and expensive mode computations. This algorithm
is applied to both synthetic and experimental data sets to obtain roughness components that
can be used to compute the 2D features needed for applying machine learning.
4.3.2.3     Feature Extraction from Surfaces
    This section describes the feature extraction procedure from surfaces. The roughness
component of each surface or profile is used to extract features. Depending on the type of
data used in the analysis, the profile or areal parameters given in Reference [246, 245] are
used. While working with surface area measurements, the height and hybrid parameters
provided in Sections 4.2 and 4.4 of Reference [245] are computed for roughness components
                                               175


of the surfaces. For surface profiles, height, spatial and hybrid parameters given in Secs. 4.1-
4.2 of [246] are computed. The list of features and their definitions are provided in Table 4.1.
In addition, definitions for all the features are given in the continuous form. The discrete
form of these equations can be found in References [245] and [246]. Then, these feature
matrices are fed to supervised classification algorithms, namely, Support Vector Machine
(SVM), Logistic Regression (LR), Random Forest (RF), and Gradient Boosting (GB). This
study uses the default parameters of the classifiers.
Table 4.1: The features used in the classification of surfaces and surface profiles. f (x) and
f (x, y) represent surface profile and surface, respectively.
                                1D Features                                                     q2D RR  Features
             q
      Rq =       1
                         f (x)2dx, Rsk =             1 1
                                                                                          Sq = A Ã f 2 (x, y)dxdy
                                                                                                    1
                    R                                       R        3
                lm   lm                              Rq3 lm  lm
                                                                f (x) dx
                                                                                                q
  Rku =   1 1
                      f (x) dx, Rt = max(f (x)) + |min(f (x))|                           Ssk = AS1 3 Ã f 3 (x, y)dxdy
                R            4
                                                                                                         RR
         Rq4 lm   lm                                                                                   q
                         Ra = lm  1
                                                                                          Sku = AS1 4 Ã f 4 (x, y)dxdy
                                     R                                                                  RR
                                        lm
                                           |f (x)|dx
                                                                                                     q
        Ral = min tx , where R = {tx : ACF(tx ) < s}                                           Sp = max(f (x, y))
                 tx ∈R
           Rsw =               2π
                       arg max |F (p)|
                                       ,   Rdt = max | dz(x)   dx
                                                                   |                          Sv = | min(f (x, y))|
                              p                        x∈R
                                v                    !2
                                u
                                u R
                    Rdq = t lm    1
                                        lm
                                              df (x)
                                                dx
                                                         dx                                      Sz = Sp + Sv
                        Rda =      1
                                       R df (x)
                                                                                            Sa = A1 Ã |f (x, y)|dxdy
                                                                                                    RR
                                  lm lm
                                            | dx |dx
                                  v                  !2 !                              v        "            !2             !2 #
                                  u                                                    u
                                  u                                                    u RR
                Rdl =             t1 + df (x)                                             1         ∂f (x,y)       ∂f (x,y)
                          R
                             lm                  dx
                                                            dx                  Sdq = A Ã
                                                                                       t
                                                                                                       ∂x
                                                                                                                +     ∂y
                                                                                                                                 dxdy
                                  v                  !2       !                          v"                 !2             !2 #     !
                                  u                                                      u
                                  u                                                 RR u
           Rdr =       1                      df (x)
                                                                            Sdr = 1                ∂f (x,y)       ∂f (x,y)
                          R
                      lm    lm
                                  t1 +
                                                dx
                                                          − 1 dx                  A   Ã
                                                                                         t 1+
                                                                                                      ∂x
                                                                                                               +    ∂y
                                                                                                                                − 1 dxdy
4.3.3     Results
     This section presents the results obtained with automatic threshold selection algorithms
introduced in Section 4.3.2, and compares the results to the ones obtained from the Gaussian
filter. Both algorithms are applied to the synthetic and experimental data sets described
in Section 4.3.1, and the surfaces and their profiles are classified. There are three labels
for synthetic surfaces representing the surface’s roughness level. For experimental data,
samples come from three main machining operations, each of which corresponds to three
different surface roughness ranges. Therefore, there are nine labels for the experimental
                                                                         176


Figure 4.17: Classification accuracies obtained with automatic and heuristic threshold selec-
tion for DCT and DWT using a synthetic data set. Three-class classification is performed
in this case.
data set. Three-class classification is also performed by only taking into account the type of
the machining operation. In addition to using the automatic threshold selection algorithms,
surfaces are decomposed into form, waviness, and roughness heuristically. Specifically, the
decomposition of a few surfaces or profiles is inspected, and a decision is made on the
threshold values mentioned in Section 4.3.2.1 and 4.3.2.1. These thresholds are then fixed
and used for the whole data set. The resulting roughness surfaces or profiles are used to
extract features as explained in Section 4.3.2.3, and supervised classification is performed.
    Figure 4.17 provides the figures for the classification results obtained from DCT and
DWT using automatic and heuristic threshold selection. It is seen that there is no significant
difference between the classification accuracies of automatic threshold selection and heuristic
threshold selection. It is seen that the automatic threshold algorithm helps to decrease the
deviation in the results obtained with LR and GB classifiers for DCT. One should note
that heuristic threshold selection requires manual inspection of several surfaces to decide the
threshold value. Depending on the number of surfaces inspected, the manual process can
add a significant amount of time for identifying the roughness level. In contrast, automatic
threshold selection algorithms for both DWT and DCT do not require manual inspection
since only a slope threshold needs to be selected, as explained in Section 4.3.2.2.
    The first row of the figure provides plots for 3-class classification where the labels are
milling (M), profiling (P), and shaped or turned (ST). The second row contains the results
obtained with the nine-class classification, where each machining operation has three dif-
                                               177


ferent roughness values. It is seen that DCT automatic threshold selection provides similar
accuracies compared to the heuristic approach. Nine-class classification is also performed by
assigning distinct labels to the main surfaces shown in Figure C.2b. Figure 4.18 shows that
nine-class classification provides better classification scores. DCT results obtained using the
automatic threshold selection approach are slightly lower when they are compared to the
heuristic approach. However, nine-class classification with automatic threshold selection for
DWT outperforms the heuristic threshold selection (see Figure 4.18). Nevertheless, three-
class classification with the heuristic approach of DWT provides better accuracies. This can
be explained by the selected threshold being suitable for a large portion of the data set.
Since the threshold selection is highly dependent on the person who performs the manual
inspection, the heuristic approach results may not be consistent.
Figure 4.18: Experimental data results for DWT (profile classification) and DCT (Surface
classification).
    Manual threshold selection for DWT and DCT is performed using visual inspection. One
may choose a large value for the threshold depending on the size of the given surfaces. In
this case, the thresholds of DCT are selected as 50 for both synthetic and experimental data
sets, while thresholds of DWT were chosen as 2 and 4 for synthetic and experimental data
sets, respectively. Figure 4.20 provides the roughness components of three surfaces obtained
from the heuristic approach and the proposed automatic threshold selection algorithms. It
is seen that roughness profiles obtained from the heuristic approach have smaller amplitudes
                                               178


Figure 4.19: The thresholds selected by proposed algorithm for the synthetic data set. The
roughness parameter 0 represents the roughest surface, while the smoothest surface has a
roughness parameter of 1.
compared to the ones obtained from the automatic threshold selection approach. This is
because of the fact higher thresholds remove more modes from the main surface, and this
leads to less number of modes for the roughness component. In addition, the selected thresh-
olds are less than the heuristic thresholds value, which is kept constant for all surfaces in the
experimental data set. Figure 4.21 provides the selected threshold values from the automatic
selection algorithm and the constant heuristic threshold.
Figure 4.20: Roughness surfaces and roughness profiles obtained from three surfaces in the
experimental data set. (a-d) Milling, (e-h) Profiled.
    Proposed threshold selection algorithms remove the manual inspection and make the
                                              179


Figure 4.21: Threshold values selected from the automatic threshold selection algorithm and
the constant heuristic threshold.
decomposition fully automatic. For DCT, the proposed approach provides a significant
reduction in computational time. Since the threshold is selected as 50 for the heuristic
DCT approach, 502 surface modes need to be computed for each surface in the data set.
However, the automatic threshold selection algorithm picks different values of the threshold
depending on the surface. Figure 4.19 provides the selected threshold values for the synthetic
surfaces for varying roughness parameters. It is seen that the maximum selected threshold
is nearly 40. That means that the algorithm computes 402 modes at maximum for the
corresponding surface. Compared to the number of mode computations performed with the
heuristic threshold, the proposed algorithm saves the time needed to compute 900 modes.
The time reduction is even more significant with smoother surfaces with a higher value of
roughness parameter H. In this case, 102 modes are computed instead of 502 . Thus this shows
that automatic threshold selection avoids redundant mode computations and dramatically
decreases the computational time in comparison to the heuristic threshold selection.
    The experimental results for DWT and DCT are compared to those obtained from Gaus-
sian Filtering, another tool widely adopted in digital image and signal processing. Figure 4.22
provides classification scores for the Gaussian Filter applied to surface profiles (1D) and sur-
faces (2D) from the synthetic data set. Figure 4.22 shows that the highest classification
                                              180


Figure 4.22: Gaussian filtering results obtained from profile and surface classification for
synthetic data set.
Figure 4.23: Three and nine class classification results for surface profiles (1D) and the
surfaces (2D) in experimental data using Gaussian Filter.
accuracy is obtained with Gaussian Filtering. However, the results for experimental data
do not show the same trend. Figure 4.23 shows that three-class classification does not per-
form well. In addition, the nine-class classification for profiles does not perform well either.
This can be explained by the fact that the selected number of profiles from the surfaces is
not enough to extract texture information. Subsampling is not applied for profiles, and 8
profiles are extracted from the images whose sizes are 2400 × 2400. Therefore, while more
profiles may increase the score, exploring that direction is out of the scope of this work. Pro-
posed approaches for DWT provide better accuracies for both profile classification in three
and nine-class classification than the ones obtained from Gaussian Filtering (see Figure 4.18
and Figure 4.23). Nine-class classification for surface classification provides mean accuracies
                                              181


around 70%, which is higher than the results of DCT shown in Figure 4.18. However, the
Gaussian filter, when applied in 2D, is parameter-dependent, and the same parameter (ker-
nel size) is used for the whole data set in this study. The same wavelet function is used for
all surface scans, so the algorithm for DWT is parameter independent, while the algorithm
for DCT only takes the slope threshold from the user. This threshold value can be set to a
value near zero and can be used for the whole data set. However, the threshold selection is
still data-driven, and it depends on the entropy curve of the given surface (see Figure 4.15).
4.4     Surface Finish Monitoring Using Persistent Homology
4.4.1    Topological Saliency and Simplification
                    Figure 4.24: Motivation example given in Reference [1].
     The presence of noise can make it challenging to distinguish intrinsic surface features
from noise. Therefore, an elimination needs to be performed on the given surface to delineate
between surface features and noise. Topological approaches have been successfully utilized to
extract and identify the importance of features in images, as shown in Section 4.2.3. One of
these tools is topological saliency [1] which was introduced to capture the relative importance
of a topological feature relative to other features within its neighborhood. The original
applications of topological saliency included key feature extraction, scalar field simplification,
and feature clustering. However, this study leverages, for the first time, topological saliency
to propose a novel digital-twin for surface treatment.
     Topological saliency is proposed to capture the features which are eliminated due to
                                               182


           Figure 4.25: Steps to perform topological saliency-based simplification
their low persistence [1]. Doraiswamy et al. provided an example to explain this in more
detail (see Figure 4.24). In this example, it is seen that there are seven significant features
for the given scalar data, and each feature represents a peak. If topological persistence-
based simplification is used to rank the features, the peak with the label F will have the
lowest rank, or it will be eliminated since its persistence is the smallest. However, there
are no significant peaks around feature F , and it dominates its neighborhood. Therefore,
topological saliency has been suggested to evaluate the significance of the features instead
of using their topological persistence [1]. Figure 4.25 provides block diagrams that show
the steps for topological saliency-based simplification. The first step is to define the critical
points of the surface, which are local minima and maxima. Users can either select local
minimums if the main focus is on the valleys or local maximums if the user wants to work
on peaks. Let assume that C = c1 , c2 , . . . , ct is the set of local minimas of the given surface
f : M → R, where M is the d − manif old. Sublevel set persistence is utilized to compute
persistence diagrams of the given function, then the difference between birth and death times
of the features located at critical points is called the persistence P (i) of that feature. The
definition of topological saliency is given as [1].
                                                     wii P (i)
                                     Tr (i) = P             i
                                                                                             (4.15)
                                                   cj∈C wj P (j)
where r is the radius of neighborhood chosen around each feature ci , and wji represents the
weight of the feature j with respect to feature i. There are two main weighting functions: 1)
                                                  183


           Figure 4.26: Topological saliency plot for a simplified synhetic surface.
Uniform weighting, 2) Gaussian weighting, and their definitions are given as
                                         
                                         1, if cj ∈ Nr (i)
                                         
                                         
                                    i
                                  wj =                                                   (4.16)
                                         0, otherwise
                                         
                                         
                                                            2 /r 2
                                       wji = e−dg (ci ,cj )        ,                     (4.17)
where dg (p, q) represents the geodesic distance between two critical points. Doraiswamy et
al. suggested using the Gaussian weighting function [1]. For a given surface f , the saliency
of a feature is computed for varying neighborhood radius r. Then, topological saliency plots
are obtained, as shown in Figure 4.26. It is seen that the topological saliency of features
decreases as the neighborhood size increases. This is because of the fact more features appear
as the radius increases.
    Elimination based on persistence, which is the difference between birth and death times
of the features, can eliminate the points with larger saliency, as shown in the motivation
example in Figure 4.24. To circumvent this issue, Doraiswamy et al. proposed a saliency-
based simplification. They fixed the neighborhood size, r, and then a saliency threshold was
selected to eliminate the features under the threshold. This threshold is generally fixed to
a value which is found by multiplying the maximum saliency for the corresponding r value
by a ratio. After the elimination of features (peaks/valleys), the saliency of the remaining
features is recalculated since simplification provides a new surface. This process is repeated
until the desired number of features remains.
                                               184


4.4.2     Results
    This study introduces a pipeline that can simplify a given surface based on topological
saliency. Users can define the number of features (peaks/valleys) they want after the simpli-
fication. Figure 4.27 presents the simplified surfaces and their corresponding saliency plot.
The plots in the first column represent the results obtained from the original synthetic sur-
face. The number of features decreases during the simplification process. The neighborhood
size is fixed in simplification. Then, the algorithm automatically finds the saliency threshold
for the corresponding neighborhood size r. The saliency threshold is calculated by multi-
plying the maximum saliency by 0.15. The algorithm automatically removes the features
with saliency value under that threshold, and a simple surface is obtained. Compared to
traditional signal decomposition approaches such as DCT and DWT, the user does not select
any thresholds that have a significant effect on the resulting decomposition. They only need
to identify the number of features of the simplified surface. In addition, topological simpli-
fication for the example provided in Figure 4.27 is computed in ≈ 0.5 seconds. The image
in Figure 4.27 has a dimension of 64 × 64. It is expected to have longer runtime if the pixel
size increases. However, it is believed that all simplification processes can still be computed
in less than a second by implementing parallel computing into the proposed algorithm.
    Surface patches can be clustered based on topological saliency. In this case, the feature
similarity between peaks/valleys is used. The similarity of the two features is defined with
respect to the area between their saliency plots. The smaller area between the plots rep-
resents the likelihood of similar features. The areas between saliency plots of each feature
are computed. Then, these areas are used as distances to generate a similarity matrix. Pre-
computed similarity matrices are given to the AgglomerativeClustering algorithm to cluster
the features. This clustering of critical points/features needs to be generalized to all surface
points to obtain surface patches. In this step, the critical points are treated as cluster cen-
ters, and the pairwise geodesic distances between all critical points and the surface points
are computed. Each surface point is assigned a cluster number based on its closest critical
                                               185


Figure 4.27: Saliency based simplification on an example synthetic surface using approxi-
mated geodesic distances between the critical points.
points. This step can be considered as KNN classification where K = 1. Labeling of the
surface points provides surface patches. This clustering is applied on the example surface
and its simplified versions provided in Figure 4.27. Figure 4.28 provides clustered surfaces
based on the saliency of their critical points. It is seen that large peaks can be clustered with
different surface patches for varying heights. In this case, users can also select the number
of clusters so that they can identify what part of the surface will be machined.
    Topological saliency provides automatic surface simplification, and it eliminates the sur-
face decomposition as in the case of traditional approaches. In addition, the clustering with
topological saliency can provide regions where more machining is required to obtain desired
surface finish.
4.5     Conclusion
    This chapter introduces three main approaches for surface texture analysis: 1) feature
extraction from surfaces using TDA, 2) proposing automatic threshold selection algorithms
                                               186


Figure 4.28: Saliency-based clustering of surfaces. The first row represents the clustering
results of the original surface with a feature number of 90 for different cluster numbers. The
second and third rows represent the results for the simplified surface at the second iteration
(feature number: 30) and third iteration (feature number: 7) given in Figure 4.27.
for DCT and DWT, and 3) clustering surface patches using topological saliency and simpli-
fication.
    In Section 4.2, this study proposed an automatic and adaptive feature extraction approach
to determine the level of roughness in profile and areal measurements of surfaces. The
proposed approach yields similar classification accuracies for surface and profile classification,
and it eliminates the manual preprocessing which is required by traditional signal processing
tools. All of the results in this study are obtained using the default parameters of the
classification algorithms. Further parameter tuning for each classifier can result in better
                                               187


classification accuracies with smaller deviations. In addition, TDA-based approaches can be
computationally expensive as the number of pixels in images or the number of samples in
surface profiles increases. Therefore, one direction of the future work of this study can include
parameter tuning and optimizing TDA-based approaches to speed up the computation, as
well as applying the proposed approach to experimental data.
     Section 4.3 proposed two automatic threshold selection algorithms for two widely used
methods, namely DCT and DWT, to identify the appropriate roughness components in
synthetic and experimental data sets. In contrast to the traditional way of decomposing
engineering surfaces, proposed algorithms are data-driven, and they automatically adapt to
a given surface to provide the needed threshold values. The algorithm for DWT does not
require any input parameter selection from the user, while the one for DCT only requires
the user to give a slope threshold for the derivative of the image entropy. However, this
threshold value is generally close to zero and is easy to set, thus not requiring a high level
of expertise.
     The classification accuracies obtained from both automatic selection algorithms for DCT
and DWT either exceed or match the accuracies obtained from the heuristic threshold se-
lection. This shows that proposed algorithms are capable of autonomously decomposing the
surfaces to extract the appropriate descriptor of the surface roughness. This study also elim-
inates human error, which may happen during the manual inspection process in heuristic
threshold selection. When the results are compared to the ones obtained from Gaussian
smoothing, it is seen that proposed algorithms provide better classification scores in pro-
file classification, and they are comparable to their Gaussian smoothing counterparts. In
addition, the proposed DCT algorithm is capable of eliminating the redundant mode com-
putations for a given surface. This considerably reduces the computational time, as evidenced
by an order of magnitude improvement in DCT mode computations for a single surface. The
mode computation for DCT can be parallelized using High-Performance Computing (HPC)
tools, and the computational time needed for DCT can further be significantly reduced. This
                                               188


study claims that the combination of the proposed algorithms with HPC tools will enable
this approach to be used in real-time surface characterization applications.
    Section 4.4 proposed to use topological saliency and simplification for surface finish mon-
itoring. The proposed approach is applied to synthetic surfaces explained in Section 4.2.1.
It is seen that peaks and valleys can be clustered separately, and this can provide guidance
on the regions where additional machining is required to obtain the desired surface finish.
This section only introduces a pipeline to cluster given surface scans. The future direction
of this study can involve the application of this pipeline into experimental data set and the
integration of the resulting framework into a machining center to test the viability of the
approach in real-time cutting experiments.
                                             189


                                        CHAPTER 5
       TOOL WEAR IDENTIFICATION USING MACHINE LEARNING
5.1     Literature Review
    Machining accuracy and final surface finish during cutting operations are dependent on
several factors. One of them is the tool wear during the cutting. Tool wear can affect the
tool life, and excessive tool wear can cause breakage as well as some damage to the cutting
lathe. Therefore, its detection and prediction have been prominent research fields during the
last decades. Tool wear analysis is essential for reducing material waste and prolonging the
cutting tool life [261].
    Tool wear analysis can be divided into two main kinds, and these are indirect and direct
tool wear analysis [262]. Indirect approaches focus on the analysis of experimental data
such as force, vibration, and acoustic emission sensor signals and finding the correlation
between experimental signals and the tool wear [262]. Some studies in the literature focused
on model-based approaches to detect tool wear. They compared the simulation results
to the experimental ones using acquired signals from cutting experiments. For instance,
Huang et al. designed an observer based on the derived linear model and estimated the
cutting forces during the operation [263]. The estimated cutting forces are compared with
the measured one, and threshold values for tool wear detection are proposed based on the
model. Stavropoulos et al. extracted tool wear curves using numerical simulations based on
Reference [264] and compared them to the experimental curves [265].
    A wide range of studies focused on data-driven indirect approaches in the literature.
Experimental data such as force, vibration, and acoustic emission signals are collected and
analyzed using digital signal processing and time series analysis methods. Then, resulting
signals are used to extract features, and they are combined with machine learning algorithms
to identify the tool wear. One of the commonly used approaches is Singular Spectrum Anal-
                                              190


ysis (SSA)[266, 267, 268]. For instance, Alonso et al. used SSA to decompose acceleration
signals and grouped them into three clusters [267]. Statistical features extracted from these
three main clusters are fed into neural networks to estimate the wear. Kundu et al. combined
SSA and band-pass filtering to remove the noise from vibration signals and used time and
frequency domain features in four different classification algorithm [268]. Wavelet Trans-
form (WT) and Wavelet Packet Transform (WPT) are also commonly used approaches for
indirect tool wear analysis [269, 270, 271, 272, 273, 274, 275]. Li et al. provide a review for
analysis of experimental data obtained from acoustic emission sensors using different signal
processing tools, including WT [269]. Pechin et al. studied Continuous Wavelet Transform
(CWT) to identify the informative component in acoustic emission signals and proposed a
tool wear identification system [270]. Leng et al. study the correlation of the energy of the
wavelet packets and the AE signals with tool wear [271].
    Statistical features such as kurtosis, root mean squared, and skewness are the most com-
monly adopted features for indirect methods of tool wear analysis in the literature[276, 277,
278, 279, 280]. For instance, Hassan et al. focused on signal segmentation and normalization
process to reduce the noise in current signals obtained from the spindle of milling experiments
and extracted statistical features to perform classification [276]. Simon et al. used several
subsets of statistical features extracted from vibration signals in drilling experiments to de-
tect tool wear status [277]. Another indicator used for determining the transition between
tool wear status is mean power analysis. Rmili et al. employed mean power analysis and
developed an algorithm that finds the transition between the tool wear status of vibration
signals in real time [281].
    Deep learning algorithms have been widely adopted in tool wear analysis during the
last decades with the enhancement in data collection.         Convolutional Neural Networks
(CNNs) [282, 283, 284], artificial neural netweorks (ANNs) [267, 285, 286, 287, 280, 288]
Deep learning algorithms may require a large number of cutting experiments to have enough
data set to train the algorithms. For instance, Gouarir et al. used existing data set obtained
                                               191


from the 2010 PHM Data Challenge to test the performance of a CNN model for tool condi-
tion monitoring [282]. However, the data used in this study has 315 cutting tests, and this
number can not be easily obtained due to the required cost of conducting experiments. An-
other study was also focused on the analysis of images using CNNs and Fully Convolutional
Networks (FCNs) to identify the tool wear status [283]. Their image data set includes 400
images obtained from cutting inserts, milling tool, and the drill at different magnification
levels with a microscope.
    On the other hand, direct approaches are based on the analysis of images taken from
the cutting to define the amount of wear [289]. For example, Hou et al. developed an
online tool wear inspection system based on machine vision [290]. The developed system
is able to measure the bottom and flank wear. They designed a triple prism to overcome
reflection on the bottom and flank faces. In another study, Wu et al. developed a framework
called Toolwearnet that can identify the tool wear type and measure the tool wear using
Convolutional Neural Networks (CNN) [291]. Real-time object detection algorithms are also
employed in tool wear analysis. For instance, Lin et al. combined the You Only Look Once
(YOLO) [292] algorithm with segmentation models such as U-net [293] and Segnet [294] to
identify the tool wear areas in cutting tool images [295]. García-Ordás et al. developed
an online approach that can determine the tool wear status based on computer vision and
machine learning [296]. They first obtained the cutting edge images of the inserts, then
identified the wear patches. These patches are classified as worn and serviceable, and finally,
the wear status of the insert is determined based on the wear status in the wear patches.
    This chapter only focuses on the indirect approaches used by the sensor measurements.
During data collection, force sensors, acoustic emission sensors, accelerometers, and mi-
crophones are among the commonly used sensors. Since multiple sensors are used during
experiments, it is essential to apply data fusion to combine the information gained from each
sensor. According to Reference [261, 297], there are three different data fusion techniques.
These are the data level, feature level, and decision level. The studies reviewed in this study
                                               192


are categorized into these three data fusion techniques. Out of 44 different studies, it was
seen that only 10 of them applied sensor fusion to improve the tool wear identification/pre-
diction performance. In addition, it is seen that mostly feature level and decision level data
fusion are used in these studies. Decision level study is only used once among the reviewed
studies, while the rest uses the feature level fusion. Feature level fusion generally extracts
features from the signals, and then the relationship between them is found by using different
techniques. One of the most commonly used feature-level-based data fusion is the fuzzy
inference system [298, 299, 300]. For instance, Yao et al. used Fuzzy Neural Network to
identify tool wear states in turning operation [298]. Signals obtained from the acoustic emis-
sion sensor, feed motor current, and spindle motor current are used to extract features. The
relationship between the extracted features and the tool wear status is found using fuzzy neu-
ral networks. Wu et al. utilized the Ensemble Empirical Mode Decomposition to eliminate
the noise effects in vibration, acoustic emission, and motor current signals [300]. Then they
extracted statistical features from these signals in both the time and frequency domains.
Optimum feature selection is performed using correlation analysis, monotonicity analysis,
and residual analysis. Finally, they used an adaptive network-based fuzzy inference system
(ANFIS) to find the relationship between tool wear level and the extracted features [300].
    Another technique used for feature level data fusion in tool wear analysis is Mahalanobis-
Taguchi System (MTS) [301]. It is a pattern recognition algorithm based on the combination
of Mahalanobis Distance [302] and Taguchi methods. Rizal et al. employed MTS to perform
tool wear classification [303]. Time and frequency domain features are extracted from the six-
channel data set. Then, Mahalanobis Space is identified and validated. The Taguchi method
is used to eliminate redundant features, and Mahalanobis Distance (MD) is used to identify
tool wear status. Another study applied the correlation-based sensor fusion technique to find
the optimal set of features obtained from various sensors, and the optimal set is combined
with ensemble machine learning to classify tool wear, breakage, and chipping [304]. It utilized
both feature and decision-level data fusion techniques. Chen et al. studied the effect of five
                                              193


different data fusion techniques in feature level for tool condition monitoring in milling,
and neural networks are used to detect tool wear condition [242]. Wang et al. proposed
an in-process tool condition monitoring approach based on Self-Organizing Map (SOM) to
estimate the flank tool wear [305]. Flank wear is estimated using captured images, and the
features extracted from force signals are used to train the SOM network. In addition to
these techniques, Liu et al. combined all the features extracted from the accelerometer and
dynamometer using Discrete Wavelet Transform (DWT) [261]. Then the condition of the
machine is classified using the Support Vector Machine classifier.
    Some studies apply data and decision level data fusion. For instance, Xu et al. intro-
duced a deep learning-based sensor fusion at the data level [284]. Data obtained from an
accelerometer, acoustic emission sensor, and dynamometer are fed into parallel convolutional
networks to obtain a feature map that defines the relationship between the features extracted
from each sensor with the tool wear. Kannatey-Asibu et al. utilized class-weighted voting
to decide on tool condition [306]. They have combined the decision made by Hidden Markov
Model (HMM), Gaussian Mixture Model (GMM), Bayesian rule, and K-means model using
majority voting.
    This study focuses on indirect tool wear analysis using feature level data fusion technique.
Experimental signals are collected using an acoustic emission sensor, a microphone, and
a force dynamometer. Then, tool wear for each cutting insert is measured using digital
microscopy. Crater and flank wear are used as a benchmark to label each cutting signal.
The signals are labeled low, medium, and severe wear. Two of the most commonly adopted
signal decomposition techniques, Discrete Wavelet Transform (DWT) and the Ensemble
Empirical Mode Decomposition, are implemented in this study. Feature extraction using
DWT and EEMD is based on the selection of several parameters, such as the level of the
transform for DWT and the informative coefficients or Intrinsic Mode Functions (IMFs). This
study presents an automatic and adaptive pipeline for these two approaches and tests it on
experimental data. The experimental signals are cleaned using the low pass and bandpass
                                              194


filters, and then downsampling is applied to reduce the sample size. For DWT and EEMD
approaches, statistical features are extracted from the sensors and combined into a single
vector in the feature matrix. Then, Principal Component Analysis is used to reduce the
dimension of the feature matrix. The resulting feature matrix and the labeling obtained
from wear measurements are fed into supervised classification algorithms. In addition to
traditional approaches for feature extraction, Topological Data Analysis (TDA) based feature
extraction is also utilized in this study. Feature level data fusion is also applied to the TDA-
based approach. This study compares the DWT and EEMD results to the ones obtained
from the TDA-based approach.
     This chapter is organized as follows. Section 5.2 describes the experimental procedure for
titanium cutting experiments and data cleaning. In Section 5.3, the feature extraction pro-
cedure is described for TDA based approach in addition to two widely adopted approaches,
DWT and EEMD. Section 5.4 compares the results of the TDA-based approach to the DWT
and EEMD. In Section 5.5 provides the concluding remarks.
5.2      Experimental Procedure and Data Cleaning
     This section explains data collection and cleaning for titanium cutting experiments. Cut-
ting experiments are conducted in dry conditions on a Haas TL1 CNC lathe shown in the top
image of Figure 5.1. A titanium bar Ti64-STA with an initial diameter of 12.7 cm is used
during the experiments. Three sensors are utilized to collect data from the experimental
setup. Force dynamometer is placed into CNC lathe as shown in the Figure 5.1. It measures
the forces in three orthogonal directions shown in the right bottom corner of the top image
in Figure 5.1. A Kistler 8152C acoustic emission sensor with a measuring range of 100-900
kHz is used to measure the elastic waves in the tool holder. Procedure in Reference [307] is
followed to check the consistency of acoustic emission sensor output. The sensor is mounted
on the backside of the tool holder. Audio signals are also collected during the experiments.
A PCB 130F20 with a measuring frequency range of 10 Hz-20 kHz is placed approximately
                                               195


Figure 5.1: Experimental setup for titanium cutting experiments. (top) Titanium bar and
the sensors used in data collection, (bottom) Data acquisition boxes, and signals conditioner
used during the experiments.
a meter away from the cutting lathe to reduce the effect of reflected sound waves.
    The bottom image in Figure 5.1 shows the data acquisition boxes and the signal condi-
tioner used during cutting experiments. Since 900 kHz is the maximum frequency measured
by the acoustic emission sensor, 1.8 MHz is the sampling rate needed with respect to the
Nyquist sampling theorem. The acoustic emission sensor is connected to the data acquisi-
tion box NI6361 since its sampling rate can reach up to 2 MHz. However, when multiple
channels are used for data acquisition with the same box, 2 MHz is divided into the number
of channels used. Therefore, the second data acquisition box NI6356 is used to collect data
                                             196


from the microphone and force dynamometer, and it can reach up to 1.25 MHz per chan-
nel. Since the measuring range for the microphone and the dynamometer is low compared
to the acoustic emission sensor, the sampling rate for these two sensors is selected as 180
kHz. The microphone is first connected to a signal conditioner PCB 482C15 and then to
the data acquisition box. MATLAB’s data acquisition toolbox is utilized to collect data and
synchronization of two data acquisition boxes.
    Sandvik carbide inserts with and without chip breaker are used in the cutting experi-
ments, and their model numbers are SCMW 12 04 08 H13A and SCMT 12 04 08-KR H13A.
During the cutting operations, the cutting speed is kept constant at 122 m/min, while the
rotational speed of the bar changes as the radius of the titanium bar decreases. Rotational
speed varies between 407 and 654 rpm during the experiments. Depth of cut and feed rate
are also kept constant at 1.2 mm and 0.127, respectively. The corners of each cutting tool are
numbered from 1 to 4. For each cutting experiment, a cutting length required to break the
tool is estimated based on the cutting conditions and prior experience. The time required to
complete this length of cutting is called TF . TF is divided into four different time intervals,
whose length is ∆t = TF /4. For the first corner of the cutting tool, a cutting operation is
performed for ∆t = TF /4 seconds, and the resulting data set is saved. Then, the second
corner of the cutting tool is used to cut titanium for 2∆t = TF /2 seconds. Every time when
a new corner of the cutting tool is used, the cutting time is increased by ∆t. This results in
having a tool breakage in the fourth corner where cutting time is equal to TF .
5.2.1    Data Processing
    High sampling rates increase the number of samples dramatically, even if the sampling
time is not large. Therefore, for each cutting operation, the data acquisition is started in
the last 30 seconds of the corresponding cutting time. During experiments, six cutting tools
without chip breaker and ten tools with chip breaker are used. After data collection, the next
step is to clean the data set. Previously, it has been noted that the acoustic emission sensor
                                               197


Figure 5.2: Filtered signals and raw measurements obtained from cutting configuration with
RPM:407, DOC (depth of cut):1.2 mm, CT (cutting time): AC (All cut, where cutting time
is T ), TT (tool type): chip breaker, TN (tool number): 2 and CN (corner number): 1.
also captures the AC frequency (60 Hz) during data collection. This component needs to be
removed from the signals as the first step. This study uses a baseline correction approach
based on Stationary Wavelet Transform to remove the AC frequency component. One can
refer to Reference [308] for more details about the approach.
     Then, acoustic emission signals are filtered using a bandpass filter since the typical range
for acoustic events in machining is between 100-300 kHz [309]. The maximum voltage output
is 10V for force signals. During experiments, the acquired signals in several axes may reach
up to 10V due to a dramatic increase in the force along the corresponding axis. Therefore,
force signals are thresholded before applying filters to them. When the amplitude of signals
reaches 9.95 V, the algorithm automatically neglects the rest of the time series. For audio
and force signals, least square method with a filter order of 100 is used to design a FIR
filter to reduce the effect of noise in raw measurements. The Force dynamometer used in
these experiments can measure frequencies up to 45 kHz, while the frequency range for
the microphone is between 10 Hz and 20 kHz. Therefore, force signals and audio signals
are passed through low pass filters designed using the desired frequency of 45 and 20 kHz,
respectively. Figure 5.2 shows the raw measurements and the filtered signals obtained from
a cutting configuration.
     After completing the filtering of all signals, the next step is to downsample the signals
so that we can reduce the number of samples. Downsampling is performed by choosing
                                                198


Figure 5.3: Surface scans for cutting inserts with low, medium and severe wear levels. All
scans are obtained from cutting inserts without chip breaker used in different cutting tests.
a downsampling factor for each sensor data and choosing every nth sample in the signals
based on these factors. These factors are chosen such that the resulting sampling frequency
is two times the desired frequency. For acoustic emission signals, the downsampling factor
of 3 is chosen, while 4 and 10 are chosen for microphone and force signals, respectively.
Figure 5.2 shows that force signals experience significant jumps and baseline shifts during
cutting. Therefore, it is decided to crop these signals such that only the initial part of the
signals before the change points are used in the analysis. This procedure is completed by
manually selecting the two endpoints for the corresponding time series since the data set
has only 55 different cutting configurations. This also removes the high amplitude peaks at
lower frequencies in the frequency spectrum of the force signals.
5.2.2    Data Labeling
    Preprocessed time series are labeled based on the wear amount measured for each cutting
experiment. Crater and flank wear for each corner of the cutting inserts are measured. Then,
                                             199


time series are labeled into three categories: 1) low wear, 2) medium wear, and 3)severe wear.
The surface scans of the cutting inserts are also taken under a microscope, and Figure 5.3
presents the surface scans of three different wear levels for both flank and crater wear.
Table 5.1 also provides the range of wear amounts used in the labeling. After labeling the
time series, the next step is to extract features from them.
        Table 5.1: The ranges for wear amounts used in labeling of time series data.
                              Inserts without                        Inserts with
                         chip breakers (WOCB)                    chip breakers (CB)
     Wear Level    Crater (d(µm)) Flank (vb(µm)) Crater (d(µm))             Flank (vb(µm))
        Low             d < 25            vb < 110             d < 26           vb < 127
      Medium         25 ≤ d < 72      110 ≤ vb < 127.5      26 ≤ d < 62      127 ≤ vb < 187
       Severe           d ≥ 72           vb ≥ 127.5            d ≥ 62           vb ≥ 187
5.3     Methodology
    In this section, feature extraction approaches from experimental signals are explained.
This study utilizes two widely adopted signal decomposition approaches, DWT and EEMD,
in addition to novel TDA-based approach. The background information for these approaches
can be found in Sections 4.3.2.1, 2.3.2 and 2.6.1. Therefore, this section only describes the
feature extraction procedure.
5.3.1    Discrete Wavelet Transform
    Discrete Wavelet Transform based feature extraction approach for tool wear analysis is
introduced in Reference [57]. The first step is to standardize the processed signals such that
they have zero mean and standard deviation of one. In Reference [57], frequency spectrum
of the selected signals are investigated, and the frequencies whose amplitudes are significant
in the spectrum are defined. These frequencies are used to identify the level of DWT. In this
study, the approach is modified such that sensitive frequency is selected in an automated
and adaptive way. Algorithm 5.1 provides the pseudo code for the adaptive algorithm.
                                               200


                                              AE                       MIC                            Fx                       Fy                       Fz
CS: 122 m/min                  1.0
RPM: 407
DOC: 1.2 mm
FR: 0.127 mm/rev               0.5
CT: AC TT: CB
TN: 2 CN: 1
                               0.0
CS: 122 m/min                  1.0
RPM: 407
DOC: 1.2 mm
FR: 0.127 mm/rev               0.5
CT: QC TT: CB
TN: 2 CN: 2
                               0.0
CS: 122 m/min                  1.0
                   Amplitude
RPM: 428
DOC: 1.2 mm
FR: 0.127 mm/rev               0.5
CT: HC TT: CB
TN: 3 CN: 2
                               0.0
CS: 122 m/min                  1.0
RPM: 428
DOC: 1.2 mm
FR: 0.127 mm/rev               0.5
CT: 3QC TT: CB
TN: 3 CN: 3
                               0.0
CS: 122 m/min                  1.0
RPM: 428
DOC: 1.2 mm
FR: 0.127 mm/rev               0.5
CT: AC TT: WOCB
TN: 5 CN: 2
                               0.0
                                     0   100000   200000   3000000    10000      20000   0     2500   5000   7500   0   2500   5000   7500   0   2500   5000   7500
                                           Frequency                 Frequency                   Frequency                Frequency                Frequency
Figure 5.4: FFT spectrum of the processed signals obtained from the first five cutting
configurations. CS: Cutting speed, DOC: depth of cut, FR: feed rate, CT: cutting time,
QC: quarter cut, HC: half-cut, 3QC: three-quarter cut, AC: all cut, TT: tool type, CB: chip
breaker, WOCB: without chip breaker, TN: tool number, CN: corner number.
      FFT is first applied to processed signals, and then the peaks in the spectrum are found
using the built-in peak finding algorithm in Python. The frequency spectrum is normalized
by dividing all amplitudes by the maximum amplitude. The number of peaks found by this
algorithm is large and contains redundant peaks as well. Therefore, a restriction parameter
minimum peak height is set. Figure 5.4 presents the spectrum of the signals of five cutting
configurations. It is seen that the spectrums of force and microphones signals have a large
amplitude in lower frequencies. To eliminate these frequencies, the proposed algorithm
neglects the first 200 samples in the frequency domain. Then, the threshold for the minimum
peak height is decided by multiplying the maximum amplitude of the remaining samples
in the spectrums by 0.2. However, this threshold value can still lead to capturing low
frequencies in the spectrum in some cases. Therefore, a minimum desired frequency for each
sensor is decided. 100 kHz, and 5 kHz are chosen as the minimum desired frequency for
acoustic emission sensor and microphone, respectively. For force signals, 1000 Hz is chosen.
                                                                                         201


     Algorithm 5.1: Automatic and adaptive algorithm to find the sensitive frequency.
Input: desMinFreq=100000, p=0.2, freq, amp, thrshldFreq=5000
Output: senFreq
   peaks = []
   while max(peakFreq)<desMinFreq do
      peaks = findpeaks(amp, height=max(amp[200:]*p))
      PeakAmp = amp[peaks]
      PeakFreq = freq[peaks]
      p -= 0.03
   end while
   peakFreq=peakFreq[peakFreq>thrshldFreq]
   counts, freqHist = histogram(peakFreq)
   Set SecondMaxCount = sorted(counts)[-1]
   Find the index of SecondMaxCount in counts,
   Set SecFreqHist = freqHist[index]
   Find the closest frequency to SecFreqHist in freq and set it to senFreq
Algorithm 5.1 shows that iterations are performed in a while until the amplitude of the
desired minimum frequency is in the selected peaks. In addition to setting desired minimum
frequency, the peaks found for microphone and force signals are thresholded by predefined
frequencies. The frequencies under 5 kHz and 1 kHz are neglected for microphone and force
signals, respectively. These frequencies generally either coincide with structural modes of
the system or do not contain useful information in the spectrum. Then, the histogram of
the resulting peaks is found with 100 bins. This means that the frequency range in the
spectrum is divided into 100 equal intervals, and the number of frequencies that fall into
each interval is counted. The highest count generally corresponds to the lower frequency
region, so the proposed algorithm selects the second highest count in the histogram and
finds its corresponding frequency provided by the histogram. However, this frequency can
not be directly used as a sensitive frequency since the histogram only provides 100 frequency
values. Therefore, the proposed algorithm finds the closest frequency to it, and this is the
sensitive frequency chosen by the algorithm.
    The proposed adaptive algorithm is tested on the time series data of the first five cutting
configurations, and the resulting sensitive frequencies are marked in the spectrum with red
                                              202


dashed lines as shown in Figure 5.4. It is seen that selected frequencies are meaningful and
can vary from case to case. This shows the effectiveness of the adaptive approach. These
resulting sensitive frequencies are used to determine the level of the wavelet transform.
Figure 4.11 shows 3 level of DWT. In the first level, approximation and detail coefficients
correspond to [0, f /4] and [f /4, f ] ranges, respectively. An iterative approach is utilized
to find the level of DWT L, and it is set to 1 initially. Then, if the sensitive frequency is
smaller than f /(2L), the algorithm applies the DWT in the first level L = 1. Otherwise, L
is increased by one, and f /(2L) is computed again. This can be iterated until L reaches the
maximum level of the transform that is found with respect to the selected mother wavelet
function and the length of the corresponding signal. After determining the level of DWT,
approximation and detail coefficients are obtained, and two signals are reconstructed based
on these coefficients. For each reconstructed signals, 10 statistical features given in Table 2.3
are computed. Since features from all signals are combined for a cutting configuration,
feature level data fusion is applied in this approach. The resulting feature matrix is given as
input to supervised classification algorithms.
5.3.2    Ensemble Empirical Mode Decomposition
     This study follows the feature extraction approach used in Reference [300]. IMFs are
obtained using the PyEMD package of Python for each time series. Then, autocorrelation
functions for the original time series and its IMFs are computed. The Pearson correlation
coefficients between the autocorrelation functions of the original time series and the IMFs are
computed. An initial threshold value of 0.5 and 0.15 are selected for the acoustic emission
sensor and the rest of the sensors, respectively. If the correlation coefficient between the
autocorrelation functions is greater than this threshold value, corresponding IMF is used
in the reconstruction of the signal. If the algorithm can not find IMFs whose correlation
coefficient is greater than the threshold value, the threshold is lowered by 0.05. This process
is iterated until the algorithm finds at least an IMF. The resulting IMFs are summed up to
                                               203


                                                AE                                 MIC                                Fx                             Fy                                   Fz
                                  5
CS: 122 m/min                                                                                          2                                2
RPM: 407                                                           2.5                                                                                                2.5
DOC: 1.2 mm                       0
                                                                                                                                        0
FR: 0.127 mm/rev                                                                                       0
                                                                   0.0                                                                                                0.0
CT: AC TT: CB                   −5
                                                                                                                                      −2
TN: 2 CN: 1                                                    −2.5                                   −2
                                                                                                                                                                     −2.5
                               −10
                                                                                                                                      −4
CS: 122 m/min
RPM: 407                        2.5                                2.5                                2.5                             2.5                             2.5
DOC: 1.2 mm
FR: 0.127 mm/rev                0.0                                0.0                                0.0                             0.0                             0.0
CT: QC TT: CB
TN: 2 CN: 2                    −2.5                            −2.5                                                                  −2.5
                                                                                                  −2.5                                                               −2.5
                               −5.0
CS: 122 m/min
                   Amplitude
                                2.5                                2.5                                2.5                             2.5                             2.5
RPM: 428
DOC: 1.2 mm
FR: 0.127 mm/rev                0.0                                0.0                                0.0                             0.0                             0.0
CT: HC TT: CB
TN: 3 CN: 2                    −2.5                            −2.5                               −2.5                               −2.5                            −2.5
CS: 122 m/min                                                       5
RPM: 428                        2.5                                                                                                   2.5                             2.5
                                                                                                      2.5
DOC: 1.2 mm                                                         0
FR: 0.127 mm/rev                0.0                                                                                                   0.0
                                                                                                      0.0                                                             0.0
CT: 3QC TT: CB
                               −2.5                                −5
TN: 3 CN: 3                                                                                       −2.5                               −2.5                            −2.5
                               −5.0
CS: 122 m/min
                                2.5                                2.5                                2.5
RPM: 428                                                                                                                              2.5                             2.5
DOC: 1.2 mm
FR: 0.127 mm/rev                0.0                                0.0                                0.0                             0.0                             0.0
CT: AC TT: WOCB
TN: 5 CN: 2                    −2.5                            −2.5                               −2.5                               −2.5
                                                                                                                                                                     −2.5
                                      5   10   15   20   25   30         5   10   15   20   25   30         0     1   2    3     4          0   1    2     3     4          5   10   15        20   25   30
                                          Time (seconds)                     Time (seconds)                     Time (seconds)                  Time (seconds)                  Time (seconds)
Figure 5.5: Processed time series (blue) and the reconstructed signals using EEMD (orange)
for five cutting configurations when the last 25 seconds of the time series used in the analysis.
CS: Cutting speed, DOC: depth of cut, FR: feed rate, CT: cutting time, QC: quarter cut,
HC: half-cut, 3QC: three-quarter cut, AC: all cut, TT: tool type, CB: chip breaker, WOCB:
without chip breaker, TN: tool number, CN: corner number.
reconstruct the signal. Figure 5.5 and 5.6 show the reconstructed signals and the processed
signals when the last 25 and 5 seconds of the time series that is used in the analysis. It is
seen that only IMFs which correspond to lower frequency content are selected to reconstruct
the signals for some of the force signals (see reconstructed signals of Fx and Fz of the second
cutting configuration in Figure 5.5). However, better reconstructions are obtained when
only the last five seconds of the time series that is used in the analysis. These reconstructed
signals are used to extract statistical features as listed in Table 2.3.
5.3.3          Topological Data Analysis
      The first step in the TDA-based approach is to embed the time series to higher dimen-
sional space so that a point cloud representation of the data can be obtained. There are two
parameters needed for this step: embedding dimension and delay parameters. This study
                                                                                                            204


                                                AE                                  MIC                                Fx                             Fy                                    Fz
CS: 122 m/min                  2.5                                                                      2                                2
                                                                                                                                                                       2.5
                                                                   2.5
RPM: 407
DOC: 1.2 mm                                                                                             0                                0
FR: 0.127 mm/rev               0.0                                 0.0                                                                                                 0.0
CT: AC TT: CB                                                                                                                          −2
TN: 2 CN: 1              −2.5                                                                          −2                                                             −2.5
                                                               −2.5
                                                                                                                                       −4
                                5                                                                       4                               4
CS: 122 m/min
RPM: 407                                                                                                2                                2                             2.5
DOC: 1.2 mm
                                0                                   0
FR: 0.127 mm/rev                                                                                        0                                0                             0.0
CT: QC TT: CB
TN: 2 CN: 2                                                                                            −2                              −2                             −2.5
                               −5
                                                                   −5
CS: 122 m/min
                   Amplitude
                                                                                                       2.5                             2.5                             2.5
RPM: 428                                                            2
DOC: 1.2 mm
FR: 0.127 mm/rev                0                                   0                                  0.0                             0.0                             0.0
CT: HC TT: CB
TN: 3 CN: 2                                                        −2                              −2.5                               −2.5                            −2.5
                               −5
                                                                                                        4                                                                4
CS: 122 m/min                                                      2.5
RPM: 428                       2.5                                                                                                       2
                                                                                                        2                                                                2
DOC: 1.2 mm
                                                                   0.0                                                                   0
FR: 0.127 mm/rev               0.0                                                                      0                                                                0
CT: 3QC TT: CB
TN: 3 CN: 3                                                    −2.5                                                                    −2
                                                                                                       −2                                                              −2
                         −2.5
                                                                                                                                       −4
                                                                                                                                                                         4
CS: 122 m/min
RPM: 428                        2                                   2                                  2.5                             2.5                               2
DOC: 1.2 mm
FR: 0.127 mm/rev                0                                                                      0.0                             0.0
                                                                    0                                                                                                    0
CT: AC TT: WOCB
TN: 5 CN: 2                    −2                                                                  −2.5                                                                −2
                                                                   −2                                                                 −2.5
                               −4
                                     25   26   27   28   29   30         25   26   27   28   29   30         0     1   2    3     4          0   1    2     3     4          25   26   27        28   29   30
                                          Time (seconds)                      Time (seconds)                     Time (seconds)                  Time (seconds)                   Time (seconds)
Figure 5.6: Processed time series (blue) and the reconstructed signals using EEMD (orange)
for five cutting configurations when the last 5 seconds of the time series that is used in
the analysis. CS: Cutting speed, DOC: depth of cut, FR: feed rate, CT: cutting time, QC:
quarter cut, HC: half-cut, 3QC: three-quarter cut, AC: all cut, TT: tool type, CB: chip
breaker, WOCB: without chip breaker, TN: tool number, CN: corner number.
utilizes the Least Means Square (LMS) based approach introduced in Reference [143]. This
approach uses LMS to identify the noise floor in the Fourier spectrum of the given signal.
Then, the noise floor is used to identify the maximum significant frequency in the spec-
trum. The delay parameter is defined as τ = fs /(2fmax ) where fs and fmax represents the
sampling frequency and the maximum significant frequency, respectively [143]. Then, the
embedding dimension is computed using the delay parameter and the Multi-scale permuta-
tion entropy [310, 143]. After finding these two parameters, the time series are embedded
using Taken’s embedding (see Section 2.6.3.1).
      The embedded time series represents the point clouds that are used to compute the per-
sistence diagrams. In this chapter, only considered the first-dimensional persistent homology
is considered (H1 ), where the main interest is loops in the point cloud. The Ripser package in
Python is utilized to compute the persistence diagrams. Since the time series contains a large
                                                                                                             205


number of samples, embedding results in giant point clouds. Prior experience shows that
having more than 1000 points in the point cloud can increase the computation time of the
diagrams. Therefore, previously, this study used the greedy permutation [78] option of the
Ripser and Bezier curve approximation approach to reduce the runtime for computation of
persistence diagrams (see Section 2.6.3.2 for more details.) This chapter uses only the greedy
permutation option in the Ripser to obtain persistence diagrams. The algorithm subsamples
point clouds to the final size of 1000 samples, and the resulting persistence diagrams are
used to extract features. In Section 2.6.4, five different featurization option for persistence
diagrams are explained. This section utilizes only Carlsson Coordinates, persistence images,
and template functions. The resulting features obtained from persistence diagrams are fed
into supervised classification algorithms to determine the amount of wear in cutting signals.
5.4    Results
    Features obtained from cutting signals using the approaches explained in Section 5.3
are scaled such that each column of the feature matrix has zero mean and unit standard
deviation. Then the next step is to apply Principal Component Analysis (PCA) [311] to
the feature matrix to reduce the dimension of the feature matrix. The maximum number of
components that can be selected is equal to min(r,c), where r and c represent the row and
column number of the feature matrix, respectively. For instance, there are 100 features for
each time series in the DWT approach, but the number of samples is 55. Therefore, 55 is
the maximum number of components to select in PCA. Then the variance contributions of
each feature are calculated and sorted. When the cumulative sum of variance contributions
exceeds 90%, the algorithm automatically selects the corresponding number of components
and applies PCA again to obtain the final feature matrix. For DWT and EEMD approaches,
the variance plots are provided in Figure 5.7.
    After applying PCA, the next step is to generate the training and test set. This study
uses stratified 5-fold cross-validation to generate the training and test sets. For each feature
                                               206


Figure 5.7: The number of components obtained after applying PCA for features of DWT
(left) and EEMD (right) and computing the cumulative variance ratio. These variance plots
are obtained when the last 25 seconds of the time series are used in feature extraction.
extraction approach, the same random state number is used for the k-fold cross validator to
generate the same sets of training and test sets for a fair comparison. In this study, four
supervised classification algorithms explained in Section 2.2 are used to test the performance
of each feature extraction approach. Parameter tuning for each classifier is performed using
a grid search-based approach. A range of values or a list of options is predefined for the
parameter to be tuned. Then, the grid search algorithm finds the set of parameters that
results in the best estimator, providing the highest accuracy. The classifier is trained and
tested based on these parameters, and several metrics such as accuracy and F1 scores are
saved. Time series are labeled based on three wear levels (low, medium, and high) measured
after running the experiments. Therefore, in this study, a three-class classification is applied.
    This study also investigates how the time duration of the signals used in the analysis
affects the classification results. Therefore, the last ∆T seconds of the time series are only
taken into account during feature extraction where ∆T = 5, 10, 15, 20 and 25. The machine
learning framework is run for each ∆T value, and the results are obtained. This allows
us to see how much data is needed to identify different types of wear with the explained
approaches.
                                               207


5.4.1    Discrete Wavelet Transform Results
    After obtaining the feature matrix using the DWT approach explained in Section 5.3.1,
classification results are obtained using 5-fold cross-validation. Two types of labeling are
performed based on the wear type. Figures 5.8 and 5.9 provide the results obtained with the
labeling based on flank and crater wear, respectively. Figure 5.8 shows that DWT can detect
the flank wear level with 50% accuracy at most when the last 5 seconds of the time series are
used, and logistic regression is chosen as a classifier. However, this result has a large deviation
amount which is 15.9%. The same accuracy with a smaller deviation can only be obtained
when the last 20 seconds of the time series are used. This will also increase the number of
samples analyzed in the framework, potentially increasing the runtime. On the other hand,
Figure 5.9 provides better accuracies when the labeling of time series is done with respect
to crater wear measurements. The highest accuracy reaches 58% when ∆T = 20 seconds
with the SVM classifier. If only the last 5 seconds of the time series are used, the resulting
accuracy is 54% with a similar deviation compared to the one for the highest accuracy. This
shows that DWT can detect the wear level without needing many samples. However, it is
worth noting that several assumptions are required to identify sensitive frequencies in the
Fourier spectrum.
    In addition to test set results, training set results for the same approach are provided in
Figures D.1 and D.2 for flank and crater wear based labeling, respectively. It is seen that
there is a large difference between the test set and training set accuracies. This could be the
indication of overfitting in the classifier. However, it is worth noting that the total number
of samples (55) and the 5-fold cross-validation results in samples 11 samples in the test set.
Since this is a three-class classification, approximately every test set has four samples from
each group. This could also lead to lower classification results in the test set and higher
deviations in the accuracies.
    Most of the forces extracted from time series belong to the force sensor since three time
series are obtained in each cutting experiment. To see the effect of features from the force
                                                208


                                      The last ∆T seconds used in analysis
                ∆T = 5          ∆T = 10            ∆T = 15               ∆T = 20          ∆T = 25
                0.345            0.273              0.345               0.509              0.382
         SVM   ± 0.134          ± 0.129            ± 0.106             ± 0.073            ± 0.068
                0.509            0.255              0.273               0.291              0.418
         LR    ± 0.159          ± 0.156            ± 0.100             ± 0.089            ± 0.073
                0.273            0.364              0.364               0.436              0.345
         RF    ± 0.115          ± 0.081            ± 0.115             ± 0.134            ± 0.106
                0.327            0.273              0.345               0.364              0.291
         GB    ± 0.073          ± 0.057            ± 0.106             ± 0.163            ± 0.089
                         0.30               0.35              0.40                 0.45             0.50
Figure 5.8: The test set classification scores obtained with the DWT approach. Results
are provided for four different classification algorithms: SVM: Support Vector Machine, LR:
Logistic Regression, RF: Random Forest, GB: Gradient Boosting. The labeling of the time
series used in the analysis is done based on flank wear measurements.
                                      The last ∆T seconds used in analysis
                ∆T = 5          ∆T = 10            ∆T = 15               ∆T = 20          ∆T = 25
                0.545            0.527              0.564               0.582              0.527
         SVM   ± 0.057          ± 0.036            ± 0.036             ± 0.045            ± 0.089
                0.527            0.364              0.309               0.436              0.418
         LR    ± 0.134          ± 0.100            ± 0.093             ± 0.106            ± 0.123
                0.491            0.527              0.545               0.545              0.582
         RF    ± 0.045          ± 0.036            ± 0.000             ± 0.057            ± 0.045
                0.509            0.509              0.455               0.364              0.582
         GB    ± 0.109          ± 0.073            ± 0.081             ± 0.081            ± 0.109
                     0.35            0.40              0.45              0.50             0.55
Figure 5.9: The test set classification scores obtained with the DWT approach. Results
are provided for four different classification algorithms: SVM: Support Vector Machine, LR:
Logistic Regression, RF: Random Forest, GB: Gradient Boosting. The labeling of the time
series used in the analysis is done based on crater wear measurements.
                                                   209


                                              The last ∆T seconds used in analysis
                       ∆T = 5           ∆T = 10            ∆T = 15               ∆T = 20          ∆T = 25
                        0.473            0.564               0.600              0.545              0.545
          SVM          ± 0.194          ± 0.036             ± 0.073            ± 0.057            ± 0.000
          LR
                        0.491            0.509               0.491              0.564              0.509
                       ± 0.148          ± 0.093             ± 0.123            ± 0.089            ± 0.045
                        0.491            0.527               0.455              0.527              0.545
          RF
                       ± 0.073          ± 0.068             ± 0.057            ± 0.036            ± 0.000
                        0.382            0.345               0.455              0.473              0.509
          GB
                       ± 0.167          ± 0.121             ± 0.081            ± 0.145            ± 0.093
                0.35             0.40                0.45               0.50               0.55             0.60
Figure 5.10: The test set classification scores obtained with the DWT approach by excluding
features from force signals. Results are provided for four different classification algorithms:
SVM: Support Vector Machine, LR: Logistic Regression, RF: Random Forest, GB: Gradient
Boosting. The labeling of the time series used in the analysis is done based on crater wear
measurements.
sensor, classification is performed by excluding them. The resulting classification results
are provided in Figure 5.10. This figure only contains the results obtained with labeling
based on crater wear since flank wear results provide poor accuracies. Figure 5.10 shows
that classification scores are relatively lower compared to the ones provided in Figure 5.9,
especially for small ∆T . When features from force signals are included in the feature matrix,
DWT can detect the wear level with 54.5% accuracy when ∆T = 5. However, a similar
classification performance can be obtained with ∆T = 10 when we exclude the features from
force signals (see Figure 5.10). In addition, the user needs to increase ∆T even further to
obtain the highest accuracy, which is 60%. This indicates that the DWT approach requires
users to include force signal features for better performance if the user wants to identify the
wear level in a signal in a short time, in this case, smaller ∆T . Selected sensitive frequency
in the Fourier spectrum of shorter signals can provide better descriptors for identifying wear
levels in the time series.
                                                            210


                                     The last ∆T seconds used in analysis
                 ∆T = 5        ∆T = 10            ∆T = 15               ∆T = 20     ∆T = 25
                 0.345          0.345              0.382               0.382         0.418
          SVM   ± 0.068        ± 0.068            ± 0.068             ± 0.121       ± 0.178
                 0.455          0.400              0.327               0.309         0.400
          LR    ± 0.100        ± 0.093            ± 0.169             ± 0.123       ± 0.093
                 0.418          0.309              0.491               0.309         0.364
          RF    ± 0.073        ± 0.045            ± 0.093             ± 0.093       ± 0.100
                 0.382          0.291              0.436               0.218         0.382
          GB    ± 0.089        ± 0.121            ± 0.145             ± 0.093       ± 0.089
                   0.25          0.30              0.35              0.40         0.45
Figure 5.11: The test set classification scores obtained with the EEMD approach. Results
are provided for four different classification algorithms: SVM: Support Vector Machine, LR:
Logistic Regression, RF: Random Forest, GB: Gradient Boosting. The labeling of the time
series used in the analysis is done based on flank wear measurements.
5.4.2   Ensemble Empirical Mode Decomposition Results
   The same time domain feature set presented in Table 2.3 is used to extract features from
reconstructed signals with the EEMD approach. The resulting classification scores of the
test set for four different classifiers and varying ∆T parameters are presented in Figures 5.11
and Figures 5.12, while Figures D.1 and D.2 It is seen that labeling based on crater wear
measurements performs better than labeling with flank wear measurements. The highest
accuracies are obtained when ∆T = 15 for flank and crater wear measurements. When ∆T
is chosen as 5 seconds, crater wear-based labeling can provide 54.5% accuracy with a 17.2%
deviation. However, this deviation is quite larger than the one for the highest accuracy (see
Figure 5.12). In addition, the DWT approach can provide the same accuracy as EEMD with
lower deviation when the last five seconds of the time series are used in feature extraction.
   The results without including the features from force signals are obtained for EEMD.
Figure 5.13 shows that excluding features of force signals can lead to an increase in accuracies
when ∆T = 10. However, accuracies decreases for ∆T = 5 if Figures 5.12 and 5.13 are
                                                  211


                                              The last ∆T seconds used in analysis
                       ∆T = 5           ∆T = 10            ∆T = 15               ∆T = 20          ∆T = 25
                        0.509            0.491              0.455               0.564              0.545
          SVM          ± 0.148          ± 0.093            ± 0.115             ± 0.036            ± 0.057
                        0.491            0.473              0.564               0.527              0.455
          LR           ± 0.159          ± 0.185            ± 0.068             ± 0.106            ± 0.129
                        0.509            0.509              0.564               0.509              0.564
          RF           ± 0.136          ± 0.093            ± 0.036             ± 0.093            ± 0.068
                        0.545            0.436              0.564               0.436              0.473
          GB           ± 0.172          ± 0.089            ± 0.068             ± 0.068            ± 0.106
                0.44             0.46         0.48            0.50           0.52          0.54             0.56
Figure 5.12: The test set classification scores obtained with the EEMD approach. Results
are provided for four different classification algorithms: SVM: Support Vector Machine, LR:
Logistic Regression, RF: Random Forest, GB: Gradient Boosting. The labeling of the time
series used in the analysis is done based on crater wear measurements.
compared. This shows us the effect of selected IMFs in feature extraction. Reconstructed
signals obtained from force sensor using EEMD may not have all the information needed to
identify the wear level since the reconstructed signals in some cutting configurations seems
to be an approximation to the original signal (see Figure 5.6). If most of the reconstructed
signals look like a curve fit to the original signal, the features obtained from them can not be
a good descriptor for identifying the wear level. Removing these features can lead to better
accuracies, as seen from the case when ∆T = 10.
5.4.3   Topological Data Analysis Based Approach Results
   This study also proposes a TDA-based approach for tool wear analysis. Three types of
featurization techniques are utilized in this study to summarize the persistence diagrams
obtained from time series data. These are Carlsson Coordinates, persistence images, and
template functions. Figures 5.14 and 5.15 provides the classification results obtained with
Carlsoon Coordinates using flank and crater wear based labeling, respectively. As seen previ-
                                                           212


                                     The last ∆T seconds used in analysis
                 ∆T = 5        ∆T = 10            ∆T = 15               ∆T = 20           ∆T = 25
                 0.418          0.564              0.491               0.418               0.527
          SVM   ± 0.169        ± 0.036            ± 0.093             ± 0.227             ± 0.089
          LR
                 0.509          0.509              0.473               0.527               0.527
                ± 0.045        ± 0.073            ± 0.106             ± 0.036             ± 0.089
                 0.455          0.564              0.473               0.527               0.491
          RF
                ± 0.115        ± 0.036            ± 0.068             ± 0.068             ± 0.109
                 0.418          0.564              0.473               0.473               0.382
          GB
                ± 0.159        ± 0.121            ± 0.089             ± 0.068             ± 0.068
                  0.400    0.425         0.450       0.475         0.500          0.525      0.550
Figure 5.13: The test set classification scores obtained with the EEMD approach by excluding
features from force signals. Results are provided for four different classification algorithms:
SVM: Support Vector Machine, LR: Logistic Regression, RF: Random Forest, GB: Gradient
Boosting. The labeling of the time series used in the analysis is done based on crater wear
measurements.
ously, crater wear-based labeling performs better than flank wear-based labeling. Therefore,
this section only explains the results obtained with crater wear-based labeling. Carlsson Co-
ordinates can detect the tool wear with 56.4% accuracy when ∆T = 5 seconds. For the same
time window, Carlsson Coordinates provides slightly higher accuracy compared to EEMD
and DWT-based approaches (see Figures 5.9 5.12 and 5.15). However, DWT requires select-
ing thresholds for the Fourier spectrum to avoid selecting small sensitive frequencies, and
EEMD based approach is computationally more expensive than the Carlsson Coordinates.
   The classification scores obtained without including the features from force signals are
presented in Figure 5.16. It is seen that Carlsson Coordinates can detect wear levels with
56.4% and 58.2% when ∆T = 5 and ∆T = 10, respectively. This indicates that features
from the acoustic emission sensor and the microphone can only be used in data acquisition.
Carlsson coordinates eliminate the need for features from the force dynamometer, whose
setup is cumbersome and time-consuming. For the same time period (∆T = 5), Carlsson
                                                  213


                                              The last ∆T seconds used in analysis
                      ∆T = 5            ∆T = 10            ∆T = 15               ∆T = 20           ∆T = 25
                       0.418             0.327              0.345                 0.455             0.418
         SVM          ± 0.109           ± 0.148            ± 0.089               ± 0.182           ± 0.093
                       0.455             0.455              0.436                 0.400             0.327
         LR           ± 0.057           ± 0.115            ± 0.068               ± 0.093           ± 0.045
                       0.400             0.491              0.364                 0.364             0.309
         RF           ± 0.109           ± 0.169            ± 0.057               ± 0.000           ± 0.136
                       0.364             0.491              0.382                 0.364             0.327
         GB           ± 0.081           ± 0.169            ± 0.145               ± 0.129           ± 0.109
                      0.325     0.350          0.375         0.400           0.425         0.450     0.475
Figure 5.14: The test set classification scores obtained with the Carlsson Coordinates. Re-
sults are provided for four different classification algorithms: SVM: Support Vector Machine,
LR: Logistic Regression, RF: Random Forest, GB: Gradient Boosting. The labeling of the
time series used in the analysis is done based on flank wear measurements.
                                              The last ∆T seconds used in analysis
                      ∆T = 5            ∆T = 10            ∆T = 15               ∆T = 20           ∆T = 25
                       0.564             0.473              0.545                 0.527             0.564
         SVM          ± 0.121           ± 0.134            ± 0.100               ± 0.068           ± 0.036
                       0.509             0.455              0.545                 0.491             0.527
         LR           ± 0.123           ± 0.115            ± 0.100               ± 0.073           ± 0.068
                       0.564             0.545              0.545                 0.455             0.564
         RF           ± 0.089           ± 0.081            ± 0.129               ± 0.000           ± 0.036
                       0.455             0.527              0.564                 0.291             0.455
         GB           ± 0.152           ± 0.156            ± 0.106               ± 0.068           ± 0.081
               0.30             0.35                0.40              0.45                 0.50          0.55
Figure 5.15: The test set classification scores obtained with the Carlsson Coordinates. Re-
sults are provided for four different classification algorithms: SVM: Support Vector Machine,
LR: Logistic Regression, RF: Random Forest, GB: Gradient Boosting. The labeling of the
time series used in the analysis is done based on crater wear measurements.
                                                           214


                                     The last ∆T seconds used in analysis
                 ∆T = 5        ∆T = 10            ∆T = 15               ∆T = 20   ∆T = 25
                 0.564          0.473              0.545               0.527       0.564
          SVM   ± 0.121        ± 0.134            ± 0.100             ± 0.068     ± 0.036
          LR
                 0.509          0.455              0.545               0.491       0.527
                ± 0.123        ± 0.115            ± 0.100             ± 0.073     ± 0.068
                 0.473          0.582              0.509               0.455       0.582
          RF
                ± 0.068        ± 0.045            ± 0.093             ± 0.100     ± 0.045
                 0.436          0.491              0.582               0.309       0.455
          GB
                ± 0.145        ± 0.123            ± 0.073             ± 0.073     ± 0.081
                      0.35          0.40              0.45              0.50      0.55
Figure 5.16: The test set classification scores obtained with the Carlsson Coordinates by
excluding features from force signals. Results are provided for four different classification
algorithms: SVM: Support Vector Machine, LR: Logistic Regression, RF: Random Forest,
GB: Gradient Boosting. The labeling of the time series used in the analysis is done based
on crater wear measurements.
coordinates provides the highest score compared to EEMD and DWT (see Figures 5.10, 5.13,
and 5.16).
   The second featurization technique utilized for persistence diagrams is the persistence
images. In this approach, persistence diagrams are converted to images, and their pixel
values are used as features in the classification. Figures 5.17 and D.7 provide the results
obtained with two types of labeling. Crater wear-based labeling provides higher accuracies
than flank wear-based labeling. Similar to Carlsson Coordinates, persistence images capture
the information related to tool wear in a five-second time window with 56.4% accuracy and
lower deviation compared to persistence images. When the features of the force signals are
excluded, the resulting classification scores in Figure 5.18 decrease 2%. Highest classification
scores are obtained when ∆T = 10 and ∆ = 15 seconds.
   The last featurization approach is the template functions. Since crater wear-based label-
ing performs better than flank wear-based labeling, the results for flank wear can be found
                                                  215


                                        The last ∆T seconds used in analysis
                ∆T = 5            ∆T = 10            ∆T = 15               ∆T = 20                  ∆T = 25
                 0.436             0.564              0.545                 0.527                   0.491
         SVM    ± 0.134           ± 0.068            ± 0.057               ± 0.036                 ± 0.045
         LR
                 0.436             0.545              0.473                 0.436                   0.473
                ± 0.176           ± 0.141            ± 0.156               ± 0.121                 ± 0.089
                 0.564             0.491              0.564                 0.564                   0.473
         RF
                ± 0.036           ± 0.109            ± 0.036               ± 0.036                 ± 0.156
                 0.382             0.509              0.473                 0.418                   0.473
         GB
                ± 0.145           ± 0.093            ± 0.106               ± 0.073                 ± 0.106
                 0.400      0.425           0.450       0.475           0.500           0.525           0.550
Figure 5.17: The test set classification scores obtained with the persistence images. Results
are provided for four different classification algorithms: SVM: Support Vector Machine, LR:
Logistic Regression, RF: Random Forest, GB: Gradient Boosting. The labeling of the time
series used in the analysis is done based on crater wear measurements.
                                        The last ∆T seconds used in analysis
                ∆T = 5            ∆T = 10            ∆T = 15               ∆T = 20                  ∆T = 25
                 0.382             0.582              0.564                 0.509                   0.491
         SVM    ± 0.106           ± 0.045            ± 0.068               ± 0.045                 ± 0.045
         LR
                 0.418             0.527              0.527                 0.436                   0.473
                ± 0.169           ± 0.145            ± 0.156               ± 0.121                 ± 0.089
                 0.545             0.509              0.582                 0.527                   0.491
         RF
                ± 0.000           ± 0.093            ± 0.045               ± 0.068                 ± 0.093
                 0.400             0.491              0.527                 0.436                   0.436
         GB
                ± 0.136           ± 0.093            ± 0.106               ± 0.089                 ± 0.106
                0.400     0.425        0.450        0.475       0.500           0.525           0.550       0.575
Figure 5.18: The test set classification scores obtained with the persistence images by ex-
cluding features from force signals. Results are provided for four different classification
algorithms: SVM: Support Vector Machine, LR: Logistic Regression, RF: Random Forest,
GB: Gradient Boosting. The labeling of the time series used in the analysis is done based
on crater wear measurements.
                                                     216


in Appendix Figure D.10. Figures 5.19 and 5.20 show that ignoring features of force sig-
nals increased the highest accuracy obtained out of combinations of four different classifiers
and five different ∆T parameters. Template functions, including the features from force
signals, are able to identify the wear level with 56.4%, while the accuracy increases up to
60% without including force signals. 60% accuracy is the highest accuracy obtained out of
five approaches. While persistence images can provide this accuracy with five seconds time
window, DWT needs at least 15 seconds of the time series to reach the same accuracy (see
Figures 5.10 and 5.16).
    These results show that TDA-based approaches can capture the information related to
tool wear from the time series in a shorter time window. As a result, the user does not need
to collect a large number of samples to identify the wear level. This also brings an advantage
in terms of the runtime of the algorithms. Providing larger classification scores for smaller
time series and eliminating the need to use a force dynamometer in data acquisition can
make the integration of TDA-based approaches into real-time tool wear analysis easier than
DWT and EEMD based approaches.
5.5     Conclusion
    This study contributes to the DWT-based approach widely adopted in the literature
by introducing an automatic and adaptive algorithm to find the sensitive frequency in the
Fourier spectrum. EEMD based approach is also used as a benchmark case to compare
results. In addition, it is believed that this is the first study to implement persistence
homology from TDA for tool wear analysis in machining.
    It has been shown that DWT and EEMD based approaches are dependent on the fea-
tures from force signals when a shorter time is chosen for analysis. However, TDA-based
approaches can still provide similar accuracies when the features of force signals are excluded.
Setting up a force dynamometer is time-consuming, and it requires additional fixtures to be
designed for the cutting lathe. TDA-based approaches can eliminate the usage of force dy-
                                              217


                                                    The last ∆T seconds used in analysis
                      ∆T = 5                  ∆T = 10            ∆T = 15               ∆T = 20             ∆T = 25
                       0.509                   0.491                0.491               0.527              0.527
         SVM          ± 0.045                 ± 0.093              ± 0.109             ± 0.089            ± 0.036
                       0.527                   0.509                0.491               0.545              0.545
         LR           ± 0.134                 ± 0.073              ± 0.136             ± 0.100            ± 0.129
                       0.564                   0.545                0.491               0.455              0.545
         RF           ± 0.068                 ± 0.057              ± 0.045             ± 0.141            ± 0.057
                       0.545                   0.455                0.455               0.455              0.509
         GB           ± 0.081                 ± 0.057              ± 0.057             ± 0.152            ± 0.093
                0.46                   0.48                 0.50               0.52              0.54                0.56
Figure 5.19: The test set classification scores obtained with the template functions. Results
are provided for four different classification algorithms: SVM: Support Vector Machine, LR:
Logistic Regression, RF: Random Forest, GB: Gradient Boosting. The labeling of the time
series used in the analysis is done based on crater wear measurements.
                                                    The last ∆T seconds used in analysis
                      ∆T = 5                  ∆T = 10            ∆T = 15               ∆T = 20           ∆T = 25
                   0.509                   0.491                    0.491              0.527              0.527
         SVM      ± 0.045                 ± 0.093                  ± 0.109            ± 0.089            ± 0.036
         LR
                   0.527                   0.509                    0.491              0.545              0.545
                  ± 0.134                 ± 0.073                  ± 0.136            ± 0.100            ± 0.129
                   0.600                   0.509                    0.509              0.491              0.545
         RF
                  ± 0.045                 ± 0.093                  ± 0.045            ± 0.123            ± 0.057
                   0.527                   0.455                    0.473              0.473              0.545
         GB
                  ± 0.089                 ± 0.057                  ± 0.089            ± 0.121            ± 0.081
               0.46             0.48             0.50          0.52          0.54         0.56          0.58          0.60
Figure 5.20: The test set classification scores obtained with the template functions by ex-
cluding features from force signals. Results are provided for four different classification
algorithms: SVM: Support Vector Machine, LR: Logistic Regression, RF: Random Forest,
GB: Gradient Boosting. The labeling of the time series used in the analysis is done based
on crater wear measurements.
                                                                   218


namometer in data acquisition. Therefore, they can provide more efficient data acquisition
in terms of time and money.
    DWT can provide similar accuracies with TDA-based approaches when smaller ∆T is
chosen for the analysis and features from force signals are included. Although the proposed
algorithm for DWT to identify the sensitive frequencies is automatic and adaptive, it requires
users to enter the minimum desired frequency. This parameter needs to be chosen to avoid
the selection of small frequencies as sensitive frequencies. The user needs to know the
structural modes of the experimental setup or review the spectrum to identify frequencies
whose amplitudes are not significant. Therefore, this challenges DWT to be implemented in
a real-time tool wear analysis pipeline.
    The EEMD approach can provide similar accuracies with large deviation compared to the
TDA-based approach when features of the force signals are included, and a small ∆T is used.
To obtain similar scores with smaller deviations, the user needs to include the last 15 seconds
of the time series in the analysis. Including more data points will also increase the time
required to compute the IMFs, while the computation of IMFs is computationally expensive
in Python, even for smaller ∆T . Therefore, implementing EEMD into real-time tool wear
analysis requires additional optimization to decrease the runtime for IMF computations.
    TDA-based approaches provide promising results with or without features from force
signals. Their performance either exceeds or matches the results obtained from EEMD
based approaches. The machine learning framework with TDA-based approaches does not
require the user to make assumptions. The pipeline is automatic and adaptive. While the
greedy permutation option of Ripser package is utilized to compute persistence diagrams
in a reasonable time in this chapter, this work previously showed that the combination of
parallel computing and the Bézier curve approximation technique can significantly reduce
the runtime for persistence diagram computation. Therefore, users have multiple choices to
utilize for a faster pipeline. It is worth noting that point clouds obtained after embedding are
subsampled to 1000 points with the greedy permutation option of Ripser. Although this is
                                                 219


a significantly smaller number compared to the original size of the point clouds, TDA-based
approaches can provide similar classification scores to DWT and EEMD based approaches.
This study hypothesizes that approximated persistence diagrams obtained from the Bézier
curve approach can lead to higher classification scores. Therefore, future work of this study
can include obtaining the results with persistence diagrams obtained from the combination of
Bézier curve approximation and parallel computing. Another direction of future studies can
focus on feature ranking to identify the pixels that provide the most informative features in
persistence images. This will show us whether focusing on the areas that provide the most
informative features can lead to better accuracy. In addition to improving classification
accuracy, real-time tool wear analysis can be built using the TDA-based approaches, which
show promising results for offline tool wear analysis in this study.
                                              220


                                        CHAPTER 6
                               CONCLUDING REMARKS
    This research mainly focuses on machine learning applications in dynamical systems.
Specifically, manufacturing applications and the parameter identification of complex systems
are the two main areas investigated in this study. It has been observed that the signal
decomposition approaches are widely utilized in chatter diagnosis, tool wear analysis, and
surface texture analysis. While these approaches can provide high classification accuracy in
some cases, this study shows that they are user-dependent and require manual preprocessing
of the data set. The main goal of this study is to improve existing approaches to reduce
human intervention and introduce novel approaches based on Topological Data Analysis.
    For chatter detection, a novel approach based on similarities between the time series
is developed in this study for chatter diagnosis. It implements DTW and combines this
algorithm with KNN to classify cutting signals. In addition, persistence homology is utilized
in chatter diagnosis with six different featurization techniques of persistence diagrams. It is
shown that the results of DTW and TDA-based approaches can either match or exceed the
classification scores obtained with existing approaches. This study also improves the runtime
of the proposed approaches by implementing High Performance Computing tools. For small
batch operations, this study proposes the combination of DTW with AESA to reduce the
number of distance computations in the test phase. For the TDA-based approach, persistence
diagrams are approximated using Bézier curves. These improvements in the pipeline can
allow users to complete the classification of a single time series in less than two seconds.
Therefore it is shown that TDA is viable for real-time chatter diagnosis approaches.
    For parameter identification of dynamical systems, this study investigates one of the
most popular approaches called SINDy. This approach is generally used in literature with
data obtained from analytical models, while the experimental studies are limited. Therefore,
this study tests the performance of SINDy with a complex single pendulum apparatus for
                                              221


the first time, and it reports the caveats of SINDy while making modifications in derivative
estimation steps of the approach.
    Surface texture is another application that is taken into account in this study. Two
widely adopted approaches, DCT and DWT, are investigated in this study. Two automatic
and adaptive threshold selection algorithms are proposed for these approaches. It is seen that
the proposed algorithms can match the results obtained from heuristic threshold selection
while removing manual preprocessing of the surface scans and surface profiles. In addition
to these two approaches, sublevel set persistence from TDA is utilized to summarize the
information in given surfaces. The resulting classification accuracies are either higher than
the results of traditional feature extraction approaches or match them. Therefore, this study
also provides another TDA-based approach called topological saliency to cluster surface
patches to identify the regions where additional machining is required. It is believed that
further research is needed to implement topological saliency into real-time cutting operations
to reduce material waste and provide an efficient surface finish monitoring system.
    Tool wear analysis is studied in the last chapter of this study. DWT and EEMD based
approaches are applied to experimental titanium cutting signals. A new automatic and
adaptive algorithm is proposed to select sensitive frequencies in the Fourier spectrum for
the DWT approach. In addition, featurization techniques from TDA tools are utilized to
classify time series based on their wear level. TDA-based approaches show that they can
detect the wear level in an experiment with a similar amount of samples needed by the
DWT approach, while the EEMD approach needs a larger number of samples compared to
the other two. However, DWT requires the user to select minimum desired frequencies to
locate sensitive frequencies correctly. Moreover, the TDA-based approach can still show the
same classification performance when force signal data is excluded from the experiment. On
the other hand, DWT and EEMD show a decrease in accuracy or need more data samples
to obtain the same performance with the analysis, including force sensor data. Therefore,
the TDA-based approach does not need an expensive force sensor, and it can save the
                                               222


time needed to integrate a force dynamometer into an experimental setup. The TDA-based
approach provides a fully automated pipeline without needing user input in addition to
several options to speed up the computation, such as Bézier curve approximation, parallel
computing, and greedy permutation. Therefore, TDA is shown to be a viable option for
real-time tool wear analysis. One future direction of this study can include the prediction of
tool wear in real-time using standard and novel approaches.
                                             223


APPENDICES
    224


                                      APPENDIX A
                               CUTTING EXPERIMENTS
A.1      Aluminum Cutting Experiment
    Figure A.1 shows the turning experiment that was used to collect the measurement data
for training and testing the chatter detection algorithms. It consists of a 6061 aluminum
cylindrical workpiece mounted into the chuck of the spindle of a Clausing-Gamet 33 cm (13
inch) engine lathe. An S10R-SCLCR3A boring bar from the Grizzly T10439 carbide insert
boring bar set with an attached 0.04 cm (0.015 inch) radius Titanium nitride coated cutting
insert is secured to the tool holder.
    The rod’s stiffness, and therefore, the eigenfrequencies of the tool vibration, are varied
by changing the overhang or stickout length of the rod. Four stickout lengths are used in
the experiment: 5.08 cm (2 inch), 6.35 cm (2.5 inch), 8.89 cm (3.5 inch), and 11.43 cm (4.5
inch). In order to obtain more accurate measurements, the stickout length is measured as
the distance between the flat back surface of the tool holder and the heel of the boring rod.
The visual representation of stickout distance is given on the right hand side of FigureA.1.
Increasing the stickout length leads to a stiffer cutting tool and higher eigenfrequencies for
lateral vibrations. Since the lateral direction is the most flexible and chatter frequencies
appear in the neighborhood of dominant eigenfrequencies of the structure, the dominant
chatter frequencies increase with increasing stickout length.
    The boring rod is instrumented with two PCB 352B10 miniature, lightweight, uni-axial
ceramic shear accelerometers that are ninety degrees apart to measure lateral vibrations of
the rod. The two accelerometers are superglued onto the rod at about 3.81 cm (1.5 inch) away
from the cutting tool to protect them from moving parts and cutting debris. A PCB 356B11
triaxial, miniature ceramic shear accelerometer is also attached to the bottom clamp of the
tool holder, as shown in Figure A.1. The data from all the accelerometers are collected on
                                              225


Figure A.1: The experimental setup showing the workpiece, the cutting tool, and the at-
tached accelerometers (left). The visual representation of the stickout distance (right).
the analog channels of an NI USB-6366 data acquisition box using Matlab. No in-line analog
filter is used; however, the signals are oversampled at 160kHz. Digital filtering is used before
subsampling, thus eliminating noise while avoiding the undesirable effects of antialiasing. In
particular, a Butterworth low-pass filter with order 100 and a cutoff frequency of 10 kHz is
used. The data is then downsampled to 10 kHz without the risk of causing aliasing effects.
The resulting conditioned data is what is considered in Section A.1.1. In addition, both the
raw and filtered data are provided in a Mendeley repository [312]. Also, one can find the
codes for WPT and EEMD approaches in a GitHub repository.
A.1.1      Data Labeling
     Before tagging the signals, the time series of the two uni-axial accelerometers on the
boring rod are analyzed as well as the signals from the tri-axial accelerometer on the tool post,
see Figure A.1. It is found that although the data of the accelerometers is mostly redundant,
the x-vibration at the tool post, which is measured by the x-axis signal of the tri-axial
accelerometer, had the best signal-to-noise ratio. Therefore, the data tagging is performed
exclusively using the data from this channel. Another sanity check was the comparison of
the tagged signals with a few photographs of the resulting machined surface taken during
the experiment, as shown in Figure A.2.
     Each time series from every cutting test was examined, and the different parts of the
signals were labeled as either no chatter, mild/intermediate chatter, chatter, or unknown.
                                               226


           (a) No chatter.               (b) Mild chatter.           (c) Chatter.
Figure A.2: The surface finish corresponding to (a) no-chatter case with 6.35 cm (2.5 inch)
stickout length, 320 rpm, and 0.127 mm (0.005 inch) depth of cut, (b) Mild chatter case
with 6.35 cm (2.5 inch) stickout length, 770 rpm, and 0.127 mm (0.005 inch) depth of cut,
and (c) chatter case with 6.35 cm (2.5 inch) stickout length, 570 rpm, and 0.127 mm (0.005
inch) depth of cut.
Figure A.3a shows an example of how one time series is labeled using these categories. The
separation into different parts has been done based on the characteristics of the amplitude in
the time domain. In particular, parts with a low amplitude were separated from parts with
a large amplitude. In addition, parts with an impact-like structure with an abrupt, very
strong increase of the accelerations and a relatively fast decay were also separated. Then
the frequency domain characteristics were studied for final classification of the signal. In
the frequency domain, only the frequency components lower than 5 kHz were considered.
Specifically, the criteria used for classifying the signals are:
   1. No chatter (stable):
        a) Low amplitude in the time domain
        b) Low amplitude in the frequency domain (highest peaks at spindle rotation fre-
            quencies [313])
   2. Mild or intermediate chatter:
        a) Low amplitude in the time domain
        b) Large amplitude in the frequency domain (highest peaks at chatter frequencies)
                                                227


    3. Chatter:
          a) Large amplitude in time domain
          b) Large amplitude in the frequency domain (very high peaks at chatter frequencies
             which are not equal to the spindle rotation frequencies)
    4. Unknown:
          a) All other cases
The unknown data are parts of the time series with large amplitude in the time domain but
no large peaks in the frequency domain at chatter frequencies (lower than 5 kHz). Typically,
this corresponds to the parts with impact-like structure, which might occur due to chip
breakage or other inhomogeneities during the process. Also, there can be another eigenmode
at 10kHz, which is vibrating (chattering) for the unknown portion of the time series in
FigureA.3. Figure A.4 also provides time and frequency domain of the cutting configurations
provided in Figure A.2. However, typical chatter frequencies in this process lie between 0-
150 Hz for structural modes, 200-800 Hz for workpiece vibrations, and 1000-3000 Hz for tool
vibrations. Therefore, it is not clear if this is chatter or something else, and therefore it was
excluded from the analysis. The time domain and the frequency domain characteristics for an
example time series that includes all four classes are shown in Figure A.3a and Figure A.3b,
respectively. The first part of the time series was not classified because, in this case, the tool
is still not engaged in the workpiece.
     From Figure A.3, it also becomes clear that a process with fixed cutting conditions
is not necessarily clearly stable or unstable, which is another reason why chatter detection
algorithms might be helpful for practical applications. For example, the process in Figure A.3
is stable at the beginning between 4s and 6s. Then, a strong perturbation drives the system
away from the stable state to some chattering motion, which is a reasonable scenario because
there can be bistability between stable cutting and chatter [314, 315, 316]. Table A.1 shows
the breakdown and the total number of the tagged time series for each stickout length.
                                                228


                 (a) time series                         (b) spectrum comparison
Figure A.3: Tagging example in the (a) time domain and (b) the frequency domain for the
case with 8.9 cm (3.5 in) stickout, 770 rpm, and 0.04 cm (0.015 in) depth of cut.
Figure A.4: (a) Sample tagged time series and some samples of the resulting surface finish.
The right panel shows a comparison of the frequency spectra between a signal labeled as
chatter versus (b) no chatter, (c) intermediate chatter, and (d) unknown.
Processed time series are split into smaller pieces whose lengths are around 10000 to reduce
the computational time in several feature extraction methods. The numbers provided in
parenthesis in Table A.1 represent the number of time series after splitting.
                                              229


Table A.1: Case numbers before and after splitting (in parenthesis) the time series and
overall amount of tagged data for each stickout case.
                Stickout length
                                    Stable   Mild chatter     Chatter   Total
                  (cm (inch))
                    5.08 (2)       17 (443)     8 (31)       11 (118) 36 (592)
                   6.35 (2.5)       7 (82)      4 (33)         3 (13) 14 (128)
                   8.89 (3.5)       7 (55)       2 (5)         2 (6)   11 (66)
                   11.43 (4.5)     13 (114)     4 (14)         5 (48) 22 (176)
A.2     Milling Experiment
    This section describes the experimental setups for milling. The experimental data is
received from Reference [317]. An illustration showing the experimental milling system is
shown in Fig. A.5. An Ingersol machining center with a Fischer 40000 rpm and 40kW
spindle was used to conduct experiments on an aluminum workpiece (7050-T7451). The
type of milling conducted in these experiments is down milling. The depth of cut is 2.03
mm, and radial immersion is kept constant at 5%. Lion precision capacitive probes were
used to collect the tool displacements along the x, and y axes [317]. The data were sampled
at 25 kHz. As in the turning experiments, a low pass filter was used, and the data was
downsampled to 12.5 kHz. In addition, a laser tachometer was used to independently verify
the spindle rotational speed from the machine setting. The cutting tool was a 19.05 mm end
mill with two teeth and a 106 mm overhang distance. Data tagging was performed using
power spectral density (PSD) plots and Poincaré sections. Tool displacement plots in the x
direction, along with the corresponding Poincaré sections, are shown in Figure A.6.
          Table A.2: Cutting parameters for the turning and milling experiments.
   Cutting operation    Rotational Speeds (rpm)    Depth of Cut (mm)         Feed Rate
        Milling                11206-32161             0.381 - 3.556    0.191 mm/tooth/rev
    The first milling example shown in Figure A.6 is a stable cut (Ω = 19488 rpm, 1.524 mm
cutting depth), whereas the second example shows a Hopf bifurcation example (Ω = 27285 rpm,
3.556 mm cutting depth). The first column of Figure A.6 presents the tool displacements
                                             230


           Figure A.5: Experimental setup of milling cutting experiments.(left) Illus-
           tration of the setup (right) picture of the cutting tool and the workpiece.
for two teeth and the second column provides the Poincare sections of these time series.
xn and xn2 represent the time-delayed coordinates. In this study, the constant time delay
parameter is used, and it is chosen as 6. The third column of Figure A.6 shows the power
spectrum or a PSD plot that helped us better see the frequency content of the time series
[317]. If the spectral peaks were synchronous with the tooth passage frequency, that meant
the corresponding time series was stable. However, if the spectral peaks were not aligned
with the tooth passage frequency, the cutting test was unstable, as shown in the second row
and third column of Figure A.6.
    Stability predictions were made for the milling system using the measured modal pa-
rameters and the spectral element approach [2]; the resulting stability diagram is shown in
Figure A.8. The spectral element method uses eigenvalues of a dynamic map to determine
the stability of the process [317, 3]. If the magnitude of the eigenvalue is smaller than 1, the
process is stable. If the eigenvalue has only a positive real part that is larger than 1, then
the process is unstable. Eigenvalues with only negative real part less than -1 represent the
flip bifurcations, and Hopf bifurcation occurs when an eigenvalue has an amplitude larger
than 1. The illustration for the stability analysis using the real and imaginary parts of the
                                               231


Figure A.6: The first column represents tool displacements for two teeth, and the second
column provides Poincare sections for two time series obtained with three different rotational
speeds and depth of cuts in milling experiments. The third column shows the PSD plots
of the three-time series whose Poincare sections are shown in the first column. Red dots
represent tooth passage frequency. The time series shown in the first row is stable with
cutting conditions Ω = 11793 rpm and depth of cut of 2.54 mm, while the one in the second
row is unstable with cutting conditions Ω = 17746 rpm and depth of cut of 1.524 mm. The
third row represents an unstable milling signal with cutting conditions Ω = 27285 rpm and
depth of cut of 3.556 mm.
eigenvalues is also provided in Figure A.7. Based on this stability criteria, 10000 time series
for a 100 × 100 grid of points in the rpm vs. cutting depth parameter space were used to
produce the stability diagram shown in Figure A.8. The stability of the experimental data
set is decided based on the Poincare section and the PSD plots. Then experimental data
                                             232


set is included in the stability diagram shown with black diamonds (unstable) and triangle
(stable) markers in Figure A.8. It is seen that the labels obtained using the eigenvalues and
the ones obtained using frequency domain analysis and the Poincare section may not match.
The labels obtained from the analysis shown in Figure A.6 are used to perform classification.
                                       Figure A.8: The stability of time series obtained using
Figure A.7: Illustration for the sta-the analytical model provided in Section 2.6.5 [3] with
bility analysis using eigenvalues ofdifferent depth of cut (b) and spindle speed (Ω) on 100×
the dynamic map obtained using100 grid. The green color corresponds to the time series
spectral element approach [2].         with Hopf bifurcation (unstable), while the blue color
                                       represents the stable time series. The red color shows
                                       the time series with flip bifurcation. Experimental data,
                                       whose stability is defined based on the Poincare section
                                       and PSD plots, is shown with diamond (unstable) and
                                       triangle (stable) symbols.
                                               233


                                   APPENDIX B
          CLASSIFICATION RESULTS FOR CHATTER DETECTION
                           Stable    Intermediate Chatter Chatter
                                                                            7500
                                                                            7000
                Stable    3301.97          5714.23        7313.45
                                                                            6500
                                                                            6000
                                                                            5500
          Intermediate    5714.23          5171.58        7073.39
               Chatter                                                      5000
                                                                            4500
                                                                            4000
               Chatter    7313.45          7073.39        4647.07
                                                                            3500
                                                                            3000
                           Stable    Intermediate Chatter Chatter
Figure B.1: The heat map of averages DTW distances of time series belongs to three classes
for the 8.89 cm (3.5 inch) case.
                           Stable    Intermediate Chatter Chatter
                                                                            7500
                                                                            7000
                Stable    4787.94          5655.41        5549.76
                                                                            6500
                                                                            6000
                                                                            5500
          Intermediate    5655.41          5300.29        5850.78
               Chatter                                                      5000
                                                                            4500
                                                                            4000
               Chatter    5549.76          5850.78        5687.71
                                                                            3500
                                                                            3000
                           Stable    Intermediate Chatter Chatter
Figure B.2: The heat map of averages DTW distances of time series belongs to three classes
for the 11.43 cm (4.5 inch) case.
                                           234


Table B.1: Results obtained with Level 1 Wavelet Packet Transform feature extraction for
2, 2.5, 3.5, and 4.5 inch stickout cases.
             Classifier: SVM         5.08 cm (2 inch)              6.35 cm (2.5 inch)
                   Features      Test Set       Training Set    Test Set        Training Set
             r1                93.1%6.4%       85.0% ± 3.6%  73.3% ± 21.3%    83.0% ± 13.5%
             r1 ,r2           93.9% ± 5.8% 85.8% ± 3.0%      85.0% ± 22.9%    100.0% ± 0.0%
             r1 ,r2 ,r3      92.3% ± 11.4% 89.6% ± 1.8%      85.0% ± 22.9%    100.0% ± 0.0%
             r1 ,. . .,r4    90.8% ± 13.7% 90.0% ± 3.1%      85.0% ± 22.9%    100.0% ± 0.0%
             r1 ,. . .,r5    89.2% ± 13.9% 89.2% ± 3.4%      85.0% ± 22.9%    100.0% ± 0.0%
             r1 ,. . .,r6    81.5% ± 15.1% 85.0% ± 4.7%      85.0% ± 22.9%    100.0% ± 0.0%
             r1 ,. . .,r7    76.9% ± 14.6% 85.0% ± 4.7%      85.0% ± 22.9%    100.0% ± 0.0%
             r1 ,. . .,r8    78.5% ± 13.7% 85.8% ± 6.0%      85.0% ± 22.9%    100.0% ± 0.0%
             r1 ,. . .,r9    78.5% ± 13.7% 86.2% ± 5.5%      85.0% ± 22.9%    100.0% ± 0.0%
             r1 ,. . .,r10   78.5% ± 13.7% 86.2% ± 5.5%      85.0% ± 22.9%    100.0% ± 0.0%
             r1 ,. . .,r11   78.5% ± 13.7% 86.2% ± 5.5%      85.0% ± 22.9%    100.0% ± 0.0%
             r1 ,. . .,r12   78.5% ± 13.7% 86.2% ± 5.5%      85.0% ± 22.9%    100.0% ± 0.0%
             r1 ,. . .,r13   78.5% ± 13.7% 86.2% ± 5.5%      85.0% ± 22.9%    100.0% ± 0.0%
             r1 ,. . .,r14   78.5% ± 13.7% 86.2% ± 5.5%      85.0% ± 22.9%    100.0% ± 0.0%
             Classifier: SVM        8.89 cm (3.5 inch)             11.43 cm (4.5 inch)
             r1              56.0% ± 17.4%    83.3% ± 11.4%  78.8% ± 14.8%     83.6% ± 8.5%
             r1 ,r2          84.0% ± 15.0%    100.0% ± 0.0%  68.8% ± 17.0%     82.9% ± 9.7%
             r1 ,r2 ,r3      84.0% ± 15.0%    100.0% ± 0.0%  67.5% ± 13.9%    80.7% ± 12.4%
             r1 ,. . .,r4    84.0% ± 15.0%    100.0% ± 0.0%  68.8% ± 15.1%     82.9% ± 9.7%
             r1 ,. . .,r5    84.0% ± 15.0%    100.0% ± 0.0%  68.8% ± 15.1%     82.9% ± 9.7%
             r1 ,. . .,r6    84.0% ± 15.0%    100.0% ± 0.0%  73.8% ± 10.4%    83.6% ± 10.1%
             r1 ,. . .,r7    84.0% ± 15.0%    100.0% ± 0.0%  76.3% ± 10.4%     84.3% ± 7.7%
             r1 ,. . .,r8    84.0% ± 15.0%    100.0% ± 0.0%  76.3% ± 10.4%     84.3% ± 7.7%
             r1 ,. . .,r9    84.0% ± 15.0%    100.0% ± 0.0%  76.3% ± 10.4%     84.3% ± 7.7%
             r1 ,. . .,r10   84.0% ± 15.0%    100.0% ± 0.0%  76.3% ± 10.4%     84.3% ± 7.7%
             r1 ,. . .,r11   84.0% ± 15.0%    100.0% ± 0.0%  76.3% ± 10.4%     84.3% ± 7.7%
             r1 ,. . .,r12   84.0% ± 15.0%    100.0% ± 0.0%  76.3% ± 10.4%     84.3% ± 7.7%
             r1 ,. . .,r13   84.0% ± 15.0%    100.0% ± 0.0%  76.3% ± 10.4%     84.3% ± 7.7%
             r1 ,. . .,r14   84.0% ± 15.0%    100.0% ± 0.0%  76.3% ± 10.4%     84.3% ± 7.7%
                                                   235


Table B.2: Results obtained with Level 2 Wavelet Packet Transform feature extraction for
2, 2.5, 3.5 and 4.5 inch stickout cases.
            Classifier: SVM         5.08 cm (2 inch)               6.35 cm (2.5 inch)
                  Features      Test Set       Training Set     Test Set       Training Set
            r1               92.3% ± 6.0% 93.8% ± 1.9%      100.0% ± 0.0%    100.0% ± 0.0%
            r1 ,r2           90.8% ± 8.3% 93.5% ± 1.8%       95.0% ± 7.6%    100.0% ± 0.0%
            r1 ,r2 ,r3       90.0% ± 6.9% 91.5% ± 4.1%       95.0% ± 7.6%    100.0% ± 0.0%
            r1 ,. . .,r4     90.0% ± 6.9% 91.5% ± 4.1%       95.0% ± 7.6%    100.0% ± 0.0%
            r1 ,. . .,r5     91.5% ± 7.3% 92.7% ± 3.6%       95.0% ± 7.6%    100.0% ± 0.0%
            r1 ,. . .,r6    76.2% ± 11.1% 78.5% ± 8.8%      95.0% ± 7.6%     100.0% ± 0.0%
            r1 ,. . .,r7    83.8% ± 7.3% 78.5% ± 7.5%       95.0% ± 7.6%     100.0% ± 0.0%
            r1 ,. . .,r8    84.6% ± 8.4% 77.3% ± 6.8%       95.0% ± 7.6%     100.0% ± 0.0%
            r1 ,. . .,r9    81.5% ± 7.8% 76.5% ± 6.1%       95.0% ± 7.6%     100.0% ± 0.0%
            r1 ,. . .,r10   81.5% ± 7.8% 76.5% ± 6.1%       95.0% ± 7.6%     100.0% ± 0.0%
            r1 ,. . .,r11   81.5% ± 7.8% 76.5% ± 6.1%       95.0% ± 7.6%     100.0% ± 0.0%
            r1 ,. . .,r12   81.5% ± 7.8% 76.5% ± 6.1%       95.0% ± 7.6%     100.0% ± 0.0%
            r1 ,. . .,r13   81.5% ± 7.8% 76.5% ± 6.1%       95.0% ± 7.6%     100.0% ± 0.0%
            r1 ,. . .,r14   81.5% ± 7.8% 76.5% ± 6.1%       95.0% ± 7.6%     100.0% ± 0.0%
            Classifier: SVM        8.89 cm (3.5 inch)              11.43 cm (4.5 inch)
            r1              70.0% ± 22.4%    78.9% ± 11.6%  66.3% ± 16.8%    69.3% ± 15.7%
            r1 ,r2          72.0% ± 16.0%    92.2% ± 10.0%  87.5% ± 11.2%     82.1% ± 8.6%
            r1 ,r2 ,r3      72.0% ± 16.0%     93.3% ± 8.9%  87.5% ± 11.2%     82.1% ± 8.6%
            r1 ,. . .,r4    68.0% ± 25.6%    88.9% ± 14.1%  87.5% ± 11.2%     82.1% ± 8.6%
            r1 ,. . .,r5    74.0% ± 12.8%    87.8% ± 15.3%  87.5% ± 11.2%     82.1% ± 8.6%
            r1 ,. . .,r6    78.0% ± 14.0%    90.0% ± 10.5%  87.5% ± 11.2%     82.1% ± 8.6%
            r1 ,. . .,r7    76.0% ± 12.0%    86.7% ± 15.6%  87.5% ± 11.2%     82.1% ± 8.6%
            r1 ,. . .,r8    76.0% ± 12.0%    86.7% ± 15.6%  87.5% ± 11.2%     82.1% ± 8.6%
            r1 ,. . .,r9    76.0% ± 12.0%    86.7% ± 15.6%  87.5% ± 11.2%     82.1% ± 8.6%
            r1 ,. . .,r10   76.0% ± 12.0%    86.7% ± 15.6%  87.5% ± 11.2%     82.1% ± 8.6%
            r1 ,. . .,r11   76.0% ± 12.0%    86.7% ± 15.6%  87.5% ± 11.2%     82.1% ± 8.6%
            r1 ,. . .,r12   76.0% ± 12.0%    86.7% ± 15.6%  87.5% ± 11.2%     82.1% ± 8.6%
            r1 ,. . .,r13   76.0% ± 12.0%    86.7% ± 15.6%  87.5% ± 11.2%     82.1% ± 8.6%
            r1 ,. . .,r14   76.0% ± 12.0%    86.7% ± 15.6%  87.5% ± 11.2%     82.1% ± 8.6%
                                                  236


Table B.3: Results obtained with Level 3 Wavelet Packet Transform feature extraction for
2, 2.5, 3.5 and 4.5 inch stickout cases.
            Classifier: SVM         5.08 cm (2 inch)              6.35 cm (2.5 inch)
                  Features      Test Set      Training Set     Test Set       Training Set
            r1              70.8% ± 13.2% 71.5% ± 4.3%     63.3% ± 20.8%    73.0% ± 19.0%
            r1 ,r2           77.7% ± 8.7% 85.4% ± 3.4%      78.3% ± 7.6%    94.0% ± 12.0%
            r1 ,r2 ,r3       78.5% ± 6.7% 85.8% ± 2.5%      76.7% ± 8.2%    95.0% ± 10.2%
            r1 ,. . .,r4     77.7% ± 7.3% 84.6% ± 3.8%     78.3% ± 10.7%     97.0% ± 6.4%
            r1 ,. . .,r5     76.9% ± 4.9% 84.6% ± 6.9%     78.3% ± 10.7%     97.0% ± 6.4%
            r1 ,. . .,r6    73.8% ± 13.0% 78.1% ± 7.5%     78.3% ± 10.7%     97.0% ± 6.4%
            r1 ,. . .,r7    62.3% ± 11.6% 74.2% ± 8.9%     78.3% ± 10.7%     97.0% ± 6.4%
            r1 ,. . .,r8    60.8% ± 11.1% 72.3% ± 7.7%     81.7% ± 11.7%     98.0% ± 4.0%
            r1 ,. . .,r9    60.8% ± 11.1% 72.3% ± 7.7%     81.7% ± 11.7%     98.0% ± 4.0%
            r1 ,. . .,r10   60.8% ± 11.1% 72.3% ± 7.7%     81.7% ± 11.7%     98.0% ± 4.0%
            r1 ,. . .,r11   60.8% ± 11.1% 72.3% ± 7.7%     81.7% ± 11.7%     98.0% ± 4.0%
            r1 ,. . .,r12   60.8% ± 11.1% 72.3% ± 7.7%     81.7% ± 11.7%     98.0% ± 4.0%
            r1 ,. . .,r13   60.8% ± 11.1% 72.3% ± 7.7%     81.7% ± 11.7%     98.0% ± 4.0%
            r1 ,. . .,r14   60.8% ± 11.1% 72.3% ± 7.7%     81.7% ± 11.7%     98.0% ± 4.0%
            Classifier: SVM        8.89 cm (3.5 inch)             11.43 cm (4.5 inch)
            r1              62.0% ± 14.0%    73.3% ± 16.6%  85.0% ± 9.4%    83.6% ± 10.6%
            r1 ,r2          60.0% ± 15.5%    80.0% ± 18.5% 72.5% ± 16.6%    81.4% ± 14.7%
            r1 ,r2 ,r3       66.0% ± 9.2%    83.3% ± 17.4% 72.5% ± 16.6%    81.4% ± 14.7%
            r1 ,. . .,r4     66.0% ± 9.2%    83.3% ± 17.4% 75.0% ± 18.5%    82.1% ± 14.7%
            r1 ,. . .,r5    64.0% ± 12.0%    82.2% ± 18.7% 75.0% ± 18.5%    82.1% ± 14.7%
            r1 ,. . .,r6     58.0% ± 6.0%    81.1% ± 20.5% 75.0% ± 18.5%    82.1% ± 14.7%
            r1 ,. . .,r7    68.0% ± 13.3%    83.3% ± 15.9% 77.5% ± 14.6%    84.3% ± 13.5%
            r1 ,. . .,r8    64.0% ± 15.0%    78.9% ± 19.5% 76.3% ± 15.3%     87.1% ± 7.7%
            r1 ,. . .,r9    64.0% ± 15.0%    78.9% ± 19.5% 76.3% ± 15.3%     87.1% ± 7.7%
            r1 ,. . .,r10   64.0% ± 15.0%    78.9% ± 19.5% 76.3% ± 15.3%     87.1% ± 7.7%
            r1 ,. . .,r11   64.0% ± 15.0%    78.9% ± 19.5% 76.3% ± 15.3%     87.1% ± 7.7%
            r1 ,. . .,r12   64.0% ± 15.0%    78.9% ± 19.5% 76.3% ± 15.3%     87.1% ± 7.7%
            r1 ,. . .,r13   64.0% ± 15.0%    78.9% ± 19.5% 76.3% ± 15.3%     87.1% ± 7.7%
            r1 ,. . .,r14   64.0% ± 15.0%    78.9% ± 19.5% 76.3% ± 15.3%     87.1% ± 7.7%
                                                  237


Table B.4: Results obtained with Level 4 Wavelet Packet Transform feature extraction for
2, 2.5, 3.5 and 4.5 inch stickout cases.
            Classifier: SVM        5.08 cm (2 inch)             6.35 cm (2.5 inch)
                  Features     Test Set      Training Set    Test Set       Training Set
            r1              43.1% ± 9.9% 60.0% ± 3.9%     50.0% ± 12.9%    64.0% ± 10.2%
            r1 ,r2          69.2% ± 9.7% 84.6% ± 9.1%     75.0% ± 13.4%    84.0% ± 18.5%
            r1 ,r2 ,r3      69.2% ± 9.7% 84.6% ± 9.1%     65.0% ± 15.7%    84.0% ± 18.5%
            r1 ,. . .,r4    72.3% ± 8.6% 85.7% ± 8.6%     58.3% ± 20.1%    80.0% ± 20.0%
            r1 ,. . .,r5    88.5% ± 5.1% 94.6% ± 3.1%     61.7% ± 16.8%    76.0% ± 20.1%
            r1 ,. . .,r6    80.0% ± 3.7% 88.8% ± 9.0%     61.7% ± 21.1%    72.0% ± 25.6%
            r1 ,. . .,r7    84.6% ± 9.1% 89.2% ± 6.2%     61.7% ± 21.1%    68.0% ± 24.0%
            r1 ,. . .,r8    84.6% ± 9.1% 89.2% ± 6.2%     51.7% ± 21.7%    74.0% ± 21.1%
            r1 ,. . .,r9    84.6% ± 9.1% 89.2% ± 6.2%     51.7% ± 21.7%    74.0% ± 21.1%
            r1 ,. . .,r10   84.6% ± 9.1% 89.2% ± 6.2%     51.7% ± 21.7%    74.0% ± 21.1%
            r1 ,. . .,r11   84.6% ± 9.1% 89.2% ± 6.2%     51.7% ± 21.7%    74.0% ± 21.1%
            r1 ,. . .,r12   84.6% ± 9.1% 89.2% ± 6.2%     51.7% ± 21.7%    74.0% ± 21.1%
            r1 ,. . .,r13   84.6% ± 9.1% 89.2% ± 6.2%     51.7% ± 21.7%    74.0% ± 21.1%
            r1 ,. . .,r14   84.6% ± 9.1% 89.2% ± 6.2%     51.7% ± 21.7%    74.0% ± 21.1%
            Classifier: SVM       8.89 cm (3.5 inch)            11.43 cm (4.5 inch)
            r1              54.0% ± 25.4%   60.0% ± 20.0% 72.5% ± 15.6%    75.0% ± 13.3%
            r1 ,r2          66.0% ± 9.2%    77.8% ± 14.1% 73.8% ± 14.2%    72.1% ± 10.3%
            r1 ,r2 ,r3      66.0% ± 9.2%    77.8% ± 14.1% 65.0% ± 17.5%    69.3% ± 10.1%
            r1 ,. . .,r4    66.0% ± 9.2%    77.8% ± 14.1% 63.8% ± 15.3%    70.7% ± 14.8%
            r1 ,. . .,r5    66.0% ± 9.2%    77.8% ± 14.1% 61.3% ± 15.3%    69.3% ± 13.9%
            r1 ,. . .,r6    68.0% ± 13.3%   77.8% ± 14.1% 61.3% ± 15.3%    68.6% ± 14.0%
            r1 ,. . .,r7    68.0% ± 13.3%   77.8% ± 14.1% 60.0% ± 14.6%    70.7% ± 13.3%
            r1 ,. . .,r8    56.0% ± 15.0%   77.8% ± 14.1% 52.5% ± 10.9%    71.4% ± 10.6%
            r1 ,. . .,r9    58.0% ± 16.6%   77.8% ± 14.1% 52.5% ± 10.9%    71.4% ± 10.6%
            r1 ,. . .,r10   60.0% ± 17.9%   76.7% ± 14.4% 52.5% ± 10.9%    71.4% ± 10.6%
            r1 ,. . .,r11   60.0% ± 17.9%   76.7% ± 14.4% 52.5% ± 10.9%    71.4% ± 10.6%
            r1 ,. . .,r12   60.0% ± 17.9%   76.7% ± 14.4% 52.5% ± 10.9%    71.4% ± 10.6%
            r1 ,. . .,r13   60.0% ± 17.9%   77.8% ± 14.1% 52.5% ± 10.9%    71.4% ± 10.6%
            r1 ,. . .,r14   60.0% ± 17.9%   77.8% ± 14.1% 52.5% ± 10.9%    71.4% ± 10.6%
                                                 238


Table B.5: Results obtained with the EEMD feature extraction method for 2, 2.5, 3.5, and
4.5 inch stickout cases.
      Classifier: SVM               5.08 cm (2 inch)                6.35 cm (2.5 inch)
            Features            Test Set       Training Set      Test Set      Training Set
      r1                     74.4% ± 0.9% 74.4% ± 0.5% 78.3% ± 1.2%           78.6% ± 0.5%
      r1 ,r2                 84.2% ± 0.8% 84.2% ± 0.4% 78.3% ± 1.2%           78.6% ± 0.5%
      r1 ,r2 ,r3             84.2% ± 0.8% 84.2% ± 0.4% 78.3% ± 1.4%           78.6% ± 0.5%
      r1 ,. . .,r4           84.2% ± 0.8% 84.2% ± 0.4% 78.5% ± 1.4%           78.7% ± 0.4%
      r1 ,. . .,r5           84.2% ± 0.8% 84.2% ± 0.4% 78.5% ± 1.4%           78.7% ± 0.4%
      r1 ,. . .,r6           84.2% ± 0.8% 84.2% ± 0.4% 78.5% ± 1.2%           78.7% ± 0.5%
      r1 ,. . .,r7           84.2% ± 0.8% 84.2% ± 0.4% 78.6% ± 1.2%           78.8% ± 0.5%
      Classifier: SVM              8.89 cm (3.5 inch)              11.43 cm  (4.5 inch)
      r1                     90.7% ± 1.4%     91.1% ± 0.6%    77.1% ± 1.2%    77.1% ± 0.5%
      r1 ,r2                 90.7% ± 1.4%     91.1% ± 0.7%    77.3% ± 1.1%    77.4% ± 0.6%
      r1 ,r2 ,r3             90.6% ± 1.4%     91.0% ± 0.6%    77.4% ± 1.0%    77.5% ± 0.6%
      r1 ,. . .,r4           90.6% ± 1.4%     91.0% ± 0.6%    78.9% ± 1.1%    78.9% ± 0.7%
      r1 ,. . .,r5           90.5% ± 1.4%     90.9% ± 0.6%    78.9% ± 1.2%    79.0% ± 0.7%
      r1 ,. . .,r6           90.5% ± 1.4%     91.0% ± 0.6%    78.9% ± 1.2%    79.0% ± 0.7%
      r1 ,. . .,r7           90.5% ± 1.4%     90.8% ± 0.6%    79.1% ± 1.2%    79.0% ± 0.6%
Table B.6: Two class classification (chatter and stable) results obtained with the DTW
similarity measure method for 2, 2.5, 3.5, and 4.5 inch overhang cases.
                            5.08 cm (2 inch)                     6.35 cm (2.5 inch)
      K-NN             Test Set         Training Set         Test Set         Training Set
      1-NN         94.52% ± 1.70% 93.92% ± 0.88% 55.00% ± 10.93% 54.13% ± 6.04%
      2-NN         79.52% ± 2.24% 78.69% ± 1.11% 85.94% ± 3.76% 86.51% ± 1.91%
      3-NN         78.98% ± 2.69% 78.96% ± 1.34% 84.69% ± 6.47% 87.14% ± 3.29%
      4-NN         79.09% ± 1.88% 78.91% ± 0.93% 85.63% ± 5.96% 86.67% ± 3.03%
      5-NN         79.46% ± 4.11% 78.72% ± 2.04% 86.88% ± 3.90% 86.03% ± 1.98%
                           8.89 cm (3.5 inch)                   11.43 cm (4.5 inch)
      1-NN         89.52% ± 2.86%     89.75% ± 2.36%     67.78% ± 6.48%     68.89% ± 4.37%
      2-NN         92.86% ± 4.39%     88.75% ± 2.30%     70.93% ± 5.11%     70.09% ± 2.55%
      3-NN         87.62% ± 6.10%     91.50% ± 3.20%     65.00% ± 6.28%     73.06% ± 3.14%
      4-NN         90.95% ± 4.49%     89.75% ± 2.36%     70.56% ± 2.55%     70.28% ± 1.27%
      5-NN         89.52% ± 4.15%     90.50% ± 2.18%     70.93% ± 5.68%     70.09% ± 2.84%
                                                  239


Table B.7: Two class classification (chatter including intermediate chatter cases as chatter
and stable) results obtained with the DTW similarity measure method for 2, 2.5, 3.5, and
4.5 inch overhang cases.
                        5.08 cm (2 inch)                       6.35 cm (2.5 inch)
      K-NN        Test Set          Training Set          Test Set         Training Set
      1-NN    98.04% ± 0.83% 97.98% ± 0.39% 68.84% ± 4.79% 66.32% ± 3.41%
      2-NN    98.34% ± 1.08% 100.00% ± 0.00% 59.30% ± 6.76% 66.90% ± 3.28%
      3-NN    97.27% ± 0.69% 98.38% ± 0.27% 72.33% ± 5.14% 72.99% ± 3.82%
      4-NN    97.09% ± 1.13% 98.64% ± 0.27% 60.70% ± 5.74% 64.14% ± 3.87%
      5-NN    97.42% ± 0.97% 97.37% ± 0.60% 67.67% ± 6.28% 70.11% ± 5.04%
                       8.89 cm (3.5 inch)                     11.43 cm (4.5 inch)
      1-NN    91.96% ± 5.37%      91.25% ± 2.81% 75.34% ± 4.84%          75.30% ± 2.63%
      2-NN    91.52% ± 6.22%     100.00% ± 0.00% 74.49% ± 3.88%          98.16% ± 0.95%
      3-NN    93.04% ± 4.84%      92.27% ± 1.96% 74.92% ± 4.69%          77.39% ± 2.17%
      4-NN    92.17% ± 4.68%      93.18% ± 1.76% 74.75% ± 4.61%          78.16% ± 2.07%
      5-NN    93.04% ± 4.22%      91.14% ± 2.26% 75.68% ± 4.30%          75.94% ± 2.05%
          Table B.8: Three-class classification is applied with the DTW approach.
                        5.08 cm (2 inch)                      6.35 cm (2.5 inch)
       K-NN        Test Set         Training Set         Test Set         Training Set
       1-NN    97.65% ± 1.12% 97.07% ± 0.47%         60.47% ± 6.90%     62.47% ± 3.81%
       2-NN    97.19% ± 0.89% 97.58% ± 0.54%         65.81% ± 6.24%     97.76% ± 1.23%
       3-NN    97.14% ± 1.12% 97.85% ± 0.26%         59.30% ± 7.22%     70.59% ± 4.04%
       4-NN    95.77% ± 0.89% 97.45% ± 0.56%         71.40% ± 6.98%     78.47% ± 7.21%
       5-NN    95.20% ± 1.29% 97.35% ± 0.38%         61.16% ± 8.45%     66.00% ± 3.95%
                       8.89 cm (3.5 inch)                    11.43 cm  (4.5 inch)
       1-NN    93.64% ± 3.64% 95.00% ± 1.9% 69.66% ± 4.64%              69.57% ± 3.86%
       2-NN    93.64% ± 2.23% 94.55% ± 1.51% 75.93% ± 4.41%             83.59% ± 2.38%
       3-NN    95.45% ± 5.38% 94.55% ± 1.51% 72.20% ± 5.26%             77.69% ± 1.60%
       4-NN    89.55% ± 7.62% 93.18% ± 2.27% 73.90% ± 4.62%             76.58% ± 2.99%
       5-NN    89.55% ± 7.06% 93.64% ± 1.98% 71.36% ± 4.70%             73.42% ± 3.51%
                                              240


Table B.9: Persistence Landscape Results with Support Vector Machine (SVM) classifier.
The landscape number column represents which landscapes were used to extract features.
                                2 inch                          2.5 inch
        Landscape
                       Test Set        Training Set     Test Set       Training Set
         Number
            1       96.8% ± 1.5%      96.7% ± 0.5%   87.0% ± 4.8% 86.9% ± 2.4%
            2       86.6% ± 1.7%      91.5% ± 0.8%   87.0% ± 3.8% 86.0% ± 2.0%
            3       89.1% ± 2.7%      93.4% ± 0.5%   84.7% ± 4.7% 86.6% ± 1.9%
            4       90.5% ± 1.9%      92.9% ± 0.4%   85.8% ± 2.6% 86.1% ± 1.4%
            5       90.5% ± 2.3%      92.2% ± 0.7%   88.4% ± 1.5% 85.2% ± 0.6%
                               3.5  inch                        4.5 inch
        Landscape
                       Test Set        Training Set     Test Set       Training Set
         Number
            1       92.2% ± 5.4% 93.2% ± 1.8%        66.8% ± 5.6%     78.5% ± 3.0%
            2       78.3% ± 4.3% 85.2% ± 2.1%        65.3% ± 4.3%     69.3% ± 2.3%
            3       82.6% ± 4.3% 83.6% ± 2.0%        68.6% ± 4.8%     72.6% ± 1.5%
            4       84.3% ± 6.8% 84.3% ± 2.8%        67.6% ± 5.2%     71.8% ± 2.6%
            5       85.7% ± 4.8% 84.3% ± 3.1%        66.8% ± 5.5%     73.0% ± 3.4%
Table B.10: Persistence Landscape Results with Logistic Regression (LR) classifier. The
landscape number column represents which landscapes were used to extract features.
                                 2 inch                         2.5 inch
        Landscape
                       Test Set         Training Set    Test Set       Training Set
         Number
            1       95.7% ± 1.6%       98.3% ± 0.4%  82.6% ± 4.2%     91.3% ± 2.5%
            2       85.5% ± 1.9%       93.7% ± 1.0%  86.3% ± 4.7%     91.5% ± 2.1%
            3       88.5% ± 1.1%       93.8% ± 0.6%  84.7% ± 4.3%     92.0% ± 2.6%
            4       89.4% ± 1.6%       94.8% ± 0.8%  86.5% ± 5.0%     90.8% ± 2.4%
            5       88.3% ± 1.7%       94.7% ± 0.6%  86.0% ± 3.4%     91.1% ± 2.7%
                                3.5  inch                       4.5 inch
        Landscape
                       Test Set         Training Set    Test Set       Training Set
         Number
            1       91.3% ± 3.4%       96.4% ± 1.8%  63.1% ± 3.5%     82.0% ± 3.4%
            2       82.2% ± 6.6%       90.7% ± 2.6%  63.1% ± 3.5%     82.0% ± 3.4%
            3       87.8% ± 6.4%       91.4% ± 2.8%  64.2% ± 4.5%     84.1% ± 1.1%
            4       90.4% ± 4.3%       91.8% ± 1.5%  63.1% ± 5.5%     82.9% ± 2.4%
            5       85.7% ± 4.8%       93.4% ± 1.9%  65.9% ± 4.6%     83.0% ± 2.2%
                                              241


Table B.11: Persistence Landscape Results with Random Forest (RF) classifier. The land-
scape number column represents which landscapes were used to extract features.
                                 2 inch                           2.5 inch
       Landscape
                       Test Set         Training Set     Test Set         Training Set
        Number
            1       96.1% ± 1.7%      100.0% ± 0.0%   86.7% ± 5.6%      100.0% ± 0.0%
            2       87.7% ± 2.2%      100.0% ± 0.0%   88.6% ± 4.0%      100.0% ± 0.0%
            3       88.0% ± 1.9%      100.0% ± 0.0%   87.4% ± 2.4%      100.0% ± 0.0%
            4       88.8% ± 1.2%      100.0% ± 0.0%   86.7% ± 5.8%      100.0% ± 0.0%
            5       89.6% ± 2.2%      100.0% ± 0.0%   86.7% ± 5.8%      100.0% ± 0.0%
                                3.5  inch                         4.5  inch
       Landscape
                       Test Set         Training Set     Test Set         Training Set
        Number
            1       91.3% ± 3.9%      100.0% ± 0.0%   65.1% ± 7.7%      100.0% ± 0.0%
            2       80.9% ± 7.1%      100.0% ± 0.0%   66.9% ± 2.9%      100.0% ± 0.0%
            3       83.0% ± 6.3%      100.0% ± 0.0%   66.8% ± 4.9%      100.0% ± 0.0%
            4       80.4% ± 5.6%      100.0% ± 0.0%   63.6% ± 4.2%      100.0% ± 0.0%
            5       85.2% ± 5.9%      100.0% ± 0.0%   64.7% ± 7.1%      100.0% ± 0.0%
                Table B.12: Persistence Images results with pixel size = 0.1.
                                2 inch                           2.5 inch
        Classifier    Test Set         Training Set     Test Set         Training Set
          SVM      82.5% ± 1.4%       82.7% ± 0.6%   79.3% ± 4.2%       81.4% ± 2.9%
           LR      80.2% ± 2.0%       82.4% ± 0.9%   77.0% ± 7.0%       82.4% ± 4.5%
           RF      96.4% ± 1.1%      100.0% ± 0.0%   85.8% ± 5.3%      100.0% ± 0.0%
           GB      95.9% ± 1.5%      100.0% ± 0.0%   84.4% ± 4.4%      100.0% ± 0.0%
                               3.5  inch                         4.5  inch
        Classifier    Test Set         Training Set     Test Set         Training Set
          SVM      82.6% ± 5.5%       83.2% ± 2.5%   66.9% ± 5.8%       63.7% ± 2.9%
           LR      82.6% ± 5.1%       85.7% ± 2.0%   62.7% ± 5.9%       65.5% ± 3.3%
           RF      93.0% ± 2.9%      100.0% ± 0.0%   72.5% ± 6.4%      100.0% ± 0.0%
           GB      91.3% ± 4.3%      100.0% ± 0.0%   68.5% ± 2.8%      100.0% ± 0.0%
                                              242


            Table B.13: Persistence Images results with pixel size = 0.05.
                                 2 inch                             2.5 inch
    Classifier      Test Set           Training Set        Test Set        Training Set
       SVM       82.0% ± 2.5%         83.4% ± 1.2%      81.2% ± 5.1%      80.2% ± 2.8%
        LR       80.3% ± 2.3%         79.5% ± 1.0%      68.6% ± 7.3%      75.2% ± 3.4%
        RF       96.0% ± 1.4%        100.0% ± 0.0%      85.8% ± 4.1%     100.0% ± 0.0%
        GB       95.5% ± 1.6%        100.0% ± 0.0%      85.6% ± 3.4%     100.0% ± 0.0%
                               3.5  inch                            4.5 inch
    Classifier      Test Set           Training Set        Test Set        Training Set
       SVM       85.2% ± 4.0%         81.8% ± 2.5%      63.4% ± 5.2%      65.6% ± 2.4%
        LR       80.9% ± 5.9%         85.9% ± 3.0%      63.6% ± 4.5%      64.8% ± 2.1%
        RF       90.9% ± 3.0%        100.0% ± 0.0%      70.7% ± 3.1%     100.0% ± 0.0%
        GB       90.4% ± 4.7%        100.0% ± 0.0%      70.0% ± 5.4%     100.0% ± 0.0%
                       Table B.14: Carlsson Coordinates results.
                                 2 inch                             2.5 inch
    Classifier      Test Set           Training Set        Test Set        Training Set
       SVM       87.8% ± 2.0%         87.1% ± 1.2%      72.1% ± 7.1%      79.7% ± 3.9%
        LR       93.1% ± 1.8%         92.9% ± 0.9%      86.3% ± 6.2%     100.0% ± 0.0%
        RF       93.0% ± 2.0%        100.0% ± 0.0%      69.5% ± 4.8%      70.9% ± 2.6%
        GB       93.6% ± 1.7%        100.0% ± 0.0%      84.7% ± 5.0%     100.0% ± 0.0%
                               3.5  inch                            4.5 inch
    Classifier      Test Set           Training Set        Test Set        Training Set
       SVM       95.2% ± 4.5%         95.2% ± 1.9%      68.1% ± 6.5%      68.6% ± 2.4%
        LR       90.9% ± 9.2%         93.6% ± 1.4%      70.8% ± 4.0%      72.6% ± 3.0%
        RF       95.7% ± 4.3%        100.0% ± 0.0%      72.2% ± 4.4%     100.0% ± 0.0%
        GB       92.2% ± 4.3%        100.0% ± 0.0%      71.9% ± 3.3%     100.0% ± 0.0%
              Table B.15: Kernel Method results with LibSVM package.
 Kernel Scale (σ)                          Test Set Score and Deviation
                       2 inch             2.5 inch           3.5 inch            4.5 inch
        0.2               *                   *          30.4% ± 6.2%±               *
       0.25          74.5%±** 58.9% ± 28.5%±             87.0% ± 3.6%±       59.3% ± 9.6%±
*These results are not available due to high computational time.
**This result belongs to first iteration for 2 inch overhang case.
                                               243


Table B.16: Path signature method results obtained with SVM classifier. Landscape numbers
represent which landscapes were used to compute signatures.
                                2 inch                            2.5 inch
    Landscape
                      Test Set         Training Set      Test Set        Training Set
     Number
         1        83.0% ± 2.9%± 84.8% ± 0.9%±        82.7% ± 5.3%±      82.9% ± 2.0%±
         2        79.2% ± 3.1%± 80.2% ± 1.0%±        84.2% ± 3.5%±      80.4% ± 1.4%±
         3        78.6% ± 1.8%± 79.4% ± 0.7%±        79.1% ± 2.9%±      80.8% ± 1.1%±
         4        79.3% ± 2.5%± 79.5% ± 0.8%±        80.6% ± 5.6%±      78.6% ± 2.3%±
         5         0.0% ± 0.0%±       0.0% ± 0.0%±   80.9% ± 5.1%±      78.8% ± 1.2%±
                               3.5 inch                           4.5 inch
    Landscape
                      Test Set         Training Set      Test Set        Training Set
     Number
         1        81.2% ± 6.3%±      82.6% ± 2.2%±   70.0% ± 6.4%±      71.4% ± 2.2%±
         2        82.9% ± 7.2%±      81.8% ± 2.4%±   64.1% ± 5.4%±      67.0% ± 2.4%±
         3       81.2% ± 10.5%±      82.2% ± 3.8%±   67.7% ± 6.6%±      65.8% ± 2.3%±
         4        85.9% ± 4.7%±      80.8% ± 1.6%±   64.5% ± 4.2%±      65.4% ± 1.2%±
         5        82.4% ± 9.1%±      82.0% ± 3.1%±   64.8% ± 6.9%±      65.4% ± 2.1%±
                                             244


               Training: Turning - 5.08 cm                      Training: Turning - 5.08 cm                      Training: Turning - 5.08 cm
                 Test: Turning - 6.35 cm                          Test: Turning - 8.89 cm                         Test: Turning - 11.43 cm
          SVM         LR         RF          GB           SVM          LR          RF         GB           SVM          LR         RF          GB
 TDA-CC  0.805      0.814      0.815       0.810 TDA-CC  0.964       0.966      0.970       0.966 TDA-CC  0.703       0.669      0.673       0.661
  TDA-PI 0.460      0.362      0.568       0.593  TDA-PI 0.815       0.815      0.864       0.894  TDA-PI 0.646       0.640      0.643       0.610
 TDA-PL  0.476      0.464      0.758             TDA-PL  0.940       0.936      0.945             TDA-PL  0.643       0.644      0.611
 TDA-TF  0.446      0.344      0.693       0.710 TDA-TF  0.819       0.894      0.834       0.889 TDA-TF  0.655       0.665      0.649       0.640
    WPT  0.567      0.558      0.892       0.892    WPT  0.690       0.530      0.860       0.800    WPT  0.631       0.581      0.863       0.844
   EEMD  0.588      0.582      0.612       0.616   EEMD  0.760       0.758      0.763       0.762   EEMD  0.693       0.682      0.718       0.708
     FPA 0.742      0.767      0.850       0.892     FPA 0.750       0.590      0.920       0.920     FPA 0.431       0.456      0.769       0.738
         0.4     0.5      0.6      0.7     0.8              0.6        0.7        0.8       0.9              0.5         0.6       0.7        0.8
Figure B.3: Classification accuracies obtained from transfer learning applications for turn-
ing experiment case with 5.08 cm overhang distance. (left) Training: 5.08 cm Test: 6.35 cm
(middle) Training: 5.08 cm Test: 8.89 cm, (right) Training: 5.08 cm Test: 11.43 cm. CC:
Carlsson Coordinates, PI: Persistence Images, PL: Persistence Landscapes, TF: Template
Functions, WPT: Wavelet Packet Transform, EEMD: Ensemble Empirical Mode Decom-
position, FPA: FFT/PSD/ACF, SVM: Support Vector Machine, RF: Random Forest, GB:
Gradient Boosting. The results for TDA-PL implementation with Gradient Boosting classi-
fier (GB) are not available due to the large amount of time required in training and testing.
Therefore, it represents an empty box in the figure.
                                                                      245


               Training: Turning - 6.35 cm                      Training: Turning - 6.35 cm                     Training: Turning - 6.35 cm
                 Test: Turning - 5.08 cm                          Test: Turning - 8.89 cm                        Test: Turning - 11.43 cm
          SVM         LR         RF          GB            SVM         LR         RF         GB            SVM         LR         RF         GB
 TDA-CC  0.749      0.828      0.763       0.577 TDA-CC   0.843      0.913      0.826       0.645 TDA-CC  0.644      0.639      0.543       0.465
  TDA-PI 0.266      0.295      0.324       0.226  TDA-PI  0.153      0.200      0.315       0.179  TDA-PI 0.384      0.444      0.423       0.363
 TDA-PL  0.253      0.419      0.366             TDA-PL   0.187      0.366      0.315             TDA-PL  0.356      0.453      0.477
 TDA-TF  0.255      0.258      0.349       0.305 TDA-TF   0.183      0.187      0.279       0.206 TDA-TF  0.365      0.393      0.437       0.410
    WPT  0.614      0.518      0.904       0.896    WPT   0.790      0.660      0.880       0.870    WPT  0.706      0.575      0.894       0.887
   EEMD  0.914      0.892      0.531       0.587   EEMD   0.770      0.779      0.584       0.630   EEMD  0.697      0.542      0.571       0.526
     FPA 0.768      0.657      0.889       0.861     FPA  0.820      0.720      0.870       0.810     FPA 0.569      0.537      0.787       0.762
                0.4          0.6           0.8           0.2        0.4         0.6         0.8           0.4    0.5      0.6      0.7     0.8
Figure B.4: Classification accuracies obtained from transfer learning applications for turn-
ing experiment case with 6.35 cm overhang distance. (left) Training: 6.35 cm Test: 5.08 cm
(middle) Training: 6.35 cm Test: 8.89 cm, (right) Training: 6.35 cm Test: 11.43 cm. CC:
Carlsson Coordinates, PI: Persistence Images, PL: Persistence Landscapes, TF: Template
Functions, WPT: Wavelet Packet Transform, EEMD: Ensemble Empirical Mode Decom-
position, FPA: FFT/PSD/ACF, SVM: Support Vector Machine, RF: Random Forest, GB:
Gradient Boosting. The results for TDA-PL implementation with Gradient Boosting classi-
fier (GB) are not available due to the large amount of time required in training and testing.
Therefore, it represents an empty box in the figure.
                                                                      246


               Training: Turning - 8.89 cm                       Training: Turning - 8.89 cm                          Training: Turning - 8.89 cm
                 Test: Turning - 5.08 cm                           Test: Turning - 6.35 cm                             Test: Turning - 11.43 cm
          SVM         LR         RF         GB              SVM         LR         RF         GB                 SVM         LR         RF          GB
 TDA-CC  0.871      0.837      0.879       0.848   TDA-CC  0.805      0.793       0.811      0.808   TDA-CC     0.656      0.669      0.652       0.635
  TDA-PI 0.746      0.749      0.808       0.787    TDA-PI 0.630      0.630       0.480      0.671     TDA-PI   0.631      0.635      0.599       0.561
 TDA-PL  0.873      0.850      0.856               TDA-PL  0.476      0.359       0.359              TDA-PL     0.548      0.456      0.446
 TDA-TF  0.749      0.762      0.750       0.735   TDA-TF  0.630      0.623       0.723      0.780   TDA-TF     0.642      0.568      0.620       0.587
    WPT  0.521      0.525      0.739       0.832      WPT  0.533      0.567       0.758      0.767       WPT    0.700      0.631      0.875       0.894
   EEMD  0.899      0.903      0.908       0.867     EEMD  0.569      0.579       0.653      0.653      EEMD    0.656      0.671      0.608       0.592
     FPA 0.679      0.675      0.771       0.736       FPA 0.742      0.725       0.692      0.767        FPA   0.431      0.375      0.475       0.613
             0.6          0.7         0.8        0.9       0.4      0.5        0.6       0.7       0.8        0.4     0.5      0.6      0.7      0.8
Figure B.5: Classification accuracies obtained from transfer learning applications for turning
experiment case with 8.89 cm overhang distance . (left) Training: 8.89 cm Test: 5.08 cm
(middle) Training: 8.89 cm Test: 6.35 cm, (right) Training: 8.89 cm Test: 11.43 cm. CC:
Carlsson Coordinates, PI: Persistence Images, PL: Persistence Landscapes, TF: Template
Functions, WPT: Wavelet Packet Transform, EEMD: Ensemble Empirical Mode Decom-
position, FPA: FFT/PSD/ACF, SVM: Support Vector Machine, RF: Random Forest, GB:
Gradient Boosting. The results for TDA-PL implementation with Gradient Boosting clas-
sifier (GB) are unavailable due to a large amount of time required in training and testing.
Therefore, it represents an empty box in the figure.
                                                                       247


              Training: Turning - 11.43 cm                      Training: Turning - 11.43 cm                       Training: Turning - 11.43 cm
                 Test: Turning - 5.08 cm                          Test: Turning - 6.35 cm                            Test: Turning - 8.89 cm
          SVM         LR         RF        GB               SVM         LR         RF        GB                SVM         LR         RF        GB
 TDA-CC  0.757      0.770      0.767      0.787   TDA-CC   0.643      0.660      0.647      0.735  TDA-CC     0.823      0.838      0.840      0.868
  TDA-PI 0.749      0.749      0.793      0.716    TDA-PI  0.630      0.630      0.341      0.279   TDA-PI    0.815      0.815      0.804      0.660
 TDA-PL  0.830      0.523      0.807              TDA-PL   0.632      0.368      0.633             TDA-PL     0.862      0.485      0.855
 TDA-TF  0.749      0.605      0.761      0.761   TDA-TF   0.630      0.159      0.632      0.531  TDA-TF     0.815      0.719      0.815      0.821
    WPT  0.604      0.504      0.818      0.775      WPT   0.575      0.583      0.850      0.767     WPT     0.700      0.640      0.930      0.900
   EEMD  0.912      0.912      0.918      0.813     EEMD   0.611      0.608      0.619      0.598    EEMD     0.761      0.761      0.765      0.686
     FPA 0.568      0.600      0.832      0.657       FPA  0.592      0.775      0.808      0.800      FPA    0.640      0.750      0.830      0.770
               0.6        0.7        0.8      0.9         0.2         0.4           0.6        0.8         0.5      0.6        0.7        0.8      0.9
Figure B.6: Classification accuracies obtained from transfer learning applications for turning
experiment case with 11.43 cm overhang distance . (left) Training: 11.43 cm Test: 5.08 cm
(middle) Training: 11.43 cm Test: 6.35 cm, (right) Training: 11.43 cm Test: 8.89 cm. CC:
Carlsson Coordinates, PI: Persistence Images, PL: Persistence Landscapes, TF: Template
Functions, WPT: Wavelet Packet Transform, EEMD: Ensemble Empirical Mode Decom-
position, FPA: FFT/PSD/ACF, SVM: Support Vector Machine, RF: Random Forest, GB:
Gradient Boosting. The results for TDA-PL implementation with Gradient Boosting classi-
fier (GB) are not available due to a large amount of time required in training and testing.
Therefore, it represents an empty box in the figure.
                                                                       248


                         K=1         K=2            K=3               K=4           K=5
  Training: 5.08 cm     0.743       0.765          0.731            0.738          0.723
      Test: 6.35 cm
  Training: 5.08 cm     0.893       0.900          0.900            0.904          0.904
      Test: 8.89 cm
  Training: 5.08 cm     0.703       0.709          0.693            0.694          0.686
     Test: 11.43 cm
  Training: 6.35 cm     0.956       0.248          0.248            0.248          0.248
      Test: 5.08 cm
  Training: 6.35 cm     0.931       0.187          0.187            0.187          0.187
      Test: 8.89 cm
  Training: 6.35 cm     0.681       0.358          0.358            0.358          0.358
     Test: 11.43 cm
  Training: 8.89 cm     0.882       0.893          0.880            0.889          0.872
      Test: 5.08 cm
  Training: 8.89 cm     0.706       0.699          0.708            0.709          0.708
      Test: 6.35 cm
  Training: 8.89 cm     0.696       0.722          0.701            0.719          0.697
     Test: 11.43 cm
 Training: 11.43 cm     0.807       0.814          0.803            0.808          0.801
      Test: 5.08 cm
 Training: 11.43 cm     0.653       0.666          0.664            0.665          0.659
      Test: 6.35 cm
 Training: 11.43 cm     0.816       0.829          0.824            0.836          0.829
      Test: 8.89 cm
                    0.2       0.3  0.4       0.5        0.6       0.7       0.8        0.9
Figure B.7: Classification accuracies obtained from transfer learning applications for turning
data set using DTW approach with K = 1, 2, 3, 4, 5, where K represents the nearest neighbor
number. Overhang distances used as training and testing data sets are shown on y-axis.
                                             249


                                  Training: Milling                                                         Training: Milling
                               Test: Turning - 5.08 cm                                                   Test: Turning - 6.35 cm
                SVM             LR                 RF          GB                         SVM              LR               RF           GB
 TDA-CC         0.344          0.262             0.455        0.460         TDA-CC        0.625          0.498             0.731        0.636
  TDA-PI        0.222          0.249             0.225        0.223          TDA-PI       0.181          0.335             0.636        0.420
 TDA-PL         0.354          0.273             0.382                      TDA-PL        0.370          0.524             0.366
 TDA-TF         0.231          0.295             0.235        0.488         TDA-TF        0.218          0.599             0.280        0.505
    WPT         0.509          0.549             0.514        0.538            WPT        0.545          0.542             0.506        0.547
   EEMD         0.458          0.588             0.291        0.499           EEMD        0.504          0.523             0.519        0.528
     FPA        0.482          0.596             0.607        0.661             FPA       0.775          0.842             0.675        0.542
            0.25    0.30 0.35     0.40     0.45    0.50 0.55  0.60     0.65           0.2       0.3    0.4        0.5        0.6    0.7       0.8
                                  Training: Milling                                                         Training: Milling
                               Test: Turning - 8.89 cm                                                  Test: Turning - 11.43 cm
                SVM             LR                 RF          GB                         SVM              LR               RF           GB
 TDA-CC         0.209          0.200             0.347        0.470         TDA-CC        0.358          0.360             0.372        0.465
  TDA-PI        0.185          0.185             0.185        0.185          TDA-PI       0.356          0.356             0.356        0.356
 TDA-PL         0.206          0.196             0.306                      TDA-PL        0.356          0.350             0.356
 TDA-TF         0.185          0.262             0.185        0.485         TDA-TF        0.356          0.374             0.356        0.496
    WPT         0.584          0.549             0.451        0.664            WPT        0.542          0.530             0.470        0.639
   EEMD         0.511          0.589             0.441        0.629           EEMD        0.515          0.523             0.459        0.542
     FPA        0.840          0.860             0.680        0.510             FPA       0.656          0.681             0.650        0.600
         0.2        0.3     0.4        0.5        0.6     0.7      0.8             0.35      0.40   0.45       0.50       0.55   0.60      0.65
Figure B.8: Classification accuracies obtained from transfer learning applications when the
milling data set is used as the training set and the turning data set is used as the test
set. CC: Carlsson Coordinates, PI: Persistence Images, PL: Persistence Landscapes, TF:
Template Functions, WPT: Wavelet Packet Transform, EEMD: Ensemble Empirical Mode
Decomposition, FPA: FFT/PSD/ACF, SVM: Support Vector Machine, RF: Random Forest,
GB: Gradient Boosting. The results for TDA-PL implementation with Gradient Boosting
classifier (GB) are not available due to the large amount of time required in training and
testing. Therefore, it represents an empty box in the figure.
                                                                         250


                          Training: Turning - 5.08 cm                                                       Training: Turning - 6.35 cm
                                  Test: Milling                                                                     Test: Milling
             SVM              LR                RF                GB                         SVM               LR                 RF              GB
 TDA-CC      0.523           0.512             0.535             0.541       TDA-CC          0.554            0.586              0.574           0.581
  TDA-PI     0.401           0.403             0.482             0.487        TDA-PI         0.536            0.519              0.498           0.555
 TDA-PL      0.495           0.540             0.488                          TDA-PL         0.525            0.509              0.474
 TDA-TF      0.552           0.538             0.538             0.557        TDA-TF         0.543            0.535              0.510           0.498
    WPT      0.523           0.544             0.521             0.508          WPT          0.538            0.510              0.531           0.531
   EEMD      0.598           0.599             0.552             0.552         EEMD          0.609            0.596              0.561           0.557
     FPA     0.621           0.551             0.492             0.608           FPA         0.664            0.538              0.567           0.569
            0.425   0.450 0.475    0.500  0.525    0.550   0.575   0.600            0.475     0.500   0.525     0.550     0.575     0.600  0.625     0.650
                          Training: Turning - 8.89 cm                                                      Training: Turning - 11.43 cm
                                  Test: Milling                                                                     Test: Milling
             SVM              LR                RF                GB                         SVM               LR                 RF              GB
 TDA-CC      0.490           0.482             0.512             0.529       TDA-CC          0.460            0.456              0.468           0.504
  TDA-PI     0.448           0.449             0.505             0.499        TDA-PI         0.448            0.448              0.435           0.445
 TDA-PL      0.496           0.520             0.534                          TDA-PL         0.448            0.470              0.444
 TDA-TF      0.448           0.478             0.501             0.510        TDA-TF         0.448            0.430              0.475           0.457
    WPT      0.531           0.541             0.503             0.528          WPT          0.562            0.487              0.531           0.536
   EEMD      0.605           0.596             0.552             0.552         EEMD          0.598            0.598              0.552           0.554
     FPA     0.518           0.497             0.505             0.503           FPA         0.469            0.495              0.505           0.497
          0.46     0.48   0.50      0.52     0.54     0.56     0.58     0.60            0.44     0.46    0.48     0.50     0.52     0.54  0.56     0.58
Figure B.9: Classification accuracies obtained from transfer learning applications when the
milling data set is used as the test set, and the turning data set is used as the training
set. CC: Carlsson Coordinates, PI: Persistence Images, PL: Persistence Landscapes, TF:
Template Functions, WPT: Wavelet Packet Transform, EEMD: Ensemble Empirical Mode
Decomposition, FPA: FFT/PSD/ACF, SVM: Support Vector Machine, RF: Random Forest,
GB: Gradient Boosting. The results for TDA-PL implementation with Gradient Boosting
classifier (GB) are not available due to the large amount of time required in training and
testing. Therefore, it represents an empty box in the figure.
                                                                           251


                          K=1      K=2             K=3              K=4             K=5
          Train: Milling 0.958    0.863           0.866            0.816           0.949
   Test: OD. - 5.08 cm
          Train: Milling 0.814    0.822           0.808            0.809           0.803
   Test: OD. - 6.35 cm
          Train: Milling 0.931    0.789           0.804            0.731           0.909
   Test: OD. - 8.89 cm
          Train: Milling 0.734    0.715           0.724            0.653           0.703
  Test: OD. - 11.43 cm
  Train: OD. - 5.08 cm   0.568    0.611           0.564            0.599           0.557
           Test: Milling
  Train: OD. - 6.35 cm   0.559    0.551           0.551            0.551           0.551
           Test: Milling
  Train: OD. - 8.89 cm   0.449    0.450           0.449            0.449           0.449
           Test: Milling
 Train: OD. - 11.43 cm   0.449    0.449           0.449            0.449           0.449
           Test: Milling
                           0.5     0.6             0.7             0.8             0.9
Figure B.10: Classification accuracies obtained from transfer learning applications between
turning and milling data sets using the DTW approach with K = 1, 2, 3, 4, 5, where K
represents the nearest neighbor number. Overhang distances (OD.) used as training or
testing data sets are shown on y-axis.
Figure B.11: The highest F1-score out of four different classifiers (or out of selected num-
bers of nearest neighbor for DTW) for each approach used in transfer learning applications
between overhang distance cases of the turning and milling experiments.
                                            252


Figure B.12: F1 scores obtained from the selected methods when we train and test between
the overhang distance cases of the turning data set and the milling data set. The selected
methods that give the highest accuracy are represented with ’◦’ bar hatch, and the ones that
are in the error band of the highest accuracy are shown with ‘/’ bar hatch. One approach is
selected from each category of the methods, and these are FPA, TDA-CC, and DTW.
                                             253


                                       APPENDIX C
 EXPERIMENTAL PROCEDURE FOR ENGINEERING SURFACE SCANS
    Experimental surfaces scans are collected from the standard S-22 Microfinish Compara-
tor. The roughness heights and the machining type of 9 selected surfaces are provided in
Table C.1.
              Table C.1: Selected surfaces under various machining conditions.
              Machining Type         Roughness height (micrometers(microinches))
                 M - milling         3.175 (125) 6.35(250)           12.7(500)
                 P - profiled        3.175 (125) 6.35(250)           12.7(500)
           ST - shaped or turned 3.175 (125) 6.35(250)               12.7(500)
    For each selected sample, the upper left corner area was selected for consistent scanning.
The selected area is 5 mm × 5 m. The surface texture was measured using Keyence digital
microscope (VHX6000) after placing the comparator on a free-angle XYZ motorized obser-
vation system (VHX-S650E). The setup for scanning is also provided in Figure C.2a. x500
magnification is selected to obtain enough spatial resolution (0.42 µm), and surface texture is
obtained using the zoom lens Keyence VH-Z500R, RZ x500-x5000. The stitching technique
was used to scan the designated area, and the procedure was completed by taking 11 × 11
                              Figure C.1: Scan sample of 125M.
                                              254


(a) Scanned portions of the sample and the digital
microscope.                                            (b) The scanned surface textures.
Figure C.2: The microscope used for experimental data collection and the sample surfaces.
scans in horizontal and vertical directions. The spatial sampling rate for the scans is around
2.4 samples per µm. The resulting surface scans are shown in Figure C.2b.
                                                255


                                               APPENDIX D
         CLASSIFICATION RESULTS FOR TOOL WEAR ANALYSIS
                                           The last ∆T seconds used in analysis
                      ∆T = 5         ∆T = 10            ∆T = 15               ∆T = 20         ∆T = 25
                      0.495           0.582                0.577              0.868            0.473
              SVM    ± 0.174         ± 0.261              ± 0.283            ± 0.242          ± 0.150
                      0.723           0.668                0.691              0.700            0.709
              LR     ± 0.072         ± 0.027              ± 0.023            ± 0.033          ± 0.053
                      0.932           0.873                0.973              1.000            0.955
              RF     ± 0.086         ± 0.068              ± 0.044            ± 0.000          ± 0.043
                      1.000           1.000                1.000              1.000            1.000
              GB     ± 0.000         ± 0.000              ± 0.000            ± 0.000          ± 0.000
                    0.5        0.6                  0.7                0.8              0.9             1.0
Figure D.1: Training set classification scores obtained from the DWT approach. Results
are provided for four different classification algorithms: SVM: Support Vector Machine, LR:
Logistic Regression, RF: Random Forest, GB: Gradient Boosting. The labeling of the data
was performed with respect to flank wear measurements.
                                                          256


                                                   The last ∆T seconds used in analysis
                       ∆T = 5                ∆T = 10            ∆T = 15               ∆T = 20          ∆T = 25
                        0.741                 0.632              0.550               0.650              0.659
              SVM      ± 0.204               ± 0.127            ± 0.027             ± 0.182            ± 0.121
                        0.841                 0.718              0.700               0.750              0.805
              LR       ± 0.025               ± 0.031            ± 0.017             ± 0.048            ± 0.040
                        0.950                 0.727              0.705               0.773              0.900
              RF       ± 0.100               ± 0.139            ± 0.032             ± 0.118            ± 0.126
                        1.000                 1.000              1.000               1.000              1.000
              GB       ± 0.000               ± 0.000            ± 0.000             ± 0.000            ± 0.000
                0.55     0.60         0.65        0.70       0.75       0.80       0.85         0.90    0.95     1.00
Figure D.2: Training set classification scores obtained from the DWT approach. Results
are provided for four different classification algorithms: SVM: Support Vector Machine, LR:
Logistic Regression, RF: Random Forest, GB: Gradient Boosting. The labeling of the data
was performed with respect to crater wear measurements.
                                                   The last ∆T seconds used in analysis
                       ∆T = 5                ∆T = 10            ∆T = 15               ∆T = 20          ∆T = 25
                        0.691                 0.591              0.650               0.845              1.000
              SVM      ± 0.130               ± 0.215            ± 0.287             ± 0.229            ± 0.000
                        0.645                 0.673              0.618               0.577              0.532
              LR       ± 0.027               ± 0.055            ± 0.071             ± 0.031            ± 0.031
                        0.936                 0.909              1.000               0.868              0.936
              RF       ± 0.079               ± 0.077            ± 0.000             ± 0.071            ± 0.060
                        1.000                 1.000              1.000               1.000              1.000
              GB       ± 0.000               ± 0.000            ± 0.000             ± 0.000            ± 0.000
                                0.6                    0.7                 0.8                   0.9              1.0
Figure D.3: Training set classification scores obtained from the EEMD approach. Results
are provided for four different classification algorithms: SVM: Support Vector Machine, LR:
Logistic Regression, RF: Random Forest, GB: Gradient Boosting. The labeling the of data
was performed with respect to flank wear measurements.
                                                                 257


                                                   The last ∆T seconds used in analysis
                          ∆T = 5             ∆T = 10            ∆T = 15               ∆T = 20                ∆T = 25
                           0.782              0.577                  0.695           0.823                0.609
              SVM         ± 0.219            ± 0.159                ± 0.199         ± 0.208              ± 0.094
                           0.691              0.755                  0.764           0.691                0.705
              LR          ± 0.027            ± 0.066                ± 0.018         ± 0.037              ± 0.048
                           0.905              0.941                  0.891           0.782                0.768
              RF          ± 0.119            ± 0.086                ± 0.092         ± 0.126              ± 0.121
                           1.000              1.000                  1.000           1.000                1.000
              GB          ± 0.000            ± 0.000                ± 0.000         ± 0.000              ± 0.000
                     0.60           0.65      0.70           0.75       0.80     0.85           0.90         0.95      1.00
Figure D.4: Training set classification scores obtained from the EEMD approach. Results
are provided for four different classification algorithms: SVM: Support Vector Machine, LR:
Logistic Regression, RF: Random Forest, GB: Gradient Boosting. The labeling of the data
was performed with respect to crater wear measurements.
                                                   The last ∆T seconds used in analysis
                          ∆T = 5             ∆T = 10            ∆T = 15               ∆T = 20                ∆T = 25
                           0.605              0.545                  0.586           0.377                    0.455
              SVM         ± 0.098            ± 0.221                ± 0.344         ± 0.112                  ± 0.137
                           0.632              0.523                  0.532           0.559                    0.545
              LR          ± 0.044            ± 0.032                ± 0.053         ± 0.034                  ± 0.045
                           0.950              0.927                  0.814           0.786                    0.955
              RF          ± 0.100            ± 0.092                ± 0.136         ± 0.108                  ± 0.091
                           1.000              1.000                  1.000           1.000                    1.000
              GB          ± 0.000            ± 0.000                ± 0.000         ± 0.000                  ± 0.000
                    0.4                0.5             0.6              0.7             0.8            0.9              1.0
Figure D.5: Training set classification scores obtained from the Carlsson Coordinates ap-
proach. Results are provided for four different classification algorithms: SVM: Support
Vector Machine, LR: Logistic Regression, RF: Random Forest, GB: Gradient Boosting. The
labeling of the data was performed with respect to flank wear measurements.
                                                                    258


                                                The last ∆T seconds used in analysis
                        ∆T = 5            ∆T = 10            ∆T = 15               ∆T = 20            ∆T = 25
                        0.895              0.505                 0.791               0.505             0.564
              SVM      ± 0.209            ± 0.060               ± 0.215             ± 0.046           ± 0.009
                        0.636              0.623                 0.655               0.645             0.605
              LR       ± 0.038            ± 0.045               ± 0.030             ± 0.051           ± 0.027
                        0.759              0.814                 0.886               0.750             0.686
              RF       ± 0.122            ± 0.153               ± 0.124             ± 0.127           ± 0.101
                        1.000              1.000                 1.000               1.000             1.000
              GB       ± 0.000            ± 0.000               ± 0.000             ± 0.000           ± 0.000
                                    0.6               0.7                     0.8             0.9                 1.0
Figure D.6: Training set classification scores obtained from the Carlsson Coordinates ap-
proach. Results are provided for four different classification algorithms: SVM: Support
Vector Machine, LR: Logistic Regression, RF: Random Forest, GB: Gradient Boosting. The
labeling of the data was performed with respect to crater wear measurements.
                                                The last ∆T seconds used in analysis
                        ∆T = 5            ∆T = 10            ∆T = 15               ∆T = 20             ∆T = 25
                        0.455              0.345                 0.418               0.382             0.309
              SVM      ± 0.057            ± 0.134               ± 0.136             ± 0.156           ± 0.093
              LR
                        0.455              0.400                 0.491               0.345             0.364
                       ± 0.057            ± 0.045               ± 0.148             ± 0.167           ± 0.100
                        0.382              0.345                 0.400               0.364             0.291
              RF
                       ± 0.068            ± 0.089               ± 0.123             ± 0.129           ± 0.089
                        0.327              0.455                 0.436               0.364             0.382
              GB
                       ± 0.123            ± 0.115               ± 0.106             ± 0.057           ± 0.068
                    0.300        0.325     0.350        0.375         0.400         0.425     0.450       0.475
Figure D.7: The test set classification scores obtained with the persistence images. Results
are provided for four different classification algorithms: SVM: Support Vector Machine, LR:
Logistic Regression, RF: Random Forest, GB: Gradient Boosting. The labeling of the time
series used in the analysis is done based on flank wear measurements.
                                                                259


                                          The last ∆T seconds used in analysis
                    ∆T = 5          ∆T = 10            ∆T = 15               ∆T = 20          ∆T = 25
                     0.618           0.827              0.518               0.577              0.759
              SVM   ± 0.172         ± 0.135            ± 0.253             ± 0.124            ± 0.295
              LR
                     0.814           0.718              0.736               0.750              0.705
                    ± 0.030         ± 0.034            ± 0.065             ± 0.032            ± 0.059
                     0.982           0.950              1.000               0.932              0.927
              RF
                    ± 0.036         ± 0.055            ± 0.000             ± 0.086            ± 0.063
                     1.000           1.000              1.000               1.000              1.000
              GB
                    ± 0.000         ± 0.000            ± 0.000             ± 0.000            ± 0.000
                              0.6               0.7                0.8                  0.9             1.0
Figure D.8: Training set classification scores obtained from the persistence images. Results
are provided for four different classification algorithms: SVM: Support Vector Machine, LR:
Logistic Regression, RF: Random Forest, GB: Gradient Boosting. The labeling of the data
was performed with respect to flank wear measurements.
                                          The last ∆T seconds used in analysis
                    ∆T = 5          ∆T = 10            ∆T = 15               ∆T = 20          ∆T = 25
                     0.623           0.659              0.668               0.545              0.782
              SVM   ± 0.192         ± 0.144            ± 0.194             ± 0.029            ± 0.188
              LR
                     0.827           0.800              0.836               0.750              0.764
                    ± 0.037         ± 0.027            ± 0.039             ± 0.038            ± 0.034
                     0.709           0.827              0.886               0.732              0.791
              RF
                    ± 0.027         ± 0.136            ± 0.093             ± 0.027            ± 0.110
                     1.000           1.000              1.000               1.000              1.000
              GB
                    ± 0.000         ± 0.000            ± 0.000             ± 0.000            ± 0.000
                       0.6                0.7                    0.8                   0.9              1.0
Figure D.9: Training set classification scores obtained from the persistence images. Results
are provided for four different classification algorithms: SVM: Support Vector Machine, LR:
Logistic Regression, RF: Random Forest, GB: Gradient Boosting. The labeling of the data
was performed with respect to crater wear measurements.
                                                        260


                                               The last ∆T seconds used in analysis
                       ∆T = 5            ∆T = 10            ∆T = 15               ∆T = 20               ∆T = 25
                        0.345             0.364               0.273                   0.273             0.382
              SVM      ± 0.134           ± 0.115             ± 0.100                 ± 0.081           ± 0.106
                        0.364             0.364               0.309                   0.418             0.273
              LR       ± 0.129           ± 0.057             ± 0.148                 ± 0.123           ± 0.100
                        0.327             0.327               0.255                   0.364             0.273
              RF       ± 0.093           ± 0.045             ± 0.134                 ± 0.100           ± 0.129
                        0.382             0.273               0.236                   0.382             0.200
              GB       ± 0.068           ± 0.057             ± 0.136                 ± 0.068           ± 0.068
               0.200     0.225   0.250           0.275     0.300       0.325         0.350     0.375      0.400
Figure D.10: The test set classification scores obtained with the template functions. Results
are provided for four different classification algorithms: SVM: Support Vector Machine, LR:
Logistic Regression, RF: Random Forest, GB: Gradient Boosting. The labeling of the time
series used in the analysis is done based on flank wear measurements.
                                               The last ∆T seconds used in analysis
                       ∆T = 5            ∆T = 10            ∆T = 15               ∆T = 20              ∆T = 25
                        0.432         0.677                  0.595                0.677                 0.759
              SVM      ± 0.107       ± 0.164                ± 0.220              ± 0.335               ± 0.295
                        0.586         0.500                  0.482                0.491                 0.518
              LR       ± 0.053       ± 0.038                ± 0.044              ± 0.047               ± 0.042
                        0.818         0.773                  0.936                0.936                 0.882
              RF       ± 0.105       ± 0.115                ± 0.106              ± 0.079               ± 0.099
                        1.000         1.000                  1.000                1.000                 1.000
              GB       ± 0.000       ± 0.000                ± 0.000              ± 0.000               ± 0.000
                          0.5              0.6              0.7                0.8               0.9              1.0
Figure D.11: Training set classification scores obtained from the template functions. Results
are provided for four different classification algorithms: SVM: Support Vector Machine, LR:
Logistic Regression, RF: Random Forest, GB: Gradient Boosting. The labeling of the data
was performed with respect to flank wear measurements.
                                                             261


                                      The last ∆T seconds used in analysis
                    ∆T = 5      ∆T = 10            ∆T = 15               ∆T = 20         ∆T = 25
                     0.659       0.532              0.632               0.686             0.668
              SVM   ± 0.157     ± 0.023            ± 0.187             ± 0.101           ± 0.123
                     0.627       0.614              0.600               0.627             0.645
              LR    ± 0.047     ± 0.020            ± 0.018             ± 0.018           ± 0.037
                     0.855       0.732              0.718               0.873             0.745
              RF    ± 0.110     ± 0.135            ± 0.109             ± 0.147           ± 0.131
                     1.000       1.000              1.000               1.000             1.000
              GB    ± 0.000     ± 0.000            ± 0.000             ± 0.000           ± 0.000
                          0.6             0.7                0.8                   0.9             1.0
Figure D.12: Training set classification scores obtained from the template functions. Results
are provided for four different classification algorithms: SVM: Support Vector Machine, LR:
Logistic Regression, RF: Random Forest, GB: Gradient Boosting. The labeling of the data
was performed with respect to crater wear measurements.
                                                    262


BIBLIOGRAPHY
      263


                                   BIBLIOGRAPHY
[1]  H. Doraiswamy, N. Shivashankar, V. Natarajan, and Y. Wang, “Topological saliency,”
     Computers & Graphics, vol. 37, no. 7, pp. 787–799, 2013.
[2]  F. A. Khasawneh and B. P. Mann, “A spectral element approach for the stability of
     delay systems,” International Journal for Numerical Methods in Engineering, vol. 87,
     no. 6, pp. 566–592, 2011.
[3]  M. C. Yesilli, S. Tymochko, F. A. Khasawneh, and E. Munch, “Chatter diagnosis in
     milling using supervised learning and topological features vector,” in 2019 18th IEEE
     International Conference On Machine Learning And Applications (ICMLA), pp. 1211–
     1218, IEEE, 2019.
[4]  T. Thaler, P. Potočnik, I. Bric, and E. Govekar, “Chatter detection in band sawing
     based on discriminant analysis of sound features,” Applied Acoustics, vol. 77, pp. 114–
     121, 2014.
[5]  G. S. Chen and Q. Z. Zheng, “Online chatter detection of the end milling based on
     wavelet packet transform and support vector machine recursive feature elimination,”
     The International Journal of Advanced Manufacturing Technology, vol. 95, no. 1-4,
     pp. 775–784, 2017.
[6]  Y. Ji, X. Wang, Z. Liu, H. Wang, L. Jiao, D. Wang, and S. Leng, “Early milling
     chatter identification by improved empirical mode decomposition and multi-indicator
     synthetic evaluation,” Journal of Sound and Vibration, vol. 433, pp. 138–159, 2018.
[7]  X. Li, G. Ouyang, and Z. Liang, “Complexity measure of motor current signals for tool
     flute breakage detection in end milling,” International Journal of Machine Tools and
     Manufacture, vol. 48, no. 3-4, pp. 371–379, 2008.
[8]  H. Liu, Q. Chen, B. Li, X. Mao, K. Mao, and F. Peng, “On-line chatter detection using
     servo motor current signal in turning,” Science China Technological Sciences, vol. 54,
     no. 12, pp. 3119–3129, 2011.
[9]  M. C. Yesilli, F. A. Khasawneh, and A. Otto, “On transfer learning for chatter detection
     in turning using wavelet packet transform and ensemble empirical mode decomposi-
     tion,” CIRP Journal of Manufacturing Science and Technology, vol. 28, pp. 118–135,
     2020.
[10] M. C. Yesilli, F. A. Khasawneh, and A. Otto, “Topological feature vectors for chatter
     detection in turning processes,” The International Journal of Advanced Manufacturing
     Technology, vol. 119, no. 9-10, pp. 5687–5713, 2022.
[11] M. C. Yesilli, F. A. Khasawneh, and A. Otto, “Chatter detection in turning using
     machine learning and similarity measures of time series via dynamic time warping,”
     Journal of Manufacturing Processes, vol. 77, pp. 190–206, 2022.
                                            264


[12] M. C. Yesilli and F. A. Khasawneh, “Data driven model identification for a chaotic
     pendulum with variable interaction potential,” in Volume 2: 16th International Con-
     ference on Multibody Systems, Nonlinear Dynamics, and Control, American Society of
     Mechanical Engineers, aug 2020.
[13] A. Bruzzone, H. Costa, P. Lonardo, and D. Lucca, “Advances in engineered surfaces
     for functional performance,” CIRP Annals, vol. 57, no. 2, pp. 750–769, 2008.
[14] T. D. B. Jacobs, T. Junge, and L. Pastewka, “Quantitative characterization of surface
     topography using spectral analysis,” Surface Topography: Metrology and Properties,
     vol. 5, no. 1, p. 013001, 2017.
[15] A. Majumdar and C. Tien, “Fractal characterization and simulation of rough surfaces,”
     Wear, vol. 136, no. 2, pp. 313–327, 1990.
[16] M. Hasegawa, J. Liu, K. Okuda, and M. Nunobiki, “Calculation of the fractal dimen-
     sions of machined surface profiles,” Wear, vol. 192, no. 1-2, pp. 40–45, 1996.
[17] A. Wang, C. Yang, and X. Yuan, “Evaluation of the wavelet transform method for ma-
     chined surface topography i: methodology validation,” Tribology International, vol. 36,
     no. 7, pp. 517–526, 2003.
[18] A. Wang, C. Yang, and X. Yuan, “Evaluation of the wavelet transform method for ma-
     chined surface topography 2: fractal characteristic analysis,” Tribology International,
     vol. 36, no. 7, pp. 527–535, 2003.
[19] G. Petropoulos, W. Bouzid, C. Pandazaras, and D. Dramalist, “Fractal geometry of
     metal surfaces obtained by turning,” Materials Technology, vol. 21, no. 3, pp. 163–169,
     2006.
[20] X. Zuo, H. Zhu, Y. Zhou, and J. Yang, “Estimation of fractal dimension and surface
     roughness based on material characteristics and cutting conditions in the end milling of
     carbon steels,” Proceedings of the Institution of Mechanical Engineers, Part B: Journal
     of Engineering Manufacture, vol. 231, no. 8, pp. 1423–1437, 2015.
[21] Z. Chen, Y. Liu, and P. Zhou, “A comparative study of fractal dimension calculation
     methods for rough surface profiles,” Chaos, Solitons & Fractals, vol. 112, pp. 24–30,
     2018.
[22] F. W. Taylor, “On the art of cutting metals,” Transactions of ASME, vol. 43, pp. 31–
     350, 1907.
[23] Y. Altintas and M. Weck, “Chatter stability of metal cutting and grinding,” CIRP
     Annals, vol. 53, pp. 619 – 642, 2004.
[24] G. Quintana and J. Ciurana, “Chatter in machining processes: A review,” International
     Journal of Machine Tools and Manufacture, vol. 51, no. 5, pp. 363–376, 2011.
                                             265


[25] J. Munoa, X. Beudaert, Z. Dombovari, Y. Altintas, E. Budak, C. Brecher, and
     G. Stepan, “Chatter suppression techniques in metal cutting,” CIRP Annals, vol. 65,
     no. 2, pp. 785–808, 2016.
[26] Y. Altintas, Manufacturing Automation: Metal Cutting Mechanics, Machine Tool Vi-
     brations, and CNC Design. New York: Cambridge University Press, 2012.
[27] A. Otto, S. Rauh, M. Kolouch, and G. Radons, “Extension of tlusty’s law for the
     identification of chatter stability lobes in multi-dimensional cutting processes,” Int. J.
     Mach. Tools Manuf., vol. 82–83, pp. 50 – 58, 2014.
[28] S. Smith and J. Tlusty, “Stabilizing chatter by automatic spindle speed regulation,”
     {CIRP} Annals - Manufacturing Technology, vol. 41, no. 1, pp. 433 – 436, 1992.
[29] Y. Altintas and P. K. Chan, “In-process detection and suppression of chatter in milling,”
     International Journal of Machine Tools and Manufacture, vol. 32, no. 3, pp. 329 – 347,
     1992.
[30] T. Choi and Y. C. Shin, “On-line chatter detection using wavelet-based parameter
     estimation,” Journal of Manufacturing Science and Engineering, vol. 125, no. 1, pp. 21–
     28, 2003.
[31] E. Kuljanic, G. Totis, and M. Sortino, “Development of an intelligent multisensor
     chatter detection system in milling,” Mechanical Systems and Signal Processing, vol. 23,
     no. 5, pp. 1704 – 1718, 2009.
[32] Z. Yao, D. Mei, and Z. Chen, “On-line chatter detection and identification based on
     wavelet and support vector machine,” Journal of Materials Processing Technology,
     vol. 210, no. 5, pp. 713 – 719, 2010.
[33] J. Elias and V. Narayanan Namboothiri, “Cross-recurrence plot quantification anal-
     ysis of input and output signals for the detection of chatter in turning,” Nonlinear
     Dynamics, vol. 76, no. 1, pp. 255–261, 2014.
[34] J. Tlusty and G. Andrews, “A critical review of sensors for unmanned machining,”
     {CIRP} Annals - Manufacturing Technology, vol. 32, no. 2, pp. 563 – 572, 1983.
[35] T. Delio, J. Tlusty, and S. Smith, “Use of audio signals for chatter detection and
     control,” Journal of Manufacturing Science and Engineering, vol. 114, no. 2, pp. 146–
     157, 1992.
[36] J. Gradisek, E. Govekar, and I. Grabec, “Using coarse-grained entropy rate to detect
     chatter in cutting,” Journal of Sound and Vibration, vol. 214, no. 5, pp. 941–952, 1998.
[37] T. L. Schmitz, K. Medicus, and B. Dutterer, “Exploring once-per-revolution audio
     signal variance as a chatter indicator,” Machining Science and Technology, vol. 6,
     no. 2, pp. 215–233, 2002.
                                              266


[38] I. Bediaga, J. Muñoa, J. Hernández, and L. L. de Lacalle, “An automatic spindle speed
     selection strategy to obtain stability in high-speed milling,” Int. J. Mach. Tools Manuf.,
     vol. 49, no. 5, pp. 384 – 394, 2009.
[39] N. D. Sims, “Dynamics diagnostics: Methods, equipment and analysis tools,” in Ma-
     chining Dynamics (K. Cheng, ed.), Springer Series in Advanced Manufacturing, pp. 85–
     115, Springer London, 2009.
[40] U. Nair, B. Krishna, V. Namboothiri, and V. Nampoori, “Permutation entropy based
     real-time chatter detection using audio signal in turning process,” The International
     Journal of Advanced Manufacturing Technology, vol. 46, no. 1-4, pp. 61–68, 2010.
[41] N. J. M. van Dijk, E. J. J. Doppenberg, R. P. H. Faassen, N. van de Wouw, J. A. J. Oost-
     erling, and H. Nijmeijer, “Automatic in-process chatter avoidance in the high-speed
     milling process,” Journal of Dynamic Systems, Measurement, and Control, vol. 132,
     no. 3, p. 031006, 2010.
[42] N.-C. Tsai, D.-C. Chen, and R.-M. Lee, “Chatter prevention for milling process by
     acoustic signal feedback,” The International Journal of Advanced Manufacturing Tech-
     nology, vol. 47, no. 9-12, pp. 1013–1021, 2010.
[43] Y. Kakinuma, Y. Sudo, and T. Aoyama, “Detection of chatter vibration in end milling
     applying disturbance observer,” {CIRP} Annals - Manufacturing Technology, vol. 60,
     no. 1, pp. 109 – 112, 2011.
[44] L. Ma, S. N. Melkote, and J. B. Castle, “A model-based computationally efficient
     method for on-line detection of chatter in milling,” Journal of Manufacturing Science
     and Engineering, vol. 135, no. 3, p. 031007, 2013.
[45] Y. Chen, H. Li, L. Hou, J. Wang, and X. Bu, “An intelligent chatter detection method
     based on EEMD and feature selection with multi-channel vibration signals,” Measure-
     ment, vol. 127, pp. 356–365, 2018.
[46] Y. Wang, Q. Bo, H. Liu, L. Hu, and H. Zhang, “Mirror milling chatter identifica-
     tion using q-factor and SVM,” The International Journal of Advanced Manufacturing
     Technology, vol. 98, no. 5-8, pp. 1163–1177, 2018.
[47] S. Saravanamurugan, S. Thiyagu, N. Sakthivel, and B. B. Nair, “Chatter prediction in
     boring process using machine learning technique,” International Journal of Manufac-
     turing Research, vol. 12, no. 4, p. 405, 2017.
[48] Z. Han, H. Jin, D. Han, and H. Fu, “ESPRIT- and HMM-based real-time monitoring
     and suppression of machining chatter in smart CNC milling system,” The International
     Journal of Advanced Manufacturing Technology, vol. 89, no. 9-12, pp. 2731–2746, 2016.
[49] F.-Y. Xie, Y.-M. Hu, B. Wu, and Y. Wang, “A generalized hidden markov model and
     its applications in recognition of cutting states,” International Journal of Precision
     Engineering and Manufacturing, vol. 17, no. 11, pp. 1471–1482, 2016.
                                               267


[50] L. Ding, Y. Sun, and Z. Xiong, “Early chatter detection based on logistic regression
     with time and frequency domain features,” in 2017 IEEE International Conference on
     Advanced Intelligent Mechatronics (AIM), pp. 1052–1057, IEEE, 2017.
[51] S. Qian, Y. Sun, and Z. Xiong, “Intelligent chatter detection based on wavelet packet
     node energy and LSSVM-RFE,” in 2015 IEEE International Conference on Advanced
     Intelligent Mechatronics (AIM), IEEE, 2015.
[52] X. Li, Z. H. Yao, and Z. C. Chen, “An effective EMD-based feature extraction method
     for boring chatter recognition,” Applied Mechanics and Materials, vol. 34-35, pp. 1058–
     1063, 2010.
[53] H. Cao, Y. Lei, and Z. He, “Chatter identification in end milling process using wavelet
     packets and Hilbert – Huang transform,” International Journal of Machine Tools and
     Manufacture, vol. 69, pp. 11–19, 2013.
[54] M. Lamraoui, M. Barakat, M. Thomas, and M. E. Badaoui, “Chatter detection in
     milling machines by neural network classification and feature selection,” Journal of
     Vibration and Control, vol. 21, no. 7, pp. 1251–1266, 2013.
[55] M. C. Yesilli and F. A. Khasawneh, “On transfer learning of traditional frequency and
     time domain features in turning,” in Volume 2: Manufacturing Processes; Manufac-
     turing Systems; Nano/Micro/Meso Manufacturing; Quality and Reliability, American
     Society of Mechanical Engineers, 2020.
[56] M.-K. Liu, M.-Q. Tran, C. Chung, and Y.-W. Qui, “Hybrid model- and signal-based
     chatter detection in the milling process,” Journal of Mechanical Science and Technol-
     ogy, vol. 34, no. 1, pp. 1–10, 2020.
[57] C. Liu, L. Zhu, and C. Ni, “Chatter detection in milling process based on VMD and
     energy entropy,” Mechanical Systems and Signal Processing, vol. 105, pp. 169–182,
     2018.
[58] Y. S. Tarng and M. C. Chen, “An intelligent sensor for detection of milling chatter,”
     Journal of Intelligent Manufacturing, vol. 5, no. 3, pp. 193–200, 1994.
[59] S. Tangjitsitcharoen, “Advance in detection system to improve the stability and capa-
     bility of CNC turning process,” Journal of Intelligent Manufacturing, vol. 22, no. 6,
     pp. 843–852, 2009.
[60] Y. Fu, Y. Zhang, H. Gao, T. Mao, H. Zhou, R. Sun, and D. Li, “Automatic feature con-
     structing from vibration signals for machining state monitoring,” Journal of Intelligent
     Manufacturing, vol. 30, no. 3, pp. 995–1008, 2017.
[61] Cherukuri, Perez-Bernabeu, Selles, and Schmitz, “Machining chatter prediction using
     a data learning model,” Journal of Manufacturing and Materials Processing, vol. 3,
     no. 2, p. 45, 2019.
                                            268


[62] R. Ghrist, “Barcodes: The persistent topology of data,” Builletin of the American
     Mathematical Society, vol. 45, pp. 61–75, 2008. Survey.
[63] G. Carlsson, “Topology and data,” Bulletin of the American Mathematical Society,
     vol. 46, no. 2, pp. 255–308, 2009. Survey.
[64] H. Edelsbrunner and J. L. Harer, Computational topology: an introduction. American
     Mathematical Society, 2009.
[65] S. Y. Oudot, Persistence theory: from quiver representations to data analysis, vol. 209
     of AMS Mathematical Surveys and Monographs. American Mathematical Society, 2015.
[66] M. Robinson, Topological Signal Processing. Springer-Verlag Berlin Heidelberg, 1 ed.,
     2014.
[67] F. A. Khasawneh and E. Munch, “Stability of a stochastic turning model using persis-
     tent homology.” In submission, 2014.
[68] F. A. Khasawneh and E. Munch, “Chatter detection in turning using persistent ho-
     mology,” Mechanical Systems and Signal Processing, vol. 70-71, pp. 527–541, 2016.
[69] F. A. Khasawneh, E. Munch, and J. A. Perea, “Chatter classification in turning using
     machine learning and topological data analysis,” IFAC-PapersOnLine, vol. 51, no. 14,
     pp. 195–200, 2018.
[70] F. A. Khasawneh and E. Munch, “Stability determination in turning using persis-
     tent homology and time series analysis,” in Proceedings of the ASME 2014 Interna-
     tional Mechanical Engineering Congress & Exposition, November 14-20, 2014, Mon-
     treal, Canada, ASME, 2014. Paper no. IMECE2014-40221.
[71] A. Adcock, E. Carlsson, and G. Carlsson, “The ring of algebraic functions on persistence
     bar codes,” Homology, Homotopy and Applications, vol. 18, no. 1, pp. 381–402, 2016.
[72] J. A. Perea, E. Munch, and F. A. Khasawneh, “Approximating Continuous Functions
     on Persistence Diagrams Using Template Functions,” arXiv preprint: 1902.07190,
     2019.
[73] P. Bubenik, “Statistical topological data analysis using persistence landscapes,” Journal
     of Machine Learning Research, vol. 16, pp. 77–102, 2015.
[74] H. Adams, T. Emerson, M. Kirby, R. Neville, C. Peterson, P. Shipman, S. Chepush-
     tanova, E. Hanson, F. Motta, and L. Ziegelmeier, “Persistence images: A stable vector
     representation of persistent homology,” Journal of Machine Learning Research, vol. 18,
     2017.
[75] J. Reininghaus, S. Huber, U. Bauer, and R. Kwitt, “A stable multi-scale kernel for topo-
     logical machine learning,” in The IEEE Conference on Computer Vision and Pattern
     Recognition (CVPR), 2015.
                                             269


[76] I. Chevyrev, V. Nanda, and H. Oberhauser, “Persistence paths and signature features
     in topological data analysis,” IEEE Transactions on Pattern Analysis and Machine
     Intelligence, pp. 1–1, 2018.
[77] S. Tsuji and K. Aihara, “A fast method of computing persistent homology of time series
     data,” in ICASSP 2019 - 2019 IEEE International Conference on Acoustics, Speech
     and Signal Processing (ICASSP), IEEE, 2019.
[78] N. J. Cavanna, M. Jahanseir, and D. R. Sheehy, “A geometric perspective on sparse
     filtrations,” 2015.
[79] C. Myers, L. Rabiner, and A. Rosenberg, “Performance tradeoffs in dynamic time
     warping algorithms for isolated word recognition,” IEEE Transactions on Acoustics,
     Speech, and Signal Processing, vol. 28, no. 6, pp. 623–635, 1980.
[80] C. S. Myers and L. R. Rabiner, “A comparative study of several dynamic time-warping
     algorithms for connected-word recognition,” Bell System Technical Journal, vol. 60,
     no. 7, pp. 1389–1409, 1981.
[81] H. Sakoe, S. Chiba, A. Waibel, and K. Lee, “Dynamic programming algorithm opti-
     mization for spoken word recognition,” Readings in speech recognition, vol. 159, p. 224,
     1990.
[82] B.-H. Juang, “On the hidden markov model and dynamic time warping for speech
     recognition-a unified view,” AT&T Bell Laboratories Technical Journal, vol. 63, no. 7,
     pp. 1213–1243, 1984.
[83] F. Itakura, “Minimum prediction residual principle applied to speech recognition,”
     IEEE Transactions on Acoustics, Speech, and Signal Processing, vol. 23, no. 1, pp. 67–
     72, 1975.
[84] V. Niennattrakul and C. A. Ratanamahatana, “On clustering multimedia time series
     data using k-means and dynamic time warping,” in 2007 International Conference on
     Multimedia and Ubiquitous Engineering (MUE'07), IEEE, 2007.
[85] F. Yu, K. Dong, F. Chen, Y. Jiang, and W. Zeng, “Clustering time series with granular
     dynamic time warping method,” in 2007 IEEE International Conference on Granular
     Computing (GRC 2007), pp. 393–398, IEEE, 2007.
[86] F. Petitjean, G. Forestier, G. I. Webb, A. E. Nicholson, Y. Chen, and E. Keogh,
     “Dynamic time warping averaging of time series allows faster and more accurate classi-
     fication,” in 2014 IEEE International Conference on Data Mining, pp. 470–479, IEEE,
     2014.
[87] A. P. Shanker and A. Rajagopalan, “Off-line signature verification using DTW,” Pat-
     tern Recognition Letters, vol. 28, no. 12, pp. 1407–1414, 2007.
[88] M. Munich and P. Perona, “Continuous dynamic time warping for translation-invariant
     curve alignment with applications to signature verification,” in Proceedings of the Sev-
     enth IEEE International Conference on Computer Vision, IEEE, 1999.
                                             270


[89] M. Parizeau and R. Plamondon, “A comparative analysis of regional correlation, dy-
      namic time warping, and skeletal tree matching for signature verification,” IEEE Trans-
      actions on Pattern Analysis and Machine Intelligence, vol. 12, no. 7, pp. 710–717, 1990.
[90] R. Martens and L. Claesen, “On-line signature verification by dynamic time-warping,”
      in Proceedings of 13th International Conference on Pattern Recognition, pp. 38–42,
      IEEE, 1996.
[91] M. Postel, B. Bugdayci, and K. Wegener, “Ensemble transfer learning for refining
      stability predictions in milling using experimental stability states,” The International
      Journal of Advanced Manufacturing Technology, vol. 107, no. 9, pp. 4123–4139, 2020.
[92] H. O. Unver and B. Sener, “A novel transfer learning framework for chatter detection
      using convolutional neural networks,” Journal of Intelligent Manufacturing, pp. 1–20,
      2021.
[93] W. Li and Y. Liang, “Deep transfer learning based diagnosis for machining process
      lifecycle,” Procedia CIRP, vol. 90, pp. 642–647, 2020.
[94] J. Wu, Z. Zhao, C. Sun, R. Yan, and X. Chen, “Few-shot transfer learning for intelligent
      fault diagnosis of machine,” Measurement, vol. 166, p. 108202, 2020.
[95] Y.-M. Kim, S.-J. Shin, and H.-W. Cho, “Predictive modeling for machining power
      based on multi-source transfer learning in metal cutting,” International Journal of
      Precision Engineering and Manufacturing-Green Technology, 2021.
[96] H. Mamledesai, M. A. Soriano, and R. Ahmad, “A qualitative tool condition monitoring
      framework using convolution neural network and transfer learning,” Applied Sciences,
      vol. 10, no. 20, p. 7298, 2020.
[97] M. Marei, S. El Zaatari, and W. Li, “Transfer learning enabled convolutional neural net-
      works for estimating health state of cutting tools,” Robotics and Computer-Integrated
      Manufacturing, vol. 71, p. 102145, 2021.
[98] J. Wang, B. Zou, M. Liu, Y. Li, H. Ding, and K. Xue, “Milling force prediction model
      based on transfer learning and neural network,” Journal of Intelligent Manufacturing,
      vol. 32, pp. 947–956, 2021.
[99] P. Wang and R. X. Gao, “Transfer learning for enhanced machine fault diagnosis in
      manufacturing,” CIRP Annals, vol. 69, no. 1, pp. 413–416, 2020.
[100] Y. Kim, T. Kim, B. D. Youn, and S.-H. Ahn, “Machining quality monitoring (mqm) in
      laser-assisted micro-milling of glass using cutting force signals: An image-based deep
      transfer learning,” Journal of Intelligent Manufacturing, pp. 1–16, 2021.
[101] Z. Gao, Q. Hu, and X. Xu, “Condition monitoring and life prediction of the turning
      tool based on extreme learning machine and transfer learning,” Neural Computing and
      Applications, pp. 1–12, 2021.
                                              271


[102] W. Dai, Q. Yang, G.-R. Xue, and Y. Yu, “Boosting for transfer learning,” in Proceedings
      of the 24th International Conference on Machine Learning, ICML ’07, (New York, NY,
      USA), p. 193–200, Association for Computing Machinery, 2007.
[103] F. Shen, C. Chen, R. Yan, and R. X. Gao, “Bearing fault diagnosis based on svd feature
      extraction and transfer learning classification,” in 2015 Prognostics and System Health
      Management Conference (PHM), pp. 1–6, IEEE, 2015.
[104] G. Chen, Y. Li, and X. Liu, “Pose-dependent tool tip dynamics prediction using transfer
      learning,” International Journal of Machine Tools and Manufacture, vol. 137, pp. 30–
      41, 2019.
[105] E. V. Ruiz, F. C. Nolla, and H. R. Segovia, “Is the DTW “distance” really a metric?
      an algorithm reducing the number of DTW comparisons in isolated word recognition,”
      Speech Communication, vol. 4, no. 4, pp. 333–344, 1985.
[106] E. Vidal and M.-J. Lloret, “Fast speaker-independent DTW recognition of isolated
      words using a metric-space search algorithm (AESA),” Speech Communication, vol. 7,
      no. 4, pp. 417–422, 1988.
[107] E. V. Ruiz, “An algorithm for finding nearest neighbours in (approximately) constant
      average time,” Pattern Recognition Letters, vol. 4, no. 3, pp. 145–157, 1986.
[108] M. L. Micó, J. Oncina, and E. Vidal, “A new version of the nearest-neighbour approx-
      imating and eliminating search algorithm (AESA) with linear preprocessing time and
      memory requirements,” Pattern Recognition Letters, vol. 15, no. 1, pp. 9–17, 1994.
[109] J. R. Rico-Juan and L. Micó, “Comparison of AESA and LAESA search algorithms
      using string and tree-edit-distances,” Pattern Recognition Letters, vol. 24, no. 9-10,
      pp. 1417–1426, 2003.
[110] M. C. Yesilli, F. A. Khasawneh, and B. Mann, “Transfer learning for autonomous
      chatter detection in machining,” arXiv preprint: 2204.05400, Apr. 2022.
[111] C. J. Burges, “A tutorial on support vector machines for pattern recognition,” Data
      mining and knowledge discovery, vol. 2, no. 2, pp. 121–167, 1998.
[112] J. Weston and C. Watkins, “Multi-class support vector machines,” tech. rep., 1998.
[113] S. Maji, A. C. Berg, and J. Malik, “Classification using intersection kernel support vec-
      tor machines is efficient,” in 2008 IEEE Conference on Computer Vision and Pattern
      Recognition, IEEE, 2008.
[114] S. Dreiseitl and L. Ohno-Machado, “Logistic regression and artificial neural network
      classification models: a methodology review,” Journal of Biomedical Informatics,
      vol. 35, no. 5-6, pp. 352–359, 2002.
[115] D. W. Hosmer Jr, S. Lemeshow, and R. X. Sturdivant, Applied logistic regression,
      vol. 398. John Wiley & Sons, 2013.
                                              272


[116] C.-Y. J. Peng, K. L. Lee, and G. M. Ingersoll, “An introduction to logistic regression
      analysis and reporting,” The Journal of Educational Research, vol. 96, no. 1, pp. 3–14,
      2002.
[117] L. Breiman, “Random forests,” Machine learning, vol. 45, no. 1, pp. 5–32, 2001.
[118] L. Breiman, Classification and regression trees. Routledge, 2017.
[119] M. Belgiu and L. Drăguţ, “Random forest in remote sensing: A review of applica-
      tions and future directions,” ISPRS Journal of Photogrammetry and Remote Sensing,
      vol. 114, pp. 24–31, 2016.
[120] R. E. Schapire, “The strength of weak learnability,” Machine Learning, vol. 5, no. 2,
      pp. 197–227, 1990.
[121] A. Natekin and A. Knoll, “Gradient boosting machines, a tutorial,” Frontiers in Neu-
      rorobotics, vol. 7, 2013.
[122] J. H. Friedman, “Stochastic gradient boosting,” Computational Statistics & Data Anal-
      ysis, vol. 38, no. 4, pp. 367–378, 2002.
[123] K. Yan and D. Zhang, “Feature selection and analysis on correlated gas sensor data with
      recursive feature elimination,” Sensors and Actuators B: Chemical, vol. 212, pp. 353–
      363, 2015.
[124] N. E. Huang, Z. Shen, S. R. Long, M. C. Wu, H. H. Shih, Q. Zheng, N.-C. Yen, C. C.
      Tung, and H. H. Liu, “The empirical mode decomposition and the hilbert spectrum for
      nonlinear and non-stationary time series analysis,” Proceedings of the Royal Society of
      London. Series A: Mathematical, Physical and Engineering Sciences, vol. 454, no. 1971,
      pp. 903–995, 1998.
[125] Z. WU and N. E. HUANG, “Ensemble Empirical Mode Decomposition: A Noise As-
      sisted Data Analysis Method,” Advances in Adaptive Data Analysis, vol. 01, no. 01,
      pp. 1–41, 2009.
[126] T. Wang, M. Zhang, Q. Yu, and H. Zhang, “Comparing the applications of EMD and
      EEMD on time–frequency analysis of seismic signal,” Journal of Applied Geophysics,
      vol. 83, pp. 29–34, 2012.
[127] O. Pele and M. Werman, “A linear time histogram metric for improved sift matching,”
      in Computer Vision–ECCV 2008, pp. 495–508, Springer, 2008.
[128] O. Pele and M. Werman, “Fast and robust earth mover’s distances,” in 2009 IEEE 12th
      International Conference on Computer Vision, pp. 460–467, IEEE, 2009.
[129] I. Guyon, J. Weston, S. Barnhill, and V. Vapnik, “Gene selection for cancer classifica-
      tion using support vector machines,” Machine Learning, vol. 46, no. 1/3, pp. 389–422,
      2002.
                                               273


[130] D. J. Berndt and J. Clifford, “Using dynamic time warping to find patterns in time
      series.,” in KDD workshop, vol. 10, pp. 359–370, Seattle, WA, 1994.
[131] H. Sakoe and S. Chiba, “Dynamic programming algorithm optimization for spoken
      word recognition,” IEEE Transactions on Acoustics, Speech, and Signal Processing,
      vol. 26, no. 1, pp. 43–49, 1978.
[132] S. Salvador and P. Chan, “Toward accurate dynamic time warping in linear time and
      space,” Intelligent Data Analysis, vol. 11, no. 5, pp. 561–580, 2007.
[133] R. Han, Y. Li, X. Gao, and S. Wang, “An accurate and rapid continuous wavelet
      dynamic time warping algorithm for end-to-end mapping in ultra-long nanopore se-
      quencing,” Bioinformatics, vol. 34, no. 17, pp. i722–i731, 2018.
[134] S. A. Dudani, “The distance-weighted k-nearest-neighbor rule,” IEEE Transactions on
      Systems, Man, and Cybernetics, no. 4, pp. 325–327, 1976.
[135] T. Horváth, S. Wrobel, and U. Bohnebeck, “Relational instance-based learning with
      lists and terms,” Machine Learning, vol. 43, no. 1, pp. 53–80, 2001.
[136] B. Choudhary, The elements of complex analysis. New York: J. Wiley, 1993.
[137] T. Rakthanmanon, B. Campana, A. Mueen, G. Batista, B. Westover, Q. Zhu, J. Za-
      karia, and E. Keogh, “Searching and mining trillions of time series subsequences under
      dynamic time warping,” in Proceedings of the 18th ACM SIGKDD international con-
      ference on Knowledge discovery and data mining, pp. 262–270, ACM, 2012.
[138] F. Takens, “Detecting strange attractors in turbulence,” in Dynamical Systems and
      Turbulence, Warwick 1980 (D. Rand and L.-S. Young, eds.), vol. 898 of Lecture Notes
      in Mathematics, pp. 366–381, Springer Berlin Heidelberg, 1981.
[139] J. R. Munkres, Elements of Algebraic Topology. CRC Press, 2018.
[140] E. Munch, “A user’s guide to topological data analysis,” Journal of Learning Analytics,
      vol. 4, pp. 47–61, jul 2017.
[141] S. Theodoridis and K. Koutroumbas, “Feature selection,” in Pattern Recognition,
      pp. 261–322, Elsevier, 2009.
[142] P. J. Rousseeuw, “Least median of squares regression,” Journal of the American Sta-
      tistical Association, vol. 79, no. 388, pp. 871–880, 1984.
[143] A. Myers and F. A. Khasawneh, “On the automatic parameter selection for permutation
      entropy,” Chaos: An Interdisciplinary Journal of Nonlinear Science, vol. 30, no. 3,
      p. 033130, 2020.
[144] M. Melosik and W. Marszalek, “On the 0/1 test for chaos in continuous systems,”
      Bulletin of the Polish Academy of Sciences Technical Sciences, vol. 64, no. 3, pp. 521–
      528, 2016.
                                               274


[145] A. M. Fraser and H. L. Swinney, “Independent coordinates for strange attractors from
      mutual information,” Phys. Rev. A, vol. 33, pp. 1134–1140, 1986.
[146] M. B. Kennel, R. Brown, and H. D. I. Abarbanel, “Determining embedding dimension
      for phase-space reconstruction using a geometrical construction,” Phys. Rev. A, vol. 45,
      pp. 3403–3411, Mar 1992.
[147] H. D. I. Abarbanel, T. A. Carroll, L. M. Pecora, J. J. Sidorowich, and L. S. Tsimring,
      “Predicting physical variables in time-delay embedding,” Physical Review E, vol. 49,
      no. 3, pp. 1840–1853, 1994.
[148] A. Tharwat, M. Elhoseny, A. E. Hassanien, T. Gabel, and A. Kumar, “Intelligent
      bézier curve-based path planning model using chaotic particle swarm optimization
      algorithm,” Cluster Computing, vol. 22, no. S2, pp. 4745–4766, 2018.
[149] M. Elhoseny, A. Tharwat, and A. E. Hassanien, “Bézier curve based path planning in
      a dynamic field using modified genetic algorithm,” Journal of Computational Science,
      vol. 25, pp. 339–350, 2018.
[150] J. wung Choi, R. Curry, and G. Elkaim, “Path planning based on bézier curve for
      autonomous ground vehicles,” in Advances in Electrical and Electronics Engineering -
      IAENG Special Edition of the World Congress on Engineering and Computer Science
      2008, IEEE, 2008.
[151] J.-H. Hwang, R. Arkin, and D.-S. Kwon, “Mobile robots at your fingertip: Bézier
      curve on-line trajectory generation for supervisory control,” in Proceedings 2003
      IEEE/RSJ International Conference on Intelligent Robots and Systems (IROS 2003)
      (Cat. No.03CH37453), IEEE, 2003.
[152] S. C. Endres, C. Sandrock, and W. W. Focke, “A simplicial homology algorithm for
      lipschitz optimisation,” Journal of Global Optimization, vol. 72, no. 2, pp. 181–217,
      2018.
[153] M. Kerber, D. Morozov, and A. Nigmetov, “Geometry helps to compare persistence
      diagrams,” arXiv preprint, arXiv:1606.03357v1, vol. cs.CG, pp. 1–20, 2016.
[154] A. D. Myers, M. Yesilli, S. Tymochko, F. Khasawneh, and E. Munch, “Teaspoon:
      A comprehensive python package for topological signal processing,” in NeurIPS 2020
      Workshop on Topological Data Analysis and Beyond, 2020.
[155] P. Bubenik and P. Dłotko, “A persistence landscapes toolbox for topological statistics,”
      Journal of Symbolic Computation, vol. 78, pp. 91–114, 2017.
[156] E. Berry, Y.-C. Chen, J. Cisewski-Kehe, and B. T. Fasy, “Functional summaries of
      persistence diagrams,” Journal of Applied and Computational Topology, vol. 4, pp. 211–
      262, 2018.
[157] M. Zeppelzauer, B. Zieliński, M. Juda, and M. Seidl, “A study on topological descrip-
      tors for the analysis of 3d surface texture,” Computer Vision and Image Understanding,
      vol. 167, pp. 74–88, 2018.
                                               275


[158] S. Chepushtanova, T. Emerson, E. Hanson, M. Kirby, F. Motta, R. Neville, C. Peter-
      son, P. Shipman, and L. Ziegelmeier, “Persistence images: An alternative persistent
      homology representation,” arXiv preprint arXiv:1507.06217, 2015.
[159] R. Kwitt, S. Huber, M. Niethammer, W. Lin, and U. Bauer, “Statistical topological
      data analysis - a kernel perspective,” in Advances in Neural Information Processing
      Systems 28 (C. Cortes, N. Lawrence, D. Lee, M. Sugiyama, R. Garnett, and R. Garnett,
      eds.), pp. 3052–3060, Curran Associates, Inc., 2015.
[160] Q. Zhao and Y. Wang, “Learning metrics for persistence-based summaries and ap-
      plications for graph classification,” arXiv preprint, arXiv:1904.12189v1, vol. cs.CG,
      pp. 1–21, 2019.
[161] G. Kusano, Y. Hiraoka, and K. Fukumizu, “Persistence weighted gaussian kernel for
      topological data analysis,” in International Conference on Machine Learning, pp. 2004–
      2013, 2016.
[162] G. Kusano, K. Fukumizu, and Y. Hiraoka, “Kernel method for persistence diagrams
      via kernel embedding and weight factor,” The Journal of Machine Learning Research,
      vol. 18, no. 1, pp. 6947–6987, 2017.
[163] M. Carrière, M. Cuturi, and S. Oudot, “Sliced wasserstein kernel for persistence dia-
      grams,” arXiv preprint, arXiv:1706.03358, vol. cs.CG, 2017.
[164] G. Kusano, “Persistence weighted gaussian kernel for probability distributions on the
      space of persistence diagrams,” arXiv preprint, arXiv:1803.08269v1, vol. math.AT,
      2018.
[165] C.-C. Chang and C.-J. Lin, “LIBSVM: A library for support vector machines,” ACM
      Transactions on Intelligent Systems and Technology, vol. 2, pp. 27:1–27:27, 2011. Soft-
      ware available at http://www.csie.ntu.edu.tw/~cjlin/libsvm.
[166] I. Chevyrev and A. Kormilitzin, “A primer on the signature method in machine learn-
      ing,” arXiv preprint, arXiv:1603.03788v1, vol. stat.ML, pp. 1–45, 2016.
[167] Y. Altintas, Manufacturing Automation. Cambridge University Press, 2009.
[168] O. Bobrenkov, F. Khasawneh, E. Butcher, and B. Mann, “Analysis of milling dynamics
      for simultaneously engaged cutting teeth,” Journal of Sound and Vibration, vol. 329,
      no. 5, pp. 585–606, 2010.
[169] S. J. Pan and Q. Yang, “A survey on transfer learning,” IEEE Transactions on Knowl-
      edge and Data Engineering, vol. 22, no. 10, pp. 1345–1359, 2010.
[170] F. Zhuang, Z. Qi, K. Duan, D. Xi, Y. Zhu, H. Zhu, H. Xiong, and Q. He, “A comprehen-
      sive survey on transfer learning,” Proceedings of the IEEE, vol. 109, no. 1, pp. 43–76,
      2020.
[171] K. Weiss, T. M. Khoshgoftaar, and D. Wang, “A survey of transfer learning,” Journal
      of Big Data, vol. 3, no. 1, 2016.
                                              276


[172] J. Lu, V. Behbood, P. Hao, H. Zuo, S. Xue, and G. Zhang, “Transfer learning using
      computational intelligence: A survey,” Knowledge-Based Systems, vol. 80, pp. 14–23,
      2015.
[173] J.-N. Juang and R. S. Pappa, “An eigensystem realization algorithm for modal param-
      eter identification and model reduction,” Journal of Guidance, Control, and Dynamics,
      vol. 8, no. 5, pp. 620–627, 1985.
[174] H. Ye, R. J. Beamish, S. M. Glaser, S. C. H. Grant, C.-H. Hsieh, L. J. Richards, J. T.
      Schnute, and G. Sugihara, “Equation-free mechanistic ecosystem forecasting using em-
      pirical dynamic modeling,” Proceedings of the National Academy of Sciences, vol. 112,
      no. 13, pp. E1569–E1576, 2015.
[175] I. G. Kevrekidis, C. W. Gear, J. M. Hyman, P. G. Kevrekidid, O. Runborg, C. Theodor-
      opoulos, et al., “Equation-free, coarse-grained multiscale computation: Enabling
      mocroscopic simulators to perform system-level analysis,” Communications in Mathe-
      matical Sciences, vol. 1, no. 4, pp. 715–762, 2003.
[176] A. J. Roberts, Model Emergent Dynamics in Complex Systems. CAMBRIDGE, 2015.
[177] M. D. Schmidt, R. R. Vallabhajosyula, J. W. Jenkins, J. E. Hood, A. S. Soni, J. P.
      Wikswo, and H. Lipson, “Automated refinement and inference of analytical models for
      metabolic networks,” Physical Biology, vol. 8, no. 5, p. 055011, 2011.
[178] B. C. Daniels, W. S. Ryu, and I. Nemenman, “Automated, predictive, and inter-
      pretable inference of caenorhabditis elegans escape dynamics,” Proceedings of the Na-
      tional Academy of Sciences, vol. 116, no. 15, pp. 7226–7231, 2019.
[179] B. C. Daniels and I. Nemenman, “Automated adaptive inference of phenomenological
      dynamical models,” Nature Communications, vol. 6, no. 1, 2015.
[180] D. Giannakis and A. J. Majda, “Nonlinear laplacian spectral analysis for time series
      with intermittency and low-frequency variability,” Proceedings of the National Academy
      of Sciences, vol. 109, no. 7, pp. 2222–2227, 2012.
[181] J. Bongard and H. Lipson, “Automated reverse engineering of nonlinear dynamical
      systems,” Proceedings of the National Academy of Sciences, vol. 104, no. 24, pp. 9943–
      9948, 2007.
[182] M. Schmidt and H. Lipson, “Distilling free-form natural laws from experimental data,”
      Science, vol. 324, no. 5923, pp. 81–85, 2009.
[183] M. Quade, M. Abel, K. Shafi, R. K. Niven, and B. R. Noack, “Prediction of dynamical
      systems by symbolic regression,” Physical Review E, vol. 94, no. 1, 2016.
[184] X. Sun, L. Jin, and M. Xiong, “Extended Kalman filter for estimation of parameters
      in nonlinear state-space models of biochemical networks,” PLoS ONE, vol. 3, no. 11,
      p. e3758, 2008.
                                              277


[185] T. Qin, K. Wu, and D. Xiu, “Data driven governing equations approximation using
      deep neural networks,” Journal of Computational Physics, vol. 395, pp. 620–635, 2019.
[186] M. Raissi, P. Perdikaris, and G. Karniadakis, “Physics-informed neural networks: A
      deep learning framework for solving forward and inverse problems involving nonlinear
      partial differential equations,” Journal of Computational Physics, vol. 378, pp. 686–
      707, 2019.
[187] R. Iten, T. Metger, H. Wilming, L. del Rio, and R. Renner, “Discovering physical
      concepts with neural networks,” Physical Review Letters, vol. 124, no. 1, 2020.
[188] J. P. Crutchfield and B. S. McNamara, “Equations of motion from a data series,”
      Complex systems, vol. 1, no. 417-452, p. 121, 1987.
[189] S. L. Brunton, J. L. Proctor, and J. N. Kutz, “Discovering governing equations from
      data by sparse identification of nonlinear dynamical systems,” Proceedings of the Na-
      tional Academy of Sciences, vol. 113, no. 15, pp. 3932–3937, 2016.
[190] R. Tibshirani, “Regression shrinkage and selection via the lasso,” Journal of the Royal
      Statistical Society. Series B (Methodological), vol. 58, no. 1, pp. 267–288, 1996.
[191] T. Hastie, R. Tibshirani, and J. Friedman, The Elements of Statistical Learning.
      Springer New York, 2009.
[192] H. Schaeffer, R. Caflisch, C. D. Hauck, and S. Osher, “Sparse dynamics for partial dif-
      ferential equations,” Proceedings of the National Academy of Sciences, vol. 110, no. 17,
      pp. 6634–6639, 2013.
[193] G. Tran and R. Ward, “Exact recovery of chaotic systems from highly corrupted data,”
      Multiscale Modeling & Simulation, vol. 15, no. 3, pp. 1108–1129, 2017.
[194] H. Schaeffer, “Learning partial differential equations via data discovery and sparse opti-
      mization,” Proceedings of the Royal Society A: Mathematical, Physical and Engineering
      Sciences, vol. 473, no. 2197, p. 20160446, 2017.
[195] N. M. Mangan, S. L. Brunton, J. L. Proctor, and J. N. Kutz, “Inferring biological net-
      works by sparse identification of nonlinear dynamics,” IEEE Transactions on Molecu-
      lar, Biological and Multi-Scale Communications, vol. 2, no. 1, pp. 52–63, 2016.
[196] J. L. Proctor, S. L. Brunton, B. W. Brunton, and J. N. Kutz, “Exploiting sparsity
      and equation-free architectures in complex systems,” The European Physical Journal
      Special Topics, vol. 223, no. 13, pp. 2665–2684, 2014.
[197] D. Rey, M. Eldridge, M. Kostuk, H. D. Abarbanel, J. Schumann-Bischoff, and U. Par-
      litz, “Accurate state and parameter estimation in nonlinear systems with sparse obser-
      vations,” Physics Letters A, vol. 378, no. 11-12, pp. 869–873, 2014.
[198] W.-X. Wang, Y.-C. Lai, and C. Grebogi, “Data based identification and prediction of
      nonlinear and complex dynamical systems,” Physics Reports, vol. 644, pp. 1–76, 2016.
                                              278


[199] S. H. Rudy, S. L. Brunton, J. L. Proctor, and J. N. Kutz, “Data-driven discovery of
      partial differential equations,” Science Advances, vol. 3, no. 4, p. e1602614, 2017.
[200] G. F. Smits and M. Kotanchek, Pareto-Front Exploitation in Symbolic Regression,
      pp. 283–299. Boston, MA: Springer US, 2005.
[201] S. L. Brunton, J. L. Proctor, and J. N. Kutz, “Sparse identification of nonlinear dy-
      namics with control (SINDYc),” IFAC-PapersOnLine, vol. 49, no. 18, pp. 710–715,
      2016.
[202] E. Kaiser, J. N. Kutz, and S. L. Brunton, “Sparse identification of nonlinear dynamics
      for model predictive control in the low-data limit,” Proceedings of the Royal Society
      A: Mathematical, Physical and Engineering Sciences, vol. 474, no. 2219, p. 20180335,
      2018.
[203] L. Boninsegna, F. Nüske, and C. Clementi, “Sparse learning of stochastic dynamical
      equations,” The Journal of Chemical Physics, vol. 148, no. 24, p. 241723, 2018.
[204] S. K. Goharoodi, K. Dekemele, L. Dupre, M. Loccufier, and G. Crevecoeur,
      “Sparse identification of nonlinear duffing oscillator from measurement data,” IFAC-
      PapersOnLine, vol. 51, no. 33, pp. 162–167, 2018.
[205] S. Li, E. Kaiser, S. Laima, H. Li, S. L. Brunton, and J. N. Kutz, “Discovering time-
      varying aerodynamics of a prototype bridge by sparse identification of nonlinear dy-
      namical systems,” Physical Review E, vol. 100, no. 2, 2019.
[206] I. Knowles and R. Wallace, “A variational method for numerical differentiation,” Nu-
      merische Mathematik, vol. 70, no. 1, pp. 91–110, 1995.
[207] L. I. Rudin, S. Osher, and E. Fatemi, “Nonlinear total variation based noise removal
      algorithms,” Physica D: nonlinear phenomena, vol. 60, no. 1-4, pp. 259–268, 1992.
[208] R. Chartrand, “Numerical differentiation of noisy, nonsmooth data,” ISRN Applied
      Mathematics, vol. 2011, pp. 1–11, 2011.
[209] N. Mork, M. C. Yesilli, and F. A. Khasawneh, “Design of chaotic pendulum with a
      variable interaction potential, Zenodo, DOI: 10.5281/zenodo.3784897,” 2020.
[210] V. Tran, E. Brost, M. Johnston, and J. Jalkio, “Predicting the behavior of a chaotic
      pendulum with a variable interaction potential,” Chaos: An Interdisciplinary Journal
      of Nonlinear Science, vol. 23, no. 3, p. 033103, 2013.
[211] C. M. Hurvich and C.-L. Tsai, “Regression and time series model selection in small
      samples,” Biometrika, vol. 76, no. 2, pp. 297–307, 1989.
[212] M. A. Babyak, “What you see may not be what you get: A brief, nontechnical in-
      troduction to overfitting in regression-type models,” Psychosomatic Medicine, vol. 66,
      no. 3, pp. 411–421, 2004.
                                              279


[213] I. Knowles and R. J. Renka, “Methods for numerical differentiation of noisy data,”
      Electron. J. Differ. Equ, vol. 21, pp. 235–246, 2014.
[214] R. Schafer, “What is a Savitzky-Golay Filter? [lecture notes],” IEEE Signal Processing
      Magazine, vol. 28, no. 4, pp. 111–117, 2011.
[215] A. Y. Suh, A. A. Polycarpou, and T. F. Conry, “Detailed surface roughness charac-
      terization of engineering surfaces undergoing tribological testing leading to scuffing,”
      Wear, vol. 255, no. 1-6, pp. 556–568, 2003.
[216] N. Makarenko, M. Kalimoldayev, I. Pak, and A. Yessenaliyeva, “Texture recognition
      by the methods of topological data analysis,” Open Engineering, vol. 6, no. 1, 2016.
[217] A. A. Taha and A. Hanbury, “Metrics for evaluating 3d medical image segmentation:
      analysis, selection, and tool,” BMC Medical Imaging, vol. 15, no. 1, 2015.
[218] S. Wu, B. Zhang, Y. Liu, X. Suo, and H. Li, “Influence of surface topography on
      bacterial adhesion: A review (review),” Biointerphases, vol. 13, no. 6, p. 060801, 2018.
[219] ISO, “ISO 16610-21 Geometrical product specifications (GPS) — Filtration — Part
      21: Linear profile filters: Gaussian filters,” 2011.
[220] J. Raja, B. Muralikrishnan, and S. Fu, “Recent advances in separation of roughness,
      waviness and form,” Precision Engineering, vol. 26, no. 2, pp. 222–235, 2002.
[221] L. Gurau, H. Mansfield-Williams, and M. Irle, “Processing roughness of sanded wood
      surfaces,” Holz als Roh- und Werkstoff, vol. 63, no. 1, pp. 43–52, 2004.
[222] B. Hendarto, E. Shayan, B. Ozarska, and R. Carr, “Analysis of roughness of a sanded
      wood surface,” The International Journal of Advanced Manufacturing Technology,
      vol. 28, no. 7, pp. 775–780, 2006.
[223] D. Janecki, “Gaussian filters with profile extrapolation,” Precision Engineering, vol. 35,
      no. 4, pp. 602–606, 2011.
[224] J. Raja and V. Radhakrishnan, “Filtering of surface profiles using fast fourier trans-
      form,” International Journal of Machine Tool Design and Research, vol. 19, no. 3,
      pp. 133–141, 1979.
[225] U. G. Indahl and T. Naes, “Evaluation of alternative spectral feature extraction meth-
      ods of textural images for multivariate modelling,” Journal of Chemometrics, vol. 12,
      no. 4, pp. 261–278, 1998.
[226] Z. Peng and T. Kirk, “Two-dimensional fast fourier transform and power spectrum for
      wear particle analysis,” Tribology International, vol. 30, no. 8, pp. 583–590, 1997.
[227] W. P. Dong and K. J. Stout, “Two-dimensional fast fourier transform and power
      spectrum for surface roughness in three dimensions,” Proceedings of the Institution
      of Mechanical Engineers, Part B: Journal of Engineering Manufacture, vol. 209, no. 5,
      pp. 381–391, 1995.
                                               280


[228] J. Nunes, Y. Bouaoune, E. Delechelle, O. Niang, and P. Bunel, “Image analysis by
      bidimensional empirical mode decomposition,” Image and Vision Computing, vol. 21,
      no. 12, pp. 1019–1026, 2003.
[229] A. LINDERHED, “IMAGE EMPIRICAL MODE DECOMPOSITION: A NEW TOOL
      FOR IMAGE PROCESSING,” Advances in Adaptive Data Analysis, vol. 01, no. 02,
      pp. 265–294, 2009.
[230] Y. Xia, B. Zhang, W. Pei, and D. P. Mandic, “Bidimensional multivariate empiri-
      cal mode decomposition with applications in multi-scale image fusion,” IEEE Access,
      vol. 7, pp. 114261–114270, 2019.
[231] H.-D. Lin, “Tiny surface defect inspection of electronic passive components using dis-
      crete cosine transform decomposition and cumulative sum techniques,” Image and Vi-
      sion Computing, vol. 26, no. 5, pp. 603–621, 2008.
[232] J. Lecompte, O. Legoff, and J.-Y. Hascoet, “Technological form defects identification
      using discrete cosine transform method,” The International Journal of Advanced Man-
      ufacturing Technology, vol. 51, no. 9-12, pp. 1033 –1044, 2010.
[233] P. H. Chandankhede, P. V. Puranik, and P. R. Bajaj, “Soft computing tool approach
      for texture classification using discrete cosine transform,” in 2011 3rd International
      Conference on Electronics Computer Technology, IEEE, 2011.
[234] G. L. Goïc, M. Bigerelle, S. Samper, H. Favrelière, and M. Pillet, “Multiscale roughness
      analysis of engineering surfaces: A comparison of methods for the investigation of
      functional correlations,” Mechanical Systems and Signal Processing, vol. 66-67, pp. 437–
      457, 2016.
[235] X. Chen, J. Raja, and S. Simanapalli, “Multi-scale analysis of engineering surfaces,”
      International Journal of Machine Tools and Manufacture, vol. 35, no. 2, pp. 231–238,
      1995.
[236] X. Liu and J. Raja, “Analyzing engineering surface texture using wavelet filter,” in
      Wavelet Applications in Signal and Image Processing IV (M. A. Unser, A. Aldroubi,
      and A. F. Laine, eds.), SPIE, 1996.
[237] S. Fu, B. Muralikrishnan, and J. Raja, “Engineering surface analysis with different
      wavelet bases,” Journal of Manufacturing Science and Engineering, vol. 125, no. 4,
      pp. 844–852, 2003.
[238] B. Josso, D. R. Burton, and M. J. Lalor, “Frequency normalised wavelet transform for
      surface roughness analysis and characterisation,” Wear, vol. 252, no. 5-6, pp. 491–500,
      2002.
[239] P. Morala-Argüello, J. Barreiro, and E. Alegre, “A evaluation of surface roughness
      classes by computer vision using wavelet transform in the frequency domain,” The
      International Journal of Advanced Manufacturing Technology, vol. 59, no. 1, pp. 213–
      220, 2012.
                                              281


[240] S. I. Chang and J. S. Ravathur, “Computer vision based non-contact surface rough-
      ness assessment using wavelet transform and response surface methodology,” Quality
      Engineering, vol. 17, no. 3, pp. 435–451, 2005.
[241] X. Wang, T. Shi, G. Liao, Y. Zhang, Y. Hong, and K. Chen, “Using wavelet packet
      transform for surface roughness evaluation and texture extraction,” Sensors, vol. 17,
      no. 4, p. 933, 2017.
[242] Q. Chen, S. Yang, and Z. Li, “Surface roughness evaluation by using wavelets analysis,”
      Precision Engineering, vol. 23, no. 3, pp. 209–212, 1999.
[243] K. Stępień, W. Makiela, A. Stoić, and I. Samardžić, “Defining the criteria to select the
      wavelet type for the assessment of surface quality,” Tehnički vjesnik–Technical Gazette,
      vol. 22, no. 3, pp. 781–784, 2015.
[244] M. C. Yesilli, J. Chen, F. A. Khasawneh, and Y. Guo, “Automated surface texture
      analysis via discrete cosine transform and discrete wavelet transform,” arXiv preprint:
      2204.05968, Apr. 2022.
[245] ISO, “ISO 25178 Geometrical product specifications (GPS) — Surface texture: Areal.”
[246] ISO, “ISO 21920 Geometrical product specifications (GPS) —Surface texture: Profile.”
[247] M. C. Yesilli and F. A. Khasawneh, “Data-driven and automatic surface texture anal-
      ysis using persistent homology,” in 2021 20th IEEE International Conference on Ma-
      chine Learning and Applications (ICMLA), IEEE, dec 2021.
[248] M. H. Müser, W. B. Dapp, R. Bugnicourt, P. Sainsot, N. Lesaffre, T. A. Lubrecht,
      B. N. J. Persson, K. Harris, A. Bennett, K. Schulze, S. Rohde, P. Ifju, W. G. Sawyer,
      T. Angelini, H. A. Esfahani, M. Kadkhodaei, S. Akbarzadeh, J.-J. Wu, G. Vorlaufer,
      A. Vernes, S. Solhjoo, A. I. Vakis, R. L. Jackson, Y. Xu, J. Streator, A. Rostami,
      D. Dini, S. Medina, G. Carbone, F. Bottiglione, L. Afferrante, J. Monti, L. Pastewka,
      M. O. Robbins, and J. A. Greenwood, “Meeting the contact-mechanics challenge,”
      Tribology Letters, vol. 65, no. 4, 2017.
[249] ASME, “ASME B46.1 Surface Texture (Surface Roughness, Waviness and Lay).”
[250] A. Averbuch, R. Coifman, D. Donoho, M. Elad, and M. Israeli, “Fast and accurate
      polar fourier transform,” Applied and Computational Harmonic Analysis, vol. 21, no. 2,
      pp. 145–167, 2006.
[251] E. Berry, Y.-C. Chen, J. Cisewski-Kehe, and B. T. Fasy, “Functional summaries of
      persistence diagrams,” Journal of Applied and Computational Topology, vol. 4, no. 2,
      pp. 211–262, 2020.
[252] R. X. Gao and R. Yan, Wavelets: Theory and applications for manufacturing. Springer
      Science & Business Media, 2010.
[253] H. Olkkonen, Discrete Wavelet Transforms: Biomedical Applications. BoD–Books on
      Demand, 2011.
                                               282


[254] S. Prabhakar, A. Mohanty, and A. Sekhar, “Application of discrete wavelet trans-
      form for detection of ball bearing race faults,” Tribology International, vol. 35, no. 12,
      pp. 793–800, 2002.
[255] K. Gurley and A. Kareem, “Applications of wavelet transforms in earthquake, wind
      and ocean engineering,” Engineering structures, vol. 21, no. 2, pp. 149–167, 1999.
[256] D. E. Newland, An Introduction to Random Vibrations, Spectral & Wavelet Analysis.
      Guilford Publications, 2012.
[257] ISO, “ISO16610-29 Geometrical product specifications (GPS) - Filtration - Part 29:
      Linear profile filters: wavelets,” 2020.
[258] G. Strang, “The discrete cosine transform,” SIAM review, vol. 41, no. 1, pp. 135–147,
      1999.
[259] C. E. Shannon, “A mathematical theory of communication,” Bell System Technical
      Journal, vol. 27, no. 3, pp. 379–423, 1948.
[260] S. M.R., “Entropy-based image registration,” PhD Thesis, Princeton University, 2006.
[261] C. Liu, Y. Li, G. Zhou, and W. Shen, “A sensor fusion and support vector machine
      based approach for recognition of complex machining conditions,” Journal of Intelligent
      Manufacturing, vol. 29, no. 8, pp. 1739–1752, 2016.
[262] R. Peng, J. Liu, X. Fu, C. Liu, and L. Zhao, “Application of machine vision method
      in tool wear monitoring,” The International Journal of Advanced Manufacturing Tech-
      nology, vol. 116, no. 3-4, pp. 1357–1372, 2021.
[263] S. Huang, K. Tan, Y. Wong, C. de Silva, H. Goh, and W. Tan, “Tool wear detection and
      fault diagnosis based on cutting force monitoring,” International Journal of Machine
      Tools and Manufacture, vol. 47, no. 3-4, pp. 444–451, 2007.
[264] J. Wassdahl, “Modeling of wear mechanisms in mechanical cutting,” 2008.
[265] P. Stavropoulos, A. Papacharalampopoulos, and T. Souflas, “Indirect online tool wear
      monitoring and model-based identification of process-related signal,” Advances in Me-
      chanical Engineering, vol. 12, no. 5, p. 168781402091920, 2020.
[266] D. Salgado and F. Alonso, “Tool wear detection in turning operations using singu-
      lar spectrum analysis,” Journal of Materials Processing Technology, vol. 171, no. 3,
      pp. 451–458, 2006.
[267] F. Alonso and D. Salgado, “Analysis of the structure of vibration signals for tool wear
      detection,” Mechanical Systems and Signal Processing, vol. 22, no. 3, pp. 735–748,
      2008.
[268] B. Kilundu, P. Dehombreux, and X. Chiementin, “Tool wear monitoring by machine
      learning techniques and singular spectrum analysis,” Mechanical Systems and Signal
      Processing, vol. 25, no. 1, pp. 400–415, 2011.
                                               283


[269] X. Li, “A brief review: acoustic emission method for tool wear monitoring during
      turning,” International Journal of Machine Tools and Manufacture, vol. 42, no. 2,
      pp. 157–165, 2002.
[270] V. A. Pechenin, A. I. Khaimovich, A. I. Kondratiev, and M. A. Bolotov, “Method of
      controlling cutting tool wear based on signal analysis of acoustic emission for milling,”
      Procedia Engineering, vol. 176, pp. 246–252, 2017.
[271] S. Leng, Z. Wang, T. Min, Z. Dai, and G. Chen, “Detection of tool wear in drilling
      CFRP/TC4 stacks by acoustic emission,” Journal of Vibration Engineering & Tech-
      nologies, vol. 8, no. 3, pp. 463–470, 2019.
[272] W. Li, W. Gong, T. Obikawa, and T. Shirakashi, “A method of recognizing tool-wear
      states based on a fast algorithm of wavelet transform,” Journal of Materials Processing
      Technology, vol. 170, no. 1-2, pp. 374–380, 2005.
[273] T. Benkedjouh, N. Zerhouni, and S. Rechak, “Tool wear condition monitoring based on
      continuous wavelet transform and blind source separation,” The International Journal
      of Advanced Manufacturing Technology, vol. 97, no. 9-12, pp. 3311–3323, 2018.
[274] Y. Choi, R. Narayanaswami, and A. Chandra, “Tool wear monitoring in ramp cuts
      in end milling using the wavelet transform,” The International Journal of Advanced
      Manufacturing Technology, vol. 23, no. 5-6, pp. 419–428, 2004.
[275] L. Xiaoli and Y. Zhejun, “Tool wear monitoring with wavelet packet transform—fuzzy
      clustering method,” Wear, vol. 219, no. 2, pp. 145–154, 1998.
[276] M. Hassan, A. Damir, H. Attia, and V. Thomson, “Benchmarking of pattern recognition
      techniques for online tool wear detection,” Procedia CIRP, vol. 72, pp. 1451–1456, 2018.
[277] G. D. Simon and R. Deivanathan, “Early detection of drilling tool wear by vibration
      data acquisition and classification,” Manufacturing Letters, vol. 21, pp. 60–65, 2019.
[278] A. Rivero, L. L. de Lacalle, and M. L. Penalva, “Tool wear detection in dry high-speed
      milling based upon the analysis of machine internal signals,” Mechatronics, vol. 18,
      no. 10, pp. 627–633, 2008.
[279] M. Kious, A. Ouahabi, M. Boudraa, R. Serra, and A. Cheknane, “Detection process
      approach of tool wear in high speed milling,” Measurement, vol. 43, no. 10, pp. 1439–
      1446, 2010.
[280] D. Wu, C. Jennings, J. Terpenny, R. X. Gao, and S. Kumara, “A comparative study
      on machine learning algorithms for smart manufacturing: Tool wear prediction using
      random forests,” Journal of Manufacturing Science and Engineering, vol. 139, no. 7,
      2017.
[281] W. Rmili, A. Ouahabi, R. Serra, and R. Leroy, “An automatic system based on vibra-
      tory analysis for cutting tool wear monitoring,” Measurement, vol. 77, pp. 117–123,
      2016.
                                              284


[282] A. Gouarir, G. Martínez-Arellano, G. Terrazas, P. Benardos, and S. Ratchev, “In-
      process tool wear prediction system based on machine learning techniques and force
      analysis,” Procedia CIRP, vol. 77, pp. 501–504, 2018.
[283] T. Bergs, C. Holst, P. Gupta, and T. Augspurger, “Digital image processing with deep
      learning for automated cutting tool wear detection,” Procedia Manufacturing, vol. 48,
      pp. 947–958, 2020.
[284] X. Xu, J. Wang, B. Zhong, W. Ming, and M. Chen, “Deep learning-based tool wear
      prediction and its application for machining process using multi-scale feature fusion
      and channel attention mechanism,” Measurement, vol. 177, p. 109254, 2021.
[285] S.-L. Chen and Y. Jen, “Data fusion neural network for tool condition monitoring in
      CNC milling machining,” International Journal of Machine Tools and Manufacture,
      vol. 40, no. 3, pp. 381–400, 2000.
[286] Y. Shaban, S. Yacout, M. Balazinski, and K. Jemielniak, “Cutting tool wear detection
      using multiclass logical analysis of data,” Machining Science and Technology, vol. 21,
      no. 4, pp. 526–541, 2017.
[287] D. M. D’Addona, A. M. M. S. Ullah, and D. Matarazzo, “Tool-wear prediction and
      pattern-recognition using artificial neural network and DNA-based computing,” Jour-
      nal of Intelligent Manufacturing, vol. 28, no. 6, pp. 1285–1301, 2015.
[288] F. J. Alonso and D. R. Salgado, “Application of singular spectrum analysis to tool wear
      detection using sound signals,” Proceedings of the Institution of Mechanical Engineers,
      Part B: Journal of Engineering Manufacture, vol. 219, no. 9, pp. 703–710, 2005.
[289] R. Peng, H. Pang, H. Jiang, and Y. Hu, “Study of tool wear monitoring using machine
      vision,” Automatic Control and Computer Sciences, vol. 54, no. 3, pp. 259–270, 2020.
[290] Q. Hou, J. Sun, Z. Lv, P. Huang, G. Song, and C. Sun, “An online tool wear detection
      system in dry milling based on machine vision,” The International Journal of Advanced
      Manufacturing Technology, vol. 105, no. 1-4, pp. 1801–1810, 2019.
[291] X. Wu, Y. Liu, X. Zhou, and A. Mou, “Automatic identification of tool wear based on
      convolutional neural network in face milling process,” Sensors, vol. 19, no. 18, p. 3817,
      2019.
[292] J. Redmon, S. Divvala, R. Girshick, and A. Farhadi, “You only look once: Unified, real-
      time object detection,” in Proceedings of the IEEE Conference on Computer Vision
      and Pattern Recognition (CVPR), 2016.
[293] O. Ronneberger, P. Fischer, and T. Brox, “U-net: Convolutional networks for biomedi-
      cal image segmentation,” in Lecture Notes in Computer Science, pp. 234–241, Springer
      International Publishing, 2015.
[294] V. Badrinarayanan, A. Kendall, and R. Cipolla, “SegNet: A deep convolutional
      encoder-decoder architecture for image segmentation,” IEEE Transactions on Pattern
      Analysis and Machine Intelligence, vol. 39, no. 12, pp. 2481–2495, 2017.
                                              285


[295] W.-J. Lin, J.-W. Chen, J.-P. Jhuang, M.-S. Tsai, C.-L. Hung, K.-M. Li, and H.-T.
      Young, “Integrating object detection and image segmentation for detecting the tool
      wear area on stitched image,” Scientific Reports, vol. 11, no. 1, 2021.
[296] M. T. García-Ordás, E. Alegre-Gutiérrez, R. Alaiz-Rodríguez, and V. González-Castro,
      “Tool wear monitoring using an online, automatic and low cost system based on local
      texture,” Mechanical Systems and Signal Processing, vol. 112, pp. 98–112, 2018.
[297] M. Schmitt and X. X. Zhu, “Data fusion and remote sensing: An ever-growing rela-
      tionship,” IEEE Geoscience and Remote Sensing Magazine, vol. 4, no. 4, pp. 6–23,
      2016.
[298] Y. Yao, X. Li, and Z. Yuan, “Tool wear detection with fuzzy classification and wavelet
      fuzzy neural network,” International Journal of Machine Tools and Manufacture,
      vol. 39, no. 10, pp. 1525–1538, 1999.
[299] C. Aliustaoglu, H. M. Ertunc, and H. Ocak, “Tool wear condition monitoring using a
      sensor fusion model based on fuzzy inference system,” Mechanical Systems and Signal
      Processing, vol. 23, no. 2, pp. 539–546, 2009.
[300] J. Wu, Y. Su, Y. Cheng, X. Shao, C. Deng, and C. Liu, “Multi-sensor information fusion
      for remaining useful life prediction of machining tools by adaptive network based fuzzy
      inference system,” Applied Soft Computing, vol. 68, pp. 13–23, 2018.
[301] W. H. Woodall, R. Koudelik, K.-L. Tsui, S. B. Kim, Z. G. Stoumbos, and C. P. Car-
      vounis, “A review and analysis of the mahalanobis—taguchi system,” Technometrics,
      vol. 45, no. 1, pp. 1–15, 2003.
[302] R. D. Maesschalck, D. Jouan-Rimbaud, and D. Massart, “The mahalanobis distance,”
      Chemometrics and Intelligent Laboratory Systems, vol. 50, no. 1, pp. 1–18, 2000.
[303] M. Rizal, J. Ghani, M. Nuawi, and C. Haron, “Cutting tool wear classification and
      detection using multi-sensor signals and mahalanobis-taguchi system,” Wear, vol. 376-
      377, pp. 1759–1765, 2017.
[304] S. Binsaeid, S. Asfour, S. Cho, and A. Onar, “Machine ensemble approach for simulta-
      neous detection of transient and gradual abnormalities in end milling using multisensor
      fusion,” Journal of Materials Processing Technology, vol. 209, no. 10, pp. 4728–4738,
      2009.
[305] W. H. Wang, G. S. Hong, Y. S. Wong, and K. P. Zhu, “Sensor fusion for online tool
      condition monitoring in milling,” International Journal of Production Research, vol. 45,
      no. 21, pp. 5095–5116, 2007.
[306] E. Kannatey-Asibu, J. Yum, and T. Kim, “Monitoring tool wear using classifier fusion,”
      Mechanical Systems and Signal Processing, vol. 85, pp. 651–661, 2017.
[307] A. E2075/E2075M-15, “Standard practice for verifying the consistency of ae-sensor
      response using an acrylic rod,” American Society for Testing and Materials West Con-
      shohocken, PA, 2010.
                                               286


[308] B. Arvinti-Costache, M. Costache, C. Nafornita, A. Isar, R. Stolz, and H. Toepfer, “A
      wavelet based baseline drift correction method for fetal magnetocardiograms,” in 2011
      IEEE 9th International New Circuits and systems conference, IEEE, 2011.
[309] T. Childs, K. Maekawa, T. Obikawa, and Y. Yamane, Metal Machining. Elsevier, 2000.
[310] M. Riedl, A. Müller, and N. Wessel, “Practical considerations of permutation entropy,”
      The European Physical Journal Special Topics, vol. 222, no. 2, pp. 249–262, 2013.
[311] H. Abdi and L. J. Williams, “Principal component analysis,” Wiley Interdisciplinary
      Reviews: Computational Statistics, vol. 2, no. 4, pp. 433–459, 2010.
[312] F. Khasawneh, A. Otto, and M. Yesilli, “Turning dataset for chatter diagnosis using ma-
      chine learning.” Mendeley Data, v1. http://dx.doi.org/10.17632/hvm4wh3jzx.1, 2019.
[313] T. Insperger, B. P. Mann, T. Surmann, and G. Stepan, “On the chatter frequencies of
      milling processes with runout,” International Journal of Machine Tools and Manufac-
      ture, vol. 48, no. 10, pp. 1081 – 1089, 2008.
[314] R. E. W. Dombovari, Zoltan and G. Stepan, “Estimates of the bistable region in metal
      cutting,” Proceddings of the Royal Society A, vol. 464, no. 2100, pp. 3255–3271, 2008.
[315] Z. Dombovari and G. Stepan, “On the bistable zone of milling processes,” Philosophical
      Transactions of the Royal Society A: Mathematical, Physical and Engineering Sciences,
      vol. 373, no. 2051, p. 20140409, 2015.
[316] Y. Yan, J. Xu, and M. Wiercigroch, “Basins of attraction of the bistable region of
      time-delayed cutting dynamics,” Phys. Rev. E, vol. 96, p. 032205, 2017.
[317] B. P. Mann, K. A. Young, T. L. Schmitz, and D. N. Dilley, “Simultaneous stability
      and surface location error predictions in milling,” Journal of Manufacturing Science
      and Engineering, vol. 127, pp. 446–453, 2005.
                                              287