NONLINEAR EXTENSIONS TO NEW CAUSALITY AND A NARMAX MODEL SELECTION ALGORITHM FOR CAUSALITY ANALYSIS By Pedro da Cunha Nariyoshi A DISSERTATION Submitted to Michigan State University in partial fulfillment of the requirements for the degree of Electrical Engineering – Doctor of Philosophy 2021 ABSTRACT NONLINEAR EXTENSIONS TO NEW CAUSALITY AND A NARMAX MODEL SELECTION ALGORITHM FOR CAUSALITY ANALYSIS By Pedro da Cunha Nariyoshi Although the concept of causality is intuitive, an universally accepted objective measure to quantify causal relationships does not exist. In complex systems where the internal mechanism is not well understood, it is helpful to estimate how different parts of the system are related. In the context of time-series data, Granger Causality (GC) has long been used as a way to quantify such relationships, having been successfully been applied in fields as diverse as econometrics and neurology. Multiple Granger-like and extensions to GC have also been proposed. A recent measure developed to address limitations of GC, New Causality (NC), offers several advantages over GC, such as normalization and better proportionality with respect to internal mechanisms. However, NC is limited in scope by its seminal definition being based on parametric linear models. In this work, a critical analysis of NC is presented, NC is extended to a wide range of nonlinear models and finally, enhancements to a method of estimating nonlinear models for use with NC are reported. A critical analysis is conducted to study the relationship between NC values and model estimation errors. It is shown that NC is much more sensitive to overfitting in comparison to GC. Although the variance of NC estimates is reduced by applying regularization techniques, NC estimates are also prone to bias. In this work, diverse case-studies are presented showing the behavior of NC estimation in the presence of regularization. A mathematical study of the sources of bias in the estimates is given. For systems that cannot be modeled well by linear models, the seminal definition of NC performs poorly. This works gives examples in which nonlinear observation models cause NC values obtained with the seminal definition to behave contrary to intuitive expectations. A nonlinear extension of NC to all linear-in-parameters models is then developed and shown to address these limitations. The extension reduces to the seminal definition of NC for linear models and offers a flexible weighting mechanism to distribute contributions among nonlinear terms. The nonlinear extension is applied to a range of synthetic data and real EEG data with promising results. The sensitivity of NC to parameter estimation errors demands that special care be taken when using NC with nonlinear models. As a complement to nonlinear NC, enhancements to a algorithm for nonlinear parametric model estimation are presented. The algorithm combines a genetic search element for regressor selection with a set-theoretic optimal bounded ellipsoid algorithm for parameter estimation. The enhancements to the genetic search make use of sparsity and information theoretic measures to reduce the computational cost of the algorithm. Significant reductions are shown and direction for further improvements of the algorithm are given. The main contributions of this work are providing a method for estimating causal relationships between signals using nonlinear estimated models, and a framework for estimating the relationships using an enhanced algorithm for model structure search and parameter estimation. This thesis is dedicated to God, the Creator, Redeemer and Giver of Life. Praise God from whom all blessings flow Praise Him all creatures here below Praise Him above ye heavenly hosts Praise Father, Son, and Holy Ghost. – Thomas Ken, “Morning Hymn” iv ACKNOWLEDGEMENTS Throughout my graduate experience, I have come to fully appreciate that the completion of this thesis was a joint effort and would not have been possible with the help of others. I would like to show my heartfelt gratefulness here. I would like to thank my doctoral advisor, Dr. John Deller. He has been a great advisor, mentor and friend during the duration of my program. I thank him for his patience, good humor, and unwavering support for my work through all the challenges I have faced. He has set an example of excellence in teaching, research and mentorship for me. I would like to thank my wife, Shihua Liu. She deserves at least half of the credit for the completion of this dissertation. A most excellent mother, wife and friend. Words cannot express how grateful I am for your companionship and support. I would also like to thank my children, Alice, Samuel and Natalia, who fill my life with glee everyday and with whom there is never a dull moment. It has been an immense pleasure and privilege to have you in my life. The members of my committee have been exceedingly patient and gracious with me and the numerous detours I have taken until I was able to have some traction in my research. I have known Dr. Goodman the longest and who I have never seen without a smile and positive attitude. I thank him for all the meetings shared, papers reviewed, the constant optimism, advice given and kindness received. I also thank Dr. Aviyente and Dr. Punch, who I have known first as excellent and passionate instructors and later had the pleasure to join my doctoral committee. My family has been constant source of encouragement and unconditional love. I would like to especially acknowledge my mother, Maria Alice da Cunha Nariyoshi, who unfortunately was not able to live to see her grandchildren and my graduation. She has always encouraged me to strive forward and persevere. To my brother, João Fernando da Cunha Nariyoshi and father, Fernando Massanori Nariyoshi, who have given me much love and encouragement over the years, I convey here my utter gratefulness and love. I also must thank my parents-in-law, Yongheng Liu and Sufan Long, who have welcomed me into their family with open arms and have given me the honor v to marry their precious daughter. I also thank my extended family for so much love received throughout the years. Especially, I thank my grandparents, Ignácio Adonias da Cunha and Alice Eliza da Cunha, and my aunts Nanci and Alice Nariyoshi. Many friends, new and old, have accompanied and assisted me also deserve my thanks. I would like to thank Fan Bin, Yiqun Yang and Xiaofeng Zhao with whom I shared countless lunches and lively conversation, and Blair Fleet and Jinyao Yan, who were great lab mates. I thank my good friends Danilo Luvizotto, Adelle Araújo, and Nicole Torelli who remained close, even though we lived so far. I would be remiss if I didn’t also mention Shichen Zhang, Qianwei Jiang, Li Jie, He Qiong, Sichao Wang, Rebeca Gutierrez, Xiaoxing Han, Yu Cheng, Ifwat Ghazali, Abhinav Gaur, Thássyo Pinto and Anselmo Pontes. I am deeply thankful for each one of you. I thank Dr. McGough for supporting me and serving as my advisor for the first 2 years of my program. I would also like to thank Dr. Radha, Dr. Wierzba, Dr. Mason and Dr. Chakrapani, who have mentored and supervised me during my teaching assignments. Through these ex- periences, I have continually grown to love teaching. I am also especially grateful to Dr. Katy Colbry for her kindness and encouragement to all engineering graduate students. And I cannot neglect also thanking Dr. Hogan, Dr. Balasubramanian, Dr. Rothwell, Dr. L. Udpa, Dr. S. Udpa, Dr. Papapolymerou and all members of the ECE office and support staff. From the University of São Paulo, many professors also served as an inspiration to pursue a PhD. Specifically, I would like to thank Dr. Denise Consonni, who sparked my love for Electrical Engineering and teaching. I would also like to thank Dr. Vítor H. Nascimento, who advised me during my undergraduate research work.I also thank Dr. Magno T. M. Silva, Dr. Cristiano Panázio and Dr. Marco Alayo who great instructors and mentors. I would also like to thank all my friends in the classes of 2009, 2010 and 2011. I would like to thank Nathanael Fawcett, my pastor during my youth and who I dearly love and am grateful to. I thank my brothers in sisters in Christ from Intervarsity, Aliança Bíblica Universitária, and the global Church of Christ. Specially, I would like to thank the Bauers, Bielers, Cogans, Fosters, Jeffries, and Starks for being mentors, friends and walking together with us. vi TABLE OF CONTENTS LIST OF TABLES . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . x LIST OF FIGURES . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . xi LIST OF ALGORITHMS . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . xiv KEY TO ABBREVIATIONS . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . xv CHAPTER 1 INTRODUCTION . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1 1.1 General statement . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1 1.2 Research objectives . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 7 1.3 Critical analysis of the study . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 8 1.4 Structure of the dissertation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 10 1.5 Summary and contributions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 11 CHAPTER 2 BACKGROUND METHODS . . . . . . . . . . . . . . . . . . . . . . . . . 12 2.1 Overview . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 12 2.2 Modeling . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 12 2.2.1 Generalized observation model . . . . . . . . . . . . . . . . . . . . . . . . 14 2.2.2 Estimation model . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 15 2.2.3 Least squares estimation . . . . . . . . . . . . . . . . . . . . . . . . . . . . 19 2.2.4 ARMAX models . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 20 2.2.5 ARX models with non-white error sequences . . . . . . . . . . . . . . . . 20 2.2.6 NARMAX and modified NARX models . . . . . . . . . . . . . . . . . . . . 21 2.2.7 LASSO regression . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 22 2.3 Set-membership optimum bounded ellipsoid algorithms . . . . . . . . . . . . . . . 23 2.4 NARMAX model estimation and the EvolOBE method . . . . . . . . . . . . . . . . 26 2.4.1 Genetic encoding and algorithm overview . . . . . . . . . . . . . . . . . . 28 2.5 Causality analysis . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 30 2.5.1 Humean concept of causality . . . . . . . . . . . . . . . . . . . . . . . . . 31 2.5.2 Granger causality . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 34 2.5.3 Spectral Granger causality . . . . . . . . . . . . . . . . . . . . . . . . . . . 36 2.5.4 Conditional Granger causality . . . . . . . . . . . . . . . . . . . . . . . . . 38 2.5.5 New causality . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 39 2.5.6 Spectral new causality . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 40 CHAPTER 3 A CRITICAL ANALYSIS OF NEW CAUSALITY . . . . . . . . . . . . . . . 42 3.1 Overview . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 42 3.2 Problematic aspects of models in NC literature . . . . . . . . . . . . . . . . . . . . 43 3.2.1 Model 1 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 43 3.2.2 Model 2 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 44 3.2.3 Model 3 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 45 vii 3.2.4 Model 4 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 45 3.2.5 Model 5 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 46 3.2.6 Model 6 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 48 3.2.7 Model 7 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 50 3.2.8 Discussion of models 1-7 . . . . . . . . . . . . . . . . . . . . . . . . . . . . 51 3.3 Analysis of NC robustness to parameter errors through case studies . . . . . . . . 52 3.3.1 Discussion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 62 3.3.2 NC and GC fluctuation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 62 3.3.3 Regression conditioning and over-fitting . . . . . . . . . . . . . . . . . . . 64 3.3.4 Comparing NC and GC . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 66 3.4.1 Case 1: Δ𝒃 𝑇 𝒁 ≈ 𝟎 . . . . . . . . . . . . . . . . . . . . . . . . 3.4 Bias in NC estimates . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 67 3.4.2 Case 2: Δ𝒂 𝑇 𝑿 ≈ 𝟎 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 70 3.4.3 Case 3: Δ𝒂 𝑇 𝑿 + Δ𝒃 𝑇 𝒁 ≈ 𝟎 . . . . . . . . . . . . . . . . . . . . . . . . . . . 71 . . . . . . . . 72 3.4.4 Case 4: Regularization . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 72 3.4.4.1 Extending case 3 . . . . . . . . . . . . . . . . . . . . . . . . . . . 73 3.4.4.2 Discussion of bias . . . . . . . . . . . . . . . . . . . . . . . . . . 75 3.5 Conclusion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 76 CHAPTER 4 A NONLINEAR EXTENSION TO NEW CAUSALITY . . . . . . . . . . . . 77 4.1 Overview . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 77 4.2 Motivation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 77 4.3 Choice of NARMAX models . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 79 4.4 A nonlinear extension to NC for a restricted set of models . . . . . . . . . . . . . . 80 4.5.1 Form 1: 𝜆1 - create a new category for nonlinear cross-terms . . . . . . . . 83 4.5 A comprehensive NNC definition . . . . . . . . . . . . . . . . . . . . . . . . . . . 81 4.5.2 Form 2: 𝜆2 - weight regressor functions equally across regressor signals . . 84 4.5.3 Form 3: 𝜆3 - weight regressor functions across regressor signals accord- ing to an application (model) dependent criterion . . . . . . . . . . . . . . 86 4.5.4 Spectral nonlinear new causality . . . . . . . . . . . . . . . . . . . . . . . 87 4.6 Discussion and analysis through example models . . . . . . . . . . . . . . . . . . 87 4.8 Discussion of 𝜆 functions and preprocessing . . . . . . . . . . . . . . . . . . . . . 102 4.7 Application: EEG data . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 96 4.9 Conclusions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 104 CHAPTER 5 IMPROVEMENTS TO THE EvolOBE METHOD FOR NONLINEAR CAUSAL- ITY ANALYSIS . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 106 5.1 Overview . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 106 5.2 Model form . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 107 5.3 Identification strategy . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 108 5.3.1 NSGA-II . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 109 5.3.2 Asymmetric mutation operator . . . . . . . . . . . . . . . . . . . . . . . . 110 5.3.3 Reduced surrogate crossover . . . . . . . . . . . . . . . . . . . . . . . . . . 112 5.3.4 Linkage tree crossover . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 112 5.4 Results of AM and RSX . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 114 viii 5.5 Results of LTX . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 117 5.6 Application to NNC analysis . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 123 5.7 Discussion and conclusions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 125 CHAPTER 6 CONCLUSION . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 128 6.1 Overview . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 128 6.2 Contributions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 130 6.3 Future Work . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 130 APPENDICES . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 132 APPENDIX A DERIVATION OF CLOSED-FORM EXPRESSIONS FOR GC AND NC FOR FIRST-ORDER BIJOINTLY REGRESSIVE OBSERVATION MODELS . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 133 APPENDIX B LISTINGS FOR ALGORITHMS . . . . . . . . . . . . . . . . . . . . 141 BIBLIOGRAPHY . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 146 ix LIST OF TABLES Table 3.1: Theoretically evaluated GC and NC measures for the observation model in Eq. (3.19) . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 54 Table 3.2: Theoretically evaluated NC measures for the observation model in Eq. (3.20) . . 60 Table 4.1: Linear NC𝑥𝑗 →𝑥𝑘 values for the model of Eq. (4.1) . . . . . . . . . . . . . . . . . . 78 Table 4.2: Nonlinear NC𝑥𝑗 →𝑥𝑘 values for the model of Eq. (4.1) . . . . . . . . . . . . . . . . 81 Table 4.3: NC𝑥𝑗 →𝑥𝑘 values for the model in Eq. (4.17) . . . . . . . . . . . . . . . . . . . . . 89 Table 4.4: NC𝑦𝑗 →𝑦𝑘 values for the model of Eq. (4.18) . . . . . . . . . . . . . . . . . . . . . 90 Table 4.5: NC𝑥𝑗 →𝑥𝑘 values for the model of Eq. (4.20) . . . . . . . . . . . . . . . . . . . . . 92 Table 4.6: NC𝑥𝑗 →𝑥1 values for the model of Eq. (4.21) with 10dB SNR . . . . . . . . . . . . . 94 Table 4.7: NC𝑥𝑗 →𝑥1 values for the model of Eq. (4.21) with 50dB SNR . . . . . . . . . . . . . 94 Table 4.8: NC𝑥𝑗 →𝑥1 values for the model of Eq. (4.23) with 10dB SNR . . . . . . . . . . . . . 95 Table 4.9: NC𝑥𝑗 →𝑥1 values for the model of Eq. (4.23) with 50dB SNR . . . . . . . . . . . . . 95 Table 4.10: GC, NC, NNC, and SNNC results on whether to accept 𝑥Fp1 causes 𝑥O2 . . . . . 99 Table 4.11: GC, NC, NNC, and SNNC results on whether to reject 𝑥O2 causes 𝑥Fp1 . . . . . . 99 Table 4.12: GC, NC, NNC, and SNNC results on whether to accept 𝑥Fp1 causes 𝑥P4 . . . . . . 101 Table 4.13: GC, NC, NNC, and SNNC results on whether to reject 𝑥P4 causes 𝑥Fp1 . . . . . . 101 Table 5.1: Fitted parameters for different methods . . . . . . . . . . . . . . . . . . . . . . . 117 x LIST OF FIGURES Figure 2.1: Geometric illustration of OBE algorithms . . . . . . . . . . . . . . . . . . . . . 25 Figure 2.2: NSGA-II algorithm summary . . . . . . . . . . . . . . . . . . . . . . . . . . . . 29 Figure 2.3: Different explanations for large GC𝑥→𝑧 . . . . . . . . . . . . . . . . . . . . . . 38 Figure 3.1: Distribution of the NC1→2 estimates as a function of 𝜆 for 𝑀 = 1. . . . . . . . . 55 Figure 3.2: Distribution of the GC1→2 estimates as a function of 𝜆 for 𝑀 = 1. . . . . . . . . 55 Figure 3.3: Distribution of the GC1→2 estimates as a function of 𝜆 for 𝑀 = 2. . . . . . . . . 56 Figure 3.4: Distribution of the NC1→2 estimates as a function of 𝜆 for 𝑀 = 2. . . . . . . . . 56 Figure 3.5: Distribution of the NC1→2 estimates as a function of 𝜆 for 𝑀 = 5. . . . . . . . . 57 Figure 3.6: Distribution of the NC1→2 estimates as a function of 𝜆 for 𝑀 = 6. . . . . . . . . 58 Figure 3.7: Distribution of the GC1→2 estimates as a function of 𝜆 for 𝑀 = 5. . . . . . . . . 58 Figure 3.8: Distribution of the GC1→2 estimates as a function of 𝜆 for 𝑀 = 6. . . . . . . . . 59 Figure 3.9: Distribution of the NC1→2 estimates as a function of 𝜆 for 𝑀 = 5 and 𝑁 = 1024. 59 Figure 3.10: Distribution of the NC1→1 estimates as a function of 𝜆 for the model shown in Eq. (3.20) and 𝑀 = 4 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 61 Figure 3.11: Distribution of the NC1→1 estimates as a function of 𝜆 for 𝑀 = 6 . . . . . . . . 61 Figure 3.12: Distribution of the NC1→1 estimates as a function of 𝜆 for 𝑀 = 1 for the model shown in Eq. (3.20) . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 62 Figure 3.13: NC1→1 vs NC0,1→1 histogram plots as a function of 𝜆 . . . . . . . . . . . . . . . 63 Figure 3.14: NC1→1 vs NC0,1→1 histogram plots as a function of 𝜆 . . . . . . . . . . . . . . . 66 Figure 3.15: Distribution of the GC2→1 estimates as a function of 𝜆 for the model shown in Eq. (3.20) and 𝑀 = 6 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 67 Figure 3.16: Distribution of the GC1→2 estimates as a function of 𝜆 for the model shown in Eq. (3.20) and 𝑀 = 6 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 68 xi Figure 3.17: NC estimates for different values of NC0 and 𝛽 . . . . . . . . . . . . . . . . . . 71 Figure 3.18: Estimated probability density function of NC using exact and approximate expressions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 74 Figure 3.19: Estimated probability density function of NC split into two cases . . . . . . . . 75 Figure 4.1: NC values for the model of Eq. (4.1) . . . . . . . . . . . . . . . . . . . . . . . . 79 Figure 4.2: NC and NNC values for the model of Eq. (4.1) . . . . . . . . . . . . . . . . . . 82 Figure 4.3: 10-20 International System Electrode Location Diagram . . . . . . . . . . . . . 96 Figure 4.4: Spectrum of the Fp1 channel of the EEG recording . . . . . . . . . . . . . . . . 97 Figure 4.5: Average of SNNCFp1→O2 values of subject 1 . . . . . . . . . . . . . . . . . . . . 98 Figure 4.6: Receiver operating characteristic curves for the unfiltered tests . . . . . . . . . 100 Figure 4.7: Receiver operating characteristic curves for 13.5Hz . . . . . . . . . . . . . . . . 101 Figure 4.8: Receiver operating characteristic curves for the unfiltered tests . . . . . . . . . 102 Figure 4.9: Receiver operating characteristic curves for 13.5Hz . . . . . . . . . . . . . . . . 102 Figure 5.1: NSGA-II algorithm summary . . . . . . . . . . . . . . . . . . . . . . . . . . . . 110 Figure 5.2: Linkage tree example . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 113 Figure 5.3: Estimated pareto front . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 115 Figure 5.4: Histogram vs. Fitted distribution . . . . . . . . . . . . . . . . . . . . . . . . . . 116 Figure 5.5: Generations to arrive at the desired model . . . . . . . . . . . . . . . . . . . . 116 Figure 5.6: Estimated regressor functions present in best models . . . . . . . . . . . . . . 119 Figure 5.7: Estimated pareto-front for 15dB SNR . . . . . . . . . . . . . . . . . . . . . . . . 119 Figure 5.8: Histogram of required evaluations for RSX and LTX . . . . . . . . . . . . . . . 120 Figure 5.9: Fitted PDF for the required evaluations . . . . . . . . . . . . . . . . . . . . . . 121 Figure 5.10: CDF for the required number of evaluations to find the desired solution . . . . 121 xii Figure 5.11: Close-up for fewer than 15000 evaluations . . . . . . . . . . . . . . . . . . . . 122 Figure 5.12: CDF for the required number of evaluations for 15dB SNR . . . . . . . . . . . . 123 Figure 5.13: Comparison between estimated pareto fronts for different SNR values . . . . . 123 Figure 5.14: NNC values for the final candidate model set for 10dB SNR . . . . . . . . . . . 124 Figure 5.15: NNC values for the final candidate model set for 50dB SNR . . . . . . . . . . . 125 xiii LIST OF ALGORITHMS Algorithm B.1: Unified Optimum Bounded Ellipsoid Algorithm . . . . . . . . . . . . . . . 142 Algorithm B.2: UOBE Recursion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 142 Algorithm B.3: Automatic Bounds Estimation . . . . . . . . . . . . . . . . . . . . . . . . . 144 Algorithm B.4: Generate Linkage Tree . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 144 Algorithm B.5: Linkage Tree Crossover . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 145 Algorithm B.6: Linkage Tree Crossover for multi-objective problems . . . . . . . . . . . . 145 xiv KEY TO ABBREVIATIONS AIC Akaike Information Criterion AR Autoregressive ARMAX Autoregressive Moving Average with Exogenous input ARX Autoregressive with exogenous input BIC Bayesian Information Criterion] EvolOBE Evolved Optimum Bounded Ellipsoid GC Granger Causality LSE Least Squares Estimation LTGA Linkage Tree Genetic Algorithm LTI Linear and Time Invariant LTIiP Linear and Time Invariant in Parameters LTX Linkage Tree Crossover MSE Mean Square Error NSGA Non-dominated Sorting Genetic Algorithm OBE Optimum Bounding Ellipsoid RMSE Root Mean Square Error RSX Reduced Surrogate Crossover WRLS Weighted Recursive Least Squares xv CHAPTER 1 INTRODUCTION 1.1 General statement The concept of causation and consequence is at the foundation of the scientific method. Although causality is an intuitively simple concept, action 𝐴 causes event 𝐵 to occur, an universally accepted definition of causality has long eluded scientists and philosophers. Understanding causal rela- tionships is an essential step in the analysis of complex systems. Despite significant theoretical and heuristic advances in the topic, quantifying and tracking causality strength and assessing the causal link between two dependent quantities or events is still an active field of research. The scientific approach to establishing these relationships is by creating falsifiable hypotheses (e.g., “𝐴 causes 𝐵” or “𝐴 does not cause 𝐵”) and subsequently testing which hypothesis provides the most satisfactory answer. The analysis often starts by taking measurements or observations of quantities that are relevant (or at least possibly relevant) to the question. In the context of signal processing, these measurements are referred to as signals. Signals are frequently classified as inputs and outputs, which are somewhat analogous to causes and effects. The system is the underlying entity that processes the quantities from which input signals are measured into the quantities from which the output signals are measured. When signals are measured sequentially in constant time intervals, the resulting sequence is called a time series. The mathematical representation of how the inputs and outputs are related is called a model. The models are constructed given the available data, the particular hypothesis being considered and any a priori knowledge available about the system being studied. Many causality analysis methods involve the creation of models and measuring intrinsic characteristics of one or more models and statistical properties of the data. At the heart of the scientific method, Occam’s razor has been used as an heuristic tool to evaluate different explanations of observed phenomena. Also known as lex parsimoniæ (law of 1 parsimony), it states that “Entities are not to be multiplied without necessity,” or, in other words, for different explanations of a phenomenon, the simplest (satisfactorily accurate) explanation is to be preferred. With regard to model construction, this has been re-expressed (somewhat amusingly) by Box [31]: Now it would be very remarkable if any system existing in the real world could be exactly represented by any simple model. However, cunningly chosen parsimonious models often do provide remarkably useful approximations. . . . [T]here is no need to ask the question "Is the model true?". If "truth" is to be the "whole truth" the answer must be "No". The only question of interest is "Is the model illuminating and useful?". For causality analysis, it is often expedient to disregard many fundamental aspects of a system in order to produce a model that provides better intuition of the relationships between potential inputs and outputs (or causes and effects) [104]. For instance, one need not know the line frequency or voltage to assert that a light switch controls a lamp, even though these are fundamental design parameters for the internal function of the circuit. For more complex system, determining what aspects to consider or ignore in constructing a model is not straightforward [138]. While the problem of model structure selection and validation cannot be universally solved, it is possible to employ general principles to find useful models. Models with higher complexity may potentially better represent the system being observed, but may also be prohibitively expensive or require large amounts of data to be accurately computed. Ljung summarizes the problem with [128, pg. 494]: The compromise between parsimony and flexibility is at the heart of the identification problem. How shall we obtain a good fit to data with few parameters? The answer usually is to use a priori knowledge about the system, intuition, and ingenuity. These facts stress that identification can hardly be brought into a fully automated procedure. The answer usually is to use a priori knowledge about the system, intuition, and ingenuity. . . . A general advice is to “try simple things first.” Besides challenges of properly modeling systems, quantifying causal relationships represents an additional non-trivial problem. Causality can only be inferred (but not determined) from time-series records (and only under certain conditions [154]). The most widely used method for 2 assessing causality in the context of signals and systems - the context of this work - is known as Granger Causality (GC) [76, 77]. Borrowing from Hume’s study of causality [103], GC focuses on evaluating how well past information about a signal or event 𝐴 can predict the current state of second signal or event 𝐵. The method has received several extensions, such as conditional GC (CGC) [72] and spectral GC (SGC) [71], as well as similar spectral methods such as partial directed coherence (PDC) [12, 169, 173], the relative power contribution (RPC, also referred to as Akaike Causality) [3, 208] and the directed transfer function (DTF) [57, 108, 176]. In addition to transfer function and model based approaches, alternative methods abound for inferring connectivity between time-series records [80, 156]. Phase analysis methods have shown promise in inferring connectivity, such as the the phase-locking value (PLV) [91, 120], the phase slope index (PSI) [86, 150] and phase-syncrony [9, 10, 119]. More recently, phase-amplitude coupling methods have been applied with promising results [140]. Information theory based methods, such as directionality index (DI) [126, 170], Mutual Information (MI) and Transfer Entropy (TE) [196] also have been employed, but in general require more data for estimating probability distributions [114] and cannot capture quickly time-varying characteristics, such as functional connectivity microstates in the brain [61]. The present work focuses on model based approaches rather than phase analysis and information theoretic approaches. A more recent causality analysis method, New Causality (NC) [95], uses a different approach, relying on the internal structure and states of a multivariate autoregressive model (MVAR) to estimate causality strength. The use of the internal structure presupposes that the models appro- priately represent the mechanisms being studied. This assumption is not necessarily correct in complex models; nonetheless, NC possesses several desirable characteristics, such as the produc- tion of a normalized value for which the sum of all the NC values contributing to a particular “effect” signal adds to unity. Relative to GC, NC allows easier comparisons among different systems, because the measured causality strength increases with increasing NC values, whereas GC might produce “small values” even when signals have a strong causal link [94] or not depend on relevant model parameters [100]. Moreover, in the tests with real and surrogate data in [94, 95, 98–100], 3 NC is superior to GC in the indication of causality strength. However, as shown in [148], NC is more sensitive to model parameter overfitting than GC, requiring more accurate model parameter estimation to produce meaningful results. Further, the seminal formulation of NC is restricted to linear MVAR models. One of the central contributions of the present work is the extension of NC to the far more general nonlinear autoregressive moving average with exogenous input (NARMAX) models [22] while retaining all the advantageous properties of NC. One principle often used to obtain “useful” models is to find the simplest models that pro- vides good explanatory power. Since model simplicity and high explanatory power are often conflicting objectives, system identification algorithms involve a solution that provides the “best” balance/trade-off between the two objectives. Evaluating the complexity1 of a model is not simple, specially when distinct classes of models must be compared, such as ones generated by artificial neural networks (ANN) or genetic pro- gramming and linear models. Within the same class of models, however, there are often methods of quantifying complexity. Particularly, for linear models, many approaches to compare model complexity exist, such as the 𝑙0 norm of the parameter space [193] and the model orders for autoregressive (AR), moving average (MA), and autoregressive moving average (ARMA) models. For parametric models, accuracy is usually optimized using a mean square error (MSE) criterion, although in some cases other measures, such as total least squares [133] or the 𝑙∞ norm (also known as max-norm) [56, 84, 185], might be preferable. Parametric models frequently employ a form of regularization of the parameter space to balance model complexity and prediction error. This is achieved by adding a regularization term to the cost function, which assigns a penalty to solutions with higher order or larger norm of the parameter space. Regularization may be regarded as a Bayesian approach to model estimation, in which prior information (i.e., assumptions) is used in the formulation of the models [111]. If the model is assumed to be sparse, the 𝑙0 norm of the parameters can be used [174]. Total variation has been applied for image denoising, assuming the noise-free image is smooth [29, 171]. As long 1 The term “complexity” is used here in a customary (non-technical) sense 4 as the assumptions about the formulation of the models is close enough to reality, regularization can greatly aid parameter estimation [166]. For linear and time-invariant (LTI) models, a vast variety of methods and literature are available which are based on well-established theories, such as Fourier transforms [164]. Linear and time-invariant models possess a number of properties that make them amenable to analysis [152] and can be completely characterized (within the constraints of short-term processing) by the impulse response of the model or, equivalently, the system function. The advantages of LTI modeling are such that it is sometimes desirable to linearize nonlinear models so that LTI techniques may be applied [151]. However, LTI models are increasingly deemed insufficient for system analysis and design in the 21st century [25, 39]. For linear time-varying systems, adaptive methods exist, such as Least Mean Squares (LMS) [205], Recursive Least Squares (RLS) [158] and derivatives, such as the Normalized Least Mean Squares (NLMS) [172] and Set-membership Weighted Recursive Least Squares [52]. However, addressing nonlinearity in models is an ongoing problem, for which the development of a concise universal methodology is unlikely. While ad-hoc techniques, such as nonlinear state-space models, have been successful in applications like neural connectivity analysis [68] and stock market volatility [190], they require relatively intimate understanding of the systems being modeled and do not generalize well outside their application domains. Artificial neural network models are also very powerful and have fostered advances in predic- tion [47], classification [123] and even complex gameplay [183, 197]. The universal approximation theorem states that ANNs can potentially represent any continuous function on compact subsets [48, 49, 129], although the learnability of the parameters is not addressed by the theorem. In spite of the performance, the black-box nature of ANNs remains one of the largest criticisms [58] and a barrier to interpretability. Recent developments seek to address some of these criticisms by providing methods of interpreting ANN models [130, 181]. Finally, linear and time-invariant in parameters (LTIiP) models, which are most concisely 5 expressed using NARMAX models [22] (of which the Volterra series [198] and the Hammerstein models are special cases [145]) have shown to be very powerful in many applications, from epidemiology [165] and microbial growth [210] to human physiology [116] and aerospace engi- neering [34]. A significant advantage of LTIiP methods is that they allow the use of the wealth of powerful and well understood LTI methods of system identification in identifying nonlinear models. Additionally, NARMAX models enable sparse, interpretable and transparent modeling [202], all of which are characteristics desirable in causality analysis. While NARMAX models provide a concise but flexible parsimonious model paradigm [117], NARMAX models also introduce a new set of challenges. Unlike ARMAX models, which only allow time-shift operators to be applied to the regressor signals, NARMAX models allow the application of other operators - generally called regressor functions. Depending on the model order and class of regressor functions, the number of such functions may be very large. Also, it is often the case that regressors are highly correlated, leading to slow and inaccurate convergence [11]. Although many methods exist, the selection of the subset of the regressor functions is an unresolved issue in system identification for over-parameterized models [2, 22, 25, 27, 81, 117, 118, 201, 203, 214, 219]. A recent method for NARMAX model identification [214], henceforth called evolved OBE (EvolOBE),2 has shown promise in developing accurate, sparse and interpretable results. This method searches for a family of NARMAX model structures that maximize accuracy while mini- mizing the number of regressors. The method involves a hybrid approach which uses a genetic algorithm to select regressor functions, while employing a set-membership based optimum bound- ing ellipsoid algorithm [52] to estimate the parameters values. A significant advantage of such a method is that does not require any assumptions about stationarity or distributional charac- teristics of the model disturbances. The capability of identifying simple nonlinear models with good accuracy with unbiased parameters under complex noise conditions makes this algorithm compelling for use in complex nonlinear systems, avoiding overfitting and maintaining good interpretability of model structure. 2 In [209], this algorithm is called OBE with evolved regressor signals (OBE-ERS). 6 1.2 Research objectives Causality analysis is often employed to gain insight about systems whose internal properties are unknown. Granger causality possesses a intuitive interpretation, if the inclusion of past values of a signal 𝑥 improve the prediction of the current value of a second signal 𝑦 compared to predicting 𝑦 using only past values of 𝑦 itself, then this improvement can be used as evidence that 𝑥 causes 𝑦. While conceptually simple, it can be difficult to map GC values to information about the systems which relate 𝑥 and 𝑦 [95], as GC is designed to measure effect, not mechanism [19].3 On the other hand, NC draws directly from the mechanism (of the model) and thus is complementary to GC, providing new insight into the models. However, the literature on NC is limited in comparison to the wealth of methods for estimating and applying GC. Additionally, most of the studies of NC have assumed that the observational and the estimated models are equivalent, with little discussion on the validity of that assumption and the consequences to the analysis results. A deeper characterization of the robustness of NC to model order and parameter uncertainty is required to increase understanding and confidence in the use of NC [148]. Although GC was only defined for MVAR models in its seminal form [76, 77], nonlinear extensions exist [7, 13, 66, 132]. The seminal definition of NC is also restricted to MVAR models, so the extension of NC to NARMAX models developed in this work will allow NC to be useful in a much wider range of applications. To improve upon NC and address some of its drawbacks, this work takes a two pronged approach: first, an extension of NC to a more comprehensive set of linear and nonlinear models is developed [146]; second, the framework for nonlinear system identification found in [214] is explored and improved in the search for “useful” models. The present work also includes the implementation and discussion of state-of-the-art methods for improved search speed and accuracy [149, 192, 217]. Thus, the research objectives of this study are to: 1. Characterize the behavior of NC under model order and parameter uncertainty. 3 The authors of [95] dispute this claim in [99]. 7 2. Extend the formulation of NC to enable application to LTIiP nonlinear models. 3. Improve model structure search for LTIiP nonlinear models through use of enhanced genetic algorithms. 4. Apply the model structure search algorithms to causality analysis using sets of simulated and real data. 1.3 Critical analysis of the study In the same vein as Box’s remark, it would not be expected that causality would be discriminable from time-series records alone. While all techniques discussed in this work could potentially be applied to any multivariate time-series data, a priori information should be used to first evaluate if the hypothesis of causality is plausible and whether all relevant factors have been considered.4 A machine cannot correct operator mistakes because “it cannot think for itself” [137]. Thus, causality measures must represent only a part of causality analysis, because such measures are unable to differentiate between alleged causality and deficient experimental design. New causality is under the same restrictions and is prone to produce misleading results if incorrect or incomplete data are used. Holland and Durbin [92] have also argued that only one cause can be observed at a time, what they referred to as the fundamental problem of causal inference. That is, supposing it is desired to know if intervention 𝐴 (e.g., medication) will cause 𝐵 (e.g., reduction of a particular symptom) on a particular patient 𝐶. If it is chosen to do 𝐴, one can measure the outcome of 𝐴 given 𝐶 (e.g., giving the medicine to 𝐶), but not the outcome of not doing 𝐴 on 𝐶 (e.g., not giving the medicine to 𝐶), and vice-versa. Therefore, one must either take a statistical approach of testing different interventions over a large population (e.g., giving the medicine to people similar to 𝐶 reduced the symptom on 80% of them, when given a placebo, the symptom was reduced in 40% of 4 Cliff stated this fact as “these programs are not magic. They cannot tell the user about what is not there.” [46] Cartwright argues that one cannot get knowledge of causes from equations and associations alone [36], but instead, old causal knowledge must be used to extract new causal knowledge. 8 them) or an approach they call scientific, which requires the assumptions of homogeneity (e.g., the outcome of an intervention in the past would be the same in the present) so that different outcomes can be compared (e.g., the sentence “symptoms are reduced every time 𝐶 takes the medicine” assumes that the effect of 𝐴 on 𝐶 is time-invariant even if 𝐶 might change over time). Additionally, they assert that causes can only be interventions that are imposed (not voluntary) and are not attributes (e.g., one cannot state that a car is fast because it is a Ferrari, since it would be impossible to measure the speed of the same car if it were made by Ford, because it would not be same car after all. Instead, one could only say that cars made by Ferrari are usually faster than cars made by Ford, without establishing a causal relationship). Their conclusions were summarized in the motto: “no causation without manipulation.” However, Pearl argues in [154] that, while manipulation is simply one way to test the workings of mechanisms, it is by no means necessary for causal determination. Humans can confidently say that the moon causes tides (even if we cannot observe the effects of the lack of a moon) or that the genetic code of a raven causes it to be black (even without manipulating its DNA). As will be discussed in Sec. 2.5, Hume believes humans to be unable to assert causation. Thus he devises a framework through which causation can be inferred. Granger causality builds upon Hume’s work, creating a formal measure for causal inference. Granger causality is closely linked with the concept of TE, which measures transferred information rather than how two signals are interconnected. In fact, GC and TE are equivalent for normally distributed signals [14]. The differences between transferred information (and therefore GC) and causal effects are sometimes subtle but not negligible [127]. Similarly to the seminal definition of NC, the nonlinear extension of NC [146] fundamentally relies on the quality5 of the estimated models being used. As shown in [148], even when the data are generated by a parametric model of the same class as the estimated models, the NC measure values depend heavily on the accuracy of the parameter estimates, whereas GC was 5 Quality in this context refers to the ability of a model to sufficiently represent the internal dynamics of a system. This is in contrast to many predictive models, whose design is based on the ability to predict the output of a system given a set of inputs, often without regard to the actual internal dynamics of the system. 9 shown to be much more robust to parameter estimation errors. The use of robust parametric model estimation methods mitigates this uncertainty somewhat, but careful selection and examination of the estimated models remains essential in evaluating causality using NC. Causality analysis studies generally focus on systems with complex behaviors and/or unknown internal mechanisms. The goal is often to gain some insight into the functioning of a system, without necessarily fully comprehending internal interaction. This poses a problem for the evaluation of novel causality analysis tools, as most real datasets do not possess a “ground truth” for validation. Synthetic datasets offer several advantages, the foremost for causality analysis being the presence of ground truth. The knowledge of internal parameters also allows decoupling the quality of the causality measure from the model estimation aspect of the measure. On the other hand, while the ability to tune models to exhibit different behaviors is often desirable, as one can test the measure under different scenarios, the use of synthetic datasets can also (accidentally or intentionally) produce misleading results [147]. This work utilizes a set of real and synthetic datasets to show performance on a variety of problems, showing interesting results in a number of applications, but makes no claim of supremacy, rather presenting the nonlinear extension of NC as an additional and useful tool in an signal processing practitioner. Just as any powerful analysis tool, care must be taken in its application and the interpretation of the results. Again, a machine cannot correct operator mistakes regardless of how powerful the machine and how smart the operator may be. 1.4 Structure of the dissertation This dissertation begins with the background methods chapter, in which an overview of modeling and modeling philosophy are given. This is foundational for the discussion which follows. The background material is followed by a short review of existing causality analysis tools. Finally, the model identification framework is laid out, with discussion of the particular techniques implemented. The model development is followed by a series of studies. First, a critical analysis of NC is given, which discusses models used in the literature, the robustness of NC under model 10 uncertainty, and derivations of sources of bias in NC estimation. Second, nonlinear extensions to NC are developed, with application examples using synthetic and real data. Third, enhancements to the EvolOBE method, where the method is tested against simulated data and the results are evaluated against observational models, also the GC and NC values obtained under the evolutionary algorithm are compared to the values obtained using the observational models. The studies are followed by the conclusion chapter, where a summary and a discussion of the results is given. 1.5 Summary and contributions The concept of causality is integral to the scientific method. However, concisely defining and quantifying causality relationships is an elusive task. Many methods of evaluating causality have been created, with GC being the most prominent. However, since GC is designed to measure effect, not mechanism, NC can be used in conjunction to obtain more insight into the systems being studied. This work expands on NC by extending it to a wide range of nonlinear models and, thus, its applicability to a wider set of problems, and by doing a deeper critical analysis of NC, as portrayed in existing literature, and its behavior under model structure and parameter uncertainty. Additionally, this work also includes improvements to the EvolOBE method, which are applied to the nonlinear extension to NC. These results will drive the field forward to a more comprehensive set of causality analysis tools that include nonlinear NC. 11 CHAPTER 2 BACKGROUND METHODS 2.1 Overview This chapter includes an overview of some of the methods used in this work. A large portion of Sec. 2.2 is quoted directly from [147–149] with a few modifications for improved flow and clarity. 2.2 Modeling Before delving into the topic of causality analysis, it is important to make a distinction between systems and models. The time-series literature tends to be somewhat cavalier in the formulation of parametric time-series models. Widespread understanding of the fundamental modeling concepts allows a certain lack of precision in model notation. In particular, it is not uncommon to use the same modeling notation for the putative observation model and the estimation model. The obser- vation model, ordinarily one of the standard time-series models [32] with a white-noise or more strongly-independent disturbances is assumed to generate the observed sequence. Accordingly, its parameters are unknown, but the model is posed for theoretical analysis. The estimation model (or estimated, following model identification) is the parametric model resulting from the model identification process. Although the observation model and the estimated model are naturally similar in form, the two models which may have quite different parameter values and accompany- ing disturbances. Since this distinction is important in the causality analysis approaches studied in this work, this section is dedicated to a clear explanation of the intricacies of models, including the establishment of a clear convention for model nomenclature and notation. To simplify this task, the discussion will be restricted to a class of models that are linear, time-invariant and causal (over the interval of observation). This restriction simplifies he task of modeling signals – the model therefore representing a discrete time system of which only the output is observable. Moreover, the intention to use conventional least-square-error (LSE) 12 estimation of model parameters (in keeping with existing literature to which this work refers) prescribes that the natural choice of signal observation model is – at least in the case of model involving a single signal – the standard time-series model known as the autoregressive (AR) model, often denoted AR() to indicate that the model has  parameters. The AR() observation model for a signal sequence 𝑥 is given by 𝑥[𝑛] = ∑ 𝑎𝑚∗ 𝑥[𝑛 − 𝑚] + 𝜂[𝑛] ≐ (𝒂 ∗ )𝑇 𝒙[𝑛] + 𝜂[𝑛] , 𝑛 ∈ Z,  (2.1) 𝑚=1 in which, by convention, 𝜂 is a discrete-time white noise process, and in which we have defined the Cartesian -vectors, 𝑇 𝒂 ≐ [ 𝑎1∗ 𝑎2∗ ⋯ 𝑎∗ ] , 𝑇 𝒙[𝑛] ≐ [ 𝑥[𝑛 − 1] 𝑥[𝑛 − 2] ⋯ 𝑥[𝑛 − ] ] . (2.2) The parameter values, 𝑎𝑚∗ , include the superscript symbol “∗” to indicate the “true” parameters – that is, the parameters associated with the observation model. The estimation of these parameters is discussed in a more general context below. Let us digress momentarily to comment on a terminology issue. Some authors might choose to refer to the model of form (2.1) as a “generative model” (or “synthesis model” ) referring its assumed role in “generating” or “synthesizing” the sequence 𝑥. For the reporting of future research extending the present developments, the authors prefer to reserve the term generative model to refer to an unconstrained (and generally unknowable) operator, say H, across normed vector spaces that is “used by nature” to exactly (without error at any level of precision) produce the signal 𝑥 from the input 𝜂, say 𝑥 = H𝜂. We will therefore deliberately use the term “observation model” when referring to Eq. (2.1) and related extensions. It remains to specify the models used in estimation (following some further consideration of the observation model). The issue we are addressing by taking extra care in defining what each model refers to is necessitated by the following matter: it is not unusual for an author (across many fields) to, for example, implicitly use model (2.1) – with parameters 𝑎𝑖 , rather than 𝑎𝑚∗ – then to refer to the estimated parameters with the same notation 𝑎1 , … , 𝑎 , thus creating ambiguity in the 13 meaning of the parameter symbol notation. Less frequently, but all too commonly, the sequence name 𝜂 may also be used to indicate the error sequence in the estimated model (in the AR case, the residual in the linear prediction of 𝑥[𝑛] using 𝑥[𝑛 − 1], … , 𝑥[𝑛 − ]), thereby creating further ambiguity. Whereas such practices are generally accepted and lead to no adverse issues for the experienced practitioner, it is critical to clearly distinguish the various models used in the present discussion. 2.2.1 Generalized observation model Before addressing the estimation models, we need to enhance the AR model of Eq. (2.1) for the present purposes. One can approach the required modification in several ways. Equation (2.1) represents a model for a single signal generated by passing uncorrelated noise through a linear filter. Causality analysis is generally concerned with multiple signals, say 𝑥1 , 𝑥2 , … , 𝑥𝑁𝑠 , where { }𝑁𝑠 𝑁𝑠 ≥ 2 denotes the number of such signals, and the possibility that any of the signals 𝑥𝑗 𝑗=1 may contribute to (may “cause” ) the generation of 𝑥𝑝 for a given 1 ≤ 𝑝 ≤ 𝑁𝑠 . The inclusion of linear combinations of samples from further signals on the right side of Eq. (2.1) makes it improper to refer to the model as “autoregressive.” The augmented model (in the “careful” notation suggested above), assuming, for convenience, that, for every 𝑝, 𝑥𝑝 has a linear dependency on  past values of each of the signals including itself, takes the form 𝑁𝑠 𝑥𝑝 [𝑛] = ∑ 𝑎𝑝𝑝 𝑚∗ 𝑥𝑝 [𝑛 − 𝑚] + (∑ ∑ 𝑎𝑝𝑞 𝑚∗ 𝑥𝑞 [𝑛 − 𝑚]) + 𝜂𝑝 [𝑛]   𝑚=1 𝑞=1 𝑚=1 𝑞≠𝑝 (2.3) ≐ (𝒂𝑝∗ )𝑇 𝒙[𝑛] + 𝜂𝑝 [𝑛] , where 𝜂𝑝 continues to denote a scalar white-noise excitation for 𝑝 and the vectors 𝒂𝑝∗ and 𝒙[𝑛] are extended in the natural way relative to Eq. (2.1): 𝑇 𝒂𝑝∗ = [ 𝑎𝑝1 𝑎𝑝1 ⋯ 𝑎𝑝1 𝑎𝑝2 ⋯ 𝑎𝑝2 ⋯ ⋯ 𝑎𝑝𝑁 ⋯ 𝑎𝑝𝑁 𝑠 ] 1∗ 2∗ 1∗ 1∗ and 𝑠 ∗ ∗ ∗ 𝑇 𝒙[𝑛] = [ 𝑥1 [𝑛 − 1] ⋯ 𝑥1 [𝑛 − ] ⋯ ⋯ 𝑥𝑁𝑠 [𝑛 − 1] ⋯ 𝑥𝑁𝑠 [𝑛 − ] ] (2.4) 14 with 𝒂𝑝∗ and 𝒙[𝑛] both vectors in R𝑀𝒂∗ where 𝑀𝒂∗ is the number of parameters used in modeling signal 𝑥𝑝 , 𝑀𝒂∗ ≐ 𝑁𝑠  = dim{𝒂𝑝∗ }. (2.5) Although this is not customary in the current literature on causality modeling, the most con- ventional way to refer to such a model (for each 𝑝) would be as an autoregressive model with exogenous inputs (ARX). One can also view this model as representing a multiple-input, single- output (MISO), discrete-time system (if the disturbance 𝜂𝑝 is viewed as an excitation), but with the caution that it is only recursive in the signal 𝑥𝑝 , with 𝑥𝑗 , ∀𝑗 ≠ 𝑝 serving as exogenous inputs for each 𝑝. Models accounting for multiple outputs are sometimes referred to as jointly regressive models [100, 109] or multivariate autoregressive (MVAR) models [33, 108, 204]. An important special case of the observation model of Eq. (2.3) occurs for 𝑁𝑠 = 2 which appears in problems in which the causality effects between two signals are analyzed. In this case, the estimation model can be written as two explicit equations, 𝑥1 [𝑛] = ∑ 𝑎11𝑚∗ 𝑥1 [𝑛 − 𝑚] + ∑ 𝑎12 𝑚∗ 𝑥2 [𝑛 − 𝑚] + 𝜂1 [𝑛]   𝑚=1 𝑚=1 ≐ 𝒂11 ∗𝑇 𝒙1 [𝑛] + 𝒂12 ∗𝑇 𝒙2 [𝑛] + 𝜂1 [𝑛], 𝑥2 [𝑛] = ∑ 𝑎22 𝑥2 [𝑛 − 𝑚] + ∑ 𝑎21 𝑥1 [𝑛 − 𝑚] + 𝜂2 [𝑛] (2.6) 𝑚∗ 𝑚∗   𝑚=1 𝑚=1 ≐ 𝒂22 ∗𝑇 𝒙2 [𝑛] + 𝒂21 ∗𝑇 𝒙1 [𝑛] + 𝜂2 [𝑛]. These equations can be formulated as the more general model of Eq. (2.3). For example, for 𝑁𝑠 = 2, 𝑇 𝒂𝑝∗ = [ 𝒂𝑝1∗𝑇 𝒂𝑝2 ∗𝑇 ] and 𝑇 𝒙[𝑛] = [ 𝒙1𝑇 [𝑛] 𝒙2𝑇 [𝑛] ] . (2.7) 2.2.2 Estimation model Turning to the estimation model, it is customary in the linear modeling case – and consistent with minimum-mean-squared-error (MMSE) estimation theory – to take the form of the noise-free 15 observation model as the basis of the estimation model. For the general observation model of Eq. (2.3), the estimation model for signal 𝑥𝑝 becomes 𝑀 𝑁𝑠 𝑀 𝑥̂𝑝 [𝑛] = ∑ 𝑎𝑝𝑝 𝑚 𝑥𝑝 [𝑛 − 𝑚] + (∑ ∑ 𝑎𝑝𝑞 𝑚 𝑥𝑞 [𝑛 − 𝑚]) ≐ 𝒂𝑝𝑇 𝒙[𝑛], (2.8) 𝑚=1 𝑞=1 𝑚=1 𝑞≠𝑝 where 𝑀 is the model order. It is to be observed that the “∗” superscripts do not appear on the notation for the parameter estimates. This is a deliberate effort to distinguish a “true” coefficient in the observation model, say 𝑎𝑝𝑞 𝑚∗ , from the symbolic representation of the corresponding parameter to be determined in the estimation model. It will be our custom to refer to estimation model of (2.8) as the estimated model when we wish to stress that the parameters have taken values determined by an optimization procedure over observed data [161]. Note that 𝑀 itself is a parameter of the model, which must also be predetermined. While theoretically any model with 𝑀 ≥  could perfectly represent the observation model, the parameter estimators become less accurate as 𝑀 increases. An example of the distributional characteristics of the parameter estimates will be given in Sec. 2.2.3 for jointly normally distributed signals. Many methods of comparing models with different 𝑀 values exist, such as Akaike Information Criterion (AIC) [5], Final Prediction Error (FPE) [4], Minimum Description Length (MDL) [168], Bayesian information criterion (BIC) [174], and other hybrid methods [62]. It is further noteworthy that, whereas the observation model is AR or ARX in the signal 𝑥𝑝 – that is, it is recursive in the signal 𝑥𝑝 – the estimation model is purely “feedforward” in producing an output as a linear combination of past values of 𝑥𝑝 , and of some subset of the remaining 𝑁𝑠 − 1 signals, at time 𝑛. Such a model does not correspond to any conventional (Box-Jenkins-type) time-series model, but, in the parlance of signal processing, corresponds to a MISO discrete-time system [32]. Note also the absence of any noise in the estimated model process. Associated with an estimated model for signal 𝑥𝑝 is an error sequence, say 𝜖𝑝 , with value at time 𝑛 given by 𝜖𝑝 [𝑛] ≐ 𝑥𝑝 [𝑛] − 𝑥̂𝑝 [𝑛] . (2.9) By subtracting Eq. (2.8) from Eq. (2.3), we see that this error contains components due to inaccura- 16 cies in the estimated coefficients, as well as the disturbance sequence 𝜂𝑝 , 𝜖𝑝 [𝑛] = (𝒂𝑝∗ − 𝒂𝑝 )𝑇 𝒙[𝑛] + 𝜂𝑝 [𝑛]. (2.10) A slight abuse of notation is used here, where 𝒂𝑝∗ and 𝒂𝑝 are zero-padded to account for the missing elements (when 𝑀 ≠ ) and 𝒙[𝑛] is similarly adjusted to account for any missing elements. For example, suppose 𝑀 > , then 𝒂𝑝∗ is padded with 𝑀 −  zeros in the locations that correspond to 𝑥𝑞 [𝑛 − 𝑀 − 1] ⋯ 𝑥𝑞 [𝑛 − ] for all 𝑞 ∈ {1, ⋯ , 𝑁𝑠 }. When the parameters are correctly identified in the estimation model, so that 𝒂 = 𝒂 ∗ , then the estimation error is equivalent to the white-noise disturbance of the observation model at each 𝑛, 𝜖𝑝 [𝑛] = 𝜂𝑝 [𝑛]. This is known to be the case for the MMSE estimate of the parameters of such a linear model [153], assuming that the model order of the estimated model is greater or equal to that of the observation model. The LSE solution asymptotically approaches the MMSE solution as the number of observations increase. In practice, of course, the parameter estimates 𝒂 must be determined from finite data records { }𝑁𝑠 of the signals 𝑥𝑗 𝑗=1 . Without loss of generality, we may assume that each of the signals is observed on the time indices, 𝑛 = 1, 2, … , 𝑁 − 1, observation 𝑥𝑝 [𝑁 ] is additionally available, and the parameters are sought with which to model the signal 𝑥𝑝 on the interval 𝑛 = 1, … , 𝑁 . Let 𝒂[𝑁 ] { }𝑁 denote the vector of parameter estimates obtained on this interval, and let 𝜖𝑝 ( 𝑛 | 𝑁 ) 𝑛=1 be the corresponding error sequence associated with the estimated model with parameters 𝒂[𝑁 ]. The assumption of “small errors” (i.e., 𝜎𝜂2𝑝 ≪ 𝜎𝑥2̂𝑝 ) is often used to justify the use of (LSE) estimation of the parameters on the finite interval. In fact, in the present context, the lack of correlation in the sequence 𝜂𝑝 [𝑛] leads to an unbiased LSE estimate, 𝒂[𝑁 ], for finite 𝑁 , and asymptotic convergence in mean square to 𝒂 ∗ . The observations on the given time range comprise a set of 𝑁 equations in 𝑀𝒂 ≐ dim{𝒂} 17 unknown parameters (maximally 𝑀𝒂 = 𝑀𝑁𝑠 ), which, written in vector-matrix form as, ⎡ ⎤ ⎡ 1 ⎤ 𝑇 ⎢ 𝑥1 [0] 𝑥1 [1] ⋯ 𝑥1 [𝑁 − 1] ⎥ ⎢ 𝑎𝑝1 [𝑁 ] ⎥ ⎢ ⎥ ⎢ ⎥ ⎢ ⋮ ⋮ ⋱ ⋮ ⎥ ⎢ ⋮ ⎥ ⎡ ⎢ ⎤ ⎢ ⎥ ⎢ ⎥ ⎢ 𝑝 𝑥̂ [1] ⎥ ⎢ 1 𝑥 [−𝑀 + 1] 𝑥 [−𝑀 + 2] ⋯ 𝑥 [𝑁 − 𝑀] ⎥ ⎢ 𝑎𝑝1 𝑀 [𝑁 ] ⎥ ⎥ ⎢ ⎥ 1 1 ⎢ ⎥ ⎢ ⎥ ⎢ ⎥ ⎢ 𝑝 𝑥̂ [2] ⎥ ⎢ ⋮ ⋮ ⋱ ⋮ ⋮ ⎢ ⎥=⎢ ⎥ ⎢ ⎥, ⎢ ⋮ ⎥ ⎢ ⎥ ⎢ ⎥ ⋮ ⋮ ⋱ ⋮ ⎥ ⎢ ⋮ ⎥ (2.11) ⎢ ⎥ ⎢ ⎥ ⎢ 1 ⎥ ⎢ 𝑥̂𝑝 [𝑁 ] ⎥ ⎢ 𝑥𝑁𝑠 [0] 𝑥𝑁𝑠 [1] ⋯ 𝑥𝑁𝑠 [𝑁 − 1] ⎥ ⎢ 𝑎𝑝𝑁𝑠 [𝑁 ] ⎥ ⎣ ⎦ ⏟⏞⏞⏞⏞⏞⏞⏞⏞⏞⏟⏞⏞⏞⏞⏞⏞⏞⏞⏞⏟ ⎢ ⎥ ⎢ ⎥ ⎢ ⋮ ⋮ ⋱ ⋮ ⎥ ⎢ ⋮ ⎥ ⎢ ⎥ ⎢ ⎥ ≐𝒙̂ 𝑝 [𝑁 ] ⎢ 𝑥𝑁 [−𝑀 + 1] 𝑥𝑁 [−𝑀 + 2] ⋯ 𝑥𝑁 [𝑁 − 𝑀] ⎥ ⎢ 𝑎𝑀 [𝑁 ] ⎥ ⎣ 𝑠 ⎦ ⎣ 𝑝𝑁𝑠 ⎦ ⏟⏞⏞⏞⏞⏞⏞⏞⏞⏞⏞⏞⏞⏞⏞⏞⏞⏞⏞⏞⏞⏞⏞⏞⏞⏞⏞⏞⏞⏞⏞⏞⏞⏞⏞⏞⏞⏞⏞⏞⏞⏞⏞⏞⏞⏞⏞⏞⏞⏞⏞⏞⏞⏞⏞⏞⏞⏟⏞⏞⏞⏞⏞⏞⏞⏞⏞⏞⏞⏞⏞⏞⏞⏞⏞⏞⏞⏞⏞⏞⏞⏞⏞⏞⏞⏞⏞⏞⏞⏞⏞⏞⏞⏞⏞⏞⏞⏞⏞⏞⏞⏞⏞⏞⏞⏞⏞⏞⏞⏞⏞⏞⏞⏞⏟ ⏟⏞⏞⏞⏞⏞⏞⏞⏞⏞⏞⏞⏟⏞⏞⏞⏞⏞⏞⏞⏞⏞⏞⏞⏟ 𝑠 𝑠 ≐𝑿 𝑇 [𝑁 ] ≐𝒂[𝑁 ] where 𝒙̂ 𝑝 [𝑁 ] ∈ R𝑁 , 𝑿 [𝑁 ] ∈ R𝑁 ×𝑀𝒂 , and 𝒂[𝑁 ] ∈ R𝑀𝒂 . In these terms, the LSE estimate is the solution to 𝑿 𝑇 [𝑁 ]𝑿 [𝑁 ]𝒂𝑝 [𝑁 ] = 𝑿 𝑇 [𝑁 ]𝒙̂ 𝑝 [𝑁 ]. (2.13) The error sequence may be added to the estimated model for signal 𝑥𝑝 to create a model that exactly produces the original signal: 𝑥𝑝 [𝑛] = 𝒂𝑝𝑇 𝒙[𝑛] + 𝜖𝑝 [𝑛], (2.14) or, if we wish to emphasize the short-term temporal nature of the estimated parameters in the model, 𝑥𝑝 [𝑛] = 𝒂𝑝𝑇 [𝑁 ]𝒙[𝑛] + 𝜖𝑝 ( 𝑛 | 𝑁 ). (2.15) Although this model theoretically produces the exact signal 𝑥𝑝 over the interval 𝑛 = 1, … , 𝑁 , it is generally very different from the observation model of Eq. (2.3). We refer to Eq. (2.15) as the error-augmented estimated model. As noted near Eq. (2.10), the estimation error sequence 𝜖𝑝 is dependent upon the misadjustment in the parameter values relative to the presumed true values of the observation model, 𝒂𝑝∗ − 𝒂𝑝 [𝑁 ], as well as the disturbance sequence in the observation, 𝜂𝑝 . Not discussed above is the fact that the error sequence is also dependent upon the short-term 18 estimation of the parameters (i.e., the duration 𝑁 ). The error sequence is therefore a key indicator of the quality of the model and we will see this sequence play an important role in causality analysis. 2.2.3 Least squares estimation An error sequence accounts for both the disturbance sequence and errors in parameter estimation [Eq. (2.10)]. Under the assumption of “small errors”, minimizing the error sequence therefore approximately minimizes the parameter estimation error. The well-known solution to the normal equations, Eq. (2.13), is given by [74] 𝒂𝑝 = (𝑿 𝑇 𝑿 )−1 𝑿 𝑇 𝒙𝑝 , (2.16) in which (𝑿 𝑇 𝑿 )−1 𝑿 𝑇 is the pseudoinverse of 𝑿 . Assuming that the disturbance is i.i.d. zero mean Gaussian random process with variance 𝝈𝜼2 , that the regressors are bounded and the covariance matrix of the regressors 𝚺𝑿 exists and is non-singular and that the observation model is BIBO1 stable, the solution is distributed as 𝒂𝑝 ∼ 𝑀𝒂 (𝒂𝟎 , 𝝈𝜼2 𝚺−𝟏 𝑿 /(𝑁 − 𝑀𝒂 )) , (2.17) where 𝑀𝒂 (𝝁, 𝚺) is a multivariate normal distribution of dimension 𝑀𝒂 with mean vector 𝝁 and covariance matrix 𝚺, 𝚺−1 𝑿 is the inverse of the covariance matrix of the regressors, and 𝑁 is the number of time samples. For sets of regressors with ill-conditioned covariance matrices, the variance of 𝒂𝑝 can be very large. As NC depends directly on the accuracy of the model parameters, it is prone to misleading results for small 𝑁 (compared to the largest element of the vector 𝚺−𝟏 𝑿 𝜎𝜂 ). 2 If the i.i.d. Gaussian assumption is not satisfied, ill-conditioned regressor matrices will still cause the parameter estimates to be have potentially large variance, although the parameters may inputs. A discrete time signal 𝑥[𝑛] is called bounded if there exists a 𝐵 > 0 ∈ R such that for every 𝑛 ∈ Z |𝑥[𝑛]| < 𝐵. A 1 Bounded-input-bounded-output (BIBO) stability is a form of system stability linking the output of a system to its system is called BIBO stable if and only if, given any bounded input, the output is also guaranteed to be bounded [151]. 19 not be normally distributed. Special attention must be taken in the case of NC, to assure that the covariance matrix is well conditioned or regularization must be applied to reduce errors in the NC measure estimation. 2.2.4 ARMAX models The most comprehensive way to represent LTI models is using the ARMAX representation. This representation encompasses AR, MA, models with exogenous inputs and any combination thereof. Starting with the error augmented model of Eq. (2.14), expanded to highlight the AR, MA and exogenous inputs, ⏞⏞⏞⏞⏞⏞⏞⏞⏞⏞⏞⏞⏞⏞⏞⏞⏞⏞⏞⏞⏞⏞⏞⏞⏞⏞⏞⏞⏞⏞⏞⏞⏞⏞⏞⏞⏞⏞⏞⏞⏞⏞⏞⏞⏞⏞⏞⏞⏞⏞⏞⏞⏞⏞⏞⏞⏞⏞⏞⏞⏞⏞⏞⏞⏞⏞⏞⏞⏞⏞⏞⏞⏞⏞⏞⏞⏞⏞⏞⏞⏞⏞⏞⏞⏞⏞⏞⏞⏞⏞⏞⏞⏞⏞⏞⏞⏞⏞⏞⏞⏞⏞⏞⏞⏞⏞⏞⏞⏞⏞⏞⏞⏞⏞⏞⏞⏞⏞⏞⏞⏞⏞⏞⏞⏞⏞⏞⏞⏞⏞⏞⏞⏞⏞⏞⏞⏞⏞⏞⏞⏞⏞⏞⏞⏞⏞⏞ 𝑥̂𝑝 [𝑛] 𝑀AR 𝑁𝑠 𝑀X 𝑀MA 𝑥𝑝 [𝑛] = ∑ 𝑎𝑝𝑝 𝑚 𝑥𝑝 [𝑛 − 𝑚] + (∑ ∑ 𝑎𝑝𝑞 𝑖 𝑥𝑞 [𝑛 − 𝑚]) + ∑ 𝑎𝑝𝜖 𝑚 𝜖𝑝 [𝑛 − 𝑚] + 𝜖𝑝 [𝑛], ⏟⏞⏞⏞⏞⏟⏞⏞⏞⏞⏟ ⏟⏞⏞⏞⏞⏞⏞⏞⏞⏞⏞⏞⏞⏞⏞⏞⏞⏞⏟⏞⏞⏞⏞⏞⏞⏞⏞⏞⏞⏞⏞⏞⏞⏞⏞⏞⏟ ⏟⏞⏞⏞⏞⏞⏞⏞⏞⏞⏞⏞⏞⏞⏞⏞⏞⏞⏟⏞⏞⏞⏞⏞⏞⏞⏞⏞⏞⏞⏞⏞⏞⏞⏞⏞⏟ sequence 𝑚=1 𝑞=1 𝑚=1 𝑚=1 ⏟⏞⏞⏞⏞⏞⏞⏞⏞⏞⏞⏞⏞⏞⏞⏞⏞⏞⏞⏞⏞⏞⏞⏞⏞⏞⏞⏟⏞⏞⏞⏞⏞⏞⏞⏞⏞⏞⏞⏞⏞⏞⏞⏞⏞⏞⏞⏞⏞⏞⏞⏞⏞⏞⏟ 𝑞≠𝑝 error (2.18) autoregressive moving average exogenous input = 𝒂𝑝𝑇 𝒙[𝑛] + 𝜖𝑝 [𝑛] , where 𝑀AR is the model order for the AR term, 𝑀MA is the model order for the MA term and 𝑀X is the model order for the exogenous input term. Note that the model orders and generally unknown (unless predicated on a priori knowledge of the system being modeled) and must be estimated prior to the parameter estimation, additionally 𝜖𝑝 must also be estimated. The Box-Jenkins method [32] is the standard approach to iteratively identify ARMAX model structures. 2.2.5 ARX models with non-white error sequences Digressing momentarily into ARMAX modeling, note that ARMAX models of Eq. (2.18) can be expressed in terms of the sum of ARX model and a colored noise term 𝑀AR 𝑁𝑠 𝑀X 𝑥𝑝 [𝑛] = ∑ 𝑎𝑝𝑝 𝑚 𝑥𝑝 [𝑛 − 𝑚] + (∑ ∑ 𝑎𝑝𝑞 𝑖 𝑥𝑞 [𝑛 − 𝑚]) + 𝜖𝑝′ [𝑛], 𝑚=1 𝑞=1 𝑚=1 ⏟⏞⏞⏞⏞⏟⏞⏞⏞⏞⏟ ⏟⏞⏞⏞⏞⏞⏞⏞⏞⏞⏞⏞⏞⏞⏞⏞⏞⏞⏞⏞⏞⏞⏞⏞⏞⏞⏞⏞⏞⏞⏞⏞⏞⏞⏞⏞⏞⏞⏞⏞⏞⏞⏞⏞⏞⏞⏞⏞⏞⏞⏟⏞⏞⏞⏞⏞⏞⏞⏞⏞⏞⏞⏞⏞⏞⏞⏞⏞⏞⏞⏞⏞⏞⏞⏞⏞⏞⏞⏞⏞⏞⏞⏞⏞⏞⏞⏞⏞⏞⏞⏞⏞⏞⏞⏞⏞⏞⏞⏞⏞⏟ error 𝑞≠𝑝 (2.19) colored sequence ARX model 20 where 𝑀MA 𝜖 ′ [𝑛] = ∑ 𝑎𝑝𝜖 𝑚 𝜖𝑝 [𝑛 − 𝑚] + 𝜖𝑝 [𝑛], (2.20) 𝑚=1 | | | 𝑚| such that 𝑀MA max |𝜖𝑝′ [𝑛]| ≤ 1 + ∑ |𝑎𝑝𝜖 | max ||𝜖𝑝 [𝑛]|| , 𝑛∈[1,𝑁 ] | | ( | |) 𝑛∈[1,𝑁 ] (2.21) 𝑚=1 which shows that if 𝜖𝑝 is bounded, 𝜖𝑝′ will also be bounded. These characteristics will be exploited in Sec. 2.4. 2.2.6 NARMAX and modified NARX models The LTIiP class of models extend traditional LTI models by allowing nonlinear transformations of the model inputs and past outputs, while allowing the use of many of the classical modeling, prediction and estimation techniques with well-understood and well-tested convergence charac- teristics. LTIiP models have shown to be a viable alternative to highly nonlinear in parameter models [41], with excellent results in many applications, from epidemiology [165] and microbial growth [210] to human physiology [116]. The most comprehensive representation of LTIiP models is the nonlinear ARMAX (NARMAX) [39, 122], which are expressed as 𝐾 𝑥𝑝 [𝑛] = ∑ 𝑎𝑝𝑘 𝜑𝑝𝑘 (𝑥1 ||𝑛−𝑀 , 𝑥2 ||𝑛−𝑀 , … , 𝑥𝑁𝑠 ||𝑛−𝑀 , 𝜖𝑝 ||𝑛−𝑀 ) + 𝜖𝑝 [𝑛] 𝑛−1 𝑛−1 𝑛−1 𝑛−1 𝑘=1 ≐ 𝒂𝑝𝑇 𝝋𝑝 [𝑛] + 𝜖𝑝 [𝑛] (2.22) where 𝐾 is the number of regressor functions, 𝜑𝑞𝑝 is the 𝑞 th regressor function, 𝑥𝑟 ||𝑛−𝑀 represents 𝑛−1 the set of all available samples of signal 𝑥𝑟 from time 𝑛 − 𝑀 until time 𝑛 − 1 and 𝑎𝑝𝑘 is the parameter weight associated with 𝜑𝑝𝑘 . Here, the argument of the 𝜑𝑞𝑝 is included to reinforce the fact that the regressor functions may depend on any combination of the regressor signals (including the error). Common regressor function families include radial basis functions [40], wavelets [26], and polynomials [6, 8, 24, 82]. In [202], Wei uses the linear in parameters nonlinear in variables (LIP-NIV) terminology to describe NARMAX models. However, this implies that the models are inherently nonlinear 21 in variables, which would exclude ARMAX models from the category. Instead, this work will maintain the usage of the LTIiP terminology to highlight that traditional LTI models are a subset of NARMAX models. The modeling power of NARMAX models comes at the cost increased complexity in estimating parameters. Due to the large number of highly correlated regressors, slow convergence, overfit- ting and inaccurate parameter estimates are common challenges faced when estimating model parameters [11]. The estimation of parameters that depend on past values of the error sequence in linear ARMAX models (MA portion) is considerably more complex than for the parameters associated autoregressive and exogenous inputs portions of the model. While there are methods for estimating MA parameters [63, 204], and iterative approaches exist for NARMAX models, many approaches focus on NARX models [26, 40]. Additionally, the interpretability of terms that depend on the error sequence have lower interpretability and are often not included in final predictive model [200], as these noise terms are not useful for model prediction but are only used to reduce bias in model estimation [200, 202]. A small modification to NARMAX models simplify parameter estimation is 𝐾 𝐾𝜖 𝑥𝑝 [𝑛] = ∑ 𝑎𝑝𝑘 𝜑𝑝𝑘 (𝑥1 ||𝑛−𝑀 , 𝑥2 ||𝑛−𝑀 , … , 𝑥𝑁𝑠 ||𝑛−𝑀 ) + ∑ 𝑏𝑝𝑘 𝜙𝑝𝑘 (𝜖𝑝 ||𝑛−𝑀 ) + 𝜖𝑝 [𝑛] 𝑛−1 𝑛−1 𝑛−1 𝑛−1 𝑘=1 𝑘=1 ≐ 𝒂𝑝𝑇 𝝋𝑝 [𝑛] + 𝜖𝑝′ [𝑛] (2.23) where 𝐾𝜖 𝜖𝑝′ [𝑛] = ∑ 𝑏𝑝𝑘 𝜙𝑝𝑘 (𝜖𝑝 ||𝑛−𝑀 ) + 𝜖𝑝 [𝑛], 𝑛−1 (2.24) 𝑘=1 so that regressor functions may depend on either the regressor signals or past values of the error sequence. This restriction to NARMAX models is equivalent to NARX models with colored noise. 2.2.7 LASSO regression The least absolute shrinkage and selection operator (LASSO) [193] is an extension to traditional least squares estimation, in which an 𝑙1 norm regularization is employed to encourage sparsity in 22 the parameters. LASSO regression is equivalent to finding a parameter vector 𝒂 that satisfies { } arg min ‖𝑥 − 𝑥(𝒂)‖ ̂ 2 2 + 𝜆‖𝒂‖ 1 , (2.25) 𝒂∈R𝑀𝑎 in which 𝑥 the signal being modeled and 𝑥(𝒂) ̂ is the prediction of 𝑥 based on the parameter vector 𝒂, and 𝜆 a the regularization factor. Unlike the 𝑙0 norm, the 𝑙1 norm allows the use of efficient gradient-based optimization tech- niques [70], while being more effective at encouraging sparsity in the parameter space than, for example, Tikhonov (𝑙2 norm) regularization. 2.3 Set-membership optimum bounded ellipsoid algorithms All parameter estimation strategies share a similar goal: finding the optimum parameter estimates given a limited amount of data. The optimality criterion differs between algorithms, for example, the smallest prediction error for LSE or a compromise between prediction error and sparsity of parameters [Eq. (2.25)] for LASSO. Set-membership estimation approaches aim at providing the set of parameters that are consistent with the observed data and the model. Starting with a putative NARMAX observation model of the form 𝐾 𝑥[𝑛] = ∑ 𝑎𝑝𝑘 𝜑𝑝𝑘 |𝑛−1 |𝑛−1 |𝑛−1 (𝑥1 |𝑛−𝑀 , 𝑥2 |𝑛−𝑀 , … , 𝑥𝑁𝑠 |𝑛−𝑀 , 𝜂𝑝 𝑛−𝑀 ) + 𝜖𝑝 [𝑛] ∗ ∗ 𝑛−1 𝑘=1 (2.26) (𝒂𝑝∗ )𝑇 𝝋𝑝∗ + 𝜂[𝑛], where 𝐾 is the number of regressor functions and 𝑀 is the model order for which there exists a sequence of positive numbers 𝛾 [𝑛], such that |𝜂[𝑛]|2 < 𝛾 [𝑛]. (2.27) For a estimation model of the form ̂ 𝑥[𝑛] = 𝒂𝑝𝑇 𝝋𝑝 [𝑛] + 𝜖[𝑛], (2.28) the sequence 𝛾 [𝑛] imposes the constraint at each time 𝑛, | | |𝑥[𝑛] − 𝒂𝑝𝑇 𝝋𝑝 [𝑛]| < 𝛾 [𝑛], | | (2.29) 23 or, equivalently, 𝒂𝑝𝑇 𝝋𝑝 [𝑛] < 𝑥[𝑛] + 𝛾 [𝑛] 𝒂𝑝𝑇 𝝋𝑝 [𝑛] > 𝑥[𝑛] − 𝛾 [𝑛] (2.30) which define a hyperstrip (region between the two parallel hyperplanes) in which the set of valid parameters - known as feasibility set - must lie. The intersection of any set of 𝐾 or more hyperstrips defined by linearly independent observations forms a convex polytope of dimension 𝐾 . If at time 𝑛, the polytope defined by the intersection of all previous hyperstrips is not fully contained within the hyperstrip defined by Eq. (2.29), the feasibility set is refined. This is akin to faceting a gem, where each new refinement potentially adds up to two flat facets to the polytope.2 Although the polytope defined by the feasibility set has finite dimension, there is no limit to the number of facets. The evaluation of the intersection of hyperstrips becomes increasingly complex as the number of considered time samples increases. Optimum bounded ellipsoid algorithms provide a computationally efficient approximation to the polytope by evaluating a hyperellipsoid that bounds the polytope [52]. Compared with the polytope, the unfaceted nature of the hyper- ellipsoid is more akin to a cabochon (a polished unfaceted gem). A geometric illustration for a bidimensional parameter space is shown in Fig. 2.1. In Fig. 2.1, the 𝑥-axis represents the value of 𝜃1 , the 𝑦-axis represents the value of 𝜃2 . 𝜔2 is strip defined by 𝝋𝑝 [2] and 𝑥[2], likewise, 𝜔3 is strip defined by 𝝋𝑝 [3] and 𝑥[3]. Ω3 is the intersection between 𝜔2 and 𝜔3 and Θ3 is an ellipsoid that bounds Ω3 . Note that Θ3 and Ω3 are both completely contained within the strip 𝜔4 , so no refinement occurs at time 𝑛 = 4. The ability to reject samples that do not reduce the feasibility set is a significant advantage of OBE algorithms. While the recursion for OBE algorithms is very similar to a weighted recursive least squares (WRLS), thus  (𝐾 2 ) complexity per time sample processed, typically only a small fraction of samples provides refinement to the ellipsoid [54]. Despite the increased computational efficiency in comparison to WRLS, OBE algorithms produced guaranteed bounds for the feasibility 2 Any facets completely located outside the hyperstrip are removed from the polytope, so the number of facets does not necessarily monotonically increase, even though the volume of the polytope monotonically decreases with every refinement. 24 ω3 ω2 θ2 Θ3 θ(3) ω4 · θ1 Ω3 = ω2 ∩ ω3 Figure 2.1: Geometric illustration of OBE algorithms set. The ellipsoid can be succinctly defined by the centroid and a matrix containing the principal axes of the ellipsoid. In the case of OBE algortithms, the feasibility set at time 𝑛 is defined as { } 𝑪 Θ≐ 𝜽 ∈R 𝐾 (𝜽 − 𝜽𝒄 ) (𝜽 − 𝜽𝒄 ) < 1 , 𝑇 𝜅 : (2.31) where 𝜽𝒄 is the centroid of the ellipsoid, 𝑪 is the sample covariance matrix of 𝝋𝑝 (thus a positive 𝑪 semidefinite matrix), and 𝜅 is a positive scalar, such that define the principal axes of the ellipsoid. 𝜅 Another advantage of SM estimation is that it requires fewer assumptions about the distri- butional characteristics of the noise term. The only requirement for the employment of SM algorithms is that the noise be bounded over the observed sequence. Methods for estimating the bounds have been developed with proven conditions under which convergence is guaranteed [125]. The recursion steps for OBE algorithms can be found in Alg. B.1 of the appendix with a short overview on the difference between variants and enhancements to the algorithm. 25 2.4 NARMAX model estimation and the EvolOBE method While NARMAX models are often able to represent many complex interactions with few terms, the parameters associated with such terms must still be estimated. As the number of regressor functions increases, parameter estimation is very likely to become an ill-conditioned problem. Thus, traditional regression methods such as LSE do not generally produce good results when coupled with nonlinear models with a large number of regressor functions. The number of candidate regressor functions often is very large. For example, for polynomial regressor functions, the number of such functions grows factorially with the polynomial order. An exhaustive search of all possible subsets is computationally prohibitive for most practical applications. Thus, finding optimal subsets of the regressor functions becomes fundamental to properly estimate models. This section contains an overview of a family of NARMAX model estimation algorithms that is particularly suited for causality analysis. The poor conditioning of a large set of regressor functions is in large part due to the fact that many regressor functions will be highly correlated with one another (e.g., 𝑥 and 𝑥 3 have a correlation coefficient of 0.77 for 𝑥 normally distributed with zero mean and unity variance). Additionally, many commonly used sets of regressor functions form overcomplete systems, which creates null spaces in the regressor space. Many techniques have been developed specifically for nonlinear model selection and parameter estimation [22, 25, 27, 81, 118, 201, 203, 214]. These tend to fall within three categories: stepwise search algorithms, bridge regression and evolutionary search. Stepwise search algorithms iteratively add or remove candidate regressor functions from the model until a criterion is reached. Since the number of possible “paths” grows factorially with the number of regressor functions, most employ greedy approaches, where the regressor which most reduces the prediction error of the NARMAX model is chosen and/or the prune the regressor functions which least increase the prediction error when removed. Matching pursuit [131], Forward-Regression Orthogonal Least Squares (FROLS) [25] and Least Angle Regression (LARS) [65] are prominent examples. Stepwise approaches suffer from shortcomings in practice. 26 Particularly, autoregressive terms are typically included first in the search, especially for systems with dynamics well below the sampling frequencies [23]. This is true regardless of how important those terms are in the final model. Once the initial autoregressive terms are selected, the remaining prediction error is often small enough that the choice of regressors is sensitive to noise in the data [157]. Bridge regression methods [67] add a penalty to the cost function proportional to the 𝓁𝜌 -norm of the parameters3 . Bridge methods can be used independently or combined with stepwise methods. Ridge regression [90] (also known as Tikhonov regularization [194]) uses 𝓁2 -norm and possesses closed-form solution and can improve conditioning in ill-posed problems, but do not generate sparse solutions. LASSO regression [193] (also know as basis pursuit [42]) use the 𝓁1 -norm and are effective ways of finding sparse solutions. However, existing model structure and parameter estimation methods suffer (to differing degrees) from slow or inaccurate convergence of the parameters [11], high computational cost [138] and often produce inaccurate model structures [201]. Evolutionary search is well suited for regressor selection with many examples in the literature [115, 121, 163, 187]. While more computationally expensive than bridge or step-wise methods, it is able to find global optima within the search space at an acceptable computational cost (sometimes even comparable to gradient-based approaches [139]). The EvolOBE method [210–217] differs from these approaches by combining the evolutionary search with set-theoretic OBE algorithms. The OBE class of parameter estimation algorithms possesses several desirable characteristics that make it particularly suited for the problem of estimating parameters for models of the form given by Eq. (2.23), for example, no necessity to make assumptions about the stationarity and distributional characteristics of the noise, and efficient computation of parameters. Earlier variants of the algorithm used more traditional methods of evaluating model fitness, such as AIC and FPE, but later variants use a bi-objective evolutionary search [149, 217] that produces a set of models with the best compromise between predictive power and complexity. 3𝜌 is most often set such that 0 ≤ 𝜌 ≤ 2 [70] 27 This obviates the choice of hyperparameters or assumptions to regulate the trade-off between the two objectives and allows a wider search and greater population diversity [195] as solutions that have high fitness for different objectives can more easily coexist and coevolve. 2.4.1 Genetic encoding and algorithm overview In the EvolOBE method, models are treated as chromosomes. The LTIiP model is the phenotype of a chromosome, a binary sequence in which each bit indicates the presence or absence of a particular gene. Each gene codes for a particular regressor function in the model. The algorithm starts with a random population of chromosomes. The parameter sets result from the set-membership processing of the data and the genetic makeup of each chromosome. Unlike other estimation methods, the set-membership algorithms provide sets of feasible parameter vectors rather than a single point estimate. Measurable set properties are then used to assign fitness values to each chromosome, and the fitness value is used in the genetic algorithm selection process to evolve the population toward better solutions (e.g. [167]). This framework simultaneously addresses selection of the model structure and the parameter estimation. To reduce the computational complexity of this process, the search space of regressor models must be controlled, and the candidate and final models must use the fewest regressors that are consistent with an objective of prediction-error minimization, Since these objectives are conflicting, a multi-objective optimization approach is desired. For this work, the Non-dominated Sorting Genetic Algorithm - II (NSGA-II) [51] approach is adopted, since it generates set solutions (ideally the Pareto-front), providing the best solution for a given number of regressors and allowing the model with the best trade-off to be chosen. NSGA-II is a standard algorithm for solving multiobjective optimization problems. It requires a small number of parameters and is able to obtain solution sets with good spread. The basic NSGA-II algorithm is shown in Fig. 2.2. An initial random population of size 𝑁 is generated and evaluated according to the two objectives: prediction accuracy and number of regressors. The population is then sorted, the best half is selected as parents, which go through selection, 28 INITIALIZE EVALUATE FITNESS POPULATION YES TERMINATION STOP CRITERION? NO SORT POPULATION SELECT PARENTS CROSSOVER & MUTATION Figure 2.2: NSGA-II algorithm summary mutation and crossover to generate a new population of children. The parents and children of this generation become the parents of the following generation. The cycle is repeated until the termination criterion is reached. In the seminal EvolOBE paper [214], Yan et al. used binary tournament, bit-wise mutation and single-point crossover.Later variants of the EvolOBE algorithm use different mutation and cross-over algorithms tailored for discovery of sparse models which provide faster convergence [149]. The sorting of the population occurs at two tiers. First, the population is sorted by fronts, each front is formed by a set of solutions has higher optimality than all other members of the set (this is called being non-dominated). The population is sorted such that the members of the first front is placed higher in the set of solutions, followed by the subsequent fronts sequentially until the entire population has been sorted. Within a front, the population is sorted by the sum of the edge lengths of the cuboid formed by the two surrounding solutions within the front (in the bi-objective case), this is known as the crowding distance. The elements with larger crowding distance are placed higher within their respective fronts. As a consequence of how the crowding distance is computed, the edge solutions are always ranked higher, as they do only have a single solution surrounding them (infinite crowding distance). 29 2.5 Causality analysis Philosophers and scientists have vigorously debated the meaning of causality and no universally accepted definition exists. In [103], philosopher David Hume argued that the human mind is not able to fully assert true causality, only to observe events occurring in succession. Nevertheless, Hume proposes conditions for a relationship to be called causal. While, not universally accepted, this work will use Hume’s definition of causality, because it is testable and quantifiable. Like Box [31], our intent is not to find "true" models, but rather gain insight and understanding of the systems being studied. Nevertheless, a “true” model may be posed in some cases for theoretical analysis. The most widely known method of assessing causality strength is GC. It was first postulated by Norbert Wiener that if the inclusion of a regressor could improve the prediction of a regressand, then the relationship between the regressor and regressand could be assumed "causal" [206]. Granger used this idea to give a formal definition of causality and feedback in the context of AR models [76]. Granger Causality relies on Hume’s work [103], which focused on epistemological causality (focusing on what can can be learned and known), rather than ontological (how things are). Hume posed certain conditions under which causality can be ascertained. These conditions are discussed in Sec. 2.5.2 and connected with the definition of GC. While Granger himself distinguished GC from “true causality” [77], GC performs well in a number of applications, from econometrics [60, 87] to neurology [175]. While causality analysis often involves the use of predictive models, there is no guarantee that the predictive models internally represent the systems that they are modeling. This is closely related to the distinction between correlation and causation (known as the cum hoc ergo propter hoc fallacy). Although precedence (when coupled high correlation) may seem like a good indicator of causality, it also cannot be equated with causation (known as the post hoc fallacy). For example, many people brush their teeth before going to sleep; however, brushing teeth does not cause sleep. Granger himself has highlighted the distinction between “true” causality and GC [77]. Of particular interest is NC, which was developed to address limitations of GC in measuring 30 causal mechanisms and which has shown useful results in a number of applications [95, 96, 98– 100, 105, 112, 220]. It has been pointed out that GC measures causal effect rather than mechanism [19] and NC measures a fundamentally different (although related) quantity.4 New Causality is better suited as a complement for GC (and other causality measure tools) rather than a replacement. 2.5.1 Humean concept of causality Hume claims that the relationship between cause and effect cannot be established simply by reasoning, but instead requires an assumption of “uniformity of nature,” i.e., that certain natural laws and processes do not change overtime [102]. Although unprovable by means of observation alone, “uniformity of nature” serves as a first principle through which causation can be judged. While Hume believes that “nothing is more evident than that the human mind cannot form such an idea of two objects as to conceive any connection between them” [103, Sec. XIV], he studies causality within the context of what can be understood through experience. In [103, Sec. XV], Hume postulates the following set of rules by which to judge causes and effects (quoted verbatim here, other than use of modern spelling): 1. The cause and effect must be contiguous in space and time. 2. The cause must be prior to the effect. 3. There must be a constant union between the cause and effect. It is chiefly this quality that constitutes the relation. 4. The same cause always produces the same effect, and the same effect never arises but from the same cause. This principle we derive from experience, and is the source of most of our philosophical reasonings. For when by any clear experiment we have discovered the causes or effects of any phenomenon, we immediately extend our observation to every phenomenon of the same kind, 4 The claim is disputed by the authors of [99]. Nonetheless, the author tends to agree with [19]. 31 without waiting for that constant repetition, from which the first idea of this relation is derived. 5. There is another principle, which hangs upon this, namely that where several different objects produce the same effect, it must be by means of some quality, which we discover to be common among them. For as like effects imply like causes, we must always ascribe the causation to the circumstance, wherein we discover the resemblance. 6. The following principle is founded on the same reason. The difference in the effects of two resembling objects must proceed from that particular, in which they differ. For as like causes always produce like effects, when in any instance we find our expectation to be disappointed, we must conclude that this irregularity proceeds from some difference in the causes. 7. When any object increases or diminishes with the increase or diminution of its cause, it is to be regarded as a compounded effect, derived from the union of the several different effects, which arise from the several different parts of the cause. The absence or presence of one part of the cause is here supposed to be always attended with the absence or presence of a proportionable part of the effect. This constant conjunction sufficiently proves, that the one part is the cause of the other. We must, however, beware not to draw such a conclusion from a few experiments. A certain degree of heat gives pleasure; if you diminish that heat, the pleasure diminishes; but it does not follow, that if you augment it beyond a certain degree, the pleasure will likewise augment, for we find that it degenerates into pain. 8. The eighth and last rule I shall take notice of is, that an object, which exists for any time in its full perfection without any effect, is not the sole cause of that effect, but requires to be assisted by some other principle, which may forward its influence and operation. For as like effects necessarily follow from like causes, 32 and in a contiguous time and place, their separation for a moment shows, that these causes are not complete ones. A discussion of the philosophical implications of “uniformity of nature” assumption lies outside the scope of this work. Here, systems will be assumed to vary slowly enough that a time-invariant model adequately represents the system dynamics over “short” periods of time in which analysis takes place. Similarly, item 7 implies some proportionality in the causal relationship, where an increase in the cause will proportionally affect the effect. Hume, however, does not exclude the possibility of nonlinearity in the relationship. One must not indiscriminately assume an affine relationship between cause and effect exists even if the observations (under a limited range) closely follow an affine relationship. Therefore, as discussed in Sec. 2.2.2, it is important to remember the distinction between models and the systems they represent. Additionally, Hume’s items 1 and 3 cannot be derived from samples of signals alone, but must be evaluated separately. Note that Hume’s concepts of contiguity and union in time and space are loosely defined. Even if internally to the systems, causes and effects might be contiguous, often these states are unobtainable. Additionally, discrete time data collected from a finite number of sensors implies these requirements will never be fully satisfied without additional assumptions (e.g., limited bandwidth). For the purposes of this work, it will be assumed that signals satisfy these requirements. Time-series data are unable to provide information regarding items 1 and 3, which must be evaluated using a priori information. Hume’s item 6 states that if two outcomes are different, then the causes must also be different. When some causes cannot be measured or estimated, the outcomes will also not be estimable. The error augmented and observation models [Eq. (2.1) and Eq. (2.14), respectively] account for this by including an unknown disturbance sequence. That is, even if the parameters of the observation model were to be known, discrepancies (however small) are still expected in the prediction. Nevertheless, a disturbance sequence with small variance suggests (but does not guarantee) that most of the “causes” are being accounted for. What remains for analysis are Hume’s items 2, 4 and 5. Item 4 states that if A causes B, then A 33 must co-occur with B. In the domain of continuous random variables, this is roughly equivalent to the concept of dependence (or correlation for linear models). Item 5 states that if A causes B, and C also causes B, there must be a common element between A and C. Uncovering such mechanisms is helpful when analyzing systems, but the existence of a common factor between A and C does not aid in the decision on whether A and/or C cause B. Finally item 2 requires event A to precede B in order to establish causality of B by A. Although this requirement is intuitive, careful examination is required to ascertain whether A truly precedes B. This is particularly evident in systems that exhibit predictable, periodic, or quasiperiodic behavior. Apparent “noncausal” behavior can be attributed to predictive learning. For instance, rooster crows do not cause the sun to rise, instead, roosters possess the ability to predict sunrise times due to its quasiperiodicity using an internal circadian clock [180] (and also using other cues such as light and even social rank [179]). Nonetheless, few would object to the statement that “the rooster crows just before the break of dawn.” 2.5.2 Granger causality By combining item 4 (correlation) and item 2 (precedence), GC assesses the causality strength using the relative increase in predictive power gained by including a second signal into an estimation model. This is done by comparing an estimated ARX model (joint model) over an estimated AR model (disjoint model) where the exogenous input is formed of past samples of the causing signal being studied.5 The increase in predictive power is used as evidence of causality.6 Suppose that stochastic signals 𝑥1 and 𝑥2 are sampled. It is then possible to create predictive models for 𝑥1 varying the presence or absence of 𝑥2 . A model that only uses past values of 𝑥1 can 5 When the current sample of this signal is used, the increase in predictive power is called instantaneous GC. Instantaneous GC violates precedence and therefore weakens the case for calling it “causality.” 6 To highlight the distinction between “true causality” and GC, some authors choose to use the Granger-cause (A Granger-causes B) jargon, however, keeping with Hume’s notion of “obtainable causality” and for brevity’s sake, this work will refrain from using the term, while acknowledging the distinction between epistemological and ontological causality. 34 𝑥1 [𝑛] = 𝜑1 (𝑥1 ||𝑛−𝑀 ) + 𝜖[𝑛], be written as 𝑛−1 = 𝜑1 (𝑥1 [𝑛 − 1], 𝑥1 [𝑛 − 2], 𝑥1 [𝑛 − 3], ⋯ , 𝑥1 [𝑛 − 𝑀]) + 𝜖[𝑛], (2.32) where 𝜑1 (𝑥1 ||𝑛−𝑀 ) is a function of past values of 𝑥1 from time 𝑛 − 𝑀 to time 𝑛 − 1 inclusive and 𝜖 is 𝑛−1 the error sequence. If 𝜑1 is a linear function, this predictive model reduces to an AR observation model. A second predictive model using past values of both 𝑥1 and 𝑥2 can be written as 𝑥1 [𝑛] = 𝜑2 (𝑥1 ||𝑛−𝑀 , 𝑥2 ||𝑛−𝑀 ) + 𝜖 ′ [𝑛], 𝑛−1 𝑛−1 (2.33) where 𝜑2 (𝑥1 ||𝑛−𝑀 , 𝑥2 ||𝑛−𝑀 ) is a function of past values of 𝑥1 and 𝑥2 from time 𝑛 − 𝑀 to time inclusive 𝑛−1 𝑛−1 𝑛 − 1, and 𝜖 ′ is the error sequence. If 𝜑2 is a linear function, this predictive model reduces to an ARX observation model, where 𝑥2 is the exogenous input. Note that both predictive models must have their topology and parameters estimated (in the case of parametric models). Although in many applications the signals being analyzed are of the same nature (e.g., two EEG channels, two stocks, etc) and minimally processed (e.g., filtering applied for removing volume conduction, line noise, EMG interference, etc), GC can analyzed distinct quantities like the effect of phase from one channel into amplitude of a second channel [141]. The GC value in the contrast represented by [Eqs. (2.32) and (2.33)] is defined as GC2→1 = ln(𝜎𝜖2 /𝜎𝜖2′ ), (2.34) where 𝜎𝜖2 is the sample variance of the error sequence of the estimated model where 𝑥2 is absent and 𝜎𝜖2′ is the sample variance of the error sequence of the model with 𝑥2 as exogenous input. Since, in general, one of the rational objectives of model estimation is minimizing the residual error, the inclusion of 𝑥2 in Eq. (2.33) assures that 𝜎𝜖2′ ≤ 𝜎𝜖2 and thus GC≥ 0. In order to evaluate the hypothesis of whether 𝑥2 causes 𝑥1 , a statistical significance test, such as an F-test [184], is conducted on the GC statistic. It is noteworthy that, in general, 𝜑1 (𝑥1 ||𝑛−𝑀 ) ≠ 𝜑2 (𝑥1 ||𝑛−𝑀 , 𝟎) unless 𝑥2 ||𝑛−𝑀 ≐ 𝟎; that is, the model 𝑛−1 𝑛−1 𝑛−1 estimation method employed for obtaining 𝜑1 and 𝜑2 will attempt to fit the data, so 𝜑1 will adapt 35 to the absence of 𝑥2 . If 𝑥2 ||𝑛−𝑀 can be predicted well by 𝑥1 ||𝑛−𝑀 , then 𝜎𝜖2 might not be significantly 𝑛−1 𝑛−1 larger than 𝜎𝜖2′ , even if the contribution of 𝑥2 to 𝜑2 is large [23, 157]. The simplicity of GC allows it to be easily applied to a wide range of problems with good results, e.g. [33, 69, 177]. However, since it is designed to measure causal effect, GC value does not fully consider the internal states of the underlying observation model, only the outputs of the model. Further, it has been claimed that GC values are difficult to compare across observation models, as GC values are not normalized and obtaining a threshold for statistical significance is not straightforward [95]. The use of two independently estimated models is vulnerable to resulting bias and larger variance [16, 43]. More recent methods have been developed to derive GC values from a single full regression using factorization of the spectral density matrix [16, 17, 59]. Nevertheless, conceptually, these methods still stem from the comparison of the predictive power of two models. Although authors have pointed out apparent limitations of GC, [79, 94, 95, 97, 100, 135, 188], GC is a well established methodology for analyzing causal relationships [33]. Additionally, for normally distributed signals, GC has been shown to be equivalent to TE (save by a scaling factor) [14], but can be evaluated reliably with fewer samples. Barrett and Barnett acknowledge in [19] that “GC is not a perfect measure for all stochastic time series: if the true process is not a straightforward multivariate autoregressive process with white-noise residuals, then it becomes only an approximate measure of causal influence. In each real-world scenario, discretion is required in deciding if confounds such as non-linearity and correlations in the noise are mild enough for the measure to remain applicable.” While TE is applicable to other models, other authors have also pointed out that causal effects and transferred information [127]. 2.5.3 Spectral Granger causality Spectral GC is the frequency domain decomposition of GC introduced by Geweke [71]. Spectral GC uses the power spectral density (PSD) function to assess GC at particular frequencies. Suppose 36 there is a pair of signals 𝑥1 and 𝑥2 that can be modeled by 𝑥1 [𝑛] = 𝒂12𝑇 𝒙2 [𝑛] + 𝒂11 𝑇 𝒙1 [𝑛] + 𝜖1 [𝑛] 𝑥2 [𝑛] = 𝒂22 𝒙2 [𝑛] + 𝒂21 𝒙1 [𝑛] + 𝜖2 [𝑛], (2.35) 𝑇 𝑇 where 𝜖1 and 𝜖2 are assumed to be sampled from white and mutually uncorrelated random processes. Applying the discrete time Fourier transform (DTFT) yields 𝑋1 (𝑓 ) = 𝑨12 (𝑓 )𝑋2 (𝑓 ) + 𝑨11 (𝑓 )𝑋1 (𝑓 ) + 𝐸1 (𝑓 ) 𝑋2 (𝑓 ) = 𝑨22 (𝑓 )𝑋2 (𝑓 ) + 𝑨21 (𝑓 )𝑋1 (𝑓 ) + 𝐸2 (𝑓 ), (2.36) where 𝑨12 (𝑓 ), 𝑨11 (𝑓 ), 𝑨22 (𝑓 ), and 𝑨21 (𝑓 ) are the DTFTs of 𝒂12 , 𝒂11 , 𝒂22 , and 𝒂21 respectively, 𝑋1 (𝑓 ) and 𝑋2 (𝑓 ) and the DTFTs of 𝒙1 and 𝒙2 respectively and 𝐸1 (𝑓 ) and 𝐸2 (𝑓 ) are the DTFT of samples of 𝜖1 and 𝜖2 respectively. Through manipulation, Eq. (2.36) can be rewritten as ⎡ ⎤ ⎡ ⎤⎡ ⎤ ⎢𝐸1 (𝑓 )⎥ ⎢𝑩11 (𝑓 ) 𝑩12 (𝑓 )⎥ ⎢𝑋1 (𝑓 )⎥ ⎢ ⎥=⎢ ⎥⎢ ⎥. ⎢𝐸2 (𝑓 )⎥ ⎢𝑩21 (𝑓 ) 𝑩22 (𝑓 )⎥ ⎢𝑋2 (𝑓 )⎥ (2.37) ⎣ ⎦ ⎣ ⎦⎣ ⎦ As long as 𝑩11 (𝑓 )𝑩22 (𝑓 ) ≠ 𝑩12 (𝑓 )𝑩21 (𝑓 ) for any 𝑓 ∈ [−0.5, 0.5], Eq. (2.37) can be inverted yielding ⎡ ⎤ ⎡ ⎤⎡ ⎤ ⎢𝑋1 (𝑓 )⎥ ⎢𝑪11 (𝑓 ) 𝑪12 (𝑓 )⎥ ⎢𝐸1 (𝑓 )⎥ ⎢ ⎥=⎢ ⎥⎢ ⎥. ⎢𝑋2 (𝑓 )⎥ ⎢𝑪21 (𝑓 ) 𝑪22 (𝑓 )⎥ ⎢𝐸2 (𝑓 )⎥ (2.38) ⎣ ⎦ ⎣ ⎦⎣ ⎦ Under these circumstances, the spectral density of 𝑥1 can be written as |𝑋1 (𝑓 )|2 = |𝑪11 (𝑓 )𝐸1 (𝑓 )|2 + |𝑪12 (𝑓 )𝐸2 (𝑓 )|2 . (2.39) Using Eq. (2.39), SGC is defined as |𝑋1 (𝑓 )|2 SGC𝑥2 →𝑥1 = ln , ( |𝑪11 (𝑓 )𝐸1 (𝑓 )|2 ) (2.40) |𝑪12 (𝑓 )𝐸2 (𝑓 )|2 or, equivalently SGC𝑥2 →𝑥1 = ln 1 + . ( |𝑪11 (𝑓 )𝐸1 (𝑓 )|2 ) (2.41) This means that the SGC𝑥2 →𝑥1 is proportional to the ratio between the “contribution” of 𝐸2 (𝑓 ) (originating from 𝑥2 ) and 𝐸1 (𝑓 ) (originating from 𝑥1 ). As the contribution of 𝐸2 (𝑓 ) to 𝑥1 increases, so does the SGC𝑥2 →𝑥1 . 37 It is important to note that, due to the matrix inversion in Eq. (2.38), the relationship between the parameters in vectors 𝒂11 , 𝒂12 , 𝒂21 and 𝒂22 [from Eq. (2.35)] and the functions 𝑪11 (𝑓 ) and 𝑪12 (𝑓 ) is not straightforward and is model order dependent. An example of the nontrivial relationship between the parameters and GC is shown in Appendix A. Spectral GC is particularly helpful when the frequency bands of interest are well known or concentrated into relatively narrowband peaks [35]. Another noteworthy characteristic of SGC is that it is (at least theoretically) filtering invariant, that is, the SGC values do not change when the signals are filtered by an invertible filter [15]. In fact, prefiltering the data has been recommended against unless the noise can be very well characterized (e.g., 50Hz/60Hz mains hum) [16]. 2.5.4 Conditional Granger causality Geweke also developed an extension to GC for MVAR models [72]. When analyzing more than two signals, traditional GC is unable to differentiate chains of causal relationships. For example, suppose 𝑥, 𝑦 and 𝑧 are signals that can be represented by a MVAR model. If both GC𝑥→𝑧 and GC𝑦→𝑧 are large, GC cannot distinguish between the model A in Fig. 2.3a from the model B in Fig. 2.3b. Conditional GC solves the ambiguity by evaluating the improvement in the prediction conditioned to other signal or set of signals. 𝑥 𝑦 𝑧 𝑥 𝑦 𝑧 (a) Model A: 𝑥 causes 𝑧 directly (b) Model B: 𝑥 causes 𝑧 indirectly though 𝑧 Figure 2.3: Different explanations for large GC𝑥→𝑧 In other words, conditional GC compares the variance of the error sequence associated with the model for 𝑧 predicted using past values of 𝑦 and 𝑧 - 𝜎𝜖2𝑧|𝑦 - to the the variance of the prediction error of signal 𝑧 given past values of 𝑥, 𝑦 and 𝑧 - 𝜎𝜖2𝑧|𝑥,𝑦 . Similarly to Eq. (2.34), conditional GC is defined as GC𝑥→𝑧|𝑦 = ln(𝜎𝜖2𝑧|𝑦 /𝜎𝜖2𝑧|𝑥,𝑦 ), (2.42) 38 In model A, GC𝑥→𝑧|𝑦 remains large, while in model B, GC𝑥→𝑧|𝑦 will be small (ideally 0). Thus, a large GC𝑥→𝑧|𝑦 would indicate model A is more likely than model B. 2.5.5 New causality Instead of focusing on predictive power as a measure of causality, the NC measure relies on the internal structure of a parametric model and upon evaluating the proportion of the energy of each contribution [which is formally defined in Eq. (2.44)] to infer causation. By making use of the models, NC is able to more proportionately represent the strength of internal mechanisms of the observation model. Also, unlike GC which requires the careful selection of conditioning sets beforehand (otherwise potentially leading to false conclusions [184]), NC foregoes the use of two models and derives its value from a single MVAR model. Suppose an estimated model is generated for time-series data using an error augmented model in Eq. (2.14), which can be expanded and grouped by regressor signal as 𝑁𝑠 𝑀 𝑥𝑝 [𝑛] = ∑ ∑ 𝑎𝑝ℎ 𝑚 𝑥ℎ [𝑛 − 𝑚] + 𝜖𝑝 [𝑛]. (2.43) ℎ=1 𝑚=1 Under this model, we define the contribution from 𝑥𝑞 into 𝑥𝑝 as 𝑀 𝑐𝑝𝑞 [𝑛] = ∑ 𝑎𝑝𝑞 𝑚 𝑥𝑞 [𝑛 − 𝑚] (2.44) 𝑚=1 such that the NC measure is defined as 𝑁 ∑ (𝑐𝑝𝑞 [𝑛]) 2 NC𝑥𝑞 →𝑥𝑝 = 𝑁𝑠 𝑁 𝑛=𝑀 𝑁 , ∑ ∑ (𝑐𝑝ℎ [𝑛]) + ∑ 𝜖𝑝2 [𝑛] 2 (2.45) ℎ=1 𝑛=𝑀 𝑛=𝑀 2 or, equivalently, 𝑁 𝑀 ∑ ∑ 𝑎𝑝𝑞 𝑚 𝑥𝑞 [𝑛 − 𝑚] (𝑚=1 ) NC𝑥𝑞 →𝑥𝑝 = , 𝑛=𝑀 𝑁𝑠 𝑁 𝑀 2 𝑁 ∑ ∑ ∑ 𝑎𝑚 𝑥 [𝑛 − 𝑚] + ∑ 𝜖𝑝2 [𝑛] (2.46) ℎ=1 𝑛=𝑀 (𝑚=1 𝑝ℎ ℎ ) 𝑛=𝑀 where NC𝑥𝑞 →𝑥𝑝 is the NC value of 𝑥𝑞 into 𝑥𝑝 , 𝑁 is the number of observed time samples of 𝑥𝑝 and 𝑥ℎ , 𝑀 is the model order and 𝑁𝑠 is the number of signals compared. When comparing two signals, 39 the equation reduces to 𝑁 𝑀 2 ∑ ∑ 𝑎12 𝑖 𝑥2 [𝑛 − 𝑚] (𝑚=1 ) NC𝑥2 →𝑥1 = . 𝑛=𝑀 2 2 (2.47) 𝑁 𝑀 𝑀 ∑ ∑ 𝑎12𝑖 𝑥2 [𝑛 − 𝑚] + ∑ 𝑎11 𝑖 𝑥1 [𝑛 − 𝑚] + 𝜖12 [𝑛] ( 𝑛=𝑀 [ 𝑚=1 ) ( 𝑚=1 ) ] 2.5.6 Spectral new causality One characteristic shared by many causality analysis tools is the ability to spectrally decompose the measure to analyze particular frequency bands. The spectral extension of new causality, henceforth referred as Spectral New Causality (SNC),7 proceeds rather intuitively from the seminal definition. First, the contributions are defined in the frequency domain 𝑁 𝐶𝑝𝑞 (𝑓 ) =  {𝑐𝑝𝑞 [𝑛]} = ∑ 𝑐𝑝𝑞 [𝑛]𝑒 −𝑗2𝜋𝑓 𝑛 (2.48) 𝑛=𝑚 where  is the DTFT operator, which is shown on the right hand side. The SNC is then defined as |𝐶𝑝𝑞 (𝑓 )|2 SNC𝑥𝑞 →𝑥𝑝 = 𝑁𝑠 0.5 , ∑ ∫ |𝐶𝑝ℎ (𝑓 )|2 𝑑𝑓 + (𝑁 − 𝑚)𝜎𝜖2𝑝 (2.49) ℎ=1 −0.5 where 𝜎𝜖2𝑝 is the sample variance of 𝜖𝑝 . Note also that the denominator has been modified for consistency, but it can be shown using Parseval’s theorem that the value of the denominator is equivalent to the denominator in Eq. (2.46). In [95], SNC is defined using the power spectrum of the regressors signals, but the definition using contributions is equivalent and greatly simplifies the derivations in Ch. 4. Also note that in [95], the integrals in the denominator are erroneously omitted. One characteristic shared between SGC and SNC is that the integral of SNC𝑥𝑞 →𝑥𝑝 (𝑓 ) over one period of the DTFT (e.g.. from -0.5 to 0.5) yield the GC and NC values respectively. The expression for SNC is conceptually similar to RPC [3]. The difference lies in that RPC uses the power contribution of the innovation sequence of a signal (𝜖𝑞 ) instead of the signals (𝑥𝑞 ). One 7 The spectral extension is called “new spectral causality” in [95], which is confusing, as it is the spectral extension to NC, rather than a new definition of spectral causality (which does not exist). 40 advantage of RPC is that the denominator is model invariant, whereas in NC the squared sum of the elements in the denominator depend on the model parameter estimates. This occurs because 𝜖𝑝 and 𝜖𝑞 are assumed to be mutually uncorrelated for all 𝑝 ≠ 𝑞, whereas 𝑥𝑝 and 𝑥𝑞 are (in general) correlated. This can lead to the presence of bias in the NC estimates (further explored in Sec. 3.4). 41 CHAPTER 3 A CRITICAL ANALYSIS OF NEW CAUSALITY 3.1 Overview In the causality analysis literature, the distinctions among systems, observation models, and estimated models is often blurred. Models are often taken at face value without further discussion on the validity of the model, order and parameter estimates. In this chapter, some of the observation models used in NC literature [94, 95, 100] are discussed. Then, two case studies are done in order to evaluate the robustness of NC and GC to model order and parameter estimation errors. Finally, four scenarios for bias in NC estimation are explored. From the perspective of the equivalence of GC to TE (measuring transferred information), GC will not measure causal contributions from signals that follow predictable patterns (e.g., slow changing signals, periodic or quasiperiodic signals). While it is true that the GC values estimated using data from some of these observation models may defy intuition on causal strength, signals with high temporal correlation will also require a large number of epochs to produce accurate parameter estimates. Sec. 3.2 discusses the challenge some of the models in NC literature pose to parameter estimation and also the plausibility of some of models. Some of the observation models used in the literature to showcase the advantages of NC over GC are severely ill-posed. Although it has been shown that NC can more proportionally represent the causal mechanisms than GC [95, 100], the NC values can only improve upon the inference from GC values if the the estimated models correctly mimic the internal dynamics of the observation models. In Sec. 3.3, particular examples are shown of how NC estimates are susceptible to errors in the parameter estimation. In summary, the NC value is as good (or useful) as the model used. On the other hand, GC is generally more robust to parameter estimation errors. During the investigation of the robustness of NC estimates to model estimation errors, bias was observed in the estimates. This led to the study reported in Sec. 3.4, in which, a mathematical 42 approach is used to predict likely the sources of bias in NC estimates. A significant portion of this chapter is quoted directly from the author’s work in [147, 148] with a few modifications for improved flow and clarity. 3.2 Problematic aspects of models in NC literature With the assessment of causality strength in mind, several observation models previously used in comparisons between GC and NC will be re-examined. 3.2.1 Model 1 A principal example observation model studied by Hu et al. [95, Eq. (14)] is re-examined. The observation model is compared to a second observation model [95, Eq. (15)], to argue that GC does not reflect the “real strength of causality,” the observation models share the same GC value, in spite of their differences. This model is ill-posed in a way that produces relatively small GC estimates. However, the ill-posedness also presents a challenge for NC, as NC depends on the parameter estimates (further discussion on the effect of parameter estimate errors on NC is given in Sec. 3.3). The observation model from [95, Eq. (14)] is expressed as 𝑥1 [𝑛] = 0.8𝑥1 [𝑛 − 1]− 0.8𝑥2 [𝑛 − 1] + 𝜂1 [𝑛], 𝑥2 [𝑛] = + 0.8𝑥2 [𝑛 − 1] + 𝜂2 [𝑛], (3.1) in which 𝜂1 and 𝜂2 are white noise processes of variances 0.005 and unity, respectively. It is noteworthy that 𝜎𝜂22 = 200𝜎𝜂21 . In [95], it is claimed that the GC value does not reflect the apparent real causal interaction between 𝑥1 and 𝑥2 . Although the low variance of 𝜂1 of [Eq. (3.1)] aids in the estimation of the parameters associated with 𝑥1 [𝑛], it also can cause the covariance matrix of the regressors to be ill-conditioned. Because of the small 𝜎𝜂21 , for any 𝓁 ∈ Z, one can write 𝑥1 [𝑛 − 𝓁 ] ≈ 0.8𝑥1 [𝑛 − 𝓁 − 1] − 0.8𝑥2 [𝑛 − 𝓁 − 1], (3.2) so regressors 𝑥1 [𝑛 − 𝓁 ], 𝑥1 [𝑛 − 𝓁 − 1] and 𝑥2 [𝑛 − 𝓁 − 1] are approximately linearly dependent. The linear dependance can also be characterized as a null space in the regressor matrix, in which 43 variations in the parameters have little effect on the residual error. When combined with the relatively large variance found in 𝑥2 , one can write 𝑥2 [𝑛] ≈ 0.8𝑥2 [𝑛 − 1] + 𝜂2 [𝑛] + ⋯ + 𝑀−1 ∑ 𝛽𝓁 𝑥1 [𝑛 − 𝓁 ] − 0.8𝑥1 [𝑛 − 𝓁 − 1] + 0.8𝑥2 [𝑛 − 𝓁 − 1] , (3.3) 𝓁 =1 [ ] in which the 𝛽𝓁 are scalars that represent errors in the estimated parameters in the direction given by the parameters in the brackets. A large variance on the parameter estimates is expected in light of Eq. (2.17), because the covariance matrix is ill-conditioned. Because Eq. (3.2) contains both 𝑥1 and 𝑥2 terms and the way NC is computed, the estimates of NC1→2 and NC2→2 will be biased towards 0.5, which is particularly problematic for NC1→2 , since ideally NC1→2 = 0. A full treatment for the presence of bias in ill-posed problems is given Sec. 3.4.3. 3.2.2 Model 2 The observation model studied by Hu et al. in [95, Eq. (15)] is used in conjunction with the observation model in Eq. (3.1) to compare GC and NC. This model is ill-posed as well and has an unrealistic structure which also produces small GC values. The observation model is given by 𝑥1 [𝑛] = − 0.8𝑥2 [𝑛 − 1] + 𝜂1 [𝑛], 𝑥2 [𝑛] = + 0.8𝑥2 [𝑛 − 1] + 𝜂2 [𝑛], (3.4) where 𝜂1 and 𝜂2 are white noise processes of variances 0.01 and unity, respectively. Note that 𝑥1 [𝑛] does not depend on previous samples of itself. Due to the small variance of 𝜂1 relative to 𝜂2 , this is also an ill-posed problem, as one can deduce from Eq. (3.4) that 𝑥1 [𝑛 − 𝓁 ] + 0.8𝑥2 [𝑛 − 𝓁 − 1] ≈ 0, (3.5) for any 𝓁 ∈ Z. Following an argument similar to Eq. (3.3), as 𝜎𝜂21 is much smaller than 𝜎𝜂22 , the estimated model is likely to contain contributions from 𝑥1 into 𝑥2 , in the form of 𝑥1 [𝑛 −𝓁 ]+0.8𝑥2 [𝑛 − 𝓁 − 1], which are absent in the observation model. 44 3.2.3 Model 3 Hu et al. [95, Eq. 24] use the following observation model to argue that GC underrepresents causality strength 𝑥1 [𝑛] = − 0.99𝑥2 [𝑛 − 1] + 𝜂1 [𝑛], 𝑥2 [𝑛] = 0.99𝑥1 [𝑛 − 1] + 0.1𝑥2 [𝑛 − 1] + 𝜂2 [𝑛], (3.6) where 𝜂1 and 𝜂2 are white noise processes of variances unity and 0.1, respectively. In this observation model, the asymptotic value for GC2→1 is 0.093 (the derivation of the expression is given in Sec. A.2.2 of Appendix A), meaning that the power of the residual error of the prediction of 𝑥1 [𝑛] is reduced by less than 10% by including previous samples of 𝑥2 relative to using only past samples of 𝑥1 . It is claimed in [95] that the GC cannot identify the causal relationship between the two signals, as the theoretical value for NC2→1 is 0.96, which indicates that current value of 𝑥1 can be almost fully explained by first delayed value of 𝑥2 . The small GC value is a result of the particular conditions in this observation model. The relatively large 𝜎𝜂21 means that a larger portion of the signal cannot be explained by previous values of either 𝑥1 or 𝑥2 . Therefore, the theoretical minimum variance of the residual of 𝑥1 is relatively large. Additionally, since 𝜎𝜂22 is relatively small, the previous values of 𝑥2 can be well predicted by previous values of 𝑥1 , so the reduction of residual error by considering 𝑥2 is small. If 𝜎𝜂21 is made equal to 𝜎𝜂22 , GC1→2 = 0.67, and GC2→1 = 0.70, meaning that the contributions from 𝑥1 to 𝑥2 is similar, but smaller, than the contribution of 𝑥2 to 𝑥1 . The power of residual error is reduced by about half in both cases. The NC values also indicate that the strengths of the contributions are similar to each other with NC1→2 = 0.96 and NC2→1 = 0.98. 3.2.4 Model 4 In Hu et al. in [95, Eq. (25)], the following observation model is used to further argue that GC does not represent causality strength. The observation model is given by 𝑥1 [𝑛] = − 0.99𝑥2 [𝑛 − 1] + 𝜂1 [𝑛], 𝑥2 [𝑛] = 0.1𝑥2 [𝑛 − 1] + 𝜂2 [𝑛], (3.7) 45 where 𝜂1 and 𝜂2 are white noise processes of variances unity and 0.1, respectively. The GC value from 𝑥2 into 𝑥1 is 0.092, which is claimed to be too small, given that 𝑥1 [𝑛] is clearly caused by 𝑥2 [𝑛 − 1]. However, upon closer inspection, it becomes clear that the contribution of 𝑥2 into 𝑥1 is indeed small. The variances of 𝑥1 [𝑛] and 𝑥2 [𝑛] are 1.099 and 0.101 respectively. So the contribution of 𝜂1 to 𝑥1 is about 10 times larger than that of 𝑥2 . Even under perfect estimation conditions, the residual can only be reduced by approximately 9%, so, although 𝑥2 represents the only measurable contribution to 𝑥1 , the contribution is significantly smaller than that of 𝜂1 , as GC correctly indicates. 3.2.5 Model 5 Model 5 is presented in the paper by Hu et. al. [100, Eq. (25)] to support a claim that GC possesses a “fatal drawback” that makes it unsuitable in some scenarios. This, of course, is only true if a similar observation model is plausible in any practical application, otherwise the analysis should have limited bearing on judging GC. The model is given by 𝑥1 [𝑛] = 𝑥2 [𝑛 − 1] + 𝜂1 [𝑛], 𝑥2 [𝑛] = − 0.9𝑥1 [𝑛 − 1] + 𝜂2 [𝑛], (3.8) where 𝜂1 and 𝜂2 are white noise processes of unity variance. In this model, at every timestep, 𝑥1 and 𝑥2 exchange values with one another. The current value of 𝑥1 depends solely on the delayed sample of 𝑥2 and a white noise process. Similarly, 𝑥2 depends solely on the first delayed sample of 𝑥1 and a white noise process. The variances of 𝑥1 and 𝑥2 are 10.53 and 9.53 respectively, which are significantly larger than that of 𝜂1 and 𝜂2 . Therefore, the contribution of 𝑥1 into 𝑥2 and that of 𝑥2 into 𝑥1 are indeed relatively large. The asymptotic values for GC are GC1→2 = 0.26 and GC2→1 = 0.30, whereas the theoretically-evaluated NC values are NC1→2 = 0.90 and NC2→1 = 0.89. The NC values clearly indicate how strongly 𝑥1 and 𝑥2 are coupled, whereas the GC values are relatively low, representing a potential of reduction of the variance of the error of only 23% and 26% for GC1→2 and GC2→1 respectively. 46 However, a bigger question is: “Under what conditions would a similar observation model occur in nature?” The propagation delay between the two signals is exactly one time sample, which can only be achieved if the sampling rate is designed this way or by faulty delay embedding. This means the observation model would not be so cleanly representable if the sampling rate were even slightly different. Additionally, (according to the model equations) the signals do not depend directly on previous samples of themselves, however, assuming that the continuous-time counterparts of 𝑥1 is differentiable, we have 𝑑𝑥1 (𝑡) 𝑥1 (𝑡 + Δ𝑡) ≈ 𝑥1 (𝑡) + Δ𝑡 ⋅ , 𝑑𝑡 (3.9) for small enough Δ𝑡, where 𝑥1 (𝑡) is the continuous time signal from which 𝑥1 [𝑛] is sampled (i.e., 𝑥1 (𝑛𝑇𝑠 ) = 𝑥1 [𝑛], where 𝑇𝑠 is the sampling period and the sampling rate 𝑓𝑠 = 1/𝑇𝑠 ). Consequently, for small enough 𝑇𝑠 , 𝑑𝑋1 (𝑡) || 𝑥1 [𝑛 + 1] ≈ 𝑥[𝑛] + 𝑇𝑠 ⋅ , 𝑑𝑡 ||𝑡=𝑛𝑇𝑠 (3.10) where 𝑑/𝑑𝑡 denotes the derivative in time. For any sufficiently high sampling rate, the signals should be at least correlated to previous samples of themselves. A similar analysis is done in [79], but in the context of GC. Another example of insufficient sampling rate is given in Sec. 3.2.7. The observation model in Eq. (3.8) can be written as 𝑥1 [𝑛] = − 0.9𝑥1 [𝑛 − 2] + 𝜂3 [𝑛], 𝑥2 [𝑛] = − 0.9𝑥2 [𝑛 − 2] + 𝜂4 [𝑛], (3.11) where the residual errors 𝜂3 [𝑛] = 𝜂1 [𝑛] + 𝜂2 [𝑛 − 1], and 𝜂4 [𝑛] = 0.9𝜂1 [𝑛 − 1] + 𝜂2 [𝑛] are independent white Gaussian processes of variances 2 and 1.81 respectively. The small difference in variance of the residuals explains the low GC values. The model given in Eq. (3.11) might have slightly larger power in the residual error, but it also does not require inter-channel contributions. So the model given in Eq. (3.11) is arguably simpler than Eq. (3.8), with a minimal increase of predictive error. Additionally, notice that 𝑥1 [𝑛] and 𝑥2 [𝑛] are uncorrelated, further strengthening the case for the model given in Eq. (3.11). 47 The claim that GC incorrectly represents the causal relationship requires knowledge that the model of Eq. (3.8) correctly models signals 𝑥1 and 𝑥2 . While this can be argued in a computational simulation, such a strong claim cannot be made about a complex problem where the underlying mechanism cannot be easily explained. Thus, while an interesting mental exercise concerning GC, a case cannot be made that the occurrence of such cases is significant enough to warrant the term “fatal flaw.” The choice of sampling frequency is also important for modeling and causality inference in real applications. A discussion of the relationship of regression and sampling rates is given in [23] in the context of nonlinear models. Existing literature of the effects of insufficient sampling in GC estimation in the context of econometrics is found in [136] and in the context of neurophysiological processes in [18]. 3.2.6 Model 6 In [94], Hu et al. provide two example observation models in which GC values are zero, even though there are clearly causal relationships between the two signals. These example highlight that GC measures transferred information, rather than “causal” influence. In this case, 𝑥2 is a periodic signal, and therefore no new entropy (information) is added to 𝑥2 beyond the first period. The first example model is found in [94, Eq. (10)], 𝑥1 [𝑛] = −0.99𝑥2 [𝑛 − 1] + 𝜂1 [𝑛], 𝑥2 [𝑛] = −𝑥2 [𝑛 − 2], (3.12) where 𝜂1 is a white noise process of unity variance. Note that this model does not contain a 𝜂2 term, effectively making 𝜎𝜂2 = 0. It can be shown that 𝑥2 can be expressed as a periodic signal with a period of exactly four samples, repeating 𝑥2 [0], 𝑥2 [1], −𝑥2 [0], and −𝑥2 [1] indefinitely. The values of these samples depend on the initial conditions of 𝑥2 . Similarly to Model 5, the observation model seems to be sampled in a way that synchronizes with 𝑥2 , such that the period of 𝑥2 is exactly four samples. Due to the 48 lack of external driving forces, 𝑥2 is stable, however, it is noteworthy that the observation model contains a pole on the unit circle. Since 𝑥2 is not stochastic, it can be estimated using past values of 𝑥1 as 𝑁𝓁 𝑁𝑘 ∑ 𝑥1 [𝑛 − 3 − 4𝓁 ] − ∑ 𝑥1 [𝑛 − 1 − 4𝑘] 𝑥̂2 [𝑛] ≈ − , 𝓁 =0 𝑘=0 0.99(𝑁𝓁 + 𝑁𝑘 ) (3.13) √ 2 2 for any 𝑁𝓁 > 0 and 𝑁𝑘 > 0. For large enough 𝑁𝓁 + 𝑁𝑘 (e.g. 𝑁𝓁 + 𝑁𝑘 ≫ 𝜎𝑥2 /𝜎𝜂1 ), past values of 𝑥1 can predict the value 𝑥2 , so that the first line of Eq. (3.12) can be rewritten as 𝑁𝓁 𝑁𝑘 ∑ 𝑥1 [𝑛 − 4 − 4𝓁 ] − ∑ 𝑥1 [𝑛 − 2 − 4𝑘] 𝑥1 [𝑛] = lim + 𝜂1 [𝑛]. 𝓁 =0 𝑘=0 𝑁𝓁 →∞ 𝑁𝓁 + 𝑁𝑘 (3.14) 𝑁𝑘 →∞ In this case, the GC value tends to zero, as Eq. (3.14) does not contain any 𝑥2 terms and yet has the same residual as Eq. (3.13). However, since AR models must have finite order, the GC is never zero. Often a maximum order is imposed in the regression algorithm to avoid overfitting, which would bound GC away from zero. However, because 𝑥2 is deterministic, it cannot be discerned whether 𝑥2 [𝑛 − 1] truly causes 𝑥1 [𝑛]. Since Eq. (3.12) can be rewritten as 𝑥1 [𝑛] = −0.99𝑥2 [𝑛 − 1 − 4𝓁1 ] + 𝜂1 [𝑛], 𝑥2 [𝑛] = −𝑥2 [𝑛 − 2 − 4𝓁2 ], (3.15) for any 𝓁1 , 𝓁2 ∈ Z+ . Thus, it is impossible to discern 𝑥2 [𝑛 − 1] from any 𝑥2 [𝑛 − 1 − 4𝓁 ] where1 𝓁 ∈ Z+ . In terms of parameter estimation, this ambiguity causes the regressor matrix to become singular. In this case, additional assumptions (e.g. sparsity in parameters, bias towards smaller weights or maximum allowable model order) are necessary to estimate the model parameters correctly. This is especially true if it is desired to also concurrently estimate the propagation delay between 𝑥2 and its effect on 𝑥1 . 1 in fact any convex combination of 𝑥2 [𝑛 − 1 − 4𝓁 ] terms would be indistinguishable. 49 3.2.7 Model 7 The second example given by Hu et al. in [94, Eq. (13)] shows another instance in which GC is allegedly zero: 𝑥1 [𝑛] = −0.99𝑥2 [𝑛 − 1] + 𝜂1 [𝑛] 𝑥2 [𝑛] = 𝜂1 [𝑛], (3.16) where 𝜂1 is a white noise process of unity variance. Note that the equations for both 𝑥1 and 𝑥2 have 𝜂1 [𝑛] instead of separate 𝜂1 [𝑛] and 𝜂2 [𝑛]. Although 𝑥2 [𝑛] is not linearly predictable by any strictly causal model, 𝑥2 [𝑛 − 1] is predictable given past samples of 𝑥1 . Hu et al. state that for any realization of Eq. (3.16), it is possible to rewrite the equation for 𝑥1 [𝑛] as 𝑀 𝑥1 [𝑛] = lim ∑ 𝑎𝑗 𝑥1 [𝑛 − 𝑗] + 𝜂1 [𝑛]. 𝑀→∞ (3.17) 𝑗=1 for some {𝑎𝑗 }∞𝑗=1 .However, this is only true as 𝑀 → ∞. For 𝑀 = 1, the GC value from 𝑥2 into 𝑥1 is 0.4, decreasing monotonically as 𝑀 increases. Although this value is arguably low given the mechanism in Eq. (3.16), the NC value of 𝑥2 into 𝑥1 is = 0.5, which also underrepresents the causal relationship. The authors suggest in [94] that there is a instantaneous causality relationship from 𝑥2 into 𝑥1 . To illustrate this, the first line of Eq. (3.16) can be rewritten as 𝑥1 [𝑛] = 𝑥2 [𝑛] − 0.99𝑥2 [𝑛 − 1]. (3.18) In this case, the NC2→1 = 1, which implies that 𝑥1 can be fully explained by 𝑥2 . In this case, GC2→1 → ∞, which also implies that 𝑥1 can be fully explained by 𝑥2 . While 𝜂1 and 𝜂2 are assumed white in AR models, 𝑥1 and 𝑥2 are not. The whiteness of the residuals implies that each new samples of 𝜂1 and 𝜂2 provide innovation to the observation model that is independent of any previous sample. If that were not true, previous samples of 𝜂1 and 𝜂2 could be used to predict the current values of 𝜂1 and 𝜂2 , and, in turn, the current values of 𝑥1 and 𝑥2 . Since most regression techniques aim at minimizing the residual error, 𝜂 is usually to be assumed white. However, in this model, 𝑥2 is white. The implication is that 𝑥2 changes 50 unpredictably and that no previous values of 𝑥2 can be used to predict its current value. This seems to indicate that the system is being sampled insufficiently and that we cannot determine whether 𝑥2 is being aliased. 3.2.8 Discussion of models 1-7 The basis of AR modeling is that previous samples of a signal provide information about the expected current sample. When the signal changes slowly, the previous sample often provides a good estimate of the current sample. By considering two samples, one can estimate the derivative of the signal and use it to improve the estimate. Assuming no overfitting occurs, the inclusion of more regressors will further improve the estimate. The same argument can be expressed in the spectral domain. The estimation filter attempts to match the spectrum of the signal, where filters with larger orders can better match the desired spectrum. In Models 2 through 7 (Sec. 3.2.1 through Sec. 3.2.7), at least one of the regressands does not depend on previous samples of itself. In a physical system, that would imply that either the quantity being estimated has no inertia (not continuous) or that the system is being sampled too slowly. However, if the signals are assumed to be continuous in time and sampled above the Nyquist frequency, it would be expected that, the difference between the current sample of signal should be constrained by the previous samples. While such systems exist, one can argue that they are degenerate cases of more practical observation models. In particular, neural systems have been shown to be strongly dependent on internal states [68]. While some models shown in [94, 95, 100] highlight alleged drawbacks of GC, they represent only a small restrictive class observation models, which are not representative of the performance of GC in most problems, or may even pose difficulty for NC, as parameter estimation could be adversely affected by ill-posed problems. In comparison, the models used as examples in the MVGC toolbox [16] contain models designed to mimic particular realistic scenarios (e.g. 5-node networks, 9-node networks, non-stationary linear models, etc). Although not a comprehensive list, these models provide better means of analyzing and comparing causality tools. 51 In many of the analyses of experimental data presented in [95, 98, 99], NC outperforms GC in showing the causality mechanism strength, however, some simulations in [94, 95, 100] show extreme cases in which GC will predictably underrepresent the causality mechanism and might be considered degenerate observation models. Although it is important to highlight instances where GC does not perform well, these are far from being “fatal drawbacks.” In [19], Barrett and Barnett assert that the claim of GC not capturing “how strongly one time-series influences another” could be considered “radical.” However, it was conceded that “GC is not a perfect measure for all stochastic time series: if the true process is not a straightforward multivariate AR process with white-noise residuals, then it becomes only an approximate measure of causal influence.” For different applications, other methods are available such as conditional GC [14, 44], spectral GC [71] and other methods such as partial directed coherence (PDC) [12], relative power contribution (RPC) [3], directed transfer function (DTF) [93], and phase slope index (PSI) [150]. NC is a new addition to that list, which has shown promising results and its strengths and weaknesses will likely be explored in the following years. When comparing two techniques, it is important that the observation models chosen represent the strengths as well as the limitations of both techniques. A possible remedy to this challenge is to create a set of benchmark problems or datasets that represent a variety realistic scenarios. For instance, the multivariate GC toolbox (MVGC) [16] provides a small set of example models representing realistic scenarios. This set could be expanded to account more scenarios and serve as a benchmark set, which would allow a fairer and comprehensive comparison between causality analysis tools. 3.3 Analysis of NC robustness to parameter errors through case studies To empirically evaluate the robustness of NC measures to model parameter estimation error and overfitting under model uncertainty, one of the primary example observation model used in [95, 52 Eq. 14] is re-examined. The observation model is expressed as 𝑥1 [𝑛] = 0.8𝑥1 [𝑛 − 1] − 0.8𝑥2 [𝑛 − 1] + 𝜂1 [𝑛] 𝑥2 [𝑛] = + 0.8𝑥2 [𝑛 − 1] + 𝜂2 [𝑛] (3.19) where 𝜂1 and 𝜂2 are white noise processes of variances 0.005 and unity, respectively. Three scenarios are observed in the present study, each using a different values for 𝑀 of the regressors. To study the statistical properties of the NC and GC estimates under the effects of overfitting and regularization, 65536 simulations were run in each scenario. In each simulation, the number of time samples 𝑁 = 256, and model parameters were estimated using LASSO regression. A wide range of regularization parameters was used (𝜆 ∈ [10−7 , 101 ]). In the following figures, the value of 𝜆 is shown in the 𝑥-axis and the NC and GC values are shown in the 𝑦-axis. For each value of 𝜆, the probability distribution of NC and GC values were estimated from the histogram taken at that 𝜆 value and are shown in a color plot, where yellow represents higher probability density and blue represents lower probability density. Although 𝜆 must also be estimated, several techniques exist to find appropriate values. The most common method is using a resampling method, such as bootstrapping and cross-validation [21, 110]. 𝑘-fold cross-validation splits the dataset into 𝑘 subsets, then for each subset, evaluating the prediction error of that subset using the estimated model obtained by using the union of the other 𝑘 − 1 subsets to estimate the parameters. The 𝜆 value that produces the smallest average of variance of prediction errors is chosen. Alternatively, thresholding can be used to evaluate parameter significance, with the number of significant parameters used in conjunction with a method such as AIC or BIC to compare models [124]. In this work, the wide range of 𝜆 values was chosen to showcase how NC and GC estimation react to differing levels of regularization. For the observation model in Eq. (3.19), the theoretically evaluated NC values [see Eq. (A.22) in Appendix A for the exact expression] are shown in Table 3.1. These NC measures indicate that 𝑥1 strongly dictates its own behavior, that 𝑥1 and 𝑥2 together can very accurately predict 𝑥1 (in other words, NC𝜂1 →𝑥1 is small), 𝑥1 does not contribute to 𝑥2 [as Eq. (3.19) indicates] and 𝑥2 contributes to its own behavior strongly, but a significant portion cannot be explained by either 𝑥1 or 𝑥2 (in 53 other words, NC𝜂2 →𝑥2 is relatively large). Although there are other factors at play, the relatively smaller value of NC𝜂1 →𝑥1 compared to NC𝜂2 →𝑥2 is expected as the variance of 𝜂1 is much smaller than the variance of 𝜂2 . The theoretical values for the GC measures are also shown in Table 3.1, where GC2→1 shows a range, due to GC being model order dependent. These values indicate that the presence of 𝑥1 does not improve the prediction of 𝑥2 (𝑥1 does not cause 𝑥2 ), but the presence of 𝑥2 reduces the residual prediction error energy from 14 to 140 times (𝑥2 does cause 𝑥1 ). Table 3.1: Theoretically evaluated GC and NC measures for the observation model in Eq. (3.19) NC1→1 0.89 NC2→1 0.11 NC1→2 0 NC2→2 0.64 GC2→1 [4.86,5.38] GC1→2 0 Of particular interest in these tests is NC1→2 , which should indicate that 𝑥2 is not caused by 𝑥1 . Similarly, GC1→2 is expected to be small, indicating that 𝑥1 does not cause 𝑥2 . Ideally, NC1→2 = 0 and GC1→2 = 0, but, in practice the values will be greater than zero as a consequence of the statistical variance of the estimator in conjunction with the data properties.2 Nevertheless, significance thresholds 𝑇NC and 𝑇GC may be set such that if NC1→2 < 𝑇NC or NC1→2 < 𝑇GC , 𝑥1 is assumed not to cause 𝑥2 . The significance thresholds can be obtained through a number of techniques, such as block resampling [159], stationary bootstrap [160] or trial shuffling [37, 196]. In the studies with 𝑀 = 1 (exact model order), NC and GC using LASSO regression produce good results, despite a small value of 𝑁 (𝑁 = 256). In Fig. 3.1, the distribution of estimated NC1→2 values is shown for different regularization parameters. Even for 𝜆 as low as 10−7 , most simulations yield NC measures close to the theoretical NC values. Fig. 3.2 shows the distribution of GC1→2 values under the same circumstances. Similarly to the NC1→2 , little variation is seen over the range of 𝜆 values. To evaluate the effects of order overestimation, simulations for the observation model of 2 If the estimated 𝑎21 𝑖 = 0 for 𝑖 = 1, 2, … , 𝑀, then NC 1→2 = 0 and GC1→2 = 0, but at least in terms of LSE, the probability of exact equivalence is nil. 54 1 0.9 0.8 0.7 0.6 NC 0.5 0.4 0.3 0.2 0.1 0 10 -7 10 -6 10 -5 10 -4 10 -3 10 -2 10 -1 10 0 10 1 Figure 3.1: Distribution of the NC1→2 estimates as a function of 𝜆 for 𝑀 = 1. 0.1 0.08 0.06 GC 0.04 0.02 0 -7 -6 -5 -4 -3 -2 -1 0 1 10 10 10 10 10 10 10 10 10 Figure 3.2: Distribution of the GC1→2 estimates as a function of 𝜆 for 𝑀 = 1. Eq. (3.19) were run under the same conditions except for a larger 𝑀. With 𝑀 = 2, until enough regularization is applied (𝜆 ≥ 10−3 ), the simulation does not yield satisfactory results for NC, as shown in Fig. 3.4, where there is a large spread of values for NC1→2 . The GC1→2 estimate for 𝑀 = 2 (shown in Fig. 3.3) has higher variance than the estimate for 𝑀 = 1, but is robust to the overfitting and regularization, having a consistent value for lowest values of 𝜆 tested and only changing when excessive regularization is applied (𝜆 > 10−2 ), in other words, when the parameter 55 estimates are substantially biased towards zero. 0.1 0.08 0.06 GC 0.04 0.02 0 10 -7 10 -6 10 -5 10 -4 10 -3 10 -2 10 -1 10 0 10 1 Figure 3.3: Distribution of the GC1→2 estimates as a function of 𝜆 for 𝑀 = 2. Figure 3.4: Distribution of the NC1→2 estimates as a function of 𝜆 for 𝑀 = 2. Because of the large correlations between 𝑥1 , 𝑥2 and their delayed samples, the covariance matrix of the regressors is ill-conditioned. When 𝑀 is increased to five, the estimate of NC1→2 not only has large variance, but also tends to bifurcate and cluster around two values (approximately 0.35 and 0.55) when 𝜆 < 10−3 . Neither of these is the theoretically correct value, as shown in 56 Fig. 3.5. This tendency is strengthened as the mismatch between model order and regressor order increases, as shown in Fig. 3.6, where 𝑀 = 6. Meanwhile, the GC estimates remain close to what was observed for 𝑀 = 2, as shown in Fig. 3.7 for 𝑀 = 5 and Fig. 3.8 for 𝑀 = 6. While the extra regressors cause the GC estimates not to have any probability mass at GC = 0 for 𝜆 < 10−3 , the estimates remain close to zero, even for small amounts of regularization. This suggests that, in this test, GC is more robust to overfitting and model order overestimation, even when the parameters cannot be estimated accurately. Figure 3.5: Distribution of the NC1→2 estimates as a function of 𝜆 for 𝑀 = 5. Although it is possible to mitigate the need for regularization with larger sample sizes, as shown in Fig. 3.9, in which 𝑁 = 1024 (instead of 𝑁 = 256 in previous figures), it is sometimes necessary to infer the change in causality strength over short time intervals, so that the model parameters must be estimated over data blocks spanning the same short time intervals. Blindly increasing the sampling rate is often not advisable as it would adversely interfere with the models and conditioning of the regressor matrix [23]. Therefore, special care must be taken when estimating the model order and its parameters to avoid misleading NC values. In order to study the NC performance in models with longer propagation delays between 57 Figure 3.6: Distribution of the NC1→2 estimates as a function of 𝜆 for 𝑀 = 6. 0.1 0.08 0.06 GC 0.04 0.02 0 -7 -6 -5 -4 -3 -2 -1 0 1 10 10 10 10 10 10 10 10 10 Figure 3.7: Distribution of the GC1→2 estimates as a function of 𝜆 for 𝑀 = 5. channels, we propose a similar model here, with 𝑥1 [𝑛] = 0.6𝑥1 [𝑛 − 1] − 0.3𝑥2 [𝑛 − 4] + 𝜂1 [𝑛] 𝑥2 [𝑛] = − 0.5𝑥1 [𝑛 − 1] + 0.6𝑥2 [𝑛 − 1] + 𝜂2 [𝑛] (3.20) where 𝜂1 and 𝜂2 are both white Gaussian noise processes of zero mean and variances unity and 0.005, respectively. Note that the contribution of 𝑥2 into 𝑥1 has a delay of four samples. In this case, the order of the joint ARX model is four, even though the contribution of 𝑥2 into 𝑥1 can be 58 0.1 0.08 0.06 GC 0.04 0.02 0 10 -7 10 -6 10 -5 10 -4 10 -3 10 -2 10 -1 10 0 10 1 Figure 3.8: Distribution of the GC1→2 estimates as a function of 𝜆 for 𝑀 = 6. Figure 3.9: Distribution of the NC1→2 estimates as a function of 𝜆 for 𝑀 = 5 and 𝑁 = 1024. represented with a single regressor (𝑥1 [𝑛 − 4]). When assuming that the system can be modeled as a joint AR model and varying only the order of the model, either a large number of regressors is made available (when 𝑀 ≥ 4) or an insufficient number of terms is made available (𝑀 < 4). Therefore problems with larger propagation delays present a challenge for parameter estimation. The theoretical NC values for Eq. (3.20) are shown in Table 3.2, which can be obtained by evaluating Eq. (A.23) with the expected values for the squared terms. This indicates that for 𝑥1 the 59 contribution of previous values of 𝑥1 is roughly twice that of the contributions of 𝑥2 . Also, the past values of 𝑥1 and 𝑥2 can be used to almost fully predict the current value of 𝑥1 . Additionally, the power of the contribution of previous values of 𝑥2 is about three times that of the power of 𝑥1 to 𝑥2 , but a portion of the current value of 𝑥2 cannot be well predicted by past values of 𝑥1 and 𝑥2 (since NC1→2 +NC2→2 = 0.77). Table 3.2: Theoretically evaluated NC measures for the observation model in Eq. (3.20) NC1→1 0.57 NC2→1 0.20 NC1→2 0.33 NC2→2 0.67 When the exact model order (𝑀 = 4) is used, results show that regularization is necessary when 𝑁 = 256. Fig. 3.10 shows the NC1→1 values obtained using LASSO regression for different values of the regularization factor (𝜆). The behavior observed in Fig. 3.5 is present, despite the use of the exact model order. When enough regularization is applied, the NC values approach the theoretical values. The results are biased towards zero, partially due to tendency of the regularization to bias the value of the parameters towards zero, while also increasing the residual error. Additionally, a small bias is observed due to the nonlinear dependence on parameters in the definition of NC. Further analysis of biases in NC estimates is found in Sec. 3.4. If the model order is overestimated at 6, the NC values will exhibit more variance, even when enough regularization is applied. Fig. 3.11 shows the probability density function of the NC values. This indicates that LASSO-assisted regression is not able to accurately estimate the model parameters, regardless of the choice of regularization factor. When underestimating the model order as unity, in other words, attempting to predict 𝑥1 [𝑛] and 𝑥2 [𝑛] with only 𝑥1 [𝑛−1] and 𝑥2 [𝑛−1], the results were unexpectedly good, as shown in Fig. 3.12. When compared to Fig. 3.11, the results do not bifurcate and have lower variance. Although the average NC value estimate is lower than the theoretical value, they are comparable to the results obtained with the exact model order, while accepting a wider range of regularization factors. Part of the improvement comes from the fact that the autocorrelation of the signals is high and that 60 1 0.9 0.8 0.7 0.6 NC 0.5 0.4 0.3 0.2 0.1 0 10 -7 10 -6 10 -5 10 -4 10 -3 10 -2 10 -1 10 0 10 1 Figure 3.10: Distribution of the NC1→1 estimates as a function of 𝜆 for the model shown in Eq. (3.20) and 𝑀 = 4 1 0.9 0.8 0.7 0.6 NC 0.5 0.4 0.3 0.2 0.1 0 10 -7 10 -6 10 -5 10 -4 10 -3 10 -2 10 -1 10 0 10 1 Figure 3.11: Distribution of the NC1→1 estimates as a function of 𝜆 for 𝑀 = 6 𝑀 = 1 improves the posedness of the problem. 61 1 0.9 0.8 0.7 0.6 NC 0.5 0.4 0.3 0.2 0.1 0 10 -7 10 -6 10 -5 10 -4 10 -3 10 -2 10 -1 10 0 10 1 Figure 3.12: Distribution of the NC1→1 estimates as a function of 𝜆 for 𝑀 = 1 for the model shown in Eq. (3.20) 3.3.1 Discussion 3.3.2 NC and GC fluctuation For synthetic models, the NC and GC values can be calculated theoretically using the variances of 𝜂1 and 𝜂2 and the observation model parameters and the MMSE model parameter values for the disjoint model [see Eq. (A.15) in Sec. A.2.2 for the derivation of the disjoin model parameters]. However, even when the model parameters are known, NC and GC values estimated using their respective definitions also depend on the particular samples (practically, only sample variances can be obtained, which are used as an estimate of the variances) that are used to estimate the model parameters. To decouple the variation in the NC values due to parameter estimation errors and the variation due to sample variance, in this subsection, the NC values are calculated twice, once with the parameters obtained from LASSO regression and once with the true model parameters. The NC values obtained with the true model parameters will henceforth be called NC0 . The variation found in NC0 values indicates how much variation is inherent in the short-term record, while the correlation between the NC0 values and the NC values serve to assess the effect of parameter estimation on the variation of NC values. 62 The NC and NC0 values are expected to be strongly correlated, as this would indicate that the NC value estimates are close to the theoretical values, and that the variation in the NC values originates mostly from sample variation, rather than noise or inaccurate assessment of causality strength. Fig. 3.13 shows the bidimensional histogram data for NC2→2 and NC0,2→2 for the model in Eq. (3.20) and 𝑀 = 4 for differing regularization factors. On the x-axis are the bins for the NC0,2→2 values and in the y-axis are the bins for NC2→2 values. The dashed line region of the histogram that corresponds to 𝑁 𝐶 = 𝑁 𝐶0 and the circle is drawn centered at the theoretically calculated NC value. The NC values in Fig. 3.13a were obtained with 𝜆 = 10−2 , which shows a strong correlation between the NC and NC0 , although not perfect alignment with the dashed line. (a) 𝜆 = 10−2 (b) 𝜆 = 10−4 (c) 𝜆 = 10−1 Figure 3.13: NC1→1 vs NC0,1→1 histogram plots as a function of 𝜆 If the regularization factor is not sufficiently large, however, NC estimates deviate from NC0 . For small regularization factors, the correlation is much lower between NC and NC0 . Fig. 3.13b shows the NC2→1 for 𝜆 = 10−4 , where some of the estimates are correctly located around the dashed line and circle. However, many of the estimates do not correlate well with NC0 , but rather are even weakly negatively correlated with it (the bottom peak). 63 For large regularization factors, the NC estimates tend to be biased towards zero, even if the correlation is high. A more formal treatment of the bias is given in Sec. 3.4.4. Note that this is expected as regularization introduces bias to the parameter estimator in exchange for lower variance of the estimates. Fig. 3.13c shows the histogram data for 𝜆 = 10−1 , which is only one order of magnitude larger than that of Fig. 3.13a, but the results are no longer centered on the dashed line. 3.3.3 Regression conditioning and over-fitting The model of Eq. (3.19) is introduced by Hu et al. in [95], where it is claimed that the GC value does not reflect the real causal influence between 𝑥1 and 𝑥2 . In the present work, it is shown that this system also poses problems for NC estimation. Although the low variance of 𝜂1 of Eq. (3.19) aids in the estimation of the parameters associated with 𝑥1 [𝑛], it also can cause the regressors to be highly colinear, as one can write 𝑥1 [𝑛 − 𝑙] ≈ 0.8𝑥1 [𝑛 − 𝑙 − 1] − 0.8𝑥2 [𝑛 − 𝑙 − 1] (3.21) so regressors 𝑥1 [𝑛 − 𝑙], 𝑥1 [𝑛 − 𝑙 − 1] and 𝑥2 [𝑛 − 𝑙 − 1] are nearly linearly dependent. This can also be characterized as a null space in the regressor matrix [𝑿 [𝑁 ] in Eq. (2.11)], in which variations in the parameters have very little effect on the residual error, creating large variances in the parameter estimation [see Eq. (2.17) for the distribution of the parameters under LSE]. When combined with the relatively large variance found in 𝑥2 , one can write 𝑥2 [𝑛] ≈ 0.8𝑥2 [𝑛 − 1]+ 𝑃−1 ∑ 𝛽𝑙 𝑥1 [𝑛 − 𝑙] − 0.8𝑥1 [𝑛 − 𝑙 − 1] + 0.8𝑥2 [𝑛 − 𝑙 − 1] (3.22) 𝑙=1 [ ] in which the 𝛽𝑙 are scalars that govern the deviation of the predicted parameters in the direction given by the parameters in the brackets. The increased model order also adversely affects the predictive error estimation, thus, GC analysis. Since the model order is also used to estimate the variance of the residual error, GC 64 values will change as the model order increases. The parameter estimation algorithm attempts to match the spectrum of the regressand, so an increase in model order yields improvements in the prediction using the disjoint model, even when using the joint observation model. The reduction in the residual error of the estimated disjoint model causes the GC value for large model orders to be lower than desired. This was analyzed for a particular model by Zhuo et al. in [220], where a series of backward recursive operations was used to expand the the AR model. A similar approach was taken by Grassmann in [79]. A general expansion and discussion is given in Sec. A.2.2. One important discussion presented in [95] is that for overestimated model orders, GC can be invariant to the model parameters,therefore model order estimation is an important aspect when estimating causality strength using GC as well. While methods of estimating model orders exist, such as using the Akaike Information Criterion (AIC) [5] or Bayesian Information Criterion (BIC) [174], these criteria can only compare the quality of different models, but a different method must be used to generate the models, as exhaustive search of all possible combination of regressors is ordinarily prohibitive. Additionally, there are instances where AIC and BIC perform sub-optimally and misestimate the model order. The topic is still an active area of research, with new criteria still being developed [62]. observation models that contain terms with large delays, such as the observation model described in Eq. (3.20) (which has one element of order four), further complicate the proper regression model order selection. As shown in Fig. 3.11, there are instances in which LASSO regression is unable to accurately estimate the model parameters, regardless of the regularization factor. Fig. 3.14 shows the histogram plots for three different values of model order (𝑀), 1, 2 and 6. As expected, for 𝑀 = 6, the variance of NC increases relative to 𝑀 = 4. However, 𝑀 = 1 performs comparably to 𝑀 = 4, which is surprising, since 𝑀 = 1 cannot fully model the observation model given in Eq. (3.20). In this particular case, it is preferable to underestimate the model order rather than overestimating it. 65 (a) 𝑀 = 1 (b) 𝑀 = 4 (c) 𝑀 = 6 Figure 3.14: NC1→1 vs NC0,1→1 histogram plots as a function of 𝜆 3.3.4 Comparing NC and GC For the two models studied, depending on the estimated model order and regularization, the NC measure does not produce the expected results. However, this is not an issue with the measure itself, but rather with the estimation of the model parameters. To compare the sensitivity to model parameter uncertainty of NC to GC, the GC values were measured for the models described in Eqs. (3.19) and (3.20). For the model in Eq. (3.19), the theoretical values for the GC measures are GC1→2 = 0 and GC2→1 ∈ [4.86, 5.38] (depending on the model order). These values indicate that the presence of 𝑥1 does not improve the prediction of 𝑥2 , inferring that 𝑥1 does not cause 𝑥2 , but the presence of 𝑥2 reduces the residual prediction error energy from 14 (for large 𝑀) to 140 times (for 𝑀 = 1), inferring 𝑥2 does cause 𝑥1 . In contrast to NC, the GC measure is not as sensitive to errors in the model parameter estimates. GC2→1 remained relatively flat and close to the theoretical value for a wide variety of regularization factors. More importantly, GC1→2 is close to zero, even when very little regularization is applied, showing that there is no causal relationship from 𝑥2 into 𝑥1 . 66 For the observation model of Eq. (3.20), the GC measures are also robust to model uncertainty. Figs. 3.15 and 3.16 shows the GC measures evaluated for 𝑀 = 6. If excessive regularization (𝜆 ≫ 10−2 ) is applied, the GC measure tends to 0, but for a wide range of values, the GC measures closely approximate the theoretical values. However, GC1→2 is very small, even though the contribution of 𝑥1 to 𝑥2 is significant. This is due to the large autocorrelation of 𝑥2 and large variance of 𝜂2 , which leads 𝑥1 not to reduce the variance of the residual significantly. 6 5 4 GC 3 2 1 0 10 -7 10 -6 10 -5 10 -4 10 -310 -2 10 -1 10 0 10 1 Figure 3.15: Distribution of the GC2→1 estimates as a function of 𝜆 for the model shown in Eq. (3.20) and 𝑀 = 6 3.4 Bias in NC estimates Although this work focuses on examining the variance of NC estimates observed two exam- ple models, a small bias was observed in the tested models. This subsection contains further investigation of the bias in the NC estimates. The analysis will be constrained to the bivariate case, where a signal 𝑦 can be expressed as the weighted sum of two signals 𝑥 and 𝑧 and a white noise process 𝜂. In order to increase clarity in the notation, this appendix will utilize different notation from the rest of paper. Instead of the ∗ superscript, the observation model parameters will have subscript 0, to avoid confusion between the superscripts and the transpose operator. Instead of 𝑥1 and 𝑥2 , signals 𝑥 and 𝑧 are used, so that 67 0.5 0.4 0.3 GC 0.2 0.1 0 10 -7 10 -6 10 -5 10 -4 10 -3 10 -2 10 -1 10 0 10 1 Figure 3.16: Distribution of the GC1→2 estimates as a function of 𝜆 for the model shown in Eq. (3.20) and 𝑀 = 6 no subscripts are needed and to make clear that the 0 subscript refers to the observation model, rather than being an index. The separation into 𝑥 and 𝑧 also serves to more clearly denote the difference between the parameter vectors associated with 𝑥 and 𝑧, which will be called 𝒂0 and 𝒃0 respectively. The observation model, therefore, is written as 𝑦[𝑛] = 𝒂0𝑇 𝒙[𝑛] + 𝒃0𝑇 𝒛[𝑛] + 𝜂[𝑛] (3.23) where 𝑦[𝑛] is the signal being modeled at time 𝑛, 𝒙[𝑛] is the 𝑀𝑎 × 1 vector of regressors associated with signal 𝑥 at time 𝑛 (i.e. 𝒙[𝑛] = {𝑥[𝑛], 𝑥[𝑛 − 1] … 𝑥[𝑛 − 𝑀𝑎 + 1]}𝑇 ) and parameterized by the 𝑀𝑎 × 1 vector 𝒂0 , 𝒛[𝑛] is the 𝑀𝑏 × 1 vector of regressors associated with signal 𝑧 at time 𝑛 (i.e. 𝒛[𝑛] = {𝑧[𝑛], 𝑧[𝑛 − 1] … 𝑧[𝑛 − 𝑀𝑏 + 1]}𝑇 ) and parameterized by the 𝑀𝑏 × 1 vector 𝒃0 and 𝜂 is a white noise process. For further simplification of the calculations, Eq. (3.23) can be further condensed into a matrix form that contains all 𝑁 time samples as 𝒀 = 𝒂0𝑇 𝑿 + 𝒃0𝑇 𝒁 + 𝜼 (3.24) where 𝒀 = [𝑦[𝑛]𝑦[𝑛 − 1] ⋯ 𝑦[𝑛 − 𝑁 + 1]]𝑇 is a 1 × 𝑁 time-series vector of the regressand signal, 𝑿 is the 𝑀𝑎 × 𝑁 regressor matrix where column 𝑗 is 𝒙[𝑛 − 𝑗], 𝒁 is the 𝑀𝑏 × 𝑁 regressor matrix 68 where each column 𝑗 is 𝒛[𝑛 − 𝑗] and 𝜼 = [𝜂[𝑛]𝜂[𝑛 − 1] ⋯ 𝜂[𝑛 − 𝑁 + 1]]𝑇 . Vectors 𝒂0 and 𝒃0 remain unchanged. Having defined these variables, NC0,𝑥→𝑦 can be calculated ||𝒂0𝑇 𝑿 ||2 NC0,𝑥→𝑦 = ||𝒂0𝑇 𝑿 ||2 + ||𝒃0𝑇 𝒁 ||2 + ||𝜼||2 (3.25) This value is the desired NC value calculated assuming perfect parameter estimates. In these examples, 𝜼 are assumed to be non-zero for at least one element, to avoid degenerate cases where the denominator is zero. For greater clarity, the 𝑥 → 𝑦 subscript will be dropped from the following expressions, but for all subsequent analyses, NC should be understood to be NC𝑥→𝑦 . Using an estimated model defined by 𝒀𝑝 = 𝒂 𝑇 𝑿 + 𝒃 𝑇 𝒁 (3.26) allows the calculation of the residual 𝝐 = 𝒀 − 𝒀𝑝 = Δ𝒂 𝑇 𝑿 + Δ𝒃 𝑇 𝒁 + 𝜼 (3.27) where Δ𝒂 = 𝒂0 − 𝒂 and Δ𝒃 = 𝒃0 − 𝒃. 𝒂0 and 𝒂 are assumed to be of the same size. Whenever the estimated order is not 𝑀𝑎 , vectors 𝒂0 or 𝒂 must be zero-padded such that their sizes match. The same applies to 𝒃0 and 𝒃. This is similar to the approach in taken in Eq. (2.10) and is taken without loss of generality. After obtaining the estimated model parameters, the NC estimate is evaluated ||𝒂 𝑇 𝑿 ||2 NC = ||𝒂 𝑇 𝑿 ||2 + ||𝒃 𝑇 𝒁 ||2 + ||Δ𝒂 𝑇 𝑿 + Δ𝒃 𝑇 𝒁 + 𝜼||2 ||𝒂0𝑇 𝑿 ||2 − 2𝒂0𝑇 𝑿 𝑿 𝑇 Δ𝒂 + ||Δ𝒂 𝑇 𝑿 ||2 = 𝑇 2 ||𝒂0 𝑿 || − 2𝒂0𝑇 𝑿 𝑿 𝑇 Δ𝒂 + 2||Δ𝒂 𝑇 𝑿 ||2 + ||𝒃0𝑇 𝒁 ||2 − 2𝒃0𝑇 𝒁 𝒁 𝑇 Δ𝒃 (3.28) + 2||Δ𝒃 𝑇 𝒁 ||2 + 2Δ𝒂 𝑇 𝑿 𝒁 𝑇 Δ𝒃 + 2(Δ𝒂 𝑇 𝑿 + Δ𝒃 𝑇 𝒁 )𝜼𝑇 + ||𝜼||2 . Without further assumptions, this is as far as the expression can be simplified. In the interest of gaining further insight, a few additional assumptions are made. As 𝜼 is assumed white, the term (Δ𝒂 𝑇 𝑿 + Δ𝒃 𝑇 𝒁 )𝑇 𝜼 asymptotically approaches 0, so it will be assumed to be small enough to be disregarded in further analyses unless otherwise specified. In the next subsections, four distinct special cases will be analyzed that emulate some typical model estimation conditions. 69 3.4.1 Case 1: Δ𝒃 𝑇 𝒁 ≈ 𝟎 The first case being analyzed is where Δ𝒃 𝑇 𝒁 ≈ 𝟎. This encompasses both the case in which Δ𝒃 ≈ 𝟎 (where the estimate of 𝒃 ≈ 𝒃0 ) and the case in which Δ𝒃 ⟂ 𝒁 (e.g. when 𝒙 represents a FIR filter with zeros that coincide with the spectrum of 𝒁 or, in more algebraic terms, the vector Δ𝒃 is inside the null-space of matrix 𝒁 , due to its containing highly colinear terms). First, two auxiliary variables are defined 𝒂0𝑇 𝑿 𝑿 𝑇 Δ𝒂 ||Δ𝒂 𝑇 𝑿 ||2 𝛼= 𝛽= ||𝒂0𝑇 𝑿 ||2 ||𝒂0𝑇 𝑿 ||2 (3.29) where 𝛼 represents the level of colinearity of Δ𝒂 and 𝒂0𝑇 in the inner product defined by the matrix 𝑿 𝑿 𝑇 , which converges asymptotically to (𝑁 − 1)𝚺𝑿 , where 𝚺𝑿 is the covariance matrix of 𝑋 . Geometrically, in the subspace defined by 𝑿 , 𝛼𝒂0 𝑋 is the projection of Δ𝒂 into 𝒂0 and 𝛽 is the ratio between the square of the norm of Δ𝒂 and the norm of 𝒂0 . It has been assumed here that ||𝒂0𝑇 𝑿 ||2 > 0, whereas the case where ||𝒂0𝑇 𝑿 ||2 will be evaluated separately later in this subsection. It can be shown that 𝛽 ≥ 0 and 𝛼 2 < 𝛽. For Δ𝒃 𝑇 𝒁 = 0, substituting 𝛼 and 𝛽 into Eq. (3.28) allows it to be rewritten as (1 − 2𝛼 + 𝛽)||𝒂0𝑇 𝑿 ||2 NC = (1 − 2𝛼 + 2𝛽)||𝒂0𝑇 𝑿 ||2 + ||𝒃0𝑇 𝒁 ||2 + ||𝜼||2 (3.30) which can be manipulated using Eq. (3.25) into (1 − 2𝛼 + 𝛽)NC0 NC = 1 + 2NC0 (𝛽 − 𝛼) (3.31) Under these conditions, further assumptions are needed for further analysis. Under least squares estimation of parameters, the error in the estimated variables is expected to be uncorrelated with the variables being estimated (as long as 𝒂0𝑇 𝑿 is uncorrelated with 𝜼). Therefore, the case for 𝛼 = 0 will be explored first. In this case, (1 + 𝛽)NC0 NC = 1 + 2NC0 𝛽 (3.32) Notice how the numerator contains a 1 + 𝛽 factor, while the denominator contains a 1 + 2NC0 𝛽. This means that NC will be equal to NC0 if and only if 𝛽 = 0; otherwise, it will be slightly biased towards 0.5. Fig. 3.17 shows contour curves for different values of NC0 and 𝛽. 70 1 0.9 0.8 0.7 NC estimate 0.6 0.5 0.4 0.3 0.2 0.1 0 0 1 2 3 4 5 Figure 3.17: NC estimates for different values of NC0 and 𝛽 For ||𝒂0𝑇 𝑿 ||2 = 0, the expression needs a small modification, but yields a similar expression ||Δ𝒂 𝑇 𝑿 ||2 NC = 2||Δ𝒂 𝑇 𝑿 ||2 + ||𝒃0𝑇 𝒁 ||2 + ||𝜼||2 (3.33) which also biases NC towards 0.5. Therefore, for any Δ𝒂 such that ||Δ𝒂 𝑇 𝑿 ||2 > 0, NC is expected to be biased towards 0.5. Evidently, as long as the error in parameters is small (i.e. ||Δ𝒂 𝑇 𝑿 ||2 ≪ ||𝒂0𝑇 𝑿 ||2 + ||𝒃0𝑇 𝒁 ||2 + ||𝜼||2 ), this bias is also small. 3.4.2 Case 2: Δ𝒂 𝑇 𝑿 ≈ 𝟎 Similarly to Case 1, this analysis focuses varying only a single parameter vector (Δ𝒃), while all terms related to the other parameter vectors (Δ𝒂) are disregarded. Also similarly to Case 1, assuming ||𝒂0𝑇 𝑿 ||2 > 0, two new auxiliary variables are defined. 𝒃0𝑇 𝒁 𝒁 𝑇 Δ𝒃 ||Δ𝒃 𝑇 𝒁 ||2 𝛾= 𝛿= ||𝒂0𝑇 𝑿 ||2 ||𝒂0𝑇 𝑿 ||2 (3.34) Here, similarly to Eq. (3.29), 𝛿 > 0 and 𝛾 2 < 𝛿. By substituting Eq. (3.34) into Eq. (3.28), NC can be expressed as NC = NC0 1 + 2NC0 (𝛿 − 𝛾 ) (3.35) 71 As in Case 1, under least square assumptions (and assuming 𝒃0𝑇 𝒁 is uncorrelated with 𝜼), 𝛾 can be assumed small. In this case, NC is biased towards 0. For ||𝒂0𝑇 𝑿 ||2 = 0, NC0 =0 regardless of 𝜂 and 𝒃. Therefore, there is no bias in NC if ||𝒂0𝑇 𝑿 ||2 = 0. 3.4.3 Case 3: Δ𝒂 𝑇 𝑿 + Δ𝒃 𝑇 𝒁 ≈ 𝟎 The third case extends both previous cases. In the previous cases, it was assumed that the parame- ters associated with one of the regressors were equal to the observation model parameters, or that the error in the parameters was located in the null space of the regressor matrix (representing ill-posed problems). However, the parameter error due to the regressor matrix conditioning was constrained to a single signal. Case 3 deals with the case in which the parameter errors are spread across multiple signals. This is represented by Δ𝒂 𝑇 𝑿 + Δ𝒃 𝑇 𝒁 ≈ 0 (3.36) Substituting Eq. (3.36) and variables 𝛼, 𝛽 and 𝛾 into Eq. (3.28) yields (1 − 2𝛼 + 𝛽)NC0 NC = 1 + 2NC0 (𝛽 − 𝛼 − 𝛾 ) (3.37) Note that under the assumption of small |𝛾 |, this reduces to Eq. (3.32), being equivalent to Case 1 and, therefore, subject to the same biasing towards 0.5. Likewise, the expression for ||𝒂0𝑇 𝑿 ||2 = 0 is ||Δ𝒂 𝑇 𝑿 ||2 NC = 2||Δ𝒂 𝑇 𝑿 ||2 + ||𝒃0𝑇 𝒁 ||2 − 2𝒃0𝑇 𝒁 𝒁 𝑇 Δ𝒃 + ||𝜼||2 (3.38) 3.4.4 Case 4: Regularization In previous cases, 𝛼 and 𝛾 were assumed to be small. If parameter estimates are obtained using LSE, this assumption is appropriate. However, if regularization methods are applied, the parameter estimates are expected to be biased towards zero. The bias towards zero can be modeled as Δ𝒂 𝑇 𝑿 = 𝜇𝒂𝟎 𝑇 𝑿 Δ𝒃 𝑇 𝒁 = 𝜇𝒃0𝑇 𝒁 (3.39) 72 for 0 ≤ 𝜇 < 1. Effectively, this introduces a bias such that ||𝒂 𝑇 𝑿 ||2 < ||𝒂0𝑇 𝑿 ||2 and ||𝒃 𝑇 𝒁 ||2 < ||𝒃0𝑇 𝒁 ||2 , and, therefore, ||𝝐||2 > ||𝜼||2 . Under these conditions, for any 0 < 𝜇 < 1 and NC0 > 0, the estimated NC value is expected to be lower than NC0 . By combining Eq. (3.39) and Eq. (3.28), the NC estimate becomes NC = . NC0 1+ NC0 ( (1−𝜇)2 ||𝒂||𝒚|| 𝜇2 + 2 (1−𝜇) 𝜇 ||𝒂𝟎 𝑇 𝑿 ||2 ) 𝒚𝑇 𝜼 2 (3.40) 𝑇 𝑿 ||2 𝟎 As 𝜂 is assumed to be white, it is uncorrelated with 𝑿 and 𝒀 , so it can be further assumed that 𝒚 𝑇 𝜼 = ||𝜂||2 . Thus, Eq. (3.40) can be further manipulated into NC = . NC0 1+ NC0 ( (1−𝜇) 𝜇2 ||𝒚||2 + 2 (1−𝜇) 𝜇 ||𝒂𝟎 𝑇 𝑿 ||2 ) ||𝜼||2 (3.41) 2 ||𝒂 𝑇 𝑿 ||2 𝟎 Note that both terms in 𝜇 in the denominator are strictly positive for all 0 < 𝜇 < 1, therefore confirming that any regularization is bound to reduce the estimate of NC, particularly for data with low signal-to-noise ratio (i.e. ||𝒂𝟎𝑇 𝑿 ||2 + ||𝒃𝟎 𝑇 𝒁 ||2 ̸≫ ||𝜼||2 ). Low signal-to-noise ratios already imply low NC0 values, but Eq. (3.41) demonstrates that estimates of NC using regularization are expected to be even lower than NC0 . 3.4.4.1 Extending case 3 The behavior observed in the simulations for the observation model described by Eq. (3.19) for 𝑀 ≥ 2, particularly the bifurcation observed for 𝑀 = 5 and 𝑀 = 6, requires further analysis. The bifurcation behavior cannot be explained fully by Eq. (3.37) alone, and particularly, some of the bifurcation points are for NC>0.5, which violates the assumption that |𝛼| ≪ 1 and |𝛾 | ≪ 1 . In order to model this behavior, the simplifying assumptions must be reconsidered. In order to observe the bifurcation behavior in the solutions, it is necessary to assume some distributional characteristics of Δ𝒂 𝑇 𝑿 and Δ𝒃 𝑇 𝒁 . In this analysis, Δ𝒂 𝑇 𝑿 and 𝒂𝟎 𝑇 𝑿 will be assumed to be samples from a bivariate normal distribution. Note that these terms appear only as inner products in Eq. (3.28), so the distributional characteristics need only apply to the sum of all the time 73 10 Simulation Results Simplified expression 8 Predicted PDF Probability density 6 4 2 0 0 0.2 0.4 0.6 0.8 1 NC value Figure 3.18: Estimated probability density function of NC using exact and approximate expressions samples of each term; therefore, for sufficiently large 𝑁 , the Gaussian assumption can be made under the central limit theorem, regardless of the distribution of the regressors and parameters.3 Obtaining the covariance matrix for Δ𝒂 𝑇 𝑿 and 𝒂𝟎 𝑇 𝑿 is not straightforward, as the distribu- tional characteristics of 𝑿 and 𝒁 depend on the observation model parameters (i.e. 𝒂𝟎 , 𝒃𝟎 and the distributional characteristics of 𝜼), which have interactions that are strongly coupled through a feedback loop (a solution to a bivariate second-order regressive model can be found in Appendix A and [147]). In a simulation study, however, these can be obtained empirically. For the observation model of Eq. (3.19), a simulation was run using LSE to estimate the model parameters, under the same conditions as in Sec. 3.3. NC was estimated via Eq. (2.46), using the approximate form of Case 3 [Eq. (3.38)] and by calculating the mean and variances of the terms in the equation and estimating the probability density function using Monte Carlo simulation. The probability density functions were estimated using histograms and the results can be seen in Fig. 3.18. Note the excellent agreement between the exact and approximate expressions. The peak seen around 0.6 occurs due to the large values of 𝒂0𝑇 𝑿 𝑿 𝑇 Δ𝒂 + 𝒃0𝑇 𝒁 𝒁 𝑇 Δ𝒃. Fig. 3.19 was obtained by computing cases where ||𝒂0𝑇 𝑿 ||2 + ||𝒃0𝑇 𝒁 ||2 + ||𝜼||2 is greater or smaller than 3 While the classical central limit theorem requires i.i.d. samples, later developments prove convergence to Gaussian distributions under non-i.i.d. conditions [28, Theorem 27.5]. 74 8 Greater Smaller Probability density 6 4 2 0 0 0.2 0.4 0.6 0.8 1 NC value Figure 3.19: Estimated probability density function of NC split into two cases 2𝒂0𝑇 𝑿 𝑿 𝑇 Δ𝒂 + 2𝒃0𝑇 𝒁 𝒁 𝑇 Δ𝒃 separately, and estimating their probability function. The blue line (unimodal function with mode close to 0.6) represent when the first term is greater than the second term, with the orange line representing the opposite case. This behavior is exacerbated for larger values of 𝑀 as the variances of Δ𝒂 and Δ𝒃 tend to increase with the increase of model order. Obviously, in most cases, the errors in the model parameters are expected to be small such that the case shown with the orange line never occurs and the NC estimate is close to NC0 . 3.4.4.2 Discussion of bias These 4 cases demonstrate that due to the nonlinear nature of the NC calculation, NC estimates will often be biased towards a particular value. Although the studied cases represent particular conditions, combinations of one or more of these cases should represent a wide range of problems. This is not to say that NC is inherently flawed, but instead that special care must be taken when estimating NC. Particularly when the parameter estimates are close enough to the parameters of the observation model, the bias observed is small. Additionally, even a biased NC estimate can still provide helpful information for causality analysis. 75 3.5 Conclusion The NC measure is an important development in causality analysis, addressing some limitations of GC. Particularly, it is designed to measure the causality mechanism, unlike GC, which measures the causal effect [19]. This ties to the relationship between GC and TE [14], since transferred information is indicator of causality, but whose differences must not be neglected [127]. As with many powerful analysis tools, proper care must be taken in order to avoid incorrect results. In this chapter, two examples are given where NC is shown to be more susceptible to model parameter estimation errors and overfitting than GC, particularly for ill-conditioned problems. Although GC seems to be more robust to model parameter estimation, it still possesses many of the limitations described in [95, 99, 220]. Another advantage of NC is that is allows (pairwise) causality analysis for the entire model, while GC requires causality analysis to be done by considering one additional regressor signal at a time. When estimating NC, a proper regression method must be applied to prevent overfitting. In this work, LASSO regression was used as an ad-hoc method of imposing sparsity in the model. For more complex systems, more sophisticated methods of obtaining model structure might be necessary. One recent method to obtain model structure is [149, 211, 217], which produces a family of models with differing levels of complexity and residual error, allowing easy trade-off selection. Future work will explore the performance of such methods for better model structure selection and causality measure estimation. Although NC requires accurate parameter and model estimation, when these conditions are met, NC provides reliable results that in some cases have more powerful explanatory power than GC, more closely representing causality strength. 76 CHAPTER 4 A NONLINEAR EXTENSION TO NEW CAUSALITY 4.1 Overview The seminal version of NC is defined for (linear) AR models [95]. While suitable in many ap- plications, modern applications increasingly find linear and time-invariant (LTI) models to be insufficient [25, 39]. At the same time, as most generalizations, it is important to extend the applicability of a technique without losing its identifying characteristics. In this chapter, the definition of NC is extended to NARMAX models. The new definition, henceforth called nonlinear NC (NNC), not only maintains the same intuitive meaning, but identically reduces to the seminal definition when applied to ARMAX models. The chapter starts with a motivating problem and the reasoning for the choice of NARMAX models. These are followed by the definition of the extension and examples of possible implemen- tations. These are followed by application of this nonlinear extension into a series of progressively more complex synthetic models and discussions of the results. The technique is then applied to a EEG dataset and the results compared to GC and the seminal definition of NC. Finally, the results are summarized and discussed. A significant portion of this chapter is quoted directly from [147] and [146] with a few modifications for improved flow and clarity. 4.2 Motivation The ARMAX models used in the seminal formulation of NC of Eq. (2.46), contain only linear combinations of the regressors 𝑥1 , 𝑥2 , … , 𝑥𝑁𝑠 (and their time-delayed counterparts). observation models containing significant nonlinear terms, when modeled using ARMAX models, will less accurately predict the outputs of the model, and, more importantly, inadequately represent the underlying nature of the model. The following example illustrates one simple case where linear 77 NC is unable to represent causal strength. Example A: Simple quadratic model Consider the nonlinear model in Eq. (4.1) which contains a quadratic term, where 𝜂1 and 𝜂2 are samples from i.i.d. normally distributed processes with zero means and unity variances: 𝑥1 [𝑛] = 0.53𝑥1 [𝑛 − 1] + 0.5𝑥2 [𝑛 − 1] + 𝛼𝑥22 [𝑛 − 1] + 𝜂1 [𝑛], 𝑥2 [𝑛] = 0.5𝑥2 [𝑛 − 1] + 𝜂2 [𝑛]. (4.1) 𝛼 is a coupling parameter that regulates the strength of the contribution of the quadratic term to 𝑥1 . Although the model is relatively simple, the seminal definition of NC has no mechanism to account for the quadratic term. A linear estimation model like that of Eq. (4.2) can be fit to predict the 𝑥1 , although at a reduced level of accuracy. As the effect of the quadratic term increases, an estimated ARX model of cannot represent the internal mechanism of the observation model and will produce an increasing prediction error variance. Consider the estimation model 𝑀 𝑀 𝑥1 [𝑛] = ∑ 𝑎11𝑖 𝑥1 [𝑛 − 𝑖] + ∑ 𝑎12 𝑖 𝑥2 [𝑛 − 𝑖] + 𝜖1 [𝑛]. (4.2) 𝑖=1 𝑖=1 For 𝛼 = 0.5, the variance of 𝑥1 is 𝜎𝑥21 = 3.80, the variance of 𝑥2 is 𝜎𝑥22 = 1.33 and the variance of 𝑥22 is 𝜎𝑥22 = 3.55. Using the model of Eq. (4.2), the optimum variance of the prediction error is 𝜎𝜖21 = 1.95 2 (opposed to the variance of 𝜂1 which is 𝜎𝜂21 = 1). The evaluated NC value estimates are found in table 4.1. Table 4.1: Linear NC𝑥𝑗 →𝑥𝑘 values for the model of Eq. (4.1) 𝑥𝑗 𝑥1 𝑥2 𝑥1 𝑥𝑘 𝑥2 0.49 0.058 0 0.25 Considering the variances of 𝑥1 and 𝑥2 (and 𝑥22 ), notice that NC𝑥2 →𝑥1 is small in comparison to NC𝑥1 →𝑥1 . Also note that the NC values for 𝑥1 and 𝑥2 add only to about 0.54, while the value which represents the contribution of the prediction error of the model is relatively large (NC𝜂1 →𝑥1 = 0.46). For NC values over a range of 𝛼 values, another undesirable result is observed. In Fig. 4.1, the NC values for this model are plotted for 𝛼 ∈ [0, 1]. Observe how, as 𝛼 increases, NC𝑥2 →𝑥1 78 decreases and NC𝑥1 →𝑥1 increases. This is contrary to the intuition that the influence of 𝑥2 over 𝑥1 is increasing as 𝛼 increases. This behavior stems from the fact that the model is using past values 0.6 0.4 NCx x NC 1 1 NCx x 2 1 0.2 0 0 0.2 0.4 0.6 0.8 1 Figure 4.1: NC values for the model of Eq. (4.1) of 𝑥1 to estimate the value of 𝑥22 [𝑛 − 1], since (for 𝛼 > 0), 𝑥1 [𝑛 − 1] is correlated with 𝑥22 [𝑛 − 1], whereas 𝑥2 [𝑛 − 1] is not correlated to 𝑥22 [𝑛 − 1]. As 𝛼 increases, the correlation between 𝑥1 [𝑛] and 𝑥22 [𝑛 − 1] increases, and so does NC𝑥1 →𝑥1 . Thus, it follows that the influence of 𝑥2 on 𝑥1 is not only underestimated when using the seminal definition of NC, but can also be negatively correlated. Moreover, it shows that, for observation models with significant nonlinear components, the seminal definition of NC is unable to properly assess the causal relationships between signals. △ End of Example A 4.3 Choice of NARMAX models ARMAX models are the most general representation of scalar linear systems. As shown in the previous section, the original definition of NC in terms of the parameters of a ARMAX model limits its use to systems that can be well-modeled by linear models. Since no canonical representation for all nonlinear models exists, a general nonlinear extension for NC is not possible. Particularly, 79 for many nonlinear models, it is not possible to decompose the model into a sum of “contributions” for each regressor [the concept of contribution will be later defined in Eq. (4.4)]. The seminal definition of NC requires the model to be decomposable into a sum of contributions, so a general extension of NC to all nonlinear models is infeasible. Note that common LTI models comprise a subset of LTIiP models, so the extension of NC to NARMAX models subsumes the original NC development inherently. As long as certain conditions are met [discussed near Eq. (4.6)], the extension of NC developed in this work reduces to Eq. (2.46) for linear models. The modeling power of NARMAX models comes at the cost of increased difficulty in estimating parameters. Due to the potentially large number of highly correlated regressors, overfitting and slow or inaccurate convergence are common challenges faced when estimating model parameters [11]. As a consequence, the quality of the models must be carefully evaluated, as NC values are dependent on accurate model structure and parameter estimation [148]. Nevertheless, many tech- niques have been developed specifically for nonlinear model selection and parameter estimation [22, 25, 27, 81, 118, 201, 203, 214]. 4.4 A nonlinear extension to NC for a restricted set of models A straightforward extension of NC to treat the LTIiP model occurs by grouping 𝜑𝑝 functions according to the regressor signal upon which they depend (e.g., 𝑥𝑞 ||𝑛−𝑀 ). A tentative expression for 𝑛−1 the nonlinear extension is as follows ‖ 𝐾𝑞 ‖2 ‖ | 𝑛−1 ‖ ‖ ∑ 𝑎𝑝𝑘𝑞 𝜑𝑘𝑞 (𝑥𝑞 |𝑛−𝑀 )‖ 𝑞 ‖𝑘𝑞 =1 ‖ = ‖ ‖2 , ‖ ‖ ‖ ‖2 NC𝑥𝑞 →𝑥𝑝 ‖ 𝑛−1 ‖ ‖ 𝑛−1 ‖ 2 𝐾 𝑥 | ‖ + ‖𝜂𝑝 [𝑛] + ∑ 𝑎𝑝𝑘𝑞 𝜑𝑘𝑞 (𝜂𝑝 ||𝑛−𝑀 )‖ 𝐾ℎ (4.3) 𝑁𝑠 ∑ ‖ ∑ 𝑎𝑝𝑘ℎ 𝜑𝑝𝑘 𝜂 ℎ ( ℎ |𝑛−𝑀 )‖ 𝑝 ℎ ‖ ‖ ‖ ℎ=1 ‖𝑘ℎ =1 ‖2 ‖ 𝑘𝜂𝑝 =1 ‖2 where 𝐾ℎ is the number of regressor functions that depend exclusively on 𝑥ℎ ||𝑛−𝑀 , 𝜑𝑘ℎℎ (𝑥ℎ ||𝑛−𝑀 ) is 𝑛−1 𝑛−1 the 𝑘ℎth regressor function found in 𝝋𝑝 that depends exclusively on 𝑥ℎ ||𝑛−𝑀 and 𝑎𝑝𝑘ℎ is the respective 𝑛−1 parameter associated with 𝜑𝑘ℎℎ (𝑥ℎ ||𝑛−𝑀 ). Note that this definition reduces identically to the seminal 𝑛−1 definition of NC for linear models. 80 This expression allows us to revisit the observation model of Eq. (4.1) and recompute the NC values with this definition. Example B: Simple quadratic model revisited The NC values shown in table 4.2 are computed using Eq. (4.3) for the model described by Eq. (4.1) for 𝛼 = 0.5. When using a quadratic NARMAX model, the NC values are more intuitive than the values found in Example A. NC𝑥2 →𝑥1 is comparable to NC𝑥1 →𝑥1 , just as the contribution of 𝑥2 to the current value of 𝑥1 is comparable to the contributions of past values of 𝑥1 to the current value of 𝑥1 . Table 4.2: Nonlinear NC𝑥𝑗 →𝑥𝑘 values for the model of Eq. (4.1) 𝑥𝑗 𝑥1 𝑥2 𝑥1 𝑥𝑘 𝑥2 0.32 0.37 0 0.25 The NC and NNC values for the model of Eq. (4.1) are shown in Fig. 4.2 for varying values of 𝛼. At 𝛼 = 0, the NC and NNC values are equivalent, but as 𝛼 increases, the values diverge significantly. As previously mentioned, the NC values follow a counterintuitive trend, with NC𝑥2 →𝑥1 decreasing as 𝛼 increases, whereas the NNC values follow a more intuitive trend. Notice that NNC𝑥1 →𝑥1 remains almost constant over the range of 𝛼, where the small increase originates from the increased SNR, as the variance of 𝑥1 increases with 𝛼 whereas the variance of 𝜂1 does not. △ End of Example B 4.5 A comprehensive NNC definition Although the definition of Eq. (4.3) is intuitive, it cannot be used with general NARMAX models due to regressor functions that depend on multiple regressors (e.g., 𝜑𝑘 [𝑛] = 𝑥1 [𝑛 − 1]𝑥2 [𝑛 − 2]). The presence of regressor functions that depend on more than one signal poses an additional challenge: how to best split the contribution across different signals? This question also appears in other causality related work [189]. Before answering this question, it is helpful to modify Eq. (4.3) to account for these terms. By including a weighting function, 𝜆, to the contribution of each 81 0.6 0.4 NC/NNC NCx x 1 1 NCx x 2 1 0.2 NNCx x 1 1 NNCx x 2 1 0 0 0.2 0.4 0.6 0.8 1 Figure 4.2: NC and NNC values for the model of Eq. (4.1) regressor function, the NNC expression becomes ‖𝐾 ‖2 ‖ ∑ 𝑎𝑝𝑘 𝜑𝑘 𝜆𝑝𝑞 (𝜑𝑘 )‖ ‖ ‖ ‖𝑘=1 ‖2 = 𝑁𝑠 ‖ 𝐾 ‖2 ‖ ‖2 NC𝑥𝑞 →𝑥𝑝 ∑ ‖ ∑ 𝑎𝑝𝑘 𝜑𝑘 𝜆𝑝ℎ (𝜑𝑘 )‖ + ‖𝜂𝑝 [𝑛] + ∑ 𝑎𝑝𝑘 𝜑𝑘 𝜆𝑝𝜂 (𝜑𝑘 )‖‖ ‖ ‖ ‖ 𝐾 (4.4) ℎ=1 ‖𝑘=1 ‖2 ‖ 𝑘=1 ‖2 where 𝜑𝑘 = 𝜑𝑘 (𝑥1 ||𝑛−𝑀 , 𝑥2 ||𝑛−𝑀 , … , 𝑥𝑁𝑠 ||𝑛−𝑀 , 𝜂𝑝 𝑛−1 and 𝜆𝑝𝑞 (𝜑𝑘 ) is a function of 𝜑𝑘 associated with 𝑛−1 𝑛−1 𝑛−1 𝑛−𝑀 ) 1 𝑥𝑞 → 𝑥𝑝 with the following properties 0 ≤ 𝜆𝑝𝑞 (𝜑𝑘 ) ≤ 1 (4.5a) 𝑁𝜆 𝜆𝑝𝜂 (𝜑𝑘 ) + ∑ 𝜆𝑝𝑞 (𝜑𝑘 ) = 1 (4.5b) 𝑞=1 Further, the following constraints are required so that the definition of NC for linear models remains as a special case: ⎧ ⎪ ⎪ ⎪ ⎪1 if 𝜑𝑘 is a function of only 𝑥𝑞 ||𝑛−𝑀 , 𝑛−1 𝜆𝑝𝑞 (𝜑𝑘 ) = ⎨ ⎪ ⎪ ⎪0 if 𝜑𝑘 does not depend on 𝑥𝑞 ||𝑛−𝑀 . (4.6) 𝑛−1 ⎪ ⎩ 1 The arguments have been omitted for clarity, but the definition of the regressor functions remains the same as in Eq. (2.22), i.e., can potentially depend on any set of the previous inputs and outputs. 82 Similarly to general nonlinear models, where no single concise representation is able to account for all cases, a single definition for the weighting function 𝜆𝑝𝑞 (𝜑𝑘 ) is impossible. Even when considering only LTIiP models, which can be concisely specified by the NARMAX representation, there is no canonical choice for set of regressor functions 𝝋𝑘 . Instead, the NARMAX representation requires the choice of the proper function set for the particular problem. A similar challenge is present in this work. It is not possible to identify a unique definition of 𝜆 that would be appropriate for all applications of the method. 4.5.1 Form 1: 𝜆1 - create a new category for nonlinear cross-terms The first form for 𝜆 discriminates terms that only depend on a single regressor signal from signals that depend on multiple regressors and assign them to separate categories. This way, the regressor functions 𝜑𝑗 [𝑛] and 𝜑𝑘 [𝑛] are joined if there is a 𝑞 ∈ 1, … , 𝑁𝑠 such that 𝜑𝑗 [𝑛] and 𝜑𝑘 [𝑛] can be expressed solely as functions of 𝑥𝑞 ||𝑛−𝑀 . This principle is used in the original definition of NC 𝑛−1 for the linear regressors, where past values of a signal as weighted and summed (i.e., filtered) before the variance is estimated. The main distinction is that, in the linear NC definition, only time-shifting is used (as it is a linear transformation) and scaling the regressors would be absorbed in the parameter estimation. In other words, 𝜆1 can be defined as ⎧ ⎪ ⎪ ⎪ ⎪ 1 if 𝜑𝑘 is a function of only 𝑥𝑞 ||𝑛−𝑀 𝑛−1 𝜆𝑝𝑞 1 (𝜑𝑘 ) = ⎨ ⎪ ⎪ ⎪ 0 if 𝜑𝑘 does not depend on 𝑥𝑞 ||𝑛−𝑀 or depends on more than one regressor (4.7) 𝑛−1 ⎪ ⎩ In order to satisfy Eq. (4.5b), a slight modification of the set of regressor signals must be made. Instead of 𝑥1 , 𝑥2 , … , 𝑥𝑁𝑠 , the set of regressor signals must be augmented by the set of all combinations of two or more signals (e.g., 𝑥1 ∪ 𝑥2 , 𝑥1 ∪ 𝑥3 , 𝑥1 ∪ 𝑥2 ∪ 𝑥3 , etc.) For example, for a bivariate observation model of the form 𝑥1 [𝑛] = 𝑎1 𝑥1 [𝑛 − 1] + 𝑎2 𝑥1 [𝑛 − 2] + 𝑎3 𝑥2 [𝑛 − 1] + 𝑎4 𝑥2 [𝑛 − 2]+ 𝑎5 𝑥1 [𝑛 − 1]𝑥2 [𝑛 − 1] + 𝑎6 𝑥1 [𝑛 − 2]𝑥2 [𝑛 − 2] + 𝜂1 [𝑛] (4.8) 83 the set of regressors is 𝑥1 , 𝑥2 and 𝑥1 ∪ 𝑥2 , as all the regressor functions can be expressed as a function of 𝑥1 , 𝑥2 or 𝑥1 ∪ 𝑥2 . Note that, regardless of time and polynomial order, this set covers all possibilities for bivariate nonlinear autoregressive with exogenous input (NARX) models,2 therefore the inclusion of terms such as 𝑥1 [𝑛 − 1]𝑥2 [𝑛 − 2] or 𝑥1 [𝑛 − 2]𝑥2 [𝑛 − 1] does not alter the set. To simplify notation, let us create a virtual regressor 𝑥3 [𝑛] = 𝑥1 [𝑛]𝑥2 [𝑛], so the regressor set is 𝑥1 , 𝑥2 and 𝑥3 . Assuming the values for 𝑎𝑘 , for 𝑘 ∈ 1, 2, … , 6, have been accurately estimated, the expression for NC𝑥3 →𝑥1 is therefore ‖𝑎5 𝑥3 [𝑛 − 1] + 𝑎6 𝑥3 [𝑛 − 2]‖22 = . ‖𝑎5 𝑥3 [𝑛 − 1] + 𝑎6 𝑥3 [𝑛 − 2]‖22 + ‖𝑎1 𝑥1 [𝑛 − 1] + 𝑎2 𝑥1 [𝑛 − 2]‖22 + NC𝑥3 →𝑥1 (4.9) ‖𝑎3 𝑥2 [𝑛 − 1] + 𝑎4 𝑥2 [𝑛 − 2]‖22 + ‖‖𝜂𝑝 [𝑛]‖‖2 2 It is not difficult to verify that, with the exception of defining the new set of regressors (i.e., 𝑥3 [𝑛] = 𝑥1 [𝑛]𝑥2 [𝑛]), this definition of NC is equivalent to the seminal definition of NC. Moreover, for linear estimated models, this definition of NC reduces to the seminal definition. The 𝜆1 formulation is advantageous as it does not require defining weights for individual regressor functions. This allows it to be used with any set of candidate regressor functions. However, it creates additional regressors, which reduces the interpretability of the NC values. 4.5.2 Form 2: 𝜆2 - weight regressor functions equally across regressor signals In order to avoid the creation of virtual regressors, the nonlinear contributions must be divided across the different regressor signals. Harnessing the knowledge of the arguments of each regressor functions, 𝜆2 splits the contributions equally between the regressors. Thus, 𝜆2 is defined as ⎧ ⎪ ⎪ 1 ⎪ ⎪ if 𝜑𝑘 is a function of 𝑅 regressor signals, including 𝑥𝑞 ||𝑛−𝑀 , 𝑛−1 𝜆𝑝𝑞 2 (𝜑𝑘 ) = ⎨ 𝑅 ⎪ ⎪ ⎪ 0 if 𝜑𝑘 does not depend on 𝑥𝑞 ||𝑛−𝑀 . (4.10) 𝑛−1 ⎪ ⎩ Since this approach does not require the creation of new virtual regressors, the final NC values can be easily mapped into the original signal set. For the observation model in Eq. (4.8), the NC 2 To cover all NARMAX possibilities, permutations including 𝜂1 [𝑛] must also be included in the set 84 ‖ 5 𝑥3 [𝑛−2] ‖ value can be calculated (assuming perfect model structure and parameter estimation) as ‖𝑎0 𝑥1 [𝑛 − 1] + 𝑎1 𝑥1 [𝑛 − 2] + 𝑎4 𝑥3 [𝑛−1]+𝑎 ‖ 2 NC𝑥3 →𝑥1 = ‖ 2 ‖2 . ‖ 𝑎4 𝑥3 [𝑛−1]+𝑎5 𝑥3 [𝑛−2] ‖ ‖𝑎0 𝑥1 [𝑛 − 1] + 𝑎1 𝑥1 [𝑛 − 2] + ‖ + 2 (4.11) ‖ 2 ‖2 ‖ 𝑎4 𝑥3 [𝑛−1]+𝑎5 𝑥3 [𝑛−2] ‖ ‖𝑎2 𝑥2 [𝑛 − 1] + 𝑎3 𝑥2 [𝑛 − 2] + ‖ + ‖𝜂 [𝑛]‖ 2 ‖2 ‖ 𝑝 ‖2 2 ‖ 2 As long as the predictive model can mimic the dynamics of the observation model using a combination of regressor functions, both 𝜆1 and 𝜆2 will produce results similar to the observation model. That is, suppose an observation model can be decomposed as 𝑥1 [𝑛] = 𝐹1∗ (𝑥1 ||𝑛−𝑀 ) + 𝐹12 (𝑥1 ||𝑛−𝑀 , 𝑥2 ||𝑛−𝑀 ) + 𝐹2∗ (𝑥2 ||𝑛−𝑀 ) + 𝜂∗1 [𝑛], 𝑛−1 ∗ 𝑛−1 𝑛−1 𝑛−1 (4.12) and the predictive model takes the form 𝐾1 −1 𝑥1𝑝 [𝑛] = ∑ 𝛼1,𝑘1 𝜑1,𝑘1 (𝑥1 ||𝑛−𝑀 ) 𝑛−1 𝑘1 =0 𝐾12 −1 + ∑ 𝛼12,𝑘12 𝜑12,𝑘12 (𝑥1 ||𝑛−𝑀 , 𝑥2 ||𝑛−𝑀 ) 𝑛−1 𝑛−1 (4.13) 𝑘12 =0 𝐾2 −1 + ∑ 𝛼2,𝑘2 𝜑2,𝑘2 (𝑥2 ||𝑛−𝑀 ), 𝑛−1 𝑘2 =0 then, as long as 𝐾1 −1 𝐹1∗ (𝑥1 ||𝑛−𝑀 ) ≈ ∑ 𝛼1,𝑘1 𝜑1,𝑘1 (𝑥1 ||𝑛−𝑀 ) 𝑛−1 𝑛−1 𝑘1 =0 𝐾12 −1 𝐹12 (𝑥1 ||𝑛−𝑀 , 𝑥2 ||𝑛−𝑀 ) ≈ ∑ 𝛼12,𝑘12 𝜑12,𝑘12 (𝑥1 ||𝑛−𝑀 , 𝑥2 ||𝑛−𝑀 ) ∗ 𝑛−1 𝑛−1 𝑛−1 𝑛−1 𝑘12 =0 (4.14) 𝐾2 −1 𝐹2∗ (𝑥2 ||𝑛−𝑀 ) ≈ ∑ 𝛼2,𝑘2 𝜑2,𝑘2 (𝑥2 ||𝑛−𝑀 ) 𝑛−1 𝑛−1 𝑘2 =0 𝜂∗1 [𝑛] ≈ 𝑥1 [𝑛] − 𝑥1𝑝 [𝑛], the NC value for the estimated model will approximate the NC value for the observation model for both 𝜆1 and 𝜆2 forms. The “equal splits” used in 𝜆2 simplify the analysis by avoiding the creation of new regressor signals. However, 𝜆2 does not take the characteristics of the regressor functions and distributional characteristics of the regressors into account. Particularly, for regressor functions that depend much more strongly on one regressor rather than another, it might be beneficial to implement a different weighting function. 85 4.5.3 Form 3: 𝜆3 - weight regressor functions across regressor signals according to an application (model) dependent criterion Since no canonical form of distributing the contributions in nonlinear regressor functions exists, 𝜆1 and 𝜆2 are practical heuristic methods of approximately estimating causality strength (just as "true causality" is difficulty to define and measure [103]). Therefore, there is no need to limit NNC to pre-defined 𝜆s. The only requirement is that the function 𝜆 fit the definition of a probability mass function over the set of regressors and that it satisfy Eq. (4.6). Certain regressor functions that depend on multiple regressors may not be equally affected by each of the regressors. The inhomogeneity of the influence may be due to the regressor function or the distributional characteristics of the regressors. For example, suppose that 𝑥1 and 𝑥2 are independent discrete random variables taken from Bernoulli distributions with parameters 𝑝1 and 𝑝2 respectively. Suppose also that 𝜑𝑘 (𝑥1 , 𝑥2 ) = AND(𝑥1 , 𝑥2 ), where AND() is the binary “and” operator. Note that this regressor function is perfectly symmetrical, as AND(𝑥1 , 𝑥2 ) = AND(𝑥2 , 𝑥1 ), however, for 𝑝1 ≫ 𝑝2 , the value of 𝑥2 contains more information on the output of 𝜑𝑘 than 𝑥1 . As another example, suppose that 𝑥1 and 𝑥2 are independent uniformly distributed random variables with support 𝑥1 , 𝑥2 ∈ [0, 1]. Suppose also that 𝜑𝑘 (𝑥1 , 𝑥2 ) = 𝑥1 sin(2𝜋𝑥2 ). In this case, the variables are similarly distributed, but their effects upon the regressor function are not. In such cases, it is desirable to split the contribution of the regressor function unequally across the regressors. One possible approach would be to weight the contributions by the predictive power of 𝑥1 𝑛−1𝑛−𝑀 and 𝑥1 𝑛−𝑀 to 𝜑𝑘 . One such method would be to use GC or TE to weight the 𝑛−1 ⎧ ⎪ ⎪ contributions ⎪1, if 𝜑𝑘 depends only on 𝑥𝑞 ⎪ 𝜆𝑝𝑞 (𝜑𝑘 ) = ⎨ GC ⎪ ⎪ , otherwise, ⎪ GC (4.15) 𝑥𝑞 →𝜑𝑘 ⎪ ⎩ ℎ=1 𝑁𝑠 ∑ GC𝑥ℎ →𝜑𝑘 where the first case is necessary as GC tends to infinity for deterministic expressions. Note that the second case tends to unity when GC𝑥𝑞 →𝜑𝑘 approaches infinity.3 Note also that Eq. (4.15) satisfies it is possible for the second case not to converge to unity, this is only true if for some ℎ ≠ 𝑞 there is at least one GC𝑥ℎ →𝜑𝑘 that also tends to infinity. This would only happen if 𝜑𝑘 is deterministic for more than one 3 Although regressor signal, a degenerate case. 86 Eq. (4.6). However, there is an increased onus of estimating the GC values. This example shows one way that contributions from regressor functions could be weighted across different regressors. Besides this example, there is an infinite set other possible variations that satisfy Eq. (4.6), but that might produce vastly different NC values. The choice and design of new 𝜆 weighting functions requires careful consideration and problem specific knowledge. 4.5.4 Spectral nonlinear new causality A spectral expansion to NNC follows the same logic shown in Sec. 2.5.6, where the DTFT of the numerator is taken before the norm calculation, which yields ‖ {𝐾 } ‖2 ‖ ‖ ‖ ∑ 𝑎𝑝𝑘 𝜑𝑘 𝜆𝑝𝑞 (𝜑𝑘 ) (𝑓 )‖ ‖ ‖ SNC𝑥𝑞 →𝑥𝑝 (𝑓 ) = ‖ 𝑘=1 ‖2 . 𝑁𝜆 ‖ 𝐾 ‖ ‖ 2 ‖2 ‖ ‖ ‖ ‖ 𝐾 ∑ ‖ ∑ 𝑎𝑝𝑘 𝜑𝑘 𝜆𝑝ℎ (𝜑𝑘 )‖ + ‖𝜂𝑝 [𝑛] + ∑ 𝑎𝑝𝑘 𝜑𝑘 𝜆𝑝𝜂 (𝜑𝑘 )‖ (4.16) ℎ=1 ‖𝑘=1 ‖2 ‖ 𝑘=1 ‖2 Similarly to Eq. (2.49), this equation decomposes the contributions into their spectral com- ponents. However, an important distinction between Eq. (2.49) and Eq. (4.16) is that nonlinear models allow for cross-frequency couplings to be shown. These cross-frequency effects have been observed between planetary waves and tides [107] and EEG signals under various conditions [75, 83, 144, 178]. 4.6 Discussion and analysis through example models Although the choice of weighting function 𝜆 adds a additional complexity and uncertainty to the estimation of NC values, it is important to point out that the choice of function 𝜆 belongs more closely to the process of model selection and data pre-processing than causality estimation per se. That is, causality analysis tools are used to estimate characteristics or gain insight about systems whose internal properties are unknown. For example, evoked potentials (EPs) are measured electrical potentials from the scalp im- mediately following a particular stimulus (e.g., visual, auditory, tactile, etc.). EPs can be used in noninvasive tests of sensory pathway abstandardizties, language and speech disorders, among 87 other uses. However, due to anatomy and tissue impedance, electric potential measurements contain a significant amount of interchannel crosstalk, which may obscure the anatomical and tem- poral properties of the recorded EPs [191]. Since their characteristics are of the utmost importance to causality analysis, EP signals are commonly preprocessed using Current Source Density (CSD) or other spatio-temporal sharpening methods. However, the spatial component of these methods alters the recorded EPs, which in turn alter the NC values (arguably in a way that enhances the analysis). Just as the seminal definition of NC is not transformation invariant (with the notable exception of uniform scaling and time-shifts), the nonlinear extension is not invariant to changes in the set of regressor functions. Similarly, the choice of 𝜆 weighting functions or regressor standardization falls within which assumptions better fit the current analysis. Thus, it is important to employ a priori knowledge about the systems being studied to obtain the most useful NC values possible. The following example shows how under typical conditions, linear transformations done as data preprocessing may affect NC values. Example C: Effects of transformations on NC values In many engineering applications, the desired signals cannot be directly obtained (e.g., mixture ratios inside rocket engines due to the extremely high temperatures [142], or brain electric activity due to health risks and costs associated with intrusive implants [1, 207]), but instead, the signals are measured indirectly and estimated using different modeling techniques, such as Kalman filters [20, 142]. For EEG signals, the choice of reference to the unipolar measurements has also shown to affect the outcomes of the analysis [38]. Indirect measurement not only reduces the signal to noise ratio, but also limits the spatio-temporal resolution available. This example aims at demonstrating that modeling and a priori knowledge is critical to NC estimation. Here, a simple spatial transformation is applied to a simple three-signal model and the effect of the transformation to NC measurements will be shown. The second-order jointly regressive model in Eq. (4.17) possesses simple relationships among its signals, i.e., 𝑥1 and 𝑥2 “cause” 𝑥1 , but 𝑥3 does not “cause” 𝑥1 . Similarly, 𝑥2 and 𝑥3 “cause” 𝑥2 , 88 whereas 𝑥1 does not. Finally 𝑥3 and 𝑥1 “cause” 𝑥3 , but 𝑥2 does not: 𝑥1 [𝑛] = 0.8𝑥1 [𝑛 − 1] + 0.15𝑥2 [𝑛 − 2] + 𝜂1 [𝑛], 𝑥2 [𝑛] = 0.8𝑥2 [𝑛 − 1] + 0.1𝑥3 [𝑛 − 2] + 𝜂2 [𝑛], (4.17) 𝑥3 [𝑛] = 0.5𝑥3 [𝑛 − 1] + 0.40𝑥1 [𝑛 − 2] + 𝜂3 [𝑛]. Assuming that 𝜂𝑘 , 𝑘 ∈ 1, 2, 3 are samples taken from independent i.i.d. normally distributed processes with zero means and unity variances, the NC𝑥𝑗 →𝑥𝑘 values (rounded to two significant figures) are computed and displayed in table 4.3. Although extremely important to NC value estimation, we are not concerned with model topology or model parameter estimation in this example; instead, the observation model topology and parameters will be assumed to be known (or "perfectly" estimated). In general, model estimation adds an additional challenge to NC value estimation and models with highly correlated signals lead to higher variance in the parameter estimates and, in turn, higher variances in the NC value estimates [148]. Table 4.3: NC𝑥𝑗 →𝑥𝑘 values for the model in Eq. (4.17) 𝑥𝑗 𝑥1 𝑥2 𝑥3 𝑥1 𝑥𝑘 𝑥2 0.70 0.021 0 𝑥3 0 0.68 0.011 0.26 0 0.35 The NC values in table 4.3 provide intuition about the model (e.g., the current value of 𝑥1 depends on past values of 𝑥1 and 𝑥2 , with 𝑥1 having a greater influence and 𝑥3 having no direct influence on 𝑥1 ). Now suppose that the same signals cannot be measured directly, but must be estimated using surface sensors, such that the signals contain interference from surrounding sources. In this model involving three signals, the interference is assumed to be uniform and controlled by the parameter 𝛿, as in Eq. (4.18). ⎡ ⎤ ⎡ ⎤ ⎡ ⎤⎡ ⎤ ⎢𝑦1 [𝑛]⎥ ⎢𝑥1 [𝑛] + 𝛿 (𝑥2 [𝑛] + 𝑥3 [𝑛])⎥ ⎢ 1 𝛿 𝛿 ⎥ ⎢𝑥1 [𝑛]⎥ ⎢ ⎥ ⎢ ⎥ ⎢ ⎥⎢ ⎥ ⎢𝑦2 [𝑛]⎥ = ⎢𝑥2 [𝑛] + 𝛿 (𝑥1 [𝑛] + 𝑥3 [𝑛])⎥ = ⎢𝛿 1 𝛿 ⎥ ⎢𝑥2 [𝑛]⎥ . ⎢ ⎥ ⎢ ⎥ ⎢ ⎥⎢ ⎥ (4.18) ⎢𝑦3 [𝑛]⎥ ⎢𝑥3 [𝑛] + 𝛿 (𝑥1 [𝑛] + 𝑥2 [𝑛])⎥ ⎢𝛿 𝛿 1 ⎥ ⎢𝑥3 [𝑛]⎥ ⎣ ⎦ ⎣ ⎦ ⎣ ⎦⎣ ⎦ 89 For 𝛿 = 0.15, the NC𝑦𝑗 →𝑦𝑘 values are shown in table 4.4. Note how the NC𝑦𝑗 →𝑦3 row differs from NC𝑥𝑗 →𝑥3 of table 4.3. In particular, this analysis implies that past values of 𝑦1 have greater influence on the current value of 𝑦3 than past values of 𝑦3 , a property which is not shared with 𝑥1 and 𝑥3 . Table 4.4: NC𝑦𝑗 →𝑦𝑘 values for the model of Eq. (4.18) 𝑦𝑗 𝑦1 𝑦2 𝑦3 𝑦1 𝑦𝑘 𝑦2 0.72 0.020 0.05 𝑦3 0.0033 0.68 0.0044 0.32 0.00068 0.29 This example shows that, even for simple linear models, careful consideration is necessary not only for model topology and model parameter estimation, but also for assumptions used when preprocessing data. The preprocessing of data using a priori knowledge about the studied system is necessary for more "useful" NC estimates. This observation will be helpful when discussing the increased complexity that the class of nonlinear models adds to this work. △ End of Example C In the same way as data preprocessing can be used to enhance linear NC values, proper preprocessing is essential to nonlinear NC value estimation. One occasion where this is particularly apparent when regressors do not possess zero mean. For example, let 𝜑𝑘 [𝑛 − 1] = 𝑥1 [𝑛 − 1]𝑥2 [𝑛 − 1]. Intuitively, this regressor function depends equally on 𝑥1 [𝑛 − 1] and 𝑥2 [𝑛 − 1]. However, suppose that 𝑥1 [𝑛 − 1] and 𝑥2 [𝑛 − 1] are distributed as multivariate normal random variables with means 𝝁 = [ 𝜇01 ] and covariance matrix 𝚺 = [ 𝜌𝜎𝜎11𝜎2 𝜌𝜎1 𝜎2 ]. If 2 𝜎22 ‖𝜇1 ‖ ≫ 𝜎1 𝜎2 ≫ 𝜎1 (4.19) then the value of 𝑥1 [𝑛−1] is likely to be close to 𝜇1 . Therefore most of the variation seen in 𝜑𝑘 comes from variations in 𝑥2 [𝑛 −1], not 𝑥1 [𝑛 −1]. In other words, regressor functions 𝜑𝑘1 = 𝑥1 [𝑛 −1]𝑥2 [𝑛 −1] and 𝜑𝑘2 = 𝑥2 [𝑛 − 1] would produce very different results than 𝜑𝑘3 = (𝑥1 [𝑛 − 1] − 𝜇1 )𝑥2 [𝑛 − 1] and 𝜑𝑘4 = 𝑥2 [𝑛 − 1]. For 𝜑𝑘1 and 𝜑𝑘2 , the NC value for 𝑥1 would be larger than the NC values computed 90 using 𝜑𝑘3 and 𝜑𝑘4 and, likewise, the NC value for 𝑥2 would be smaller than the NC values computed using 𝜑𝑘3 and 𝜑𝑘4 . Standardization is a common technique for data preprocessing. Standardization involves removing the means and dividing by standard deviation. While scaling of the regressors does not affect NC values, many regression methods benefit from standardization in the form of faster convergence or improved numerical stability. For nonlinear models, standardization can have a drastic effect on NC values. The choice of removing the means of regressor signals prior to computing the regressor functions or not standardizing depends mainly on assumptions on the models and the causality information desired. Is the information contained within the signals an absolute or relative measure? Many phenomena depend linearly on absolute quantities (e.g., the average sound speed on a fluid depends on the mean absolute pressure, final volume in an isobaric process depends on the absolute temperature, etc.) On the other hand, sound is a measure of relative pressure fluctuations measured at a microphone (or hydrophone for underwater measurements). The sound pressure fluctuations are several orders of magnitude smaller than the mean absolute pressure, therefore standardization is desirable. Nonetheless, in some cases, even choosing a reference value can be challenging for processes that are not wide-sense stationary and for measurements that do not have a clear reference point [218] (such as ERP and EEG signals). When 𝜑𝑘 is not an odd function, even a linear regressor signal symmetrically distributed with zero mean might produce an output with nonzero mean. For example, suppose that two independent signals, 𝑥1 and 𝑥2 were uniformly distributed with support [−1, 1]. Then |𝑥1 [𝑛]| has mean 0.5, but the regressor function 𝜑𝑘 [𝑛] = ||𝑥1 [𝑛]|| ⋅ 𝑥2 [𝑛] has zero mean. While, 𝜑𝑘 has zero mean, it is important to consider whether a combination of 𝜑𝑝𝑙 [𝑛] = 𝑥2 [𝑛] and 𝜑𝑝𝑚 [𝑛] = (|𝑥1 [𝑛]| − 0.5)𝑥2 [𝑛] (both also having zero mean) better represent the dynamics of interest in the system being studied. Ultimately, the differences observed between NC computed with regressors with means removed or not is a modeling issue more than a limitation of the method. Time series data must be analyzed prior to model specification [78] in order to remove undesired artifacts. Any type 91 of preprocessing will modify the outcomes of the analysis, but whether it will be beneficial to a particular analysis depends on the particular characteristics of the system. One must evaluate the assumptions when choosing preprocessing data as to produce “useful” models. As shown in [147], the reliability of NC value estimation is closely related to the models used, so a careful selection off preprocessing and model estimation is doubly important for NC analysis. To demonstrate the nonlinear extension of NC, two models used in [81] are tested to demon- strate the performance of the nonlinear extension of NC. The first example model given in [81] is noise-free as shown in Eq. (4.20): Example D: First model from [81] 𝑥1 [𝑛] = 0.5𝑥1 [𝑛 − 1] + 0.8𝑥2 [𝑛 − 2] + 𝑥22 [𝑛 − 1] − 0.05𝑥12 [𝑛 − 2] + 0.5, (4.20) where 𝑥2 is assumed to be sampled from an i.i.d. uniform distribution process bounded by [−1, 1]. 𝑥1 has 1.42 mean and 0.4 variance, whereas 𝑥2 has zero mean and variance 1/3. Since the equation for 𝑥1 is noise-free, the sum of all NC𝑥𝑗 →𝑥1 values is expected to be unity, which is confirmed by table 4.5, whereas the sum of all NC𝑥𝑗 →𝑥2 , with 𝑥2 being i.i.d., is zero. Note that, in this instance, standardizing the regressors and regressand yield no difference, as there are no nonlinear cross- terms. The absence of nonlinear cross-terms also means that any weighting function 𝜆 following Eq. (4.6) produces identical results. An example where nonlinear cross-terms are present and the standardization affects the NC estimates and further elaboration on this effect are given in the next example. Table 4.5: NC𝑥𝑗 →𝑥𝑘 values for the model of Eq. (4.20) 𝑥𝑗 𝑥1 𝑥2 𝑥1 𝑥𝑘 𝑥2 0.25 0.75 0 0 △ End of Example D 92 Example E: Second model from [81] The second model example used in [81] is shown in Eq. (4.21) below. In [81], the model is used to evaluate how the robust model structure selection (RMSS) method proposed in [81] behaves when the nonlinear regressor function in the observation model is not included the candidate nonlinear regressor functions, but instead a Volterra expansion with two time lags and up to order 3 is applied to 𝑥1 and 𝑥2 , √ 𝑥1 [𝑛] = −𝑥2 [𝑛 − 1] |𝑥1 [𝑛 − 1]| + 0.4𝑥22 [𝑛 − 1] + 0.8𝑥2 [𝑛 − 1]𝑥2 [𝑛 − 2] + 𝜂1 [𝑛], (4.21) where 𝑥2 is assumed to be uniformly distributed on [−1, 1] and 𝜂1 [𝑛] is white noise with zero mean and finite variation. The variance of 𝜂1 is adjusted to produce different SNR values (i.e., 0dB, √ 10dB, 15dB, 50dB and noise-free in the paper). Eq. (4.21) poses a particular problem for NC value estimation using Volterra expansions as the term 𝑥2 [𝑛 − 1] |𝑥1 [𝑛 − 1]| cannot be easily expanded √ using polynomials since |𝑥| is not differentiable at 𝑥 = 0. Further complicating NC estimation is √ that a polynomial expansion of 𝑥2 [𝑛 − 1] |𝑥1 [𝑛 − 1]|, takes the form √ 𝑥2 [𝑛 − 1] |𝑥1 [𝑛 − 1]| ≈ 𝑥2 [𝑛 − 1] (𝛼0 + 𝛼1 𝑥1 [𝑛 − 1] + 𝛼2 𝑥12 [𝑛 − 1] + ⋯) . (4.22) Note how most the terms in the right-hand side would have the same 𝜆1 and 𝜆2 value (i.e., 0.5 for both 𝑥1 and 𝑥2 in the case of 𝜆1 and a separate category that depends on 𝑥1 and 𝑥2 for 𝜆1 ), but the term 𝑥2 [𝑛 − 1]𝛼0 only depends on 𝑥2 and therefore would be counted entirely towards NC𝑥2 →𝑥1 , rather than sharing the contributions. √ To observe the effect of standardization, tests were conducted at 10dB and 50dB SNR using the original and standardized regressors. The tests included NC values for |𝑥1 [𝑛 − 1]| as one of the candidate functions and Volterra expansions of third and fifth order. The results for 10dB and √ 50dB SNR are shown in tables 4.6 and 4.7, respectively. |𝑥| candidate regressor function √ Although the value for NC using the non-standardized differs significantly from the others values, the NC values computed with the standardized |𝑥| √ √ are very similar to those computed with fifth-order polynomials. The discrepancy between NC values computed with the non-standardized |𝑥| and the standardized |𝑥| is a consequence of the 93 Table 4.6: NC𝑥𝑗 →𝑥1 values for the model of Eq. (4.21) with 10dB SNR √ Volterra With |𝑥| Poly. 𝑥1 𝑥2 𝑥1 𝑥2 𝑥1 𝑥2 𝑥1 𝑥2 Not standardized Standardized Not standardized Standardized Order 3 0.028 0.83 0.023 0.84 0.17 0.70 0.028 0.87 5 0.044 0.83 0.035 0.84 Table 4.7: NC𝑥𝑗 →𝑥1 values for the model of Eq. (4.21) with 50dB SNR √ Volterra With |𝑥| Poly. 𝑥1 𝑥2 𝑥1 𝑥2 𝑥1 𝑥2 𝑥1 𝑥2 Not standardized Standardized Not standardized Standardized Order 3 0.038 0.92 0.029 0.93 0.18 0.82 0.040 0.96 5 0.058 0.92 0.040 0.94 𝛼0 𝑥2 [𝑛 − 1] term from Eq. (4.22), which is assigned to solely to NC𝑥2 →𝑥1 , whereas, being a function √ of both 𝑥1 and 𝑥2 , the contributions of 𝑥2 [𝑛 − 1] |𝑥1 [𝑛 − 1]| depends on both 𝑥1 and 𝑥2 . If 𝜆 were set to split the contribution of 𝑥2 [𝑛 − 1] equally across 𝑥1 and 𝑥2 , all the NC values would be in √ close agreement. Because |𝑥| cannot be well modeled with polynomials, modeling Eq. (4.21) with Volterra √ filters limits the accuracy of the predictive model. To observe how a similarly complex, but differentiable model behaves, the |𝑥| is be replaced with a tanh(𝑥) term, a sigmoid function. Functions that exhibit saturation, like sigmoids, are poorly approximated with polynomials at √ the extremes, but can produce reasonable approximations if the polynomial order is high enough and/or the input has small variance. The resulting difference equation of replacing |𝑥| with tanh(𝑥1 [𝑛 − 1]) in Eq. (4.21) is shown in Eq. (4.23), 𝑥1 [𝑛] = −2𝑥2 [𝑛 − 1] tanh(𝑥1 [𝑛 − 1]) + 0.5𝑥22 [𝑛 − 1] + 0.5𝑥2 [𝑛 − 1]𝑥2 [𝑛 − 2] + 𝜂1 [𝑛]. (4.23) For this modified observation model, the same tests were conducted for 10dB and 50dB SNR. Again, the Volterra expansion was applied with two time lags and polynomial orders of three and five, and a prediction model was created with tanh(𝑥) as one of the candidate regressor functions. The results are found in tables tables 4.8 and 4.9. Note how in this case, the results between the standardized and non-standardized cases are in closer agreement as the mean of 𝑥1 is closer to 94 zero [since the 𝑥2 [𝑛 − 1] tanh(𝑥1 [𝑛 − 1]) term does not introduce bias, only the 𝑥2 [𝑛 − 1]2 term does]. The Volterra results are limited by the term containing tanh(𝑥) function being approximated only by finite order polynomials. Nevertheless, in both cases, the Volterra and tanh(𝑥) results show good agreement. Table 4.8: NC𝑥𝑗 →𝑥1 values for the model of Eq. (4.23) with 10dB SNR Volterra With tanh(𝑥) Poly. 𝑥1 𝑥2 𝑥1 𝑥2 𝑥1 𝑥2 𝑥1 𝑥2 Not standardized Standardized Not standardized Standardized Order 3 0.26 0.55 0.24 0.58 0.29 0.55 0.26 0.58 5 0.28 0.55 0.26 0.59 Table 4.9: NC𝑥𝑗 →𝑥1 values for the model of Eq. (4.23) with 50dB SNR Volterra With tanh(𝑥) Poly. 𝑥1 𝑥2 𝑥1 𝑥2 𝑥1 𝑥2 𝑥1 𝑥2 Not standardized Standardized Not standardized Standardized Order 3 0.28 0.69 0.25 0.72 0.31 0.69 0.26 0.73 5 0.31 0.69 0.26 0.74 Due to the difficulty in properly estimating parameter and topology for nonlinear models, it is not advisable to blindly increase the order of the polynomial expansions [11]. In addition to overfitting, NC value quality requires the estimated model structure and parameters to represent the observation model. Nonlinearity can often create complex relationships among regressors, such that high order regressor models might have good fitness and even generalize well, but might misrepresent the underlying model structure. Due to the complex interaction among regressors and noise, instead of representing tanh(𝑥) as a Taylor series, the regression algorithm will likely find a more compact set of regressor functions which produce lower prediction error. This compact set does not necessarily preserve the same relationship between 𝑥1 and 𝑥2 , so indiscriminately increasing the model order leads to results tending towards 1/𝑁𝑠 . This is similar to the behavior shown in Sec. 3.4, where the several scenarios are discussed where NC estimates exhibit bias under least squares estimation. △ End of Example E 95 4.7 Application: EEG data The EEG dataset used in [150] is used to compare NNC to the performance of GC and NC. Although most of the power of EEG signals can be predicted well using simple MVAR models, EEG signals contain nonlinear components that contain important information [140, 162, 182, 186]. Since linear predictive estimation models are able to reasonably represent the gross features of EEG signals, the improvement in NNC application is expected to be modest. Experiments using digital filters are used to highlight the nonlinear components which will be compared to the unfiltered results. The data were made publicly available by Nolte et al. [150], but obtained from Tom Brismar of the Karolinska Institute in Stockholm. The dataset contains EEG measurements for 10 subjects, sampled at 256Hz using the International 10–20 system, with 19 channels available using linked mastoid reference for the unipolar measurements. The measurements were made while subjects kept their eyes closed. The subjects were asked to open their eyes for 5 seconds every minute. The records contain about 200 segments of 4 seconds, which were recorded while subjects had their eyes closed. The location of the electrodes are shown in Fig. 4.3a, with the channel indices used in the dataset shown in Fig. 4.3b. Fp1 Fp2 1 2 F7 F3 F4 F8 11 3 4 12 Fz 17 T3 C3 Cz C4 T4 13 5 18 6 14 P3 Pz P4 7 19 8 T5 T6 15 16 O1 O2 9 10 (a) Electrode labels (b) Electrode index in dataset Figure 4.3: 10-20 International System Electrode Location Diagram 96 2 Magnitude spectrum (log scale) 10 10 0 10 -2 0 10 20 30 40 50 Frequency (Hz) Figure 4.4: Spectrum of the Fp1 channel of the EEG recording The signals contain a 𝛼 rhythm component (8-13 Hz band) at approximately 10Hz. All apparent artifacts have been removed from the data by Nolte et al. prior to the publication of the data. The 10 recordings were selected out of a pool of 88 recordings based on estimated signal to noise ratio. The database contains no subject identifiable information. While no ground truth is possible for these data, it is well established in literature that information flow for 𝛼 and 𝛽 waves follow a posterior-to-anterior (front to back) pattern [89, 150] during resting states. For these experiment, the flow between the left pre-frontal cortex (Fp1) channel and the right occipital (O2) and right parietal (P4) channels were considered. The 𝜃 waves flow in an anterior-to-posterior pattern [89] under similar conditions. The time-series were further processed using a notch filter to remove 50Hz line noise and a high-pass Butterworth filter of order 10 with cutoff frequency at 7.5Hz to remove low frequency signal drifts and 𝜃 waves. The recordings were split into 202 segments of 4 seconds each. The spectrum of the entire signal and for the first segment for the Fp1 channel are shown in Fig. 4.4. The models used to evaluate GC and NC were 3rd order AR/ARX and ARX models respectively. The models used to evaluate NNC and SNNC were 3rd order polynomial expansions of the regres- sors used to evaluate NC. The model parameters were evaluated using LASSO using four-fold 97 10 -1 SNNC (log scale) 10 -2 10 -3 10 -4 0 10 20 30 40 50 Frequency (Hz) Figure 4.5: Average of SNNCFp1→O2 values of subject 1 cross-validation. The average of the SNNC values for the 𝑥Fp1 into 𝑥O2 test is shown in Fig. 4.5. The significance numbers were obtained using trial-shuffling [37, 196]. For each 𝑗 th segment output time-series, the GC, NC and NNC values were calculated using the input time-series of all 𝑘 th segments, the GC, NC and NNC values evaluated for 𝑗 ≠ 𝑘 were used to estimate the distribution of GC, NC and NNC under the non-causal assumption. The distributions were evaluated using kernel estimation technique [88]. Since the pre-frontal cortex is reasonably distant from parietal and occipital regions, no spatial sharpening procedure is applied. Additionally, since no activity is being executed by subjects and (particularly) the 4 second segments are not related to the (non) activity of the subjects, no data alignment procedure is done and trials are assumed independent. The GC, NC and NNC evaluated for 𝑗 = 𝑘 were evaluated against that distribution to evaluate the 𝑝-value of that trial. The trials were considered significant using a Neyman-Pearson test with maximum of 1% false positives. To highlight the nonlinear relationships in the EEG signal, the tests were repeated three times: first as described above, second by filtering the 𝛼 rhythm frequencies and lower and third by filtering the 𝛽 rhythm (13-35Hz) frequencies and lower. The signals were filtered using Chebyshev type II high-pass filters of order 10 at cut off frequencies 13.5Hz and 35Hz respectively. In the 98 models used to evaluate NNC, the filters are applied after the polynomial expansions, to preserve the contribution of the 𝛼 waves into 𝛽 and higher bands due to the harmonic distortion. For the tests using the 13.5Hz high-pass filters, the SNNC was also evaluated between 18Hz and 28Hz, which roughly correspond to twice the frequency of the 𝛼 waves. During the first test, all of the measures identified a strong relationship between the Fp1 and O2, but were unable to differentiate direction of flow between Fp1 and O2, having both high levels of significance in both directions, with only SNNC having significantly higher rejection in the O2 to Fp1 direction. Applying the filter with a cutoff frequency of 13.5Hz reveals the directivity and also more differences between the measures. When filtering both 𝛼 and 𝛽 bands, the measures fail to indicate the strong connectivity between Fp1 and O2, partially due to lower SNR and electromyographic interference [143]. The results are shown in table 4.10 and table 4.11, where the best two results4 are in bold. Table 4.10: GC, NC, NNC, and SNNC results on whether to accept 𝑥Fp1 causes 𝑥O2 Unfiltered Filtered at 13.5Hz Filtered at 35Hz GC 0.851 0.535 0.228 NC 0.851 0.614 0.267 NNC 0.772 0.525 0.168 SNNC 0.812 0.674 0.891 Table 4.11: GC, NC, NNC, and SNNC results on whether to reject 𝑥O2 causes 𝑥Fp1 Unfiltered Filtered at 13.5Hz Filtered at 35Hz GC 0.139 0.604 0.861 NC 0.139 0.545 0.861 NNC 0.158 0.723 0.861 SNNC 0.386 0.743 0.99 In the tests with the high-pass filter with cut-off frequency at 13.5Hz, SNNC performed the best at both accepting 𝑥Fp1 causing 𝑥O2 and rejecting 𝑥O2 causing 𝑥Fp1 . The NNC result was also able to reject 𝑥O2 causing 𝑥Fp1 at a comparable rate to NNC and were about 20% higher relative to GC. The NC results seem to indicate that bias towards significance as it consistently assigned 4 When multiple measures perform equally, more than two entries may boldened. 99 highest significance to tests out of all measures. The GC results show no similar bias, but show lower selectivity than SNNC. The receiver operating characteristic curves for the unfiltered tests and the tests filtered at 13.5Hz regarding Fp1 and O2 are shown in Figs. 4.6 and 4.7, where Fp1 causing O2 is assumed true positives and O2 causing Fp1 is assumed as a false positive. In Fig. 4.6, the improvement of SNNC over the other measures can be seen more clearly, where only the higher rejection of O2 causing Fp1 is seen in table 4.11. In Fig. 4.7, both NNC and SNNC perform better than the other measures, but quite similarly to each other. 1 0.8 True positive rate 0.6 GC 0.4 NC SNC NNC 0.2 SNNC 1-to-1 0 0 0.2 0.4 0.6 0.8 1 False positive rate Figure 4.6: Receiver operating characteristic curves for the unfiltered tests The tests were repeated computing the causality measures between the Fp1 and P4 channels. The results are shown in table 4.12 and table 4.13. For the unfiltered signals, all tested methods were better able to show the directionality of information flow than the tests with Fp1 and O2. Nevertheless, the rate of significant results for 𝑥Fp1 causing 𝑥P4 are also smaller. The rate of significant results for the signals filtered at 13.5Hz are higher than the unfiltered ones for 𝑥Fp1 causing 𝑥P4 and are comparable to the unfiltered results found in table 4.10. The rejection rates for 𝑥O2 causing 𝑥P4 for signals filtered at 13.5Hz are similar to table 4.11, where NNC and SNNC are both significantly superior to GC and NC (here by 27% and 42% respectively). 100 1 0.8 True positive rate GC NC 0.6 SNC NNC 0.4 SNNC 1-to-1 0.2 0 0 0.2 0.4 0.6 0.8 1 False positive rate Figure 4.7: Receiver operating characteristic curves for 13.5Hz Table 4.12: GC, NC, NNC, and SNNC results on whether to accept 𝑥Fp1 causes 𝑥P4 Unfiltered Filtered at 13.5Hz Filtered at 35Hz GC 0.653 0.891 0.851 NC 0.634 0.891 0.851 NNC 0.593 0.842 0.743 SNNC 0.624 0.772 0.168 Table 4.13: GC, NC, NNC, and SNNC results on whether to reject 𝑥P4 causes 𝑥Fp1 Unfiltered Filtered at 13.5Hz Filtered at 35Hz GC 0.545 0.535 0.465 NC 0.634 0.416 0.347 NNC 0.564 0.683 0.594 SNNC 0.574 0.762 0.881 The receiver operating characteristic curves for the unfiltered tests and the tests filtered at 13.5Hz regarding Fp1 and P4 are shown in Figs. 4.6 and 4.7, where Fp1 causing P4 is assumed true positives and P4 causing Fp1 is assumed as a false positive. In Fig. 4.6, the NNC and SNNC results are worse than NC, although NNC achieves similar results in the small false positive rate region and SNNC achieves similar results to NC for large false positive rates. In Fig. 4.7, the advantage of NNC and SNNC over NC is visible in the small false positive rate region, with the advantage diminishing as the false positive rate increases. 101 1 0.8 True positive rate GC NC 0.6 SNC NNC 0.4 SNNC 1-to-1 0.2 0 0 0.2 0.4 0.6 0.8 1 False positive rate Figure 4.8: Receiver operating characteristic curves for the unfiltered tests 1 0.8 True positive rate GC NC 0.6 SNC NNC 0.4 SNNC 1-to-1 0.2 0 0 0.2 0.4 0.6 0.8 1 False positive rate Figure 4.9: Receiver operating characteristic curves for 13.5Hz 4.8 Discussion of 𝜆 functions and preprocessing The properties of the weighting function 𝜆 qualifies it as a probability mass function. In fact, the weighting function 𝜆 operates similarly to a probability mass function in Eq. (4.4). Since 𝜆𝑝𝑞 (𝜑𝑘 ) defines how much of the contribution of 𝑎𝑝𝑘 𝜑𝑘 should be attributed to 𝑥𝑝 , this would be 102 equivalent of evaluating the expected value of the contribution attributed to assuming it has probability 𝜆𝑝𝑞 (𝜑𝑘 ) of being 𝑎𝑝𝑘 𝜑𝑘 and (1 − 𝜆𝑝𝑞 (𝜑𝑘 )) probability of being 0. Under the same rationale, 𝜆2 defines the indicator function of greatest entropy, which makes no a priori assumptions about the regressor functions. One of the remaining challenges for the development of a unified nonlinear extension of NC is the choice of the “correct” function 𝜆. This begs the question of what “true causality” and the purpose of causality analysis are. As both GC and NC are based on causality as defined by Hume [103], it is helpful to point out that Hume was concerned mostly with the epistemological aspect of causality, rather than an ontological one. Similarly, it would be naive to assert that signal 𝑥𝑞 “causes” 𝑥𝑝 as a matter of fact, without careful consideration of a priori knowledge. Similarly, the appropriate choice of 𝜆 relies on understanding what is the most useful manner to assign contributions given a particular set of regressor functions and the system being observed. The simulations concerning the model from Eq. (4.21) show how standardizing the regressors changes the NC estimates. Additionally, due to the characteristics of the nonlinear model from Eq. (4.21), the Volterra filter had limited success at estimating the contribution of past values of 𝑥1 to the current value of 𝑥1 , as 𝑥1 did not have zero mean and, therefore, some of the contribution of 𝑥1 was misattributed to 𝑥2 . Analogously to the choice of candidate regressor functions, the choice of 𝜆 function relies on careful consideration of the system being modeled. Additionally, since the NC value is derived from models, it is important to distinguish the systems from which the data are gathered from the models used to represent them. For exam- ple, one could develop very accurate models to predict sunrise and sunset times without ever considering whether the sun still exists. For such models, GC and NC would suggest that the existence of the sun has no impact on sunrise and sunset times, an absurd conclusion. Instead, an epistemological interpretation of causality analysis yields more useful interpretations, the knowledge of the effects of the sun’s inexistence does not increase the knowledge of sunset and sunrise times. This argument is similar to Box’s commentary on the wrongness of all models [31]. Ultimately, the goal of causality analysis is to gain knowledge on systems given limited 103 information available about them. Therefore, the concern should not lie on which the choice of function 𝜆 is “right” or “wrong,” but rather which ones lead to most “useful” conclusions about causal relationships. 4.9 Conclusions New Causality is a promising method for assessing causality links between two or more signals. In the seminal definition [95] NC is defined only for LTI models. This limits the use of NC to systems that can be modeled well with LTI models. In this work, a novel extension of NC to NARMAX models is presented. Three methods for choosing the 𝜆 weighting function are shown, where the first two are formally defined and a suggestion is made for the implementation of a third, while allowing for alternate implementations. All three methods produce identical results to seminal definition of NC for ARMAX models. Results show that this extension is suitable for systems that can be modeled well by NARMAX models, producing good results in the tested models. Particularly 𝜆2 has shown to produce adequate results even the nonlinear functions of the observation model are not part of the set of candidate regressor functions. In tests with EEG signals, SNNC was shown to outperform NC and GC in showing the linkage between 𝛼 waves in Fp1 to 𝛽 waves in O2. Just as the seminal definition of NC, the nonlinear extension depends heavily on the estimated model. Thus, it is important to highlight that careful selection of model topology and model parameter estimation is essential to obtain useful NNC estimates. The function 𝜆 has been shown to be a probability mass function. For each suggested 𝜆, the weights are governed by different assumptions about the distribution of “causal strength.” Although 𝜆2 has shown promise in this work, models with non-antisymmetrical properties or regressors with non-zero means can induce shifts in the NC values. However, the seminal definition of NC is also sensitive to data preprocessing, as it pertains to modeling more than causality analysis. This extension of NC to NARMAX models adds flexibility to NC to assess causality strength to any signals that can be well modeled with LTIiP models. The extension inherits the strengths of 104 NC, while also having the same requirement of accurate model topology and parameter estimation in order to produce “useful” NC values. The choice of 𝜆 function requires careful consideration, but is not unlike the choice of candidate regressor functions, in which a priori information about the system being modeled is used to guide the choice. 105 CHAPTER 5 IMPROVEMENTS TO THE EvolOBE METHOD FOR NONLINEAR CAUSALITY ANALYSIS 5.1 Overview With the need for accurate modeling for NC analysis made clear,the focus of this chapter now shifts to a method of estimating nonlinear model structures and parameters. The current work is centered on a biologically-motivated method for both the selection of the effective regressors and the estimation of the parameters of modified NARMAX models. The approach integrates set-based parameter estimation and genetic algorithms for optimization over fitness measures derived from a set of solutions [213]. A brief sketch of the overall approach appears in Sec. 2.4. This chapter is focused on innovations in the evolutionary process by which the model regressor set is selected. As in any nonlinear identification solution, the evolutionary–set-theoretic framework described above is computationally-intensive, as the number of regressors increases factorially with the order of the nonlinear expansion. In a general sense, this chapter addresses the need to find more efficient data-processing algorithms for brain modeling. A more efficient solution is based in the expected sparsity of the connectivity models in terms of the relatively low number of regressors that would be necessary to effectively characterize nonlinear relationships in time-series records. This assumption has significant implications for the evolutionary search over the space of regressor combinations. In particular, modified crossover and mutation operators are incorporated in the NSGA-II [51] framework to expedite feature (regressor) selection. By adjusting the mutation and crossover operators to account for sparsity and pairwise relationships in the population, the number of generations needed to arrive at the solution is greatly reduced. Further technical details of the operation of the model are found in previous papers [213, 214, 106 216]. Some portions of this chapter are quoted directly from [149] with a few modifications for improved flow and clarity. 5.2 Model form The goal of the identification strategy in this work is to obtain a model whose internal mechanism mimics the system being studied. Note that unless a priori information is available, the similarity between the internal mechanism and the system cannot be measured, but instead, predictive power is often used as a surrogate measure of similarity. { } The internal processing of the system is based on a subset of a candidate set of nonlinear regressor functions, Ξ𝜑 = 𝜑𝑞 , of size || Ξ𝜑 ||. Each regressor is a mapping 𝜑𝑞 : R𝑁𝑠 → R. The identification strategy starts by positing that, given the appropriate candidate set, there exists a LTIiP observation model, O𝒂∗ ,𝝋 ∗ , of the form in Eq. (2.23) for 𝑛 ∈ Z, given by 𝐾∗ 𝐾𝜖 O𝒂∗ ,𝝋 ∗ : 𝑥𝑝 [𝑛] = ∑ 𝑎𝑝𝑘∗ 𝜑𝑝𝑘∗ (𝑥1||𝑛−1 ,𝑥 | 𝑛− 2 |𝑛− 𝑛−1 , … , 𝑥𝑁 ||𝑛− ) + ∑ 𝑏𝑝𝑘 𝑠 𝑛−1 ∗ 𝜙𝑝𝑘 ∗ ∗ |𝑛−1 (𝜖𝑝 |𝑛− ) + 𝜖𝑝 [𝑛] ∗ 𝑘=1 𝑘=1 ≐ 𝒂𝑝∗𝑇 𝝋𝑝∗ [𝑛] + 𝜖𝑝∗∗ [𝑛] (5.1) where 𝐾𝜖 ∗ |𝑛−1 𝜖𝑝∗∗ [𝑛] = ∑ 𝑏𝑝𝑘 ∗ 𝜙𝑝𝑘 ∗ (𝜖𝑝 |𝑛− ) + 𝜖𝑝 [𝑛] ∗ (5.2) 𝑘=1 with 𝒂 ∗ ∈ R𝐾 , and 𝜖 ∗∗ an error sequence representing uncertainties in the model. The “∗” ∗ subscript indicates a “true,” but unknown, quantity associated with the observation model. 1 The arguments, 𝑥−∞ 𝑛 and 𝑦−∞ 𝑛−1 , of the regressor signals 𝜑𝑞 (or vector 𝝋) indicate that a fi- nite number of elements is selected from the subsequences {𝑥1 [𝑛 − 1], 𝑥1 [𝑛 − 2], … , 𝑥1 [𝑛 − ], 𝑥2 [𝑛 − 1], … , 𝑥2 [𝑛 − ], … , 𝑥𝑁𝑠 [𝑛 − ]} by each 𝜑𝑞 for processing at time 𝑛. For conservation of space, we define the vectors of 𝑁𝑠 signal samples used at time 𝑛 by 𝒖𝑞∗ [𝑛], and the matrix 𝑼∗ [𝑛] = [ 𝒖1∗ [𝑛] 𝒖2∗ [𝑛] ⋯ 𝒖𝐾 ∗ [𝑛] ]. Given observations of 𝑥 and 𝑦 sufficient to compute out- puts on time interval 𝑛 = 1, 2, … , 𝑁 , we pose an estimation model as a function of the parameters avoid cumbersome notation, it is to be understood that 𝜑𝑞∗ is the 𝑞 th element selected from Ξ𝜑 , rather than element 𝑞 of Ξ𝜑 . 1 To 107 and regressor signals, 𝐾 M𝒂𝑝 ,𝝋 : 𝑥̂𝑝 (𝑛, 𝒂𝑝 , 𝝋 ) = ∑ 𝑎𝑝𝑘 𝜑𝑝𝑘 (𝒖𝑞 [𝑛]) ≐ 𝒂𝑝𝑇 𝝋𝑝 (𝑼 [𝑛]), (5.3) 𝑘=1 in which each 𝜑𝑞 is drawn from the set Ξ𝜑 (see footnote 1), 𝒂 ∈ R𝐾 , and the 𝒖𝑞 [𝑛] and 𝑼 [𝑛] are defined similarly to 𝒖𝑞∗ [𝑛] and 𝑼∗ [𝑛]. The circumflex in 𝑥̂ connotes “prediction” , as this estimation model corresponds to the classical prediction-error method (e.g., [128]). This is true even though the regressor functions can be highly-nonlinear functions of the observations, because (when assumed fixed in the model) they appear in a model that is linear-time-invariant-in-parameters (LTIiP). Thus, the identification of the parameters using least square errors or (theoretically) mean-squared-error techniques is a well-known problem. Our approach, however, involves a distinctly different identification method which produces parameter solution sets rather than point estimates (e.g., [54, 55]). It is the properties of these sets that couple the model creation and parameter identification problems. 5.3 Identification strategy The EvolOBE method combines the strengths of evolutionary computing and more traditional set-theoretic parameter estimation methods to robustly obtain a family of models with different tradeoffs between accuracy and model complexity. The evolutionary algorithm is responsible for finding the subsets of regressor functions 𝝋𝑝 out of Ξ𝜑 , whereas the set-theoretic parameter estimation method uses of the selected 𝝋𝑝 to obtain 𝒂𝑝 . This framework simultaneously addresses selection of the model structure and the parameter estimation. Moreover, a very significant advantage of the algorithm is the lack of need for assumptions about stationarity or distributional characteristics of the noise. The specifics are outlined in the following paragraphs. Candidate models are encoded as binary chromosomes, where each possible phenotype repre- sents a model with different regressor functions. The chromosome is a binary sequence in which the 𝑞 th gene represents the presence or absence of the 𝑞 th regressor function. The information encoded in the chromosomes is used to generate the regressor functions which are fed to the OBE algorithm, which obtains a feasibility set according to data. The set properties are then used to 108 assign fitness values to each chromosome, and the fitness value is used in the genetic algorithm selection process to evolve the population toward better solutions (e.g. [167]). This fitness measure can be in the form of a single objective function that provides a summary of the quality of the model, such as FPE and AIC, or in the form of multiple objective functions, covering predictive accuracy, model complexity, and other information about the candidate model (such as the volume or sum of the semi-axes of the ellipsoid). The assigned fitness measure regulates the chance of survival of each particular model in a generation. The algorithm starts with a random population of chromosomes. At each step, the population is evaluated, then a subset of the population is selected to generate children through mutation and crossover operations. Mutation operators work by randomly selecting genes and altering them, whereas crossover operators combine portions of the chromosomes of two or more parents to produce a new offspring, which are added into the population. The population is sorted and individuals with lower fitness are discarded. The specific mechanisms for mutation and crossover operations, as well as the selection of parents, sorting of the population, and survival criterion are often tailored for a particular application. To reduce the computational complexity of this process, the search space of regressor models must be controlled, and the candidate and final models must use the fewest regressors that are con- sistent with an objective of prediction-error minimization. Since minimizing the prediction error and minimizing the number of regressors are conflicting objectives, a multi-objective optimization approach is desired. For this work, the NSGA-II [51] approach is adopted, since it generates a set of solutions (ideally the Pareto-front), providing the best solution for a given number of regressors and allowing the model with the best trade-off to be chosen. 5.3.1 NSGA-II NSGA-II is a standard algorithm for solving multiobjective optimization problems. It requires a small number of parameters and is able to obtain solution sets spread along the pareto-front. It is especially appropriate for problems with only two objectives. The basic NSGA-II algorithm is 109 INITIALIZE EVALUATE FITNESS POPULATION YES TERMINATION STOP CRITERION? NO SORT POPULATION SELECT PARENTS CROSSOVER & MUTATION Figure 5.1: NSGA-II algorithm summary shown in Fig. 5.1. In the original NSGA-II paper [51], Deb et al. use binary tournament selection, bit-wise mutation and single-point crossover with probability of 𝑝𝑐 = 0.9, and mutation probability 𝜇 = 1/𝓁 (where 𝓁 is the length of the chromosome). In this work, these operators and parameters are used as a baseline for comparison, with the exception of the single-point crossover operator, which is replaced by a two-point crossover operator. 5.3.2 Asymmetric mutation operator For sparse solutions, the mutation operator can be tuned to guide the population toward sparsity. Although judicious selection alone can effect sparse solutions, a properly tuned mutation operator can increase the convergence rate significantly. Here, an asymmetric mutation (AM) operator is developed. Classic mutation operators use a fixed probability to flip each chromosome regardless of its previous value. This is effective for blind exploration, but imposes pressure toward solutions with 50% active genes. For a given 𝜇 probability of mutation, the expected number of active (𝑁1 ) and inactive genes 110 (𝑁0 ) at step 𝑛 + 1 is given by ⎡ 𝑛+1 ⎤ ⎡ ⎤⎡ ⎤ ⎢𝑁1 ⎥ ⎢(1 − 𝜇) 𝜇 ⎥ ⎢𝑁1𝑛 ⎥ ⎢ ⎥=⎢ ⎥⎢ ⎥ ⎢𝑁0𝑛+1 ⎥ ⎢ 𝜇 (1 − 𝜇)⎥ ⎢𝑁0𝑛 ⎥ (5.4) ⎣ ⎦ ⎣ ⎦⎣ ⎦ This matrix has eigenvalues 1 and (1-2𝜇). The eigenvector for 1 is [ 1 1 ]𝑇 , which means that, in the absence of selection operators, the number of active and inactive genes tends to equality at a rate depending on 𝜇 . An asymmetric mutation operator can be used to achieve any desired rate of activation. Two distinct mutation operators are introduced to implement this effect: 𝜇10 the probability of deactivating an active gene, and 𝜇01 the probability of activating an inactive gene. The matrix system (5.4) becomes ⎡ 𝑛+1 ⎤ ⎡ ⎤⎡ ⎤ ⎢𝑁1 ⎥ ⎢(1 − 𝜇10 ) 𝜇01 ⎥ ⎢𝑁1𝑛 ⎥ ⎢ ⎥=⎢ ⎥⎢ ⎥ ⎢𝑁0𝑛+1 ⎥ ⎢ 𝜇10 (1 − 𝜇01 )⎥ ⎢𝑁0𝑛 ⎥ (5.5) ⎣ ⎦ ⎣ ⎦⎣ ⎦ The eigenvalues of this system are 1 and 1 − 𝜇10 − 𝜇01 with corresponding eigenvectors [ 𝜇01 𝜇10 ]𝑇 and [ 1 −1 ]𝑇 . The desired ratio of active to inactive genes is given by 𝜇01 𝑟𝑑 = 𝜇10 + 𝜇01 (5.6) For this scheme, the mutation rate is defined as 𝜇 = 𝑟𝑐 𝜇10 + (1 − 𝑟𝑐 )𝜇01 (5.7) where 𝑟𝑐 is the ratio of active to total genes (i.e. 𝑁1 /(𝑁1 + 𝑁0 )). By combining Eqs. (5.6) and (5.7), the following expressions for the mutation probabilities are obtained 𝜇 𝑟𝑑 𝜇01 = 𝑟𝑐 + 𝑟𝑑 − 2𝑟𝑐 𝑟𝑑 𝜇(1 − 𝑟𝑑 ) 𝜇10 = (5.8) 𝑟𝑐 + 𝑟𝑑 − 2𝑟𝑐 𝑟𝑑 This extended solution reduces to that for the traditional mutation operator when 𝑟𝑑 = 0.5. Decoupling the mutation probabilities yields a more flexible mutation operator with which pressure can be applied toward a desired sparsity level. 111 5.3.3 Reduced surrogate crossover Evolution is improved by a crossover operator that generates novel individuals. A method to achieve novelty is to use reduced surrogate crossover (RSX) [30]. With RSX, only non-matching alleles are crossed between individuals. This is especially important as the genetic diversity decreases with evolution. Thus the likelihood of generating a novel individual from two similar parents becomes smaller in traditional two-point crossover operations. A varying-minimum Hamming distance between chromosomes is suggested in [134]. In the present work, a fixed unity Hamming distance yielded small improvements in convergence speed. The fixed distance avoids the shortcomings of the minimum Hamming approach, but results in less efficient sampling of the search space. 5.3.4 Linkage tree crossover One of the tenets for the convergence of genetic algorithms is that the population will shift from the initial randomly generated solutions into a population that increasingly has characteristics found in the pareto-optimal solution set. Under this assumption, the statistical characteristics of the population at a generation can be used to estimate what operations are more likely to produce helpful results.2 Linkage tree crossover (LTX), introduced by Thierens in [192], crosses solutions over at positions that are more likely to generate fit offspring. First, LTX collects information of the statistical characteristics of the population and clusters the genes into a binary tree that summarizes how clusters are linked together. Each cluster is initialized with a single gene, and clusters are then progressively linked together until all genes are included in a single cluster. The clustering uses a distance metric based on mutual information and entropy [114]. For clusters 𝐶1 and 𝐶2 , the mutual information is computed as 𝑝𝐶1 ,𝐶2 (𝑐1 , 𝑐2 ) 𝐼 (𝐶1 ; 𝐶2 ) ≐ ∑ ∑ 𝑝𝐶1 ,𝐶2 (𝑐1 , 𝑐2 ) log , ( 𝑝𝐶1 (𝑐1 )𝑝𝐶2 (𝑐2 ) ) (5.9) 𝑐1 ∈C1 𝑐2 ∈C2 2 However, care must be taken not to heavy-handedly influence the evolution, as a stronger emphasis on exploita- tion is likely to diminish the ability of the GA for exploration. 112 where C1 and C2 are the sets of all possible values for 𝐶1 and 𝐶2 , 𝑝𝐶1 ,𝐶2 (𝑐1 , 𝑐2 ) is the joint probability of 𝑐1 and 𝑐2 , 𝑝𝐶1 (𝑐1 ) is the probability of 𝑐1 and 𝑝𝐶2 (𝑐2 ) is the probability of 𝑐2 . Alternatively, the mutual information may be computed using the entropies. The entropy for a cluster 𝐶 ∈ C is defined as 𝐻 (𝐶) ≐ − ∑ 𝑝𝐶 (𝑐) log (𝑝𝐶 (𝑐)) , (5.10) 𝑐∈C and using the following identity 𝐼 (𝐶1 ; 𝐶2 ) ≐ 𝐻 (𝐶1 ) + 𝐻 (𝐶2 ) − 𝐻 (𝐶1 ; 𝐶2 ). (5.11) The distance metric is then defined as 𝐻 (𝐶1 ) + 𝐻 (𝐶2 ) 𝐻 (𝐶1 , 𝐶2 ) − 𝐼 (𝐶1 , 𝐶2 ) 𝐷(𝐶1 , 𝐶2 ) ≐ 2 − = . 𝐻 (𝐶1 , 𝐶2 ) 𝐻 (𝐶1 , 𝐶2 ) (5.12) The general procedure for generating the linkage tree is found in Alg. B.4 in Appendix B. An example of a linkage tree is shown in Fig. 5.2. In the example, the order of the crossover operations would be combining 𝜑1 , 𝜑2 , and 𝜑4 from one parent and 𝜑3 and 𝜑4 from the other parent, then 𝜑1 and 𝜑4 with 𝜑2 , then 𝜑3 , and 𝜑5 , and finally combining 𝜑1 with 𝜑4 . 𝜑1 𝜑2 𝜑3 𝜑4 𝜑5 𝜑1 𝜑2 𝜑4 𝜑3 𝜑5 𝜑1 𝜑4 𝜑1 𝜑4 𝜑2 𝜑3 𝜑5 Figure 5.2: Linkage tree example Once the linkage tree is generated, the algorithm traverses the tree executing crossovers exchanging the clustered genes. In the seminal algorithm, if at least one offspring is superior to 113 both parents, the parents are replaced by the children. When the tree is fully traversed, the best individuals are copied into the next generation. The detailed LTX procedure is shown in Alg. B.5 in Appendix B. In this work, a special consideration is necessary, as LTX was not envisioned for multi-objective problems. For single-objective problems, a solution may either be superior, inferior or equivalent to a second solution, whereas for multi-objective problems, solutions may also be neither superior (dominate) nor inferior (dominated), but simply offer a different trade-off (i.e. superior in at least one objective function, but inferior in at least one solution). When the offspring and parents neither dominate nor are dominated by each other, there is the choice to keep or replace the parents or a stochastic combination of both. Preliminary tests showed no clear advantage of either choice, but further investigation on this topic is planned as future work. The detailed procedure for the use of LTX in multi-objective problems is shown in Alg. B.6 in Appendix B. One possible downside of the use of LTX is the substantial computational cost of evaluating the large number of entropy calculations needed to construct the linkage tree [155]. For problems with fitness functions that are computationally costly, LTX is more advantageous. The overhead of computing the linkage tree is becomes more significant as population sizes increase, but small population sizes can produce poor estimates of entropy [199]. 5.4 Results of AM and RSX A randomly generated NARMAX model with five regressors was used to evaluate the modified operators. Three delayed outputs and two delayed inputs to the system were extracted as linear regressors and expanded to a 3rd order Volterra series, obtaining a total of 55 nonlinear regressors. The estimated Pareto front is shown in Fig. 5.3. The ordinate shows the RMSE of the prediction error (dB scale) and the abscissa shows the number of regressors in each model. As expected, there is a knee located at five regressors, corresponding to the number in the generative model. There is some improvement in the RMSE for models with more regressors, but only due to overfitting. To assess the improvement relative to the unmodified NSGA-II method, simulations were used 114 40 35 30 RMSE (dB) 25 20 15 10 5 0 5 10 15 20 # of Regressors Figure 5.3: Estimated pareto front to estimate the number of generations required for each genetic algorithm (GA) to converge to the generative model. The number of generations follows a probability distribution with parameters depending on the GA and its internal parametrization. Clearwater et al. [45] have shown that the number of generations required by a GA to find a solution asymptotically approaches a log-normal distribution. Due to the long-tailed nature, the mean and variance of this distribution are both significant. A lower-variance estimator can provide a more meaningful measure of number of generations to convergence, even at the expense of mild estimator bias. To examine the how well the number of generations required to find the solution fits a log- normal distribution, 16384 runs of our algorithm were evaluated using NSGA-II with the same parameters for each run. The number of generations required by each run was recorded and a log-normal distribution fitted to the data. Fig. 5.4 shows the histogram with “×” markers and the fitted distribution with the solid line. The fitted distribution tracks the histogram remarkably well, especially in the long tail of the distribution. For clarity, the histogram is omitted from further figures. The NSGA-II algorithm was implemented using the symmetric bit-wise mutation operator 115 0.06 Fitted Log Normal Histogram of data 0.05 Probability density 0.04 0.03 0.02 0.01 0 0 20 40 60 80 100 Generation Figure 5.4: Histogram vs. Fitted distribution (equivalent to 𝑟𝑑 = 0.5) and two-point crossover. The asymmetric binary mutation with 𝑟 = 0.1 and was applied to all remaining simulations. The modified domination criterion (unique sorting) was added to the third simulation onwards. The fourth simulation incorporated the RSX operator and the fifth added a minimum Hamming distance (HD) of 1 to the mutation operator. The results can be seen in Fig. 5.5. The parameters for the fitted models are compiled in Table 5.1. Since in the 0.06 Original operators Assym. Mutation 0.05 AM + Unique Sort Probability density AM + US + RSX 0.04 AM + US + RSX + HD 0.03 0.02 0.01 0 0 20 40 60 80 100 Generation Figure 5.5: Generations to arrive at the desired model log-normal distribution, 𝜇 and 𝜎 do not correspond to the mean and standard deviations, these 116 values are also calculated and shown in separate columns. Table 5.1: Fitted parameters for different methods Method 𝝁 𝝈 Mean Std. Dev. Original operators 4.13 0.28 64.3 18.1 Assymetrical Mutation 3.34 0.421 30.9 13.6 AM + Unique Sort 3.25 0.341 27.5 9.64 AM + US + RSX 3.23 0.335 26.7 9.19 AM + US + RSX + HD 3.22 0.329 26.4 8.95 The asymmetric mutation operator causes a drastic change in the simulation results, reducing both the mean and standard deviation significantly. It reduces the number of generations to reach 99% confidence from 118 to 76 generations (reduction of 35%) and 99.9% from 145 to 104 generations (a reduction of 28%). While the modified domination criterion caused a smaller reduction in 𝜇, the 𝜎 is reduced more significantly. As seen in the graph, the modes are largely unchanged (ranging from 22 to 23 generations), but the reduction in 𝜎 greatly reduces the number of generations to reach high confidence. The old sorting algorithm passes 99% confidence at 76 generations and the 99.9% confidence region at 104 generations, while the new sorting algorithm passes 99% confidence in 58 (a further reduction of 23%) and 99.9% in 75 generations (a further reduction of 27%). The remaining improvements improve convergence, albeit in a smaller scale, with the reduced surrogate without minimum Hamming distance and the reduced surrogate with minimum distance of 1 needing 55 and 54 generations to reach 99% confidence respectively and needing 71 and 70 generations to reach 99.9% confidence. 5.5 Results of LTX To test the improvement given by LTX, a test with the nonlinear observation model of Eq. (4.23) was run under various conditions. The advantage of the tested model over the previous model is that the tanh(⋅) term in the difference equation means polynomial estimation models will provide better prediction with increasing model orders, but no finite set of polynomial regressor functions will manage to perfectly represent the tanh(⋅) term. The comparison tests were done against the 117 results in the previous section under similar conditions as the previous test. One challenge shown in the literature is that regression algorithms will adapt to the absence of certain regressors functions by using other correlated regressor functions, regardless of their presence in the observation models [23, 157]. This choice is further complicated by the presence of noise in the measurements. When the variance of the contribution is comparable with the variance of the noise, then discerning the optimum parameter values or even the optimum regressor sets for large numbers of regressors is not always possible. Fig. 5.6 shows the estimated set of best regressors for models with seven or fewer regressor functions, where dark squares indicate the presence of a particular regressor function in the model. The RMSE of the prediction error is shown in Fig. 5.7. The regressor sets were obtained using the EvolOBE method and a realization of Eq. (4.23) of 1024 consecutive epochs with 15dB SNR, with two delay taps (exact value) and the linear regressors were expanded to a polynomial order of ten, which results in a total of 1000 candidate regressor functions. Note how 𝑢 6 [𝑛 − 6] is present for the model with two regressors, even though it is not present in the observation model. In fact, in this realization, replacing 𝑢 6 [𝑛 − 6] with either 𝑢 2 [𝑛 − 1] or 𝑢[𝑛 − 1]𝑢[𝑛 − 2] yields slightly worse RMSE than 𝑢 6 [𝑛 − 6], where the model containing 𝑢 6 [𝑛 − 6] has -11.89dB and the ones containing 𝑢 2 [𝑛 − 1] or 𝑢[𝑛 − 1]𝑢[𝑛 − 2] have -11.88dB and -11.38dB, respectively. This can be interpreted as 𝑢 6 [𝑛 − 6] being able to better fit the missing terms than either 𝑢 2 [𝑛 − 1] or 𝑢[𝑛 − 1]𝑢[𝑛 − 2] individually, given the distributional characteristics of 𝑢[𝑛 − 1]2 and 𝑢[𝑛 − 1]𝑢[𝑛 − 2] and the particular realization being used. However, 𝑢 2 [𝑛 − 1] and 𝑢[𝑛 − 1]𝑢[𝑛 − 2] are synergistic in the sense that together they provide better fit to the data than the combination of 𝑢 6 [𝑛 − 1] and any other regressor function, as shown by the presence of both regressor functions in all subsequent models. Also in Fig. 5.6, note that other terms of the McLaurin series of tanh(𝑦[𝑛−1])𝑢[𝑛−1] are present, such as 𝑦 3 [𝑛 − 1]𝑢[𝑛 − 1] and 𝑦 5 [𝑛 − 1]𝑢[𝑛 − 1], but other terms provide so little improvement in the prediction that even though they might be present (e.g., 𝑦 7 [𝑛 − 1]𝑢[𝑛 − 1]) in a later model, they appear in conjunction with spurious regressor functions (e.g., 𝑦[𝑛 − 1]𝑦 8 [𝑛 − 2]𝑢[𝑛 − 2]). The same is true for 𝑦 9 [𝑛 − 1]𝑢[𝑛 − 1], which appears in the estimated best model with eight regressors (not 118 y[n-1].u[n-1] 2 u [n-1] u[n-1].u[n-2] 3 y [n-1].u[n-1] y 5[n-1].u[n-1] 6 u [n-1] y 7[n-1].u[n-1] 2 3 3 2 y [n-1].y [n-2].u [n-1].u [n-2] y[n-1].y 8[n-2].u[n-2] 0 1 2 3 4 5 6 7 # of Regressors Figure 5.6: Estimated regressor functions present in best models 0 -2.5 RMSE (dB) -5 -7.5 -10 -12.5 -15 0 2 4 6 8 # of Regressors Figure 5.7: Estimated pareto-front for 15dB SNR shown in Fig. 5.6 for clarity), which also includes the spurious term 𝑦 2 [𝑛−1]𝑦 2 [𝑛−2]𝑢 3 [𝑛−1]𝑢[𝑛−2]. While the challenge of appropriate choice of regressor functions for nonlinear estimation models is unique to modeling nonlinear observation models, a similar challenge exists when there are missing input signals in any (linear or not) regression problem. An intuitive example is given in Sec. 4.2, where the increase of a temporally correlated quadratic term of the past value of 𝑥2 resulted in the decrease of the linear NC measure of 𝑥2 into 𝑥1 , since this increase caused 𝑥1 to be 119 more temporally correlated with itself, whereas the quadratic term in 𝑥2 is uncorrelated with 𝑥2 . The first test used a noise free realization of Eq. (4.23) of 1024 consecutive epochs, with two delay taps (exact value) and the linear regressors were expanded to a polynomial order of eight, which results in a total of 494 candidate regressor functions. The histogram of the number of evaluations needed to find the best models with eight or fewer regressor functions is shown in Fig. 5.8 and the fitted log-normal probability density distribution is shown in Fig. 5.9. The number of evaluations needed was reduced by 73% for 99% confidence and 79% for 99.9% confidence. 60 Surrogate Linkage Tree 40 Counts 20 0 0 1 2 3 4 5 Evaluations #10 5 Figure 5.8: Histogram of required evaluations for RSX and LTX It is important to note, that there are times where LTX performs worse than RSX in terms of required number of evaluations. Figs. 5.10 and 5.11 show the estimated CDF of the number of required evaluations. Fig. 5.10 demonstrates that LTX clearly outperforms RSX under most circumstances. However, looking closely in the region of fewer than 15,000 evaluations (shown in Fig. 5.11), RSX outperforms LTX in about 3% of cases. This behavior can be traced back to the assumptions of LTX, that the current population contains characteristics of the desired solutions. In the first few evaluations, this assumption is less valid and leads to some runs requiring more evaluations to find the desired solutions. There are mitigation measures to avoid this increase in evaluations and to hasten the search overall, like waiting for a few generations prior to switching 120 10 -5 2 Surrogate Linkage Tree 1.5 Estimated PDF 1 0.5 0 0 0.5 1 1.5 2 2.5 3 Evaluations 10 5 Figure 5.9: Fitted PDF for the required evaluations to LTX or using a greedy or suboptimal algorithm to find the initial candidate solution set that is fed into the LTX operator, such as the bitwise hillclimber algorithm employed in the seminal LTGA paper [192]. A review of LTGA variants is given by Goldman and Tauritz in [73]. 1 0.75 Estimated CDF Surrogate Linkage Tree 0.5 0.25 0 0 2 4 6 Evaluations 10 5 Figure 5.10: CDF for the required number of evaluations to find the desired solution For simulations with lower SNR, the improvements are less pronounced. Fig. 5.12 shows the estimated CDF for the required number of evaluations for RSX and LTX. The CDFs intersect at 121 0.08 Surrogate Linkage Tree 0.06 Estimated CDF 0.04 0.02 0 0 5000 10000 15000 Evaluations Figure 5.11: Close-up for fewer than 15000 evaluations 94.7%, before which RSX requires fewer evaluations. At 99% and 99.9% confidence levels, LTX still outperforms RSX, with 7.5% and 15% respective reductions in the required numbers of evaluations. Nevertheless, this small reduction in number of required evaluations is negated by the additional computational cost of computing the linkage tree. The overhead added by LTX is not negligible and increases with population size and number of regressor functions. This drawback is less evident when the cost of evaluating the fitness function is large in comparison to the computation of the linkage tree. With lower SNR, nonlinear terms with smaller contributions are obfuscated by the noise and thus the final set of candidate solutions cannot include these terms with certainty. Finding this smaller set of solutions requires less exploration and gives LTX less chance to improve the search. Fig. 5.13 shows the comparison between the estimated Pareto fronts for the dataset expanded to polynomial order 10. Note how the noise free case continues to improve significantly until seven regressors are added, while at 15dB SNR, the improvements are greatly diminished beyond four regressors. The prediction error is not reduced beyond seven regressors, as the search space was limited to tenth-order polynomials, with the following term (𝑦 11 [𝑛 − 1]𝑢[𝑛 − 1]) requiring an expansion to order 12, increasing the chromosome sizes to 1819 genes and the search space to over 10547 possible solutions. 122 1 0.8 Estimated CDF Surrogate 0.6 Linkage Tree 0.4 0.2 0 0 2 4 6 8 10 Evaluations 10 4 Figure 5.12: CDF for the required number of evaluations for 15dB SNR 0 Noise free 15dB SNR -10 RMSE (dB) -20 -30 -40 -50 0 2 4 6 8 # of Regressors Figure 5.13: Comparison between estimated pareto fronts for different SNR values 5.6 Application to NNC analysis Once the final set of models is obtained, these models can be used for NNC analysis. One of the advantages of the biobjective optimization approach is that at the end of the optimization process, the algorithm provides the set of best models for different levels of tradeoff between complexity 123 and predictive power. In Fig. 5.14, the NNC values are given for the observation model of Eq. (4.23) for 10dB SNR. The NNC values using the observation model (found in Table 4.8) are 0.29 and 0.55. Note that the values converge quickly, with the model with four regressors being very close to the expected values and negligible changes with five or more regressors. 0.6 0.5 0.4 NNC value 0.3 0.2 NNCx x 1 1 0.1 NNCx x 2 1 0 1 2 3 4 5 6 7 8 # of Regressors Figure 5.14: NNC values for the final candidate model set for 10dB SNR In Fig. 5.15, the NNC values are given for the same model and 50dB SNR. The NNC values using the observation model (found in Table 4.9) are 0.31 and 0.69. Again, the values converge quickly and do not diverge in the observed range, as the contributions from the higher order terms are small compared to the first four chosen regressors. Note that in situations where the observational model possesses large SNR, the GC value would increase with the increase of the polynomial order expansion, even when the contribution of the new terms is small, finally converging when the variance of residual is comparable to that of the noise. On the other hand, a small residual causes the sum of NNC values to approach unity, but Figs. 5.14 and 5.15 shows that the NNC values do not vary much when the residuals become smaller. 124 0.8 0.7 0.6 NNC value 0.5 0.4 0.3 0.2 NNCx x 1 1 0.1 NNCx x 2 1 0 1 2 3 4 5 6 7 8 # of Regressors Figure 5.15: NNC values for the final candidate model set for 50dB SNR In the studied cases, NNC performs well with simpler estimation models, provided the estimated parameters well represent the internal mechanism of the observation models. This echoes Ljung’s advice to “try simple things first” [128], even though those models are “wrong3 .” 5.7 Discussion and conclusions In this chapter, modified crossover and mutation operators were presented for use in NARMAX model estimation as part of ongoing improvements to EvolOBE. These modifications yield sig- nificant performance improvements over the pure NSGA-II algorithm for this application. The operators take advantage of posited characteristics of the population and final solution sets, such as sparsity and pairwise relationships between genes. When a modeling problem provides little guidance on the selection of an effective model form, a GA must search a wide space of candidate features but determine a reliable and consistent solution in a limited number of generations. The 99% and 99.9% confidence metrics resulting in the modified search methods provide a stronger measure of performance than the estimated mean and variance, even though the confidence information is theoretically inherent in the two statistics. 3 but “useful” [31]. 125 The asymmetric mutation operator guides the mutation towards arbitrarily sparse solutions for any desired mutation rate. Tests have shown the asymmetric mutation operator to be an effective way to reduce the number of required evaluations. The change in the mutation operator also does not prevent exploration, as it simply increases the probability of the mutation to produce an offspring with the set sparsity, but does not prevent the mutation from generating offspring with other sparsity in the genes. The modified crossover operators increase the search speed by finding valid crossover locations (RSX) or finding more crossover masks that are more likely to produce fit offspring (LTX). The LTX operator estimates pairwise proximity between genes in the current population to define crossover masks. In the simulations presented here, LTX required fewer evalutions to find the desired set of solutions for high confidence levels, but a small percentage of simulations completed faster when using RSX exclusively. Since LTX requires that the population to provide useful information on linkage between genes, LTX requires a minimum number of evaluations until the population can provide such information. More complex crossover operators that make use of information beyond just pairwise linkage, such as covariance matrix adaptation [85], are very powerful, but require even more evaluations until the crossover operator can perform appropriately. At lower SNR values, the improvements given by LTX are less pronounced. Additionally, a larger set of regressor functions and the noise also adversely affect the parameter estimation, which further indicate the need for parsimony in the final candidate model set. The resulting models were used to compute the NNC values of the models and compare them to NNC values obtained with the observation model parameters. Since NC and NNC are susceptible to error in the model estimation, it is important to carefully consider the estimation models used to compute these measures. In the tests, the EvolOBE method produced NNC estimates with very good agreement with the theoretical values. As the EvolOBE method produces the set of most accurate models for any number of regressors, NNC can be estimated for the entire set of fittest models for comparison and analysis of the models. At this point, some characteristics of the algorithm have not yet been explored. For example, 126 the sparsity parameter for the asymmetric mutation operator is currently fixed at the beginning of the operation, but could potentially be set dynamically by observing the population and/or evolution. The algorithm also does not regard the relationships among regressors (e.g. 𝑦[𝑛 − 1] and 𝑦 3 [𝑛 − 1]), which could potentially provide useful information that will likely result in further improvement, especially when modeling non-polynomial regressor functions with polynomial expansions. 127 CHAPTER 6 CONCLUSION 6.1 Overview Causality analysis is a very important area of study, ranging from philosophy and econometrics to physics, neurology and engineering. The topic is highly debated and somewhat controversial. Indeed, a concise universal definition of causality or causality measures has not been reached. This work focuses on statistical methods of evaluating evidence of causality, rather than the philosophy of causality. This work possesses two synergistic goals: the characterization and development of a causality measure for nonlinear parametric models, and the investigation of an evolutionary search algorithm for sets of the best nonlinear parametric models for different levels of tradeoff between complexity and predictive power. NC is shown to be sensitive to parameter estimation error and prone to bias, which is compounded when extending NC to nonlinear models, so a method of finding and comparing models complements NNC by using the optimum set of models for NNC estimation. NC is a recent method to assess causality between signals in parametric models. In this work, a thorough critical study and nonlinear extension to NC are shown. In Summary, NC does have advantages over GC and similar causality measures in that it is more proportional to internal model parameters, it is normalized and does not require a choice on the order of the conditioning signals unlike CGC [95]. In Ch. 4, the seminal definition of NC is extended to cover all LTIiP models with a flexible weighting method that reduces to the seminal definition for LTI models. This work also explores aspects of NC that have been overlooked in the the seminal papers. In much of the literature surrounding causality, the distinctions among systems, observation models and estimation models are often not clearly stated. Although very powerful methods for parameter estimation exist, estimated models are not the systems they represent and should not be taken as anything greater (or lesser) than that - a representation. Under the risk of repeating 128 a truism, “all models are wrong, but some are useful.” As shown in Ch. 3, the usefulness of the NC estimates is strongly tied to the quality of the estimated models. This is arguably even more substantive for nonlinear model estimation, as nonlinear models entail increased difficulty in accurately estimating the parameters and selecting regressor sets. Another aspect that is often overlooked is the validity of some models found in the literature. Models that are not representative of practical applications should not be used to compare causality analysis tools unless their use is justified. Sec. 3.2 contains a list of example models and a discussion on the validity of such models. These two overlooked aspects in NC literature are very unfortunate, especially as it undercuts the argument for the unique characteristics of NC. The models shown in Sec. 3.2 could mislead a reader into thinking that NC is only superior in these impractical scenarios, which, without overlooking the merits of alternative methods, is not true in general. NC is unique in comparison to other methods in that NC values depend much more on internal model parameters and that, granted that the models represent the internal dynamics of the system well, it can better measure causal relationships for systems with quasi-periodic and slow dynamics. In its seminal form, NC was only fully defined for ARX models. In Ch. 4, a nonlinear extension of NC was presented. For models with strong nonlinearities, the seminal form can behave counterintuitively as shown in Fig. 4.1. The extension presented in this work, NNC, produces results that are in line with intuition (shown in Fig. 4.2) and shares all the strengths of NC while allowing application to a much wider set of models. As is the case with NC (and GC), NNC can also be spectrally expanded into a frequency dependent measure. The definition of NNC also offers a flexible approach to partitioning the contribution of nonlinear regressor functions that depend on more than a single regressor signal. Tests were conducted on synthetic and real data with promising results. With the need for a robust nonlinear model estimation framework having been demonstrated, improvements to the EvolOBE method are reported in Ch. 5. The EvolOBE method combines a genetic search algorithm for regressor selection with a set-theoretic approach for parameter 129 estimation. In this work, enhanced mutation and crossover are described and introduced to the EvolOBE method. The introduction of these new operators is shown to increase convergence speed, decrease the number of evaluations needed for convergence and reduce the variance of the number of evaluations needed for high confidence rates. 6.2 Contributions The major contributions of this work are the following: 1. Shown that NC is susceptible to two sources of variation, natural variations in the specific realization (e.g., difference between the sample variances and the observational model variances) and parameter estimation errors. In the sames tests, GC was shown to be significantly more robust to errors in the parameter estimation; 2. Shown that NC is prone to bias in the estimates that increase with parameter estimation errors; 3. Analytically explored four cases of the source of bias in NC estimates including regulariza- tion; 4. Provided an extension to NC to the set of all LTIiP models, which are considered interpretable and transparent [200]. This enables the use of NC to a much wider range of applications. The extension is equivalent to the seminal definition for linear models and can be spectrally expanded in the same way as the seminal definition. The extension is applied to real data (EEG signals) with encouraging results; 5. Introduced new operators into the EvolOBE method that significantly reduce the computa- tion time and required number of evaluations to reach convergence; 6.3 Future Work The large improvements seen in the EvolOBE method are encouraging and also indicate that further improvements are possible. Particularly, further enhancements in mutation and crossover 130 operators are likely to yield significant benefits to the genetic search. In the most current variant of the EvolOBE method is currently blind to the particular rela- tionship between regressor functions. When using Volterra expansions of the regressor signals on signals whose observational model has non-polynomial nonlinear terms, the set of optimal regressor functions are often related by the regressor signals used. Implementing a method to account for these relationships is likely to further improve convergence. In its current form, the multi-objective adaptation to LTX treats all situations where the offspring neither dominate or are dominated by the parents by randomly selecting whether to keep the offspring or parents. Using a different heuristic to guide the choice of whether to choose offspring or parents might help speed up the genetic search. Also, candidate models with larger number of active genes require longer computations than models with fewer active genes. Currently, the algorithm waits until all candidate models are evaluated to proceed, reducing the computational efficiency in parallel computing environments. Enhancements in the computational efficiency are possible and have not been explored. Additionally, the set-membership parameter estimation provides other indicators of set quality, such as bounds for each parameter, or size and shape of the final ellipsoid. These indicators have not been studied yet as a complement or substitute for the currently used fitness functions. Nonlinear NC is a new technique and its application has not yet been fully explored. The ability to describe the effect of a single regressor into the regressand in a complex function could have application areas outside of causality analysis, such as multi-criterion decision making. The normalized nature of NNC and sensitivity to changes in the model parameters make it particularly suitable as the results are more easily interpretable. While the susceptibility of NC to bias in the estimates, a detailed statistical characterization of NC have not yet been explored. This could lead to enhanced significance tests for NC and better understanding of how it relates to other causality analysis tools. 131 APPENDICES 132 APPENDIX A DERIVATION OF CLOSED-FORM EXPRESSIONS FOR GC AND NC FOR FIRST-ORDER BIJOINTLY REGRESSIVE OBSERVATION MODELS A.1 Overview Closed-form solutions for the GC and NC measures are useful in evaluating relative performance of the techniques. In [95], closed form expressions for GC and NC are derived for certain first-order ARX observation models. However, no general formula is given for NC or GC, and GC is only asymptotically evaluated for large 𝑀. Closed form expressions for GC depend on 𝑀 and the process of obtaining closed form expressions laborious, but understanding the intricacies of GC and NC provide insight into what each technique measures. In [95], it is argued that, GC does not depend on the feedback loop formed from the product of 𝑎21 and 𝑎12 , reflecting a coupling between 𝑥1 and 𝑥2 . The argument is supported by a closed form expression given for GC in [95, Eq. 13] for a particular form of first order ARX model which does not include any term that depends on the product. However, this expression is only true if GC is allowed to compare models with unlimited order. When the model orders are finite, the expression for GC does depend on 𝑎21 𝑎12 . A large portion of this appendix is quoted directly from [147] with a few modifications for improved flow and clarity. Long equations are placed at the end of the appendix. A.2 Derivations In order to increase clarity, the time-delay index 𝑖 superscript will be omitted. For all first-order models, 𝑎𝑝𝑞 𝑖 = 0 for 𝑖 > 1, so 𝑎𝑝𝑞 is used to mean 𝑎𝑝𝑞 1 . Instead, in this section, the superscript will be used to denote the exponent. The GC and NC measures can be evaluated in both directions (i.e., 𝑥1 → 𝑥2 and 𝑥2 → 𝑥1 ). For simplicity, only the 𝑥2 → 𝑥1 direction will be used. The derivation for 𝑥1 → 𝑥2 follows the same basic steps. 133 First, the observation model is defined 𝑥1 [𝑛] = 𝑎11 ∗ 𝑥1 [𝑛 − 1]+ 𝑎21 ∗ 𝑥2 [𝑛 − 1] + 𝜂∗1 [𝑛], 𝑥2 [𝑛] = 𝑎12 𝑥1 [𝑛 − 1]+ 𝑎22 𝑥2 [𝑛 − 1] + 𝜂∗2 [𝑛], (A.1) ∗ ∗ where 𝜂∗1 and 𝜂∗2 are discrete-time white noise processes with zero mean. The two estimated models that are compared for GC estimation follow an ARX model [Eq. (2.35)] under the joint case assumption, and follow an AR model under the disjoint case assumption. The ARX estimated model for the joint case is of the form 𝑥1 [𝑛] = 𝑎11 𝑥1 [𝑛 − 1]+ 𝑎21 𝑥2 [𝑛 − 1] + 𝜂1 [𝑛], 𝑥2 [𝑛] = 𝑎12 𝑥1 [𝑛 − 1]+ 𝑎22 𝑥2 [𝑛 − 1] + 𝜂2 [𝑛], (A.2) where the model parameters 𝜂1 and 𝜂2 are discrete-time white noise processes with zero mean and 𝑎𝑝𝑞 are the estimated model parameters. The AR estimated model for the disjoint case is of the form 𝑀 𝑥1 [𝑛] = ∑ 𝛼𝑚 𝑥1 [𝑛 − 𝑚] + 𝜖1 [𝑛], (A.3) 𝑚=1 where 𝜖1 is a discrete-time white noise process with zero mean and 𝛼𝑚 are the autoregressive model parameters. These estimated models will be used in the following derivations. The first derivation will be a generalization of the closed form expression given in [95, Eq. (12)], where the 𝑎11 and 𝑎22 are equal to zero and the estimated model order is unconstrained. For simplicity, the analysis assumes that enough epochs are available such that the sample variances and variances are assumed equal, and that the estimated models are the MMSE estimators, such that 𝑎𝑝𝑞 ≈ 𝑎𝑝𝑞∗ (𝑝, 𝑞 ∈ {1, 2}) for the joint model. These assumptions are not reasonable in many circumstances, but still provide insight on the “ideal” GC and NC estimates. Nevertheless, it is important to reinforce the point made in Sec. 2.2, that the observation models and estimation models must not be confused, even when the parameter estimation is assumed to be “perfect.” A.2.1 Derivation for the GC value for 𝑀 = 1 and 𝑀 = 2 Obtaining the GC value for different 𝑀 values is tedious, but not complicated. First the expected values for the variances of 𝑥1 , 𝑥2 and the covariance between 𝑥1 and 𝑥2 are calculated. Since 𝜂1 134 and 𝜂2 are white and zero mean, 2𝑎11 𝑎12 𝜎12 2 + 𝜎𝜂21 {𝑥1 [𝑛] ⋅ 𝑥1 [𝑛]} = 𝜎12 = , 1 − 𝑎11 2 2𝑎22 𝑎21 𝜎12 2 + 𝜎𝜂22 {𝑥2 [𝑛] ⋅ 𝑥2 [𝑛]} = 𝜎2 = 2 , 1 − 𝑎22 2 (A.4) 𝑎 𝑎 𝜎 2 + 𝑎12 𝑎22 𝜎22 {𝑥1 [𝑛] ⋅ 𝑥2 [𝑛]} = 𝜎12 2 = 11 21 1 , 1 − 𝑎11 𝑎22 − 𝑎12 𝑎21 where  represents the expectation operator. Solving the system yields Eq. (A.5). The covariance between 𝑥1 [𝑛 − 1] and 𝑥1 [𝑛] can be succinctly expressed in terms of 𝜎12 and 𝜎12 2 as {𝑥1 [𝑛 − 1] ⋅ 𝑥1 [𝑛]} = 𝑎11 𝜎12 + 𝑎12 𝜎12 2 , (A.6) so that, by evaluating the conditional distribution of 𝑥1 [𝑛] given only 𝑥1 [𝑛 − 1] [64], the variance of 𝜖1 becomes 𝜎124 𝜎𝜖21 = (1 − 𝑎11 2 )𝜎12 − 2𝑎11 𝑎12 𝜎122 − 𝑎122 . 𝜎12 (A.7) In this case, the GC value for evaluated when fitting first-order disjoint and bijointly regressive (1 − 𝑎11 )𝜎12 − 2𝑎11 𝑎12 𝜎12 − 𝑎12 systems becomes 2 2 2 𝜎12 4 GC2→1 = ln , 𝜎2 [ 𝜎𝜂21 ] 1 (A.8) which can be expanded into Eq. (A.9). The expression demonstrates clearly that GC does take the 𝑎12 𝑎21 feedback loop into consideration. Although the analysis of the contribution of these terms using Eq. (A.9) is not straightforward, the terms in Eq. (A.8) that depend 𝑎12 𝜎12 2 can be shown to contain 𝑎12 𝑎21 by using Eq. (A.8). A similar approach can be taken to evaluate GC using higher-order models. This is done by evaluating {𝑥1 [𝑛 − Δ𝑛] ⋅ 𝑥1 [𝑛]} for Δ𝑛 ∈ [1, ⋯ , 𝑀] and using the conditional distributions to obtain the expected 𝜎𝜖21 . One helpful identity to evaluate ⎧ ⎪ ⎡ ⎤ ⎫ ⎪ ⎡ ⎤ Δ𝑛 ⎡ 2⎤ these covariances is ⎪ ⎪⎢𝑥1 [𝑛]⎥ ⎪ ⎪ ⎢𝑎11 𝑎12 ⎥ ⎢ 𝜎1 ⎥  ⎨⎢ ⎥ 𝑥1 [𝑛 − Δ𝑛]⎬ =⎢ ⎥ ⎢ ⎥, ⎪ ⎪ ⎢ ⎥ ⎪ ⎪ ⎢ ⎪ 𝑥 [𝑛] ⎪ 𝑎 𝑎 22 ⎥ ⎢𝜎12 ⎥ (A.10) 2 ⎩⎣ ⎦ ⎭ ⎣ ⎦ ⎣ ⎦ 2 21 ⏞⏞⏞⏞⏞⏞⏞⏞⏞⏞⏞⏞⏞⏞⏞⏞⏞⏞⏞⏞⏞ Δ𝑛 times where [𝑨]Δ𝑛 = 𝑨 ⋅ 𝑨 ⋯ 𝑨. For the sake of brevity, a detailed derivation for higher orders is omitted, but the expression for 𝑀 = 2 is found in Eq. (A.11). 135 Increasing 𝑀 causes GC monotonically decrease, which can be intuitively explained by re- membering that GC compares two models, an AR and an ARX model. For the ARX model, there should be no improvement for higher order estimated ARX models over the first-order estimated ARX model, since the observation model is a first-order ARX model.1 Meanwhile, AR models cannot perfectly mimic the dynamics of ARX models. While a second-order AR model is also not able to perfectly represent an ARX model, it is better able to predict 𝑥1 [𝑛] than a first order AR model. Similarly, the variance of the residual of a third-order AR model is smaller (or equal) to that of the second-order model. The variance of the residual of the ARX models should be equal to 𝜎𝜂21 for any 𝑀 ≥ 1, but the variance of the residual of the AR model should decrease monotonically with 𝑀. Thus, the GC values for 𝑀 = 2 will be lower than with 𝑀 = 1. A.2.2 Derivation for a lower bound of the GC measure for large 𝑀 The fact that the GC value decreases monotonically with the model order is well known [95] and can be argued qualitatively, as is done in Sec. A.2.1. However, it is helpful to define a lower bound for the GC value, so that the range of possible GC values for any 𝑀 may be known, that is, GC𝑀→∞ ≤ GC𝑀∈Z+ ≤ GC2→1|𝑀=1 . Since GC compares the sample variance of the error sequence of a joint and a disjoint model, the two variances must be obtained. For the joint model, the expected variance of the error sequence is simply the variance of 𝜂1 (following the assumption that 𝑎𝑝𝑞 ≈ 𝑎𝑝𝑞 ∗ for 𝑝, 𝑞 ∈ {1, 2}). To find the disjoint model of the form Eq. (A.3), one can start by expanding Eq. (A.1) into ⏞⏞⏞⏞⏞⏞⏞⏞⏞⏞⏞⏞⏞⏞⏞⏞⏞⏞⏞⏞⏞⏞⏞⏞⏞⏞⏞⏞⏞⏞⏞⏞⏞⏞⏞⏞⏞⏞⏞⏞⏞⏞⏞⏞⏞⏞⏞⏞⏞⏞⏞⏞⏞⏞⏞⏞⏞⏞⏞⏞⏞⏞⏞⏞⏞⏞⏞⏞⏞⏞⏞⏞⏞⏞⏞⏞⏞⏞⏞⏞⏞⏞⏞⏞⏞⏞⏞⏞⏞⏞⏞⏞⏞ 𝑥2 [𝑛−1] 𝑥1 [𝑛] = 𝑎11 𝑥1 [𝑛 − 1] + 𝑎12 (𝑎21 𝑥1 [𝑛 − 2] + 𝑎22 𝑥2 [𝑛 − 2] + 𝜂2 [𝑛 − 1]) +𝜂1 [𝑛], (A.12) which, for 𝑎22 ≠ 0, can be further expanded by recursively replacing the 𝑥2 terms, yielding 𝑀 𝑥1 [𝑛] =𝑎11 𝑥1 [𝑛 − 1] + 𝑎12 ∑ (𝑎21 )𝑚−1 𝑥1 [𝑛 − 𝑚]+ 𝑚=2 𝑀 𝑎12 (𝑎22 ) 𝑥2 [𝑛 − 𝑀] + 𝑎12 ∑ (𝑎22 ) 𝜂2 [𝑛 − 𝑚]) + 𝜂1 [𝑛]. (A.13) 𝑀−1 𝑚−1 𝑚=1 1 Nevertheless, a larger number of model parameters lead to larger variance in the parameter estimation. 136 For 𝑎22 = 0, the expansion reduces to 𝑀 𝑥1 [𝑛] = 𝑎11 𝑥1 [𝑛 − 1] + 𝑎12 ∑ (𝑎21 )𝑚−1 𝑥1 [𝑛 − 𝑚] + 𝑎12 𝜂2 [𝑛 − 1]) + 𝜂1 [𝑛], (A.14) 𝑚=2 and the asymptotic MMSE parameter values for the disjoint model are ⎧ ⎪ ⎪ ⎪ 𝑎11 for 𝑚 = 1, ⎪ ⎪ ⎪ 𝛼𝑚 ≈ ⎨𝑎12 (𝑎21 )𝑚−1 for 𝑚 > 1, , ⎪ ⎪ ⎪ (A.15) ⎪ ⎪ ⎪ 0 for 𝑚 < 0, ⎩ so that the prediction error for the MMSE of the disjoint model is 𝜖1 [𝑛] = 𝑎12 𝜂2 [𝑛 − 1] + 𝜂1 [𝑛]. (A.16) Using the fact that 𝜂1 and 𝜂2 are white and uncorrelated, the variance of 𝜖1 is 𝜎𝜖21 = (𝑎12 )2 𝜎𝜂22 + 𝜎𝜂21 , (A.17) and the GC value can be expressed as (𝑎12 )2 𝜎𝜂22 GC2→1 = ln 1 + , ( 𝜎𝜂21 ) (A.18) which does not depend on 𝑎21 or 𝑎11 . This expression was used to argue that GC overlooks important parameters of the model by Hu et al. in [95]. However, it is important to remember that this expression is only valid for 𝑎22 = 0 and large 𝑀, nevertheless, as is shown below, the expression is still useful for the purpose of establishing a lower bound. Returning briefly to Eq. (A.12), note that for any 𝑀 > 1, the 𝑥1 [𝑛 − 1] and 𝑥1 [𝑛 − 2] terms are available in the AR model and therefore will add no additional residual error. Note also that the 𝜂1 [𝑛] and 𝜂2 [𝑛 − 1] terms cannot be predicted in any way by the AR model. This is a consequence of 𝜂1 [𝑛] being uncorrelated with any past values of 𝜂1 and that 𝜂2 [𝑛 − 1] is only correlated with 𝑥1 [𝑛], but not with 𝑥1 [𝑛 − 1] or any other past values of 𝑥1 . Thus, the only remaining question is how well can the 𝑥2 [𝑛 − 2] term be predicted by past values of 𝑥1 . 137 Supposing that 𝑥2 [𝑛 − 2] could be perfectly predicted by past values of 𝑥1 produces a prediction error equivalent to Eq. (A.16). Although this is only exactly true for 𝑎22 = 0, it becomes clear that for any 𝑀 ≥ 1, 𝜎𝜖21 ≥ (𝑎12 )2 𝜎𝜂22 + 𝜎𝜂21 , (A.19) which shows that Eq. (A.18) is indeed a lower bound for all GC values for 𝑀 ≥ 1 and is the asymptotic value for large 𝑀 and 𝑎22 = 0. A.2.3 Derivation of the NC value In [95, Eq. (21)], a partial expansion of NC was expressed for models for which 𝑎11 = 𝑎22 = 0. The motivation to eliminate these parameters is to simplify the expressions, but this simplification arguably reduces the representativeness of the model [147]. Here, the general expression is shown. By expanding the difference equations as done in Eq. (A.12), 𝑁 ∑ (𝑎12 𝑎21 𝑥1 [𝑛 − 2] + 𝑎22 𝑥2 [𝑛 − 2] + 𝑎21 𝜂2 [𝑛 − 1])2 𝑁 𝐶2→1 = 𝑛=3 𝑁 𝑁 , ∑ (𝑎12 𝑎21 𝑥1 [𝑛 − 2] + 𝑎21 𝜂2 [𝑛 − 1]) + ∑ 𝜂21 [𝑛] (A.20) 2 𝑛=3 𝑛=3 which shows the clear dependence of NC on the 𝑎12 𝑎21 term. The expression is only valid for observation models with 𝑎11 = 0. A general expression for first order bijointly variate models is 𝑁 ∑ (𝑎12 𝑥2 [𝑛 − 1])2 𝑁 𝐶2→1 = 𝑁 𝑛=2 𝑁 𝑁 , ∑ (𝑎12 𝑥2 [𝑛 − 1]) + ∑ (𝑎11 𝑥1 [𝑛 − 1]) + ∑ 𝜂21 [𝑛] (A.21) 2 2 𝑛=2 𝑛=2 𝑛=2 which, as 𝑁 → ∞, converges to 𝑎12 𝜎2 2 2 𝑁 𝐶2→1 = . 𝑎12 2 2 𝜎2 + 𝑎11 𝜎1 + 𝜎𝜂1 2 2 2 (A.22) This expression shows the dependence of NC on the product 𝑎12 𝑎21 . It is important to note that the variances 𝜎12 and 𝜎12 themselves depend on the model parameters as Eq. (A.5) shows. This means that a change on any of the model parameters will also cause a change in 𝜎12 and 𝜎22 , thus the interactions between the model parameters and the NC values is also not straightforward. Combining Eqs. (A.5) and (A.22) yields Eq. (A.23). 138 A.3 Discussion In this appendix closed form expressions for NC and GC are derived. The GC expressions are shown for 𝑀 = 1, 𝑀 = 2, and an asymptotic expression for large 𝑀. The technique can be expanded to any order, although the complexity for the closed form expressions grows with 𝑀. Although the process of obtaining these estimates is laborious, they can quickly and accurately be numerically evaluated. These expressions show that, for finite 𝑀, the expression for GC does indeed depend the product of 𝑎12 and 𝑎21 . In fact, the relationship between the model parameters shows many intricate relationships between model parameters and GC values. The interaction between the model parameters and the GC values is not straightforward, so expressions in terms of 𝜂1 and 𝜂2 and in terms of 𝑥1 and 𝑥2 are provided. When doing theoretical analysis on GC and NC estimation, it is helpful to be able to evaluate the analytical values for comparison. These closed form expressions are used in Ch. 3 to compare the effects of estimation errors and sample variances on GC and NC estimates. 139 (1 + 𝑎11 𝑎22 − 𝑎11 𝑎22 − 𝑎22 − 𝑎12 𝑎21 − 𝑎12 𝑎21 𝑎22 ) 𝜎𝜂1 + (𝑎12 − 𝑎12 𝑎21 + 𝑎11 𝑎12 𝑎22 ) 𝜎𝜂2 3 2 2 2 2 3 2 2 𝜎12 = , (1 + 𝑎12 𝑎21 − 𝑎11 𝑎22 ) (1 − 𝑎11 − 𝑎12 𝑎21 − 𝑎22 + 𝑎11 𝑎22 ) (1 + 𝑎11 + 𝑎22 + 𝑎11 𝑎22 − 𝑎12 𝑎21 ) (𝑎21 − 𝑎21 𝑎12 + 𝑎22 𝑎21 𝑎11 ) 𝜎𝜂1 + (1 + 𝑎22 𝑎11 − 𝑎22 𝑎11 − 𝑎11 − 𝑎12 𝑎21 − 𝑎12 𝑎21 𝑎11 ) 𝜎𝜂2 2 3 2 2 3 2 2 2 𝜎22 = , (1 + 𝑎12 𝑎21 − 𝑎11 𝑎22 ) (1 − 𝑎11 − 𝑎12 𝑎21 − 𝑎22 + 𝑎11 𝑎22 ) (1 + 𝑎11 + 𝑎22 + 𝑎11 𝑎22 − 𝑎12 𝑎21 ) (A.5) (𝑎12 𝑎22 𝑎21 + 𝑎11 𝑎21 − 𝑎11 𝑎21 𝑎22 ) 𝜎𝜂1 + (𝑎21 𝑎11 𝑎12 + 𝑎22 𝑎12 − 𝑎22 𝑎12 𝑎11 ) 𝜎𝜂2 2 2 2 2 2 2 𝜎12 2 = . (1 + 𝑎12 𝑎21 − 𝑎11 𝑎22 ) (1 − 𝑎11 − 𝑎12 𝑎21 − 𝑎22 + 𝑎11 𝑎22 ) (1 + 𝑎11 + 𝑎22 + 𝑎11 𝑎22 − 𝑎12 𝑎21 ) GC2→1= { }{ } [𝑎22 (𝑎11 − 𝑎12 𝑎21 + 𝑎11 𝑎22 − 1) − 1] 𝜎𝜂21 − 𝑎12 𝜎𝜂2 2 2 [1 − (1 + 𝑎11 + 𝑎12 𝑎21 ) 𝑎22 + 𝑎11 𝑎22 ] 𝜎𝜂1 + 𝑎12 𝜎𝜂2 2 2 2 2 ln 2 { } . (A.9) [ 𝜎𝜂1 (1 + 𝑎12 𝑎21 − 𝑎11 𝑎22 ) [(1 + 𝑎22 2 ) 𝑎12 𝑎21 + 𝑎22 (𝑎11 + 𝑎22 ) − 𝑎11 𝑎223 − 1] 𝜎𝜂21 + 𝑎12 2 (𝑎12 𝑎21 − 𝑎11 𝑎22 − 1) 𝜎𝜂22 ] GC2→1= { } [(𝑎11 𝑎22 − 1) 𝜎𝜂1 − 𝑎12 𝜎𝜂2 ] [1 − (1 + 2𝑎12 𝑎21 ) 𝑎22 + 𝑎11 𝑎22 (𝑎22 − 1)] 𝜎𝜂1 + 𝑎12 (2 − 𝑎11 𝑎22 + 𝑎22 ) 𝜎𝜂1 𝜎𝜂2 + 𝑎12 𝜎𝜂2 2 2 2 2 2 4 2 2 2 2 4 4 ln { }{ } . (A.11) [ 𝜎𝜂21 [𝑎22 (𝑎11 − 𝑎12 𝑎21 + 𝑎11 𝑎22 − 1) − 1] 𝜎𝜂21 − 𝑎12 2 2 𝜎𝜂2 [1 − 𝑎22 (1 + 𝑎11 + 𝑎12 𝑎21 ) + 𝑎11 𝑎22 2 ] 𝜎𝜂21 + 𝑎12 𝜎𝜂2 2 2 ] 𝑎12 𝑎21 (𝑎12 𝑎21 − 𝑎11 𝑎22 − 1) 𝜎𝜂21 + 𝑎12 2 2 2 [𝑎12 𝑎21 + 𝑎11 (𝑎11 + 𝑎11 𝑎12 𝑎21 + 𝑎22 − 𝑎11 𝑎22 ) − 1] 𝜎𝜂2 2 2 = { } 2 . 2𝑎11 𝑎12 𝑎21 𝑎22 + 𝑎22 (𝑎11 + 𝑎22 − 𝑎11 𝑎22 ) + 𝑎12 𝑎21 [1 + 𝑎22 − 2𝑎11 (𝑎22 − 1)] − 1 𝜎𝜂1 NC2→1 2 2 2 2 2 2 + 𝑎12 [𝑎12 (𝑎21 + 2𝑎11 𝑎21 ) + 𝑎11 𝑎22 − 2𝑎11 𝑎22 − 1] 𝜎𝜂2 (A.23) 2 2 3 2 140 APPENDIX B LISTINGS FOR ALGORITHMS B.1 Overview This appendix contains the listings for key algorithms used in this work. Deeper discussion and more thorough description of the algorithms are found in the references. B.2 OBE-related algorithms The unified OBE framework is more thoroughly described and discussed in [54], whereas a summary is given here as a reference. The general algorithm follows a recursion similar to weighted recursive least squares (WRLS) [53, 54, 101], but with dynamically evaluated optimal forgetting factor calculations. The algorithm shown here assumes a MISO model, since this is the focus of this work, but UOBE defined for general MIMO models in [54]. Given a sequence of error bounds 𝛾 [𝑛] [as in Eq. (2.27)], output signal 𝑥𝑝 [𝑛] and vector of regressors (or regressor functions) 𝝋𝑝 [𝑛], the UOBE framework is given in Alg. B.1 and the recursion in Alg. B.2. The optimum weights are selected according to different optimization criteria. With the exception of the Dasgupta-Huang OBE [50] which optimizes 𝜅[𝑛], other algorithms under the UOBE umbrella choose the weights that minimize either the determinant of 𝜅[𝑛]𝑷[𝑛] (proportional to the square of the volume of the ellipsoid) or the trace of 𝜅[𝑛]𝑷[𝑛] (proportional to the sum of the squares of the semi-axes of the ellipsoid). Defining 𝑞[𝑛] = 𝛽[𝑛]/𝛼[𝑛], the weights that minimize the volume, if they exist, are obtained by finding the unique positive root of the following equation 𝐹𝑣 (𝑠) = 𝑎2 𝑠 2 + 𝑎1 𝑠 + 𝑎0 , (B.1) 141 where 𝑎2 = (𝐾 − 1)𝛾 [𝑛]𝐺 2 [𝑛], (B.2) 𝑎1 = [(2𝐾 − 1)𝛾 [𝑛] + ||𝜖[𝑛]||2 − 𝜅[𝑛 − 1]𝐺[𝑛]] 𝐺[𝑛], (B.3) 𝑎0 = 𝐾 [𝛾 [𝑛] − ||𝜖[𝑛]||2 ] − 𝜅[𝑛 − 1]𝐺[𝑛], (B.4) Algorithm B.1: Unified Optimum Bounded Ellipsoid Algorithm 𝜽[1] = 𝟎 ⊳ Set initial ellipsoid as a very large hyper-sphere centered at origin 1: procedure UOBE 𝜅[1] = 1 2: 1 𝑷[1] = 𝑰 3: 𝜇 for 𝑛 = 2 to 𝑁 do 4: 𝜖[𝑛] = 𝑥𝑝 [𝑛] − 𝜽 𝑇 [𝑛 − 1]𝝋𝑝 [𝑛] ⊳ Calculate prediction error 5: 𝐺[𝑛] = 𝝋𝑝 [𝑛]𝑷[𝑛 − 1]𝝋𝑝 [𝑛] 𝑇 ⊳ Obs.: 𝐺[𝑛] is a scalar 6: Evaluate if optimum 𝛼[𝑛] and 𝛽[𝑛] exist ⊳ 𝛼[𝑛] and 𝛽[𝑛] are described later 7: if optimum 𝛼[𝑛] and 𝛽[𝑛] exist then 8: 9: 10: do UOBE-Recursion (Alg. B.2) ⊳ Ellipsoid does not change 11: count = 0 𝑷[𝑛] = 𝑷[𝑛 − 1] 12: else 𝜽[𝑛] = 𝜽[𝑛 − 1] 13: 𝜅[𝑛] = 𝜅[𝑛 − 1] 14: 15: 16: count = count + 1 ⊳ If using ABE 17: if count > Nabe then 18: do EstimateBounds (Alg. B.2) 19: count = 0 20: end if 21: end if 22: end for 23: end procedure Algorithm B.2: UOBE Recursion 1 𝛽[𝑛]𝑷[𝑛 − 1]𝝋𝑝 [𝑛]𝝋𝑝𝑇 [𝑛]𝑷[𝑛 − 1] 1: procedure UOBE-Recursion 𝑷[𝑛] = 𝑷[𝑛 − 1] − ⊳ 𝛼[𝑛] [ 𝛼[𝑛] + 𝛽[𝑛]𝐺[𝑛] ] Update direction & 2: 𝜽[𝑛] = 𝜽[𝑛 − 1] + 𝛽[𝑛]𝑷[𝑛]𝝋𝑝 [𝑛]𝜖[𝑛] ⊳ Update centroid shape of ellipsoid 𝛼[𝑛]𝛽[𝑛]𝜖 2 [𝑛] 3: 𝜅[𝑛] = 𝛼[𝑛]𝜅[𝑛] + 𝛽[𝑛]𝛾 2 [𝑛] − ⊳ Update size of ellipsoid 𝛼[𝑛] + 𝛽[𝑛]𝐺[𝑛] 4: 5: end procedure 142 such that 𝐹𝑣 (𝑞[𝑛]) = 0. When no such root exists, none of the ellipsoids that contain the intersection between the hyperstrip and the previous ellipsoid have smaller volume than the current ellipsoid. Equivalently, the positive root indicates that the value of 𝑞[𝑛] that defines the ellipsoid with smallest volume out of the set of all ellipsoids that fully contains the intersection between the previous ellipsoid and the hyperstrip. Note here that setting 𝛼[𝑛] to unity and 𝛽[𝑛] to zero is equivalent to ignoring or discarding the current 𝑥𝑝 [𝑛] and 𝝋𝑝 [𝑛] and making no changes to the ellipsoid. To minimize the square sum of the semi-axes, the optimum weights, if they exist, are obtained by finding the unique positive root of 𝐹𝑡 (𝑠) = 𝑏 3 𝑠 3 + 𝑏2 𝑠 2 + 𝑏1 𝑠 + 𝑏0 , (B.5) where 𝑏3 = 𝛾 [𝑛]𝐺 2 [𝑛] [𝐺[𝑛] − 𝐼 [𝑛 − 1]𝐻 [𝑛]] , (B.6) 𝑏2 = 3𝛾 [𝑛]𝐺[𝑛] [𝐺[𝑛] − 𝐼 [𝑛 − 1]𝐻 [𝑛]] , (B.7) 𝑏1 = 𝐻 [𝑛]𝐺[𝑛]𝐼 [𝑛 − 1]𝜅[𝑛 − 1] − 2𝐻 [𝑛]𝐼 [𝑛 − 1] [𝛾 [𝑛] − ||𝜖[𝑛]||2 ] (B.8) − 𝐺[𝑛]||𝜖[𝑛]||2 + 3𝛾 [𝑛]𝐺[𝑛], (B.9) 𝑏0 = 𝛾 [𝑛] − ||𝜖[𝑛]||2 − 𝐻 [𝑛]𝐼 [𝑛 − 1]𝜅[𝑛 − 1], (B.10) { } where 𝐻 [𝑛] ≐ 𝝋𝑝𝑇 [𝑛]𝑃 2 [𝑛]𝝋𝑝 [𝑛] and 𝐼 [𝑛] ≐ tr 𝑷 −1 [𝑛] , where tr {⋅} is the trace operator. A stochastic method to estimate error bounds is developed by Joachim et al. in [106]. The algorithm starts with an overestimated bound. If no update to the ellipsoid is made for NABE samples, it finds the largest error in the last NABE samples and reduces the bounds accordingly. This is repeated until the error bound estimate is close enough to the true bounds. The general algorithm is shown in Alg. B.3. B.3 Linkage tree crossover In [192], Thierens introduces the Linkage Tree Genetic Algorithm (LTGA). The algorithm initializes the population randomly, but applies a steepest ascent hill climber to each member of the population 143 Algorithm B.3: Automatic Bounds Estimation N𝑚𝑎𝑥 = arg max 𝜖 2 [𝑚] ⊳ Find largest prediction error in last NABE samples 1: procedure EstimateBounds √ 𝑚∈[𝑛−NABE +1,𝑛] 2: Δ𝛾 = 𝜅[NABE − 1]𝐺[NABE ]/𝐾 − 𝜀(2 𝛾 [NABE − 1] − 𝜀) ⊳ bound for 𝑛 =NABE Find appropriate reduction in if Δ𝛾 > 0 then 3: 𝛾 [𝑛] = 𝛾 [𝑛 − 1] − Δ𝛾 ⊳ If a bound reduction is possible, reduce it 4: 5: 𝛾 [𝑛] = 𝛾 [𝑛 − 1] ⊳ If 𝛾 cannot be reduced, keep old bounds 6: else 7: 8: end if 9: end procedure to increase its fitness. The resulting population undergoes crossover until the termination criterion is reached (without further mutation). The initial hill climbing is desirable so that the population can provide useful statistical pairwise linkage information to LTX. While it is possible to achieve convergence without this step, convergence is slower, and the linkage points will less likely be at helpful locations. The first step for LTX is generating the linkage tree. The general steps are given in Alg. B.4. The distance metric used by LTX is introduced by Kraskov et al. in [113], and is a normalized mutual information distance metric. Following the generation of the linkage tree, the crossover occurs. The general steps for LTX are given in Alg. B.5. Algorithm B.4: Generate Linkage Tree 1: procedure GenerateLinkageTree 2: Initialize each gene as one cluster 3: repeat 4: Compute the distance between clusters 5: Merge closest clusters together 6: until Only one cluster remains 7: Organize clustering information into tree 8: end procedure In its seminal form, LTX is defined for single-objective optimization problems, where two solutions can be superior, inferior, or equivalent to one another. In multi-objective problems, comparing two solutions is less straighforward. Although the categories of superior (dominating), 144 Algorithm B.5: Linkage Tree Crossover 1: procedure LinkageTreeCrossover 2: Select parents 3: Start at the largest cluster 4: while Tree is not fully traversed do 5: Crossover parents using the current cluster 6: if one or more offspring are superior to both parents then 7: Replace parents with offspring 8: end if 9: Move down the linkage tree and repeat 10: end while 11: end procedure inferior (dominated) and equivalent are still present, a solution might be superior to a second solution in one objective function, but inferior in a second objective function (non-dominated). To accommodate for non-domination between parents and offspring, this work introduces a new variant of LTX that includes an additional conditional statement which can be tuned to choose keep offspring, parents or randomly select one of them. The general algorithm for LTX adapted to multi-objective problems is shown in Alg. B.6. Algorithm B.6: Linkage Tree Crossover for multi-objective problems 1: procedure LinkageTreeCrossover2 2: Select parents 3: Start at the largest cluster 4: while Tree is not fully traversed do 5: Crossover parents using the current cluster 6: if one or more offspring dominate both parents then 7: Replace parents with offspring 8: else if neither offspring dominate both parents or dominated by both then 9: Randomly decide which to keep1 10: else if one or more offspring are dominated by both parents then 11: Do not replace parents 12: end if 13: Move down the linkage tree 14: end while 15: end procedure 1 By setting the probability of each option to 1 or 0, a deterministic behavior can be set 145 BIBLIOGRAPHY 146 BIBLIOGRAPHY [1] Abdulkader, S.N., Atia, A., & Mostafa, M.S.M. (2015). Brain computer interfacing: Applications and challenges. Egyptian Informatics Journal, 16(2), 213–230. [2] Afsharnia, F., Madadi, A., & Menhaj, M.B. (2019). Iterative learning identification and control for dynamic systems described by NARMAX model. AUT Journal of Modeling and Simulation. [3] Akaike, H. (1968). On the use of a linear model for the identification of feedback systems. Annals of the Institute of Statistical Mathematics, 20(1), 425–439. [4] Akaike, H. (1969). Fitting autoregressive models for prediction. Annals of the Institute of Statistical Mathematics, 21(1), 243–247. [5] Akaike, H. (1974). A new look at the statistical model identification. IEEE Transactions on Automatic Control, 19(6), 716–723. [6] Amisigo, B.A., Van de Giesen, N., Rogers, C., Andah, W.E.I., & Friesen, J. (2008). Monthly streamflow prediction in the Volta Basin of West Africa: A SISO NARMAX polynomial modelling. Physics and Chemistry of the Earth, Parts A/B/C, 33(1-2), 141–150. [7] Ancona, N., Marinazzo, D., & Stramaglia, S. (2004). Radial basis function approach to nonlinear Granger causality of time series. Physical Review E, 70(5), 056221. [8] Anderson, S.R., Lepora, N.F., Porrill, J., & Dean, P. (2010). Nonlinear dynamic modeling of isometric force production in primate eye muscle. IEEE Transactions on Biomedical Engineering, 57(7), 1554–1567. [9] Aviyente, S., Bernat, E.M., Evans, W.S., & Sponheim, S.R. (2011). A phase synchrony measure for quantifying dynamic functional integration in the brain. Technical report, Wiley Online Library. [10] Aviyente, S., Evans, W.S., Bernat, E.M., & Sponheim, S. (2007). A time-varying phase coherence measure for quantifying functional integration in the brain. In 2007 IEEE International Conference on Acoustics, Speech and Signal Processing - ICASSP ’07, volume 4 (pp. IV–1169–IV–1172). [11] Aysal, T.C. & Barner, K.E. (2006). Hybrid polynomial filters for Gaussian and non-Gaussian noise environments. IEEE Transactions on Signal Processing, 54(12), 4644–4661. [12] Baccalá, L.A. & Sameshima, K. (2001). Partial directed coherence: A new concept in neural structure determination. Biological Cybernetics, 84(6), 463–474. 147 [13] Baek, E. & Brock, W. (1992). A general test for nonlinear Granger causality: Bivariate model. Technical report, Iowa State University and University of Wisconsin at Madison. [14] Barnett, L., Barrett, A.B., & Seth, A.K. (2009). Granger causality and transfer entropy are equivalent for Gaussian variables. Physical Review Letters, 103(23), 238701. [15] Barnett, L. & Seth, A.K. (2011). Behaviour of Granger causality under filtering: Theoretical invariance and practical application. Journal of Neuroscience Methods, 201(2), 404–419. [16] Barnett, L. & Seth, A.K. (2014). The MVGC multivariate Granger causality toolbox: A new approach to Granger-causal inference. Journal of Meuroscience Methods, 223, 50–68. [17] Barnett, L. & Seth, A.K. (2015). Granger causality for state-space models. Physical Review E, 91(4), 040101. [18] Barnett, L. & Seth, A.K. (2017). Detectability of Granger causality for subsampled continuous-time neurophysiological processes. Journal of Neuroscience Methods, 275, 93–121. [19] Barrett, A.B. & Barnett, L. (2013). Granger causality is designed to measure effect, not mechanism. Frontiers in Neuroinformatics, 7, 6. [20] Barton, M.J., Robinson, P.A., Kumar, S., Galka, A., Durrant-Whyte, H.F., Guivant, J., & Ozaki, T. (2009). Evaluating the performance of Kalman-filter-based EEG source localization. IEEE Transactions on Biomedical Engineering, 56(1), 122–136. [21] Beleites, C., Baumgartner, R., Bowman, C., Somorjai, R., Steiner, G., Salzer, R., & Sowa, M.G. (2005). Variance reduction in estimating classification error using sparse datasets. Chemometrics and Intelligent Laboratory Systems, 79(1), 91–100. [22] Billings, S.A. (2013). Nonlinear System Identification: NARMAX Methods in the Time, Frequency, and Spatio-Temporal Domains. John Wiley & Sons. [23] Billings, S.A. & Aguirre, L.A. (1995). Effects of the sampling time on the dynamics and identification of nonlinear models. International Journal of Bifurcation and Chaos, 5(06), 1541–1556. [24] Billings, S.A. & Chen, S. (1989). Extended model set, global data and threshold model identification of severely non-linear systems. International Journal of Control, 50(5), 1897–1923. [25] Billings, S.A., Korenberg, M.J., & Chen, S. (1988). Identification of non-linear output-affine systems using an orthogonal least-squares algorithm. International Journal of Systems Science, 19(8), 1559–1568. [26] Billings, S.A. & Wei, H.L. (2005). The wavelet-NARMAX representation: A hybrid model structure combining polynomial models with multiresolution wavelet decompositions. 148 International Journal of Systems Science, 36(3), 137–152. [27] Billings, S.A. & Zhu, Q. (1994). A structure detection algorithm for nonlinear dynamic rational models. International Journal of Control, 59(6), 1439–1463. [28] Billingsley, P. (1995). Measure and Probability. John Wiley & Sons. [29] Blomgren, P. & Chan, T.F. (1998). Color TV: Total variation methods for restoration of vector-valued images. IEEE Transactions on Image Processing, 7(3), 304–309. [30] Booker, L. (1987). Improving search in genetic algorithms. Genetic Algorithms and Simulated Annealing, (pp. 61–73). [31] Box, G.E.P. (1979). Robustness in the strategy of scientific model building. In Robustness in statistics (pp. 201–236). Elsevier. [32] Box, G.E.P., Jenkins, G.M., Reinsel, G.C., & Ljung, G.M. (1970). Time Series Analysis: Forecasting and Control. John Wiley & Sons. [33] Bressler, S.L. & Seth, A.K. (2011). Wiener–Granger causality: A well established methodology. Neuroimage, 58(2), 323–329. [34] Brito, A.G., Leite Filho, W.C., & Hemerly, E.M. (2013). Identification of a Hammerstein model for an aerospace electrohydraulic servovalve. IFAC Proceedings Volumes, 46(19), 459–463. [35] Brovelli, A., Ding, M., Ledberg, A., Chen, Y., Nakamura, R., & Bressler, S.L. (2004). Beta oscillations in a large-scale sensorimotor cortical network: Directional influences revealed by Granger causality. Proceedings of the National Academy of Sciences, 101(26), 9849–9854. [36] Cartwright, N. (1989). Nature’s Capacities and Their Measurement. Oxford University Press. [37] Chávez, M., Martinerie, J., & Le Van Quyen, M. (2003). Statistical assessment of nonlinear causality: application to epileptic EEG signals. Journal of Neuroscience Methods, 124(2), 113–128. [38] Chella, F., D’Andrea, A., Basti, A., Pizzella, V., & Marzetti, L. (2017). Non-linear analysis of scalp EEG by using bispectra: The effect of the reference choice. Frontiers in Neuroscience, 11, 262. [39] Chen, S. & Billings, S.A. (1989). Representations of non-linear systems: The NARMAX model. International Journal of Control, 49(3), 1013–1032. [40] Chen, S., Billings, S.A., Cowan, C.F.N., & Grant, P.M. (1990). Practical identification of NARMAX models using radial basis functions. International Journal of Control, 52(6), 1327–1350. 149 [41] Chen, S., Cowan, C., & Grant, P. (1991). OLS learning algorithm for RBF networks. IEEE Transactions on Neural Networks, 2(2), 302–309. [42] Chen, S. & Donoho, D. (1994). Basis pursuit. In Proceedings of 1994 28th Asilomar Conference on Signals, Systems and Computers, volume 1 (pp. 41–44).: IEEE. [43] Chen, Y., Bressler, S.L., & Ding, M. (2006a). Frequency decomposition of conditional Granger causality and application to multivariate neural field potential data. Journal of Neuroscience Methods, 150(2), 228–237. [44] Chen, Y., Bressler, S.L., & Ding, M. (2006b). Frequency decomposition of conditional Granger causality and application to multivariate neural field potential data. Journal of Neuroscience Methods, 150(2), 228 – 237. [45] Clearwater, S.H., Hogg, T., & Huberman, B.A. (1992). Cooperative problem solving. In Computation: The Micro and the Macro View (pp. 33–70).: World Scientific. [46] Cliff, N. (1983). Some cautions concerning the application of causal modeling methods. Multivariate behavioral research, 18(1), 115–126. [47] Connor, J.T., Martin, R.D., & Atlas, L.E. (1994). Recurrent neural networks and robust time series prediction. IEEE Transactions on Neural Networks, 5(2), 240–254. [48] Csáji, B.C. (2001). Approximation with artificial neural networks. Faculty of Sciences, Etvs Lornd University, Hungary, 24(48), 7. [49] Cybenko, G. (1989). Approximation by superposition of sigmoidal functions. Mathematics of Control, Signals and Systems, 2(4), 303–314. [50] Dasgupta, S. & Huang, Y.F. (1987). Asymptotically convergent modified recursive least-squares with data-dependent updating and forgetting factor for systems with bounded noise. IEEE Transactions on Information Theory, 33(3), 383–392. [51] Deb, K., Pratap, A., Agarwal, S., & Meyarivan, T. (2002). A fast and elitist multiobjective genetic algorithm: NSGA-II. IEEE Transactions on Evolutionary Computation, 6(2), 182–197. [52] Deller Jr., J.R. (1989). Set membership identification in digital signal processing. IEEE ASSP Magazine, 6(4), 4–20. [53] Deller Jr., J.R. & Huang, Y.F. (2002). Set-membership identification and filtering for signal processing applications. Circuits, Systems, and Signal Proc., 21, 69–82. [54] Deller Jr., J.R., Nayeri, M., & Liu, M. (1994). Unifying the landmark developments in OBE identification. International Journal of Adaptive Control and Signal Processing, 8, 43–60. 150 [55] Deller Jr., J.R., Nayeri, M., & Odeh, S.F. (1993). Least-square identification with error bounds for real-time signal processing and control. Proceedings of the IEEE, 81(6), 815–849. [56] Deller Jr., J.R. & Odeh, S.F. (1992). SM-WRLS algorithms with an efficient test for innovation. IFAC Proceedings Volumes, 25(15), 267 – 272. 9th IFAC/IFORS Symposium on Identification and System Parameter Estimation 1991 , Budapest, Hungary, 8-12 July 1991. [57] Deshpande, G., LaConte, S., Peltier, S., & Hu, X. (2006). Directed transfer function analysis of fMRI data to investigate network dynamics. In 2006 International Conference of the IEEE Engineering in Medicine and Biology Society (pp. 671–674). [58] Dewdney, A.K. (1997). Nonlinear system identification: NARMAX Methods in the Time, Frequency, and Spatio-Temporal Domains. John Wiley & Sons. [59] Dhamala, M., Rangarajan, G., & Ding, M. (2008). Estimating Granger causality from Fourier and wavelet transforms of time series data. Physical Review Letters, 100(1), 018701. [60] Diks, C. & Panchenko, V. (2006). A new statistic and practical guidelines for nonparametric Granger causality testing. Journal of Economic Dynamics and Control, 30(9-10), 1647–1669. [61] Dimitriadis, S., Laskaris, N., & Tzelepi, A. (2013). On the quantization of time-varying phase synchrony patterns into distinct functional connectivity microstates (FCµstates) in a multi-trial visual ERP paradigm. Brain Topography, 26(3), 397–409. [62] Ding, J., Tarokh, V., & Yang, Y. (2017). Bridging AIC and BIC: a new criterion for autoregression. IEEE Transactions on Information Theory. [63] Durbin, J. (1960). The fitting of time-series models. Revue de l’Institut International de Statistique, (pp. 233–244). [64] Eaton, M.L. (1983). Multivariate Statistics: A Vector Space Approach. Wiley series in probability and mathematical statistics. Probability and mathematical statistics. New York: John Wiley & Sons. [65] Efron, B., Hastie, T., Johnstone, I., & Tibshirani, R. (2004). Least angle regression. The Annals of Statistics, 32(2), 407–499. [66] Faes, L., Nollo, G., & Porta, A. (2011). Information-based detection of nonlinear Granger causality in multivariate processes via a nonuniform embedding technique. Physical Review E, 83(5), 051112. [67] Frank, I.E. & Friedman, J.H. (1993). A statistical view of some chemometrics regression tools. Technometrics, 35(2), 109–135. [68] Friston, K.J., Harrison, L., & Penny, W. (2003). Dynamic causal modelling. NeuroImage, 19(4), 151 1273 – 1302. [69] Friston, K.J., Moran, R., & Seth, A.K. (2013). Analysing connectivity with Granger causality and dynamic causal modelling. Current Opinion in Neurobiology, 23(2), 172–178. [70] Fu, W. & Knight, K. (2000). Asymptotics for LASSO-type estimators. The Annals of Statistics, 28(5), 1356 – 1378. [71] Geweke, J.F. (1982). Measurement of linear dependence and feedback between multiple time series. Journal of the American Statistical Association, 77(378), 304–313. [72] Geweke, J.F. (1984). Measures of conditional linear dependence and feedback between time series. Journal of the American Statistical Association, 79(388), 907–915. [73] Goldman, B.W. & Tauritz, D.R. (2012). Linkage tree genetic algorithms: variants and analysis. In Proceedings of the 14th Annual Conference on Genetic and Evolutionary Computation (pp. 625–632). [74] Golub, G.H. & Van Loan, C.F. (2012). Matrix Computations, volume 3. Johns-Hopkins University Press. [75] Goshvarpour, A., Goshvarpour, A., Rahati, S., & Saadatian, V. (2012). Bispectrum estimation of electroencephalogram signals during meditation. Iranian Journal of Psychiatry and Behavioral Sciences, 6(2), 48. [76] Granger, C.W.J. (1969). Investigating causal relations by econometric models and cross-spectral methods. Econometrica: Journal of the Econometric Society, (pp. 424–438). [77] Granger, C.W.J. (1980). Testing for causality: A personal viewpoint. Journal of Economic Dynamics and Control, 2, 329–352. [78] Granger, C.W.J. (1981). Some properties of time series data and their use in econometric model specification. Journal of Econometrics, 16(1), 121 – 130. [79] Grassmann, G. (2020). New considerations on the validity of the Wiener-Granger causality test. Heliyon, 6(10), e05208. [80] Greenblatt, R.E., Pflieger, M.E., & Ossadtchi, A.E. (2012). Connectivity measures applied to human brain electrophysiological data. Journal of Neuroscience Methods, 207(1), 1–16. [81] Gu, Y. & Wei, H.L. (2018). A robust model structure selection method for small sample size and multiple datasets problems. Information Sciences, 451, 195–209. [82] Guo, Y., Guo, L.Z., Billings, S.A., & Wei, H.L. (2015). Identification of nonlinear systems with non-persistent excitation using an iterative forward orthogonal least squares regression 152 algorithm. International Journal of Modelling, Identification and Control, 23(1), 1–7. [83] Hagihira, S., Takashina, M., Mori, T., & Mashimo, T. (2004). Bispectral analysis gives us more information than power spectral-based analysis. British Journal of Anaesthesia, 92(5), 772–773. [84] Hand, M.L. (1978). Aspects of linear regression estimation under the criterion of minimizing the maximum absolute residual. PhD thesis, Iowa State University. [85] Hansen, N. & Ostermeier, A. (2001). Completely derandomized self-adaptation in evolution strategies. Evolutionary Computation, 9(2), 159–195. [86] Haufe, S., Nikulin, V.V., & Nolte, G. (2011). Identifying brain effective connectivity patterns from EEG: performance of Granger causality, DTF, PDC and PSI on simulated data. BMC Neuroscience, 12(S1), P141. [87] Hiemstra, C. & Jones, J.D. (1994). Testing for linear and nonlinear Granger causality in the stock price-volume relation. The Journal of Finance, 49(5), 1639–1664. [88] Hill, P.D. (1985). Kernel estimation of a distribution function. Communications in Statistics-Theory and Methods, 14(3), 605–620. [89] Hillebrand, A., Tewarie, P., Van Dellen, E., Yu, M., Carbo, E.W.S., Douw, L., Gouw, A.A., Van Straaten, E.C.W., & Stam, C.J. (2016). Direction of information flow in large-scale resting-state networks is frequency-dependent. Proceedings of the National Academy of Sciences, 113(14), 3867–3872. [90] Hoerl, A.E. & Kennard, R.W. (1970). Ridge regression: Biased estimation for nonorthogonal problems. Technometrics, 12(1), 55–67. [91] Hoke, M., Lehnertz, K., Pantev, C., & Lütkenhöner, B. (1989). Spatiotemporal aspects of synergetic processes in the auditory cortex as revealed by the magnetoencephalogram. In E. Başar & T. H. Bullock (Eds.), Brain Dynamics (pp. 84–105). Berlin, Heidelberg: Springer Berlin Heidelberg. [92] Holland, P.W. (1986). Statistics and causal inference. Journal of the American Statistical Association, 81(396), 945–960. [93] Hosoya, Y. (1991). The decomposition and measurement of the interdependency between second-order stationary processes. Probability Theory and Related Fields, 88(4), 429–444. [94] Hu, S., Cao, Y., Zhang, J., Kong, W., Yang, K., Li, X., & Zhang, Y. (2011a). Evidence for existence of real causality in the case of zero Granger causality. In International Conference on Information Science and Technology (pp. 1385–1389). [95] Hu, S., Dai, G., Worrell, G.A., Dai, Q., & Liang, H. (2011b). Causality analysis of neural 153 connectivity: Critical examination of existing methods and advances of new methods. IEEE Transactions on Neural Networks, 22(6), 829–844. [96] Hu, S., Jia, X., Zhang, J., Kong, W., & Cao, Y. (2016a). Shortcomings/limitations of blockwise Granger causality and advances of blockwise New causality. IEEE Transactions on Neural Networks and Learning Systems, 27(12), 2588–2601. [97] Hu, S. & Liang, H. (2012). Causality analysis of neural connectivity: New tool and limitations of spectral Granger causality. Neurocomputing, 76(1), 44–47. [98] Hu, S., Wang, H., Zhang, J., Kong, W., & Cao, Y. (2014). Causality from Cz to C3/C4 or between C3 and C4 revealed by Granger causality and New causality during motor imagery. In 2014 International Joint Conference on Neural Networks (IJCNN) (pp. 3178–3185). [99] Hu, S., Wang, H., Zhang, J., Kong, W., Cao, Y., & Kozma, R. (2016b). Comparison analysis: Granger causality and New causality and their applications to motor imagery. IEEE Transactions on Neural Networks and Learning Systems, 27(7), 1429–1444. [100] Hu, X., Hu, S., Zhang, J., Kong, W., & Cao, Y. (2016c). A fatal drawback of the widely used Granger causality in neuroscience. In 2016 Sixth International Conference on Information Science and Technology (ICIST) (pp. 61–65). [101] Huang, Y.F. (1986). A recursive estimation algorithm using selective updating for spectral analysis and adaptive signal processing. IEEE transactions on acoustics, speech, and signal processing, 34(5), 1331–1334. [102] Hume, D. (1904). Enquiry Concerning Human Understanding. Clarendon Press. [103] Hume, D. (1978). A Treatise of Human Nature [1739]. British Moralists, (pp. 1650–1800). [104] James, G., Witten, D., Hastie, T., & Tibshirani, R. (2013). An Introduction to Statistical Learning, volume 112. Springer. [105] Jia, X., Hu, S., Zhang, J., & Kong, W. (2015). Blockwise Granger causality and blockwise new causality. In 2015 Seventh International Conference on Advanced Computational Intelligence (ICACI) (pp. 421–425). [106] Joachim, D., Deller Jr., J.R., & Nayeri, M. (1997). Practical considerations in the use of a new OBE algorithm that blindly estimates error bounds. In Proceedings of the 40th Midwest Symposium on Circuits and Systems, 1997, volume 2 (pp. 762–765). [107] Kamalabadi, F., Forbes, J., Makarov, N., & Portnyagin, Y.I. (1997). Evidence for nonlinear coupling of planetary waves and tides in the antarctic mesopause. Journal of Geophysical Research: Atmospheres, 102(D4), 4437–4446. 154 [108] Kamiński, M., Ding, M., Truccolo, W.A., & Bressler, S.L. (2001). Evaluating causal relations in neural systems: Granger causality, directed transfer function and statistical assessment of significance. Biol Cybern, 85(2), 145–157. [109] Kharchenko, V.S. (2019). Internet of things for industry and human application. In Modelling and Development, volume 1 (pp. 547). Ministry of Education and Science of Ukraine, National Aerospace University - Kharkiv Aviation Institute. [110] Kim, J.H. (2009). Estimating classification error rate: Repeated cross-validation, repeated hold-out and bootstrap. Computational Statistics & Data Analysis, 53(11), 3735–3745. [111] Kimeldorf, G.S. & Wahba, G. (1970). A correspondence between Bayesian estimation on stochastic processes and smoothing by splines. The Annals of Mathematical Statistics, 41(2), 495–502. [112] Kozma, R., Hu, S., Sokolov, Y., Wanger, T., Schulz, A.L., Woldeit, M.L., Gonçalves, A.I., Ruszinkó, M., & Ohl, F.W. (2021). State transitions during discrimination learning in the gerbil auditory cortex analyzed by network causality metrics. Frontiers in systems neuroscience, 15. [113] Kraskov, A., Stögbauer, H., Andrzejak, R.G., & Grassberger, P. (2005). Hierarchical clustering using mutual information. Europhysics Letters (EPL), 70(2), 278–284. [114] Kraskov, A., Stögbauer, H., & Grassberger, P. (2004). Estimating mutual information. Physical Review E, 69(6), 066138. [115] Kubiny, H. (1994). Variable selection in QSAR studies. I. An evolutionary algorithm. Quantitative Structure-Activity Relationships, 13(3), 285–294. [116] Kukreja, S.L., Galiana, H.L., & Kearney, R.E. (2003). NARMAX representation and identification of ankle dynamics. IEEE transactions on biomedical engineering, 50(1), 70–81. [117] Kukreja, S.L., Galiana, H.L., & Kearney, R.E. (2004). A bootstrap method for structure detection of NARMAX models. International Journal of Control, 77(2), 132–143. [118] Kukreja, S.L., Löfberg, J., & Brenner, M.J. (2006). A least absolute shrinkage and selection operator (LASSO) for nonlinear system identification. IFAC Proceedings Volumes, 39(1), 814–819. [119] Lachaux, J.P., Lutz, A., Rudrauf, D., Cosmelli, D., Le Van Quyen, M., Martinerie, J., & Varela, F. (2002). Estimating the time-course of coherence between single-trial brain signals: An introduction to wavelet coherence. Neurophysiologie Clinique/Clinical Neurophysiology, 32(3), 157–174. [120] Lachaux, J.P., Rodriguez, E., Martinerie, J., & Varela, F.J. (1999). Measuring phase synchrony in brain signals. Human Brain Mapping, 8(4), 194–208. 155 [121] Leardi, R., Boggia, R., & Terrile, M. (1992). Genetic algorithms as a strategy for feature selection. Journal of Chemometrics, 6(5), 267–281. [122] Leontaritis, I.J. & Billings, S.A. (1985). Input-output parametric models for non-linear systems Part I: Deterministic non-linear systems. International Journal of Control, 41(2), 303–328. [123] Liao, S.H. & Wen, C.H. (2007). Artificial neural networks classification and clustering of methodologies and applications–literature analysis from 1995 to 2005. Expert Systems with Applications, 32(1), 1–11. [124] Libal, U. (2011). Feature selection for pattern recognition by lasso and thresholding methods - a comparison. In 2011 16th International Conference on Methods Models in Automation Robotics (pp. 168–173). [125] Lin, T., Nayeri, M., & Deller Jr., J.R. (1998). A consistently convergent OBE algorithm with automatic estimation of error bounds. International Journal of Adaptive Control and Signal Processing, 12(4), 305–324. [126] Liu, Y. & Aviyente, S. (2009). Directed information measure for quantifying the information flow in the brain. In 2009 Annual International Conference of the IEEE Engineering in Medicine and Biology Society (pp. 2188–2191). [127] Lizier, J.T. & Prokopenko, M. (2010). Differentiating information transfer and causal effect. The European Physical Journal B, 73(4), 605–615. [128] Ljung, L. (1987). System Identification: Theory for the User. Prentice-Hall, Inc. [129] Lu, Z., Pu, H., Wang, F., Hu, Z., & Wang, L. (2017). The expressive power of neural networks: A view from the width. In Advances in Neural Information Processing Systems (pp. 6231–6239). [130] Lundberg, S. & Lee, S. (2017). A unified approach to interpreting model predictions. Computing Research Repository, abs/1705.07874. [131] Mallat, S.G. & Zhang, Z. (1993). Matching pursuits with time-frequency dictionaries. IEEE Transactions on Signal Processing, 41(12), 3397–3415. [132] Marinazzo, D., Pellicoro, M., & Stramaglia, S. (2008). Kernel method for nonlinear Granger causality. Physical Review Letters, 100(14), 144103. [133] Markovsky, I. & Van Huffel, S. (2007). Overview of total least-squares methods. Signal Processing, 87(10), 2283–2302. [134] Mauldin, M.L. (1984). Maintaining diversity in genetic search. In Proceedings of the Fourth 156 AAAI Conference on Artificial Intelligence, AAAI’84 (pp. 247–250).: AAAI Press. [135] Maziarz, M. (2015). A review of the Granger-causality fallacy. The Journal of Philosophical Economics: Reflections on Economic and Social Issues, 8(2), 86–105. [136] McCrorie, J.R. & Chambers, M.J. (2006). Granger causality and the sampling of economic processes. Journal of Econometrics, 132(2), 311–336. [137] Mellin, W.D. (1957). Work with new electronic ‘brains’ opens field for army math experts. The Hammond Times, 10, 66. [138] Miller, A. (2002). Subset Selection in Regression. Chapman and Hall/CRC. [139] Morse, G. & Stanley, K.O. (2016). Simple evolutionary optimization can rival stochastic gradient descent in neural networks. In Proceedings of the Genetic and Evolutionary Computation Conference 2016 (pp. 477–484). [140] Munia, T.T.K. & Aviyente, S. (2019). Time-frequency based phase-amplitude coupling measure for neuronal oscillations. Scientific Reports, 9(1), 1–15. [141] Munia, T.T.K. & Aviyente, S. (2021). Granger causality based directional phase-amplitude coupling measure. In ICASSP 2021 - 2021 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP) (pp. 1070–1074). [142] Musgrave, J.L. (1992). Linear quadratic servo control of a reusable rocket engine. Journal of Guidance, Control, and Dynamics, 15(5), 1149–1154. [143] Muthukumaraswamy, S. (2013). High-frequency brain activity and muscle artifacts in MEG/EEG: a review and recommendations. Frontiers in Human Neuroscience, 7, 138. [144] Myles, P.S., Leslie, K., McNeil, J., Forbes, A., & Chan, M.T.V. (2004). Bispectral index monitoring to prevent awareness during anaesthesia: the B-Aware randomised controlled trial. Lancet, 363(9423), 1757–1763. [145] Narendra, K. & Gallman, P. (1966). An iterative method for the identification of nonlinear systems using a Hammerstein model. IEEE Transactions on Automatic control, 11(3), 546–550. [146] Nariyoshi, P. & Deller Jr., J.R. (2021). Nonlinear extensions of new causality. Neuroscience Informatics. To appear. [147] Nariyoshi, P., Deller Jr., J.R., & Goodman, E.D. (2021a). On models for assessing causality strength. Heliyon. To appear. [148] Nariyoshi, P., Deller Jr., J.R., & Goodman, E.D. (2021b). On the robustness of Granger and new causality measures to model order and parameter uncertainty in multivariate regressive 157 models. In review. [149] Nariyoshi, P., Deller Jr., J.R., & Yan, J. (2017). Modified genetic crossover and mutation operators for sparse regressor selection in NARMAX brain connectivity modeling. In 2017 8th International IEEE/EMBS Conference on Neural Engineering (NER) (pp. 660–663). [150] Nolte, G., Ziehe, A., Nikulin, V.V., Schlögl, A., Krämer, N., Brismar, T., & Müller, K.R. (2008). Robustly estimating the flow direction of information in complex physical systems. Physical Review Letters, 100(23), 234101. [151] Ogata, K. (2001). Modern Control Engineering. USA: Prentice-Hall, Inc., 4th edition. [152] Oppenheim, A.V., Willsky, A.S., & Nawab, S.H. (1996). Signals and Systems (2nd Ed.). USA: Prentice-Hall, Inc. [153] Papoulis, A. & Pillai, S.U. (2002). Probability, Random Variables, and Stochastic Processes. Tata McGraw-Hill Education. [154] Pearl, J. (2009). Causality: Models, Reasoning and Inference. Cambridge University Press, 2nd edition. [155] Pelikan, M., Hauschild, M.W., & Thierens, D. (2011). Pairwise and problem-specific distance metrics in the linkage tree genetic algorithm. Genetic & Evolutionary Computation Conference, (pp. 1005 – 1012). [156] Pereda, E., Quiroga, R.Q., & Bhattacharya, J. (2005). Nonlinear multivariate analysis of neurophysiological signals. Progress in Neurobiology, 77(1-2), 1–37. [157] Piroddi, L. & Spinelli, W. (2003). A pruning method for the identification of polynomial NARMAX models. IFAC Proceedings Volumes, 36(16), 1071–1076. [158] Plackett, R.L. (1950). Some theorems in least squares. Biometrika, 37(1/2), 149–157. [159] Politis, D.N. & Romano, J.P. (1992). A circular block-resampling procedure for stationary data. Technical report, Stanford University. [160] Politis, D.N. & Romano, J.P. (1994). The stationary bootstrap. Journal of the American Statistical Association, 89(428), 1303–1313. [161] Poor, H.V. (1994). An Introduction to Signal Detection and Estimation (2nd Ed.). Berlin, Heidelberg: Springer-Verlag. [162] Pradhan, C., Jena, S.K., Nadar, S.R., & Pradhan, N. (2012). Higher-order spectrum in understanding nonlinearity in EEG rhythms. Computational and Mathematical Methods in Medicine, 2012. 158 [163] Punch III, W.F., Goodman, E.D., Pei, M., Chia-Shun, L., Hovland, P.D., & Enbody, R.J. (1993). Further research on feature selection and classification using genetic algorithms. In ICGA (pp. 557–564). [164] Rabiner, L.R. & Gold, B. (1975). Theory and Application of Digital Signal Processing. Prentice-Hall, Inc. [165] Rahim, H.A., Ibrahim, F., & Taib, M.N. (2007). A novel prediction system in dengue fever using NARMAX model. In 2007 International Conference on Control, Automation and Systems (pp. 305–309). [166] Rasmussen, C.E. (2004). Gaussian processes in machine learning. In Advanced Lectures on Machine Learning (pp. 63–71). Springer. [167] Reeves, C.R. & Rowe, J.E. (2003). Genetic Algorithms: Principles and Perspectives: A Guide to GA Theory, volume 20. Springer. [168] Rissanen, J. (1978). Modeling by shortest data description. Automatica, 14(5), 465–471. [169] Rodrigues, P.L.C. & Baccalá, L.A. (2016). Statistically significant time-varying neural connectivity estimation using generalized partial directed coherence. In 2016 38th Annual International Conference of the IEEE Engineering in Medicine and Biology Society (EMBC) (pp. 5493–5496). [170] Rosenblum, M.G. & Pikovsky, A.S. (2001). Detecting direction of coupling in interacting oscillators. Physical Review E, 64(4), 045202. [171] Rudin, L.I., Osher, S., & Fatemi, E. (1992). Nonlinear total variation based noise removal algorithms. Physica D: Nonlinear Phenomena, 60(1-4), 259–268. [172] Sayed, A.H. (2003). Fundamentals of Adaptive Filtering. John Wiley & Sons. [173] Schelter, B., Winterhalder, M., Eichler, M., Peifer, M., Hellwig, B., Guschlbauer, B., Lücking, C.H., Dahlhaus, R., & Timmer, J. (2006). Testing for directed influences among neural signals using partial directed coherence. Journal of Neuroscience Methods, 152(1-2), 210–219. [174] Schwarz, G. (1978). Estimating the Dimension of a Model. The Annals of Statistics, 6(2), 461 – 464. [175] Seth, A.K., Barrett, A.B., & Barnett, L. (2015). Granger causality analysis in neuroscience and neuroimaging. Journal of Neuroscience, 35(8), 3293–3297. [176] Shahabi, H., Moghimi, S., & Moghimi, A. (2013). Investigating the effective brain networks related to working memory using a modified directed transfer function. In 2013 6th International IEEE/EMBS Conference on Neural Engineering (NER) (pp. 1398–1401). 159 [177] Sheikhattar, A., Miran, S., Liu, J., Fritz, J.B., Shamma, S.A., Kanold, P.O., & Babadi, B. (2018). Extracting neuronal functional network dynamics via adaptive Granger causality analysis. Proceedings of the National Academy of Sciences, 115(17), E3869–E3878. [178] Shen Minfen, Sun Lisha, & Beadle, P.J. (2000). Parametric bispectral estimation of EEG signals in different functional states of brain. In 2000 First International Conference Advances in Medical Signal and Information Processing (IEE Conf. Publ. No. 476) (pp. 66–72). [179] Shimmura, T., Ohashi, S., & Yoshimura, T. (2015). The highest-ranking rooster has priority to announce the break of dawn. Scientific Reports, 5, 11683. [180] Shimmura, T. & Yoshimura, T. (2013). Circadian clock determines the timing of rooster crowing. Current Biology, 23(6), R231 – R233. [181] Shrikumar, A., Greenside, P., & Kundaje, A. (2017). Learning important features through propagating activation differences. Computing Research Repository, abs/1704.02685. [182] Sigl, J.C. & Chamoun, N.G. (1994). An introduction to bispectral analysis for the electroencephalogram. Journal of clinical monitoring, 10(6), 392–404. [183] Silver, D., Huang, A., Maddison, C.J., Guez, A., Sifre, L., van den Driessche, G., Schrittwieser, J., Antonoglou, I., Panneershelvam, V., Lanctot, M., Dieleman, s., Grewe, D., Nham, J., Kalchbrenner, N., Sutskever, I., Lillicrap, T., Leach, M., Kavukcuoglu, K., Graepel, T., & Hassabis, D. (2016). Mastering the game of Go with deep neural networks and tree search. Nature, 529(7587), 484–489. [184] Sims, C.A. (1972). Money, income, and causality. The American Economic Review, (pp. 540–552). [185] Sposito, V.A. (1976). Minimizing the maximum absolute deviation. SIGMAP Bulletin, 1(20), 51–53. [186] Stam, C.J. (2005). Nonlinear dynamical analysis of EEG and MEG: Review of an emerging field. Clinical neurophysiology, 116(10), 2266–2301. [187] Stoean, R., Stoean, C., & Sandita, A. (2017). Evolutionary regressor selection in ARIMA model for stock price time series forecasting. In International Conference on Intelligent Decision Technologies (pp. 117–126).: Springer. [188] Stokes, P.A. & Purdon, P.L. (2017). A study of problems encountered in Granger causality analysis from a neuroscience perspective. Proceedings of the Mational Academy of Sciences, 114(34), E7063–E7072. [189] Sun, R. (1994). A neural network model of causality. IEEE Transactions on Neural Networks, 5(4), 604–611. 160 [190] Taylor, S.J. (1994). Modeling stochastic volatility: A review and comparative study. Mathematical Finance, 4(2), 183–204. [191] Tenke, C.E. & Kayser, J. (2012). Generator localization by current source density (CSD): Implications of volume conduction and field closure at intracranial and scalp resolutions. Clinical Neurophysiology, 123(12), 2328–2345. [192] Thierens, D. (2010). The linkage tree genetic algorithm. In International Conference on Parallel Problem Solving from Nature (pp. 264–273).: Springer. [193] Tibshirani, R. (1996). Regression shrinkage and selection via the LASSO. Journal of the Royal Statistical Society. Series B (Methodological), (pp. 267–288). [194] Tikhonov, A.N. (1943). On the stability of inverse problems. In Dokl. Akad. Nauk SSSR, volume 39 (pp. 195–198). [195] Ulrich, T. & Thiele, L. (2011). Maximizing population diversity in single-objective optimization. In Proceedings of the 13th Annual Conference on Genetic and Evolutionary Computation (pp. 641–648). [196] Vicente, R., Wibral, M., Lindner, M., & Pipa, G. (2011). Transfer entropy—a model-free measure of effective connectivity for the neurosciences. Journal of Computational Neuroscience, 30(1), 45–67. [197] Vinyals, O., Babuschkin, I., Czarnecki, W.M., Mathieu, M., Dudzik, A., Chung, J., Choi, D.H., Powell, R., Ewalds, T., Georgiev, P., et al. (2019). Grandmaster level in StarCraft II using multi-agent reinforcement learning. Nature, 575(7782), 350–354. [198] Volterra, V. (1887). Sopra le Funzioni che Dipendono da Altre Funzioni (About Functions That Depend on Other Functions. Tip. della R. Accademia dei Lincei. [199] Wassenaar, G. (2020). Empirical performance evaluation of the linkage tree genetic algorithm. [200] Wei, H.L. (2019). Sparse, interpretable and transparent predictive model identification for healthcare data analysis. In International Work-Conference on Artificial Neural Networks (pp. 103–114).: Springer. [201] Wei, H.L. & Billings, S.A. (2008). Model structure selection using an integrated forward orthogonal search algorithm assisted by squared correlation and mutual information. International Journal of Modelling, Identification and Control, 3(4), 341–356. [202] Wei, H.L. & Billings, S.A. (2019). NARMAX model as a sparse, interpretable and transparent machine learning approach for big medical and healthcare data analysis. In Proceedings of the 5th IEEE International Conference on Data Science and Systems: IEEE. 161 [203] Westwick, D. & Kearney, R. (2003). Identification of Nonlinear Physiological Systems, volume 7. John Wiley & Sons. [204] Whittle, P. (1963). On the fitting of multivariate autoregressions, and the approximate canonical factorization of a spectral density matrix. Biometrika, 50(1-2), 129–134. [205] Widrow, B. & Hoff, M.E. (1960). Adaptive switching circuits. Technical report, Stanford Electronics Laboratories - Stanford University. [206] Wiener, N. (1956). The theory of prediction. In E. F. Beckenbach & R. Weller (Eds.), Modern Mathematics for the Engineer chapter 8, (pp. 165–190). New York: McGraw-Hill. [207] Wolpaw, J. & Wolpaw, E.W. (2012). Brain-Computer Interfaces: Principles and Practice. OUP Usa. [208] Yamashita, O., Sadato, N., Okada, T., & Ozaki, T. (2005). Evaluating frequency-wise directed connectivity of BOLD signals applying relative power contribution with the linear multivariate time-series models. Neuroimage, 25(2), 478–490. [209] Yan, J. (2018). Two Studies in Nonlinear Biological System Modeling and Identification. PhD thesis, Michigan State University. [210] Yan, J. & Deller Jr., J.R. (2014). Biologically-motivated system identification: Application to microbial growth modeling. In 2014 36th Annual International Conference of the IEEE Engineering in Medicine and Biology Society (pp. 322–325). [211] Yan, J. & Deller Jr., J.R. (2015a). Set-theoretic measures as evolutionary fitness criteria in nonlinear system identification. In Proceedings of the 17th IFAC Symposium on System Identification (SYSID) Beijing. Published on CD-ROM and at IFAC-PapersOnLine.net. [212] Yan, J. & Deller Jr., J.R. (2015b). Set-theoretic measures as evolutionary fitness criteria in nonlinear system identification. 17th IFAC Symposium on System Identification SYSID, 48(28), 178–183. [213] Yan, J. & Deller Jr., J.R. (2016). NARMAX model identification using a set-theoretic evolutionary approach. Signal Processing, 123(5), 30–41. [214] Yan, J., Deller Jr., J.R., Fleet, B., Goodman, E.D., & Yao, M. (2013). Evolutionary identification of nonlinear parametric models with a set-theoretic fitness criterion. In 2013 IEEE China Summit and International Conference on Signal and Information Processing (pp. 44–48). [215] Yan, J., Deller Jr., J.R., Yao, M., & Goodman, E.D. (2014a). Biologically-motivated system identification: Application to microbial growth modeling. In Proc. 36th Annual International Conference of the IEEE Engineering in Medicine and Biology Society (EMBC),. 162 [216] Yan, J., Deller Jr., J.R., Yao, M., & Goodman, E.D. (2014b). Evolutionary model selection for identification of nonlinear parametric systems. In Proceedings of the 2014 IEEE China Summit and International Conference of Signal and Information Processing (pp. 693–697). [217] Yan, J., Nariyoshi, P., & Deller Jr., J.R. (2021). Sparse nonlinear model structure selection and parameter estimation using bi-objective optimization. In review. [218] Yao, D., Qin, Y., Hu, S., Dong, L., Bringas Vega, M.L., & Sosa, P.A.V. (2019). Which reference should we use for EEG and ERP practice? Brain topography, 32(4), 530–549. [219] Yassin, I.M., Zabidi, A., Amin Megat Ali, M.S., Md Tahir, N., Zainol Abidin, H., & Rizman, Z.I. (2016). Binary particle swarm optimization structure selection of nonlinear autoregressive moving average with exogenous inputs (NARMAX) model of a flexible robot arm. International Journal on Advanced Science, Engineering and Information Technology, 6(5), 630–637. [220] Zhuo, H., Hu, S., Myers, M.H., Zhang, J., Kong, W., Cao, Y., & Kozma, R. (2016). Causality analysis during shared intentionality. In 2016 12th World Congress on Intelligent Control and Automation (WCICA) (pp. 2215–2219). 163