NOVEL METHODS FOR FUNCTIONAL DATA ANALYSIS WITH APPLICATIONS TO NEUROIMAGING STUDIES By Pratim Guha Niyogi A DISSERTATION Submitted to Michigan State University in partial fulfillment of the requirements for the degree of Statistics – Doctor of Philosophy 2022 ABSTRACT NOVEL METHODS FOR FUNCTIONAL DATA ANALYSIS WITH APPLICATIONS TO NEUROIMAGING STUDIES By Pratim Guha Niyogi In recent years, there has been explosive growth in different neuroimaging studies such as functional magnetic resonance imaging (fMRI) and diffusion tensor imaging (DTI). The data generated from such studies are often complex structured which are collected for different individuals, via various time-points and across various modalities, thus paving the way for interesting problems in statistical methodology for analysis of such data. In this dissertation, some efficient methodologies are proposed with considerable development which have nice statistical properties and can be useful not only in neuroimaging but also in other scientific domains. A brief overview of the dissertation is provided in Chapter 1 and in particular, different kinds of data structures that are commonly used in consecutive chapters are described. Some useful mathematical results frequently used in the theoretical derivations in various chapters are also provided. Moreover, we raise some fundamental questions that arise due to some specific data structures with applications in neuroimaging and answer these questions in subsequent chapters. In Chapter 2, we consider the problem of estimation of coefficients in constant linear effect models for semi-parametric functional regression with functional response, where each response curve is decomposed into the overall mean function indexed by a covariate function with constant regression parameters and random error process. We provide an alternative semi-parametric solution to estimate the parameters using quadratic inference approach by estimating bases functions non-parametrically. Therefore, the proposed method can be easily implemented without assuming √ any working correlation structure. Moreover, we achieve a parametric 𝑛-convergence rate of the proposed estimator under the proper choice of bandwidth and establish its asymptotic normality. A multi-step estimation procedure to simultaneously estimate the varying-coefficient functions using a local linear generalized method of moments (GMM) based on continuous moment conditions is developed in Chapter 3 under heteroskedasticity of unknown form. To incorporate spatial dependence, the continuous moment conditions are first projected onto eigen-functions and then combined by weighted eigen-values. This approach solves the challenges of using an inverse covariance operator directly. We propose an optimal instrumental variable that minimizes the asymptotic variance function among the class of all local linear GMM estimators, and it is found to outperform the initial estimates that do not incorporate spatial dependence. Neuroimaging data are increasingly being combined with other non-imaging modalities, such as behavioral and genetic data. The data structure of many of these modalities can be expressed as time-varying multidimensional arrays (tensors), collected at different time-points on multiple subjects. In Chapter 4, we consider a new approach to study neural correlates in the presence of tensor-valued brain images and tensor-valued predictors, where both data types are collected over the same set of time-points. We propose a time-varying tensor regression model with an inherent structural composition of responses and covariates. This development is a non-trivial extension of function-on-function concurrent linear models for complex and large structural data where the inherent structures are preserved. Through extensive simulation studies and real-data analyses, we demonstrate the opportunities and advantages of the proposed methods. Copyright by PRATIM GUHA NIYOGI 2022 To my mother Maya for her sacrifices. To my father Kajal for his support and protection. To my wife Debolina for her unconditional love and support. v ACKNOWLEDGEMENTS My journey till the completion of this dissertation has been a long one which has taught me many things. Now that I look back into all those moments, all I can feel is gratitude towards time, place, events and some very important people. With humbleness, I thank my academic advisors Dr. Ping- Shou Zhong and Dr. Tapabrata Maiti for providing invaluable technical support during my doctoral journey without which my journey would not have reached its goal. They trained me to pursue research independently, think out of the box, and deal with multiple problems simultaneously. Their inspiration boosted my confidence and persuaded me to contribute to statistical sciences. I am especially grateful to Dr. Zhong for continuing to mentor me even after moving to University of Illinois at Chicago. This long-distance mentoring was difficult, but he carried on with it effortlessly and never left me struggling. I thank Dr. Lyudmila Sakhanenko and Dr. Alla Sikorskii for serving in my dissertation committee and for providing helpful constructive suggestions to improve the quality of this dissertation. Reflecting upon the start of my journey, I would like to thank the Graduate Students’ Selection Committee of 2016 for offering me the opportunity to pursue the doctoral program here at Michigan State University (MSU) and for providing assistantship during my education. Dr. Tapabrata Maiti, the then graduate director, considered me worthy of the opportunity and expressed his belief in me when I was just another fresh Masters from India venturing into this foreign land for a new endeavour. His encouraging words have helped me believe in myself and seek excellence. I thank Dr. Alla Sikorskii for supporting me through research assistantship for more than three years from her grants and helping me grow in collaborative research. This opportunity gifted me the beneficial experience of interdisciplinary research and exposure to statistics in health sciences. I will never forget her immense help and support. I would like to thank Drs. Shlomo Levental, Ping-Shou Zhong and Lyudmila Sakhanenko for teaching four very important courses: probability theory, linear models, and asymptotics and non-parametric curve estimation, respectively, with care and depth. The lessons learned from those courses have helped a lot in my research. I am thankful to my vi seniors especially Rejaul Karim, who was my roommate for three years and I spent some of the best days at East Lansing in his company, and Atrayee Majumder for her valuable suggestions. I would like to make a special mention of Andy Hufford for being approachable and helping out with just about any technology-related issues. Also, I thank Tami Hankey for her assistance in completing numerous paper-works all along the way. A huge role on this dissertation was played by the high- performance computing technology of MSU without which running codes with voluminous data would not be possible. I am therefore thankful to computational resources and services provided by the Institute for Cyber-Enabled Research at MSU. There are also people outside the University who have been of immense help. The person to whom I am extremely grateful for providing me with the opportunity to spend a wonderful summer at the Biostatistics department of Johns Hopkins University is Dr. Martin Lindquist. I learnt many things from him and am honoured to have been able to collaborate with him in one of my projects. I would also like to thank Dr. Xiaohong Joe Zhou of University of Illinois at Chicago for providing me with valuable data that helped me validate my research findings. Going back to my roots, I would like to thank my high-school statistics teacher Ashoke Panda, faculty of Hare School, Kolkata who for the first time introduced me to this colorful subject and motivated me to pursue it. I thank Arindam Bhattacharyya, who has been a valuable guide all the way upto my Masters. I am thankful to my alma maters Presidency University and Indian Statistical Institute for their intellectual environment that shaped my critical thinking ability and propelled me to pursue research, and I will forever remain grateful to my roots. I am thankful to all professors who helped me, guided me, and motivated me during those times. I am especially thankful to Drs. Saurabh Ghosh and Ayanendranath Basu of Indian Statistical Institute and Dr. Subhra Sankar Dhar of Indian Institute of Technology, Kanpur who engaged me in research at various phases of my educational levels; their training and treatment were very beneficial to me. Nevertheless, I am indebted to Dr. Partha Sarathi Chakraborty for his excellent teaching, which created a strong foundation during my undergraduate days. Also, my sincere gratitude to the Ramakrishna Mission Institute of Culture for awarding me the Debesh-Kamal Scholarship for Higher Studies in 2016 vii which provided me the necessary funds to travel to USA. My family has been a pillar of strength throughout my life. I thank my parents Maya and Kajal Guha Niyogi for not stopping me from trying things out and letting me fail so I could learn the right way. I am grateful to them for never comparing me to anyone else, for always believing in me, and for teaching me humility and respect. I am thankful to my wife Debolina Chatterjee for her unconditional love and support and for being the first reader of all my writings. She is my “better half” in uncountable ways and I thank her for bringing out the best in me. Last but not least, I thank the Almighty for this life and for making things fall into place. Pratim Guha Niyogi June 29, 2022 East Lansing, MI viii TABLE OF CONTENTS LIST OF TABLES . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . xii LIST OF FIGURES . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . xv LIST OF ALGORITHMS . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . xvii KEY TO ABBREVIATIONS . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . xviii CHAPTER 1 PROLOGUE . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1 1.1 Big data analysis . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1 1.1.1 Functional data analysis . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1 1.1.2 Tensor data analysis . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 6 1.2 Some applications . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 6 1.2.1 (Functional) magnetic resonance imaging . . . . . . . . . . . . . . . . . . 6 1.2.2 Diffusion tensor imaging . . . . . . . . . . . . . . . . . . . . . . . . . . . 9 1.3 Mathematical preliminaries . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 10 1.3.1 Notations of tensor/matrix object . . . . . . . . . . . . . . . . . . . . . . . 11 1.3.2 Different kinds of products . . . . . . . . . . . . . . . . . . . . . . . . . . 12 1.3.3 Tensor decomposition . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 13 1.3.4 Some useful results . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 13 1.4 Dissertation outline . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 14 CHAPTER 2 IMPROVING QUADRATIC INFERENCE APPROACH FOR FUNC- TIONAL RESPONSES . . . . . . . . . . . . . . . . . . . . . . . . . . . 17 2.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 17 2.2 Functional response model and estimation procedure . . . . . . . . . . . . . . . . 22 2.2.1 Basic model . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 22 2.2.2 Incorporating eigen-functions in QIF . . . . . . . . . . . . . . . . . . . . . 24 2.2.3 Estimation of eigen-functions . . . . . . . . . . . . . . . . . . . . . . . . . 25 2.3 Asymptotic properties . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 28 2.4 Simulation studies . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 31 2.4.1 Simulation set-up . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 31 2.4.2 Comparison and evaluation . . . . . . . . . . . . . . . . . . . . . . . . . . 32 2.4.3 Simulation results . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 46 2.5 Real data analysis . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 46 2.5.1 Beijing’s PM pollution study . . . . . . . . . . . . . . . . . . . . . . . . . 46 2.5.2 DTI study for sleep apnea patients . . . . . . . . . . . . . . . . . . . . . . 48 2.6 Discussion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 49 2.7 Technical details . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 51 2.7.1 Some preliminary definitions and concepts of operators . . . . . . . . . . . 51 2.7.2 Some useful lemmas . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 54 2.7.3 Proof of Theorem 2.3.1 . . . . . . . . . . . . . . . . . . . . . . . . . . . . 59 ix 2.7.4 Proof of Theorem 2.3.2 . . . . . . . . . . . . . . . . . . . . . . . . . . . . 65 CHAPTER 3 ESTIMATION FOR VARYING-COEFFICIENT MODEL IN FUNC- TIONAL DATA ANALYSIS UNDER UNKNOWN HETEROSKEDAS- TICITY: A GMM-BASED APPROACH . . . . . . . . . . . . . . . . . . 67 3.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 67 3.2 Varying-coefficient functional model and moment conditions . . . . . . . . . . . . 73 3.2.1 Model . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 73 3.2.2 Local-linear mean-zero function . . . . . . . . . . . . . . . . . . . . . . . 74 3.3 Multi-step estimation procedure . . . . . . . . . . . . . . . . . . . . . . . . . . . 76 3.3.1 Step-I: Initial least squares estimates . . . . . . . . . . . . . . . . . . . . . 76 3.3.2 Step-II: Intermediate steps . . . . . . . . . . . . . . . . . . . . . . . . . . 77 3.3.3 Step-III: Final estimates . . . . . . . . . . . . . . . . . . . . . . . . . . . . 80 3.4 Asymptotic results . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 81 3.5 Simulation studies . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 84 3.6 Real data analysis . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 87 3.7 Discussion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 90 3.8 Technical details . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 90 3.8.1 Some useful lemmas . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 90 3.8.2 Proof of Theorem 3.4.1 . . . . . . . . . . . . . . . . . . . . . . . . . . . . 96 3.8.3 Proof of Theorem 3.4.3 . . . . . . . . . . . . . . . . . . . . . . . . . . . . 97 CHAPTER 4 TENSOR BASED SPATIO-TEMPORAL MODELS FOR ANALY- SIS OF FUNCTIONAL NEUROIMAGING DATA . . . . . . . . . . . . 100 4.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 100 4.2 Tensor-on-tensor functional regression . . . . . . . . . . . . . . . . . . . . . . . . 105 4.3 Asymptotic properties . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 110 4.3.1 Identifiability . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 110 4.3.2 Convergence rate . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 111 4.4 Algorithm and implementation . . . . . . . . . . . . . . . . . . . . . . . . . . . . 113 4.5 Simulation studies . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 114 4.6 Application to ForrestGump-data . . . . . . . . . . . . . . . . . . . . . . . . . . . 121 4.6.1 Details about the data-set . . . . . . . . . . . . . . . . . . . . . . . . . . . 121 4.6.2 Analysis . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 125 4.7 Discussion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 130 4.8 Technical details . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 130 4.8.1 Technical lemmas . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 130 4.8.2 Proof of Theorem 4.3.1 . . . . . . . . . . . . . . . . . . . . . . . . . . . . 131 4.8.3 Proof of Theorem 4.3.2 . . . . . . . . . . . . . . . . . . . . . . . . . . . . 135 CHAPTER 5 EPILOGUE . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 136 5.1 Conclusions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 136 5.2 Future directions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 136 APPENDIX . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 139 x BIBLIOGRAPHY . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 145 xi LIST OF TABLES Table 2.1 Performance of the estimation procedure where the residuals are generated from Brownian motion (a). Mean of the estimated coefficients, standard de- viation, absolute bias, mean square error (×100) and FVE in percentage are summarized upto four decimal places. . . . . . . . . . . . . . . . . . . . . . . . 34 Table 2.2 Performance of the estimation procedure where the residuals are generated from linear process (b) with 𝑙0 = 1. Mean of the estimated coefficients, standard deviation, absolute bias, mean square error (×100) and FVE in percentage are summarized upto four decimal places. . . . . . . . . . . . . . . . . . . . . . . . 35 Table 2.3 Performance of the estimation procedure where the residuals are generated from linear process (b) with 𝑙0 = 2. Mean of the estimated coefficients, standard deviation, absolute bias, mean square error (×100) and FVE in percentage are summarized upto four decimal places. . . . . . . . . . . . . . . . . . . . . . . . 36 Table 2.4 Performance of the estimation procedure where the residuals are generated from linear process (b) with 𝑙0 = 3. Mean of the estimated coefficients, standard deviation, absolute bias, mean square error (×100) and FVE in percentage are summarized upto four decimal places. . . . . . . . . . . . . . . . . . . . . . . . 37 Table 2.5 Performance of the estimation procedure where the residuals are generated from Ornstein-Uhlenbeck process (c) with 𝜇0 = 1. Mean of the estimated coefficients, standard deviation, absolute bias, mean square error (×100) and FVE in percentage are summarized upto four decimal places. . . . . . . . . . . . 38 Table 2.6 Performance of the estimation procedure where the residuals are generated from Ornstein-Uhlenbeck process (c) with 𝜇0 = 3. Mean of the estimated coefficients, standard deviation, absolute bias, mean square error (×100) and FVE in percentage are summarized upto four decimal places. . . . . . . . . . . . 39 Table 2.7 Performance of the estimation procedure where the residuals are generated with power exponential covariance function (d) where scale parameter 𝑎 0 = 1 and shape parameter 𝑏 0 = 1. Mean of the estimated coefficients, standard deviation, absolute bias, mean square error (×100) and FVE in percentage are summarized upto four decimal places. . . . . . . . . . . . . . . . . . . . . . . . 40 Table 2.8 Performance of the estimation procedure where the residuals are generated with power exponential covariance function (d) where scale parameter 𝑎 0 = 1 and shape parameter 𝑏 0 = 2. Mean of the estimated coefficients, standard deviation, absolute bias, mean square error (×100) and FVE in percentage are summarized upto four decimal places. . . . . . . . . . . . . . . . . . . . . . . . 41 xii Table 2.9 Performance of the estimation procedure where the residuals are generated with power exponential covariance function (d) where scale parameter 𝑎 0 = 1 and shape parameter 𝑏 0 = 5. Mean of the estimated coefficients, standard deviation, absolute bias, mean square error (×100) and FVE in percentage are summarized upto four decimal places. . . . . . . . . . . . . . . . . . . . . . . . 42 Table 2.10 Performance of the estimation procedure where the residuals are generated with rational quadratic covariance function (e) where scale parameter 𝑎 0 = 1 and shape parameter 𝑏 0 = 1. Mean of the estimated coefficients, standard deviation, absolute bias, mean square error (×100) and FVE in percentage are summarized upto four decimal places. . . . . . . . . . . . . . . . . . . . . . . . 43 Table 2.11 Performance of the estimation procedure where the residuals are generated with rational quadratic covariance function (e) where scale parameter 𝑎 0 = 1 and shape parameter 𝑏 0 = 2. Mean of the estimated coefficients, standard deviation, absolute bias, mean square error (×100) and FVE in percentage are summarized upto four decimal places. . . . . . . . . . . . . . . . . . . . . . . . 44 Table 2.12 Performance of the estimation procedure where the residuals are generated with rational quadratic covariance function (e) where scale parameter 𝑎 0 = 1 and shape parameter 𝑏 0 = 5. Mean of the estimated coefficients, standard deviation, absolute bias, mean square error (×100) and FVE in percentage are summarized upto four decimal places. . . . . . . . . . . . . . . . . . . . . . . . 45 Table 2.13 Apnea-data results: Estimated values and associated standard errors for the regression coefficients are provided upto four decimal places based on the existing and proposed methods. First line corresponding to each ROI shows results based on initial estimates and the second line corresponds to that of proposed estimates. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 50 Table 3.1 Performance of the estimation procedure with SNR𝜃 = 0.5 . . . . . . . . . . . . 86 Table 3.2 Performance of the estimation procedure with SNR𝜃 = 1 . . . . . . . . . . . . . 87 Table 4.1 Results of simulation situations (Situation1) where each modes are assumed to be independent for X (𝑡) and E (𝑡) for fixed time-points. Here we assume each of { 𝜒 𝑝(𝑘) } 1 ,𝑝 2 𝑝 1 ,𝑝 2 and {𝜂 𝑞1 ,𝑞2 } 𝑞(𝑘) 1 ,𝑞 2 are independent for ( 𝑝 1 , 𝑝 2 ) and (𝑞 1 , 𝑞 2 ) respectively. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 118 Table 4.2 Results of simulation situations (Situation2)a where each modes are assumed to be independent for E (𝑡) for fixed time-points whereas modes for X (𝑡) are assumed to be dependent. Here we assume { 𝜒 𝑝(𝑘) } 1 ,𝑝 2 𝑝 1 ,𝑝 2 is spatially dependent with exponential covariance function. . . . . . . . . . . . . . . . . . . . . . . . 119 xiii Table 4.3 Results of simulation situations (Situation2)b where each modes are assumed to be independent for E (𝑡) for fixed time-points whereas modes for X (𝑡) are assumed to be dependent. Here we assume { 𝜒 𝑝(𝑘) } 1 ,𝑝 2 𝑝 1 ,𝑝 2 is spatially dependent with Matérn covariance function. . . . . . . . . . . . . . . . . . . . . . . . . . 120 Table 4.4 ForrestGump-data: Summary statistics across participants. . . . . . . . . . . . . 125 xiv LIST OF FIGURES Figure 1.1 A schematic diagram of an MRI scanner. . . . . . . . . . . . . . . . . . . . . . 8 Figure 1.2 A typical example of MRI scan of healthy human brain. Source: Long et al. (2012) . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 9 Figure 2.1 Beijing2017-data: Reading of hourly PM2.5 measures for twelve different locations over 608 hourly time-points during January 2017. . . . . . . . . . . . 20 Figure 2.2 Beijing2017-data results: Scree plots of fraction of variance explained (FVE). . 48 Figure 3.1 Apnea-data: Smoothed average path length (APL) from 29 patients over different thresholds. Black solid line indicates the mean of APL over thresholds. 72 Figure 3.2 Apnea-data analysis: Plots of estimated coefficient functions of age (top panel) and number of lapses (bottom panel) for average path length associated with Fractional Anisotropy (FA) in DTI analysis. . . . . . . . . . . . . . . . . . . . 89 Figure 4.1 Multi-modal-data: An example of multi-modal data analysis which seeks to explore the relationship between EEG and fMRI data. . . . . . . . . . . . . . . 100 Figure 4.2 ForrestGump-data: (Top panel) BOLD fMRI for an example subject during their first run (see Section 4.6 for details). 35 axial slices (thickness 3.0 mm) represents the third mode of the tensor with 80 × 80 voxels (3.0 × 3.0 mm) in- plate resolution measured at every repetition time (TR) of 2 seconds. (Bottom panel) fMRI data-set consists of a time series of 3D images (tensors) at each TR (source: Wager and Lindquist (2015)). . . . . . . . . . . . . . . . . . . . . 102 Figure 4.3 ForrestGump-data: Summary statistics for the parameters estimates of head motion correction across TRs and participants. (Left panel) Magnitude of three rotational parameters (in radians) and (Right panel) Magnitude of three translation parameters (in millimeters) for each individual on each of the 451 TRs. In each plot, solid black line indicates mean over the individuals through TRs and black dotted lines indicate mean±2sd over the individuals through TRs. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 123 Figure 4.4 ForrestGump-data: Covariates of interest. Cartesian and polar coordinates and pupil area are shown across TRs and participants. . . . . . . . . . . . . . . 124 Figure 4.5 ForrestGump-data results: Estimate of the coefficient 𝜷1 (𝑡) corresponding to visual feature distance of eye-gaze for different location. Legends for different parcellation as mentioned in Shen et al. (2013) are also provided. . . . . . . . . 127 xv Figure 4.6 ForrestGump-data results: Estimate of the coefficient 𝜷2 (𝑡) corresponding to visual feature angle of eye-gaze for different location. Legends for different parcellation as mentioned in Shen et al. (2013) are also provided. . . . . . . . . 128 Figure 4.7 ForrestGump-data results: Estimate of the coefficient 𝜷3 (𝑡) corresponding to visual feature pupil area for different location. Legends for different parcel- lation as mentioned in Shen et al. (2013) are also provided. . . . . . . . . . . . 129 xvi LIST OF ALGORITHMS Algorithm 2.1 Estimation of 𝜷 using the Quasi-Newton method with halving. . . . . . . 26 Algorithm 3.1 Estimation of 𝜷(𝑠) : 𝑠 ∈ S for the proposed local-GMM based estima- tion procedure. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 81 Algorithm 4.1 Estimation of 𝜷(𝑡) : 𝑡 ∈ [0, 𝑇] for tensor based function-on-function regression method. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 115 xvii KEY TO ABBREVIATIONS ADC Apparent Diffusion Coefficient BOLD Blood Oxygen Level Dependent DTI Diffusion Tensor Imaging FA Fractional Anisotropy (f)MRI (Functional) Magnetic Resonance Imaging FDA Functional Data Analysis FPCA Functional Principal Component Analysis FVE Fraction of variance explained GEE Generalized Estimating Equations GLM Generalized Linear Model (G)MM (Generalized) Method of Moments i.i.d. Independent and Identically Distributed LDA Longitudinal Data Analysis MD Mean Diffusivity OSA Obstructive Sleep Apnea PCA Principal Component Analysis PM Particulate Matter QIF Quadratic Inference Functions ROI Region of Interest TR Repetition Time VCM Varying Coefficient Model xviii CHAPTER 1 PROLOGUE In this chapter, we provide a brief overview of the relevant topics and state the problems that we are going to present in the following chapters. 1.1 Big data analysis The advances in scientific research and technological developments have led to the collection and storage of huge amounts of data which are not only voluminous but also complex in structure. These are commonly called “Big data”. Big data give rise to statistical problems in natural science, engineering, social sciences, and humanities. The analysis of such data having massive volumes and complex structures for decision-making and scientific discovery is a challenge faced by statisticians and computer scientists, which requires innovative statistical and computational methods, sophisticated statistical modelling, and theoretical results. Collectively, this is known as “Data science” which has nowadays become a multi-disciplinary field involving knowledge from various disciplines for developing new methodologies for various kinds of data: low or high dimensional; structured, unstructured, or semi-structured. In recent decades, big data has become a significant part of scientific interest, where images, videos, texts, and other objects can be considered as a form of massive data. Therefore, the statistician plays an important role in proposing new methodologies for discovering information from available big data. In the following subsections, we discuss the different data structures that lead to different methodologies. In Sub-section 1.1.1 we discuss functional data analysis, and in Sub-section 1.1.2 we discuss analysis of a complex structured multidimensional array (also known as tensor data analysis). 1.1.1 Functional data analysis This subsection is dedicated to a discussion on functional data analysis (FDA) and its relevance in the dissertation. Some relevant review articles include Morris (2015); Müller (2016); Wang 1 et al. (2016). Owing to the types of data generated in various scientific research in the fields of biology, audiology, environmental sciences, earth sciences, and economics (to name a few), there was a need for a statistical methodology that could analyze data which are observed as functions varying over time, space, or other continuum domains. This led to the development of functional data analysis. Although the term “functional data analysis” was coined by Jim Ramsey in the famous 1982 paper (Ramsay, 1982), its origin dates back to the late 1940s in the Ph.D. theses of Kari Karhunen (Karhunen, 1946) and Ulf Grenander (Grenander, 1950). In their seminal works, Karhunen and Grenander respectively, discussed the decomposition of square integrable continuous-time stochastic processes into series expansions to obtain representations in a Hilbert space. The idea to expand random curves appeared in Rao (1958) and Tucker (1958) around the same time. In the last three decades, FDA gained considerable momentum in statistics literature, of which some significant works are Ramsay and Silverman (2005, 2007); Ferraty and Vieu (2006); Horváth and Kokoszka (2012); Zhang (2013); Hsing and Eubank (2015) and some notable survey articles are Morris (2015); Wang et al. (2016); Greven and Scheipl (2017); Li et al. (2022). The main feature that makes functional data distinct from other types of data, especially those having a large 𝑝 (number of parameters) and small 𝑛 (sample size) framework, is that the functional data are infinite-dimensional in nature, since the underlying statistical quantity of the measurement is a curve or a surface over a continuum domain. Thus, the commonly used classical multivariate statistical methods (Anderson et al., 1958) do not suffice for these types of analyses. Moreover, in asymptotic analysis, the space between the function arguments is assumed to approach zero, hence making the number of arguments tend to infinity. This is essentially the large 𝑝 (rather 𝑝 𝑛 , where the number of arguments of the function is 𝑝 𝑛 for the sample size 𝑛) problem in high-dimensional statistics. In fact, this dimensionality issue is a blessing in disguise because we end up with more data, with an extra cost being paid by the smoothness assumption on some standard spaces. The smoothness assumption tells us that the information from measurements at neighboring arguments can be pooled, thereby overcoming the curse of dimensionality. Some significant research maneuvering the use of FDA include 2 • Functional generalized linear models (Müller and Stadtmüller, 2005) • Functional sliced inverse regression (Ferré and Yao, 2005) • Multi-level functional data analysis (Crainiceanu et al., 2009; Huang et al., 2014; Xu et al., 2018) • Functional time series (Hörmann and Kokoszka, 2010; Aue et al., 2015; Kowal et al., 2017; van Delft and Eichler, 2018) • Spatially dependent functional data (Zhu et al., 2014; Kuenzer et al., 2021) • Spatio-temporal point process (Li and Guan, 2014; Goldsmith et al., 2015) • Longitudinal functional data analysis (Goldsmith et al., 2012; Chen and Müller, 2012; Park and Staicu, 2015; Staicu et al., 2020) In FDA, continuous functional data are available at every time-point or can at least be evaluated for some time-points. In practice, however, data are observed in discrete domains such as time- points, with or without measurement errors. For the demonstration of the theoretical results, without loss of generality, it suffices to assume that the functional data are observed continuously without measurement error. Note that the theory behind the estimation can be different for different measurement schedules/ sampling plans such as densely or sparsely observing data over time-points. In most cases, the analyses of sparse and dense functional data are different, although sparse and dense data are asymptotic concepts and are difficult to use in practice. While for dense functional data, one can smooth each of the curves separately and then proceed with further estimation and inference procedures based on pre-smoothed curves (Castro et al., 1986; Rice and Silverman, 1991; Zhang and Chen, 2007), for sparse functional data, the pre-smoothing step is not required (Yao et al., 2005). Various smoothing techniques are available in non-parametric literature to deal with functional data. The different types of non-parametric smoothing techniques commonly used in the literature are 3 • Spline smoothing (Rice and Silverman, 1991; Cai and Hall, 2006) • B-spline (Cardot et al., 1999; James et al., 2000; Rice and Wu, 2001) • Penalized splines (Ruppert et al., 2003; Yao and Lee, 2006) • Local polynomial smoothing (Fan and Gijbels, 1996; Zhang and Chen, 2007; Yao and Li, 2013) Principal component analysis (PCA) in FDA is a generalization of the classical high-dimensional statistics for finite-dimensional matrix-valued observations to the case of infinite-dimensional con- tinuum domain, and it is termed as functional principal component analysis (FPCA). The main objective of FPCA is to express the underlying stochastic processes as a truncated sum of a count- able sequence of uncorrelated random variables, thereby reducing the problem from infinite into that of finite dimension, so that the tools of multivariate data analysis can be applied to the resulting random vector of scores. FPCA based on spline smoothing was studied in James et al. (2000); Zhou et al. (2008), whereas, FPCA based on kernel were discussed in Hall et al. (2006); Müller and Yao (2010); Li and Hsing (2010). Asymptotic theories based on kernel smoothing (a.k.a. local polynomial smoothing) are more profound in the literature. For fully observed dense data, Hall and Hosseini-Nasab (2009) derived a stochastic expansion of estimators of eigen-values and eigen-functions based on the principles of operator theory, the statistical implementations of which were provided in Hall and Hosseini-Nasab (2006). For sparse functional data, FPCA approach was studied in Yao et al. (2005); Liu and Müller (2009). Hall et al. (2006) discussed the theoretical properties of FPCA based on local linear smoother. In one of the seminal works in Li and Hsing (2010), an estimation procedure was discussed for all types of sampling strategies. It was found that in some specific ranges for the rate of the number of functional points, for dense sampling strategy, pre-smoothing was found to be asymptotically negligible, and other important commonly used statistics such as mean, covariance, and eigen-components could be estimated using the parametric rate. On the other hand, for sparse functional data, those statistics could only be estimated with a non-parametric convergence rate. It was shown that the estimation of the eigen-values was not 4 as sensitive to the sampling design as the estimation of the eigen-functions. This was the first time where a phase transition was observed. Zhang and Wang (2016) investigated local linear estimation of mean and covariance functions with general weighting schemes, where equal weight per observation and equal weight per subject were two special cases. All works mentioned till now were based on univariate functional data. Multivariate FPCA was discussed in Viviani et al. (2005); Wang (2008); Berrendero et al. (2011); Chiou et al. (2014); Happ and Greven (2018) among many others. Regression analysis for FDA is one of the most active research domains for the analysis of functional data wherein the modelling of the data depends on the type of variables. For example, • when the response is functional, but the covariates are vectors, the approach is called function- on-scalar regression (Zhu et al., 2014; Chen et al., 2019). • when the response is vector valued, while the covariate is functional, this approach is called scalar-on-function regression (Cardot et al., 1999, 2003; Müller and Stadtmüller, 2005; Cai and Hall, 2006; Hall et al., 2007; Li and Hsing, 2007; Goldsmith et al., 2011; Kato, 2012). In this approach, the covariates and the varying coefficient are expressed as the same set of orthogonal functional bases. • when the response and covariates are both functional, the approach is called function-on- function regression. It was introduced by (Ramsay and Dalzell, 1991). In this regression set-up, a varying coefficient model (Hastie and Tibshirani, 1993) was implemented. These regression models are often referred to as concurrent linear models. Recent literature for functional concurrent linear models include Faraway (1997); Zhang and Chen (2007); Zhang et al. (2010); Wang et al. (2016); Fang et al. (2020). Other techniques to estimate the regression function can be found in Hoover et al. (1998); He et al. (2003); Yao et al. (2005); He et al. (2018). In this dissertation, in Chapter 3, we consider the first case where the response is functional but the covariates are vectors and we consider the third case where the covariate and response both are 5 functional in Chapters 2 and 4. 1.1.2 Tensor data analysis In many scientific researches, for instance, in areas of imaging studies, network sciences, economics, computer technologies, genetics, recommendation systems, etc. data appear structured. Such high- dimensional as well as multi-dimensional structures have raised various challenges to their analysis. Thus, multidimensional arrays, popularly known as “tensors”, came as a savior for understanding the structure of these complex data. The tensor as a generalization of matrices appeared for the first time in the literature during 1928 (Hitchcock, 1928) and was used to represent and store data efficiently. Ever since then, its use has seen a boom in the scientific community. Some significant research surveys can be found in Ji et al. (2019); Bi et al. (2021). Sub-section 1.3.1 discusses some basic notation and properties of the tensor. We will consider such kind of data in chapter 4 in more detail. 1.2 Some applications In December 2, 1956, eminent statistician Professor P. C. Mahalanobis emphasized that Statistics is the universal tool of inductive inference, research in natural and social sciences, and technological applications. Statistics, therefore, must have a clearly defined purpose, either in the pursuit of knowledge or in the promotion of human welfare. In this dissertation, some advanced methodologies driven by their applications are proposed for two types of neuroimaging studies. 1.2.1 (Functional) magnetic resonance imaging A remarkable research area developed in magnetic resonance imaging (MRI) for studying the structure and functioning of the human brain in the years following 1977 after the first MRI scanner 6 was developed. In these studies, the differences in magnetic properties of certain molecules (especially water molecules) in the brain are measured by using the fact that their density differs in different media like air, white matter, gray matter, blood vessels, and tumors. Functional MRI, also known as fMRI, has recently gained popularity as a pre-surgical procedure to map the functional architecture of a subject’s brain without exciting the tissues associated with some critical skills like vision, hearing, etc. Occurrence of neural activity in a certain portion of the brain results in increased metabolic activity, causing a rush of oxygen-carrying hemoglobin in that particular area, whereas immediately following the end of neural activity, the oxygen level drops. These changes in the oxygen levels give rise to a measure called the blood oxygen level dependent (BOLD) signal, which is the ratio between oxygenated and de-oxygenated hemoglobin in blood. The objective of fMRI studies is to observe the neural activity of the brain in instantaneous time with high spatial resolution by detecting changes in the BOLD signal. Typically, the BOLD signal happens to rise well above the baseline with a peak at around 6 seconds following a neural activity, and decays back to baseline over a period of 20 seconds. Due to the observable nature of neural activities, we can use fMRI data to make various inferences if we can assess the relationship between neural activity and the BOLD response. In fMRI data, images are collected over time; therefore, in order to maintain high temporal resolution, spatial resolution is sacrificed. High-resolution structural images are used to get back the spatial resolution from the fMRI data, and the spatial coordinates are used to identify the activation regions during the fMRI scans by examining the aligned structural coordinates. The times between two successive scans are called repetition time (TR). Subjects are aligned in the scanner which is assumed to be a three-dimensional coordinate system with coordinates (𝑋, 𝑌 , 𝑍) to the bore of the magnet, where the 𝑍 direction is downward to the bore (from feet to head) and the 𝑋 and 𝑌 directions refer to the plane which is perpendicular to the 𝑍 axis. A schematic diagram of the MRI scanner is provided in Figure 1.1. The brain is naturally a continuous medium due to the existence of neurons in almost all coordinates, but can be made discrete by dividing the brain into a set of cubes. These cubes are commonly called voxels. A typical MRI scan of a healthy human brain is 7 Superconducting primary electromagnet coil Scanner Bore Scanner table Figure 1.1 A schematic diagram of an MRI scanner. provided in Figure 1.2. fMRI data consist of spatial and temporal correlations; therefore, we need sophisticated tools to analyze them. For example, brain tissue in neighboring voxels is supplied by the same kind of vasculature; as a result, a large response in one voxel in a neighborhood set increases the probability that the neighboring voxels consist of a large response (spatial dependency) or, under the same set of stimuli over time, the brain activation is expected to be similar. Moreover, fMRI data can often be corrupted with noise arising due to the thermal motion of electrons inside the bore of the magnet, the brain itself, and due to other physiological reasons. In order to reduce such inherent unaccounted and uncontrolled errors due to head motion scanner drift, a series of prepossessing is performed (see Appendix A for more details). The main objective of fMRI data analysis is to identify regions of the brain that show task-related activity. For more information on fMRI data analysis, please refer to Huettel et al. (2004); Lindquist (2008); Ashby (2011); Wager and Lindquist (2015). 8 Figure 1.2 A typical example of MRI scan of healthy human brain. Source: Long et al. (2012) 1.2.2 Diffusion tensor imaging Diffusion tensor imaging, popularly known as DTI measures the restricted diffusion of water molecules in the brain tissues in order to produce neural tract images. When water molecules are located in fiber tracts, their movement is restricted and they are more likely to be anisotropic; whereas those molecules in the rest of brain are less restricted in their movement, and are therefore isotropic. Diffusion causes water molecules to diverge from a central point and gradually reach the surface of an ellipsoid when the medium is anisotropic. In an isotropic medium, water molecules move out at the same rate in all directions. Thus, using the laws of physics (such as attenuation), the signal of an MRI voxel can be converted into numerical measures of diffusion, which are taken care of by physicians. Thus, each brain voxel has one or more pairs of parameters, such as the 9 rate of diffusion, direction of diffusion etc. The properties of each voxel of a single DTI image are usually calculated by vector or higher-order multi-dimensional arrays. Consider an ellipsoid tensor in a three-dimensional Cartesian grid, where there exist three projections of the given ellipsoid into three different axes. These projections provide apparent diffusion coefficient (ADC) and are denoted as ADC𝑥 , ADC𝑦 and ADC𝑧 corresponding to 𝑋, 𝑌 and 𝑍 axes respectively. Therefore, the average diffusivity in a given voxel is defied by ADC = (ADC𝑥 + ADC𝑦 + ADC𝑧 )/3. Note that ellipsoid has three axes, one principle axis (longest) and two small axes passing through center, where the direction and length of these axes are eigen-vectors and eigenvalues, respectively, in the context of tensor algebra. The diffusion along the principle axis is termed as axial diffusivity (denoted as 𝐿 1 ) and the average diffusivity along two other minor axes is termed as radial diffusivity (denoted as 𝐿 23 where 𝐿 23 = (𝐿 2 + 𝐿 3 )/2, 𝐿 2 and 𝐿 3 are eigen-values corresponding to minor axes). The mean diffusivity is defined as MD = (𝐿 1 + 𝐿 2 + 𝐿 3 )/3. The degree of anisotropy of a diffusion process is termed as fractional anisotropy (FA), which is a scaled measure that belongs to the interval [0, 1]. The quantity FA takes the value zero when diffusion is isotropic (i.e., unrestricted in all directions) and takes the value one when diffusion occurs only on one and is fully restricted in other directions. FA can be calculated using the following formula. 1/2 3 (𝐿 1 − MD) 2 + (𝐿 2 − MD) 2 + (𝐿 3 − MD) 2 √︂  FA = 1/2 (1.1) 2  2 𝐿 + 𝐿2 + 𝐿2 1 2 3 For more information on DTI, please refer to O’Donnell and Westin (2011); Van Hecke et al. (2016). 1.3 Mathematical preliminaries In this section, we introduce some notation and basic mathematical principles that will be used frequently throughout the dissertation. 10 1.3.1 Notations of tensor/matrix object Tensor is a multidimensional array indexed by 𝐷 many indices. The first-order tensor is a vector for 𝐷 = 1, second-order tensor is a matrix for 𝐷 = 2 and for 𝐷 > 2, we call the set of objects higher-order tensors. In the following paragraph, we provide a brief summary of tensors and define important notation. Interested readers can refer to a survey article by Kolda and Bader (2009) for more details. A 𝐷-dimensional tensor is denoted by Sans-serif upper-face letters A ∈ R 𝐼1 ×···×𝐼𝐷 where the size 𝐼 𝑑 along each mode or dimension 𝑑 for 𝑑 = 1, · · · , 𝐷. Therefore, the number of elements Î𝐷 in the tensor A is 𝐼 = 𝑑=1 𝐼 𝑑 and the order of the tensor is the number of dimensions. Here and henceforth, matrices are denoted by bold-face capital letters (examples: A, B, · · · ), vectors are written as bold-face lower-case letters (examples: a, b, · · · ) and scalars are presented as Latin alphabets (𝑎, 𝑏, · · · ). The entry on 𝑖-th row and 𝑗-th column of a matrix A is denoted by (A)𝑖, 𝑗 = 𝑎𝑖 𝑗 and (𝑖 1 , · · · , 𝑖 𝐷 )-th entry of a 𝐷 dimensional tensor is denoted as ( A)𝑖1 ,··· ,𝑖 𝐷 = 𝑎𝑖1 ,··· ,𝑖 𝐷 . Slices are two-dimensional sections of the tensor defined by fixing all but two indices and thus become a 𝐼 𝑑 × 𝐼 𝑑 ′ dimensional matrix. For a 𝐷-way tensor A ∈ R 𝐼1 ×···×𝐼𝐷 with the element 𝑎𝑖1 ,··· ,𝑖 𝐷 at the position with mode 𝑖 𝑑 , 𝑑 = 1, · · · , 𝐷, vectorization operator vec(·) is defined as a vector with length Î𝐷 𝑑=1 𝐼 𝑑 which is formed by stacking the nodes of A into a single column vector, i.e., " 𝐷 Ö𝑑−1 ! # ∑︁ vec( A) 𝑖1 + 𝐼 𝑘 (𝑖 𝑑 − 1) = 𝑎𝑖1 ,··· ,𝑖 𝐷 (1.2) 𝑑=2 𝑘=1 By simplifying, for a matrix A of order 𝐼 × 𝐽, vec(A) = (𝑎 1,1 , · · · , 𝑎 𝐼,1 , · · · , 𝑎 1,𝐽 , · · · , 𝑎 𝐼,𝐽 ) T . Similarly to the vectorization operator, one can unfold as 𝑑-mode matricization or unfolding, a Î 𝐷-array A, to form a matrix (A (𝑑) with 𝐼 𝑑 rows and 𝑑 ′:𝑑 ′≠𝑑 ) 𝐼 𝑑 columns where the element 𝑎𝑖1 ,··· ,𝑖 𝐷 ′ is at the row 𝑖 𝑑 and column 1 + 𝑑𝐷1 =1 (𝑖 𝑑1 − 1) 𝑑𝑑12 −1 Í Î =1 𝐼 𝑑2 . 𝑑1 ≠𝑑 𝑑2 ≠𝑑 11 1.3.2 Different kinds of products Analogous to the Frobenius norm in a 2D space, the norm of a tensor A is the square root of the sum of squares of its indices, denoted as v u t ∑︁𝐼1 𝐼𝐷 ∑︁ ∥A∥F = ··· 𝑎𝑖21 ,··· ,𝑖 𝑑 (1.3) 𝑖1 =1 𝑖 𝐷 =1 The scalar product ⟨A, B⟩ of two 𝐷-dimensional tensors with the same size is defined as ∑︁ ⟨A , B ⟩ = 𝑏𝑖1 ,··· ,𝑖 𝐷 𝑎𝑖1 ,··· ,𝑖 𝐷 (1.4) 𝑖 1 ,··· ,𝑖 𝐷 √︁ Thus, immediately, the Frobenius norm of the tensor A can be expressed as ∥ A ∥ F = ⟨A, A⟩. Two tensors A and B are said to be orthogonal if ⟨A, B⟩ = 0. Furthermore, consider the con- tracted tensor product between two tensors with different mode dimensions. For two tensors A ∈ R 𝐼1 ×···×𝐼𝐾 ×𝑃1 ×···×𝑃 𝐿 and B ∈ R𝑃1 ×···×𝑃 𝐿 ×𝑄 1 ×···×𝑄 𝑀 , contracted tensor product (Lock, 2018; Raskutti Í et al., 2019) is defined as ⟨A, B⟩ 𝐿 with (𝑖1 , · · · , 𝑖 𝐾 , 𝑞 1 , · · · , 𝑞 𝑀 )-th element 𝑝1 ,··· ,𝑝 𝐿 𝑎𝑖1 ,··· ,𝑖 𝐾 ,𝑝1 ,··· ,𝑝 𝐿 ×𝑏 𝑝1 ,··· ,𝑝 𝐿 ,𝑞1 ,··· ,𝑞 𝐾 . As a special case, ⟨A, B⟩ 1 = AB where A and B are 𝐼 × 𝑃 and 𝑃 × 𝑄 matrices. A 𝐷-way tensor A has a rank-1 when it can be written as the outer product of 𝐷 vectors u (1) , · · · , u (𝐷) of length 𝐼1 , · · · , 𝐼𝐾 respectively, i.e., A = u (1) ◦ · · · ◦ u (𝐷) (1.5) Î𝐷 (𝑑) where (𝑖1 , · · · , 𝑖 𝐷 )-th element of A is 𝑑=1 𝑢𝑖 𝑑 . The Kronecker product of the matrices A ∈ R 𝐼×𝐽 and B ∈ R𝐾×𝐿 is denoted by A ⊗ B, an (𝐼𝐾) × (𝐽 𝐿) matrix, defined by A ⊗ B = (𝑎𝑖 𝑗 B)𝑖, 𝑗 = [a1 ⊗ b1 , a1 ⊗ b2 , · · · , a𝐽 ⊗ b 𝐿−1 , a𝐽 ⊗ b 𝐿 ] (1.6) The Khatri-Rao product of matrices A ∈ R 𝐼×𝐾 , B ∈ R𝐽×𝐾 , denoted as A ⊙ B, is defined by A ⊙ B = [a1 ⊗ b1 , · · · , a𝐾 ⊗ b𝐾 ] (1.7) which is an (𝐼𝐽) × 𝐾 matrix. The Hadamard product is the element-wise matrix product which is denoted by A ∗ B where A, B are 𝐼 × 𝐽. 12 1.3.3 Tensor decomposition Now the question is how to represent the tensor as a sum of a finite number of rank-1 tensors? The answers came from psychometrics in the form of canonical decomposition or CANDECOMP (Carroll and Chang, 1970) and parallel factors or PARAFAC (Harshman et al., 1970) decomposition and now in the literature of tensor decomposition, it is known as CANDECOMP/PARAFAC (CP) decomposition which is an extension of matrix singular value decomposition (Tucker, 1966; Kiers, 2000). Therefore, the CP decomposition factorizes a tensor into a sum of rank-1 tensors, i.e., mathematically, 𝑅 ∑︁ A= u𝑟(1) ◦ · · · ◦ u𝑟(𝐷) (1.8) 𝑟=1 where u𝑟(𝑑) ∈ R 𝐼𝑑 , 𝑑 = 1, · · · , 𝐷 and 𝑟 = 1, · · · , 𝑅 are column vectors and A cannot be written as a sum of less than 𝑅 outer products for a positive integer 𝑅 which is the rank of the tensor. Equation (1.8) is sometimes denoted as A = [[U1 , · · · , U𝐷 ]] where U1 , · · · , U𝐷 have linearly independent columns U𝑑 = [u1(𝑑) , · · · , u 𝑅(𝑑) ] ∈ R 𝐼𝑑 ×𝑅 for each 𝑑 = 1, · · · 𝐷. 1.3.4 Some useful results In this sub-section, we will present some useful well-known results without proofs. We define 1 𝑅 as an 𝑅-dimensional vector with all elements 1. 1. vec(u (1) ◦ u (2) ◦ · · · ◦ u (𝐷) ) = u (𝐷) ⊗ · · · ⊗ u (1) . 2. For two vectors a and b, a ⊗ b = a ⊙ b, a ◦ b = abT . 3. For any matrices A, B and C so that the required matrix multiplications are possible, we have the following: a) (A ⊗ B)(C ⊗ D) = AC ⊗ BD. b) (A ⊗ B) −1 = A−1 ⊗ B−1 c) A ⊙ B ⊙ C = (A ⊙ B) ⊙ C = A ⊙ (B ⊙ C) 13 d) (A ⊙ B) T (A ⊙ B) = (AT A) ∗ (BT B) e) (A ⊙ B) −1 = {(AT A) ∗ (BT B)}−1 (A ⊙ B) T f) vec(A ⊙ B) = ((I ⊙ A) ⊗ I) vec(B) g) vec(B ⊙ A) = I ⊙ (A(I ⊗ 1T )) vec(B)  h) trace(AB) = trace(BA) = vec(AT ) T vec(B) i) vec(ABC) = (CT ⊗ A) vec(B) j) rank(A ⊙ B) ≤ rank(A ⊗ B) ≤ rank(A)rank(B) k) If A is a matrix of order 𝑚 1 × 𝑚 2 with (𝑖, 𝑗)-th element 𝑎𝑖 𝑗 , then the Frobenius norm √︃Í Í √︃ 𝑚1 𝑚2 2 √︁ T Ímin(𝑚1 ,𝑚2 ) 2 of A is defined as ∥A∥ F = 𝑖=1 𝑗=1 |𝑎 𝑖 𝑗 | = trace(A A) = 𝑖=1 𝜎𝑖 (A) where 𝜎𝑖 (A) is the 𝑖-th order singular value of A and trace(A) is the trace operator of a square matrix A. 4. If the tensor A admits a rank-𝑅 decomposition, then A (𝑑) = U𝑑 (U𝐷 ⊙ · · · ⊙ U𝑑+1 ⊙ U𝑑−1 ⊙ · · · ⊙ U1 )𝑇 and vec( A) = (U𝐷 ⊙ · · · ⊙ U1 )1 𝑅 5. ∥ A ∥ F = ∥A (𝑑) ∥ F = ∥ vec(A (𝑑) )∥ 2 for 𝑑 = 1, · · · , 𝐷 6. vec( A) = P 𝐼(𝑑) 1 ,··· ,𝐼 𝐷 × vec(A (𝑑) ) where P 𝐼(𝑑) 1 ,··· ,𝐼 𝐷 are permutation matrices such that P 𝐼(𝑑)−1 1 ,··· ,𝐼 𝐷 = P 𝐼(𝑑)T 1 ,··· ,𝐼 𝐷 . 1.4 Dissertation outline The main objective of this dissertation is to answer some fundamental questions that appear in different domains of statistics due to real-life situations. Let us dive into these questions one by one and briefly introduce them. Question 1. How should we handle the dense functional response in quadratic inference method? We consider the problem of estimation for constant linear effect models in semi-parametric functional regression with functional response, where each response curve is decomposed into 14 the overall mean function indexed by a covariate function with constant regression parameters and random error. In Chapter 2, we provide an alternative solution using a popular method for the analysis of correlated data, viz., the quadratic inference approach for such models. Here, we use the estimated basis functions which are being estimated non-parametrically. Therefore, the proposed method can be easily implemented without assuming any working √ correlation structure. Moreover, we achieve a parametric 𝑛-convergence rate under the proper choice of bandwidth when the number of repeated measurements per trajectory is larger than 𝑛𝑎0 where 𝑛 is the number of trajectories and establish the asymptotic normality of the resulting estimator. The performance of the proposed method is compared with that of existing methods through extensive simulation studies. Real data analysis is also carried out to demonstrate the proposed method. Question 2. How should the heteroskedastic functional data be analyzed? Motivated by recent work on diffusion tensor imaging, we propose a novel varying-coefficient model in Chapter 3. We develop a multi-step estimation procedure to simultaneously estimate the varying-coefficient functions using a local linear generalized method of moments (GMM) based on continuous moment conditions. To incorporate spatial dependence, the continuous moment conditions are first projected onto eigen-functions and then combined by weighted eigen-values. This approach solves the challenges of using an inverse covariance operator directly. We propose an optimal instrumental variable that minimizes the asymptotic variance function among the class of all local linear GMM estimators and it outperforms the initial estimates which do not incorporate the spatial dependence. It is shown that with our pro- posed method, accuracy of the estimation is significantly improved under heteroskedasticity conditions. We investigate the asymptotic properties of the initial and proposed estimators. Extensive simulation studies illustrate the finite sample performance and the analysis of real data confirms the efficacy of the proposed method. Question 3. How should functional regression be preformed for complex structured data such as 15 tensor? All neuroimaging modalities have their own strengths and limitations. A current trend is towards interdisciplinary approaches that use multiple imaging methods to overcome limita- tions of each method in isolation. At the same time neuroimaging data is increasingly being combined with other non-imaging modalities, such as behavioral and genetic data. The data structure of many of these modalities can be expressed as time-varying multidimensional arrays (tensors), collected at different time-points on multiple subjects. In Chapter 4, we con- sider a new approach for the study of neural correlates in the presence of tensor-valued brain images and tensor-valued predictors, where both data types are collected over the same set of time-points. We propose a time-varying tensor regression model with an inherent structural composition of responses and covariates. Regression coefficients are expressed using the B-spline technique, and basis function coefficients are estimated using CP-decomposition by minimizing a penalized loss function. We develop a varying-coefficient model for the tensor-valued regression model, where both predictors and responses are modeled as tensors. This development is a non-trivial extension of function-on-function concurrent linear models for complex and large structural data where the inherent structures are preserved. In addition to the methodological and theoretical development, the usefulness of the proposed method based on both simulated and real data analysis (e.g., the combination of eye-tracking data and functional magnetic resonance imaging (fMRI) data) is also discussed. Putting it all together, in this chapter, we have introduced the concepts of functional data, its computational framework, required mathematical notations, definitions and real-life applications thereby establishing a foundation of the upcoming chapters of this dissertation. 16 CHAPTER 2 IMPROVING QUADRATIC INFERENCE APPROACH FOR FUNCTIONAL RESPONSES 2.1 Introduction The key characteristic of longitudinal data analysis (LDA) is the collection of repeated measure- ments on the same set of individuals over multiple time-points, thus allowing study of changes in responses over time and identification of factors that influence those changes. Unlike cross- sectional studies, where one can estimate only the “between-individual” responses, since they are measured at a single time-point; in LDA, it is possible to capture the “with-in individual” changes as repeated measurements on each individual are available. Moreover, longitudinal data is always observed as clusters, where each cluster pertains to repeated measurements obtained from each individual. Although longitudinal studies are performed for data that are observed sparsely over irregular time-points, such studies do not suffice when voluminous data are observed in a continuum domain. As technologies advance, this type of data is being observed more often, so sophisticated methods are needed to handle it. Since functional data are natural generalizations to multivari- ate data from finite to infinite dimension, functional data analysis (FDA) has turned out to be an important methodological tool. In the following two paragraphs, we will present a brief review of some significant research in the past decades that led to the current research. In LDA, the data are generally observed with noise for measurements at each time-point (Taris, 2000; Diggle et al., 2002; Hedeker and Gibbons, 2006; Hand and Crowder, 2017). Moreover, a few repeated measurements are required in LDA and the data are observed sparsely with noise. On the other hand, in FDA, data are densely observed as a continuous-time stochastic process without noise (Zhang and Wang, 2016). Often, the sampling plan can have an effect on the performance of the estimation procedures and inference (Hall and Hosseini-Nasab, 2006). In some situations, data are typically functions by 17 nature and are observed densely over time. Chiou et al. (2003) proposed a class of semi-parametric functional regression models to describe the influence of vector-valued covariates on a sample of the response curve. When data collection leads to experimental error, smoothing is performed at closely spaced time-points in order to reduce the effect of noise. The current developments of functional regression techniques have been rigorously studied in Fan et al. (1999); Hall et al. (2007); Chen et al. (2019). The applicability of FDA spans across various scientific domains such as medical imaging, speech recognition, growth curves, climatology, price index analysis, and many more. Some recent literature on applications of FDA include Ramsay and Silverman (2005); Ferraty and Vieu (2006); Ramsay and Silverman (2007); Zhang (2013); Hsing and Eubank (2015); Morris (2015); Wang et al. (2016); Kokoszka and Reimherr (2017). Methodologically, in LDA in the past few years, the generalized estimating equation (GEE) technique proposed by Liang and Zeger (1986) has been extensively used for estimation of pa- rameters. Although it is an efficient technique, the GEE is unable to estimate the parameters of interest efficiently when the correlation matrix of covariates is not specified correctly. Hence, without requiring the estimation of the correlation parameters, the quadratic inference function (QIF) approach proposed by Qu et al. (2000) is useful for parameter estimation in longitudinal studies (Diggle et al., 2002) and cluster randomized trials (Turner et al., 2017). By representing the inverse of the working correlation matrix in terms of linear combinations of the basis matrices and involving multiple sets of score functions, the QIF approach has improved efficiency over GEE when the working correlation matrix is not specified correctly. Although it maintains the same efficiency as in the situation where the working correlation matrix is specified correctly, the QIF method is not independent of the choice of the working correlation matrix. A QIF method-based approach to varying-coefficient models for longitudinal data was proposed by Qu and Li (2006). The related work of Bai et al. (2008) is an extension of QIF for the partial linear model. An alternative method was presented in Yu et al. (2020) where each set of score equations was solved separately and their solutions were combined afterwards; thereby providing results on inference for an optimally weighted estimator and extending those insights to the general setting with over- 18 identified estimating equations. Zhao et al. (2020, 2021) proposed variable selection method for the varying-coefficient model when some of the covariates were contaminated with additive errors based on bias-corrected penalized QIFs that are defined by combining the bias function approxi- mation to the coefficient functions and bias-corrected QIF with shrinkage estimation. Zhou and Qu (2012) proposed QIF based strategy which minimizes the norm of the difference between two estimating functions based on empirical correlation information. Tian et al. (2014) focused on the selection of variables for the semi-parametric varying-coefficient model based on the combination of the approximated basis function and the QIFs. A longitudinal principal component analysis was proposed in Kinson et al. (2020) based on eigen-decomposition of random effects, while data on correlations information of multivariate observations over time were decomposed by nonparametric splines. Zheng et al. (2018) proposed a method based on a time-varying linear representation of the inverse of the correlation matrix which is projected over the span of basis matrices. The fundamental limitations that all the above-mentioned powerful techniques suffer from are: (1) all the above methodologies require prior information on the working correlation structure; and (2) performance of the classical QIF approach is unknown for dense functional data. Our study is motivated by problems from multiple real-data applications that involve dense functional data when information on the working correlation structure is lacking. Let us discuss two motivating examples that we will use to illustrate the proposed method in this chapter (see Section 2.5 for more details). • Beijing2017-data example - In different locations in China, particulate matter (PM) with diameter less than 2.5 micrometer is collected over different time-points. Scientists are inter- ested in knowing the linear dependence of the pollution factor PM2.5 with other atmospheric chemicals (Liang et al., 2015). Figure 2.1 pictorially demonstrates the readings of PM2.5 for the given locations over several hourly time-points; therefore, dense functional data analysis can be implemented. • Apnea-data example - In neuroimaging data analysis, scientists are interested in modelling 19 600 400 PM2.5 200 0 0 100 200 300 400 500 600 time−points Figure 2.1 Beijing2017-data: Reading of hourly PM2.5 measures for twelve different locations over 608 hourly time-points during January 2017. the change of responses among voxels in each region of interest (ROI) of the human brain. Therefore, we can fit a linear regression model and compare the estimated coefficient across each ROIs. Needless to say, there exist a large number of voxels and the responses change smoothly across the voxels in each ROI, therefore, the data are functional and dense in nature. In recent literature, Xiong et al. (2017) investigates white matter structural alterations 20 using diffusion tensor imaging (DTI) in obstructive sleep apnea (OSA) patients. Here, the change of DTI parameters such as fractional anisotropy (FA) with interaction of count of lapses obtained from the Psychomotor Vigilance Task and voxel locations are investigated and compared to the results obtained in each ROIs. We propose a data-driven way to select the working covariance matrix and express the inverse of the covariance function in terms of the empirical eigen-functions of the covariance operator. The covariance operator can be estimated as in Hsing and Eubank (2015) and other related methods based on functional principal component analysis (FPCA) as found in Dauxois et al. (1982); Yao et al. (2005); Hall and Hosseini-Nasab (2006); Hall et al. (2007); Li and Hsing (2010). Note that the estimation of the eigen-functions creates some error in the proposed estimation method. In this chapter, we try to answer the following question: while we estimate the eigen-functions from √ the data, is the estimation of coefficient vectors in a semi-parametric problem 𝑛- consistent in dense functional data, and can we achieve asymptotic normality? The advantages of our proposed method are the following: First, our method preserves the good properties of the QIF method and is easier to implement since the eigen-functions can be estimated using the existing packages in statistical softwares such as R. Second, under some mild conditions, our proposed estimator can obtain the optimal convergence rate and is asymptotically normally distributed with less variance as compared to the classical QIF methods. Third, asymptotic results show the estimation accuracy of the coefficient in semi-parametric functional model, therefore, making the influence of the dimension reduction step using FPCA redundant. The error in the estimation of the eigen-functions contributes to the error in the estimation of the parameters. Under some mild bandwidth conditions, the above-mentioned error contribution is of the same order of magnitude as error in parameter estimation if eigen-functions are known in advance. The rest of the chapter is organized as follows. In Section 2.2 we introduce the basic concept of QIF along with our proposed method. The asymptotic results for the proposed estimator are presented in Section 2.3. In Section 2.4, we demonstrate the performance for finite samples. We also apply the proposed method to real data-sets in Section 2.5. We conclude in some remarks in 21 Section 2.6. All technical proofs are given in Section 2.7. 2.2 Functional response model and estimation procedure 2.2.1 Basic model To analyze longitudinal data, a straightforward application of a generalized linear model (GLM) (McCullagh and Nelder, 1989) for single response variables is not applicable due to the lack of independence between repeated measures. To account for the high correlation in the longitudinal data, some special techniques are required. A seminal work by Liang and Zeger (1986) proposed the use of GLM for the analysis of longitudinal data. The model we consider in this chapter is commonly observed in spatial modeling, where associations among variables do not change over the functional domain (see Zhang and Banerjee (2021) and references therein); this is termed a constant linear effects model. In this chapter, the variable “time” is used as a functional domain variable. Let 𝑦(𝑡) be the response variable at time-point 𝑡 and x(𝑡) be 𝑝-dimensional covariates observed   at time 𝑡 ∈ T where T = 𝑎, 𝑎 , −∞ < 𝑎 < 𝑎 < ∞ is the spectrum of the time-points. Without loss of generality, assume that 𝑎 = 0 and 𝑎 = 1 in the rest of this chapter. The stochastic process 𝑦(𝑡) is square-integrable with marginal mean E{𝑦(𝑡)|x(𝑡)} and finite covariance function; the regression parameter 𝜷 is unknown and is to be efficiently estimated. Thus, linear models with longitudinal data have the following expression. 𝑦(𝑡) = x(𝑡) T 𝜷 + 𝑒(𝑡) (2.1) where, the stochastic process 𝑦(𝑡) is decomposed into two parts: one is the mean function 𝜇(𝑡) = x(𝑡) T 𝜷 that depends on time-varying covariates and vector-valued coefficient vector 𝜷, and other is the random error part 𝑒(𝑡) where E{𝑒} = 0 and has finite second-order covariance. Let 𝑦𝑖 be i.i.d. copies of the stochastic process and for each individual, the measurements are taken on 𝑚𝑖 discrete time-points 𝑇𝑖 𝑗 for 𝑗 = 1, · · · , 𝑚𝑖 ; 𝑖 = 1, · · · , 𝑛. Therefore, at time 𝑇𝑖 𝑗 , we observe a 𝑚𝑖 × 1 response vector 𝑦𝑖 (𝑇𝑖 𝑗 ) and corresponding covariates x𝑖 (𝑇𝑖 𝑗 ) for the 𝑖-th subject. We assume that 22 𝑚𝑖 ’s are all of the same order as 𝑚 = 𝑛𝑎 for some 𝑎 ≥ 0, thus 𝑚𝑖 /𝑚 are bounded below and above by some constants. Functional data are considered to be sparse depending on the choice of 𝑎 (Hall and Hosseini-Nasab, 2006). Data with bounded 𝑚 or 𝑎 = 0 are called sparse functional data, and if 𝑎 ≥ 𝑎 0 where 𝑎 0 is a transition point are called dense functional data. Moreover, the regions (0, 𝑎 0 ) are sometimes referred to as moderately dense. Furthermore, we denote 𝑦𝑖 𝑗 and x𝑖 𝑗 as 𝑦𝑖 (𝑇𝑖 𝑗 ) and x𝑖 (𝑇𝑖 𝑗 ) respectively. (𝑦𝑖1 , · · · , 𝑦𝑖𝑚𝑖 ) T and (𝜇𝑖1 , · · · , 𝜇𝑖𝑚𝑖 ) T are 𝑚𝑖 component vectors, denoted as y𝑖 and 𝝁𝑖 respectively. The derivative of 𝝁, denoted as 𝝁¤ , is a 𝑚𝑖 × 𝑝 matrix. In the classical problem of GEE, we estimate 𝜷 by solving the quasi-likelihood equations (Liang and Zeger, 1986): ∑︁𝑛 𝝁¤ 𝑖T V𝑖−1 (y𝑖 − 𝝁𝑖 ) = 0 (2.2) 𝑖=1 We denote V𝑖 = 𝜈A𝑖1/2 R𝑖 (𝜌)A𝑖1/2 where R𝑖 (𝜌) is the working correlation matrix, 𝜈 is an over- dispersion parameter and A𝑖 is a diagonal matrix where entries are marginal variances Var(𝑦𝑖1 ), · · · , Var(𝑦𝑖𝑚𝑖 ). In this article, we simply set 𝜈 = 1 while the extension to a general 𝜈 is straightforward. The GEE approach is robust in the sense that it does not require the true knowledge of the likelihood function. Note that, in practice, the prior knowledge of the working correlation matrix is not known, and the estimation of the coefficient is influenced by its choice. Therefore, Qu et al. (2000) suggested an expansion of the inverse of the working correlation matrix as R(𝜌) −1 = 𝜅𝑘=1 Í0 𝑎 𝑘 (𝜌)M 𝑘 where M 𝑘 are some basis matrices. Zhou and Qu (2012) modified linear representation by grouping the basis matrices into an identity matrix and some symmetric basis matrices. For example, if the working correlation matrix is exchangeable/ compound symmetric, R(𝜌) −1 = 𝑐 1 I𝑚 + 𝑐 2 J𝑚 where I𝑚 is the 𝑚 × 𝑚 identity matrix and J𝑚 is the 𝑚 × 𝑚 matrix such that 0 is in diagonal and 1 is in off-diagonal positions. On the other hand, for first-order auto-regressive correlation matrix, R(𝜌) −1 = 𝑐 1 I𝑚 + 𝑐 2 J𝑚(1) + 𝑐 3 J𝑚(2) where J𝑚(1) is a matrix with 1 in the two main off-diagonal positions and 0 otherwise, J𝑚(2) is a matrix such that 1 is in the corner positions, viz. (1, 1) and (𝑚, 𝑚) and 0 elsewhere. Here, 𝑐 𝑘 s are real constants that depend on the nuisance parameter 𝜌. Therefore, 23 Equation (2.2) reduces to the linear combination of the score vectors: T A−1/2 M A−1/2 (y − 𝝁 )   1 Í𝑛   𝑛 𝑖=1 ¤ 𝝁 𝑖 𝑖 1 𝑖 𝑖 𝑖 𝑛  1 ∑︁  ..  ḡ(𝛽) = g𝑖 (𝛽) =  .  (2.3) 𝑛 𝑖=1  Í   1 𝑛 T −1/2 −1/2 ¤  𝑛 𝑖=1 𝝁𝑖 A𝑖 M𝜅0 A𝑖 (y𝑖 − 𝝁𝑖 )     Due to the higher dimension of g, Qu et al. (2000) used the generalized method of moments (GMM) (Hansen, 1982) for which the method of estimation boils down to minimization of the quadratic inference function Q( 𝜷) = 𝑛ḡ( 𝜷) T C( b 𝜷) −1 ḡ( 𝜷) where C( b 𝜷) = 1 Í𝑛 ḡ𝑖 ( 𝜷) ḡ𝑖 ( 𝜷) T is the sample 𝑛 𝑖=1 covariance matrix of Equation (2.3). In order to obtain the solution of 𝜷, Newton-Raphson method is used which iteratively updates the value of 𝜷. 2.2.2 Incorporating eigen-functions in QIF Now, due to standard Karhunen-Loève expansions of 𝑒𝑖 (𝑡) = 𝑦𝑖 (𝑡) − 𝜇𝑖 (𝑡) (Karhunen, 1946; Loève, 1946) ∞ ∑︁ 𝑒𝑖 (𝑡) = 𝜉𝑖𝑟 𝜙𝑟 (𝑡) (2.4) 𝑟=1 where independently distributed random variables 𝜉𝑖𝑟 ∼ (0, 𝜆𝑟 ) for ordered eigen-values 𝜆𝑟 such ∫ that 𝜆 1 ≥ 𝜆 2 ≥ · · · ≥ 0 and 𝜙𝑟 s are orthonormal eigen-functions such that 𝜙𝑟 (𝑡)𝜙𝑙 (𝑡) = 1(𝑟 = 𝑙). We extract the main directions of the variation of the response variables using FPCA. In this situation, we take the first 𝜅0 terms, which provide a good approximation of the infinite sum in Equation (2.4) by considering that the majority of the variations in the data are contained in the subspace spanned by few eigen-functions (Chen et al., 2019). For finite 𝜅0 ≥ 1, we, therefore, consider the rank-𝜅0 FPCA model, ∑︁𝜅0 E{𝑦(𝑡)|x(𝑡)} = 𝜇(𝑡) + E{𝜉𝑟 |x(𝑡)}𝜙𝑟 (𝑡) (2.5) 𝑟=1 An analogue of the truncated empirical version of Equation (2.22) defined in Section 2.7 and Equation (2.4) can be provided easily and we discuss the proposed method based on this truncated version. Moreover, we discuss how to choose 𝜅0 in our situation in Section 2.3 in detail. 24 In this chapter, we propose a data-driven way to compute the basis matrices to obtain the approximate inverse of V as discussed earlier. In this approach, it is enough to find the eigen- functions to construct a GEE. Let us define  1 Í𝑛 T𝚽    𝑛 𝑖=1 𝑖 𝝁¤ b 𝑖1 (y𝑖 − 𝝁𝑖 )    ..  ḡ( 𝜷) =   .   (2.6)  Í  1 𝑛 Tb  𝑛 𝑖=1 𝜇¤𝑖 𝚽𝑖𝜅0 (y𝑖 − 𝝁𝑖 )       b −2 where for 𝑘 = 1, · · · , 𝜅0 , we define 𝚽𝑖𝑘 = 𝑚𝑖 𝜙 𝑘 (𝑇𝑖 𝑗 ) 𝜙 𝑘 (𝑇𝑖 𝑗 ′ ) b b . Since the dimension 𝑗, 𝑗 ′ =1,··· ,𝑚 𝑖 of g in Equation (2.6) is greater than the number of parameters to estimate, instead of setting g to zero, we minimize the following quadratic function. 𝜷 = arg min Q( 𝜷) where Q( 𝜷) = 𝑛ḡ( 𝜷) T C( b b 𝜷) −1 ḡ( 𝜷) (2.7) 𝜷 b 𝜷) = 1 Í𝑛 T. b−1 we need the additional restriction: where, C( 𝑛 𝑖=1 g𝑖 ( 𝜷)g𝑖 ( 𝜷) For the existence of C 𝑛 ≥ 𝑑𝑖𝑚(g𝑖 ) = 𝑝 × 𝜅0 where 𝜅0 is the number of eigen-functions. Under the given set-up, by Equation (8) in Qu et al. (2000) the estimating equation for 𝜷 will be ¤ 𝜷) ≈ 2g( Q( b 𝜷) −1 ḡ( 𝜷) ¤̄ 𝜷) T C( (2.8) For obtaining the solution of the above equation, we use a Newton-like method. In practice, the standard Newton method does not lead to a decrease in the objective function, that is, at each step of the iteration, there is no guarantee that Q( 𝜷 𝑠+1 ) < Q( 𝜷 𝑠 ). Therefore, we use the following algorithm to estimate 𝜷 using the Quasi-Newton method with halving (Givens and Hoeting, 2012). 2.2.3 Estimation of eigen-functions Estimation of eigen-functions is an important step in our proposed quadratic inference technique. In general, FPCA plays an important role as a dimension reduction technique in functional data analysis. Some important theories on FPCA have been developed in recent years. In particular, Hall and Hosseini-Nasab (2006) proved various asymptotic expressions for FPCA for densely observed functional data. Later, Hall and Hosseini-Nasab (2009) showed more common theoretical 25 Algorithm 2.1 Estimation of 𝜷 using the Quasi-Newton method with halving. Data: b 𝜷0 (initial estimates) and calculate Q( c ¤ c 𝜷0 ), Q( 𝜷0 ), Q(¥ c 𝜷0 ) respectively. 𝜖 0 (threshold, a small number) and max.count (maximum number of repetition) Result: Estimate 𝜷 using proposed method 1: Calculate: b 𝜷1 ← b 𝜷0 − Q( ¥ c 𝜷0 ) −1Q(¤ c𝜷0 ) 2: while Error > 𝜖 0 do 3: Calculate: Q( ¤ b 𝜷1 ) and Q( ¥ b𝜷1 ) based on b 𝜷1 4: Initialise: 𝑟 0 = 1 5: b𝜷2 ← b ¥ b 𝜷1 − 𝑟 0Q( ¤ b 𝜷1 ) −1Q( 𝜷1 ) 6: Calculate Q( b 𝜷1 ) and Q( b 𝜷2 ) based on b 𝜷1 and b 𝜷2 respectively using proposed method 7: while Q( b 𝜷2 ) > Q( b𝜷1 ) do 8: 𝑟 0 ← 𝑟 0 /2 9: 𝜷2 ← b b 𝜷1 − 𝑟 0Q(¥ b𝜷1 ) −1Q(¤ b 𝜷1 ) 10: Calculate Q( 𝜷1 ) and Q( 𝜷2 ) based on b b b 𝜷1 and b 𝜷2 respectively using proposed method end 11: Calculate: Error = ∥ b 𝜷2 − b 𝜷1 ∥ 2 12: b𝜷0 ← b 𝜷1 13: b𝜷1 ← 𝜷2 b end arguments, including the effect of gap between eigen-value (a.k.a., spacing) on the property of eigen- value estimators. In Li and Hsing (2010), uniform rates of convergence of mean and covariance functions are given, which are equipped for all possible choices/scenarios of 𝑚𝑖 s. In this section, we adopt the estimation of covariance functions mostly from Li and Hsing (2010). Note that the error process 𝑒(𝑡) has mean zero, defined on compact set T = [0, 1] satisfying ∫ T E{𝑒 2 } < ∞. The functional principal components can be constructed via the covariance function 𝑅(𝑠, 𝑡) defined as 𝑅(𝑠, 𝑡) = E{𝑒(𝑠)𝑒(𝑡)} (2.9) which is assumed to be square-integrable. This function 𝑅 induces the kernel operator F as defined in Sub-section 2.7.1. An empirical analogue of the spectral decomposition of 𝑅 can be obtained ∞ ∑︁ b 𝑡) = 𝑅(𝑠, b𝑟 𝜙b𝑟 (𝑠) 𝜙b𝑟 (𝑡) 𝜆 (2.10) 𝑟=1 26 where the random variables 𝜆 b1 ≥ 𝜆 b2 ≥ · · · ≥ 0 are the eigen-values of the estimated operator F b and ∫ the corresponding sequence of eigen-functions are 𝜙b1 , 𝜙b2 , · · · . Further, assume that T 𝜙𝑟 𝜙b𝑟 ≥ 0 to avoid the issue regarding change of sign (Hall and Hosseini-Nasab, 2006) for practical comparison of eigen-functions, otherwise there is no impact on the convergence rate of eigen-functions and hence the proposed estimators. Our proposed method can be generalized for finite ties of the true eigen-values 𝜆𝑟 but to avoid more technicalities, we assume that the eigen-values are distinct. Suppose that 𝑇𝑖 𝑗 are observational points with a positive density function 𝑓𝑇 (·). Assume 𝑚𝑖 ≥ 2 Í𝑛 and define 𝑁 = 𝑖=1 𝑁𝑖 where 𝑁𝑖 = 𝑚𝑖 (𝑚𝑖 − 1). This approach is based on local linear smoother, which is popular in functional data analysis, including Fan and Gijbels (1996); Li and Hsing (2010) among many others. Let 𝐾 (·) be a symmetric probability density function on [−1, 1], which is used as kernel, and ℎ > 0 be bandwidth; thus the re-scaled kernel function is defined as 𝐾 ℎ (·) = 1ℎ 𝐾 (·). Therefore, for given 𝑠, 𝑡 ∈ T, choose (b 𝑎0, b𝑏1, b 𝑏 2 ) be the minimizer of the following equation. 𝑛 𝑚 𝑚 1 ∑︁ 1 ∑︁𝑖 ∑︁𝑖 {𝑒𝑖 (𝑇𝑖 𝑗1 )𝑒𝑖 (𝑇𝑖 𝑗2 ) −𝑎 0 −𝑏 1 (𝑇𝑖 𝑗1 − 𝑠) −𝑏 2 (𝑇𝑖 𝑗2 −𝑡)}2 𝐾 ℎ 𝑇𝑖 𝑗1 − 𝑠 𝐾 ℎ 𝑇𝑖 𝑗2 − 𝑡 (2.11)   𝑛 𝑖=1 𝑁𝑖 𝑗 =1 𝑗 =1 1 2 𝑗1 ≠ 𝑗2 Thus, we estimate 𝑅(𝑠, 𝑡) = E{𝑒(𝑠)𝑒(𝑡)} using the quantity b 𝑎 0 , viz., 𝑅(𝑠, b 𝑡) = b 𝑎 0 . The operator F b is in general positive semi-definite and the estimated eigen-values 𝜆 b𝑟 are non-negative; indeed, 𝑅 b is symmetric. Along with the lines in the existing literature, we define the following. 𝑇 𝑎 𝑇 𝑏 𝑖 𝑗1 −𝑠 𝑖 𝑗2 −𝑡 • 𝑆 𝑎,𝑏 (𝑠, 𝑡) = 1 1 Í𝑚 𝑖 Í𝑚 𝑖 Í𝑛 𝑛 𝑖=1 𝑁𝑖 𝑗1 =1 𝑗2 =1 ℎ ℎ 𝐾 ℎ (𝑇𝑖 𝑗1 − 𝑠)𝐾 ℎ (𝑇𝑖 𝑗2 − 𝑡) 𝑗1 ≠ 𝑗2 𝑇 𝑎 𝑇 𝑏 𝑖 𝑗1 −𝑠 𝑖 𝑗2 −𝑡 • R𝑎,𝑏 (𝑠, 𝑡) = 1 Í𝑚 𝑖 Í𝑚 𝑖 𝑁𝑖 𝑗 1 =1 𝑗2 =1 ℎ ℎ 𝑒𝑖 (𝑇𝑖 𝑗1 )𝑒𝑖 (𝑇𝑖 𝑗2 )𝐾 ℎ (𝑇𝑖 𝑗1 − 𝑠)𝐾 ℎ (𝑇𝑖 𝑗2 − 𝑡) 𝑗1 ≠ 𝑗2 • A1 = 𝑆20 𝑆02 − 𝑆11 2 , A = 𝑆 𝑆 − 𝑆 𝑆 , and A = 𝑆 𝑆 − 𝑆 𝑆 2 10 02 01 11 3 01 20 10 11 • B = A1 𝑆00 − A2 𝑆10 − A3 𝑆01 Therefore, 𝑅(𝑠,b 𝑡) = (A1 R00 − A2 R10 − A3 R01 )B−1 . 27 2.3 Asymptotic properties In this section, we study the asymptotic properties of the proposed estimator. Let us introduce some notation. Assume that 𝑚𝑖 s are all of the same order, i.e., 𝑚 ≡ 𝑚(𝑛) = 𝑛𝑎 for some 𝑎 ≥ 0. Define 𝑑𝑛1 (ℎ) = ℎ2 + ℎ𝑚/𝑚 and 𝑑𝑛2 (ℎ) = ℎ4 + ℎ3 𝑚/𝑚 + ℎ2 𝑚/𝑚 2 where 𝑚 = lim𝑛→∞ 𝑛1 𝑖=1 Í𝑛 𝑚/𝑚𝑖 1/2 and 𝑚 = lim𝑛→∞ 𝑛1 𝑖=1 (𝑚/𝑚𝑖 ) 2 . Denote 𝛿𝑛1 (ℎ) = 𝑑𝑛1 (ℎ) log 𝑛/(𝑛ℎ2 ) Í𝑛  and 𝛿𝑛2 (ℎ) = 1/2 ∫ 𝑑𝑛2 (ℎ) log 𝑛/(𝑛ℎ4 ) . Further, 𝜈𝑎,𝑏 = 𝑡 𝑎 𝐾 𝑏 (𝑡)𝑑𝑡. Define W = (𝝓(𝑡 1 ) T , · · · , 𝝓(𝑡 𝑚 ) T ) T is a  matrix of order 𝑚 × 𝜅0 obtained after stacking all 𝝓 𝑘 s and random components 𝜉𝑖 = (𝜉𝑖1 , · · · , 𝜉𝑖𝜅0 ) T . Further, 𝝃 has mean zero and variance 𝚲 which is a diagonal matrix with components 𝜆 1 , · · · , 𝜆 𝜅0 . The sign “ ≲ ” indicates that the left-hand side of the inequality is bounded by the right-hand side up to a multiplicative positive constant, i.e. for two positive variables 𝑓1 and 𝑓2 we define 𝑓1 ≲ 𝑓2 as 𝑓1 ≤ 𝐶 𝑓2 where 𝐶 is a positive constant not involving 𝑛. The following conditions are needed for further discussion of the asymptotic properties. (C1) Kernel function 𝐾 (·) is a symmetric density function defined on bounded support [−1, 1]. (C2) Density function 𝑓𝑇 of 𝑇 is bounded above and away from infinity. Also the density function is bounded below away from zero. Moreover, 𝑓 is differentiable and the derivative is continuous. (C3) 𝑅(𝑠, 𝑡) is twice differentiable and all second order partial derivatives are bounded on [0, 1] 2 . (C4) E{sup𝑡∈[0,1] |𝑒(𝑡)| 𝛾 } < ∞ and E{sup𝑡∈[0,1] |x𝑖 (𝑡)| 2𝛾 } < ∞ for some 𝛾 ∈ (4, ∞). (C5) ℎ → 0 as 𝑛 → ∞ such that 𝑑𝑛1 −1 (log 𝑛/𝑛) 1−2/𝛾 → 0 and 𝑑 −1 (log 𝑛/𝑛) 1−4/𝛾 → 0 for 𝑛2 𝛾 ∈ (4, ∞). (C6) Condition for eigen-components. a) for each 1 ≤ 𝑘 < 𝑟 < ∞ and for non-zero finite generic constant 𝐶0 , max {𝜆 𝑘 , 𝜆𝑟 } max {𝑘, 𝑟} ≤ 𝐶0 (2.12) |𝜆 𝑘 − 𝜆𝑟 | |𝑘 − 𝑟 | 28 ∫ b) For some 𝛼 > 0, with the condition 𝑉𝑟 𝜆𝑟−2𝑟 1+𝛼 → 0 as 𝑟 → ∞ where 𝑉𝑟 = E{ 𝜇(𝑡)𝜙 ¤ 2 𝑟 (𝑡)𝑑𝑡} . The above two conditions hold if 𝜆𝑟 = 𝑟 −𝜏1 Λ(𝑟) and 𝑉𝑟 = 𝑟 −𝜏2 Γ(𝑟) for slowly varying functions Λ and Γ where 𝜏2 > 1 + 2𝜏1 > 3. ∫ ∫ c) 𝜙4𝑘 (𝑡)𝑑𝑡 and 𝜇¤ 2 (𝑡)𝜙2𝑘 (𝑡)𝑑𝑡 are finite for all 𝑘 ≥ 1. b 𝜷) converges almost surely to an invertible matrix C0 = E{g( 𝜷0 )g( 𝜷0 ) −1 }. (C7) C( (C8) Conditions for ℎ and 𝜅0 . For 𝜏 = 𝛼 + 𝜏1 , a) If 𝑎 > 1/4, 𝜅0 = 𝑂 (𝑛1/(3−𝜏) ) and 𝑛−1/4 ≲ ℎ ≲ 𝑛−(𝑎+1)/5 b) If 𝑎 ≤ 1/4, 𝜅0 = 𝑂 (𝑛4(1+𝑎)/5(3−𝜏) ) and ℎ ≲ 𝑛−1/4 Remark 2.3.1. Condition (C1) is commonly used in non-parametric regression. The bound con- dition for the density function of time-points has the standard Condition (C2) for random design. Similar results can be obtained for fixed design where the grid-points are pre-fixed according to ∫𝑇 the design density 0 𝑗 𝑓 (𝑡)𝑑𝑡 = 𝑗/𝑚 for 𝑗 = 1, · · · , 𝑚, for 𝑚 ≥ 1. Furthermore, it is important to note that this approach does not involve the requirement to obtain sample path differentiation when we invoke the estimation of eigenfunctions from Li and Hsing (2010). Therefore, the method could be applicable for Brownian motion which has a continuously non-differentiable sample path. To expand in Taylor series, Condition (C3) is required, and is also common in non-parametric regression. Condition (C4) is required for a uniform bound for certain higher-order expectations to show uniform convergence. This is a similar condition adopted from Li and Hsing (2010). Smoothness conditions in (C5) and (C8) are common in kernel smoothing and functional data to control bias and variance. The first condition for tuning the parameters mentioned in (C5) is similar to Li and Hsing (2010). The required spacing assumptions for eigen-values in Conditions (C6)a and (C6)b are similar as in Hall and Hosseini-Nasab (2009). Condition (C6)c is the trivial assumption that frequently arises in the functional data analysis literature. In most situations, this condition automatically holds. Using the weak law of large numbers, Condition (C7) holds for large 𝑛. Similar kind of conditions can be invoked, such as the convexity assumption, i.e., 29 𝜆𝑟 − 𝜆𝑟+1 ≤ 𝜆𝑟−1 − 𝜆𝑟 for all 𝑟 ≥ 2. Condition (C8) is determined to control the rate of the number of repeated measurements. Now, the following theorem provides the asymptotic expansion and consistency of the proposed 𝜷. estimator for b Theorem 2.3.1. Let 𝜷0 be the true value of 𝜷. Under the Conditions (C1), (C2), (C3), (C4), (C5), (C6)a, (C6)b and (C6)c, for 𝑘 = 1, · · · , 𝜅0 , we have the asymptotic mean square error for g 𝑘 ( 𝜷0 ) as   AMSE{g 𝑘 ( 𝜷0 )} = 𝑂 𝑛−1 + 𝑛−1 𝜅03−𝜏 𝑅𝑛 (ℎ) almost surely (2.13) n o 1 1 1 1 1 1 where 𝑅𝑛 (ℎ) = ℎ4 + 𝑛 + 𝑛𝑚ℎ + 𝑛2 𝑚 2 ℎ 2 + 𝑛2 𝑚 4 ℎ 4 + 𝑛2 𝑚ℎ + 𝑛2 𝑚 3 ℎ 3 . Moreover, under Condition (C8), AMSE{b g 𝑘 ( 𝜷0 )} = 𝑂 (𝑛−1 ). Therefore, if in addition, Condition (C7) holds, as 𝑛 → ∞, ∥ b 𝜷 − 𝜷0 ∥ = 𝑂 (𝑛−1/2 ) in probability. The following theorem states the results of asymptotic normality. Theorem 2.3.2. Define C𝑖 = 𝜅𝑘01 =1 𝜅𝑘02 =1 𝚽 𝑘 1 X𝑖 C−1 T −1 Í Í 𝑘 1 ,𝑘 2 X𝑖 𝚽 𝑘 2 , where C 𝑘 1 ,𝑘 2 is a (𝑘 1 , 𝑘 2 ) block of C−1 g( 𝜷0 )g𝑖T ( 𝜷0 ) . Assume that the conditions for Theorem 2.3.1 hold. Then  0 with C 0 = E √ b 𝑑 − 𝑁 (0, 𝚺) where 𝚺 = B−1 AB−1 . The quantities A and B are, respectively, limits 𝑛( 𝜷 − 𝜷0 ) → 𝑑 b = 1 Í 𝑛 XT C e𝑖T C b = 1 Í 𝑛 XT C − ′′ denotes the convergence in of A 𝑛 𝑖=1 𝑖 𝑖 X𝑖 , and “ → b e𝑖b b𝑖 X𝑖 and B b 𝑛 𝑖=1 𝑖 𝑖b distribution. Remark 2.3.2. Here, the selection of the bandwidth only effects the second-order term of the MSE 𝜷 and has no effect on the asymptotic result of normality as long as ℎ satisfies the Conditions of b (C5) and (C8) along with some restrictions on 𝜅0 . All proofs with relevant technical details are available in Section 2.7. 30 2.4 Simulation studies We conduct the numerical studies to compare the finite sample performance to the corresponding longitudinal approach of quadratic inference proposed in Qu et al. (2000) under different correlation structures. 2.4.1 Simulation set-up Consider the normal response model 𝑦𝑖 (𝑇𝑖 𝑗 ) = x𝑖 (𝑇𝑖 𝑗 ) T 𝜷 + 𝑒𝑖 (𝑇𝑖 𝑗 ) (2.14) For 𝑝 = 2, we set coefficient vectors, 𝜷 = (𝛽1 , 𝛽2 ) T where 𝛽1 = 1 and 𝛽2 = 0.5. The covariates are generated in following way. √ √ 𝑥𝑖𝑘 (𝑡) = 𝜒𝑖1(𝑘) + 𝜒𝑖2(𝑘) 2 sin (𝜋𝑡) + 𝜒𝑖3(𝑘) 2 cos (𝜋𝑡) (2.15) The coefficients 𝜒𝑖1(𝑘) ∼ 𝑁 (0, (2−0.5(𝑘−1) ) 2 ), 𝜒𝑖2(𝑘) ∼ 𝑁 (0, (0.85 × 2−0.5(𝑘−1) ) 2 ), 𝜒𝑖3(𝑘) ∼ 𝑁 (0, (0.7 × 2−0.5(𝑘−1) ) 2 ) and 𝜒𝑖 𝑗 s are mutually independent for each trajectories 𝑖 and each 𝑗. Consider the following simulation design. • Observational times-points. In a fixed-design situation, associated observational times are fixed. Sample trajectories are observed at 𝑚 = 100 equidistant time-points {𝑡1 , · · · , 𝑡 𝑚 } on [0, 1]. • Choice of residuals. The residual process 𝑒𝑖 (𝑡) is a smoothed function with mean zero Í and unknown covariance function, where each 𝑒𝑖 is distributed as 𝑒𝑖 = 𝑘 ≥1 𝜉𝑖 𝜙𝑖 and 𝜉 𝑘 s are independent normal random variables with mean zero and respective variances 𝜆 𝑘 . For numerical computation, we truncate the finite series at 𝑘 = 3 in Karhunen-Loève expansions for Situations (a), (b) and (c) as described below. In Situations (d) and (e), the error process is generated from given covariance functions. 31 (a) Brownian motion. The covariance function for the Brownian motion is min(𝑠, 𝑡), 4 √ √ 𝜆 𝑘 = 𝜋2 (2𝑘−1) 2 and 𝜙 𝑘 (𝑡) = 2 sin(𝑡/ 𝜆 𝑘 ). √ (b) Linear process. Consider the eigen-values be 𝜆 𝑘 = 𝑘 −2𝑙0 and 𝜙 𝑘 (𝑡) = 2 cos(𝑘 𝜋𝑡). We fix 𝑙 0 ∈ {1, 2, 3}. (c) Ornstein Uhlenbeck (OU) process. For positive constants 𝜇0 and 𝜌0 , we have a stochastic differential equation for 𝑒(𝑡) as 𝜕𝑒(𝑡) = −𝜇0 𝑒(𝑡)𝜕𝑡 + 𝜌0 𝜕𝑤(𝑡) for the Brownian motion 𝑤(𝑡). It can be shown that cov{𝑒(𝑡), 𝑒(𝑠)} = 𝑐 exp{−𝜇0 |𝑡 − 𝑠|} where 𝑐 = 𝜌02 /2𝜇0 . Here we assume 𝑐 = 1. Thus, by solving the integral equation we have 𝜙 𝑘 (𝑡) = 2𝜇0 𝜔2 −𝜇2 𝐴 𝑘 cos(𝜔 𝑘 𝑡) + 𝐵 𝑘 sin(𝜔 𝑘 𝑡) and 𝜆 𝑘 = 𝜔2𝑘 +𝜇02 where 𝜔 is solution of cot(𝜔) = 2𝜇0 𝜔0 . The √︂ 2𝜔2 constants 𝐴 𝑘 and 𝐵 𝑘 are defined as 𝐵 𝑘 = 𝜇0 𝐴 𝑘 /𝜔 𝑘 where 𝐴 𝑘 = 2𝜇 +𝜇2𝑘+𝜔2 . Here 𝜇0 0 0 𝑘 is chosen to be 1 or 3. (d) Power exponential. 𝑅(𝑠, 𝑡) = exp{(−|𝑠 − 𝑡|/𝑎 0 ) 𝑏0 } where scale parameter 𝑎 0 = 1 and shape parameter 𝑏 0 ∈ {1, 2, 5}.    2  −𝑏0 (e) Rational quadratic. 𝑅(𝑠, 𝑡) = 1 + 𝑠−𝑡 𝑎0 where scale parameter 𝑎 0 = 1 and shape parameter 𝑏 0 ∈ {1, 2, 5}. • Sample size parameter. Number of individuals, 𝑛 ∈ {100, 300, 500}. 2.4.2 Comparison and evaluation For each of the situations, we perform 500 simulation replicates. To execute Qu et al. (2000)’s approach, we construct the scores using basis matrices as described in Example 1 (approximation of the compound symmetric correlation structure, denoted as ldaCS in the tables) and Example 2 (for the first-order autoregressive correlation structure, denoted as ldaAR in the tables) in their paper. Ordinary least squares estimate (denoted as init in the tables) is taken as the initial estimate of 𝜷 for both ours (denoted as fda-k for specific 𝑘 in the tables) and Qu et al. (2000)’s approach. The estimation procedure in the iterative algorithm converges when the square difference between the estimated values of two consecutive steps is bounded by a small number 10−10 or the maximum 32 number of steps crosses 500, whichever happens earlier. To make theoretical results and numerical examples consistent, we use “FPCA” function in R which is available in fdapace packages (Gajardo et al., 2021) or in the MATLAB (MATLAB, 2014) package PACE available at http://www.stat. ucdavis.edu/PACE/ to estimate the eigen-functions. The key references for the PACE approach and associated works include in Yao et al. (2003, 2005); Müller and Yao (2010); Li and Hsing (2010). Bandwidths are selected using generalized cross-validation and the Epanechnikov kernel 𝐾 (𝑥) = 0.75(1 − 𝑥 2 )+ is used for estimation where (𝑎)+ = max(𝑎, 0). The means and standard deviations (SD) of the regression coefficients based on 500 simulations are given as summary measures. We calculate the standard deviation mentioned in the tables based on 500 estimates from 500 replications that can be viewed as the true standard error. Moreover, we also compute the following statistics to compare the performance of estimation, where for 𝑏-th 𝜷 𝑏 be the estimated value for 𝜷, replication b • Absolute bias, AB = 1 Í500 b 500 𝑏=1 | 𝜷 𝑏 − 𝜷| • Mean square error, MSE = 1 Í500 b 2 500 𝑏=1 ( 𝜷 𝑏 − 𝜷) MSEs are reported in the order of 10−2 . The number of selected eigen-functions plays a critical role in our proposed method. We choose 𝜅0 based on a scree plot where the elbow of the graph is found and the components to the left are considered as significant. 33 Table 2.1 Performance of the estimation procedure where the residuals are generated from Brownian motion (a). Mean of the estimated coefficients, standard deviation, absolute bias, mean square error (×100) and FVE in percentage are summarized upto four decimal places. Method 𝛽1 𝛽2 FVE Mean SD AB MSE Mean SD AB MSE %-age 𝑛 = 100 init 0.9999 0.0373 0.0297 0.1391 0.4995 0.0486 0.0384 0.2354 ldaAR 0.9991 0.0331 0.0265 0.1095 0.5004 0.0445 0.0353 0.1972 ldaCS 0.9987 0.0316 0.0253 0.1000 0.4997 0.0411 0.0322 0.1685 fda-1 0.9998 0.0564 0.0447 0.3180 0.5006 0.0743 0.0587 0.5516 86.2672 fda-2 1.0001 0.0269 0.0213 0.0723 0.4971 0.0362 0.0290 0.1314 96.3746 fda-3 0.9998 0.0231 0.0181 0.0532 0.4978 0.0317 0.0251 0.1010 99.9220 fda-4 1.0004 0.0052 0.0014 0.0028 0.4994 0.0092 0.0022 0.0085 99.9979 fda-5 0.9999 0.0021 0.0008 0.0004 0.4999 0.0051 0.0012 0.0026 100.0000 fda-6 0.9999 0.0021 0.0008 0.0004 0.4999 0.0051 0.0012 0.0026 100.0000 fda-7 0.9999 0.0021 0.0008 0.0004 0.4999 0.0051 0.0012 0.0026 100.0000 𝑛 = 300 init 1.0002 0.0200 0.0162 0.0401 0.5003 0.0288 0.0226 0.0825 ldaAR 1.0003 0.0184 0.0147 0.0336 0.5000 0.0259 0.0203 0.0670 ldaCS 1.0007 0.0170 0.0134 0.0288 0.4995 0.0242 0.0190 0.0583 fda-1 1.0002 0.0309 0.0251 0.0955 0.5008 0.0443 0.0350 0.1962 86.7578 fda-2 1.0002 0.0142 0.0114 0.0202 0.4998 0.0213 0.0169 0.0451 96.4747 fda-3 1.0002 0.0122 0.0098 0.0150 0.4992 0.0179 0.0144 0.0321 99.9745 fda-4 1.0002 0.0021 0.0003 0.0004 0.4998 0.0032 0.0005 0.0010 99.9993 fda-5 1.0000 0.0002 0.0001 0.0000 0.5000 0.0003 0.0002 0.0000 100.0000 fda-6 1.0000 0.0002 0.0001 0.0000 0.5000 0.0003 0.0002 0.0000 100.0000 fda-7 1.0000 0.0002 0.0001 0.0000 0.5000 0.0003 0.0002 0.0000 100.0000 𝑛 = 500 init 1.0002 0.0148 0.0117 0.0219 0.5006 0.0223 0.0177 0.0497 ldaAR 1.0005 0.0138 0.0111 0.0189 0.5000 0.0206 0.0162 0.0422 ldaCS 1.0000 0.0128 0.0102 0.0164 0.4992 0.0184 0.0146 0.0340 fda-1 1.0007 0.0234 0.0185 0.0545 0.5012 0.0348 0.0277 0.1213 86.7520 fda-2 0.9996 0.0105 0.0083 0.0110 0.5002 0.0157 0.0126 0.0247 96.5174 fda-3 0.9991 0.0091 0.0074 0.0084 0.4999 0.0133 0.0107 0.0177 99.9851 fda-4 1.0000 0.0002 0.0001 0.0000 0.5000 0.0003 0.0002 0.0000 99.9996 fda-5 1.0000 0.0002 0.0001 0.0000 0.5000 0.0003 0.0002 0.0000 100.0000 fda-6 1.0000 0.0002 0.0001 0.0000 0.5000 0.0003 0.0002 0.0000 100.0000 fda-7 1.0000 0.0002 0.0001 0.0000 0.5000 0.0003 0.0002 0.0000 100.0000 34 Table 2.2 Performance of the estimation procedure where the residuals are generated from linear process (b) with 𝑙0 = 1. Mean of the estimated coefficients, standard deviation, absolute bias, mean square error (×100) and FVE in percentage are summarized upto four decimal places. Method 𝛽1 𝛽2 FVE Mean SD AB MSE Mean SD AB MSE %-age 𝑛 = 100 init 1.0019 0.0340 0.0267 0.1155 0.5001 0.0456 0.0356 0.2078 ldaAR 1.0007 0.0230 0.0187 0.0528 0.5010 0.0361 0.0285 0.1303 ldaCS 1.0000 0.0006 0.0005 0.0000 0.4999 0.0008 0.0006 0.0001 fda-1 1.0093 0.1436 0.1150 2.0675 0.5020 0.2010 0.1570 4.0311 73.0607 fda-2 1.0036 0.1055 0.0828 1.1123 0.5052 0.1337 0.1070 1.7869 91.6726 fda-3 1.0038 0.1024 0.0804 1.0473 0.5056 0.1303 0.1020 1.6969 99.7657 fda-4 1.0000 0.0092 0.0021 0.0084 0.5006 0.0096 0.0024 0.0093 99.9993 fda-5 1.0000 0.0011 0.0007 0.0001 0.5000 0.0017 0.0010 0.0003 100.0000 fda-6 1.0000 0.0011 0.0007 0.0001 0.5000 0.0017 0.0010 0.0003 100.0000 fda-7 1.0000 0.0011 0.0007 0.0001 0.5000 0.0017 0.0010 0.0003 100.0000 𝑛 = 300 init 0.9991 0.0181 0.0144 0.0326 0.5006 0.0268 0.0212 0.0715 ldaAR 0.9995 0.0133 0.0105 0.0178 0.5000 0.0212 0.0169 0.0447 ldaCS 1.0000 0.0003 0.0002 0.0000 0.5000 0.0005 0.0004 0.0000 fda-1 0.9958 0.0767 0.0616 0.5888 0.5037 0.1163 0.0918 1.3523 73.4907 fda-2 0.9974 0.0567 0.0458 0.3220 0.5011 0.0804 0.0648 0.6460 91.7757 fda-3 0.9974 0.0564 0.0455 0.3182 0.5007 0.0800 0.0643 0.6391 99.9225 fda-4 1.0000 0.0005 0.0001 0.0000 0.5000 0.0009 0.0002 0.0001 99.9998 fda-5 1.0000 0.0002 0.0001 0.0000 0.5000 0.0003 0.0002 0.0000 100.0000 fda-6 1.0000 0.0002 0.0001 0.0000 0.5000 0.0003 0.0002 0.0000 100.0000 fda-7 1.0000 0.0002 0.0001 0.0000 0.5000 0.0003 0.0002 0.0000 100.0000 𝑛 = 500 init 0.9999 0.0152 0.0121 0.0230 0.5027 0.0207 0.0163 0.0436 ldaAR 1.0008 0.0100 0.0079 0.0100 0.5019 0.0161 0.0129 0.0263 ldaCS 1.0000 0.0003 0.0002 0.0000 0.5000 0.0004 0.0003 0.0000 fda-1 0.9990 0.0657 0.0523 0.4303 0.5113 0.0883 0.0698 0.7913 73.4098 fda-2 1.0013 0.0468 0.0371 0.2185 0.5078 0.0651 0.0520 0.4292 91.8501 fda-3 1.0014 0.0459 0.0364 0.2107 0.5074 0.0650 0.0516 0.4274 99.9490 fda-4 1.0000 0.0001 0.0001 0.0000 0.5000 0.0001 0.0001 0.0000 99.9999 fda-5 1.0000 0.0001 0.0001 0.0000 0.5000 0.0001 0.0001 0.0000 100.0000 fda-6 1.0000 0.0001 0.0001 0.0000 0.5000 0.0001 0.0001 0.0000 100.0000 fda-7 1.0000 0.0001 0.0001 0.0000 0.5000 0.0001 0.0001 0.0000 100.0000 35 Table 2.3 Performance of the estimation procedure where the residuals are generated from linear process (b) with 𝑙0 = 2. Mean of the estimated coefficients, standard deviation, absolute bias, mean square error (×100) and FVE in percentage are summarized upto four decimal places. Method 𝛽1 𝛽2 FVE Mean SD AB MSE Mean SD AB MSE %-age 𝑛 = 100 init 1.0020 0.0331 0.0261 0.1094 0.4999 0.0451 0.0349 0.2028 ldaAR 1.0002 0.0167 0.0133 0.0278 0.5014 0.0232 0.0183 0.0538 ldaCS 1.0000 0.0003 0.0002 0.0000 0.5000 0.0004 0.0003 0.0000 fda-1 1.0096 0.1431 0.1144 2.0520 0.5022 0.2003 0.1561 4.0029 92.5933 fda-2 1.0014 0.0648 0.0508 0.4188 0.5037 0.0842 0.0667 0.7094 98.5532 fda-3 1.0009 0.0535 0.0406 0.2860 0.5018 0.0744 0.0570 0.5524 99.7251 fda-4 1.0000 0.0074 0.0017 0.0055 0.4999 0.0103 0.0024 0.0106 99.9991 fda-5 1.0000 0.0045 0.0009 0.0021 0.5000 0.0021 0.0009 0.0004 100.0000 fda-6 1.0000 0.0045 0.0009 0.0021 0.5000 0.0021 0.0009 0.0004 100.0000 fda-7 1.0000 0.0045 0.0009 0.0021 0.5000 0.0021 0.0009 0.0004 100.0000 𝑛 = 300 init 0.9991 0.0175 0.0140 0.0308 0.5006 0.0263 0.0208 0.0689 ldaAR 0.9999 0.0091 0.0071 0.0083 0.4997 0.0127 0.0102 0.0161 ldaCS 1.0000 0.0002 0.0001 0.0000 0.5000 0.0002 0.0002 0.0000 fda-1 0.9957 0.0767 0.0616 0.5893 0.5038 0.1164 0.0919 1.3541 92.9525 fda-2 0.9989 0.0365 0.0295 0.1331 0.4998 0.0511 0.0410 0.2608 98.7539 fda-3 0.9992 0.0349 0.0282 0.1218 0.4991 0.0483 0.0385 0.2332 99.9061 fda-4 0.9999 0.0017 0.0002 0.0003 0.4999 0.0033 0.0004 0.0011 99.9997 fda-5 1.0000 0.0002 0.0001 0.0000 0.5000 0.0002 0.0001 0.0000 100.0000 fda-6 1.0000 0.0002 0.0001 0.0000 0.5000 0.0002 0.0001 0.0000 100.0000 fda-7 1.0000 0.0002 0.0001 0.0000 0.5000 0.0002 0.0001 0.0000 100.0000 𝑛 = 500 init 0.9998 0.0148 0.0118 0.0219 0.5026 0.0201 0.0158 0.0409 ldaAR 1.0006 0.0073 0.0059 0.0054 0.5008 0.0098 0.0079 0.0097 ldaCS 1.0000 0.0001 0.0001 0.0000 0.5000 0.0002 0.0001 0.0000 fda-1 0.9990 0.0656 0.0523 0.4295 0.5114 0.0882 0.0698 0.7897 92.9459 fda-2 1.0013 0.0296 0.0233 0.0874 0.5039 0.0415 0.0329 0.1735 98.7957 fda-3 1.0012 0.0285 0.0224 0.0812 0.5036 0.0407 0.0322 0.1670 99.9389 fda-4 1.0000 0.0001 0.0001 0.0000 0.5000 0.0001 0.0001 0.0000 99.9998 fda-5 1.0000 0.0001 0.0001 0.0000 0.5000 0.0001 0.0001 0.0000 100.0000 fda-6 1.0000 0.0001 0.0001 0.0000 0.5000 0.0001 0.0001 0.0000 100.0000 fda-7 1.0000 0.0001 0.0001 0.0000 0.5000 0.0001 0.0001 0.0000 100.0000 36 Table 2.4 Performance of the estimation procedure where the residuals are generated from linear process (b) with 𝑙0 = 3. Mean of the estimated coefficients, standard deviation, absolute bias, mean square error (×100) and FVE in percentage are summarized upto four decimal places. Method 𝛽1 𝛽2 FVE Mean SD AB MSE Mean SD AB MSE %-age 𝑛 = 100 init 1.0020 0.0328 0.0260 0.1077 0.4998 0.0451 0.0349 0.2027 ldaAR 1.0000 0.0087 0.0069 0.0076 0.5009 0.0122 0.0096 0.0149 ldaCS 1.0000 0.0001 0.0001 0.0000 0.5000 0.0002 0.0002 0.0000 fda-1 1.0096 0.1430 0.1143 2.0496 0.5023 0.2001 0.1559 3.9978 97.9586 fda-2 1.0000 0.0326 0.0250 0.1058 0.5023 0.0440 0.0344 0.1941 99.5699 fda-3 0.9998 0.0144 0.0072 0.0208 0.4986 0.0209 0.0103 0.0436 99.8941 fda-4 1.0002 0.0053 0.0013 0.0028 0.5002 0.0067 0.0018 0.0044 99.9991 fda-5 1.0003 0.0037 0.0008 0.0014 0.4998 0.0042 0.0011 0.0018 100.0000 fda-6 1.0003 0.0037 0.0008 0.0014 0.4998 0.0042 0.0011 0.0018 100.0000 fda-7 1.0003 0.0037 0.0008 0.0014 0.4998 0.0042 0.0011 0.0018 100.0000 𝑛 = 300 init 0.9991 0.0174 0.0139 0.0303 0.5006 0.0262 0.0207 0.0684 ldaAR 1.0000 0.0047 0.0037 0.0022 0.4997 0.0066 0.0052 0.0044 ldaCS 1.0000 0.0001 0.0001 0.0000 0.5000 0.0001 0.0001 0.0000 fda-1 0.9957 0.0767 0.0616 0.5896 0.5038 0.1164 0.0919 1.3543 98.2262 fda-2 0.9996 0.0197 0.0158 0.0386 0.4996 0.0274 0.0219 0.0750 99.7663 fda-3 1.0002 0.0151 0.0101 0.0227 0.4990 0.0186 0.0125 0.0347 99.9255 fda-4 1.0001 0.0022 0.0003 0.0005 0.5001 0.0017 0.0003 0.0003 99.9997 fda-5 1.0000 0.0002 0.0001 0.0000 0.5000 0.0002 0.0001 0.0000 100.0000 fda-6 1.0000 0.0002 0.0001 0.0000 0.5000 0.0002 0.0001 0.0000 100.0000 fda-7 1.0000 0.0002 0.0001 0.0000 0.5000 0.0002 0.0001 0.0000 100.0000 𝑛 = 500 init 0.9998 0.0147 0.0117 0.0216 0.5025 0.0199 0.0157 0.0400 ldaAR 1.0003 0.0038 0.0031 0.0015 0.5003 0.0052 0.0041 0.0027 ldaCS 1.0000 0.0001 0.0001 0.0000 0.5000 0.0001 0.0001 0.0000 fda-1 0.9990 0.0656 0.0523 0.4293 0.5114 0.0882 0.0698 0.7895 98.2518 fda-2 1.0008 0.0159 0.0126 0.0253 0.5015 0.0222 0.0175 0.0495 99.8021 fda-3 1.0001 0.0126 0.0091 0.0158 0.5002 0.0179 0.0130 0.0320 99.9449 fda-4 1.0000 0.0001 0.0001 0.0000 0.5000 0.0001 0.0001 0.0000 99.9998 fda-5 1.0000 0.0001 0.0001 0.0000 0.5000 0.0001 0.0001 0.0000 100.0000 fda-6 1.0000 0.0001 0.0001 0.0000 0.5000 0.0001 0.0001 0.0000 100.0000 fda-7 1.0000 0.0001 0.0001 0.0000 0.5000 0.0001 0.0001 0.0000 100.0000 37 Table 2.5 Performance of the estimation procedure where the residuals are generated from Ornstein- Uhlenbeck process (c) with 𝜇0 = 1. Mean of the estimated coefficients, standard deviation, absolute bias, mean square error (×100) and FVE in percentage are summarized upto four decimal places. Method 𝛽1 𝛽2 FVE Mean SD AB MSE Mean SD AB MSE %-age 𝑛 = 100 init 1.0003 0.0541 0.0434 0.2922 0.4994 0.0711 0.0563 0.5048 ldaAR 1.0001 0.0476 0.0383 0.2261 0.4984 0.0650 0.0513 0.4214 ldaCS 0.9994 0.0398 0.0316 0.1581 0.4978 0.0534 0.0421 0.2849 fda-1 1.0009 0.0705 0.0563 0.4964 0.5006 0.0947 0.0747 0.8954 79.4156 fda-2 1.0001 0.0453 0.0358 0.2044 0.4982 0.0608 0.0482 0.3690 94.9669 fda-3 0.9993 0.0386 0.0307 0.1491 0.4978 0.0511 0.0405 0.2613 99.9949 fda-4 1.0003 0.0127 0.0081 0.0162 0.4991 0.0242 0.0146 0.0583 99.9992 fda-5 0.9997 0.0084 0.0071 0.0071 0.4991 0.0159 0.0124 0.0254 100.0000 fda-6 0.9997 0.0084 0.0071 0.0071 0.4991 0.0159 0.0124 0.0254 100.0000 fda-7 0.9997 0.0084 0.0071 0.0071 0.4991 0.0159 0.0124 0.0254 100.0000 𝑛 = 300 init 1.0000 0.0288 0.0233 0.0829 0.5003 0.0416 0.0329 0.1728 ldaAR 1.0000 0.0258 0.0206 0.0662 0.4996 0.0368 0.0294 0.1350 ldaCS 1.0002 0.0212 0.0170 0.0449 0.4987 0.0314 0.0254 0.0983 fda-1 0.9997 0.0388 0.0316 0.1503 0.5014 0.0560 0.0440 0.3130 79.9233 fda-2 1.0005 0.0240 0.0193 0.0576 0.4983 0.0352 0.0284 0.1242 95.0283 fda-3 0.9999 0.0202 0.0161 0.0409 0.4994 0.0298 0.0240 0.0884 99.9984 fda-4 0.9999 0.0072 0.0064 0.0051 0.4997 0.0129 0.0111 0.0166 99.9998 fda-5 1.0000 0.0072 0.0064 0.0051 0.4997 0.0128 0.0111 0.0165 100.0000 fda-6 1.0000 0.0072 0.0064 0.0051 0.4997 0.0128 0.0111 0.0165 100.0000 fda-7 1.0000 0.0072 0.0064 0.0051 0.4997 0.0128 0.0111 0.0165 100.0000 𝑛 = 500 init 1.0000 0.0288 0.0233 0.0829 0.5003 0.0416 0.0329 0.1728 ldaAR 1.0000 0.0258 0.0206 0.0662 0.4996 0.0368 0.0294 0.1350 ldaCS 1.0002 0.0212 0.0170 0.0449 0.4987 0.0314 0.0254 0.0983 fda-1 0.9997 0.0388 0.0316 0.1503 0.5014 0.0560 0.0440 0.3130 79.9233 fda-2 1.0005 0.0240 0.0193 0.0576 0.4983 0.0352 0.0284 0.1242 95.0283 fda-3 0.9999 0.0202 0.0161 0.0409 0.4994 0.0298 0.0240 0.0884 99.9984 fda-4 0.9999 0.0072 0.0064 0.0051 0.4997 0.0129 0.0111 0.0166 99.9998 fda-5 1.0000 0.0072 0.0064 0.0051 0.4997 0.0128 0.0111 0.0165 100.0000 fda-6 1.0000 0.0072 0.0064 0.0051 0.4997 0.0128 0.0111 0.0165 100.0000 fda-7 1.0000 0.0072 0.0064 0.0051 0.4997 0.0128 0.0111 0.0165 100.0000 38 Table 2.6 Performance of the estimation procedure where the residuals are generated from Ornstein- Uhlenbeck process (c) with 𝜇0 = 3. Mean of the estimated coefficients, standard deviation, absolute bias, mean square error (×100) and FVE in percentage are summarized upto four decimal places. Method 𝛽1 𝛽2 FVE Mean SD AB MSE Mean SD AB MSE %-age 𝑛 = 100 init 1.0001 0.0454 0.0363 0.2056 0.4990 0.0591 0.0469 0.3487 ldaAR 0.9996 0.0390 0.0312 0.1521 0.4979 0.0521 0.0414 0.2717 ldaCS 0.9999 0.0429 0.0341 0.1841 0.4981 0.0564 0.0448 0.3176 fda-1 1.0005 0.0557 0.0445 0.3100 0.5003 0.0743 0.0588 0.5511 59.1692 fda-2 1.0005 0.0459 0.0362 0.2107 0.4981 0.0604 0.0480 0.3639 87.0120 fda-3 1.0000 0.0436 0.0348 0.1894 0.4975 0.0568 0.0455 0.3221 99.9908 fda-4 1.0004 0.0136 0.0058 0.0185 0.4989 0.0237 0.0106 0.0562 99.9960 fda-5 1.0001 0.0049 0.0033 0.0024 0.4997 0.0099 0.0061 0.0098 100.0000 fda-6 1.0001 0.0049 0.0033 0.0024 0.4997 0.0099 0.0061 0.0098 100.0000 fda-7 1.0001 0.0049 0.0033 0.0024 0.4997 0.0099 0.0061 0.0098 100.0000 𝑛 = 300 init 1.0002 0.0239 0.0191 0.0568 0.4998 0.0345 0.0274 0.1190 ldaAR 1.0003 0.0209 0.0167 0.0435 0.4990 0.0303 0.0244 0.0916 ldaCS 1.0003 0.0224 0.0178 0.0500 0.4992 0.0328 0.0265 0.1075 fda-1 0.9998 0.0305 0.0247 0.0928 0.5011 0.0443 0.0349 0.1957 59.4418 fda-2 1.0004 0.0240 0.0192 0.0574 0.4991 0.0347 0.0279 0.1201 86.9290 fda-3 1.0003 0.0222 0.0177 0.0492 0.4992 0.0327 0.0264 0.1070 99.9972 fda-4 1.0002 0.0043 0.0039 0.0018 0.4998 0.0075 0.0065 0.0057 99.9987 fda-5 1.0001 0.0038 0.0035 0.0014 0.4998 0.0068 0.0059 0.0046 100.0000 fda-6 1.0001 0.0038 0.0035 0.0014 0.4998 0.0068 0.0059 0.0046 100.0000 fda-7 1.0001 0.0038 0.0035 0.0014 0.4998 0.0068 0.0059 0.0046 100.0000 𝑛 = 500 init 1.0003 0.0180 0.0141 0.0323 0.5017 0.0268 0.0213 0.0720 ldaAR 1.0001 0.0155 0.0123 0.0241 0.5016 0.0232 0.0186 0.0541 ldaCS 1.0002 0.0171 0.0134 0.0291 0.5013 0.0251 0.0202 0.0629 fda-1 1.0005 0.0229 0.0180 0.0523 0.5021 0.0346 0.0276 0.1201 59.3104 fda-2 1.0004 0.0177 0.0138 0.0314 0.5022 0.0268 0.0217 0.0724 87.0183 fda-3 1.0001 0.0173 0.0136 0.0300 0.5010 0.0249 0.0197 0.0620 99.9983 fda-4 1.0000 0.0042 0.0038 0.0017 0.5005 0.0075 0.0065 0.0056 99.9992 fda-5 1.0000 0.0038 0.0035 0.0015 0.5004 0.0066 0.0058 0.0044 100.0000 fda-6 1.0000 0.0038 0.0035 0.0015 0.5004 0.0066 0.0058 0.0044 100.0000 fda-7 1.0000 0.0038 0.0035 0.0015 0.5004 0.0066 0.0058 0.0044 100.0000 39 Table 2.7 Performance of the estimation procedure where the residuals are generated with power exponential covariance function (d) where scale parameter 𝑎 0 = 1 and shape parameter 𝑏 0 = 1. Mean of the estimated coefficients, standard deviation, absolute bias, mean square error (×100) and FVE in percentage are summarized upto four decimal places. Method 𝛽1 𝛽2 FVE Mean SD AB MSE Mean SD AB MSE %-age 𝑛 = 100 init 0.9968 0.0525 0.0423 0.2758 0.4961 0.0755 0.0603 0.5705 ldaAR 0.9985 0.0486 0.0387 0.2361 0.4962 0.0702 0.0562 0.4938 ldaCS 0.9978 0.0389 0.0309 0.1514 0.4986 0.0549 0.0438 0.3013 fda-1 0.9960 0.0708 0.0564 0.5018 0.4951 0.1010 0.0813 1.0195 73.1399 fda-2 0.9975 0.0439 0.0347 0.1929 0.4980 0.0629 0.0496 0.3948 87.5389 fda-3 0.9975 0.0381 0.0305 0.1453 0.4987 0.0531 0.0425 0.2811 92.2253 fda-4 0.9978 0.0392 0.0311 0.1540 0.4977 0.0540 0.0434 0.2914 94.4211 fda-5 0.9978 0.0388 0.0302 0.1508 0.4975 0.0527 0.0420 0.2773 95.6881 fda-6 0.9979 0.0393 0.0306 0.1546 0.4977 0.0534 0.0424 0.2852 96.5184 fda-7 0.9990 0.0396 0.0308 0.1570 0.4986 0.0535 0.0423 0.2857 97.0882 fda-8 0.9985 0.0405 0.0317 0.1637 0.4981 0.0540 0.0429 0.2915 97.5117 𝑛 = 300 init 0.9975 0.0297 0.0236 0.0885 0.4989 0.0428 0.0352 0.1832 ldaAR 0.9972 0.0281 0.0224 0.0793 0.4992 0.0389 0.0316 0.1509 ldaCS 0.9988 0.0226 0.0180 0.0513 0.4995 0.0300 0.0238 0.0899 fda-1 0.9967 0.0391 0.0310 0.1539 0.4986 0.0588 0.0482 0.3456 73.4971 fda-2 0.9986 0.0254 0.0203 0.0648 0.4990 0.0335 0.0266 0.1122 87.5190 fda-3 0.9987 0.0221 0.0173 0.0490 0.4996 0.0293 0.0233 0.0859 92.1425 fda-4 0.9986 0.0219 0.0171 0.0479 0.4997 0.0294 0.0233 0.0862 94.3152 fda-5 0.9988 0.0213 0.0167 0.0456 0.5000 0.0287 0.0232 0.0823 95.5616 fda-6 0.9989 0.0213 0.0166 0.0454 0.5001 0.0289 0.0232 0.0832 96.3760 fda-7 0.9988 0.0212 0.0166 0.0449 0.5003 0.0291 0.0234 0.0843 96.9441 fda-8 0.9988 0.0212 0.0166 0.0449 0.5001 0.0292 0.0236 0.0852 97.3641 𝑛 = 500 init 0.9996 0.0212 0.0169 0.0450 0.4989 0.0316 0.0252 0.0999 ldaAR 0.9993 0.0189 0.0151 0.0356 0.4983 0.0294 0.0237 0.0868 ldaCS 0.9996 0.0170 0.0135 0.0288 0.4998 0.0235 0.0193 0.0553 fda-1 0.9996 0.0278 0.0222 0.0773 0.4984 0.0422 0.0333 0.1777 73.6172 fda-2 0.9992 0.0186 0.0150 0.0347 0.4995 0.0264 0.0216 0.0694 87.5609 fda-3 1.0002 0.0164 0.0130 0.0268 0.5001 0.0225 0.0183 0.0505 92.1497 fda-4 1.0001 0.0163 0.0129 0.0264 0.4999 0.0226 0.0184 0.0508 94.3084 fda-5 1.0000 0.0160 0.0127 0.0255 0.4996 0.0223 0.0182 0.0496 95.5492 fda-6 1.0000 0.0161 0.0128 0.0260 0.4995 0.0222 0.0182 0.0493 96.3576 fda-7 0.9999 0.0161 0.0128 0.0258 0.4994 0.0219 0.0179 0.0481 96.9249 fda-8 0.9999 0.0161 0.0128 0.0258 0.4994 0.0219 0.0177 0.0478 97.3423 40 Table 2.8 Performance of the estimation procedure where the residuals are generated with power exponential covariance function (d) where scale parameter 𝑎 0 = 1 and shape parameter 𝑏 0 = 2. Mean of the estimated coefficients, standard deviation, absolute bias, mean square error (×100) and FVE in percentage are summarized upto four decimal places. Method 𝛽1 𝛽2 FVE Mean SD AB MSE Mean SD AB MSE %-age 𝑛 = 100 init 0.9967 0.0563 0.0454 0.3170 0.4957 0.0811 0.0649 0.6591 ldaAR 0.9980 0.0506 0.0399 0.2562 0.4970 0.0724 0.0578 0.5234 ldaCS 0.9982 0.0373 0.0294 0.1389 0.4983 0.0531 0.0425 0.2814 fda-1 0.9957 0.0766 0.0611 0.5872 0.4947 0.1094 0.0881 1.1964 85.6976 fda-2 0.9979 0.0440 0.0348 0.1934 0.4975 0.0633 0.0503 0.4010 98.9964 fda-3 0.9993 0.0239 0.0187 0.0569 0.4988 0.0343 0.0273 0.1177 99.9585 fda-4 0.9984 0.0224 0.0176 0.0505 0.5004 0.0305 0.0242 0.0931 99.9987 fda-5 0.9984 0.0224 0.0176 0.0505 0.5004 0.0305 0.0242 0.0931 100.0000 fda-6 0.9984 0.0224 0.0176 0.0505 0.5004 0.0305 0.0242 0.0931 100.0000 fda-7 0.9984 0.0224 0.0176 0.0505 0.5004 0.0305 0.0242 0.0931 100.0000 fda-8 0.9984 0.0224 0.0176 0.0505 0.5004 0.0305 0.0242 0.0931 100.0000 𝑛 = 300 init 0.9973 0.0318 0.0253 0.1014 0.4987 0.0461 0.0378 0.2118 ldaAR 0.9972 0.0297 0.0239 0.0888 0.4992 0.0400 0.0323 0.1597 ldaCS 0.9990 0.0216 0.0172 0.0468 0.4995 0.0289 0.0229 0.0835 fda-1 0.9964 0.0424 0.0336 0.1804 0.4984 0.0637 0.0521 0.4047 86.0991 fda-2 0.9987 0.0253 0.0201 0.0642 0.4990 0.0336 0.0266 0.1129 99.0443 fda-3 0.9992 0.0146 0.0114 0.0213 0.5003 0.0197 0.0159 0.0388 99.9593 fda-4 0.9993 0.0128 0.0100 0.0165 0.5000 0.0176 0.0139 0.0309 99.9987 fda-5 0.9993 0.0128 0.0100 0.0165 0.5000 0.0176 0.0139 0.0309 100.0000 fda-6 0.9993 0.0128 0.0100 0.0165 0.5000 0.0176 0.0139 0.0309 100.0000 fda-7 0.9993 0.0128 0.0100 0.0165 0.5000 0.0176 0.0139 0.0309 100.0000 𝑛 = 500 init 0.9994 0.0226 0.0180 0.0512 0.4987 0.0340 0.0270 0.1153 ldaAR 0.9993 0.0201 0.0161 0.0403 0.4983 0.0309 0.0250 0.0954 ldaCS 0.9993 0.0163 0.0130 0.0266 0.4995 0.0227 0.0186 0.0517 fda-1 0.9995 0.0301 0.0240 0.0906 0.4983 0.0456 0.0361 0.2081 86.1915 fda-2 0.9990 0.0186 0.0150 0.0346 0.4992 0.0265 0.0217 0.0699 99.0585 fda-3 1.0004 0.0111 0.0087 0.0122 0.5002 0.0148 0.0121 0.0219 99.9596 fda-4 1.0007 0.0100 0.0080 0.0100 0.5007 0.0129 0.0106 0.0168 99.9987 fda-5 1.0007 0.0100 0.0080 0.0100 0.5007 0.0129 0.0106 0.0168 100.0000 fda-6 1.0007 0.0100 0.0080 0.0100 0.5007 0.0129 0.0106 0.0168 100.0000 fda-7 1.0007 0.0100 0.0080 0.0100 0.5007 0.0129 0.0106 0.0168 100.0000 fda-8 1.0007 0.0100 0.0080 0.0100 0.5007 0.0129 0.0106 0.0168 100.0000 41 Table 2.9 Performance of the estimation procedure where the residuals are generated with power exponential covariance function (d) where scale parameter 𝑎 0 = 1 and shape parameter 𝑏 0 = 5. Mean of the estimated coefficients, standard deviation, absolute bias, mean square error (×100) and FVE in percentage are summarized upto four decimal places. Method 𝛽1 𝛽2 FVE Mean SD AB MSE Mean SD AB MSE %-age 𝑛 = 100 init 0.9970 0.0582 0.0468 0.3390 0.4952 0.0841 0.0676 0.7089 ldaAR 0.9977 0.0528 0.0415 0.2789 0.4961 0.0772 0.0621 0.5956 ldaCS 0.9996 0.0274 0.0218 0.0747 0.4979 0.0396 0.0321 0.1572 fda-1 0.9956 0.0810 0.0646 0.6564 0.4943 0.1157 0.0933 1.3399 92.2949 fda-2 0.9995 0.0375 0.0296 0.1400 0.4966 0.0548 0.0441 0.3013 99.7252 fda-3 0.9993 0.0283 0.0203 0.0801 0.5004 0.0383 0.0274 0.1462 99.8787 fda-4 0.9991 0.0121 0.0053 0.0147 0.4995 0.0165 0.0073 0.0273 99.9461 fda-5 1.0001 0.0116 0.0022 0.0134 0.5005 0.0146 0.0030 0.0212 99.9842 fda-6 1.0000 0.0133 0.0029 0.0177 0.5004 0.0184 0.0041 0.0336 99.9979 fda-7 1.0000 0.0133 0.0028 0.0176 0.5004 0.0183 0.0040 0.0334 99.9990 fda-8 1.0000 0.0133 0.0028 0.0176 0.5004 0.0183 0.0040 0.0334 99.9998 𝑛 = 300 init 0.9972 0.0328 0.0260 0.1082 0.4986 0.0482 0.0395 0.2317 ldaAR 0.9971 0.0310 0.0251 0.0968 0.4994 0.0426 0.0344 0.1810 ldaCS 0.9995 0.0156 0.0123 0.0244 0.4996 0.0216 0.0171 0.0467 fda-1 0.9962 0.0449 0.0356 0.2027 0.4982 0.0673 0.0552 0.4524 92.6741 fda-2 0.9990 0.0213 0.0168 0.0454 0.4994 0.0291 0.0229 0.0846 99.8076 fda-3 0.9995 0.0196 0.0153 0.0384 0.4992 0.0271 0.0213 0.0736 99.9391 fda-4 1.0005 0.0075 0.0044 0.0056 0.4994 0.0105 0.0064 0.0111 99.9716 fda-5 0.9999 0.0023 0.0006 0.0005 0.5000 0.0037 0.0010 0.0014 99.9889 fda-6 1.0000 0.0025 0.0006 0.0006 0.4999 0.0049 0.0011 0.0024 99.9979 fda-7 1.0000 0.0012 0.0003 0.0001 0.5002 0.0039 0.0006 0.0015 99.9989 fda-8 1.0000 0.0012 0.0003 0.0001 0.5002 0.0039 0.0006 0.0015 99.9997 𝑛 = 500 init 0.9994 0.0232 0.0185 0.0539 0.4985 0.0352 0.0280 0.1242 ldaAR 0.9989 0.0209 0.0169 0.0437 0.4981 0.0327 0.0264 0.1073 ldaCS 0.9991 0.0119 0.0095 0.0142 0.4992 0.0168 0.0134 0.0281 fda-1 0.9995 0.0318 0.0254 0.1011 0.4981 0.0483 0.0382 0.2328 92.7586 fda-2 0.9988 0.0158 0.0127 0.0252 0.4988 0.0227 0.0183 0.0517 99.8272 fda-3 0.9989 0.0153 0.0123 0.0235 0.4989 0.0218 0.0176 0.0478 99.9574 fda-4 0.9997 0.0070 0.0050 0.0049 0.5002 0.0105 0.0072 0.0110 99.9811 fda-5 1.0000 0.0024 0.0007 0.0006 0.5001 0.0035 0.0011 0.0012 99.9923 fda-6 1.0000 0.0026 0.0006 0.0007 0.5003 0.0045 0.0011 0.0021 99.9979 fda-7 1.0001 0.0017 0.0004 0.0003 0.5000 0.0038 0.0007 0.0014 99.9989 fda-8 1.0001 0.0015 0.0003 0.0002 0.5001 0.0035 0.0007 0.0012 99.9997 42 Table 2.10 Performance of the estimation procedure where the residuals are generated with rational quadratic covariance function (e) where scale parameter 𝑎 0 = 1 and shape parameter 𝑏 0 = 1. Mean of the estimated coefficients, standard deviation, absolute bias, mean square error (×100) and FVE in percentage are summarized upto four decimal places. Method 𝛽1 𝛽2 FVE Mean SD AB MSE Mean SD AB MSE %-age 𝑛 = 100 init 0.9967 0.0565 0.0455 0.3193 0.4957 0.0814 0.0651 0.6624 ldaAR 0.9982 0.0507 0.0399 0.2567 0.4968 0.0725 0.0580 0.5257 ldaCS 0.9983 0.0349 0.0275 0.1215 0.4986 0.0495 0.0396 0.2444 fda-1 0.9957 0.0774 0.0617 0.5994 0.4946 0.1105 0.0890 1.2213 87.2296 fda-2 0.9980 0.0413 0.0326 0.1706 0.4979 0.0594 0.0471 0.3521 98.4366 fda-3 0.9989 0.0265 0.0210 0.0704 0.4988 0.0379 0.0302 0.1434 99.8285 fda-4 0.9985 0.0271 0.0212 0.0733 0.4991 0.0372 0.0296 0.1383 99.9800 fda-5 0.9987 0.0259 0.0203 0.0671 0.4987 0.0350 0.0276 0.1224 99.9977 fda-6 0.9987 0.0259 0.0203 0.0671 0.4987 0.0350 0.0276 0.1224 99.9997 fda-7 0.9987 0.0259 0.0203 0.0671 0.4987 0.0350 0.0276 0.1224 100.0000 fda-8 0.9987 0.0259 0.0203 0.0671 0.4987 0.0350 0.0276 0.1224 100.0000 𝑛 = 300 init 0.9973 0.0319 0.0253 0.1021 0.4987 0.0463 0.0381 0.2144 ldaAR 0.9971 0.0297 0.0239 0.0889 0.4992 0.0404 0.0327 0.1630 ldaCS 0.9991 0.0203 0.0161 0.0412 0.4995 0.0271 0.0215 0.0736 fda-1 0.9964 0.0429 0.0339 0.1846 0.4984 0.0643 0.0527 0.4132 87.6273 fda-2 0.9988 0.0238 0.0190 0.0568 0.4991 0.0317 0.0251 0.1002 98.4851 fda-3 0.9991 0.0159 0.0125 0.0255 0.5002 0.0215 0.0173 0.0462 99.8287 fda-4 0.9991 0.0157 0.0121 0.0248 0.5001 0.0214 0.0171 0.0457 99.9800 fda-5 0.9993 0.0144 0.0113 0.0209 0.5005 0.0202 0.0164 0.0408 99.9977 fda-6 0.9993 0.0144 0.0113 0.0209 0.5005 0.0202 0.0164 0.0408 99.9997 fda-7 0.9993 0.0144 0.0113 0.0209 0.5005 0.0202 0.0164 0.0408 100.0000 fda-8 0.9993 0.0144 0.0113 0.0209 0.5005 0.0202 0.0164 0.0408 100.0000 𝑛 = 500 init 0.9995 0.0227 0.0181 0.0514 0.4987 0.0341 0.0271 0.1160 ldaAR 0.9993 0.0200 0.0160 0.0399 0.4982 0.0310 0.0250 0.0962 ldaCS 0.9994 0.0153 0.0122 0.0235 0.4996 0.0213 0.0174 0.0454 fda-1 0.9995 0.0304 0.0243 0.0924 0.4982 0.0461 0.0365 0.2125 87.7225 fda-2 0.9991 0.0176 0.0142 0.0310 0.4993 0.0249 0.0204 0.0620 98.5022 fda-3 1.0003 0.0121 0.0096 0.0146 0.5002 0.0163 0.0133 0.0265 99.8293 fda-4 1.0005 0.0120 0.0096 0.0145 0.5004 0.0160 0.0131 0.0256 99.9801 fda-5 1.0002 0.0114 0.0090 0.0129 0.5000 0.0151 0.0121 0.0227 99.9977 fda-6 1.0002 0.0114 0.0090 0.0129 0.5000 0.0151 0.0121 0.0227 99.9997 fda-7 1.0002 0.0114 0.0090 0.0129 0.5000 0.0151 0.0121 0.0227 100.0000 fda-8 1.0002 0.0114 0.0090 0.0129 0.5000 0.0151 0.0121 0.0227 100.0000 43 Table 2.11 Performance of the estimation procedure where the residuals are generated with rational quadratic covariance function (e) where scale parameter 𝑎 0 = 1 and shape parameter 𝑏 0 = 2. Mean of the estimated coefficients, standard deviation, absolute bias, mean square error (×100) and FVE in percentage are summarized upto four decimal places. Method 𝛽1 𝛽2 FVE Mean SD AB MSE Mean SD AB MSE %-age 𝑛 = 100 init 0.9967 0.0547 0.0441 0.2993 0.4960 0.0788 0.0628 0.6206 ldaAR 0.9983 0.0498 0.0393 0.2473 0.4968 0.0714 0.0571 0.5091 ldaCS 0.9977 0.0425 0.0337 0.1805 0.4981 0.0602 0.0481 0.3618 fda-1 0.9957 0.0730 0.0583 0.5343 0.4951 0.1042 0.0839 1.0868 78.3365 fda-2 0.9972 0.0479 0.0381 0.2297 0.4975 0.0687 0.0544 0.4721 96.4030 fda-3 0.9981 0.0368 0.0293 0.1358 0.4982 0.0520 0.0418 0.2699 99.4758 fda-4 0.9982 0.0377 0.0299 0.1424 0.4979 0.0528 0.0421 0.2791 99.9264 fda-5 0.9986 0.0356 0.0280 0.1266 0.4975 0.0483 0.0384 0.2335 99.9902 fda-6 0.9986 0.0356 0.0280 0.1266 0.4975 0.0483 0.0384 0.2335 99.9987 fda-7 0.9986 0.0356 0.0280 0.1266 0.4975 0.0483 0.0384 0.2335 99.9998 fda-8 0.9986 0.0356 0.0280 0.1266 0.4975 0.0483 0.0384 0.2335 100.0000 𝑛 = 300 init 0.9974 0.0309 0.0246 0.0959 0.4988 0.0445 0.0365 0.1977 ldaAR 0.9972 0.0290 0.0233 0.0848 0.4991 0.0393 0.0317 0.1542 ldaCS 0.9987 0.0246 0.0195 0.0606 0.4994 0.0327 0.0259 0.1066 fda-1 0.9966 0.0403 0.0320 0.1633 0.4986 0.0607 0.0497 0.3682 78.7132 fda-2 0.9985 0.0276 0.0220 0.0764 0.4989 0.0364 0.0290 0.1327 96.4278 fda-3 0.9985 0.0218 0.0171 0.0477 0.4999 0.0290 0.0232 0.0842 99.4738 fda-4 0.9985 0.0218 0.0171 0.0478 0.4998 0.0293 0.0235 0.0858 99.9259 fda-5 0.9988 0.0198 0.0157 0.0394 0.5004 0.0271 0.0224 0.0733 99.9900 fda-6 0.9988 0.0198 0.0157 0.0394 0.5004 0.0271 0.0224 0.0733 99.9987 fda-7 0.9988 0.0198 0.0157 0.0394 0.5004 0.0271 0.0224 0.0733 99.9998 fda-8 0.9988 0.0198 0.0157 0.0394 0.5004 0.0271 0.0224 0.0733 100.0000 𝑛 = 500 init 0.9995 0.0221 0.0176 0.0490 0.4989 0.0330 0.0263 0.1086 ldaAR 0.9993 0.0197 0.0158 0.0389 0.4983 0.0302 0.0245 0.0910 ldaCS 0.9994 0.0184 0.0146 0.0339 0.4996 0.0257 0.0211 0.0658 fda-1 0.9996 0.0288 0.0229 0.0826 0.4984 0.0435 0.0344 0.1892 78.8101 fda-2 0.9991 0.0201 0.0162 0.0405 0.4993 0.0286 0.0235 0.0819 96.4486 fda-3 1.0003 0.0162 0.0128 0.0262 0.5000 0.0222 0.0181 0.0491 99.4752 fda-4 1.0003 0.0162 0.0129 0.0263 0.4999 0.0224 0.0182 0.0501 99.9261 fda-5 0.9999 0.0151 0.0120 0.0227 0.4994 0.0208 0.0170 0.0434 99.9900 fda-6 0.9999 0.0151 0.0120 0.0227 0.4994 0.0208 0.0170 0.0434 99.9987 fda-7 0.9999 0.0151 0.0120 0.0227 0.4994 0.0208 0.0170 0.0434 99.9998 fda-8 0.9999 0.0151 0.0120 0.0227 0.4994 0.0208 0.0170 0.0434 100.0000 44 Table 2.12 Performance of the estimation procedure where the residuals are generated with rational quadratic covariance function (e) where scale parameter 𝑎 0 = 1 and shape parameter 𝑏 0 = 5. Mean of the estimated coefficients, standard deviation, absolute bias, mean square error (×100) and FVE in percentage are summarized upto four decimal places. Method 𝛽1 𝛽2 FVE Mean SD AB MSE Mean SD AB MSE %-age 𝑛 = 100 init 0.9967 0.0504 0.0407 0.2545 0.4965 0.0725 0.0577 0.5251 ldaAR 0.9986 0.0466 0.0369 0.2173 0.4965 0.0671 0.0539 0.4508 ldaCS 0.9970 0.0473 0.0377 0.2238 0.4978 0.0668 0.0531 0.4458 fda-1 0.9959 0.0648 0.0517 0.4205 0.4960 0.0922 0.0742 0.8505 62.1253 fda-2 0.9964 0.0505 0.0405 0.2561 0.4976 0.0721 0.0572 0.5199 89.2976 fda-3 0.9970 0.0462 0.0368 0.2136 0.4974 0.0644 0.0513 0.4142 97.4903 fda-4 0.9981 0.0464 0.0363 0.2149 0.4957 0.0647 0.0520 0.4201 99.4795 fda-5 0.9986 0.0441 0.0345 0.1941 0.4952 0.0621 0.0497 0.3872 99.9012 fda-6 0.9987 0.0446 0.0350 0.1986 0.4958 0.0622 0.0500 0.3885 99.9827 fda-7 0.9986 0.0440 0.0343 0.1934 0.4959 0.0625 0.0504 0.3921 99.9971 fda-8 0.9986 0.0440 0.0343 0.1934 0.4959 0.0625 0.0504 0.3921 99.9995 𝑛 = 300 init 0.9977 0.0286 0.0229 0.0820 0.4991 0.0406 0.0332 0.1646 ldaAR 0.9975 0.0271 0.0215 0.0737 0.4991 0.0367 0.0296 0.1346 ldaCS 0.9983 0.0272 0.0216 0.0739 0.4994 0.0362 0.0288 0.1309 fda-1 0.9972 0.0355 0.0283 0.1264 0.4991 0.0539 0.0441 0.2896 62.2661 fda-2 0.9983 0.0289 0.0229 0.0835 0.4990 0.0384 0.0308 0.1473 89.2161 fda-3 0.9981 0.0266 0.0209 0.0712 0.4992 0.0355 0.0282 0.1257 97.4664 fda-4 0.9981 0.0260 0.0205 0.0677 0.4992 0.0351 0.0278 0.1227 99.4727 fda-5 0.9984 0.0244 0.0192 0.0594 0.4996 0.0332 0.0267 0.1097 99.8990 fda-6 0.9985 0.0243 0.0191 0.0589 0.4997 0.0334 0.0269 0.1115 99.9820 fda-7 0.9985 0.0237 0.0189 0.0562 0.4998 0.0334 0.0270 0.1112 99.9969 fda-8 0.9985 0.0237 0.0189 0.0562 0.4998 0.0334 0.0270 0.1112 99.9995 𝑛 = 500 init 0.9996 0.0207 0.0165 0.0430 0.4992 0.0304 0.0245 0.0921 ldaAR 0.9993 0.0187 0.0151 0.0351 0.4985 0.0281 0.0229 0.0792 ldaCS 0.9996 0.0201 0.0160 0.0404 0.4997 0.0282 0.0230 0.0792 fda-1 0.9997 0.0256 0.0204 0.0654 0.4988 0.0385 0.0304 0.1484 62.3377 fda-2 0.9992 0.0210 0.0168 0.0440 0.4995 0.0299 0.0244 0.0892 89.2460 fda-3 1.0001 0.0193 0.0153 0.0372 0.4999 0.0270 0.0219 0.0728 97.4696 fda-4 0.9997 0.0188 0.0150 0.0355 0.4993 0.0269 0.0219 0.0724 99.4726 fda-5 0.9994 0.0180 0.0142 0.0322 0.4989 0.0260 0.0211 0.0674 99.8988 fda-6 0.9993 0.0180 0.0142 0.0323 0.4988 0.0258 0.0210 0.0668 99.9818 fda-7 0.9992 0.0176 0.0140 0.0309 0.4988 0.0258 0.0209 0.0663 99.9969 fda-8 0.9992 0.0176 0.0140 0.0309 0.4988 0.0258 0.0209 0.0663 99.9995 45 2.4.3 Simulation results Simulation results associated with the Brownian motion are shown in Table 2.1. In this situation, we observe that our approach produces better results in terms of the dispersion measures. Tables 2.2, 2.3 and 2.4 show results for linear processes, here our proposed method performs better in situations with working correlation matrix as AR, but is comparable for an exchangeable structure for 𝑙0 = 1, 2, 3. Moreover, in our proposed method, as 𝑙 increases, all dispersion measures, such as MSE, decrease. The results based on the OU process are documented in Tables 2.5 and 2.6. Our method outperforms the existing methods for both situations and, as 𝜇0 increases, the MSE decreases. For three different parameter choices of the power exponential and rational quadratic covariance structure, numerical results are presented in Tables 2.7, 2.8, 2.9 and 2.10, 2.11, 2.12, respectively. As before, we observe that our proposed method is finer than the existing ones in all sub-cases; but interestingly, when 𝑏 0 increases, MSE decreases for the power exponential, whereas it increases for the rational quadratic covariance structure, as expected due to the covariance structure. Overall, we observe that for all the above situations, as sample size increases, the dispersion measures, for example SD and MSE decrease. It establishes that as the sample size increases, the parameter estimates get closer and closer to the true parameters. From empirical studies, it was observed that if the estimated number of principal components b 𝜷) 𝜅 > 𝜅0 then Q( b may not be continuous at b 𝜷. In each of the above situations, the SDs of the proposed methods decrease as we increase 𝜅 0 and stabilize after some value of 𝜅0 where the fraction of variance (FVE) is approximately 100%. 2.5 Real data analysis In this section, we apply our proposed method to motivating examples in two different data-sets. 2.5.1 Beijing’s PM2.5 pollution study In the atmosphere, suspended microscopic particles of solid and liquid matter are commonly known as particulates or particulate matter (PM). Such particulates often have a strong noxious 46 impact on human health, climate, and visibility. One such common and fine type of atmospheric particle is PM2.5 with a diameter less than 2.5 micrometers. Many developed and developing cities across the world are experiencing chronic air pollution, with major pollutant being PM2.5 ; Beijing and a substantial part of China are among such places. Some studies show that there are many non-ignorable sources of variability in the distribution and transmission pattern of PM2.5 , which are confounded with secondary chemical generation. The atmospheric PM2.5 data used in our analysis were collected from the UCI machine learning repository https://archive.ics.uci. edu/ml/datasets/Beijing+Multi-Site+Air-Quality+Data (Liang et al., 2015). The data-set includes daily measurements of PM2.5 and associated covariates at twelve different locations in China, viz., Aotizhongxin, Changping, Dingling, Dongsi, Guanyuan, Gucheng, Huairou, Nongzhanguan, Shunyi, Tiantan, Wanliu, and Wanshouxigong during January 2017. After excluding missing data, there were 608 hourly data points in Beijing2017-data. We assume that the atmospheric measurements are independent since they are located quite apart. The objective of our analysis is to describe the trend of the functional response PM2.5 (as shown in Figure 2.1) and to evaluate the effect of covariates including chemical compounds such as sulfur dioxide (SO2 ), nitrogen dioxide (NO2 ), carbon monoxide (CO) and ozone (O3 ) over time. We smoothed the covariates and responses to reduce variability and center them. Subsequently, we consider the following model 𝑌𝑖 (𝑡) = 𝛽0 + SO2 (𝑡) 𝛽1 + NO2 (𝑡) 𝛽2 + CO(𝑡) 𝛽3 + O3 (𝑡) 𝛽4 + 𝑒𝑖 (𝑡) (2.16) We use Algorithm 2.1 to estimate the coefficients of the regression model mentioned above. Through the simulation results, we observe that if the values of 𝜅 0 increase, the standard deviation of the coefficients decreases. For small FVEs such as 50%, the corresponding 𝜅0 = 1 and the estimation procedure performs poorly; whereas for large FVE percentages, the estimation procedure has adequately improved in terms of standard error. The estimated values for 𝛽0 , 𝛽1 , 𝛽2 , 𝛽3 , and 𝛽4 produce similar results across different choices of 𝜅0 . From the estimated standard error, using the scree plot, we conclude that the suitable choice of 𝜅0 is approximately 10. The estimated scaled eigen-values are provided in Figure 2.2 which clearly shows their decay rate. The estimated 47 coefficients with standard error are 0.0009 (1.1644), 0.0829 (0.2584), 0.9503 (0.1586), 0.0196 (0.0037) and 1.1523 (0.1198) respectively. Figure 2.2 Beijing2017-data results: Scree plots of fraction of variance explained (FVE). 2.5.2 DTI study for sleep apnea patients MRI is a powerful technique for investigating the structural and functional changes in the brain during pathological and neuro-psychological processes. Due to the advancement in diffusion tensor imaging (DTI), several studies on white matter alterations associated with clinical variables can be found in the recent literature. For our analysis, we use Apnea-data obtained from one such study on obstructive sleep apnea (OSA) patients (Xiong et al., 2017). The data consist of 29 male patients between the ages of 30-55 years who underwent a study for the diagnosis of continuous positive airway pressure (CPAP) therapy. Among those who have sleep disorder other than OSA, night- shift workers, patients with psychiatric disorders, hypertension, diabetes, and other neurological disorders were excluded. In this study, the psychomotor vigilance task (PVT) was performed in 48 which a light was randomly switched-on on a screen for several seconds in a certain interval of time and subjects were asked to press a button as soon as they saw the light appear on screen; such an experiment provides a numerical measure of sleepiness by counting the number of “lapses” for each individual. DTI was performed on a 3T MRI scanner using a commercial 32-channel head coil, followed by analysis using tract-based spatial statistics to investigate the difference in fractional anisotropy (FA) and other DTI parameters. Image acquisition is as follows. An axial T1-weighted image of the brain (3D-BRAVO) is collected with repetition time (TR) = 12ms, echo time (TE) = 5,2ms, flip angle = 13◦ , inversion time = 450 ms, matrix = 384 × 256, voxel size = 1.2 × 0.57 × 0.69mm and scan time = 2 min 54 s. DTI are obtained in the axial plane using a spin-echo echo planner imaging sequence with TR = 4500ms, TE = 89.4ms, field of view = 20 × 20cm2 , matrix size = 160 × 132, slice thickness = 3mm, slice spacing = 1mm, b-values = 0, 1000 s/mm2 . Our objective is to investigate the structural alteration of white matter using DTI in the patients with OSA over each voxel at various regions of the brain (called ROIs). Thus, our response variable is one of the DTI parameters, viz., fractional anisotropy (FA) and we are interested in studying the effect of the changes of FA over continuous domain such as voxels with the interaction of the lapses and the voxel locations in each ROIs. We consider the following model for each ROI. FA𝑖 (𝑠) = 𝛽0 + 𝛽1 lapses𝑖 × 𝑠 + 𝑒𝑖 (𝑠) (2.17) where 𝑠 ∈ S, a set of voxels in the considered ROIs. Using the Algorithm 2.1, we estimate the coefficients 𝛽1 and 𝛽2 as mentioned in the model 2.17 and the results are presented in Table 2.13. We find that the coefficient estimates are close enough to their initial estimates and the estimated standard error is smaller for the coefficients based on the proposed method. Here 𝜅0 (i.e., the number of eigen-functions) are determined by FVE for simplicity. 2.6 Discussion In this chapter, we propose an estimation procedure for the constant linear effects model, which is commonly used in statistics (Zhang and Banerjee, 2021) especially in spatial modelling. One of 49 Table 2.13 Apnea-data results: Estimated values and associated standard errors for the regression coefficients are provided upto four decimal places based on the existing and proposed methods. First line corresponding to each ROI shows results based on initial estimates and the second line corresponds to that of proposed estimates. 𝛽0 𝛽1 # functional points Estimate Std. Error (×100) Estimate (×100) Std. Error (×100) ROI.6 659 0.4512 0.1343 -0.0606 0.0130 0.4512 0.0983 -0.0605 0.0028 ROI.7 1362 0.5048 0.0628 0.0309 0.0061 0.5050 0.0681 0.0342 0.0007 ROI.8 1370 0.5256 0.0586 -0.0667 0.0057 0.5271 0.0346 -0.0733 0.0006 ROI.9 690 0.4951 0.0910 0.2904 0.0088 0.5443 0.0874 0.1660 0.0014 ROI.10 699 0.4951 0.0892 0.3314 0.0086 0.5262 0.1398 0.4231 0.0014 ROI.11 968 0.4372 0.0979 0.1323 0.0095 0.4380 0.0637 0.1311 0.0009 ROI.12 968 0.4529 0.0948 0.0965 0.0092 0.4664 0.0750 0.0504 0.0013 ROI.13 992 0.5448 0.1060 0.3453 0.0103 0.5449 0.0856 0.3559 0.0011 ROI.14 992 0.5435 0.1068 0.3432 0.0104 0.5436 0.0754 0.3437 0.0003 ROI.37 1236 0.3695 0.0779 -0.1126 0.0076 0.3713 0.0669 -0.1175 0.0017 ROI.38 1155 0.3564 0.0819 -0.1356 0.0079 0.3578 0.0420 -0.1356 0.0009 ROI.39 1124 0.4618 0.0760 0.1972 0.0074 0.4621 0.0615 0.1996 0.0007 ROI.40 1125 0.4786 0.0658 0.0953 0.0064 0.4780 0.0369 0.1016 0.0005 ROI.45 380 0.4189 0.1071 0.1647 0.0104 0.4190 0.0175 0.1648 0.0001 ROI.46 376 0.4074 0.1033 0.1988 0.0100 0.4074 0.0159 0.1994 0.0002 ROI.47 596 0.4596 0.0932 0.1304 0.0090 0.4594 0.0191 0.1349 0.0001 ROI.48 600 0.4045 0.0868 0.1100 0.0084 0.4036 0.0644 0.1067 0.0006 50 the key factors of this estimation procedure is the fact that it is based on the quadratic inference methodology, which has played a huge role in the analysis of correlated data since it was discovered by Qu et al. (2000). In contrast to the existing method, our approach allows the number of repeated measurements to grow with sample size; therefore, the trajectories of individuals can be observed on a dense grid of a continuum domain. Instead of assuming a working correlation structure, we propose a data-driven way by estimating the eigen-functions that are obtained by functional √ principal component analysis. Here, we achieve 𝑛−consistency of the parametric estimates in the regression model, even though the eigen-functions are estimated non-parametrically. Additionally, our method is easy to implement in a wide range of applications. The applicability of the proposed method is illustrated by extensive simulation studies. Moreover, two real-data applications in different scientific domains confirm the efficacy of the proposed method. 2.7 Technical details 2.7.1 Some preliminary definitions and concepts of operators In this section, we discuss some basic concepts of operators and discuss some useful properties on it. This can be found in Dunford and Schwartz (1988); Riss and Sz-Nagy (1990) along many more textbooks in functional analysis. Since FDA deals with continuous-time stochastic process, we need to be equipped with the dealing of random function and hence an overview of functional spaces is required. Perturbation theory of compact operators is required to discuss functional principal component analysis and these are demonstrated here with some useful results in our context. In the next subsection, we discuss the functional principal component analysis in brief and show the estimation techniques of eigen-values and eigen-functions based on the data at hand. This plays a fundamental role in our proposed method. Consider the standard L 2 [0, 1] space that defines the set of square-integrable functions defined on the closed set [0, 1] that takes values on the real line. The space L 2 [0, 1] is equipped with an inner product and is defined as ∫ 1 ⟨ 𝑓 , 𝑔⟩ = 𝑓 (𝑡)𝑔(𝑡)𝑑𝑡 (2.18) 0 51 for 𝑓 and 𝑔 in that space, and forms a Hilbert space. Moreover, we denote the norm ∥ · ∥ 2 in L 2 ∫ 2 1/2 which is defined as ∥ 𝑓 ∥ 2 = 𝑓 (𝑢)𝑑𝑢 . Define F be an operator that assigns an element 𝑓 in L 2 [0, 1] to a new element F 𝑓 in L 2 [0, 1]. The operator is linear if F(𝛼 𝑓 + 𝛽𝑔) = 𝛼F 𝑓 + 𝛽F𝑔 for any scalar 𝛼 and 𝛽. It is said to be bounded if for any positive constant 𝑀 (which may depend on 𝑓 ) we have ∥F 𝑓 ∥ ≤ 𝑀 ∥ 𝑓 ∥ for all 𝑓 ∈ L 2 [0, 1] where the largest bound for all 𝑀 is called the norm of the operator F, denoted by ∥F∥, and it is defined as ∥F∥ = sup ∥ 𝑓 ∥≤1 ∥F 𝑓 ∥. The operator is bounded if and only if it is continuous. F is said to be self-adjoint if ⟨F 𝑓 , 𝑔⟩ = ⟨ 𝑓 , F𝑔⟩ and becomes non-negetive definite if ⟨F 𝑓 , 𝑓 ⟩ ≥ 0 for all 𝑓 ∈ L 2 [0, 1]. ∫ A linear mapping F 𝑓 (·) = 𝑅(·, 𝑢) 𝑓 (𝑢)𝑑𝑢 for any function 𝑓 ∈ L 2 [0, 1] and for some integrable function 𝑅(·, ·) on [0, 1] × [0, 1]. This function is preferably known as integral operator and the bivariate function 𝑅 is known as a kernel in statistics and functional analysis literature. Note that the above linear mapping is bounded since ∫ ∫ 2 2 |F 𝑓 (𝑡)| ≤ 𝑅 (𝑡, 𝑢)𝑑𝑢 𝑓 2 (𝑢)𝑑𝑢 using Cauchy-Schwarz inequality ∫ 2 = ∥ 𝑓 ∥2 𝑅 2 (𝑡, 𝑢)𝑑𝑢 (2.19) ∫ ∫ ∫ ∫ Furthermore, ∥F 𝑓 ∥ 22 ≤ ∥ 𝑓 ∥ 22 𝑅 2 (𝑠, 𝑡)𝑑𝑠𝑑𝑡. under the assumption that 𝑅 2 (𝑢, 𝑣)𝑑𝑢𝑑𝑣 < ∞. It is easy to see that F 𝑓 (·) is uniformly continuous and compact for a non-negative definite symmetric kernel 𝑅. For some 𝜆, in Fredholm integral equation , F𝜙 = 𝜆𝜙 has non-zero solution 𝜙 then we call 𝜆 as eigen-value of F and the solution of the eigen-equation is called eigen-functions, altogether, the pair of eigen-values and eigen-function, viz., (𝜆, 𝜙) are called eigen-elements. Let 𝑓1 and 𝑓2 Ë be the elements in Hilbert space H then the tensor product operator ( 𝑓1 𝑓2 ) : H → H and 𝑓2 )(𝑢) = ⟨ 𝑓1 , 𝑔⟩ 𝑓2 for 𝑔 ∈ H. For a compact self-adjoint operator in L 2 [0, 1], Ë defined by ( 𝑓1 let {(𝜆 𝑘 , 𝜙 𝑘 ) : 𝑘 ≥ 1} be as set of eigen-components such that 𝜙 𝑘 s are orthogonal. Then for any function 𝑓 ∈ L 2 [0, 1], it can be represented as ∑︁∞ 𝑓 = 𝑓0 + P𝑟 (2.20) 𝑟=1 52 Ë where P𝑟 is the projection operator for eigen-spaces 𝜆𝑟 . In our particular situation P𝑟 = 𝜙𝑟 𝜙𝑟 . Moreover, for a suitable 𝑓0 such that F 𝑓0 = 0, it can be shown that ∞ ∑︁ F= 𝜆𝑟 P𝑟 (2.21) 𝑟=1 where 𝜆𝑟 is repeated as its multiplicity. Due to non-negative definiteness of F, the eigen-values are ordered as 𝜆 1 ≥ 𝜆 2 ≥ · · · ≥ 0 . Now we discuss Mercer’s theorem (J Mercer, 1909) for a symmetric continuous non-negative definite kernel function 𝑅. It states that, for {(𝜆𝑟 , 𝜙𝑟 )}𝑟 ≥1 be the set of eigen-components, such kernel 𝑅 has the following representation ∑︁∞ 𝑅(𝑠, 𝑡) = 𝜆𝑟 𝜙𝑟 (𝑠)𝜙𝑟 (𝑡) (2.22) 𝑟=1 where the sum is absolutely and uniformly convergence. In this paper, we assume that the eigen-values are distinct for mathematical simplicity. We conclude this subsection perturbation theory of compact operators in the sense that every sub- sequences of {F 𝑓𝑛 } is a Cauchy sequence. In the statistical literature, this can be found in Hall and Hosseini-Nasab (2006, 2009). This is useful to find the bound of eigen-components in different applications. Suppose for self-adjoint compact operator on Hilbert space H consider two operators F and G, define perturbation operator Δ = G − F such that G = F + Δ where G is an approximation to F where Δ amount of error is occurred. Let F and G have kernels 𝐹 and 𝐺 respectively with eigen-elements (𝜃 𝑟 , 𝜓𝑟 ) and (𝜆𝑟 , 𝜙𝑟 ). For simplicity, we assume that the eigen-values are distinct. Then the following Lemma provides perturbation of the eigen-functions. Lemma 2.7.1 (Theorem 5.1.8 in Hsing and Eubank (2015)). Let (𝜆, 𝜙) be the eigen-components of F and (𝜃, 𝜓) be that of G with multiplicity of all eigen-values are restricted to be 1. Define 𝜂 𝑘 = min𝑟≠𝑘 |𝜆𝑟 − 𝜆 𝑘 |. Assume ⟨𝜙𝑟 , 𝜓𝑟 ⟩ ≥ 0 and 𝜂 𝑘 > 0. Then ∞ ∑︁ 𝜓𝑘 − 𝜙𝑘 = (𝜃 𝑘 − 𝜆𝑟 ) −1 P𝑟 Δ𝜓 𝑘 + P𝑘 (𝜓 𝑘 − 𝜙 𝑘 ) (2.23) 𝑟=1 𝑟≠𝑘 53 The above equation follows ∞ ∑︁ 𝜓𝑘 − 𝜙𝑘 = (𝜃 𝑘 − 𝜆𝑟 ) −1 P𝑟 Δ𝜓 𝑘 + 𝑂 (∥Δ∥ 2 ) (2.24) 𝑟=1 𝑟≠𝑘 Remark 2.7.1. Equation (2.24) plays an important role in finding the bound of the proposed estimator introduced in Section 2.2.2. Note that sup𝑟 ≥1 |𝜃 𝑟 − 𝜆𝑟 | ≤ ∥Δ∥ ≤ inf 𝑟≠𝑘 |𝜆 𝑘 − 𝜆𝑟 | (see Theorem 4.2.8 in Hsing and Eubank (2015) for proof). Thus, it is easy to see, |𝜃 𝑟 − 𝜆𝑟 | ≤ |𝜆 𝑘 − 𝜆𝑟 | which implies from Equation (2.23) ∞ ∞  𝑠 ∑︁ −1 ∑︁ 𝜆 𝑘 − 𝜃𝑟 𝜓𝑘 − 𝜙𝑘 = (𝜆 𝑘 − 𝜆𝑟 ) P𝑟 Δ{𝜙 𝑘 + (𝜓 𝑘 − 𝜙 𝑘 )} + P𝑘 (𝜓 𝑘 − 𝜙 𝑘 ) 𝑟=1 𝑠=0 𝜆 𝑘 − 𝜆 𝑟 𝑟≠𝑘 ∞ ∑︁ ∞ ∑︁ = (𝜆 𝑘 − 𝜆𝑟 ) −1 P𝑟 Δ𝜙 𝑘 + (𝜆 𝑘 − 𝜆𝑟 ) −1 P𝑟 Δ(𝜓 𝑘 − 𝜙 𝑘 ) 𝑟=1 𝑟=1 𝑟≠𝑘 𝑟≠𝑘 ∞ ∑︁ ∞ ∑︁ (𝜆 𝑘 − 𝜆 𝑠 ) 𝑠 + 𝑠+1 𝑟 P Δ𝜓 𝑘 + P𝑘 (𝜓 𝑘 − 𝜙 𝑘 ) (2.25) 𝑟=1 𝑠=1 (𝜆 𝑘 − 𝜆 𝑟 ) 𝑟≠𝑘 Moreover, using Bessel’s inequality 1 , we can bound last three terms in the above equation by ∥Δ∥ 2 . Another useful tool in functional data analysis is Karhunen-Loève expansions Karhunen (1946); Loève (1946) for random function 𝑒(𝑡) which is mean zero, second order process with kernel 𝑅 is defined in Sub-section 2.2.3. It states that, with probability 1, the random function can be expressed as ∞ ∑︁ 𝑒𝑖 = 𝜉𝑖𝑟 𝜙𝑟 , where 𝜉𝑖𝑟 := ⟨𝑒𝑖 , 𝜙𝑟 ⟩ (2.26) 𝑟=1 The random variables 𝜉𝑖𝑟 are uncorrelated with mean zero and variance 𝜆𝑟 . This provides a sufficient and necessary condition for the decomposition of a random process. 2.7.2 Some useful lemmas In this section, we represent some useful lemmas. For convenience, let us recall the notation. Assume that 𝑚𝑖 s are all of the same order, viz, 𝑚 ≡ 𝑚(𝑛). Define, 𝑑𝑛1 (ℎ) = ℎ2 +ℎ𝑚/𝑚 and 𝑑𝑛2 (ℎ) = 1 Bessel’s Í∞ inequality: for any 𝑓 ∈ H, 𝑘=1 | ⟨ 𝑓 , 𝜙 𝑘 ⟩ |2 ≤ ∥ 𝑓 ∥2 54 ℎ4 + ℎ3 𝑚/𝑚 + ℎ2 𝑚/𝑚 2 where 𝑚 = lim𝑛→∞ 𝑛−1 𝑖=1 𝑚/𝑚𝑖 and 𝑚 = lim𝑛→∞ 𝑛−1 𝑖=1 (𝑚/𝑚𝑖 ) 2 . Í𝑛 Í𝑛 1/2 1/2 Denote 𝛿𝑛1 (ℎ) = 𝑑𝑛1 (ℎ) log 𝑛/(𝑛ℎ2 ) , 𝛿𝑛2 (ℎ) = 𝑑𝑛2 (ℎ) log 𝑛/(𝑛ℎ4 ) and 𝛿 𝑛 (ℎ) = ℎ2 +   ∫ 𝛿𝑛1 (ℎ) + 𝛿𝑛2 2 (ℎ). Further, 𝜈 𝑡 𝑎 𝐾 𝑏 (𝑡)𝑑𝑡. Define, W = (𝝓(𝑡1 ) T , · · · , 𝝓(𝑡 𝑚 ) T ) T be matrix 𝑎,𝑏 = of order 𝑚 × 𝜅 0 obtained after stacking all 𝝓 𝑘 s and random components 𝝃 𝑖 = (𝜉𝑖1 , · · · , 𝜉𝑖𝜅0 ) T . Furthermore, 𝝃 has mean zero and variance 𝚲 which is a diagonal matrix with components 𝜆1 , · · · , 𝜆 𝑘 0 . Define the indicator function 1(𝑎 = 𝑏) = 1 if 𝑎 = 𝑏 and zero otherwise. Lemma 2.7.2. Consider 𝑍1 , · · · , 𝑍𝑛 be independent and identically distributed random variables with mean zero and finite variance. Suppose that there exists an 𝑀 such that 𝑃(|𝑍𝑖 | ≤ 𝑀) = 1 for all 𝑖 = 1, · · · , 𝑛. Let 𝑇𝑛 = 𝑛1 𝑖=1 𝑍𝑖 . then, 𝑛1 𝑖=1 𝑍𝑖 = 𝑂 ((log 𝑛/𝑛) 1/2 almost surely. If Í𝑛 Í𝑛 √︁ 𝑉 𝑎𝑟 (𝑇𝑛 ) = 𝑂 ((log 𝑛/𝑛) 1/2 ) then 𝑇𝑛 = 𝑂 (log 𝑛/𝑛) almost surely. Proof. Bernstein’s inequality states that if 𝑍1 , · · · , 𝑍𝑛 be centered independent bounded random variables with probability 1. Let 𝑇𝑛 = 𝑛1 𝑖=1 𝑍𝑖 , then let Var{𝑇𝑛 } = 𝜎𝑛2 . Then for any positive Í𝑛 𝑛𝑢 2 real number 𝑢, we have 𝑃(|𝑇𝑛 | ≥ 𝑢) ≤ exp{− 2𝜎2 +2𝑀𝑢/3 } where 𝑀 is such that 𝑃(|𝑍𝑖 | ≤ 𝑀) = 1. 𝑛 Moreover, if 𝑇𝑛 converges to its limit in probability fast enough, then it converge almost surely in the limit, i.e., if for any 𝑢 > 0, ∞ Í 𝑛=1 𝑃(|𝑇𝑛 | ≥ 𝑢) < ∞ them 𝑇𝑛 converges to zero almost √︃ 4𝜎𝑛2 log 𝑛 4𝑀 log 𝑛 + 3𝑛 . Thus, ∞ Í Í∞ 2 surely. Now, choose 𝑢 = 𝑛 𝑛=1 𝑃(|𝑇𝑛 | ≥ 𝑢) < 𝑛=1 1/𝑛 which is finite. √︁ Therefore, 𝑇𝑛 = 𝑂 (𝑢) almost surely. Now let 𝜎𝑛 ≤ 4𝑀 2 log 𝑛/9𝑛, we have, 𝑇𝑛 = 𝑂 (log 𝑛/𝑛) and if 𝜎𝑛 = 𝑂 (1) then 𝑇𝑛 = 𝑂 ((log 𝑛/𝑛) 1/2 ) almost surely. Lemma 2.7.3. Suppose 𝑇𝑖 𝑗 are i.i.d. with density 𝑓𝑇 . Then for fixed 𝑖 = 1, · · · , 𝑛, any 𝑘 and 𝑙 ≥ 1, under assumptions (C2) and (C6)c, the following holds. 𝑚 1 ∑︁𝑖 𝜙 𝑘 (𝑇𝑖 𝑗 )𝜙𝑙 (𝑇𝑖 𝑗 ) = 1(𝑘 = 𝑙) + 𝑂 ((log 𝑚𝑖 /𝑚𝑖 ) 1/2 ) almost surely (2.27) 𝑚𝑖 𝑗=1 Proof. Note that,  1 ∑︁   𝑚𝑖  ∫   E 𝜙 𝑘 (𝑇𝑖 𝑗 )𝜙𝑙 (𝑇𝑖 𝑗 ) = 𝜙 𝑘 (𝑡)𝜙𝑙 (𝑡)𝑑𝑡 = 1(𝑘 = 𝑙) (2.28)  𝑚𝑖   𝑗=1  55 and, 𝑚𝑖 𝑚 2  1 ∑︁ 1 𝑖       ∑︁      Var 𝜙 𝑘 (𝑇𝑖 𝑗 )𝜙𝑙 (𝑇𝑖 𝑗 ) = E 𝜙 𝑘 (𝑇𝑖 𝑗 )𝜙𝑙 (𝑇𝑖 𝑗 ) − 1(𝑘 = 1)  𝑚𝑖   𝑚𝑖   𝑗=1   𝑗=1  𝑚𝑖 𝑚 𝑖 ∑︁𝑚𝑖 1 ∑︁ 1 ∑︁ = 2 E{𝜙2𝑘 (𝑇𝑖 𝑗 )𝜙2𝑙 (𝑇𝑖 𝑗 )} + 2 E{𝜙 𝑘 (𝑇𝑖 𝑗1 )𝜙 𝑘 (𝑇𝑖 𝑗2 )𝜙𝑙 (𝑇𝑖 𝑗1 )𝜙𝑙 (𝑇𝑖 𝑗2 )} − 1(𝑘 = 𝑙) 𝑚𝑖 𝑗=1 𝑚𝑖 𝑗1 =1 𝑗2 =1 𝑗1 ≠ 𝑗2 𝑚𝑖 𝑚 ∑︁𝑖 ∑︁ 𝑚𝑖 1 ∑︁ 1 = E{𝜙2𝑘 (𝑇𝑖 𝑗 )𝜙2𝑙 (𝑇𝑖 𝑗 )} + E{𝜙 𝑘 (𝑇𝑖 𝑗1 )𝜙𝑙 (𝑇𝑖 𝑗1 )}E{𝜙 𝑘 (𝑇𝑖 𝑗2 )𝜙𝑙 (𝑇𝑖 𝑗2 )} − 1(𝑘 = 𝑙) 𝑚𝑖2 𝑗=1 𝑚𝑖2 𝑗1 =1 𝑗2 =1 𝑗1 ≠ 𝑗2  𝑚 −1 ∫ ∫  𝑚1𝑖 𝜙4𝑘 (𝑡)𝑑𝑡 + 𝑚𝑖 𝑖 ( 𝜙2𝑘 (𝑡)𝑑𝑡) 2 − 1  if 𝑘 = 𝑙    = 𝑚 −1 1 ∫ ∫ ∫  𝑚𝑖 𝜙2𝑘 (𝑡)𝜙2𝑙 (𝑡)𝑑𝑡 + 𝑚𝑖 𝑖 ( 𝜙 𝑘 (𝑡)𝜙𝑙 (𝑡)𝜙 𝑘 (𝑡 ′)𝜙𝑙 (𝑡 ′)𝑑𝑡𝑑𝑡 ′)    if 𝑘 ≠ 𝑙  = 𝑂 (1/𝑚𝑖 ) (2.29) Therefore, by applying Lemma 2.7.2, we get Equation (2.27). Lemma 2.7.4. Suppose 𝑇𝑖 𝑗 are i.i.d with density 𝑓𝑇 . Then for fixed 𝑖 = 1, · · · , 𝑛, for any 𝑘 ≥ 1, under assumptions (C2), (C6)c, the following holds. 𝑚 ∫ 1 ∑︁𝑖   𝜇¤𝑖 (𝑇𝑖 𝑗 )𝜙 𝑘 (𝑇𝑖 𝑗 ) = 𝜇¤𝑖 (𝑡)𝜙 𝑘 (𝑡)𝑑𝑡 + 𝑂 (log 𝑚𝑖 /𝑚𝑖 ) 1/2 almost surely (2.30) 𝑚𝑖 𝑗=1 Proof. Note that,  1 ∑︁   𝑚𝑖  ∫   E 𝜇¤𝑖 (𝑇𝑖 𝑗 )𝜙 𝑘 (𝑇𝑖 𝑗 ) = 𝜇¤𝑖 (𝑡)𝜙 𝑘 (𝑡)𝑑𝑡 (2.31)  𝑚𝑖   𝑗=1  and, 𝑚𝑖 𝑚𝑖 2  1 ∑︁       1 ∑︁      Var 𝜇¤𝑖 (𝑇𝑖 𝑗 )𝜙 𝑘 (𝑇𝑖 𝑗 ) ≤ E 𝜇¤𝑖 (𝑇𝑖 𝑗 )𝜙 𝑘 (𝑇𝑖 𝑗 )  𝑚𝑖   𝑚𝑖   𝑗=1   𝑗=1  𝑚𝑖 𝑚 𝑖 ∑︁ 𝑚𝑖 1 ∑︁ 1 ∑︁ E 𝜇¤𝑖2 (𝑇𝑖 𝑗 )𝜙 𝑘 (𝑇𝑖 𝑗 ) 2 + 2  = 2 E{ 𝜇¤ 𝑖 (𝑇𝑖 𝑗1 ) 𝜇¤𝑖 (𝑇𝑖 𝑗2 )𝜙 𝑘 (𝑇𝑖 𝑗1 )𝜙 𝑘 (𝑇𝑖 𝑗2 )} 𝑚𝑖 𝑗=1 𝑚𝑖 𝑗1 =1 𝑗2 =1 𝑗1 ≠ 𝑗2 ∫ = 𝑂 (1/𝑚𝑖 ) since 𝜇¤ 2 (𝑡)𝜙2𝑘 (𝑡)𝑑𝑡 < ∞ (2.32) Therefore, applying the Lemma 2.7.2, we get Equation (2.30). 56 ∫ ∫ 𝜇¤𝑖 (𝑡)𝜙𝑟 (𝑡)𝑑𝑡 and 𝑉𝑟 = E{ 𝜇(𝑡)𝜙 ¤ 2 Lemma 2.7.5. Define, M𝑖𝑟 = 𝑟 (𝑡)𝑑𝑡} for 𝑟 ≥ 2. Then under Conditions (C6)a, (C6)b, for some 𝛼 > 0 such that 𝑉𝑟 𝜆𝑟−2𝑟 1+𝛼 → 0 as 𝑟 → ∞ (due to Condition (C6)b) ∞ 𝑛 ∑︁ 1 ∑︁   (𝜆 𝑘 − 𝜆𝑟 ) −1 M𝑖𝑟 𝜉𝑖𝑘 = 𝑂 (log 𝑛/𝑛) 1/2𝜆1/2 𝑘 𝑘 (1−𝛼)/2 almost surely (2.33) 𝑟=1 𝑛 𝑖=1 𝑟≠𝑘 Proof. It is easy to see that,      ∞ 𝑛  1 ∑︁   ∑︁     E (𝜆 𝑘 − 𝜆𝑟 ) −1 M𝑖𝑟 𝜉𝑖𝑘 = 0 (2.34)   𝑟=1 𝑛 𝑖=1     𝑟≠𝑘     Using the spacing condition among the eigen-values in (C6)a, for each 1 ≤ 𝑘 < 𝑟 < ∞ and for non-zero finite generic constant 𝐶0 , max {𝜆 𝑘 , 𝜆𝑟 } max {𝑘, 𝑟} ≤ 𝐶0 (2.35) |𝜆 𝑘 − 𝜆𝑟 | |𝑘 − 𝑟 | Similar kind of conditions can be invoked such as the convexity assumption, i.e. 𝜆𝑟 −𝜆𝑟+1 ≤ 𝜆𝑟−1 −𝜆𝑟 for all 𝑟 ≥ 2. Thus, using Inequality (2.35), for some 𝛼 > 0, with condition 𝑉𝑟 𝜆𝑟−2𝑟 1+𝛼 → 0 as 𝑟 → ∞ we can write ∞ ∞  2 ∑︁ −2 ∑︁ max(𝑘, 𝑟) 𝑉𝑟 (𝜆 𝑘 − 𝜆𝑟 ) ≲ 𝑉𝑟 𝑟=1 𝑟=1 |𝑘 − 𝑟 | max(𝜆 𝑘 , 𝜆𝑟 ) 𝑟≠𝑘 𝑟≠𝑘 ∑︁ 𝑘2 ∑︁ 𝑟2 ∑︁ 𝑘2 ∑︁ 𝑟2 = 𝑉𝑟 𝜆𝑟−2 + 𝑉 𝑟 𝜆 −2 + 𝑉𝑟 𝜆 −2 𝑟 + 𝑉𝑟 𝜆 −2 𝑟 ≤𝑘/2 (𝑘 − 𝑟) 2 𝑟>2𝑘 𝑘 (𝑘 − 𝑟) 2 𝑘/2<𝑟<𝑘 (𝑘 − 𝑟) 2 𝑘 <𝑟<2𝑘 𝑘 (𝑘 − 𝑟) 2 ∑︁ ∑︁ ≲ 𝑉𝑟 𝜆𝑟−2 + 𝑘 2 𝑉𝑟 𝜆𝑟−2 (𝑘 − 𝑟) −2 𝑟 ≤𝑘/2,𝑟>2𝑘 𝑘/2<𝑟<2𝑘 ∑︁ ≲ 1 + 𝑘 1−𝛼 (𝑘 − 𝑟) −2 ≲ 𝑘 1−𝛼 . (2.36) 𝑘/2<𝑟<2𝑘 This follows the line of proofs in Hall and Hosseini-Nasab (2009) in different contexts. Thus, using the Inequality (2.36), it follows that      ∞ 𝑛   1 ∑︁ ∞ 1   ∑︁  ∑︁     −1 Var (𝜆 𝑘 − 𝜆𝑟 ) M𝑖𝑟 𝜉𝑖𝑘 = 𝜆 𝑘 𝑉𝑟 (𝜆 𝑘 − 𝜆𝑟 ) −2 = 𝑂 𝑛−1𝜆 𝑘 𝑘 (1−𝛼) (2.37)   𝑟=1 𝑛 𝑖=1   𝑛 𝑟=1   𝑟≠𝑘     𝑟≠𝑘 Therefore, applying Lemma 2.7.2, we get Equation (2.33). 57 ∫ Lemma 2.7.6. For, M𝑖𝑟 = ¤ 𝜇(𝑡)𝜙 𝑟 (𝑡)𝑑𝑡 and 𝜂 𝑘 = min𝑟≠𝑘 |𝜆 𝑘 − 𝜆𝑟 | > 0, under conditions (C6)a and (C6)b, we almost surely have the following. ∞ ∑︁ 𝜅0 𝑛 (𝜅 ) 1/2 0 ∑︁ 1 ∑︁ ∑︁ (𝜆 𝑘 − 𝜆𝑟1 ) −1 (𝜆 𝑘 − 𝜆𝑟2 ) −1 M𝑖𝑟1 𝜉𝑖𝑟2 = 𝑂 ­ (log 𝑛/𝑛) 1/2 𝜅0(3−𝛼)/2𝜆−1 ® (2.38) © ª 𝜅0 𝜆𝑟 𝑟 1 ≠1 𝑟 2 ≠1 𝑛 𝑖=1 𝑟=1 « ¬ 𝑟 1 ≠𝑘 𝑟 2 ≠𝑘 Proof. It is easy to see that,      ∞ ∑︁ 𝜅0 𝑛  1 ∑︁   ∑︁     E (𝜆 𝑘 − 𝜆𝑟1 ) −1 (𝜆 𝑘 − 𝜆𝑟2 ) −1 M𝑖𝑟1 𝜉𝑖𝑟2 = 0 (2.39)  𝑛   𝑟1 =1 𝑟2 =1 𝑖=1      𝑟1 ≠𝑘 𝑟2 ≠𝑘  Moreover, using the spacing condition mentioned in (C6)a, one can derive the upper bound of 𝜂−1 𝑘 by   −1 𝜂−1 𝑘 = min |𝜆 𝑘 − 𝜆𝑟 | = max |𝜆 𝑘 − 𝜆𝑟 | −1 𝑟≠𝑘 𝑟≠𝑘   max(𝑘, 𝑟) ≲ max ≤ 𝜆−1 𝑘 𝑘 (2.40) 𝑟≠𝑘 |𝑘 − 𝑟 | max(𝜆 𝑘 , 𝜆𝑟 ) Due to the monotonic decreasing property of eigen-values, for fixed 𝑘 = 1, · · · , 𝜅0 , we have 𝜅0 ∑︁ 𝜅0 ∑︁ 𝜅0 ∑︁ 𝜅0 ∑︁ −2 −2 −2 2 −2 2 𝜆𝑟 (𝜆 𝑘 − 𝜆𝑟 ) ≲ 𝜂 𝑘 𝜆𝑟 ≲ 𝜆 𝑘 𝑘 𝜆𝑟 ≲ 𝜆 𝜅0 𝜅 0 𝜆𝑟 (2.41) 𝑟=1 𝑟=1 𝑟=1 𝑟=1 𝑟≠𝑘 Therefore, the following holds under similar conditions to obtain the Inequality (2.36),   !2  ∞ ∑︁ 𝜅0 𝑛  1 ∑︁   ∑︁     E (𝜆 𝑘 − 𝜆𝑟1 ) −2 (𝜆 𝑘 − 𝜆𝑟2 ) −2 M𝑖𝑟1 𝜉𝑖𝑟2  𝑛   𝑟1 =1 𝑟2 =1 𝑖=1      𝑟1 ≠𝑘 𝑟2 ≠𝑘  ∞ 𝜅0 1 ∑︁ ∑︁ = (𝜆 𝑘 − 𝜆𝑟1 ) −2 (𝜆 𝑘 − 𝜆𝑟2 ) −2𝑉𝑟1 𝜆𝑟2 𝑛 𝑟 =1 𝑟 =1 1 1 𝑟 1 ≠𝑘 𝑟 1 ≠𝑘 𝜅0 𝜅0 1 1−𝛼 ∑︁ −2 1 3−𝛼 −2 ∑︁ ≲ 𝑘 𝜆𝑟 (𝜆 𝑘 − 𝜆𝑟 ) ≲ 𝜅 0 𝜆 𝜅0 𝜆𝑟 (2.42) 𝑛 𝑟=1 𝑛 𝑟=1 𝑟≠𝑘 Therefore, applying the Lemma 2.7.2, we get Equation (2.38). 58 2.7.3 Proof of Theorem 2.3.1 For the 𝑘−th element of g𝑛 ( 𝜷0 ) (1 ≤ 𝑘 ≤ 𝜅0 ), 𝑛 1 ∑︁ 1 T b g𝑛,𝑘 ( 𝜷0 ) = 𝝁¤ 𝚽 𝑘 (y𝑖 − 𝝁𝑖 ) 𝑛 𝑖=1 𝑚𝑖2 𝑖 𝑛 1 ∑︁ 1 T b = 𝝁¤ 𝚽 𝑘 W𝝃 𝑖 𝑛 𝑖=1 𝑚𝑖2 𝑖 𝑛 𝑛 1 ∑︁ 1 T 1 ∑︁ 1 T b = 𝝁¤ 𝚽 𝑘 W𝝃 𝑖 + 𝝁¤ ( 𝚽 𝑘 − 𝚽 𝑘 )W𝝃 𝑖 𝑛 𝑖=1 𝑚𝑖2 𝑖 𝑛 𝑖=1 𝑚𝑖2 𝑖 := 𝐽 𝑘𝑛1 + 𝐽 𝑘𝑛2 (2.43) Now, using Lemmas 2.7.3 and 2.7.4, the first part of the expression of g𝑛,𝑘 ( 𝜷0 ) becomes 𝑛 1 ∑︁ 1 T 𝐽 𝑘𝑛1 = 𝝁¤ 𝚽 𝑘 W𝝃 𝑖 𝑛 𝑖=1 𝑚𝑖2 𝑖 𝑛 𝑚 𝑚 𝜅0 1 ∑︁ 1 ∑︁𝑖 ∑︁𝑖 ∑︁ = 𝜇¤𝑖 (𝑇𝑖 𝑗1 )𝜙 𝑘 (𝑇𝑖 𝑗1 )𝜙 𝑘 (𝑇𝑖 𝑗2 )𝜙𝑙 (𝑇𝑖 𝑗2 )𝜉𝑖𝑙 𝑛 𝑖=1 𝑚𝑖2 𝑗 =1 𝑗 =1 𝑙=1 1 2 𝑛 𝑚 (𝜅 ) 0 1 ∑︁ 1 ∑︁𝑖 ∑︁ = 𝜇¤𝑖 (𝑇𝑖 𝑗1 )𝜙 𝑘 (𝑇𝑖 𝑗1 ) [1(𝑘 = 𝑙) + 𝑂 ((log 𝑚/𝑚) 1/2 )] 𝜉𝑖𝑙 𝑛 𝑖=1 𝑚𝑖 𝑗 =1 𝑙=1 1 𝑛 ∫ 1 ∑︁ n 1/2 on 1/2 o ≲ M𝑖𝑘 + 𝑂 ((log 𝑚/𝑚) ) 1 + 𝑂 ((log 𝑚/𝑚) ) 𝜉𝑖𝑘 where M𝑖𝑘 = 𝜇¤𝑖 (𝑡)𝜙 𝑘 (𝑡)𝑑𝑡 𝑛 𝑖=1 𝑛 1 ∑︁ n  o = M𝑖𝑘 𝜉𝑖𝑘 1 + 𝑂 (log 𝑚/𝑚) 1/2 𝑛 𝑖=1  n o = 𝑂 (log 𝑛/𝑛) 1/2 1 + (log 𝑚/𝑚) 1/2 almost surely (2.44) On the other hand, the last part of g𝑛,𝑘 ( 𝜷) can be expressed as 𝑛 1 ∑︁ 1 T b 𝐽 𝑘𝑛2 = 𝝁¤ ( 𝚽 𝑘 − 𝚽 𝑘 )W𝝃 𝑖 𝑛 𝑖=1 𝑚𝑖2 𝑖 𝑛 𝑚 𝑚 𝜅0 1 ∑︁ 1 ∑︁𝑖 ∑︁𝑖 ∑︁ = 𝜇¤𝑖 (𝑇𝑖 𝑗1 )𝑑𝑖𝑘 (𝑇𝑖 𝑗1 , 𝑇𝑖 𝑗2 )𝜙𝑙 (𝑇𝑖 𝑗2 )𝜉𝑖𝑙 (2.45) 𝑛 𝑖=1 𝑚𝑖2 𝑗 =1 𝑗 =1 𝑙=1 1 2 where 𝑑𝑖𝑘 (𝑇𝑖 𝑗1 , 𝑇𝑖 𝑗2 ) := 𝜙b𝑘 (𝑇𝑖 𝑗1 ) 𝜙b𝑘 (𝑇𝑖 𝑗2 ) − 𝜙 𝑘 (𝑇𝑖 𝑗1 )𝜙 𝑘 (𝑇𝑖 𝑗2 ). Now replace G, 𝜃 and 𝜓 by F, b 𝜆 b and 𝜙b respectively since F b be the approximation of F and Δ is the corresponding perturbation operator 59 in Equation (2.23). Therefore, Lemma 2.7.1 immediately implies the following expansion, which is the key fact to represent the objective function in QIF. ∑︁∞ 𝜙b𝑘 − 𝜙 𝑘 = (𝜆 𝑘 − 𝜆𝑟 ) −1 ⟨𝜙𝑟 , Δ𝜙 𝑘 ⟩ 𝜙𝑟 + 𝑂 (∥Δ∥ 2 ) almost surely (2.46) 𝑟=1 𝑟≠𝑘 where Δ be the integral operator with kernel 𝑅 b − 𝑅. Therefore, almost surely, we have, 𝑑𝑖𝑘 (𝑇𝑖 𝑗1 , 𝑇𝑖 𝑗2 ) := 𝜙b𝑘 (𝑇𝑖 𝑗1 ) 𝜙b𝑘 (𝑇𝑖 𝑗2 ) − 𝜙 𝑘 (𝑇𝑖 𝑗1 )𝜙 𝑘 (𝑇𝑖 𝑗2 )         ∞ ∑︁     = 𝜙 𝑘 (𝑇𝑖 𝑗1 ) + (𝜆 𝑘 − 𝜆𝑟 ) −1 ⟨𝜙𝑟 , Δ𝜙 𝑘 ⟩ 𝜙𝑟 (𝑇𝑖 𝑗1 ) + 𝑂 (∥Δ∥ 2 )      𝑟=1     𝑟≠𝑘          ∑︁∞     −1 2 × 𝜙 𝑘 (𝑇𝑖 𝑗2 ) + (𝜆 𝑘 − 𝜆𝑟 ) ⟨𝜙𝑟 , Δ𝜙 𝑘 ⟩ 𝜙𝑟 (𝑇𝑖 𝑗2 ) + 𝑂 (∥Δ∥ ) − 𝜙 𝑘 (𝑇𝑖 𝑗1 )𝜙 𝑘 (𝑇𝑖 𝑗2 )      𝑟=1     𝑟≠𝑘  ∑︁∞ n o = (𝜆 𝑘 − 𝜆𝑟 ) −1 ⟨𝜙𝑟 , Δ𝜙 𝑘 ⟩ 𝜙𝑟 (𝑇𝑖 𝑗1 )𝜙 𝑘 (𝑇𝑖 𝑗2 ) + 𝜙 𝑘 (𝑇𝑖 𝑗1 )𝜙𝑟 (𝑇𝑖 𝑗2 ) 𝑟=1 𝑟≠𝑘 ∑︁ ∑︁ + (𝜆 𝑘 − 𝜆𝑟1 ) −1 (𝜆 𝑘 − 𝜆𝑟2 ) −1 𝜙𝑟1 , Δ𝜙 𝑘 𝜙𝑟2 , Δ𝜙 𝑘 𝜙𝑟1 (𝑇𝑖 𝑗1 )𝜙𝑟2 (𝑇𝑖 𝑗2 ) + 𝑂 (∥Δ∥ 2 ) 𝑟 1 ≠𝑘 𝑟 2 ≠𝑘 𝑛1 := 𝐼𝑖𝑘 (𝑇𝑖 𝑗1 , 𝑇𝑖 𝑗2 ) + 𝐼𝑖𝑘𝑛2 (𝑇𝑖 𝑗1 , 𝑇𝑖 𝑗2 ) + 𝑂 (∥Δ∥ 2 ) almost surely (2.47) Therefore, by placing the expression of 𝑑𝑖𝑘 (𝑇𝑖 𝑗1 , 𝑇𝑖 𝑗2 ) in Equation (2.45), we have the following. 𝑛 1 ∑︁ 1 T b 𝐽 𝑘𝑛2 = 𝝁¤ ( 𝚽 𝑘 − 𝚽 𝑘 )W𝝃 𝑖 𝑛 𝑖=1 𝑚𝑖2 𝑖 𝑛 𝑚 𝑚 𝜅0 1 ∑︁ 1 ∑︁𝑖 ∑︁𝑖 ∑︁ = 𝜇¤𝑖 (𝑇𝑖 𝑗1 )𝑑𝑖𝑘 (𝑇𝑖 𝑗1 , 𝑇𝑖 𝑗2 )𝜙𝑙 (𝑇𝑖 𝑗2 )𝜉𝑖𝑙 𝑛 𝑖=1 𝑚𝑖2 𝑗 =1 𝑗 =1 𝑙=1 1 2 𝑛 𝑚 𝑖 ∑︁ 𝑚 𝑖 ∑︁𝜅0 1 ∑︁ 1 ∑︁ n o = 𝑛1 𝜇¤𝑖 (𝑇𝑖 𝑗1 ) 𝐼𝑖𝑘 𝑛2 (𝑇𝑖 𝑗1 , 𝑇𝑖 𝑗2 ) + 𝐼𝑖𝑘 (𝑇𝑖 𝑗1 , 𝑇𝑖 𝑗2 ) 𝜙𝑙 (𝑇𝑖 𝑗2 )𝜉𝑖𝑙 + 𝑂 (∥Δ∥ 2 ) 𝑛 𝑖=1 𝑚𝑖2 𝑗1 =1 𝑗2 =1 𝑙=1 := 𝐽 𝑘1 𝑛2 𝑛2 + 𝐽 𝑘2 + 𝑂 (∥Δ∥ 2 ) almost surely (2.48) Under assumptions (C1), (C2), (C3), (C4) and (C5), by using Theorem 3.3 in Li and Hsing (2010), we have the following. ∥Δ∥ 2 = 𝑂 (ℎ4 + 𝛿𝑛2 2 (ℎ)) almost surely (2.49) 60 Now observe that 𝑛 𝑚 𝑚 𝜅0 𝑛2 1 ∑︁ 1 ∑︁𝑖 ∑︁𝑖 ∑︁ 𝑛1 𝐽 𝑘1 = 2 𝜇¤𝑖 (𝑇𝑖 𝑗1 )𝐼𝑖𝑘 (𝑇𝑖 𝑗1 )𝜙𝑙 (𝑇𝑖 𝑗2 )𝜉𝑖𝑙 𝑛 𝑖=1 𝑚𝑖 𝑗 =1 𝑗 =1 𝑙=1 1 2 𝑛 𝑚 𝑖 ∑︁ 𝜅 0 ∑︁ 𝑚 𝑖 ∑︁ ∞ 1 ∑︁ 1 ∑︁ −1 n o = (𝜆 𝑘 − 𝜆𝑟 ) 𝜇¤𝑖 (𝑇𝑖 𝑗1 ) 𝜙𝑟 (𝑇𝑖 𝑗1 )𝜙 𝑘 (𝑇𝑖 𝑗2 ) + 𝜙 𝑘 (𝑇𝑖 𝑗1 )𝜙𝑟 (𝑇𝑖 𝑗2 ) 𝜙𝑙 (𝑇𝑖 𝑗2 ) 𝑛 𝑖=1 𝑚𝑖2 𝑗1 =1 𝑗2 =1 𝑙=1 𝑟=1 𝑟≠𝑘 × ⟨𝜙𝑟 , Δ𝜙 𝑘 ⟩ 𝜉𝑖𝑙 𝑛 𝑚 ∞ 𝜅0 1 ∑︁ 1 ∑︁𝑖 ∑︁ ∑︁ n o ≲ (𝜆 𝑘 − 𝜆𝑟 ) −1 𝜇¤𝑖 (𝑇𝑖 𝑗1 )𝜙𝑟 (𝑇𝑖 𝑗1 ) 1(𝑙 = 𝑘) + 𝑂 ((log 𝑚/𝑚) 1/2 ) ⟨𝜙𝑟 , Δ𝜙 𝑘 ⟩ 𝜉𝑖𝑙 𝑛 𝑖=1 𝑚𝑖 𝑗 =1 𝑟=1 𝑙=1 1 𝑟≠𝑘 𝑛 𝑚 ∞ 𝜅0 1 ∑︁ 1 ∑︁𝑖 ∑︁ ∑︁ n o + (𝜆 𝑘 − 𝜆𝑟 ) −1 𝜇¤𝑖 (𝑇𝑖 𝑗1 )𝜙 𝑘 (𝑇𝑖 𝑗1 ) 1(𝑟 = 𝑙) + 𝑂 ((log 𝑚/𝑚) 1/2 ) ⟨𝜙𝑟 , Δ𝜙 𝑘 ⟩ 𝜉𝑖𝑙 𝑛 𝑖=1 𝑚𝑖 𝑗 =1 𝑟=1 𝑙=1 1 𝑟≠𝑘 𝑛 𝑚 ∞ 1 ∑︁ 1 ∑︁𝑖 ∑︁ n o ≲ (𝜆 𝑘 − 𝜆𝑟 ) −1 𝜇¤𝑖 (𝑇𝑖 𝑗1 )𝜙𝑟 (𝑇𝑖 𝑗1 ) ⟨𝜙𝑟 , Δ𝜙 𝑘 ⟩ 𝜉𝑖𝑘 1 + 𝑂 ((log 𝑚/𝑚) 1/2 ) 𝑛 𝑖=1 𝑚𝑖 𝑗 =1 𝑟=1 1 𝑟≠𝑘 𝑛 𝑚 𝜅0 1 ∑︁ 1 ∑︁𝑖 ∑︁ n o + (𝜆 𝑘 − 𝜆𝑟 ) −1 𝜇¤𝑖 (𝑇𝑖 𝑗1 )𝜙 𝑘 (𝑇𝑖 𝑗1 ) ⟨𝜙𝑟 , Δ𝜙 𝑘 ⟩ 𝜉𝑖𝑟 1 + 𝑂 ((log 𝑚/𝑚) 1/2 ) 𝑛 𝑖=1 𝑚𝑖 𝑗 =1 𝑟=1 1 𝑟≠𝑘 n o 𝑛 𝑛 1/2 := (𝑈 𝑘1 + 𝑈 𝑘2 ) 1 + 𝑂 ((log 𝑚/𝑚) ) almost surely (2.50) Then applying the triangle inequality, we have the following. 𝑛 𝑚 ∞ 1 ∑︁ 1 ∑︁𝑖 ∑︁ 𝑈 𝑘1𝑛 = (𝜆 𝑘 − 𝜆𝑟 ) −1 𝜇¤𝑖 (𝑇𝑖 𝑗 )𝜙𝑟 (𝑇𝑖 𝑗 ) ⟨Δ𝜙 𝑘 , 𝜙𝑟 ⟩ 𝜉𝑖𝑘 𝑛 𝑖=1 𝑚𝑖 𝑗=1 𝑟=1 𝑟≠𝑘 ∞ 𝑛 n −1 1 ∑︁ ∑︁ o ≲ ∥Δ𝜙 𝑘 ∥ (𝜆 𝑘 − 𝜆𝑟 ) M𝑖𝑟 + 𝑂 ((log 𝑚/𝑚) 1/2 ) 𝜉𝑖𝑘 𝑟=1 𝑛 𝑖=1 𝑟≠𝑘 ∞ 𝑛 −1 1 ∑︁ ∑︁ n o = ∥Δ𝜙 𝑘 ∥ (𝜆 𝑘 − 𝜆𝑟 ) M𝑖𝑟 𝜉𝑖𝑘 1 + 𝑂 ((log 𝑚/𝑚) 1/2 ) (2.51) 𝑟=1 𝑛 𝑖=1 𝑟≠𝑘 ∫ where M𝑖𝑟 = 𝜇¤𝑖 (𝑡)𝜙𝑟 (𝑡)𝑑𝑡. By Lemma 6 of Li and Hsing (2010), under conditions (C1), (C2), (C3), (C4), (C5), for any measurable bounded function 𝑒 on [0, 1], one can show the following. ∥Δ𝜙 𝑘 ∥ = 𝑂 (ℎ2 + 𝛿𝑛1 (ℎ) + 𝛿𝑛2 2 (ℎ)) ≡ 𝑂 (𝛿 𝑛 (ℎ)) almost surely (2.52) 61 where 𝛿 𝑛 (ℎ) = ℎ2 + 𝛿𝑛1 (ℎ) + 𝛿𝑛2 2 (ℎ). Thus, combining the Inequalities (2.37) in Lemma 2.7.5 and   Equation (2.52), we obtain 𝑈 𝑘1 𝑛 = 𝑂 𝛿 (ℎ)(log 𝑛/𝑛) 1/2 𝜆 1/2 𝑘 (1−𝛼)/2 {1 + (log 𝑚/𝑚) 1/2 } almost 𝑛 𝑘 surely. Next, under the spacing condition mentioned earlier and in assumption (C6)a, using the Inequality (2.40) recall 𝜂−1 −1 𝑘 ≲ 𝜆 𝑘 𝑘. Thus, observe that 𝑛 𝑚 𝜅0 1 ∑︁ 1 ∑︁𝑖 ∑︁ 𝑛 𝑈 𝑘2 = (𝜆 𝑘 − 𝜆𝑟 ) −1 𝜇¤𝑖 (𝑇𝑖 𝑗 )𝜙 𝑘 (𝑇𝑖 𝑗 ) ⟨Δ𝜙 𝑘 , 𝜙𝑟 ⟩ 𝜉𝑖𝑟 𝑛 𝑖=1 𝑚𝑖 𝑗=1 𝑟=1 𝑟≠𝑘 𝜅0 ( 𝑛 ) ∑︁ 1 ∑︁ n o ≲ ∥Δ𝜙 𝑘 ∥ (𝜆 𝑘 − 𝜆𝑟 ) −1 M𝑖𝑘 𝜉𝑖𝑟 1 + 𝑂 ((log 𝑚/𝑚) 1/2 ) 𝑟=1 𝑛 𝑖=1 𝑟≠𝑘 𝜅0 ( 𝑛 ) −1 ∑︁ 1 ∑︁ n 1/2 o ≲ ∥Δ𝜙 𝑘 ∥𝜂 𝑘 M𝑖𝑘 𝜉𝑖𝑟 1 + 𝑂 ((log 𝑚/𝑚) ) (2.53) 𝑟=1 𝑛 𝑖=1 𝑟≠𝑘 Using Condition (C6)b we also have 𝑉𝑘1/2 𝜂−1 1/2 −1 𝑘 ≲ 𝑉𝑘 𝜆 𝑘 𝑘 = 𝑂 (𝑘 (1−𝛼)/2 ). Finally, combining with the bounds for 𝑈 𝑘1 𝑛 , 𝑈 𝑛 , we have, almost surely, 𝑘2     ©   𝜅0   ª 𝑛2 1/2   1/2 (1−𝛼)/2 −1 1/2 ∑︁ 1/2   1/2 ® = 𝑂 ­ (log 𝑛/𝑛) 𝛿 𝑛 (ℎ) 𝜆 𝑘 𝑘 + 𝜂 𝑘 𝑉𝑘 1 + (log 𝑚/𝑚) ­ 𝐽 𝑘1 𝜆𝑟 ® ­   ®    𝑟=1    «  𝑟≠𝑘  ! ¬ 𝜅0 ∑︁ n o = 𝑂 (log 𝑛/𝑛) 1/2 𝛿 𝑛 (ℎ)𝑘 (1−𝛼)/2 𝜆𝑟1/2 1 + (log 𝑚/𝑚) 1/2 𝑘=1 := 𝑂 (𝜔 𝑘1 (𝑛, ℎ)) (2.54) Í 0 1/2  where 𝜔 𝑘1 (𝑛, ℎ) = (log 𝑛/𝑛) 1/2 𝛿 𝑛 (ℎ)𝑘 (1−𝛼)/2 𝜅𝑘=1 𝜆𝑟 1 + (log 𝑚/𝑚) 1/2 . It is easy to see that 𝜏 Í𝜅0 1/2 − 21 +1 1/2 𝛿 (ℎ)𝜅 (3−𝜏)/2 1 + (log 𝑚/𝑚) 1/2 where  𝑟=1 𝜆 𝑟 ∼ 𝜅 0 . Therefore, 𝜔 𝑘1 (𝑛, ℎ) ∼ (log 𝑛/𝑛) 𝑛 0 𝜏 = 𝛼 + 𝜏1 . Similarly to the derivation of the bound for 𝐽 𝑘1 𝑛2 , we can write 𝑛 𝑚 𝑚 𝜅0 𝑛2 1 ∑︁ 1 ∑︁𝑖 ∑︁𝑖 ∑︁ 𝑛2 𝐽 𝑘2 = 𝜇¤𝑖 (𝑇𝑖 𝑗1 )𝐼𝑖𝑘 (𝑇𝑖 𝑗1 , 𝑇𝑖 𝑗2 )𝜙𝑙 (𝑇𝑖 𝑗2 )𝜉𝑖𝑙 𝑛 𝑖=1 𝑚𝑖2 𝑗 =1 𝑗 =1 𝑙=1 1 2 𝑛 𝑚 𝑚 𝜅 0 ∑︁ ∞ ∑︁ ∞ 1 ∑︁ 1 ∑︁𝑖 ∑︁𝑖 ∑︁ = (𝜆 𝑘 − 𝜆𝑟1 ) −1 (𝜆 𝑘 − 𝜆𝑟2 ) −1 𝜙𝑟1 , Δ𝜙 𝑘 𝜙𝑟2 , Δ𝜙 𝑘 𝑛 𝑖=1 𝑚𝑖2 𝑗 =1 𝑗 =1 𝑙=1 𝑟 =1 𝑟 =1 1 2 1 1 𝑟 1 ≠𝑘 𝑟 2 ≠𝑘 × 𝜇¤𝑖 (𝑇𝑖 𝑗1 )𝜙𝑟1 (𝑇𝑖 𝑗1 )𝜙𝑟2 (𝑇𝑖 𝑗2 )𝜙𝑙 (𝑇𝑖 𝑗2 )𝜉𝑖𝑙 62 𝑛 𝑚 𝜅 0 ∑︁ ∞ ∑︁ ∞ 1 ∑︁ 1 ∑︁𝑖 ∑︁ ≲ (𝜆 𝑘 − 𝜆𝑟1 ) −1 (𝜆 𝑘 − 𝜆𝑟2 ) −1 𝜙𝑟1 , Δ𝜙 𝑘 𝜙𝑟2 , Δ𝜙 𝑘 𝑛 𝑖=1 𝑚𝑖 𝑗 =1 𝑙=1 𝑟 =1 𝑟 =1 1 1 2 𝑟 1 ≠𝑘 𝑟 2 ≠𝑘 n o × 𝜇¤𝑖 (𝑇𝑖 𝑗1 )𝜙𝑟1 (𝑇𝑖 𝑗1 ) 1(𝑟 2 = 𝑙) + 𝑂 ((log 𝑚/𝑚) 1/2 ) 𝜉𝑖𝑙 𝑛 𝑚 ∞ 𝜅0 1 ∑︁ 1 ∑︁𝑖 ∑︁ ∑︁ = (𝜆 𝑘 − 𝜆𝑟1 ) −1 (𝜆 𝑘 − 𝜆𝑟2 ) −1 𝜙𝑟1 , Δ𝜙 𝑘 𝜙𝑟2 , Δ𝜙 𝑘 𝑛 𝑖=1 𝑚𝑖 𝑗 =1 𝑟 =1 𝑟 =1 1 1 2 𝑟 1 ≠𝑘 𝑟 2 ≠𝑘 n o × 𝜇¤𝑖 (𝑇𝑖 𝑗1 )𝜙𝑟1 (𝑇𝑖 𝑗1 )𝜉𝑖𝑟2 1 + 𝑂 ((log 𝑚/𝑚) 1/2 ) ∞ ∑︁ 𝜅0 𝑛 ! ∑︁ 1 ∑︁ n o ≲ ∥Δ𝜙 𝑘 ∥ 2 (𝜆 𝑘 − 𝜆𝑟1 ) −1 (𝜆 𝑘 − 𝜆𝑟2 ) −1 M𝑖𝑟1 + 𝑂 ((log 𝑚/𝑚) 1/2 ) 𝜉𝑖𝑟2 𝑟 1 =1 𝑟 2 =1 𝑛 𝑖=1 𝑟 1 ≠𝑘 𝑟 2 ≠𝑘 n o × 1 + 𝑂 ((log 𝑚/𝑚) 1/2 ) ∞ ∑︁ 𝜅0 ( 𝑛 ) 2 ∑︁ −1 −1 1 ∑︁ n 1/2 o = ∥Δ𝜙 𝑘 ∥ (𝜆 𝑘 − 𝜆𝑟1 ) (𝜆 𝑘 − 𝜆𝑟2 ) × M𝑖𝑟1 𝜉𝑖𝑟2 1 + 𝑂 ((log 𝑚/𝑚) ) 𝑟 1 =1 𝑟 2 =1 𝑛 𝑖=1 𝑟 1 ≠𝑘 𝑟 2 ≠𝑘 (2.55) Therefore, Inequality (2.55) immediately follows using Lemma 2.7.6, 𝜅0 ! 2 ∑︁ n o 𝑛2 𝐽 𝑘2 =𝑂 (log 𝑛/𝑛) 1/2 𝛿 𝑛 (ℎ)𝜅0(3−𝛼)/2𝜆−1 𝜅0 { 𝜆𝑟 } 1/2 1 + (log 𝑚/𝑚) 1/2 := 𝑂 (𝜔 𝑘2 (𝑛, ℎ)) (2.56) 𝑟=1 2 almost surely where 𝜔 𝑘2 (𝑛, ℎ) = (log 𝑛/𝑛) 1/2 𝛿 𝑛 (ℎ)𝜅0(3−𝛼)/2𝜆−1 1/2 . It is Í𝜅0  𝜅0 𝑟=1 𝜆𝑟 1 + (log 𝑚/𝑚) 2 𝜆𝑟 ∼ 𝜅0−𝜏1 +1 under assumption (C6)b. Thus, 𝜔 𝑘2 (𝑛, ℎ) ∼ (log 𝑛/𝑛)𝛿 (ℎ)𝜅0(4−𝜏)/2 {1+ Í𝜅0 easy to see 𝑟=1 (log 𝑚/𝑚) 1/2 }. 2 Since 𝜔𝑛2 = 𝑂 (𝜔𝑛1 ) and 𝛿 𝑛 (ℎ) = 𝑂 (𝜔𝑛1 ), in summary, for each 𝑘 = 1, · · · , 𝜅0 , the following holds almost surely.  n o  g 𝑘 ( 𝜷) = 𝑂 (log 𝑛/𝑛) 1/2 1 + (log 𝑚/𝑚) 1/2 + 𝜔 𝑘1 (𝑛, ℎ) (2.57) Therefore, AMSE of g 𝑘 ( 𝜷) is the following. AMSE{g 𝑘 ( 𝜷0 )}       1 1 3−𝜏 4 1 1 1 1 1 1 =𝑂 1+ 1 + 𝜅0 ℎ + + + + + + 𝑛 𝑚 𝑛 𝑛𝑚ℎ 𝑛2 𝑚 2 ℎ2 𝑛2 𝑚 4 ℎ4 𝑛2 𝑚ℎ 𝑛2 𝑚 3 ℎ3 63     1 1  3−𝜏 =𝑂 1+ 1 + 𝜅0 𝑅𝑛 (ℎ) 𝑛 𝑚   = 𝑂 𝑛−1 + 𝑛−1 𝜅03−𝜏 𝑅𝑛 (ℎ) since (1/𝑛𝑚) = 𝑂 (1/𝑛) (2.58) n o 1 1 1 1 1 1 where 𝑅𝑛 (ℎ) = ℎ4 + 𝑛 + 𝑛𝑚ℎ + 𝑛2 𝑚 2 ℎ 2 + 𝑛2 𝑚 4 ℎ 4 + 𝑛2 𝑚ℎ + 𝑛2 𝑚 3 ℎ 3 . Combining the above condi- tions, we find that if 𝑎 > 1/4, 𝜅0 = 𝑂 (𝑛1/(3−𝜏) ) and 𝑛−1/4 ≲ ℎ ≲ 𝑛−(𝑎+1)/5 then AMSE{g 𝑘 ( 𝜷0 )} = 𝑂 (1/𝑛). On the other hand, if 𝑎 ≤ 1/4, 𝜅 0 = 𝑂 (𝑛4(1+𝑎)/5(3−𝜏) ) and ℎ ≲ 𝑛−1/4 AMSE{g 𝑘 ( 𝜷0 )} = 𝑂 (1/𝑛). Note that for a three-dimensional array (𝜕𝐶/𝜕 𝛽1 , · · · , 𝜕𝐶/𝜕 𝛽 𝑝 ) such that the following is a 𝑝 × 1 vector. g( 𝜷0 ) T C−1 ( 𝜷0 ) C( ¤ 𝜷0 )C−1 ( 𝜷0 )g( 𝜷0 ) (2.59) Therefore, 𝑛−1Q(¤ 𝜷0 ) = 2g(¤ 𝜷 ) T C−1 ( 𝜷 )g( 𝜷 ) − g( 𝜷 ) T C−1 ( 𝜷 ) C( ¤ 𝜷0 )C−1 ( 𝜷0 )g( 𝜷0 ) (2.60) 0 0 0 0 0 And 𝑛−1Q(¥ 𝜷0 ) = 2g( ¤ 𝜷 ) T C−1 ( 𝜷 ) g( ¤ 𝜷 ) + 𝑟 𝑛1 + 𝑟 𝑛2 + 𝑟 𝑛3 + 𝑟 𝑛4 (2.61) 0 0 0 where 𝑟 𝑛1 = 2g(¥ 𝜷 )C−1 ( 𝜷 )g( 𝜷 ) 0 0 0 𝑟 𝑛2 = −4g( ¤ 𝜷 ) T C−1 ( 𝜷 ) C( ¤ 𝜷0 )C−1 ( 𝜷0 )g( 𝜷0 ) 0 0 (2.62) 𝑟 𝑛3 ¤ 𝜷 )C−1 ( 𝜷 )C−1 ( 𝜷 ) C( = 2g( ¤ 𝜷0 )C−1 ( 𝜷0 )g( 𝜷0 ) 0 0 0 𝑟 𝑛4 = −g( 𝜷0 ) T C−1 ( 𝜷0 ) C( ¥ 𝜷0 )C−1 ( 𝜷0 )g( 𝜷0 ) Since g( 𝜷0 ) = 𝑂 𝑃 (𝑛−1/2 ) and the weight matrix converges almost surely to an invertible matrix, ¤ 𝜷0 )C−1 ( 𝜷0 )g( 𝜷0 ) = 𝑜(𝑛−1 ) almost surely. Furthermore, 𝑟 𝑛1 = 𝑂 (𝑛−1/2 ), 𝑟 𝑛2 = g( 𝜷0 ) T C−1 ( 𝜷0 ) C( 𝑜(𝑛−1/2 ), 𝑟 𝑛3 = 𝑂 (𝑛−1/2 ), and 𝑟 𝑛4 = 𝑂 (𝑛−1 ) almost surely. Combining these bounds, we have 𝑟 𝑛 = 𝑜(1) almost surely. Therefore, ∥𝑛−1Q¤ − 2g( ¤ 𝜷 ) T C−1 ( 𝜷 )g( 𝜷 )∥ = 𝑜 𝑃 (𝑛−1 ) and ∥𝑛−1Q¥ − 0 0 0 ¤ 𝜷 ) T C−1 ( 𝜷 ) g( 2g( ¤ 𝜷 )∥ = 𝑜 𝑃 (1). 0 0 0 The following lines are based on common steps in the GEE literature that includes McCullagh and Nelder (1989); Balan et al. (2005); Tian et al. (2014) among many others. Let 𝜷𝑛 = 𝜷0 + 𝛿d 64 where set 𝛿 = 𝑛−1/2 . We have to show that for any 𝜖 > 0 there exists a large constant 𝑐 such that 𝑃{ inf Q( 𝜷𝑛 ) ≥ Q( 𝜷0 )} > 1 − 𝜖 (2.63) ∥𝑑 ∥=𝑐 Note that the above statement is always true if 𝜖 ≥ 1. Thus, we assume that 𝜖 ∈ (0, 1). Due to Taylor series expansion, Q( 𝜷𝑛 ) = Q( 𝜷0 + 𝛿d) = Q( 𝜷0 ) + 𝛿dTQ( ¤ 𝜷0 ) + 1 𝛿dTQ( ¥ 𝜷0 )d + ∥d∥ 2 𝑜 𝑃 (1) (2.64) 2 Now, observe that, using Equation (2.60), √ 𝛿dTQ( ¤ 𝜷0 ) = ∥d∥𝑂 𝑃 ( 𝑛𝛿) + ∥d∥𝑂 𝑃 (𝛿) (2.65) and 1 2 T¥ ∗ ¤ 𝜷 )C−1 ( 𝜷 ) g¤ ( 𝜷 )d + 𝑛𝛿2 ∥d∥ 2 𝑜 𝑃 (1) 𝛿 d Q( 𝜷 )d = 𝑛𝛿2 dT g( 0 0 0 (2.66) 2 Therefore, for given 𝜖 > 0, there exists a large enough 𝑐 such that the above equation (2.63) holds. This implies that there exists a 𝜷ˆ that satisfies ∥ b 𝜷 − 𝜷0 ∥ = 𝑂 𝑃 (𝛿). Thus, for large 𝑛, with probability 1, Q( 𝜷) attains the minimal value at 𝜷ˆ and therefore, Q¤ = 0. 2.7.4 Proof of Theorem 2.3.2 Í𝜅0 Í𝜅0 −1 T where C−1 −1 Recall, C𝑖 = 𝑘 1 =1 𝑘 2 =1 𝚽 𝑘 1 X𝑖 C 𝑘 1 ,𝑘 2 X𝑖 𝚽 𝑘 2 , 𝑘 1 ,𝑘 2 is the (𝑘 1 , 𝑘 2 ) block of C0 . Sim- b𝑖 = Í𝜅0 Í𝜅0 𝚽 −1 Tb ilarly, we can define C 𝑘 1 =1 𝑘 2 =1 𝑘 1 X𝑖 C 𝑘 1 ,𝑘 2 X𝑖 𝚽 𝑘 2 . It is easy to observe that C𝑖 = b b C𝑖 + 𝜅𝑘01 =1 𝜅𝑘02 =1 ( 𝚽 Í Í b 𝑘 1 − 𝚽 𝑘 1 )X𝑖 C−1 XT ( 𝚽 b 𝑘 2 − 𝚽 𝑘 2 ) + 2 Í𝜅0 Í𝜅0 𝚽 𝑘 1 X𝑖 C−1 XT ( 𝚽 b 𝑘 2 − 𝚽 𝑘 2 ). 𝑘 1 ,𝑘 2 𝑖 𝑘 1 =1 𝑘 2 =1 𝑘 1 ,𝑘 2 𝑖 Therefore, 𝑛1 𝑖=1 Í𝑛 b𝑖 − C)X𝑖 = 1 Í𝑛 Í𝜅0 Í𝜅0 P𝑖𝑘 1 C−1 P𝑖𝑘 2 and 1 Í𝑛 𝝁¤ T ( C 𝝁¤ 𝑖T ( C b𝑖 − C)y𝑖 = 𝑛 𝑖=1 𝑘 1 =1 𝑘 2 =1 𝑘 1 ,𝑘 2 𝑛 𝑖=1 𝑖 1 Í𝑛 Í 𝜅 0 Í 𝜅 0 −1 𝑛 𝑖=1 𝑘 1 =1 𝑘 2 =1 P𝑖𝑘 1 C 𝑘 1 ,𝑘 2 Q𝑖𝑘 2 where P𝑖,𝑘 = 𝝁 ¤ 𝑖T D𝑖𝑘 X𝑖 and Q𝑖𝑘 = 𝝁¤ 𝑖T D𝑖𝑘 y𝑖 with D𝑖𝑘 be the dif- ference matrix with ( 𝑗1 , 𝑗2 )-th element is 𝑑𝑖 (𝑇𝑖 𝑗1 , 𝑇𝑖 𝑗2 ). Thus, note that, almsot surely we have the following relation, 𝑚 𝑚 1 ∑︁𝑖 ∑︁𝑖 P𝑖𝑘 = 𝜇¤𝑖 (𝑇𝑖 𝑗1 )𝑑𝑖 (𝑇𝑖 𝑗1 , 𝑇𝑖 𝑗2 )𝑥𝑖 (𝑇𝑖 𝑗2 ) 𝑚𝑖2 𝑗1 =1 𝑗2 =1 𝑚 𝑚 1 ∑︁𝑖 ∑︁𝑖 (𝑇𝑖 𝑗1 , 𝑇𝑖 𝑗2 ) + 𝑂 (∥Δ∥ 2 ) 𝑥𝑖 (𝑇𝑖 𝑗2 )  𝑛1 𝑛2 = 2 𝜇¤𝑖 (𝑇𝑖 𝑗1 ) 𝐼𝑖𝑘 (𝑇𝑖 𝑗1 , 𝑇𝑖 𝑗2 ) + 𝐼𝑖𝑘 𝑚𝑖 𝑗1 =1 𝑗2 =1 65 𝑚 𝑚 ∞ 1 ∑︁𝑖 ∑︁𝑖 ∑︁ −1 𝜙𝑟 (𝑇𝑖 𝑗1 )𝜙 𝑘 (𝑇𝑖 𝑗2 ) + 𝜙 𝑘 (𝑇𝑖 𝑗1 )𝜙𝑟 (𝑇𝑖 𝑗2 ) + 𝑂 (∥Δ∥ 2 )  ≲ 2 ¤ 𝜇 𝑖 (𝑇𝑖 𝑗 1 ) (𝜆 𝑘 − 𝜆 𝑟 ) ⟨𝜙 𝑟 , Δ𝜙 𝑘 ⟩ 𝑚𝑖 𝑗1 =1 𝑗2 =1 𝑟=1 𝑟≠𝑘 ∑︁∞ ≲ (𝜆 𝑘 − 𝜆𝑟 ) −1 ∥Δ𝜙 𝑘 ∥ + 𝑂 (∥Δ∥ 2 ) 𝑟=1 𝑟≠𝑘 𝑚 𝑚 1 ∑︁𝑖 1 ∑︁𝑖 since ¤ 𝑖 𝑗 )𝜙 𝑘 (𝑇𝑖 𝑗 ) and 𝜇(𝑇 𝑥𝑖 (𝑇𝑖 𝑗 )𝜙 𝑘 (𝑇𝑖 𝑗 ) are finite 𝑚𝑖 𝑗=1 𝑚𝑖 𝑗=1 = 𝑂 (𝜛) (2.67) Í∞ −1 𝛿 2 +𝛿 2 (ℎ). where 𝜛 = 𝑟=1 (𝜆 𝑘 −𝜆𝑟 ) 𝑛 (ℎ) +ℎ 𝑛2 A similar result can be obtained for Q𝑖𝑘 . Combining 𝑟≠𝑘 2 Í𝑛 such results, in summary, we have − 𝑛2 𝑖=1 X𝑖T C b𝑖 (y𝑖 − XT b T Tb Í𝑛 𝑖 𝜷) = − 𝑛 𝑖=1 X𝑖 C𝑖 (y𝑖 − X𝑖 𝜷) + 𝑂 (𝜛𝑛 ). Since, for 𝑛 → ∞, Q( 𝜷) attains a minimal value at 𝜷 = b 𝜷, we therefore have Q( ¤ b𝜷) = 0. Thus, 𝑛 ¤ b 2 ∑︁ Q( 𝜷) = − X𝑖T C b𝑖 (y𝑖 − XT b 𝑖 𝜷) 𝑛 𝑖=1 𝑛 𝑛 2 ∑︁ b𝑖 − C𝑖 )(y𝑖 − XT b 2 ∑︁ T =− X𝑖T ( C 𝑖 𝜷) − X C𝑖 (y𝑖 − X𝑖T b 𝜷) = 0 (2.68) 𝑛 𝑖=1 𝑛 𝑖=1 𝑖 Therefore, almost surely, we have, 𝑛 2 ∑︁ T − X C𝑖 (y𝑖 − X𝑖T b 𝜷) + 𝑂 (𝜛𝑛 ) = 0 𝑛 𝑖=1 𝑖 𝑛 2 ∑︁ T − X C𝑖 (X𝑖T 𝜷0 + e𝑖 − X𝑖T b 𝜷) + 𝑂 (𝜛𝑛 ) = 0 𝑛 𝑖=1 𝑖 ( 𝑛 ) 𝑛 √ 1 ∑︁ T 1 ∑︁ T 𝑛( 𝜷 − 𝜷0 ) b X C𝑖 X𝑖 = √ X C𝑖 e𝑖 (2.69) 𝑛 𝑖=1 𝑖 𝑛 𝑖=1 𝑖 Now, using the central limit theorem, we can obtain the following. 𝑛 1 ∑︁ T 𝑑 √ X𝑖 C𝑖 e𝑖 → − 𝑁 (0, A) (2.70) 𝑛 𝑖=1 1 Í𝑛 T In addition, by the law of large numbers 𝑛 𝑖=1 X𝑖 C𝑖 X𝑖 → B in probability. Therefore, using the Slutsky theorem, we complete the proof of Theorem 2.3.2. 66 CHAPTER 3 ESTIMATION FOR VARYING-COEFFICIENT MODEL IN FUNCTIONAL DATA ANALYSIS UNDER UNKNOWN HETEROSKEDASTICITY: A GMM-BASED APPROACH 3.1 Introduction Due to modern advancements of technology, varying-coefficient models in functional data have become popular to analyze data coming from several imaging technologies such as magnetic reso- nance imaging (MRI), diffusion tensor imaging (DTI) etc. We consider the problem of estimating non-parametric coefficient function 𝜷(𝑠) which is defined on functional domain (for example, space, time) S to understand the relationship between the functional response 𝑌 (𝑠) and the real-valued covariates denoted by X = (𝑋1 , · · · , 𝑋 𝑝 ) T , which takes the following form. 𝑌 (𝑠) = XT 𝜷(𝑠) + 𝑈 (𝑠) (3.1) where 𝜷(𝑠) = (𝛽1 (𝑠), · · · , 𝛽 𝑝 (𝑠)) T is a 𝑝-dimensional vector of unknown smooth functions, and it is assumed that 𝜷(·) is twice-differentiable with continuous second-order derivatives. The random error {𝑈𝑖 (𝑠) : 𝑠 ∈ S} is assumed to be a stochastic process indexed by 𝑠 ∈ S and it characterizes the within curve dependence with mean zero and an unknown covariance function Σ(𝑠, 𝑠′) = cov{𝑈 (𝑠), 𝑈 (𝑠′)|X}. The varying-coefficient model (VCM) in Equation (3.1) allows its regression coefficient to vary over some predictors of interest. It was introduced in the literature by Hastie and Tibshirani (1993) and systematically studied in Hoover et al. (1998); Fan et al. (1999); Wu and Chiang (2000); Huang et al. (2002); Fan et al. (2003); Huang et al. (2004); Chiou et al. (2004); Ramsay and Silverman (2005); Zhang and Chen (2007); Cardot and Sarda (2008); Fan and Zhang (2008); Wang et al. (2008); Zhu et al. (2014); Kokoszka and Reimherr (2017). The notion of density is not well defined for functional responses (in general for any random function) (Delaigle and Hall, 2010), as a result of which it is difficult to take advantage of likelihood- based inference; therefore, we need to rely on the moment conditions. Typically, we assume that 67 the error term 𝑈 (𝑠) satisfies the conditional mean-zero assumption, such as E{𝑈 (𝑠)|X} = 0. By the iterated law of expectation, it is easy to see that, for a given point 𝑠 ∈ S, we can define least-square estimates as solution of the sample version of E{X[𝑌 (𝑠) − XT 𝜷(𝑠)]} = 0. Equivalently, we can obtain these estimates by a minimizer of the sample version of E{[𝑌 (𝑠) − XT 𝜷(𝑠)] 2 } which is termed as non-parametric local linear estimates (Fan and Gijbels, 1996). Since the above estimates rely only on the conditional mean-zero assumption, they become inefficient in the presence of heteroskedasticity. Therefore, to analyze such data, there is need for a robust estimation procedure, which does not require distributional assumption and can accommodate heteroskedasticity of unknown form. Therefore, we introduce the functional generalized method of moments (GMM) estimation procedure for such VCM. In classical statistics, the method of moment (MM) estimator solves the sample moment con- ditions corresponding to the population moment conditions to obtain solutions for the unknown parameters. For example, if the data 𝑌𝑖 are independently and identically distributed with mean 𝜇, then E{𝑌𝑖 − 𝜇} = 0, and the MM estimate of 𝜇 is simply 𝜇ˆ = 𝑌 , the sample mean. Now consider the linear regression model 𝑌𝑖 = X𝑖T 𝜷 + 𝑈𝑖 where 𝑌𝑖 is the response, X and 𝜷 are, respectively, the covariates of dimensions 𝑝 and the unknown regression coefficient with simple moment restric- tions 𝐸 {𝑈𝑖 |X𝑖 } = 0. By applying the law of iterated expectation, we get the unconditional moment condition E{X𝑖 𝑈𝑖 } = 0 for random error 𝑈𝑖 . It is easy to see that MM estimates coincide with the ordinary least squares (OLS) estimates for simple linear regression model. For a non-linear regression model with additive error 𝑌𝑖 = 𝔐(X𝑖 , 𝜷) + 𝑈𝑖 , the moment condition is similar to the classical linear model which is essentially E{𝑔(X𝑖 )𝑈𝑖 } = 0 for any function 𝑔(·) of X. An obvious 𝜕𝔐(x,𝜷) choice is 𝑔(X) = 𝜕𝜷 as first-order conditions in non-linear least squares estimator. In simple linear regression, let us consider that the covariates are decomposed into 𝑝 1 and 𝑝 2 components such that X𝑖 = (X𝑖1 , X𝑖2 ) T with 𝑝 1 + 𝑝 2 = 𝑝. It is a well-known fact that without any further assumptions, an asymptotically efficient estimator of 𝜷 is the OLS estimator. Assume that 𝜷 = ( 𝜷1 , 𝜷2 ) T , where 𝜷2 is known and is 0. Then, we can rewrite the above model as 𝑌𝑖 = X𝑖1T 𝜷 +𝑈 1 𝑖 with a similar restriction E{X𝑖1𝑈𝑖 } = 0. For estimating 𝜷1 based on the data {𝑌𝑖 , X𝑖1 : 𝑖 = 1, · · · , 𝑛}, 68 the MM estimates become inefficient since the number of moment conditions (𝑝) is larger than the number of parameters to estimate (𝑝 1 ). This is called “over-identified” situation whereas when 𝑝 = 𝑝 1 , it is referred as “just-identified” situation. In general, for over-identified situation, let g(𝑌 , X; 𝜷) be a 𝑑-dimensional function of 𝜷 ∈ R 𝑝 where 𝑑 ≥ 𝑝 such that Eg(𝑌𝑖 , X𝑖 ; 𝜷0 ) = 0 (3.2) where 𝜷0 is the vector of true parameters. The set of restrictions in Equation (3.2) is often called “estimating equations”. In a seminal paper, Hansen (1982) proposed an extended version of the MM approach for over-identified model. Let 𝑊1 , · · · , 𝑊𝑛 be a set of random variables indexed by 𝑝-dimensional parameter vector 𝜷 with moment conditions E{g(𝑊, 𝜷0 )} = 0, where g(𝑊, 𝜷) be a 𝑑-dimensional vector such that g(·; 𝜷) be the function of 𝑊. The estimator is formed by choosing 𝜷 such that the sample average of g𝑖 is close to zero. For over-identification problem, it is not possible for all the moment conditions to satisfy. Therefore, GMM approach is used to define an estimator which brings the sample mean of g, (viz., g(𝑊, 𝜷)) close to zero. GMM estimates minimize the following objective function. J( 𝜷) = g(𝑊; 𝜷) T W( 𝜷)g(𝑊; 𝜷) (3.3) Note that, the above Equation (3.3) is a well-defined norm as long as the weight matrix W( 𝜷) is symmetric positive definite with dimension 𝑑 × 𝑑. When the likelihood is not specified or ill-specified, GMM is an alternative estimation technique to the likelihood principle and has become quite popular in statistics and econometrics in the last few decades due to its intuitive idea and applicability; also its properties are quite well-known. In the original version of the proposed method described in Hansen (1982) a two-step GMM was described by the following algorithm: First, compute 𝜷˘ ∈ arg min 𝜷 g( 𝜷) T g( 𝜷); then estimate the precision matrix W based on the first-step estimates, W( 𝜷). ˘ Therefore, the two-step GMM esti- mates are b ˘ 𝜷 ∈ arg min 𝜷 g( 𝜷) T W( 𝜷)g( 𝜷). Under some sufficient conditions, it is well known that the GMM estimator is constant and asymptotically normally distributed with the asymptotic covari- −1 −1 ance matrix 𝑛−1 G( 𝜷0 ) T WG( 𝜷0 ) G( 𝜷0 ) T W𝛀WG( 𝜷0 ) G( 𝜷0 ) T WG( 𝜷0 ) , where G( 𝜷0 ) =   69 E{∇ 𝜷0 g(𝑊; 𝜷0 )} and 𝛀 = E{g(𝑊; 𝜷0 )g(𝑊; 𝜷0 ) T } for true 𝜷0 (Newey and McFadden, 1994). Note that the variance of the GMM estimator depends on the weight matrix and is optimal if the weight −1 matrix is W = 𝛀−1 . Therefore the asymptotic variance would be 𝑛−1 G( 𝜷0 ) T WG( 𝜷0 ) .  A crucial problem of the above estimation procedure is identifying the number of moment conditions. An increase in the number of moment conditions may lead to the problem of finite sample bias. Based on the situation, additional moment conditions may be beneficial. Adding more moment conditions leads to decrease (or at least no change) in the asymptotic variance of the estimator, since the optimal weight matrix for fewer moment conditions is not optimal for all moment conditions. In the presence of heteroskedasticity, the set of moment conditions g(𝑊; 𝜷) = (X𝑈, 𝔐(X)𝑈) T produces an estimator more efficient than the least squares estimates for the simple linear regression model by efficiently choosing 𝔐. It is of interest to figure out how many functions should be used to get a better estimator. Moreover, some choices of g are better than others depending on additional assumptions. For example, in a linear regression model, if we choose g = X𝑈 then the resulting estimator becomes OLS. On the other hand, if we choose g = X𝑈/Var{𝑈|X} we land upon a generalized least squares estimator under heteroskedasticity. Var{𝑈|X} can be modelled parametrically and substituted in the above mentioned mean-zero function. If the complete form of the likelihood structure is known, one can choose g = ∇ 𝜷 ln{P(𝑈|X)} where P(·) is the density of 𝑈. When explanatory variables are endogenous, additional moment conditions may create bias and, as a result, increase the small sample variance. However, this issue does not arise when the explanatory variables are exogenous and the presence of heteroskedasticity does not cause OLS inconsistency problems in classical linear regression. Lu and Wooldridge (2020) obtained an asymptotic efficient estimator using Cragg (1983) which showed existence of GMM estimators which are more efficient than OLS in the presence of heteroskedasticity of unknown form. It is well known that the constant or varying-coefficient models may often be misspecified and, therefore, this may lead to inconsistent estimation. In varying-coefficient non-parametric models, mostly exogenous regression situation has been considered so far (Hastie and Tibshirani, 1993; Fan et al., 1999). In last two decades, some semi-parametric models were considered under 70 endogenous variables and a non-parametric or semi-parametric GMM with instrument variables approach was considered. For example, Cai and Li (2008) proposed a one-step local linear GMM estimator that corresponds to the local linear GMM discussed in Su et al. (2013) with an identity weight matrix. Tran and Tsionas (2009) provided a local constant two-step GMM estimator with specified weight matrix by minimizing the asymptotic variance. Su et al. (2013) developed a local linear GMM estimator procedure of functional coefficient instrument variable models with a general weight matrix under exogenous conditions. Cai et al. (2006) proposed a two-step local linear estimation procedure to estimate the functional coefficient which include the estimation of high-dimensional non-parametric model in first step and later estimate the functional coefficients using the first-step non-parametric estimates as generated regressor. As opposed to the classical GMM, for non-parametric local linear GMM estimator, integrated mean square error increases as the number of instrument variables increase for arbitrary choice of instrument variables (Bravo, 2021). The current work is motivated by the problem encountered in diffusion tensor imaging (DTI) where multiple diffusion properties are measured along common major white matter fiber tracts across multiple individuals to characterize the structure and orientation of white matter in the human brain. Recently a study has been performed to understand white matter structural alteration using DTI for obstructive sleep apnea patients (Xiong et al., 2017). As an illustration, we present smoothed functional data to analyze the efficiency properties of network generated by diffusion properties of the water molecules. We plot the graphical characteristics of one of the diffusion properties called fractional anisotropy (FA) over different significant levels to obtain the graphical connectivity from 29 patients in Figures 3.1. Scientists are often interested to know the individual association of average path length (APL) of the network generated from FA with a set of covariates of interests such as age and lapses score (see Section 2.5). Moreover, in this data, there is sufficient evidence of heteroskedasticity in the covariates. Details about the data-set and associated variables are described in Section 3.6. We therefore need an estimation procedure which (1) does not need knowledge of distribution, (2) can handle heteroskedasticity of covariates, (3) can estimate non- 71 2.2 2.0 1.8 APL (s) 1.6 1.4 1.2 1.0 0.0 0.2 0.4 0.6 0.8 1.0 threshold Figure 3.1 Apnea-data: Smoothed average path length (APL) from 29 patients over different thresholds. Black solid line indicates the mean of APL over thresholds. parametric coefficient functions from varying-coefficient models, and (4) has a systematic technique for computing an efficient estimator. In this chapter, we develop a local linear GMM estimation procedure for varying-coefficient model. For given instrument variables, we propose an optimal local-linear GMM estimator motivated by Lu and Wooldridge (2020). However, the key difference in our approach from the later is that we model the variance of integrated squared error using a non-parametric function of covariates whereas they assume a parametric form in case of classical regression. Therefore, we can ensure that the proposed estimator is at least as efficient as local linear estimates (initial estimator) and more efficient than that under the presence of heteroskedasticity. 72 This chapter is organized as follows. In Section 3.2, we introduce our varying-coefficient model and propose a local linear GMM estimator. In Section 3.3, we present a multi-step estimation procedure. We establish asymptotic results in Section 3.4. We perform a set of simulations studied to understand the finite sample performance of the proposed estimator and present those in Section 3.5. In Section 3.6, we apply the proposed method in a real imaging data-set on obstructive sleep apnea (OSA). In Section 3.7, we conclude this chapter with some discussion. All technical details are provided in Section 3.8. 3.2 Varying-coefficient functional model and moment conditions In this section, we first introduce heteroskedastic conditions for SVC model and thereafter, propose a heuristic method to construct a mean-zero function. 3.2.1 Model Let {𝑌𝑖 (𝑠), X𝑖 } for 𝑖 = 1, · · · , 𝑛 be independent copies of {𝑌 (𝑠), X}. Instead of observing the entire functional trajectory, one can observe 𝑌 (𝑠) only on the discrete spatial grid {𝑠1 , · · · , 𝑠𝑟 } on the functional domain S. Data can be Gaussian or non-Gaussian and homoskedastic or heteroskedas- tic depending on the real applications. Therefore, the observed data for the 𝑖-th individual are {𝑠 𝑗 , 𝑌𝑖 (𝑠 𝑗 ), X𝑖 : 𝑗 = 1, · · · , 𝑟}. For simplifying the notation, define 𝑌𝑖 𝑗 = 𝑌𝑖 (𝑠 𝑗 ) and 𝑈𝑖 𝑗 = 𝑈𝑖 (𝑠 𝑗 ). Considering the functional principal component analysis (FPCA) model for 𝑈𝑖 (𝑠), we assume that 𝑈𝑖 (𝑠) is square-integrable and admits the Karhunen-Loève expansion (Karhunen, 1946; Loève, 1946). Let 𝜔1 (X) ≥ 𝜔2 (X) ≥ · · · ≥ 0 be ordered eigen-values of the linear operator determined by Σ with ∞ Í 𝑘=1 𝜔 𝑘 (X) being finite and 𝜓 𝑘 (𝑠)’s being the corresponding orthonormal eigen-functions or principal components. Thus, the spectral decomposition (J Mercer, 1909) is given by ∞ ∑︁ ′ Σ(𝑠, 𝑠 ) = 𝜔 𝑘 (X)𝜓 𝑘 (𝑠)𝜓 𝑘 (𝑠′) (3.4) 𝑘=1 Therefore, 𝑈𝑖 (𝑠) admits the Karhunen-Loève expansion as follows. ∞ ∑︁ 𝑈𝑖 (𝑠) = 𝜉 𝑘 (X𝑖 )𝜓 𝑘 (𝑠) (3.5) 𝑘=1 73 ∫ where 𝜉 𝑘 (X𝑖 ) = S 𝑈𝑖 (𝑠)𝜓𝑖 (𝑠)𝑑𝑠, which is termed as the 𝑘-th functional principal score for 𝑖-th individual. The 𝜉 𝑘 (X𝑖 ) are uncorrelated over 𝑘 with E{𝜉 𝑘 (X𝑖 )|X𝑖 } = 0 and Var{𝜉 𝑘 (X𝑖 )|X𝑖 } = 𝜔 𝑘 (X𝑖 ), 𝑘 ≥ 1. Furthermore, assume that the eigen-values vary with X𝑖 such that 𝜔 𝑘 (X𝑖 ) = 𝜃 𝑘 𝜎 2 (X𝑖 ) for some unknown function 𝜎(X) ≥ 0 and 𝜃 1 ≥ 𝜃 2 ≥ · · · ≥ 0. For identifiability, we need some restrictions on 𝜃 𝑘 s, such as known or fixed 𝜃 1 . Therefore, the above assumption on eigen-values for spectral decomposition allow us to incorporate heteroskedasticity into the model. To best of our knowledge, this is the first attempt to model SVC with unknown heteroskedasticity. 3.2.2 Local-linear mean-zero function Let us reiterate our main objective: we want to efficiently estimate the varying-coefficient functions based on GMM for the case of continuum moment conditions together with infinite-dimensional parameters. Therefore, we need to construct a mean-zero function which will be described in this sub-section. Since 𝜷(·) in model 3.1 is twice continuously differentiable, we can apply the Taylor series expansion to 𝜷(𝑠 𝑗 ) around an interior point 𝑠0 and get ¤ 0 )(𝑠 𝑗 − 𝑠0 ) + 𝜷(𝑠 𝜷(𝑠 𝑗 ) = 𝜷(𝑠) + 𝜷(𝑠 ¥ ∗ )(𝑠 𝑗 − 𝑠0 ) 2 /2 (3.6) where 𝑠∗ lies between 𝑠 𝑗 and 𝑠0 for all 𝑗 = 1, · · · , 𝑟 and 𝜷¤ and 𝜷¥ denote the gradients of 𝜷 and 𝜷¤ 𝜕 𝛽 𝑘 (𝑠0 ) with respect to 𝑠. Thus, 𝜷(𝑠 𝑗 ) can be approximated as 𝛽 𝑘 (𝑠 𝑗 ) ≈ 𝛽(𝑠0 ) + 𝜕𝑠 (𝑠 𝑗 − 𝑠0 ). So in matrix notation, the first-order Taylor series expansion of the coefficient functions becomes 𝜷(𝑠 𝑗 ) ≈ A(𝑠0 )zℎ (𝑠 𝑗 − 𝑠0 ) (3.7) 𝑠 −𝑠 T   where zℎ (𝑠 𝑗 − 𝑠0 ) = 1, 𝑗 ℎ 0 and A(𝑠) = [𝜷(𝑠0 ), ℎ 𝜷(𝑠 ¤ 0 )] which is a 𝑝 × 2 matrix. Hence, applying the approximation procedure in Equation (3.7), we can rewrite the model 3.1 as 𝑌𝑖 𝑗 ≈ X𝑖T A(𝑠)zℎ (𝑠 𝑗 − 𝑠0 ) + 𝑈𝑖𝑟   T = zℎ (𝑠 𝑗 − 𝑠0 ) ⊗ X𝑖 vec{A(𝑠0 )} + 𝑈𝑖𝑟 = W𝑖 𝑗 (𝑠0 ) T 𝜸(𝑠0 ) + 𝑈𝑖𝑟 (3.8) 74 such that 𝑠 𝑗 are sufficiently close to 𝑠0 , where W𝑖 𝑗 (𝑠0 ) = [zℎ (𝑠 𝑗 − 𝑠0 ) ⊗ X𝑖 ] and 𝜸(𝑠0 ) = ¤ 0 ) T ) T , both of which are vectors of length 2𝑝 × 1. ( 𝜷(𝑠0 ) T , ℎ 𝜷(𝑠 Let 𝐾 (·) be a symmetric probability density function which is used as a kernel and ℎ > 0 be the bandwidth; thus a re-scaled kernel function is defined as 𝐾 ℎ (·) = 1ℎ 𝐾 (·). It is easy to see that for a given location 𝑠0 ∈ S, we can construct a least squares estimator of 𝜸(𝑠) defined in Equation (3.8) by minimizing the sample version of mean squared error E{[𝑌𝑖 𝑗 − W𝑖T𝑗 (𝑠0 )𝜸(𝑠0 )] 2 |X𝑖 }. Let 𝔐(X) be a 𝑞-dimensional instrument variable with 𝑞 ≥ 𝑝; the moment condition can be written n Í o as E 1𝑟 𝑟𝑗=1 𝐾 ℎ (𝑠 𝑗 − 𝑠0 )𝚫𝑖 𝑗 (𝑠0 ) = 0 where 𝚫𝑖 𝑗 (𝑠0 ) = 𝔐(X𝑖 ) 𝑌𝑖 𝑗 − W𝑖 𝑗 (𝑠0 ) T 𝜸(𝑠0 ) is a zero-  mean stochastic process. Two popular approaches to construct optimal instrument variables are proposed by Newey (1990); Ai and Chen (2003). Due to functional dependence and the existence of heteroskedasticity of an unknown form, these approaches can not be undertaken for the model 3.8 that we have considered. Since Taylor series expansion provides a local approximation of the function, we need to incorporate this phenomenon into the instrument variables during construction of the mean-zero function. Motivated from the idea of local linear estimator Fan and Gijbels (1996), consider the local linear instrument variables Q𝑖𝑟 (𝑠0 ) = (𝔐(X𝑖 ), 𝔐(X𝑖 )(𝑠 𝑗 − 𝑠0 )/ℎ) T . Therefore, consider the following non-parametrically localizing augmented orthogonal moment conditions for estimating 𝜷(𝑠). 𝑟 1 ∑︁ 𝐾 ℎ (𝑠 𝑗 − 𝑠0 )Q𝑖𝑟 (𝑠0 ) 𝑌𝑖 𝑗 − W𝑖 𝑗 (𝑠0 ) T 𝜸(𝑠0 )  g𝑖 {𝜸(𝑠0 )} = 𝑟 𝑗=1 © 1𝑟 𝑟𝑗=1 𝐾 ℎ (𝑠 𝑗 − 𝑠0 )𝚫𝑖 𝑗 (𝑠0 ) ª Í = ­­ Í ® 1 𝑟 (𝑠 𝑗 −𝑠0 ) ® 𝑟 𝑗=1 𝐾 ℎ (𝑠 𝑗 − 𝑠0 ) ℎ 𝚫𝑖 𝑗 (𝑠0 ) « ¬ 𝑟 1 ∑︁ = 𝐾 ℎ (𝑠 𝑗 − 𝑠0 )zℎ (𝑠 𝑗 − 𝑠0 ) ⊗ 𝚫𝑖 𝑗 (𝑠0 ) (3.9) 𝑟 𝑗=1 and note that {g𝑖 (𝜸(𝑠))} : 𝑖 = 1, · · · , 𝑛} are independent and E{g𝑖 (𝜸(𝑠))} = 02𝑞×1 for 𝑠 ∈ S. Most of the varying-coefficient models that exist in the literature assume homoskedasticity in covariates and are limited to weakly dependent non-parametric models (Su et al., 2013; Sun, 2016), that differs significantly in our model assumptions. In contrast, we assume a varying-coefficient model under heteroskedasticity of an unknown form. 75 3.3 Multi-step estimation procedure This section develops a multi-step estimation procedure to estimate 𝜷(𝑠) simultaneously across all functional points. Essentially, the multi-step procedure can be broken down as, Step-I: an initial estimation; Step-II: estimation of the variance function, mean zero function, and eigen-components and Step-III: GMM estimation. The key ideas of each step are described below. Step-I. Calculate the least squares estimates of 𝜷(𝑠) as initial estimates, denoted by 𝜷(𝑠) ˘ across all 𝑠 ∈ S. Step-II. Estimate the conditional variance of integrated square residuals non-parametrically, subse- quently estimate the covariance of mean-zero function. Estimate the eigen-components using multivariate FPCA. Step-III. Project the continuous moment conditions onto eigen-functions and then combine them by weighted eigenvalues to incorporate spatial dependence and thus obtain the updated estimate of 𝜷(𝑠), denoted by b 𝜷(𝑠) across all 𝑠 ∈ S. 3.3.1 Step-I: Initial least squares estimates We consider a local linear smoother (Fan and Gijbels, 1996) to obtain an initial estimator of 𝜷(·) ignoring functional dependencies. In this case, the non-linear least squares function of the model 3.1 can be defined as an objective function given by J𝑖𝑛𝑖𝑡 {𝜷(·)} = 𝑛𝑟1 𝑖=1 T 2 Í𝑛 Í𝑟 𝑗=1 {𝑌𝑖 𝑗 − X𝑖 𝜷(𝑠 𝑗 )} . By the local linear smoothing method we estimate 𝜸 at functional point 𝑠0 , by minimizing 𝑛 𝑟 1 ∑︁ ∑︁ 2 𝐾 ℎ (𝑠 𝑗 − 𝑠0 ) 𝑌𝑖 𝑗 − W𝑖 𝑗 (𝑠0 ) T 𝜸(𝑠0 )  J𝑖𝑛𝑖𝑡 {𝜸(𝑠0 )} = (3.10) 𝑛𝑟 𝑖=1 𝑗=1 The solution of the above least-squares problem can be expressed as 𝑛 ∑︁ 𝑟 −1  𝑛 ∑︁ 𝑟  1 ∑︁   T     1 ∑︁     𝜸(𝑠 ˘ 0) = 𝐾 ℎ (𝑠 𝑗 − 𝑠0 )W𝑖 𝑗 (𝑠0 )W𝑖 𝑗 (𝑠0 ) 𝐾 ℎ (𝑠 𝑗 − 𝑠0 )W𝑖 𝑗 (𝑠0 )𝑌𝑖 𝑗  𝑛𝑟   𝑛𝑟   𝑖=1 𝑗=1   𝑖=1 𝑗=1  (3.11) 76 Consequently, the estimator of the coefficient function vector 𝜷(𝑠) at 𝑠0 is ˘ 0 ) = [(1, 0) ⊗ I 𝑝 ] 𝜸(𝑠 𝜷(𝑠 ˘ 0) (3.12) We determine the tuning parameter ℎ by using some data-driven techniques such as cross-validation and generalized cross-validation (Hastie and Tibshirani, 2017). 3.3.2 Step-II: Intermediate steps Step-II consists of two important steps in determining the class of GMM estimator. First in Step-II.A, we propose a method to obtain optimal instrument variables and therefore estimate the eigen-components which are used in local linear GMM objective function in Step-III. To estimate eigen-components, we essentially need to use a multivariate version of FPCA which is quite uncommon in the literature. We borrow the method proposed by Wang (2008). Step-II.A: Choice of instrument variables The choice of instrument variables is critical and the required identification condition is 𝑞 ≥ 𝑝, which ensures that the dimension of Q𝑖 𝑗 (𝑠0 ) is at least equal to the dimension of 𝜸(𝑠0 ). In our model, as discussed in Section 3.2, the error term has a potential heteroskedasticity of unknown form. We define a set of independent and identically distributed random variables 𝑅1 , 𝑅2 , · · · , 𝑅𝑛 ∫ for 𝑛 individuals, where 𝑅𝑖 = 𝑈𝑖2 (𝑠)𝑑𝑠 for each 𝑖, termed as integrated square of residuals, and E{𝑅𝑖 |X𝑖 } = 𝜎 2 (X𝑖 ) ∞ Í 𝑘=1 𝜃 𝑘 . Therefore, consider the following non-parametric regression problem. log 𝑅𝑖 = log 𝜎 2 (X𝑖 ) + 𝜖𝑖 (3.13) where 𝜖𝑖 is the mean zero random variable with constant variance. The above model in Equation (3.13) boils down to the problem of estimation of log 𝜎 2 (X𝑖 ) by regressing the logarithmic value of the integrated squared residual variables on the covariates X𝑖 . This approach is inspired by Yu and Jones (2004); Wasserman (2006) although used in a different context. Since 𝑈𝑖 s are not observable, ˘ we replace 𝑈𝑖 by an efficient estimate that is obtained from Step-I, viz., 𝑈˘ 𝑖 (𝑠) = 𝑌𝑖 𝑗 − X𝑖T 𝜷(𝑠) for 77 all 𝑠 ∈ S. This step can easily be implemented using “gam" function available in mgcv package in R to get an estimate of the non-parametric mean function, denoted by 𝜇 b(X) and therefore 𝜎 2 (X) = exp{b b 𝜇 (X)}. Given the estimate of 𝜎(·), we can, therefore, choose instrument variables T as 𝔐(X𝑖 ) = X𝑖 , X𝑖 /b 𝜎 2 (X𝑖 ) . Step-II.B: Estimation of eigen-components Without loss of generality, assume that the spectrum of functional domain S = [0, 1] and the dimension of mean-zero function g(𝑠) = (𝑔1 (𝑠), · · · , 𝑔𝑑 (𝑠)) T is 𝑑 (in our problem, it equals 2𝑞) for Í𝑑 ∫ simplicity. Note that g(𝑠) is defined on an interval [0, 1] such that 𝑙=1 E{𝑔𝑙2 (𝑠)}𝑑𝑠 is finite and the covariance function C(𝑠, 𝑠′) = E g{𝜸(𝑠)}g{𝜸(𝑠)}T . Under the condition (C6) mentioned in  Section 3.4, using the lining-up method in Wang (2008), define a new stochastic process 𝑒(𝑠) on the interval [0, 𝑑] with eigen-function 𝜙 𝑒 such that,     𝑔1 (𝑠) 0≤𝑠<1 𝜙1 (𝑠) 0≤𝑠<1                 𝑔2 (𝑠 − 1) 1≤𝑠<2 𝜙2 (𝑠 − 1) 1≤𝑠<2                 · · · · · ·       𝑒(𝑠) = 𝜙 𝑒 (𝑠) =       𝑔𝑙 (𝑠 − (𝑙 − 1)) 𝑙−1 ≤ 𝑠 < 𝑙     𝜙𝑙 (𝑠 − (𝑙 − 1)) 𝑙−1 ≤ 𝑠 < 𝑙         · · ·   · · ·              𝑔𝑑 (𝑠 − 𝑑 + 1) 𝑑−1 ≤ 𝑠 < 𝑑  𝜙 𝑑 (𝑠 − 𝑑 + 1) 𝑑−1 ≤ 𝑠 < 𝑑       where we define the eigen-function for each 𝑔𝑙 as 𝜙𝑙 for 𝑙 = 1, · · · , 𝑑. Therefore, the covariance function between 𝑔𝑙 (𝑠) and 𝑔𝑙 ′ (𝑠′) can be expressed as 𝐶𝑙,𝑙 ′ (𝑠, 𝑠′) = cov{𝑔𝑙 (𝑠 − (𝑙 − 1)), 𝑔𝑙 ′ (𝑠′ − (𝑙 ′ − 1))} for 𝑙 − 1 ≤ 𝑠 < 𝑙 and 𝑙 ′ − 1 ≤ 𝑠′ < 𝑙 ′; 𝑙, 𝑙 ′ = 1, · · · , 𝑑. Note that for 𝑑-dimensional processes, the Fredholm integral equation (Porter et al., 1990) is equivalent to 𝑑-simultaneous integral equations where each of them corresponds to a specific functional interval of 𝑒(𝑠). For 𝑙 − 1 ≤ 𝑠 < 𝑙; 𝑙 = 1, · · · , 𝑑, the Fredholm integral equation yields ∫ 𝑑 cov{𝑒(𝑠), 𝑒(𝑠′)}𝜙 𝑒 (𝑠)𝑑𝑠 = 𝜆𝜙 𝑒 (𝑠) (3.14) 0 78 Now observe that for (𝑙 − 1) ≤ 𝑠 < 𝑙; 𝑙 = 1, · · · , 𝑑, the above relation is equivalent to the following. ∑︁ 𝑑 ∫ 𝑙′ cov{𝑔𝑙 (𝑠 − (𝑙 − 1)), 𝑔𝑙 ′ (𝑠′ − (𝑙 ′ − 1))𝜙𝑙 ′ (𝑠′)𝑑𝑠′ = 𝜆𝜙𝑙 (𝑠 − (𝑙 − 1)) 𝑙 ′ =1 𝑙 ′ −1 ∑︁ 𝑑 ∫ 1 cov{𝑔𝑙 (𝑠 − (𝑙 − 1)), 𝑔𝑙 ′ (𝑠′)}𝜙𝑙 ′ (𝑠′)𝑑𝑠′ = 𝜆𝜙𝑙 (𝑠 − (𝑙 − 1)) 𝑙 ′ =1 0 ∑︁ 𝑑 ∫ 1 cov{𝑔𝑙 (𝑠), 𝑔𝑙 ′ (𝑠′)}𝜙𝑙 ′ (𝑠′)𝑑𝑠′ = 𝜆𝜙𝑙 (𝑠) (3.15) 𝑙 ′ =1 0 In a multivariate setting, the orthogonality condition is ∫ 𝑑 ∑︁𝑑 ∫ 𝑘 ′ 𝜙 𝑒,𝑙 (𝑠)𝜙 𝑒,𝑙 ′ (𝑠)𝑑𝑠 = 1(𝑙 = 𝑙 ) = 𝜙 𝑘,𝑙 (𝑠 − (𝑘 − 1))𝜙 𝑘,𝑙 ′ (𝑠 − (𝑘 − 1))𝑑𝑠 0 𝑘=1 𝑘−1 ∑︁𝑑 ∫ 1 = 𝜙 𝑘,𝑙 (𝑠)𝜙 𝑘,𝑙 ′ (𝑠)𝑑𝑠 (3.16) 𝑘=1 0 Using the generalized Mercer’s theorem (J Mercer, 1909), the results for the covariance function can be briefly shown using the lining-up method. Assume that the covariance function is continuous after the lining-up processes, so for (𝑙 1 − 1) ≤ 𝑠 < 𝑙1 and (𝑙2 − 1) ≤ 𝑠 < 𝑙2 ; 𝑙1 , 𝑙2 = 1, · · · , 𝑑, the covariance function between 𝑔𝑙1 (𝑠) and 𝑔𝑙2 (𝑠′) can be expressed as ∞ ∑︁ ′ 𝐶𝑙,𝑙 ′ (𝑠, 𝑠 ) = cov{𝑔𝑙 (𝑠), 𝑔𝑙 ′ (𝑠)} = 𝜆 𝑘 𝜙 𝑘,𝑙 (𝑠 − (𝑙 − 1))𝜙 𝑘,𝑙 ′ (𝑠 − (𝑙 ′ − 1)) 𝑘=1 ∞ ∑︁ = 𝜆 𝑘 𝜙 𝑘,𝑙 (𝑠)𝜙 𝑘,𝑙 ′ (𝑠′) (3.17) 𝑘=1 Therefore, using the above argument, we can define the multivariate spectral decomposition ∞ ∑︁ C(𝑠, 𝑠′) = 𝜆 𝑘 𝝓 𝑘 (𝑠)𝝓 𝑘 (𝑠′) T (3.18) 𝑘=1 with the orthogonality condition 3.16. Since the lining-up data are univariate, we can adopt the existing techniques of estimating functional eigen-values and eigen-function in the literature (Yao et al., 2003, 2005; Müller and Yao, 2010; Li and Hsing, 2010) to estimate 𝜆 and 𝜙 𝑒 (𝑠), and hence can estimate 𝝓(𝑠) by stacking all components for aliened eigen-functions 𝜙 𝑒 (𝑠). 79 3.3.3 Step-III: Final estimates Finally, we demonstrate our proposed estimator based on local-linear GMM where the proposed mean-zero function can be projected onto eigen-function and then combined by the weighted eigen-values. Then, for any positive 𝛼, the objective function is given by ∞ n o2 ∑︁ 𝜆 b𝑘 J{𝜸(𝑠0 )} = g(𝜸(𝑠0 )) Tb 𝝓 𝑘 (𝑠0 ) 𝑘=1 b2 + 𝛼 𝜆 𝑘 ∞ 𝑛 ∑︁ 𝑟 2 ∑︁ 𝜆 b𝑘  1 ∑︁   T  T   = 𝐾 ℎ (𝑠 𝑗 − 𝑠0 ) 𝝓 𝑘 (𝑠0 ) Q𝑖 𝑗 (𝑠0 ) 𝑌𝑖 𝑗 − W𝑖 𝑗 (𝑠0 ) 𝜸(𝑠0 ) b 𝑘=1 b2 + 𝛼  𝑛𝑟 𝑖=1 𝑗=1 𝜆  𝑘   (3.19) By minimizing the above objective function, we obtain the following. ∞  1 ∑︁  𝑛 ∑︁ 𝑟  ∑︁ 𝜆 b𝑘  T   𝐾 ℎ (𝑠 𝑗 − 𝑠0 ) 𝝓 𝑘 (𝑠0 ) Q𝑖 𝑗 (𝑠0 )W𝑖 𝑗 (𝑠0 ) b b2 𝑘=1 𝜆 𝑘 + 𝛼   𝑛𝑟 𝑖=1 𝑗=1   1 𝑛 𝑟   ∑︁ ∑︁    𝝓 𝑘 (𝑠0 ) T Q𝑖 𝑗 (𝑠0 ) 𝑌𝑖 𝑗 − W𝑖 𝑗 (𝑠0 ) T 𝜸(𝑠0 )  × 𝐾 ℎ (𝑠 𝑗 − 𝑠0 ) b  𝑛𝑟   𝑖=1 𝑗=1  ∞ ∑︁ 𝜆 b𝑘 X𝑘 (𝑠0 ) y𝑘 (𝑠0 ) − X𝑘 (𝑠0 ) T 𝜸(𝑠0 )  := (3.20) b2 𝑘=1 𝜆 + 𝛼 𝑘 1 1 𝝓 𝑘 (𝑠0 ) T Q𝑖 𝑗 (𝑠0 )W𝑖 𝑗 (𝑠0 ) and y𝑘 (𝑠0 ) = Í𝑛 Í𝑟 Í𝑛 Í𝑟 where X𝑘 (𝑠0 ) = 𝑛𝑟 𝑖=1 𝑗=1 𝐾 ℎ (𝑠 𝑗 −𝑠0 ) b 𝑛𝑟 𝑖=1 𝑗=1 𝐾 ℎ (𝑠 𝑗 − 𝝓 𝑘 (𝑠0 ) T Q𝑖 𝑗 (𝑠0 )𝑌𝑖 𝑗 . Therefore, the final estimate of the coherent function is b 𝑠0 ) b 𝜷(𝑠0 ) = [(1, 0) ⊗ 𝜸 (𝑠0 ) where I 𝑝 ]b ( ∞ ) −1 ( ∞ ) ∑︁ 𝜆 b𝑘 ∑︁ 𝜆 b𝑘 𝜸 (𝑠0 ) = b X𝑘 (𝑠0 )X𝑘 (𝑠0 ) T X𝑘 (𝑠0 )y𝑘 (𝑠0 ) (3.21) b2 𝑘=1 𝜆 𝑘 +𝛼 𝑘=1 b2 + 𝛼 𝜆 𝑘 The algorithm 3.1 summarizes the proposed method. For demonstration purposes, we choose the tuning parameters using cross-validation as discussed in the algorithm. In the proposed algorithm, 𝛼 controls the number of eigen-values, and can be chosen so that the condition (C8) defined in Section 3.4 is valid. Moreover, a continuity condition for lining-up is required for theoretical justification, by empirical studies, in the present of discontinuity in 𝜙 𝑒 , the end results are still adequate to use in practice. 80 Algorithm 3.1 Estimation of 𝜷(𝑠) : 𝑠 ∈ S for the proposed local-GMM based estimation procedure. Data: {𝑌𝑖 (𝑠 𝑗 ), 𝑋𝑖 , 𝑠 𝑗 }, for 𝑗 = 1, · · · , 𝑟; 𝑖 = 1, · · · , 𝑛 Result: Estimate 𝜷(𝑠) using proposed method Í Í𝑟 n −𝑖 o2 1: Calculate optimal ℎ: b ℎ𝑖𝑛𝑖𝑡 ← arg minℎ∈H 1 𝑛 𝑛𝑟 𝑖=1 𝑟=1 𝑌𝑖 𝑗 − XT 𝜷˘ (𝑠𝑟 ; ℎ) 𝑖 2: Calculate 𝜸(𝑠; ˘ b ℎ𝑖𝑛𝑖𝑡 ) 1 Í𝑟 3: g𝑖 {𝜸(𝑠)} = 𝑟 𝑗=1 𝐾bℎ𝑖𝑛𝑖𝑡 (𝑠 𝑗 − 𝑠)Q𝑖 𝑗 (𝑠; b ℎ𝑖𝑛𝑖𝑡 ){𝑌𝑖 𝑗 − W𝑖 𝑗 (𝑠; bℎ𝑖𝑛𝑖𝑡 ) T 𝜸(𝑠; ˘ b ℎ𝑖𝑛𝑖𝑡 )} 4: Determine the instrument variables 𝔐(X) 5: Compute eigen-components 𝜆 𝝓 𝑘 (𝑠) and get the value of 𝛼 using the condition (C8). b𝑘 , b n −𝑖 o2 1 Í𝑛 Í𝑟 T 6: Calculate optimal ℎ: ℎ 𝑜 𝑝𝑡 ← arg minℎ∈H 𝑛𝑟 𝑖=1 𝑟=1 𝑌𝑖 𝑗 − X𝑖 𝜷 (𝑠𝑟 ; ℎ) b b 7: Calculate b 𝜷(𝑠; ℎ 𝑜 𝑝𝑡 ) 3.4 Asymptotic results In this section, we provide some assumptions and then study the asymptotic properties of the local linear GMM estimator. Here, we allow the sample size 𝑛 and the number of functional domains 𝑟 to grow to infinity. Detailed technical proofs are provided in Section 3.8. Let 𝜷0 (𝑠0 ) be the true value of 𝜷(𝑠0 ) at the location 𝑠0 . For simplicity, define 𝛿𝑛1 (ℎ) = 1/2 1/2 ∫ (1 + (ℎ𝑟) −1 ) log 𝑛/𝑛 and 𝛿𝑛2 (ℎ) = (1 + (ℎ𝑟) −1 + (ℎ𝑟) −2 ) log 𝑛/𝑛   . 𝜈𝑎,𝑏 = 𝑡 𝑎 𝐾 𝑏 (𝑡)𝑑𝑡. Consider the following conditions that will be useful in asymptotic results. (C1) Kernel function 𝐾 (·) is a symmetric density function defined on the bounded support [−1, 1] and is Lipschitz continuous. (C2) Density function 𝑓𝑇 of 𝑇 is bounded above and away from infinity, and also below and away from zero. Moreover, 𝑓 is differentiable, and the derivative is continuous. (C3) E{∥ 𝑋 ∥ 𝑎 } < ∞ and E{sup𝑠∈S |𝑈 (𝑠)| 𝑎 } < ∞ for some positive 𝑎 > 1. Define E{𝔐(X)X} = 𝛀 with rank 𝑝. (C4) The true coefficient function 𝜷0 (𝑠) is three-times continuously differentiable and Σ(𝑠, 𝑠′) are twice continuously differentiable. 81 (C5) {𝑈 (𝑠) : 𝑠 ∈ [0, 1]} and {𝔐(X)𝑈 (𝑠) : 𝑠 ∈ [0, 1]} are Donsker class, where X ⊂ 𝔐(X) (C6) a) lim𝑠↘1 E{|𝑔𝑙 (𝑠 − 1) − 𝑔𝑙 (0)| 2 } = 0 for 𝑙 = 1, · · · , 𝑑 b) lim𝑠↗1 E{|𝑔𝑙−1 (𝑠) − 𝑔𝑙 (0)| 2 } = 0 for 𝑙 = 2, · · · , 𝑑 (C7) All second order partial derivatives of C(𝑠, 𝑠′) exist and are bounded on the support of the functional domain. Í  𝛼−1 𝜅0 −1 Í∞ (C8) For some 𝜅 0 ≥ 1 and =𝑜 𝑘=1 𝜆 𝑘 / 𝑘=𝜅 0 +1 𝜆 𝑘 (C9) The numbers of individuals and functional grid-points are growing to infinity such that ℎ → 0 and 𝑟 ℎ → ∞. For some positive number 𝑎 ∈ (2, 4), | log ℎ| 1−2/𝑎 /ℎ ≤ 𝑟 1−2/𝑎 . For 𝑎 > 2, (ℎ4 + ℎ3 /𝑟 + ℎ2 /𝑟 2 ) −1 (log 𝑛/𝑛) 1−2/𝑎 → 0 as 𝑛 → ∞. Remark 3.4.1. Conditions (C1) and (C2) are commonly used in the literature of non-parametric regression. The bound condition for the density function in (C2) of the functional points is standard for random design. Similar results can be obtained for fixed design where the grid-points are ∫𝑠 pre-fixed according to the design density 0 𝑗 𝑓 (𝑠)𝑑𝑡 = 𝑗/𝑟 for 𝑗 = 1, · · · , 𝑟, for 𝑟 ≥ 1. Condition (C3) is similar to that in Li and Hsing (2010) which requires the bound on the higher order moment of X. Moreover, the rank condition of 𝛀 is required for the identification of the functional coefficient and its first-order derivatives (Su et al., 2013). Condition (C4) is also common in functional data analysis literature (Wang et al., 2016). This condition allows us to perform the Taylor series expansion. Condition in (C5) avoids the smoothness condition of the sample path (Zhu et al., 2012, 2014) which is commonly assumed in Hall and Hosseini-Nasab (2006); Zhang and Chen (2007); Cardot et al. (2013). Conditions (C6) are required to check the continuity in the mean-zero function, which is equivalent to checking the mean square continuity of the process after lining-up (Hadinejad-Mahram et al., 2002). Here, the first condition shows the limits from right and remain always right; therefore, it involves only one process. A similar but opposite phenomenon occurs in the second condition. Moreover, if the vector process g(𝑠) is mean-square continuous, then both approaches are equivalent, as a result, the covariance function is continuous after lining-up the 82 process. To obtain the asymptotic expression of b 𝜷(𝑠), observed for a fixed sample size, there exists 𝜅0 such that 𝑘 ≤ 𝜅0 , 𝜆2𝑘 is much larger than 𝛼, thus the ratio 𝜆 𝑘 /(𝜆2𝑘 + 𝛼) ≈ 𝜆−1 𝑘 . On the other hand, if 𝑘 > 𝜅0 , 𝜆2𝑘 << 𝛼, as a result, the fraction 𝜆 𝑘 /(𝜆2𝑘 + 𝛼) can be approximately written as 𝜆 𝑘 /𝛼. Therefore, by the assumption that we make in (C8), we can write for 𝑠 ∈ S, 𝜅0 ∑︁ ∞ ∑︁ 𝜅0 ∑︁ 𝜆−1 ′ T 𝑘 𝝓 𝑘 (𝑠)𝝓 𝑘 (𝑠 ) + 𝛼 −1 𝜆 𝑘 𝝓 𝑘 (𝑠)𝝓 𝑘 (𝑠′) T = 𝜆−1 ′ T 𝑘 𝝓 𝑘 (𝑠)𝝓 𝑘 (𝑠 ) {1 + 𝑜(1)} (3.22) 𝑘=1 𝑘=𝜅 0 +1 𝑘=1 Condition (C9) provide the range of bandwidth. Under the fixed sampling design, this condition can be weakened; see Zhu et al. (2012). The following result provide the asymptotic properties of the initial estimates mentioned in Step-I. Theorem 3.4.1. Under conditions (C1), (C2), (C3), (C4), (C5), and (C9) n√   o ˘ 2 ¥ 𝑛 𝜷(𝑠0 ) − 𝜷0 (𝑠0 ) − 0.5ℎ 𝜈21 𝜷0 (𝑠0 ) × (1 + 𝑜 𝑎.𝑠. (1)) : 𝑠0 ∈ S weakly converges to a mean zero Gaussian process with a covariance matrix Σ(𝑠0 , 𝑠0 )𝛀−1 x where 𝛀x = E{XX }. T Next, we study the convergence rates of the estimated eigen-components based on the proposed lining-up method. The following lemma is the output of the asymptotic expansion of eigen- components of an estimated covariance function developed by Li and Hsing (2010). Lemma 3.4.2. Under assumptions (C1), (C2), (C3), (C6), (C7), (C8), and (C9) the following convergence holds almost surely for any finite-dimensional mean-zero function g(𝑠). 1. 𝜆 b𝑘 − 𝜆 𝑘 = 𝑂 (ℎ2 + 𝛿𝑛1 (ℎ) + 𝛿𝑛2 (ℎ)) 2. sup𝑠0 ∈S b 𝝓 𝑘 (𝑠0 ) − 𝝓 𝑘 (𝑠0 ) = 𝑂 (ℎ2 + 𝛿𝑛1 (ℎ) + 𝛿𝑛2 2 (ℎ)) for all 𝑘 = 1, · · · , 𝜅0 . We skip the proof of the above lemma, as it is well developed in the literature of functional data analysis including Hall (2004); Hall and Hosseini-Nasab (2006); Li and Hsing (2010). Next, we show the asymptotic results of the proposed estimation. 83 Theorem 3.4.3. Suppose the conditions (C1), (C2), (C3), (C4), (C5), (C6), (C7), (C8), and (C9) hold, then for the proposed local linear GMM estimator b 𝜷(𝑠), have the following results hold. n√   o 𝑛 b𝜷(𝑠) − 𝜷0 (𝑠) − 0.5ℎ2 𝜈21 𝜷(𝑠))¥ (1 + 𝑜 𝑎.𝑠. (1)) : 𝑠 ∈ S weakly converges to a mean zero Gaussian process with a covariance matrix   −1   −1 A(𝑠0 , 𝑠0 ) = 𝛀C−1 (𝑠 , 𝜅 0 ,11 0 0 𝑠 )𝛀T 𝛀C −1 (𝑠 , 𝜅 0 ,11 0 0 𝑠 )𝚺(𝑠 , 0 0𝑠 )C −1 (𝑠 , 𝜅 0 ,11 0 0 𝑠 )𝛀 𝛀C (𝑠 , 𝜅 0 ,11 0 0 𝑠 ) −1 T 𝛀 . The proofs of Theorems 3.4.1 and 3.4.3 are provided in Section 3.8. 3.5 Simulation studies We conduct numerical studies to compare finite sample performance under different correlation structures and heterogeneity conditions. Data are generated from the following model. 𝑌𝑖 (𝑠) = 𝑋𝑖 𝜷(𝑠) + 𝑈𝑖 (𝑠) (3.23) where we generate trajectories observed at 𝑟 spatial locations for 𝑖-th curve, 𝑖 = 1, · · · , 𝑛. Assume that the functional fixed effect be 𝛽(𝑠) = cos(2𝜋𝑠) and corresponding fixed effect covariate is generated from the normal distribution with unit mean and variance. The error process is generated as 𝑈𝑖 (𝑠) = 𝜉1 (𝑋𝑖 )𝜓1 (𝑠) + 𝜉2 (𝑋𝑖 )𝜓2 (𝑠) (3.24) where 𝜉1 (𝑋𝑖 ) and 𝜉2 (𝑋𝑖 ) are independent central normal random variables with variance 3𝜎 2 (𝑋𝑖 )𝜃 02 and 1.5𝜎 2 (𝑋𝑖 )𝜃 02 where 𝜃 0 is determined by the relative importance of random error signal-to-noise ratio, denoted as SNR𝜃 which is interpreted as the ratio of standard deviation of the additive prediction without noise divided by the standard error of the random noise function. For example, SNR𝜃 = 2 means that the contribution of each functional random noise to the variability in 𝑌 (𝑠) is about double that of the fixed effect (Scheipl et al., 2015). Here, we use scaled orthonormal functions 𝜓1 (𝑠) ∝ (1.5 − sin(2𝜋𝑠) − cos(2𝜋𝑠)) and 𝜓2 (𝑠) ∝ sin(4𝜋𝑠); due to orthonormality, the proportionality constant can be easily determined. Contributions to the conditional variances in 𝜉 𝑘 (𝑋) are specified below. 84 S.0 𝜎 2 (𝑥) = 1 (homoskedastic) S.1 𝜎 2 (𝑥) = (1 + 𝑥 2 /2) 2 S.2 𝜎 2 (𝑥) = exp(1 + 𝑥 2 /2) S.3 𝜎 2 (𝑥) = exp(1 + |𝑥| + 𝑥 2 ) S.4 𝜎 2 (𝑥) = (1 + |𝑥|/2) 2 The following parameters are considered for each of the above scenarios. 1. Observational spatial points. We sample the trajectories at 𝑟 equidistant spatial points {𝑠1 , · · · , 𝑠𝑟 } on [0, 1]. Let 𝑠𝑖 = ( 𝑗 − 0.5)/𝑟 for 𝑗 = 1, · · · , 𝑟 for the 𝑖-th curve. The number of spatial points is assumed to be 200 for each case. 2. Sample size. Number of trajectories 𝑛 ∈ {50, 100, 200, 500}. 3. Signal to noise ratio. The controlling parameter 𝜃 0 is determined using signal-to-noise ratio, SNR𝜃 which is assumed to be either 0.5 or 1. Here, for each of the above situations, we perform 500 simulation replicates. To make it consistent with theoretical results and numerical examples, we use “FPCA" function in R which is available in fdapace packages (Gajardo et al., 2021) or in the MATLAB (MATLAB, 2014) package PACE available at http://www.stat.ucdavis.edu/PACE/ to estimate the eigen-functions. Bandwidths are selected using five-fold generalized cross-validation in all situations and for estimation, the Epanechnikov kernel 𝐾 (𝑥) = 0.75(1 − 𝑥 2 )+ is used; where (𝑎)+ = max(𝑎, 0). Accuracy of the parameter estimation is assessed using integrated mean square error (IMSE) and integrated mean absolute error (IMAE) which for the 𝑏−th replication are defined as  𝑟  ∑︁ 2   IMSE𝑏 =   𝜷 𝑏 (𝑠 𝑗 ) − 𝜷(𝑠 𝑗 ) Δ(𝑠𝑟 )  b (3.25)  𝑗=1    and  𝑟 ∑︁   IMAE𝑏 =  𝜷 𝑏 (𝑠 𝑗 ) − 𝜷(𝑠 𝑗 )|Δ(𝑠𝑟 )  |b (3.26)  𝑗=1    85 with Δ(𝑠 𝑗 ) = 𝑠 𝑗 − 𝑠 𝑗−1 and 𝑠0 = 0 and 𝑠1 < · · · < 𝑠𝑟 are the observed points over the support set of observational points. We have noticed that the results can be improved by multiplying ℎ∗ by a constant in a certain range where ℎ∗ is the optimal bandwidth obtained from cross-validation. We 𝜷 corresponding to bandwidth 0.75ℎ∗ for our numerical studies. We present Tables 3.1 and use b 3.2 where IMSEs and IMAEs are averaged over 500 replications for each situation. We denote LLE, LLGMM and LLGMM-opt by local linear smoothing estimator described in Step-I, local linear GMM without incorporating weight matrix and local linear GMM with weight matrix as proposed in Step-III in Section 3.3 respectively. As expected, for all situations, IMSE and IMAE are significantly reduced if we increase the sample size and/or SNR𝜃 . For the homoskedastic case, the error rates of LLE are similar for LLGM but under the presence of heteroskedasticity of unknown form, our proposed method outperforms. Table 3.1 Performance of the estimation procedure with SNR𝜃 = 0.5 n = 50 n = 100 n = 200 n = 500 Case Method IMSE IMAE IMSE IMAE IMSE IMAE IMSE IMAE S.0 LLE 0.0372 0.1429 0.0189 0.1016 0.0099 0.0737 0.0041 0.0472 LLGMM 0.0375 0.1435 0.0191 0.1022 0.0099 0.0737 0.0041 0.0471 LLGMM-opt 0.0388 0.1460 0.0198 0.1038 0.0100 0.0740 0.0042 0.0474 S.1 LLE 0.0939 0.2271 0.0516 0.1679 0.0261 0.1189 0.0106 0.0766 LLGMM 0.0816 0.2123 0.0443 0.1556 0.0227 0.1109 0.0091 0.0708 LLGMM-opt 0.0585 0.1820 0.0292 0.1262 0.0135 0.0867 0.0050 0.0517 S.2 LLE 0.1381 0.2810 0.0812 0.2169 0.0468 0.1632 0.0209 0.1094 LLGMM 0.0804 0.2105 0.0372 0.1409 0.0164 0.0902 0.0048 0.0486 LLGMM-opt 0.0557 0.1462 0.0134 0.0817 0.0045 0.0471 0.0015 0.0262 S.3 LLE 0.1762 0.3330 0.1018 0.2538 0.0581 0.1913 0.0265 0.1291 LLGMM 0.0328 0.1069 0.0126 0.0619 0.0055 0.0376 0.0023 0.0243 LLGMM-opt 0.1067 0.0600 0.0021 0.0201 0.0004 0.0093 0.0003 0.0064 S.4 LLE 0.0588 0.1798 0.0309 0.1298 0.0158 0.0928 0.0065 0.0596 LLGMM 0.0577 0.1782 0.0303 0.1285 0.0155 0.0920 0.0063 0.0589 LLGMM-opt 0.0576 0.1792 0.0303 0.1287 0.0153 0.0914 0.0063 0.0585 86 Table 3.2 Performance of the estimation procedure with SNR𝜃 = 1 n = 50 n = 100 n = 200 n = 500 Case Method IMSE IMAE IMSE IMAE IMSE IMAE IMSE IMAE S.0 LLE 0.0097 0.0729 0.0048 0.0515 0.0025 0.0372 0.0010 0.0237 LLGMM 0.0099 0.0738 0.0051 0.0526 0.0025 0.0373 0.0010 0.0238 LLGMM-opt 0.0101 0.0741 0.0052 0.0532 0.0025 0.0374 0.0010 0.0238 S.1 LLE 0.0248 0.1169 0.0135 0.0860 0.0068 0.0608 0.0027 0.0387 LLGMM 0.0142 0.0887 0.0070 0.0617 0.0034 0.0430 0.0013 0.0269 LLGMM-opt 0.0126 0.0836 0.0062 0.0573 0.0029 0.0403 0.0012 0.0253 S.2 LLE 0.0363 0.1440 0.0215 0.1117 0.0124 0.0842 0.0055 0.0560 LLGMM 0.0103 0.0734 0.0046 0.0480 0.0019 0.0304 0.0006 0.0172 LLGMM-opt 0.0069 0.0589 0.0029 0.0376 0.0012 0.0239 0.0004 0.0142 S.3 LLE 0.0466 0.1705 0.0274 0.1314 0.0157 0.0994 0.0071 0.0669 LLGMM 0.0050 0.0398 0.0025 0.0273 0.0011 0.0168 0.0004 0.0101 LLGMM-opt 0.0006 0.0133 0.0003 0.0078 0.0002 0.0069 0.0001 0.0052 S.4 LLE 0.0155 0.0924 0.0080 0.0662 0.0041 0.0472 0.0016 0.0300 LLGMM 0.0141 0.0881 0.0073 0.0631 0.0036 0.0447 0.0015 0.0285 LLGMM-opt 0.0142 0.0883 0.0074 0.0634 0.0037 0.0449 0.0015 0.0285 3.6 Real data analysis For illustrating the application of our proposed method and the estimation procedure therein, we use Apnea-data to understand white matter structural alterations using diffusion tensor imaging (DTI) in obstructive sleep apnea (OSA) patients (Xiong et al., 2017). The details of this data-set have already been discussed in Chapter 2, Subsection 2.5.2. FA varies systematically along the trajectory of each white matter fascicle. Several pre- and post-processing steps were performed by the FSL software. The brain was extracted using brain segmentation tools. After generating FA maps using the FMRIB diffusion toolbox, images from all individuals were aligned to an FA standard template through non-linear co-registration. The Johns Hopkins University (JHU) white matter tractography atlas was used as a standard template for white matter parcellation with 50 regions of interest (ROIs). All imaging parameters were calculated by averaging the voxel values in each ROI. For each subject, we calculate the similarity matrix C with dimension 50 × 50. The (𝑘, 𝑙)-th 87 element of the matrix C is defined as 𝑐 𝑘𝑙 = |𝑦 𝑘 − 𝑦 𝑙 | where 𝑦 𝑘 is the measure of FA associated with 𝑘-th ROI. For simplicity, we scale the similarity matrix such that the range of elements of the matrix is [0, 1]. To create the network, we threshold each similarity matrix to build an adjacency matrix G with elements {1, 0} depending on whether the similarity values exceed the threshold or not. Since this threshold controls the topology of the data, we contract the adjacent matrix over a set of threshold parameters from 0.01 to 0.99, and this set is denoted as S with cardinally 99. A popular measure of the connectivity is average path length (APL) which is defined as the average number of steps along the shortest path for all possible pairs of the network nodes. Therefore, it measures the efficacy of information on a network (Albert and Barabási, 2002). For a series of threshold parameters (𝑠), we observe the APL for FA as shown in Figure 3.1. Scientists are often interested to know the association of APL of the network generated from FA with a set of covariates of interests such as age and lapses score. We fit the model 3.1 to APLs that are collected over continuous spatial domains (viz, thresholds) from all individuals in which X𝑖 included clinical variables such as lapses, age. We discard the subjects from the analysis with missing clinical variables and therefore sample size 𝑛 = 27. Here we used three-fold cross-validation to obtain the tuning parameters and the FVE is set be at 0.99. In Figure 3.2, we present the estimated coefficient functions corresponding to age and number of lapses associated with APL where it can be observed that the coefficient of the network property is negative with age but positive with lapses counts. Moreover the effect for the APL is found to be increasing when the significant level is small to moderate and decreasing at moderate to large significance levels; whereas, the effect of APL is more-or-less similar upto the larger values of threshold, and after that it significantly decreases. Here small values of significance thresholds represent sparse connected graph where the true connection might be eliminated due to lenient thresholding; on the other hand, for large significant values, the generated graphs are densely connected. 88 0.000 −0.001 β1(s) −0.002 −0.003 −0.004 0.0 0.2 0.4 0.6 0.8 1.0 threshold 0.040 0.035 β2(s) 0.030 0.025 0.0 0.2 0.4 0.6 0.8 1.0 threshold Figure 3.2 Apnea-data analysis: Plots of estimated coefficient functions of age (top panel) and number of lapses (bottom panel) for average path length associated with Fractional Anisotropy (FA) in DTI analysis. 89 3.7 Discussion In this chapter, we propose an efficient estimation procedure for the varying-coefficient model which is commonly used in neuroimaging and econometrics. We understand that this procedure is an efficient approach to incorporate heteroskedasticity in the analysis of functional data. To best of our knowledge, this is the first initiative to incorporate such a condition into the model. This model is therefore equipped with a more complex relationship between the functional response and real-valued covariates. Additionally, our method is easy to implement in a wide range of applications due to the multi-stage structure of the algorithm. The applicability of the proposed method is illustrated by simulation studies and real data analysis. We leave the testing of hypotheses for linear constraints on 𝜷0 (·) for future studies. 3.8 Technical details In this section, we provide technical details of the proposed theorems in Section 3.4. We prove theorems 3.4.1 and 3.4.3 by proving the following lemmas. 3.8.1 Some useful lemmas √1 Í𝑛 Lemma 3.8.1. Under the conditions (C5) 𝑛 𝑖=1 𝑈𝑖 (𝑠0 )𝔐(X𝑖 ) is tight. Proof. Consider the class of function C = {𝑈 (𝑠0 )𝔐(X𝑖 ) : 𝑠0 ∈ [0, 1]}. Therefore, due the assumption (C5), C is a P-Donsker class. Therefore, √1𝑛 𝑖=1 Í𝑛 𝑈𝑖 (𝑠0 )𝔐(X𝑖 ) is tight. Lemma 3.8.2. Under the assumptions (C1), (C2) and (C9), the following holds for any power 𝑐 ≥ 0. ∫ sup 𝐾 ℎ (𝑡 − 𝑠) {(𝑡 − 𝑠)/ℎ} 𝑐 𝑑Π(𝑡) − Π(𝑡) = 𝑂 (1/(𝑟 ℎ) −1/2 ) (3.27) 𝑠∈[0,1] The above bound can be replaced by 𝑂 (1/𝑟 ℎ) for fixed design case. Proof. This can be proved by using the empirical process techniques by observing that the class {𝐾 ((· − 𝑠/ℎ))((· − 𝑠/ℎ)) 𝑐 : 𝑠 ∈ [0, 1]} is a P-Donsker class (Zhu et al., 2012). For the balanced case, the results can be shown using Tayler’s series expansion. 90 1 𝐾 ℎ (𝑠 𝑗 − 𝑠0 )W𝑖 𝑗 (𝑠0 )Q𝑖 𝑗 (𝑠0 ) T . Under the conditions Í𝑛 Í𝑟 Lemma 3.8.3. Define I(𝑠0 ) = 𝑛𝑟 𝑖=1 𝑗=1 (C1), (C2), (C3) and (C9) I(𝑠0 ) = 𝑓 (𝑠0 )diag(1, 𝜈21 ) ⊗ 𝛀 + 𝑂 (ℎ + 𝛿𝑛1 (ℎ)) almost surely, where 𝛀 = E{𝔐(X)XT }. Proof. Observe the following. 𝑛 𝑟 1 ∑︁ ∑︁ I(𝑠0 ) = 𝐾 ℎ (𝑠 𝑗 − 𝑠0 )W𝑖 𝑗 (𝑠0 )Q𝑖 𝑗 (𝑠0 ) T 𝑛𝑟 𝑖=1 𝑗=1 𝑛 𝑟 1 ∑︁ ∑︁   T = 𝐾 ℎ (𝑠 𝑗 − 𝑠0 ) zℎ (𝑠 𝑗 − 𝑠0 ) ⊗ 𝔐(X𝑖 ) zℎ (𝑠 𝑗 − 𝑠0 ) ⊗ X𝑖 𝑛𝑟 𝑖=1 𝑗=1 𝑛 𝑟 1 ∑︁ ∑︁ n 2 o = 𝐾 ℎ (𝑠 𝑗 − 𝑠) zℎ (𝑠 𝑗 − 𝑠0 ) ⊗ ⊗ 𝔐(X𝑖 )X𝑖T 𝑛𝑅 𝑖=1 𝑗=1 𝑛 1 ∑︁ ∑︁ 𝑟 © 𝔐(X𝑖 )X𝑖T 𝔐(X𝑖 )X𝑖T (𝑠 𝑗 − 𝑠0 )/ℎ ª = 𝐾 ℎ (𝑠 𝑗 − 𝑠0 ) ­­ ® 𝑛𝑟 𝑖=1 𝑗=1 𝔐(X𝑖 )X𝑖T (𝑠 𝑗 − 𝑠0 )/ℎ 𝔐(X𝑖 )X𝑖T ((𝑠 𝑗 − 𝑠0 )/ℎ) 2 ® « ¬ ©I11 (𝑠0 ) I12 (𝑠0 ) ª := ­­ ® ® (3.28) I (𝑠 ) I22 (𝑠0 ) « 21 0 ¬ Let us define I𝑎,𝑏 = 𝑛𝑟1 𝑖=1 𝑎+𝑏 𝔐(X )XT . Assume that 𝜈 is finite and Í𝑛 Í𝑟 𝑗=1 𝐾 ℎ (𝑠 𝑗 − 𝑠0 )(𝑠 𝑗 − 𝑠0 ) 𝑖 𝑖 41 due to condition (C2),for general index 𝑐, we can derive the uniform bound of for all 𝑠0 ∈ S.  1 ∑︁   𝑛 ∑︁ 𝑟    𝑐 T E{I𝑎,𝑏 (𝑠0 )} = E 𝐾 ℎ (𝑠 𝑗 − 𝑠0 )((𝑠 𝑗 − 𝑠0 )/ℎ) 𝔐(X𝑖 )X𝑖  𝑛𝑟   𝑖=1 𝑗=1   1 ∑︁   𝑟    = 𝛀E 𝐾 ℎ (𝑠 𝑗 − 𝑠0 )((𝑠 𝑗 − 𝑠0 )/ℎ) 𝑐 𝑟   𝑗=1  ∫ =𝛀 𝐾 ℎ (𝑢 − 𝑠0 )((𝑢 − 𝑠0 )/ℎ) 𝑐 𝑓 (𝑢)𝑑𝑢 ∫ =𝛀 𝐾 (𝑢)𝑢 𝑐 𝑓 (𝑠0 + ℎ𝑢)𝑑𝑢 ∫ 𝐾 (𝑢)𝑢 𝑐 𝑓 (𝑠0 ) + ℎ𝑢 𝑓 ′ (𝑠0 ) + 0.5ℎ2 𝑢 2 𝑓 ′′ (𝑠0 ) + · · · 𝑑𝑢  =𝛀 91  𝑐 = 0, provided 𝜈21 < ∞, 𝑓 ′′ exists and finite  𝑓 (𝑠0 ) + 𝑂 (ℎ2 )        𝑐 = 1, provided 𝜈21 < ∞, 𝑓 ′ exists and finite  𝑂 (ℎ)    =𝛀 (3.29) 𝑓 (𝑠0 )𝜈21 + 𝑂 (ℎ2 ) 𝑓 ′′      𝑐 = 2, provided 𝜈41 < ∞, exists and finite    𝑐 = 3, provided 𝜈41 < ∞, 𝑓 ′ exists and finite  𝑂 (ℎ)    Moreover, under the condition (C3), we have E∥X∥ 𝑎 is finite for some 𝑎 > 2 and can define, 𝑏 𝑛 = ℎ2 + ℎ/𝑟 where ℎ → 0 such that 𝑏 −1 𝑛 (log 𝑛/𝑛) 1−2/𝑎 = 𝑜(1). Thus, 𝛿 (ℎ) = {𝑏 log 𝑛/𝑛ℎ2 } 1/2 . 𝑛1 𝑛 Now to establish the uniform bound for I(𝑠0 ), by using Lemma 2 in Li and Hsing (2010) for each of I𝑎,𝑏 (𝑠0 ) for 𝑎, 𝑏 = 1, 2, we have I(𝑠0 ) = 𝑓 (𝑠0 )(diag(1, 𝜈21 )) ⊗ 𝛀 + 𝑂 (ℎ + 𝛿𝑛1 (ℎ)) almost surely (3.30) 1 𝐾 ℎ (𝑠 𝑗 − 𝑠0 )Q𝑖 𝑗 (𝑠0 )X𝑖T 𝜷0 (𝑠 𝑗 ). Thus, under the condi- Í𝑛 Í𝑟 Lemma 3.8.4. Define, J(𝑠0 ) = 𝑛𝑟 𝑖=1 𝑗=1 tions (C1), (C2), (C4), (C3) and (C9), J(𝑠0 ) − I(𝑠0 )𝜸 0 (𝑠0 ) = 𝑓 (𝑠0 )(𝜈21 , 0) ⊗ 𝛀 + 𝑂 (𝛿𝑛1 (ℎ) + ℎ) almost surely, where 𝜸 0 (𝑠0 ) = ( 𝜷0 (𝑠0 ) T , ℎ 𝜷¤ 0 (𝑠0 ) T ) T . Moreover, T(𝑠0 ) = 𝑛𝑟1 𝑖=1 Í𝑛 Í𝑟 𝑗=1 𝐾 ℎ (𝑠 𝑗 − 𝑠0 )Q𝑖 𝑗 (𝑠0 )𝑈𝑖 𝑗 = 𝑂 (𝛿𝑛1 (ℎ)) almost surely. Proof. Observe that, because of condition (C4), using Taylor’s series expansion, 𝑛 𝑟 1 ∑︁ ∑︁ J(𝑠0 ) = 𝐾 ℎ (𝑠 𝑗 − 𝑠0 )Q𝑖 𝑗 (𝑠0 )X𝑖T 𝜷0 (𝑠0 ) 𝑛𝑟 𝑖=1 𝑗=1 𝑛 𝑟 1 ∑︁ ∑︁ 𝐾 ℎ (𝑠 𝑗 − 𝑠0 )Q𝑖 𝑗 (𝑠0 ) X𝑖T 𝜷0 (𝑠0 ) + (𝑠 𝑗 − 𝑠0 )X𝑖T 𝜷¤ 0 (𝑠0 )  = 𝑛𝑟 𝑖=1 𝑗=1 +0.5(𝑠 𝑗 − 𝑠0 ) 2 X𝑖T 𝜷¥ 0 (𝑠0 ) + 𝑜(ℎ2 ) = I(𝑠0 )𝜸 0 (𝑠0 ) + 0.5ℎ2 I21 (𝑠0 ) + 𝑜(ℎ2 ) (3.31) Using similar arguments, due to Lemma 2 in (Li and Hsing, 2010), under the conditions (C1), (C2) 92 and (C3), with 𝜈41 being finite, we have 𝑛 1 ∑︁ ∑︁ 𝑟 © ((𝑠 𝑗 − 𝑠0 )/ℎ) 2 ª I21 (𝑠0 ) = 𝐾 ℎ (𝑠 𝑗 − 𝑠0 ) ­­ ® 𝔐(X𝑖 )XT 𝑖 𝑛𝑟 𝑖=1 𝑗=1 ((𝑠 𝑗 − 𝑠0 )/ℎ) 3 ® « ¬ = 𝑓 (𝑠0 )(𝜈21 , 0) ⊗ 𝛀 + 𝑂 (𝛿𝑛1 (ℎ) + ℎ) almost surely (3.32) and 1 Í𝑛 Í𝑟 © 𝑛𝑟 𝑖=1 𝑗=1 𝐾 ℎ (𝑠 𝑗 − 𝑠0 )𝔐(X𝑖 )𝑈𝑖 𝑗 ª T(𝑠0 ) = ­­ Í Í ® 1 𝑛 𝑟 ® 𝑛𝑟 𝑖=1 𝑗=1 𝐾 ℎ (𝑠 𝑗 − 𝑠0 )((𝑠 𝑗 − 𝑠0 )/ℎ)𝔐(X𝑖 )𝑈𝑖𝑟 « ¬ = 𝑂 (𝛿𝑛1 (ℎ)) almost surely (3.33) Lemma 3.8.5. Under conditions (C1),(C2), (C3), (C5), (C9), √ 𝑑 𝑛T(𝑠0 )(1 + 𝑜 𝑎.𝑠. (1)) → − 𝑁 (0, 𝑓 2 (𝑠0 )Σ(𝑠0 , 𝑠0 ) ⊗ 𝛀) (3.34) where T(𝑠0 ) is defined in Lemma 3.8.4. Proof. Note that 𝑛 𝑟 √ 1 ∑︁ ∑︁   𝑛T(𝑠0 ) = √ 𝐾 ℎ (𝑠 𝑗 − 𝑠0 ) zℎ (𝑠 𝑗 − 𝑠0 ) ⊗ 𝔐(X𝑖 ) 𝑈𝑖 𝑗 (3.35) 𝑛𝑟 𝑖=1 𝑗=1 Therefore, the variance of the above quantity is √ Var{ 𝑛T(𝑠0 )} 𝑛 ∑︁ 𝑟 ∑︁ 𝑟 1    ∑︁ h i  2   = E 𝐾 ℎ (𝑠 𝑗 − 𝑠0 )𝐾 ℎ (𝑠 𝑗 ′ − 𝑠0 ) zℎ (𝑠 𝑗 − 𝑠0 )zℎ (𝑠 𝑗 ′ − 𝑠0 ) T ⊗ 𝔐(X𝑖 ) ⊗ 𝑈𝑖 𝑗 𝑈𝑖 𝑗 ′ 𝑛  𝑖=1 𝑗=1 𝑗 ′=1    𝑛 𝑟 𝑟 1 ∑︁   1 ∑︁ ∑︁  h i  ⊗2   = E 2 𝐾 ℎ (𝑠 𝑗 − 𝑠0 )𝐾 ℎ (𝑠 𝑗 − 𝑠0 ) zℎ (𝑠 𝑗 − 𝑠0 )zℎ (𝑠 𝑗 − 𝑠0 ) ⊗ 𝔐(X𝑖 ) ′ ′ Σ(𝑠 𝑗 , 𝑠 𝑗 ) ′ 𝑛 𝑖=1  𝑟 𝑗=1 𝑗 ′=1     1 ∑︁   𝑟 ∑︁𝑟    =E 2 𝐾 ℎ (𝑠 𝑗 − 𝑠0 )𝐾 ℎ (𝑠 𝑗 ′ − 𝑠0 )zℎ (𝑠 𝑗 − 𝑠0 )zℎ (𝑠 𝑗 − 𝑠0 ) T Σ(𝑠 𝑗 , 𝑠 𝑗 ′ ) ⊗ 𝛀 𝑟  𝑗=1 𝑗 ′=1   = [E{D1 (𝑠0 )} + E{D2 (𝑠0 )}] ⊗ 𝛀 (3.36) 93 1 2 1 Í𝑛 Í𝑟 𝐾 ℎ2 (𝑠 𝑗 − 𝑠0 )zℎ (𝑠 𝑗 − 𝑠0 ) ⊗ Σ(𝑠 𝑗 , 𝑠 𝑗 ) and D2 (𝑠0 ) = Í𝑛 where D1 (𝑠0 ) = 𝑟2 𝑗=1 𝑟 (𝑟−1) 𝑗=1 𝑗 ′ =1 𝐾 ℎ (𝑠 𝑗 − 𝑗≠ 𝑗 ′ 𝑠0 )𝐾 ℎ (𝑠 𝑗 ′ − 𝑠0 )zℎ (𝑠 𝑗 − 𝑠0 )zℎ (𝑠 𝑗 ′ − 𝑠0 ) T Σ(𝑠 𝑗 , 𝑠 𝑗 ′ ). Note that  1 ∑︁  𝑟   2   E{D2 (𝑠0 )} = E 2 𝐾 2 (𝑠 𝑗 − 𝑠0 )zℎ (𝑠 𝑗 − 𝑠0 ) ⊗ Σ(𝑠 𝑗 , 𝑠 𝑗 ) 𝑟   𝑗=1  ∫ 1 2 = 𝐾 ℎ2 (𝑡 − 𝑠0 )zℎ (𝑡 − 𝑠0 ) ⊗ Σ(𝑡, 𝑡) 𝑓 (𝑡)𝑑𝑡 𝑟 ©1 𝑡 ª ∫ 1 = 𝐾 2 (𝑡) ­­ ® Σ(𝑠0 + ℎ𝑢) 𝑓 (𝑠0 + ℎ𝑡)𝑑𝑡 ℎ𝑟 𝑡 𝑡 2 ® « ¬ 1 = { 𝑓 (𝑠0 )diag(𝜈02 , 𝜈22 )Σ(𝑠0 , 𝑠0 ) + 𝑂 (ℎ)} (3.37) ℎ𝑟 ∫ Now assume that Θ(𝑠0 ) = E{D2 (𝑠0 )} with (𝑙, 𝑙 ′)-th entry 𝜃 𝑙,𝑙 ′ and P(𝑡) = S 𝐾 ℎ (𝑡 − 𝑠0 )𝐾 ℎ (𝑡 ′ − 𝑠0 )zℎ (𝑡 − 𝑠0 )zℎ (𝑡 ′ − 𝑠0 )Σ(𝑡, 𝑡 ′) 𝑓 (𝑡 ′)𝑑𝑡 ′ with (𝑙, 𝑙 ′)-the element P𝑙,𝑙 ′ . Therefore, using Hájek pro- jection (Vaart and Wellner, 1996), we have 𝑟 2 ∑︁  D1,𝑙.𝑙 ′ (𝑠0 ) = 𝜃 𝑙,𝑙 ′ (𝑠0 ) + P𝑙,𝑙 ′ (𝑠 𝑗 ) − 𝜃 𝑙,𝑙 ′ (𝑠0 ) + 𝜖˜𝑙,𝑙 ′ (𝑠0 ) (3.38) 𝑟 𝑗=1 2 Í𝑟  where 𝑟 𝑗=1 P𝑙,𝑙 ′ (𝑠 𝑗 ) − 𝜃 𝑙,𝑙 ′ (𝑠0 ) is the projection on D2, ′,𝑙 ′ (𝑠0 ) − 𝜃 𝑙,𝑙 ′ (𝑠0 ) onto the set of all statistics of the linear order form. Thus, it is easy to see Var{𝜖˜} = 𝑂 (1/(𝑟 ℎ) 2 ) (Zhu et al., 2012). Since the Taylor series expansion for small ℎ → 0, we have 𝜃 𝑙,𝑙 ′ (𝑠0 ) = 𝑓 (𝑠0 ) 2 𝜈𝑙−1, 𝜈𝑙 ′−1,1 Σ(𝑠0 , 𝑠0 ). √ Therefore, in summery, we have Var{ 𝑛T(𝑠0 )} = 𝑓 2 (𝑠0 )UΣ(𝑠0 , 𝑠0 ). where the element (𝑙, 𝑙 ′) of the matrix U is 𝜈𝑙−1 𝜈𝑙 ′−1 . √ To hold the above asymptotic results, we need to show that 𝑛T(𝑠0 ) be tight asymptotically. Therefore, consider the following, for suitable choice of 𝑙 < 𝑙 after change of variables, √ 𝑛T(𝑠0 ) 𝑛 𝑟 1 ∑︁ ∑︁ =√ 𝐾 ℎ (𝑠 𝑗 − 𝑠0 ) [zℎ (𝑠 𝑗 − 𝑠0 ) ⊗ 𝔐(X𝑖 )]𝑈𝑖 𝑗 𝑛𝑟 𝑖=1 𝑟=1 𝑛  𝑟 ∫ 1 1 ∑︁   1 ∑︁    =√ 𝐾 ℎ (𝑠 𝑗 − 𝑠0 )zℎ (𝑠 𝑗 − 𝑠0 )𝑈𝑖 𝑗 − 𝐾 ℎ (𝑡 − 𝑠0 )zℎ (𝑡 − 𝑠0 )𝑈𝑖 (𝑡) 𝑓 (𝑡)𝑑𝑡 ⊗ 𝔐(X𝑖 ) 𝑛 𝑖=1  𝑟 0   𝑗=1  𝑛 ∫ 𝑙 1 ∑︁ +√ 𝑈𝑖 (𝑠0 ) 𝐾 (𝑡)(1, 𝑡) T 𝑓 (𝑠0 + ℎ𝑡)𝑑𝑡 ⊗ 𝔐(X𝑖 ) 𝑛 𝑖=1 𝑙 94 𝑛 ∫ 𝑙 1 ∑︁ +√ 𝐾 (𝑡)(1, 𝑡) T {𝑈𝑖 (𝑠0 + ℎ𝑡) − 𝑈𝑖 (𝑠0 )} 𝑓 (𝑠0 + ℎ𝑡)𝑑𝑡 ⊗ 𝔐(X𝑖 ) 𝑛 𝑖=1 𝑙 := T1 (𝑠0 ) + T2 (𝑠0 ) + T3 (𝑠0 ) (3.39) Note that, T1 (𝑠0 ) 𝑟 ( 𝑛 𝑛 ) 1 ∑︁ 1 ∑︁ 1 ∑︁ = 𝐾 ℎ (𝑠 𝑗 − 𝑠0 )zℎ (𝑠 𝑗 − 𝑠0 ) √ 𝑈𝑖 𝑗 ⊗ 𝔐(X𝑖 ) − √ 𝑈𝑖 (𝑡) ⊗ 𝔐(X𝑖 ) 𝑟 𝑟=1 𝑛 𝑖=1 𝑛 𝑖=1 𝑟 ∫ 𝑙 ( 𝑛 )  1 ∑︁    1 ∑︁   + 𝐾 ℎ (𝑠 𝑗 − 𝑠0 )zℎ (𝑠 𝑗 − 𝑠0 ) − 𝐾 ℎ (𝑡 − 𝑠0 )zℎ (𝑡 − 𝑠0 ) 𝑓 (𝑡)𝑑𝑡 √ 𝑈𝑖 (𝑡) ⊗ 𝔐(X𝑖 ) 𝑟 𝑙  𝑛  𝑗=1  𝑖=1 ∫ 𝑙 𝑛 1 ∑︁ + 𝐾 ℎ (𝑡 − 𝑠0 )zℎ (𝑡 − 𝑠0 ) √ {𝑈𝑖 (𝑠0 ) − 𝑈𝑖 (𝑡)} 𝑓 (𝑡)𝑑𝑡 ⊗ 𝔐(X𝑖 ) 𝑙 𝑛 𝑖=1 := T11 (𝑠0 ) + T12 (𝑠0 ) + T13 (𝑠0 ) (3.40) Due to the Donsker Theorem, we have √1𝑛 𝑖=1 Í𝑛 𝔐(X𝑖 )𝑈𝑖 (𝑠) weekly converges to a centered Gaussian process and sup𝑠∈[0,1] | √1𝑛 𝑖=1 Í𝑛 𝔐(X𝑖 )𝑈𝑖 (𝑠)| = 𝑂 𝑝 (1) (Vaart and Wellner, 1996). There- fore, |T11 (𝑠0 )| 𝑟 𝑛 𝑛 1 ∑︁ 1 ∑︁ 1 ∑︁ ≤ 𝐾 ℎ (𝑠 𝑗 − 𝑠0 )∥zℎ (𝑠 𝑗 − 𝑠0 )∥ 2 √ 𝑈𝑖 𝑗 ⊗ 𝔐(X𝑖 ) − √ 𝑈𝑖 (𝑠) ⊗ 𝔐(X𝑖 ) 𝑟 𝑗=1 𝑛 𝑖=1 𝑛 𝑖=1 𝑟 𝑛 𝑛 1 ∑︁ 1 ∑︁ 1 ∑︁ ≤ 𝐾 ℎ (𝑠 𝑗 − 𝑠0 )∥zℎ (𝑠 𝑗 − 𝑠0 )∥ 2 sup √ 𝑈𝑖 (𝑠) ⊗ 𝔐(X𝑖 ) − √ 𝑈𝑖 (𝑠0 ) ⊗ 𝔐(X𝑖 ) 𝑟 𝑗=1 |𝑠−𝑠0 |≤ℎ 𝑛 𝑖=1 𝑛 𝑖=1 = 𝑜 𝑃 (1) (3.41) |T12 (𝑠0 )| 𝑟 ∫ ( 𝑛 ) 𝑙 1 ∑︁ 1 ∑︁ ≤ 𝐾 ℎ (𝑠 𝑗 − 𝑠0 )zℎ (𝑠 𝑗 − 𝑠0 ) − 𝐾 ℎ (𝑡 − 𝑠0 )zℎ (𝑡 − 𝑠0 ) 𝑓 (𝑡)𝑑𝑡 sup √ 𝑈𝑖 (𝑡) ⊗ 𝔐(X𝑖 ) 𝑟 𝑗=1 𝑙 𝑡∈[0,1] 𝑛 𝑖=1 √ = 𝑂 𝑃 (1/ 𝑟 ℎ)𝑂 𝑃 (1) = 𝑜 𝑃 (1) (3.42) 95 The above bound holds for Lemma 3.8.2 and Condition (C9) so that 𝑚ℎ → ∞. |T13 (𝑠0 )| 𝑛 𝑛 ∫ 𝑙 1 ∑︁ 1 ∑︁ ≤ sup √ 𝑈𝑖 (𝑠) ⊗ 𝔐(X𝑖 ) − √ 𝑈𝑖 (𝑠0 ) ⊗ 𝔐(X𝑖 ) 𝐾 ℎ (𝑠 𝑗 − 𝑠0 )∥zℎ (𝑠 𝑗 − 𝑠0 )∥ 2 𝑓 (𝑠)𝑑𝑠 |𝑠−𝑠0 |≤ℎ 𝑛 𝑖=1 𝑛 𝑖=1 𝑙 = 𝑂 𝑃 (1) (3.43) By combining the above three bounds, due to conditions (C1),(C2), (C3), (C5), (C9), we obtain T1 (𝑠0 ) = 𝑜 𝑃 (1). Now, rewrite T3 (𝑠0 ) as 𝑛 ∫ 1 ∑︁ 𝑙 T3 (𝑠) = √ 𝐾 (𝑡)(1, 𝑡) T {𝑈𝑖 (𝑠0 + ℎ𝑡) − 𝑈𝑖 (𝑠0 )} 𝑓 (𝑠0 + ℎ𝑡)𝑑𝑡 ⊗ 𝔐(X𝑖 ) 𝑛 𝑖=1 𝑙 ∫ 𝑙 = 𝐾 (𝑡)(1, 𝑡) T ⊗ {𝑈𝑖 (𝑠0 + ℎ𝑡) − 𝑈𝑖 (𝑠0 )} 𝔐(X𝑖 ) 𝑓 (𝑠0 + ℎ𝑡)𝑑𝑡 (3.44) 𝑙 √1 Í𝑛 Since, 𝑛 𝑖=1 𝔐(X𝑖 )𝑈𝑖 (𝑠0 )is asymptotically tight, for any ℎ → 0, we have the following (Vaart and Wellner, 1996). 𝑛 1 ∑︁ sup √ 𝔐(X𝑖 ) {𝑈𝑖 (𝑠0 + ℎ𝑡) − 𝑈𝑖 (𝑠0 )} = 𝑜 𝑃 (1) (3.45) 𝑠0 ∈[0,1]:|𝑡|≤1 𝑛 𝑖=1 Now it is enough to show that T2 (𝑠0 ) is tight. First, observe that ∫ 𝑙 −1 (1, 0) 𝐾 (𝑡)diag(1, 𝜈21 )(1, 𝑡) T 𝑓 (𝑠0 + ℎ𝑡)𝑑𝑡 𝑙 ∫ 𝑙 = 𝐾 (𝑡) 𝑓 (𝑠0 + ℎ𝑡)𝑑𝑡 𝑙 ∫ 𝑙 = 𝐾 (𝑡) { 𝑓 (𝑠0 ) + ℎ𝑡 𝑓 ′ (𝑠0 ) + · · · } 𝑙 = 𝑓 (𝑠0 ) + 𝑜(ℎ) (3.46) √1 Í𝑛 Therefore, T2 (𝑠0 )(1 + 𝑜 𝑃 (ℎ)) = 𝑛 𝑖=1 𝑈𝑖 (𝑠0 ) ⊗ 𝔐(X𝑖 ). By assumption (C5), T2 (𝑠0 ) is tight. 3.8.2 Proof of Theorem 3.4.1 Under the initial estimates, by considering 𝔐(X) = X, 𝛀 can be replaced by 𝛀x in Equation (3.47) and inverse of 𝛀x exits. Therefore, by using Lemma 3.8.3, it is easy to observe that, almost surely I(𝑠0 ) −1 = 𝑓 (𝑠0 ) −1 (diag(1, 𝜈21 ) −1 ) ⊗ 𝛀−1 x + 𝑂 (ℎ + 𝛿 𝑛1 (ℎ)) (3.47) 96 Similarly, for the numerator, we have the following. 𝑛 𝑟 1 ∑︁ ∑︁  𝐾 ℎ (𝑠 𝑗 − 𝑠0 ) zℎ (𝑠 𝑗 − 𝑠0 ) ⊗ X𝑖 𝑌𝑖 𝑗 𝑛𝑟 𝑖=1 𝑗=1 𝑛 𝑟 1 ∑︁ ∑︁ 𝐾 ℎ (𝑠 𝑗 − 𝑠0 ) zℎ (𝑠 𝑗 − 𝑠0 ) ⊗ X𝑖 X𝑖T 𝜷0 (𝑠 𝑗 ) + 𝑈𝑖 𝑗   = 𝑛𝑟 𝑖=1 𝑗=1 = I(𝑠0 )𝜸 0 (𝑠0 ) + 0.5ℎ2 I21 (𝑠0 ) 𝜷¥ 0 (𝑠0 ) + T(𝑠0 ) + 𝑜(ℎ2 ) (3.48) Thus, using Equation (3.47) and (3.48), we can derive, ˘ 0 ) = [(1, 0) ⊗ I 𝑝 ]I(𝑠0 ) −1 I(𝑠0 )𝜸 0 (𝑠0 ) + 0.5ℎ2 I21 (𝑠0 ) 𝜷¥ 0 (𝑠0 ) + T(𝑠0 ) + 𝑜(ℎ2 )  𝜷(𝑠 = 𝜷0 (𝑠0 ) + [(1, 0) ⊗ I 𝑝 ] 𝑓 (𝑠0 ) −1 diag(1, 𝜈21 ) −1 ⊗ 𝛀−1 { 𝑓 (𝑠0 )(𝜈21 , 0) ⊗ 𝛀x } 0.5ℎ2 𝜷¥ 0 (𝑠0 )  x + 𝑂 (𝛿𝑛1 (ℎ) + ℎ) = 𝜷0 (𝑠0 ) + 0.5ℎ2 𝜈21 𝜷¥ 0 (𝑠0 ) + 𝑂 (𝛿𝑛1 (ℎ) + ℎ) = 𝜷0 (𝑠0 ) + 𝑂 (𝛿𝑛1 (ℎ) + ℎ) almost surely (3.49) Therefore, sup𝑠0 ∈S 𝜷(𝑠 ˘ 0 ) − 𝜷0 (𝑠0 ) = 𝑂 (𝛿𝑛1 + ℎ) almost surely. Furthermore, observe that the bias of the initial estimator is ˘ 0 )} − 𝜷0 (𝑠0 ) = 0.5ℎ2 𝜈21 𝜷¥ 0 (𝑠0 ) {1 + 𝑂 𝑃 (𝛿𝑛1 (ℎ) + ℎ)} E{ 𝜷(𝑠 (3.50) Now, to calculate the variance, note that √ ˘ 0 ) − 𝜷(𝑠0 ) − 0.5ℎ2 𝜈21 𝛽¥0 (𝑠0 )}(1 + 𝑜 𝑎.𝑠. (1)) 𝑛{ 𝜷(𝑠 √ = [(1, 0) ⊗ I 𝑝 ] 𝑓 (𝑠0 ) diag(1, 𝜈21 ) −1 ⊗ 𝛀−1  x 𝑛T(𝑠0 ) (3.51) By Lemma 3.8.5, we have the variance of the above quantity Σ(𝑠0 , 𝑠0 )𝛀−1 x . 3.8.3 Proof of Theorem 3.4.3 Define C𝜅0 (𝑠, 𝑠′) = ′) T and hence, we can define C−1 ′ Í𝜅0 𝑘=1 𝜆 𝑘 𝝓 𝑘 (𝑠)𝝓 𝑘 (𝑠 𝜅 0 (𝑠, 𝑠 ) with possible block matrix ∑︁𝜅0 ©C−1 ′ 𝜅 0 ,1,1 (𝑠, 𝑠 ) 0 C−1 ′ 𝜆−1 ′ T ª 𝜅 0 (𝑠, 𝑠 ) = 𝑘 𝝓 𝑘 (𝑠)𝝓 𝑘 (𝑠 ) =­ (3.52) ­ ® −1 ′ ® 𝑘=1 0 C𝜅0 ,2,2 (𝑠, 𝑠 ) « ¬ 97 Also define, ( 𝜅0 ) −1 ( 𝜅0 ) ∑︁ 𝜆𝑘 ∑︁ 𝜆𝑘 𝜸 𝜅0 (𝑠0 ) = b 2 X𝑘 (𝑠0 ; 𝜅0 )X𝑘 (𝑠0 ; 𝜅0 ) T 2 X𝑘 (𝑠0 ; 𝜅 0 )y𝑘 (𝑠0 ; 𝜅0 ) (3.53) 𝑘=1 𝜆𝑘 + 𝛼 𝑘=1 𝜆𝑘 + 𝛼 where 𝑛 𝑟 1 ∑︁ ∑︁ X𝑘 (𝑠0 ; 𝜅0 ) = 𝐾 ℎ (𝑠 𝑗 − 𝑠0 )𝝓 𝑘 (𝑠0 ) T Q𝑖 𝑗 (𝑠0 )W𝑖 𝑗 (𝑠0 ) (3.54) 𝑛𝑟 𝑗=1 𝑗=1 and 𝑛 𝑟 1 ∑︁ ∑︁ y𝑘 (𝑠0 ; 𝜅 0 ) = 𝐾 ℎ (𝑠 𝑗 − 𝑠0 )𝝓 𝑘 (𝑠0 ) T Q𝑖 𝑗 (𝑠0 )𝑌𝑖 𝑗 (3.55) 𝑛𝑟 𝑗=1 𝑗=1 Therefore, we have the following. 𝜅0 ∑︁ 𝜆𝑘 2 X𝑘 (𝑠0 ; 𝜅0 )X𝑘 (𝑠0 ; 𝜅0 ) T 𝑘=1 𝜆𝑘 + 𝛼  1 ∑︁   𝑛 ∑︁ 𝑟    = 𝐾 ℎ (𝑠 𝑗 − 𝑠0 )W𝑖 𝑗 (𝑠0 )Q𝑖 𝑗 (𝑠0 ) T  𝑛𝑟   𝑖=1 𝑗=1  𝜅0 𝑛 ∑︁ 𝑟 T ∑︁ −1 T  1 ∑︁   T    × 𝜆 𝑘 𝝓 𝑘 (𝑠0 )𝝓 𝑘 (𝑠0 ) 𝐾 ℎ (𝑠 𝑗 − 𝑠0 )W𝑖 𝑗 (𝑠0 )Q𝑖 𝑗 (𝑠0 )  𝑛𝑟  𝑘=1  𝑖=1 𝑗=1  = I(𝑠0 )C−1 𝜅0 (𝑠0 , 𝑠0 )I(𝑠0 ) T = 𝑓 2 (𝑠0 ) [diag(1, 𝜈21 ) ⊗ 𝛀] C−1 𝜅 0 (𝑠0 , 𝑠0 ) [diag(1, 𝜈21 ) ⊗ 𝛀] + 𝑂 (𝛿(ℎ)) T = V(𝑠0 ) + 𝑂 (𝛿(ℎ)) (3.56)   T 2 T where we define V(𝑠0 ) = 𝑓 2 (𝑠0 )diag 𝛀C−1 (𝑠 𝜅 0 ,1,1, 0 0, 𝑠 )𝛀 , 𝜈 21 𝛀C −1 (𝑠 𝜅 0 ,2,2, 0 0, 𝑠 )𝛀 and 𝜅0 ∑︁ 𝜆𝑘 X𝑘 (𝑠0 ; 𝜅 0 )y𝑘 (𝑠0 ; 𝜅0 ) 𝑘=1 𝑘 𝜆2 +𝛼  1 ∑︁   𝑛 ∑︁ 𝑟    T = 𝐾 ℎ (𝑠 𝑗 − 𝑠0 )W𝑖 𝑗 (𝑠0 )Q𝑖 𝑗 (𝑠0 )  𝑛𝑟   𝑖=1 𝑗=1  𝜅 0 𝑛 ∑︁ 𝑟 ∑︁  1 ∑︁      𝜆−1 𝑘 𝝓 (𝑠 )𝝓 𝑘 0 𝑘 0 (𝑠 ) T 𝐾 (𝑠 ℎ 𝑗 − 𝑠 0 )Q (𝑠 𝑖𝑗 0 𝑖𝑗 )𝑌  𝑛𝑟  𝑘=1  𝑖=1 𝑗=1  = I(𝑠0 )C−1 2 ¥ 2  𝜅 0 (𝑠0 , 𝑠0 ) I(𝑠0 )𝜸 0 (𝑠0 ) + 0.5ℎ I21 (𝑠0 ) 𝜷(𝑠0 ) + T(𝑠0 ) + 𝑜(ℎ ) (3.57) 98 Therefore, 𝜷(𝑠0 ) − 𝜷0 (𝑠0 ) b = 0.5ℎ2 [diag(1, 0) ⊗ I 𝑝 ]V(𝑠0 ) −1 I(𝑠0 )C−1 ¥ 𝜅 0 (𝑠0 , 𝑠0 )I21 (𝑠0 ) 𝜷(𝑠0 ) + [diag(1, 0) ⊗ I 𝑝 ]V(𝑠0 ) −1 I(𝑠0 )C−1 𝜅 0 (𝑠0 , 𝑠0 )T(𝑠0 ) + 𝑂 (𝛿(ℎ)) T¥ = 0.5ℎ2 𝑓 2 (𝑠0 ) [diag(1, 0) ⊗ I 𝑝 ]V(𝑠0 ) −1 [diag(1, 𝜈21 ) ⊗ 𝛀]C−1 𝜅 0 (𝑠0 , 𝑠0 ) [(𝜈21 , 0) ⊗ 𝛀] 𝜷(𝑠0 ) + 𝑓 2 (𝑠0 ) [diag(1, 0) ⊗ I 𝑝 ]V(𝑠0 ) −1 [diag(1, 𝜈21 ) ⊗ 𝛀]C−1 𝜅 0 (𝑠0 , 𝑠0 )T(𝑠0 ) + 𝑂 (𝛿(ℎ)) ¥ 0 ) + T(𝑠0 ) + 𝑂 (𝛿(ℎ)) = 0.5ℎ2 𝜈21 𝜷(𝑠 (3.58) In order to obtain the asymptotic variance, consider, using Lemmas 3.8.3, 3.8.4 and 3.8.5, we have √ b ¥ 0 )} →𝑑 𝑛{ 𝜷(𝑠0 )−𝜷0 (𝑠0 )−0.5ℎ2 𝜈21 𝜷(𝑠 − 𝑁 (0, A(𝑠0 , 𝑠0 )) where A(𝑠0 , 𝑠0 ) is the asymptotic variance √ ˜ 0 , 𝑠0 )V −1 (𝑠0 ) [(1, 0) ⊗ I 𝑝 ] for of 𝑛T(𝑠0 ), where we derive, A(𝑠0 , 𝑠0 ) = [(1, 0) ⊗ I 𝑝 ]V −1 (𝑠0 ) A(𝑠 ˜ 0 , 𝑠0 ) = [diag(1, 𝜈21 ) ⊗ 𝛀]C−1 A(𝑠 2 −1 𝜅 0 (𝑠0 , 𝑠0 )diag(Σ(𝑠0 , 𝑠0 ), 𝜈11 Σ(𝑠0 , 𝑠0 ))C𝜅 0 (𝑠0 , 𝑠0 ) [diag(1, 𝜈21 ) ⊗ 𝛀] T . By simple calculation, it can be shown that A(𝑠0 , 𝑠0 ) = (𝛀C−1 T −1 −1 −1 −1 𝜅 0 ,11 (𝑠0 , 𝑠0 )𝛀 ) 𝛀C𝜅 0 ,11 (𝑠0 , 𝑠0 )𝚺(𝑠0 , 𝑠0 )C𝜅 0 ,11 (𝑠0 , 𝑠0 )𝛀(𝛀C𝜅 0 ,11 (𝑠0 , 𝑠0 )𝛀 ) T −1 (3.59) 99 CHAPTER 4 TENSOR BASED SPATIO-TEMPORAL MODELS FOR ANALYSIS OF FUNCTIONAL NEUROIMAGING DATA 4.1 Introduction Recent years have seen an explosive growth in the number of neuroimaging studies performed. Popular imaging modalities include functional magnetic resonance imaging (fMRI), electroen- cephalography (EEG), diffusion tensor imaging (DTI), positron emission tomography (PET), and single-photon emission computed tomography (SPECT). Each of these techniques has its own limitations and strengths. Therefore, a current trend is toward interdisciplinary approaches that use multiple imaging techniques to overcome limitations of each method in isolation. As an example, Figure 4.1 illustrates the combination of fMRI and EEG data. At the same time, neuroimaging data is increasingly being combined with non-imaging modalities, such as behavioral and genetic data. Prepossessed EEG data Electrode Time Prepossessed fMRI data Voxel Time Figure 4.1 Multi-modal-data: An example of multi-modal data analysis which seeks to explore the relationship between EEG and fMRI data. 100 Multi-modal analysis is an increasingly important topic of research, and to fully realize its promise, novel statistical techniques are needed. Here, we present a new approach towards performing such analysis. It is common for the data generated from neuroimaging studies to consist of time-varying signal measured over a large three-dimensional (3D) domain (Lindquist, 2008; Ombao et al., 2016). Hence, the data are inherently spatio-temporal in nature. Due to the massive size of the data along with its complex anatomical structure, classical vector-based spatio-temporal statistical methods are often deemed unrealistic and inadequate. It is becoming increasingly clear that any new model and methodology should address three fundamental concerns. First, standard spatio-temporal co- variance modelling techniques are based on many parametric assumptions, which are often hard to validate in large high-dimensional data such as fMRI. Second, modelling of spatio-temporal interactions often produces large covariance matrices containing millions of elements that are hard to estimate properly. Third, storage of these large data-sets while performing analysis is nearly impossible. The current research is motivated by the experiment studyforrest (http://studyforrest.org/) which investigates high-level cognition in the human brain using complex natural stimulation, namely watching the Hollywood movie Forrest Gump (1994). The data consist of several hours of fMRI scans, structural brain images, eye-tracking data, and extensive annotations of the movie. Details of this experiment are presented in Section 4.6. In our motivating example, we focus on data consisting of voxel-wise fMRI images, measured over a large number of spatial locations (voxels) at 451 time- points. The goal of our analysis is to use the multivariate eye-tracking data, measured while the participants watch the movie, as covariates in a model that explains changes in the multivariate brain data. The vast size and scale of these data call for well-equipped statistical techniques to find the association between brain regions and other covariates over time-varying activities. It is useful to consider this as a regression problem with a multi-dimensional array of outcomes and predictors. These multi-dimensional arrays are popularly known as tensors. Figure 4.2 illustrates the reason for considering a time-varying multi-dimensional array for analysis. Although the signals in both modalities (in this case fMRI and eye-tracking) are measured discretely over time, we consider 101 Figure 4.2 ForrestGump-data: (Top panel) BOLD fMRI for an example subject during their first run (see Section 4.6 for details). 35 axial slices (thickness 3.0 mm) represents the third mode of the tensor with 80 × 80 voxels (3.0 × 3.0 mm) in-plate resolution measured at every repetition time (TR) of 2 seconds. (Bottom panel) fMRI data-set consists of a time series of 3D images (tensors) at each TR (source: Wager and Lindquist (2015)). them to be discrete measures of a smooth underlying function over time in a certain interval. This assumption is reasonable in the context of both brain activity and eye movement, as they can potentially change at any moment. There are two main advantages to taking a tensor-based approach (Guo et al., 2012) towards modeling this data-set. First, we can represent the unknown parameters to be estimated as a linear combination of rank-1 components, where the latter are expressed as the outer product of low-dimensional vectors. This allows the estimation of fewer parameters, which is consistent with variable selection or dimension reduction problems in statistics. Second, because of the need to estimate fewer parameters, the computational complexity is significantly reduced. In an exploratory analysis of multi-dimensional data, principal component analysis (PCA) is one of the most common tools for reducing dimensionality. Its use in tensor data has been studied in various articles; for example, Liu et al. (2017) provided a generalized classical PCA that can 102 deal with data matrices and tensors and can explore the spatial and temporal dependencies of data simultaneously. (Allen et al., 2014) proposed generalization of singular value decomposition (SVD) to quantify two-way regularization of PCA. This generalization involves a class of penalty functions that can be used to regularize the matrix factors. In recent years, the methodology for modelling tensor data has developed considerably with interesting applications. Hoff et al. (2011) proposed a class of multi-dimensional normal distributions by applying multi-linear transformation to an array of independently and identically distributed (i.i.d.) 𝑁 (0, 1) items and hence studied the maximum likelihood estimator of separable complex covariance structures. Hoff (2011) discussed a model-based version of low rank decomposition. In a previous work, Zhou et al. (2013) formulated a regression framework that considers clinical outcomes as response and images as covariates. Their method efficiently explored the spatial dependence of images in the form of a multi-dimensional array structure. By extending the generalized linear regression to a multi-way parameter corresponding to the tensor-structured predictor, they proposed a penalized likelihood approach with adaptive lasso penalties, which are imposed on the individual margins of PARAFAC decomposition. Guhaniyogi et al. (2017) later proposed a Bayesian approach using a similar setup as Zhou et al. (2013), but with a novel multi-way shrinkage prior, which can identify important cells in the tensor predictor appropriately. However, a shortcoming of both these approaches is that they unable to address the issue when responses are multi-dimensional images, and the covariates are also multi-dimensional variables (e.g., clinical data or data from another imaging modality), as in the case of multi-modal analysis and our motivating example. This necessitates the use of a tensor-on-tensor type model. The recent work of Zhang et al. (2014) presents a tensor generalized estimation equation (GEE) for longitudinal data analysis using low-rank CANDECOMP/PARAFAC (CP) decomposition on the coefficient array in GEE. This decomposition approach accommodates the longitudinal correlation of the data. Hoff (2015) proposed multi-linear regression model for longitudinal data using the least squares method. In practice, there might be the possibility that there exists an effect among the relations between the numbers of different pairs of modes. The general multi-linear regression 103 model can address this via separable, Kronecker-structured regression parameters along with a separable covariance structure. A tensor-on-tensor regression approach was proposed in Lock (2018), followed by a general multiple tensor-on-tensor regression in Gahrooei et al. (2018). Furthermore, Guhaniyogi and Spencer (2018) discussed a tensor response regression where the coefficients corresponding to each vector covariate are assumed to be tensors in the Bayesian framework. Recently, Liu et al. (2020) have represented a generalized multi-linear tensor-on-tensor ridge regression model via tensor train representation. Melzer et al. (2019) proposed a joint tensor regression which is weighted at expectile levels. Their estimation technique is based on low-rank factorization combined with regularization techniques using the smooth fast iterative shrinkage-thresholding algorithm (Beck and Teboulle, 2009). An adaptive tensor-based SVD estimation is also discussed in the light of Remannian trust method in Conn et al. (2000). A varying-coefficient model in functional data analysis (FDA) literature allows the regression coefficient to vary over some predictors of interest (say, 𝑇). In some cases, these predictors are confounded with covariates X or some special variables such as time. This kind of model was introduced and discussed by Hastie and Tibshirani (1993) and has since been widely studied by researchers. The non-constant relationship between functional response and predictors has been described in Fan et al. (1999); Ramsay and Silverman (2005); Ferraty and Vieu (2006); Horváth and Kokoszka (2012); Bongiorno et al. (2014); Hsing and Eubank (2015), which are some good references in FDA among many others. The current chapter provides the following contributions to this literature. First, we propose a method of modelling image data that can efficiently process large amounts of information and identify associations while preserving the structure of the 3D images and multi-layer covariates. Second, we consider the time-varying function-on-function concurrent linear model (Hastie and Tibshirani, 1993) and generalize it to the tensor-on-tensor regression case, thus moving a step further than Lock (2018), which did not consider the time-varying coefficient. Consequently, our generalization provides an extension to classical functional concurrent regression with tensor 104 predictors and tensor covariates. To the best of our knowledge, such an approach has not yet been proposed in the statistics literature. Here, we express the regression coefficients using the B-spline technique, and the coefficients of the basis functions are estimated using CP-decomposition, thereby reducing computational complexity. Furthermore, our model requires minimum assumptions compared to those in the existing literature. Our approach does not require the estimation of covariance separately. Thus, our proposal offers an important addition to the literature on functional and imaging data analysis. Our methods are flexible and general; therefore, they are applicable using data from different domains such as multi-phenotype analysis and imaging genetics (Casey et al., 2010). This makes it an ideal approach for modeling multi-modal data of the type described in our motivating example. The rest of this chapter is organized as follows. The proposed tensor-on-tensor functional regression models are described in Section 4.2. Section 4.3 provides the theoretical properties of the proposed estimator. Section 4.4 presents the algorithm and implementation of the method. The simulation results are presented in Section 4.5 and real data examples are shown in Section 4.6. Section 4.7 concludes with a discussion of future extensions. The technical proofs are presented in Section 4.8. 4.2 Tensor-on-tensor functional regression Recall the notations and definition in matrix algebra from Section 1.3 in Chapter 1. A 𝐷-dimensional tensor is denoted by Sans-serif upper-face letters A ∈ R 𝐼1 ×···×𝐼𝐷 where the size 𝐼 𝑑 along each mode Î𝐷 or dimension 𝑑 for 𝑑 = 1, · · · , 𝐷. Therefore, the number of elements in tensor A is 𝐼 = 𝑑=1 𝐼𝑑 and the order of the tensor is the number of dimensions. Here and henceforth, matrices are denoted by bold-face capital letters (examples: A, B · · · ), vectors are written as bold-face lower-case letters (examples: a, b, · · · ) and scalars are presented as Latin alphabets (𝑎, 𝑏, · · · ). In this section, we discuss tensor-on-tensor functional regression with time-varying coefficients. Let Y (𝑡) ∈ R𝑄 1 ×···×𝑄 𝑀 with (𝑞 1 , · · · , 𝑞 𝑀 )-th element 𝑦 𝑞1 ,··· ,𝑞 𝑀 for all possible indices, be a set of time-varying response variables observed at time 𝑡 and {Y (𝑡) : 𝑡 ∈ T} be the underlying continuous 105 stochastic process defined on a compact interval T. Without loss of generality, we assume T = [0, 𝑇] , 𝑇 > 0. Suppose there are 𝑁 individuals/trajectories on T. Observations are taken at 𝐽 distinct points for each individual. Collection of points for the 𝑖-th individual is denoted as T𝑖 † = {0 ≤ 𝑡𝑖1 < · · · < 𝑡𝑖𝐽 ≤ 𝑇 }. Therefore, for 𝑖-th individual at a set of discrete time-points T𝑖 † , we observe the responses Y𝑖 (𝑡𝑖 ) = ( Y𝑖 (𝑡𝑖1 ), · · · , Y𝑖 (𝑡𝑖𝐽 )) ∈ R𝐽×𝑄 1 ×···×𝑄 𝑀 which are distinct realizations of the corresponding stochastic process. The covariate X (𝑡) ∈ R𝑃1 ×···×𝑃 𝐿 with ( 𝑝 1 , · · · , 𝑝 𝐿 )-th element 𝑥 𝑝1 ,··· ,𝑝 𝐿 (𝑡) for all indices, observed at T𝑖 † is denoted as X𝑖 (𝑡𝑖 ) = ( X𝑖 (𝑡𝑖1 ), · · · , X𝑖 (𝑡𝑖𝐽 )) ∈ R𝐽×𝑃1 ×···×𝑃 𝐿 . The time-varying tensor coefficient 𝜷(𝑡) ∈ R𝑃1 ×···×𝑃 𝐿 ×𝑄 1 ×···×𝑄 𝑀 is assumed to vary smoothly over time. Therefore, we can apply local polynomial smoothing (Hardle, 1990; Wahba, 1990; Wand and Jones, 1995; Fan and Gijbels, 1996; Eubank, 1999), smoothing spline (Wahba, 1990; Green and Silverman, 1993; Eubank, 1999), regression spline (Eubank, 1999), P-spline Ruppert et al. (2003). In this chapter, we use B-spline bases which are very popular in mathematics, computer science and statistics (De Boor et al., 1978). Suppose, {𝜏ℎ } 𝐾ℎ=1 𝑁 be 𝐾 𝑁 interior knots within the compact interval [0, 𝑇] and the partition of the interval [0, 𝑇] at these knots be denoted  as K = 0 = 𝜏0 < 𝜏1 < · · · < 𝜏𝐾 𝑁 < 𝜏𝐾 𝑁 +1 = 𝑇 . The polynomial spline of order 𝑣 + 1 is a function   of polynomials with degree 𝑣 on the intervals [𝜏ℎ−1 , 𝜏ℎ ) for ℎ = 1, · · · , 𝐾 𝑁 and 𝜏𝐾 𝑁 , 𝜏𝐾 𝑁 +1 and 𝑣 − 1 continuous derivatives globally. Let S𝐾𝑣 𝑁 (𝑡) denotes a set of such spline functions, i.e., 𝑠(𝑡) belongs to S𝐾𝑣 𝑁 (𝑡) if and only if 𝑠(𝑡) belongs to 𝐶 𝑣−1 [0, 𝑇] and its restriction to each intervals [𝜏ℎ−1 , 𝜏ℎ ) is a polynomial of degree atleast 𝑣. Define for ℎ = 1, · · · , 𝐻 𝑁 Bℎ (𝑡) = (𝜏ℎ − 𝜏ℎ−𝑣−1 ) [𝜏ℎ−𝑣−1 , · · · , 𝜏ℎ ] (𝑧 − 𝑡)+𝑣 (4.1) where 𝐻 := 𝐻 𝑁 = 𝐾 𝑁 + 𝑣 + 1, [𝜏ℎ−𝑣−1 , · · · , 𝜏ℎ ] 𝑓 denotes the (𝑣 + 1)-st order divided difference of the function 𝑓 and 𝜏ℎ = 𝜏0 for ℎ = −𝑣, · · · , −1 and 𝜏ℎ = 𝜏𝐾 𝑁 +1 for ℎ = 𝐾 𝑁 + 2, · · · , 𝐻 𝑁 . Therefore, {Bℎ } 𝐻 ℎ=1 forms the basis for S𝐾 𝑁 (𝑡) (Schumaker, 2007). 𝑁 𝑣 Now, for 1 ≤ 𝑝 𝑙 ≤ 𝑃𝑙 , 1 ≤ 𝑞 𝑚 ≤ 𝑄 𝑚 , 1 ≤ 𝑙 ≤ 𝐿, 1 ≤ 𝑚 ≤ 𝑀, each function 𝛽 𝑝1 ,··· ,𝑝 𝐿 ,𝑞1 ,··· ,𝑞 𝑀 (𝑡) can be approximated by 𝐻 ∑︁ 𝛽 𝑝1 ,··· ,𝑝 𝐿 ,𝑞1 ,··· ,𝑞 𝑀 (𝑡) = 𝑏 ℎ,𝑝1 ,··· ,𝑝 𝐿 ,𝑞1 ,··· ,𝑞 𝑀 Bℎ (𝑡) = bT𝑝1 ,··· ,𝑝 𝐿 ,𝑞1 ,··· ,𝑞 𝑀 B(𝑡) (4.2) ℎ=1 106 where b 𝑝1 ,··· ,𝑝 𝐿 ,𝑞1 ,··· ,𝑞 𝑀 = (𝑏 1,𝑝1 ,··· ,𝑝 𝐿 ,𝑞1 ,··· ,𝑞 𝑀 , · · · , 𝑏 𝐻,𝑝1 ,··· ,𝑝 𝐿 ,𝑞1 ,··· ,𝑞 𝑀 ) T is the collection of basis coefficients and B(𝑡) = (B1 (𝑡), · · · , B𝐻 (𝑡)) T is a vector of known B-spline bases. In practice, we can use different basis functions in mode to approximate 𝛽 𝑝1 ,··· ,𝑝 𝐿 ,𝑞1 ,··· ,𝑞 𝑀 (𝑡). However, for convenience, we use the same set of bases in this chapter. Instead of the B-spline, one can use other basis functions to approximate the coefficient functions. We use the B-spline base for its simplicity and numerical tractability. Although this method does not produce a desirable approximation for discontinuous functions, in this chapter we restrict ourselves to smooth continuous coefficients. We propose a general time-varying tensor-on-tensor regression model, Y𝑖 (𝑡) = ⟨X𝑖 (𝑡), 𝜷(𝑡)⟩ 𝐿 + E𝑖 (𝑡) (4.3) which can be reduced into the following mode-wise time-varying coefficient model. 𝑃1 ∑︁ ∑︁𝑃𝐿 𝑦𝑖,𝑞1 ,··· ,𝑞 𝑀 (𝑡) = ··· 𝑥𝑖,𝑝1 ,··· ,𝑝 𝐿 (𝑡) 𝛽 𝑝1 ,··· ,𝑝 𝐿 ,𝑞1 ,··· ,𝑞 𝑀 (𝑡) + 𝜖𝑖,𝑞1 ,··· ,𝑞 𝑀 (𝑡) (4.4) 𝑝 1 =1 𝑝 𝐿 =1 where 𝜖𝑖,𝑞1 ,··· ,𝑞 𝑀 (𝑡) is a random error with mean zero. Errors can be correlated over time and modes, but are independent over the trajectories. After plugging-in the approximate expression of 𝛽 𝑝1 ,··· ,𝑝 𝐿 ,𝑞1 ,··· ,𝑞 𝑀 (𝑡) at each mode, the model now boils down to 𝑃1 ∑︁ 𝑃 𝐿 ∑︁ ∑︁ 𝐻 𝑦𝑖,𝑞1 ,··· ,𝑞 𝑀 (𝑡) = ··· 𝑏 ℎ,𝑝1 ,··· ,𝑝 𝐿 ,𝑞1 ,··· ,𝑞 𝑀 𝑥𝑖,𝑝1 ,··· ,𝑝 𝐿 (𝑡)Bℎ (𝑡) + 𝜖𝑖,𝑞1 ,··· ,𝑞 𝑀 (𝑡) (4.5) 𝑝 1 =1 𝑝 𝐿 =1 ℎ=1  The multi-dimensional basis coefficients B0 = 𝑏 ℎ,𝑝1 ,··· ,𝑝 𝐿 ,𝑞1 ,··· ,𝑞 𝑀 : 1 ≤ ℎ ≤ 𝐻, 1 ≤ 𝑝 𝑙 ≤ 𝑃𝑙 , 1 ≤ 𝑞 𝑚 ≤ 𝑄 𝑚 , 1 ≤ 𝑙 ≤ 𝐿, 1 ≤ 𝑚 ≤ 𝑀 } can be estimated by minimizing the mode-wise penalized integrated sum of square errors with respect to B0 . Let us denote the smoothness penalty by Ω𝑠𝑚 where Ω𝑠𝑚 ( B0 ) 𝑃1 𝑃 𝐿 ∑︁ 𝑄1 𝑄𝑀 ∫ 2 ∑︁ ∑︁ ∑︁ 𝜃 𝑝1 ,··· ,𝑝 𝐿 ,𝑞1 ,··· ,𝑞 𝑀 𝛽′′𝑝1 ,··· ,𝑝 𝐿 ,𝑞1 ,··· ,𝑞 𝑀 (𝑡)  = ··· ··· 𝑑𝑡 𝑝 1 =1 𝑝 𝐿 =1 𝑞 1 =1 𝑞 𝑀 =1 107 ∑︁𝑃1 𝑃 𝐿 ∑︁ ∑︁ 𝑄1 𝑄𝑀 ∑︁ ∫ = ··· ··· 𝜃 𝑝1 ,··· ,𝑝 𝐿 ,𝑞1 ,··· ,𝑞 𝑀 bT𝑝1 ,··· ,𝑝 𝐿 ,𝑞1 ,··· ,𝑞 𝑀 B′′ (𝑡)B′′ (𝑡) T 𝑑𝑡b 𝑝1 ,··· ,𝑝 𝐿 ,𝑞1 ,··· ,𝑞 𝑀 𝑝 1 =1 𝑝 𝐿 =1 𝑞 1 =1 𝑞 𝑀 =1 (4.6) Hence, the loss function turns out to be L( B0 ) ∫ ∑︁ 𝑁 ∑︁ 𝑄1 𝑄𝑀  𝑃1 𝑃 𝐿 ∑︁ 𝐻 2 1 ∑︁ ∑︁ ∑︁ = ··· 𝑦𝑖,𝑞1 ,··· ,𝑞 𝑀 (𝑡) − ··· 𝑏 ℎ,𝑝1 ,··· ,𝑝 𝐿 ,𝑞1 ,··· ,𝑞 𝑀 𝑥𝑖,𝑝1 ,··· ,𝑝 𝐿 (𝑡)𝐵 ℎ (𝑡) 𝑑𝑡 𝑁 T 𝑖=1 𝑞 =1 𝑞 =1 𝑝 =1 𝑝 =1 ℎ=1 1 𝑀 1 𝐿 + Ω𝑠𝑚 ( B0 ) (4.7) In Equation (4.6), {𝜃 𝑝1 ,··· ,𝑝 𝐿 ,𝑞1 ,··· ,𝑞 𝑀 } 𝑝1 ,··· ,𝑝 𝐿 ,𝑞1 ,··· ,𝑞 𝑀 are the tuning parameters for smoothness. Penalty due to smoothness is widespread in the literature of functional data analysis (Ramsay and Silverman (2005) among many others). In practice, it is unrealistic to find these large num- bers of pre-assigned tuning parameters. By considering 𝜃 𝑝1 ,··· ,𝑝 𝐿 ,𝑞1 ,··· ,𝑞 𝑀 = 𝜃, for all possible 𝑝 1 , · · · , 𝑝 𝐿 , 𝑞 1 , · · · , 𝑞 𝑀 , the simplest version of smoothness penalty would be, ∫ T Ω𝑠𝑚 ( B0 ) = 𝜃 vec( B0 ) (I𝑄 ⊗ I𝑃 ⊗ B′′ (𝑡)B′′ (𝑡) T 𝑑𝑡) vec( B0 ) (4.8) Note that, vec( B0 ) = (b11 , · · · , b𝑃1 , b12 , · · · , b𝑃2 , · · · , b1𝑄 , · · · , b𝑃𝑄 ) T . Therefore, the penalized likelihood estimating equation for functional tensor-on-tensor regression problem is ∫ 𝑁 1 ∑︁ L( B0 ) = ∥ Y𝑖 (𝑡) − ⟨Z𝑖 (𝑡), B0 ⟩ 𝐿+1 ∥ 2F 𝑑𝑡 + Ω𝑠𝑚 ( B0 ) (4.9) T 𝑁 𝑖=1 where ⟨·, ·⟩ 𝐿+1 is the contracted tensor product defined in Section 1.3.1 and ∥ · ∥ F is the Frobenius norm. The first term of the Equation (4.9) is integrated sum of squares and the second term is the smoothness penalty. Let the response tensor for time 𝑡, Y (𝑡) ∈ R𝑁×𝑄 1 ×···×𝑄 𝑀 with its (𝑖, 𝑞 1 , · · · , 𝑞 𝑀 )-th element be 𝑦𝑖,𝑞1 ,··· ,𝑞 𝑀 (𝑡) for all 𝑖 = 1, · · · , 𝑁; 𝑞 𝑚 = 1, · · · , 𝑄 𝑚 ; 𝑚 = 1, · · · , 𝑀. Similarly, we define an updated covariate tensor contaminated with B-spline bases Z (𝑡) ∈ R𝑁×𝐻×𝑃1 ×···×𝑃 𝐿 where (𝑖, ℎ, 𝑝 1 , · · · , 𝑝 𝐿 )- th element of the tensor is defined as 𝑧𝑖,ℎ,𝑝1 ,··· ,𝑝 𝐿 (𝑡) = 𝑥𝑖,𝑝1 ,··· ,𝑝 𝐿 (𝑡)Bℎ (𝑡). Therefore, the corre- sponding penalized loss function in Equation (4.9) is equivalent to ∫ L( B0 ) = ∥ Y (𝑡) − ⟨Z (𝑡), B0 ⟩ 𝐿+1 ∥ 2F 𝑑𝑡 + Ω𝑠𝑚 ( B0 ) (4.10) T 108 Remark 4.2.1. For 𝑄 = 0, the proposed model reduces to the classical concurrent linear model Ramsay and Silverman (2005). For 𝑄 = 1 and 𝑃 = 1, the time-varying network model (Xue et al., 2018) is a special case of our proposed model for a specific choice of covariates. For 𝑄 = 2, 𝑦𝑖,𝑞1 ,𝑞2 (𝑡) is the observation of the quantity of interest at time 𝑡 for sub-unit 𝑞 2 from unit 𝑞 1 of a treatment group 𝑖 in a hierarchical model (Zhou et al., 2010). Î𝐿 Î𝑀 Let 𝑃 = 𝑙=1 𝑃𝑙 be the total number of predictors for each observation and 𝑄 = 𝑚=1 𝑄 𝑚 be the total number of outcomes for each predictor over time. To minimize the penalized integrated sum of squared residuals described in Equation (4.10), the solution for B0 could be inconsistent. Since Î𝐿 Î𝑀 the unknown coefficient tensor B0 has 𝐻 𝑙=1 𝑃𝑙 𝑚=1 𝑄 𝑚 many parameters, we need to adopt the dimension reduction technique. Inspired by the novel idea discussed in Lock (2018), we consider rank 𝑅 decomposition of B0 as B0 = [[U0 , U1 , · · · , U 𝐿 , V1 , · · · , V 𝑀 ]] where U0 , U𝑙 and V𝑚 are matrices with dimensions 𝐻×𝑅, 𝑃𝑙 ×𝑅 and 𝑄 𝑚 ×𝑅 respectively for all 1 ≤ 𝑙 ≤ 𝐿, 1 ≤ 𝑚 ≤ 𝑀. After Í𝐿 Í𝑀 dimension reduction, the number of unknown parameters reduces to 𝑅(𝐻 + 𝑙=1 𝑃𝑙 + 𝑚=1 𝑄 𝑚 ). Therefore, the estimate of the coefficient tensor is as follows. B0 = arg e min L( B0 ) (4.11) rank( B0 )≤𝑅 However, this estimated coefficient tensor suffers from over-fitting and instability problems due to multi-collinearity of Z and/or the large number of observed outcomes. Thus, we obtain an alternative estimate of the coefficient tensor B0 B0 = arg b min Q( B0 ) (4.12) rank( B0 )≤𝑅 which is based on the modified loss function, Q, defined by ∫ 1 Q(B0 ) = ∥ Y (𝑡) − ⟨Z (𝑡), B0 ⟩ 𝐿+1 ∥ 2F 𝑑𝑡 + Ω( B0 ) (4.13) 𝑁 T where ∫ T Ω( B0 ) = 𝜃 vec( B0 ) (I𝑄 ⊗ I𝑃 ⊗ B′′ (𝑡)B′′ (𝑡) T 𝑑𝑡) vec( B0 ) + 𝜙 vec( B0 ) T vec( B0 ) (4.14) Equation (4.14) suggests the penalization of the smoothness and sparsity of the coefficient functions simultaneously. 109 Remark 4.2.2. Tuning parameter selection: The number of knots and tuning parameters 𝜃 and 𝜙 are unknown and we need to select it using Mallows’s 𝐶 𝑝 (Mallows, 1973), generalized cross- validation (Craven and Wahba, 1978) or leave-one-out cross validation method (Stone, 1974). Also, the selection of rank is a separate problem altogether. A rank selection method can be proposed based on Chen et al. (2013). 4.3 Asymptotic properties In this section, we will study identifiability of the model and consistency of parameter estimates under the proposed model as the number of subjects 𝑁 goes to infinity while we assume that the rank of the basis tensor coefficient is known and fixed. 4.3.1 Identifiability Identifiability issue plays important roles in tensor regression (Lock, 2018; Zhou et al., 2013; Guhaniyogi et al., 2017). The model discussed in Section 4.2 would be identifiable for 𝜷(𝑡), if for 𝜷(𝑡) ≠ 𝜷∗ (𝑡) implies ⟨X (𝑡), 𝜷(𝑡)⟩ 𝐿 ≠ ⟨X (𝑡), 𝜷∗ (𝑡)⟩ 𝐿 for some 𝑡 ∈ T and some X (𝑡) ∈ R𝑃1 ×···×𝑃 𝐿 . Using the basis expansion in Equation (4.2), we can say that B0 is identifiable if and only if 𝜷(𝑡) is identifiable for all 𝑡 ∈ T. Therefore, the reduced model is identifiable if for B0 ≠ B∗0 implies ⟨Z (𝑡), B0 ⟩ 𝐿+1 = Z (𝑡), B∗0 𝐿+1 for some 𝑡 ∈ T and for some Z (𝑡) ∈ R𝐻×𝑃1 ×···×𝑃 𝐿 . Assume that, for 𝑡 = 𝑡0 , Zℎ,𝑝 𝑘1 ,··· ,𝑝 𝑘𝐿 (𝑡0 ) = 1 at 𝑘 1 = 1, · · · , 𝑘 𝐿 = 𝐿 and 0 otherwise, then the product becomes 𝑏 ℎ,𝑝1 ,··· ,𝑝 𝐿 ,𝑞1 ,··· ,𝑞 𝑀 . Furthermore, U0 , U1 , · · · , U 𝐿 , V1 , · · · , V 𝑀 in the expression of CP- decomposition are not identifiable. Therefore, The identifiability conditions can be imposed in the following way (Sidiropoulos and Bro, 2000). 1. Restrictions for scale and non-uniqueness: B0 will remain same after replacing U0 , U𝑙 and V𝑚 by 𝑐 𝑠 U0 , 𝑐 𝑢𝑙 U𝑙 and 𝑐 𝑣 𝑚 V𝑚 respectively, where {𝑐 𝑠 , 𝑐 𝑢𝑙 , 𝑐 𝑣 𝑚 } is the set of constants with Î𝐿 Î𝑀 𝑐 𝑠 𝑙=1 𝑐 𝑢𝑙 𝑙=1 𝑐 𝑣 𝑙 = 1. This problem can be solved by introducing the condition that the norm of each of u𝑟0 , u𝑟𝑙 and v𝑟𝑚 is set to be 1, 1 ≤ 𝑟 ≤ 𝑅, 1 ≤ 𝑙 ≤ 𝐿, 1 ≤ 𝑚 ≤ 𝑀. 110 Í𝑅 2. Restriction for permutation: For any permutation 𝜋(·) of {1, · · · , 𝑅}, 𝑟=1 u𝑟0 ◦ u𝑟1 ◦ · · · ◦ Í𝑅 u𝑟 𝐿 ◦v𝑟1 ◦· · ·◦v𝑟 𝑀 is same as 𝑟=1 u𝜋(𝑟)0 ◦u𝜋(𝑟)1 ◦· · ·◦u𝜋(𝑟)𝐿 ◦v𝜋(𝑟)1 ◦· · ·◦v𝜋(𝑟) 𝑀 . Therefore, we impose the restriction ∥u01 ∥ ≥ · · · ≥ ∥u0𝑅 ∥. These indeterminacies are enough to ensure the identifiability for 𝐿 + 𝑀 ≥ 2. Therefore, we do not need the additional orthogonality condition that appeared in Lock (2018); Zhou et al. (2013); Guhaniyogi et al. (2017). 4.3.2 Convergence rate In this subsection, we study the asymptotic properties of the estimate of the time-varying tensor regression parameter 𝜷(𝑡) based on polynomial spline approximation and the CP decomposition. To proceed further, we introduce some regularity conditions which are required to establish the asymptotic properties. (C1) Without loss of generality, assume T = [0, 1]. The observation times 𝑡𝑖 𝑗 for 𝑖 = 1, · · · , 𝑁; 𝑗 = 1, · · · , 𝐽 are independent and follow a distribution 𝑓𝑇 (𝑡) over the support T. the density function 𝑓𝑇 (𝑡) is assumed to be absolutely continuous and bounded by a nonzero and finite constant. (C2) {𝜏ℎ } 𝐾ℎ=1 𝑛 be 𝐾𝑛 interior knots within the compact interval K = [0, 1] and the partition of the  interval [0, 𝑇] with 𝐾 𝑁 knots can be denoted as I = 0 = 𝜏0 < 𝜏1 < · · · < 𝜏𝐾 𝑁 < 𝜏𝐾 𝑁 +1 = 1 . (C3) The polynomial spline of order 𝑣 + 1 are the function with degree 𝑣 of polynomials on   the interval [𝜏ℎ−1 , 𝜏ℎ ) for ℎ = 1, · · · , 𝐾 and 𝜏𝐾 𝑁 , 𝜏𝐾 𝑁 +1 and 𝑣 − 1 continuous derivatives globally. (C4) For 𝑡 ∈ T, 𝜖𝑖,𝑞1 ,··· ,𝑞 𝑀 (𝑡)’s are i.i.d. copies with mean zero and finite second order moment over 𝑖. Moreover, for each 𝑖 and the coordinates 𝑞 1 , · · · , 𝑞 𝑀 , 𝜖𝑖,𝑞1 ,··· ,𝑞 𝑀 (𝑡𝑖 𝑗 ) are locally stationary time series of the form given in appendix. Assume physical dependence measure Δ(𝑘, 𝑎) is upper bounded by 𝑘 −𝜅0 for some positive 𝜅0 and for all 𝑗 ≥ 1. 111 (C5) The covariate 𝑥𝑖,𝑝1 ,··· ,𝑝 𝐿 (𝑡)’s are i.i.d. for index 𝑖 and it is bounded almost everywhere. Remark 4.3.1. Conditions (C1), (C2), (C3) are standard conditions in the context of polynomial spline regression and require the consistency of spline estimation of varying-coefficient models. Condition (C3) provides the degree of smoothness on the time-varying coefficients. We assume Condition (C4) to represent a wide class of stationary, locally stationary, and non-linear processes. Similar conditions can be found in Ding and Zhou (2020); Ding et al. (2021). This is a natural assumption of temporal short-range dependent process where temporal correlation decays in poly- nomial order. This phenomenon can also be observed in well-known Ornstein–Uhlenbeck process and the linear process with the standard basis expansion 𝜖𝑖,• (𝑡) = ∞ Í 𝑘=1 𝑎 𝑖𝑘,• 𝜙 𝑘 (𝑡) where 𝑎 𝑖𝑘,• is an uncorrelated mean zero, finite variance random variables over (𝑖, 𝑘) and sup𝑡 𝜙 𝑘 (𝑡) ≤ 𝐶 𝑘 −𝑎 for some positive constants 𝐶 and 𝑎 . Since the number of modes is fixed, we reduce the objective function by following the notation Y ∈ R𝑁 𝐽×𝑄 and Z ∈ R𝑁 𝐽×𝐻 𝑁 ×𝑃 1 Q(B0 ) = ∥Y − ⟨Z, B0 ⟩ 2 ∥ 2F + ∥ B0 ∥ 2F,W 𝜔 (4.15) 𝑁𝐽 √︁ where ∥ B0 ∥ F,W 𝜔 be the weighted Frobenius norm, defined as ∥ B0 ∥ F,W 𝜔 = vec( B0 ) T W𝜔 vec(B0 ) where 𝜔 is a set of all tuning parameters. Moreover, assume that rank( B0 ) = 𝑅0 which is assumed to be known and fixed. Further assume that   (C6) 𝜆min ZT(1) Z (1) = 𝜎min ( Z (1) ) 2 ≥ 𝜆 min (BT B)𝜆 min (XT X) > 𝜆 where 𝜆𝑖 (A) and 𝜎𝑖 (A) denotes 𝑖-th eigen-value and singular value respective for a matrix A. Define, ∫ the constants C(𝛿) = 1 + 2/𝛿 such that C(𝛿) ≤ 𝜆2 /2𝜇 where 𝜇 = (𝑁 𝐽)(𝜃𝜆max ( B′′ (𝑡)B′′ (𝑡) T 𝑑𝑡) + √ 𝜙) 2𝑅0 . Further, define 𝜉 = sup1≤ℎ≤𝐻 sup𝑡∈[0,1] |Bℎ (𝑡)| which is typically bounded. Additionally, define 𝜎1 ( C) = max{𝜎1 ( C (1) ), 𝜎1 ( C (2) ), 𝜎1 ( C (3) )}. Therefore, we propose the following theorem for the estimation and prediction performance of the coefficient tensor. Theorem 4.3.1. Under assumptions (C4) and (C6), when both the number of time-points and trajectories are large enough, there exists a constant 𝐶𝑎 , with probability atleast 1 − 𝐶𝑎 𝑁 −𝑎𝜏 , such 112 that we have the following. D E   −1  ∥ Z, (b B0 − B0 ) ∥ 2F ≤ 𝜆−1 C(𝛿) −1 − 2𝜇𝜆−2 4𝜇𝜎12 ( C) + 2𝑅0 (1 + 𝛿)𝑄 2 𝜉 2 𝑁 2𝜏+2 𝐽 (4.16) for any 𝐻 𝑁 × 𝑃 × 𝑄 matrix C with rank( C) ≤ 𝑅0 , By choosing C = B0 , a simplified prediction error could be obtained. Under the same set of assumptions, the estimation error of the matrix B0 is   −1  ∥Bb0 − B0 ∥ 2 ≤ 𝜆−1 C(𝛿) −1 − 2𝜇𝜆−2 4𝜇𝜎12 ( C) + 2𝑅0 (1 + 𝛿)𝑄 2 𝜉 2 𝑁 2𝜏+2 𝐽 (4.17) F Additionally, we introduce the following theorem which states the consistency result for the coefficient tensor function. Theorem 4.3.2. Under assumptions (C1)-(C6), with probability, we have the following with prob- ability 1 − 𝐶𝑎 𝑁 −𝑎𝜏 , √ o ∫    −1 n | 𝛽b• (𝑡) − 𝛽• (𝑡)| 𝑓𝑇 (𝑡)𝑑𝑡 = 𝑂 𝜆−1 C(𝛿) −1 − 2𝜇𝜆−2 2 4𝜇𝜎12 ( C (1) ) + 2𝑅0 (1 + 𝛿)𝑄𝜉 𝑁 𝜏+1 𝐽 T o +𝐾 𝑁−2(𝑣+1) 4.4 Algorithm and implementation In this section, we propose a general algorithm to estimate the basis coefficient tensor using the objective function described in Section 4.2. For given time-points 𝑡 1 , · · · , 𝑡 𝐽 , define Z and Y as the combined tensor after staking over all time-points. Therefore, Z and Y are the tensors of order 𝑁 𝐽 × 𝐻 × 𝑃1 × · · · 𝑃 𝐿 × 𝑄 1 × · · · 𝑄 𝑀 and 𝑁 𝐽 × 𝑄 1 × · · · 𝑄 𝑀 respectively. Moreover, define B̆0 as the matrix of coefficient of order 𝐻𝑃 × 𝑄, where columns and rows of B0 are obtained by vectorizing first (𝐿 + 1) and last 𝑀 modes of B0 respectively. For the alternative expression of the penalty term in Equation (4.13), observe the following. T 1. ∥ B0 ∥ 2 = ∥ B˘0 ∥ 2 = vec( B0 ) T vec( B0 ) = trace( B˘0 B˘0 ), where trace(A) denotes the trace of a square matrix A.   ∫  1/2  ′′ ′′ 2. I𝑄 ⊗ I𝑃 ⊗ 𝜃 B (𝑡)B (𝑡) 𝑑𝑡 + 𝜙I𝐻 T vec( B0 ) 113   ∫  1/2  ′′ = vec (I𝑃 ⊗ 𝜃 B (𝑡)B (𝑡) 𝑑𝑡 + 𝜙I𝐻 ′′ T ˘ ) B0 I𝑄 Therefore, equivalently, the optimization problem reduces to an unregulated least squares problem with modified predictor and outcome variables for estimating B0 . ∫ 1 e B0 ⟩ 𝐿+1 ∥ 2 𝑑𝑡 B0 = arg min b ∥Y e − ⟨Z, (4.18) rank( B0 )≤𝑅 𝑁 𝐽 T where Z e ∈ R (𝑁 𝐽+𝐻𝑃)×𝐻×𝑃1 ×···×𝑃 𝐿 ×𝑄 1 ×···𝑄 𝑀 be the contamination of Z (𝑡) along with smoothing term and sparsity. Y e ∈ R (𝑁 𝐽+𝐻𝑃)×𝑄 1 ×···×𝑄 𝑀 which is a contamination of Y (𝑡) and zero tensor function. The unfolding of Z e and Y e along the first dimension produce the following matrices:      Z(1)   Y(1)  Z = and Y = (4.19) e(1)   e(1)    ∫  1/2    (I ⊗ 𝜃 B′′ (𝑡)B′′ (𝑡) T 𝑑𝑡 + 𝜙I )  0   𝑃 𝐻  𝐻𝑃×𝑄      Thus, applying the following Algorithm 4.1, we obtain an estimate of coefficient tensor for the known rank of the coefficient array and hence the coefficient function 𝜷(𝑡). The above algorithm is similar to function “rrr” available in the package MultiwayRegression in R. The selection of the adjustment parameters 𝜃, 𝜙 and the rank 𝑅 of the coefficient tensor is crucial. It can be done using integrated predictive accuracy in a training and test set. K-fold cross- validation can be used to obtain these tuning parameters; however, it is computationally expensive. Fortunately, our estimate is robust for the selection of 𝜃 and 𝜙. The rank of CP decomposition of the coefficient tensor is the number of rank-1 terms that are necessary to represent the coefficient tensor. For large 𝑅, every B0 can be represented by the CP decomposition. Therefore, it determines the complexity of the model. We leave the optimal determination of the rank for future research. 4.5 Simulation studies In this section, we conduct numerical studies to compare the finite sample performance to estimate four-way time-varying tensor coefficient 𝜷(𝑡). Data are generated from the following model for each mode 𝑝 1 , 𝑝 2 , 𝑞 1 , 𝑞 2 𝑃1 ∑︁ ∑︁ 𝑃2 𝑦𝑖,𝑞1 ,𝑞2 (𝑡) = 𝑥𝑖,𝑝1 ,𝑝2 (𝑡) 𝛽 𝑝1 ,𝑝2 ,𝑞1 ,𝑞2 (𝑡) + 𝜖𝑖,𝑞1 ,𝑞2 (𝑡), 𝑖 = 1, · · · , 𝑁; 𝑡 ∈ [0, 1] (4.20) 𝑝 1 =1 𝑝 2 =1 114 Algorithm 4.1 Estimation of 𝜷(𝑡) : 𝑡 ∈ [0, 𝑇] for tensor based function-on-function regression method. Data: X (𝑡), Y (𝑡) for 𝑡 ∈ [0, 𝑇] , 𝑇 > 0 observed on a grid in [0, 𝑇] Result: Estimate 𝜷(𝑠) using proposed method Tuning parameters: {𝜃, 𝜙}, rank 𝑅 ∈ N, number of knots 𝐾 𝑁 , a vector of known B-spline bases B(𝑡) = (B1 (𝑡), · · · , B𝐻 (𝑡)) T Stopping parameter: 𝜖0 > 0 Create: Z and Y as mentioned in Equation (4.19) Initialize: U0 , U1 , · · · , U 𝐿 , V1 , · · · , V 𝑀 be randomly chosen matrices of specific order 1: while Error > 𝜖 0 do 2: for 𝑙 ← 1 to #{𝐻, 𝑃1 , · · · , 𝑃 𝐿 } do 3: Set 𝑑 (𝑙) be the 𝑙-th entry of {𝐻, 𝑃1 , · · · , 𝑃 𝐿 } 4: for 𝑟 = 1, · · · , 𝑅 do 5: C𝑟 ← ⟨Z, e u𝑟0 ◦ · · · ◦ u𝑟,𝑘−1 ◦ u𝑟,𝑘+1 ◦ · · · ◦ u𝑟 𝐿 ◦ v𝑟1 ◦ · · · ◦ v𝑟 𝑀 ⟩ 𝐿 which is a tensor of dimension (𝑁 𝐽 + 𝐻𝑃) × 𝑑 (𝑙) × 𝑄 1 × · · · × 𝑄 𝑀 6: Unfolding C𝑟 along with dimension corresponding to 𝑑 (𝑙) 7: Obtain a (𝑁 𝐽 + 𝐻𝑃)𝑄 × 𝑑 (𝑙) dimension matrix C𝑟 end (𝑙) 8: C ← [C1 , · · · , C 𝑅 ] ∈ R (𝑁 𝐽+𝐻𝑃)𝑄×𝑅𝑑 9: vec(U𝑙 ) ← (CT C) −1 CT vec(Y) e end 10: for 𝑚 ← 1 to #{𝑄 1 , · · · , 𝑄 𝑀 } do 11: Set 𝑑 (𝑚) be the 𝑚-th entry of {𝑄 1 , · · · , 𝑄 𝐿 } 12: Ye𝑑 (𝑚) is unfolded along the mode corresponding to 𝑑 (𝑚) and obtain a 𝑑 (𝑚) × (𝑁 𝐽 + 𝐻𝑃) 𝑚≠𝑘 𝑄 𝑚 Î 13: for 𝑟 = 1, · · · , 𝑅 do 14: 𝐷 𝑟 ← vec(⟨Z, e u𝑟0 ◦ u𝑟1 ◦ · · · ◦ u𝑟 𝐿 ◦ v𝑟1 ◦ · · · v𝑟,𝑘−1 ◦ v𝑟,𝑘+1 ◦ · · · ◦ v𝑟 𝑀 ⟩ 𝐿+1 ) end Î 15: D ← [𝐷 1 , · · · , 𝐷 𝑅 ] ∈ R (𝑁 𝐽+𝐻𝑃) 𝑚≠𝑘 𝑄 𝑚 ×𝑅 16: V𝑚 ← Y e𝑑 (𝑚) D(DT D) −1 end 17: Compute B = [[U0 , U1 ,D· · · E, U 𝐿 , V1 , · · · , V 𝑀 ]] ∥Y−e Z,eBb ∥2 𝐿+1 F 18: Calculate: Error = b 2 ∥Y∥ F end 19: Compute 𝛽 𝑝1 ,··· ,𝑝 𝐿 ,𝑞1 ,··· ,𝑞 𝑀 (𝑡) = bT𝑝1 ,··· ,𝑝 𝐿 ,𝑞1 ,··· ,𝑞 𝑀 B(𝑡) using Equation (4.2) for each node Regression functions are given by 𝛽 𝑝1 ,𝑝2 ,𝑞1 ,𝑞2 (𝑡) = 𝑝 1 cos (2𝜋𝑡) + 𝑞 1 sin (2𝜋𝑡) + 𝑝 2 sin (4𝜋𝑡) + 𝑞 2 cos (4𝜋𝑡) (4.21) 115 Here, changes in one unit of the index of each mode produce a change in one unit of the coefficient when the time is fixed. The covarites are generated in the following way, (1) (2) (3) 𝑥𝑖,𝑝1 ,𝑝2 (𝑡) = 𝜒𝑖,𝑝 1 ,𝑝 2 + 𝜒𝑖,𝑝 1 ,𝑝 2 sin (𝜋𝑡) + 𝜒𝑖,𝑝 1 ,𝑝 2 cos (𝜋𝑡) (4.22) and errors are generated as follows. (1) √ (2) √ 𝜖𝑖,𝑞1 ,𝑞2 (𝑡) = 𝜂𝑖,𝑞 ,𝑞 1 2 2 cos (𝜋𝑡) + 𝜂 𝑖,𝑞 ,𝑞 1 2 2 sin (𝜋𝑡) for all 𝑝 1 = 1, · · · , 𝑃1 , 𝑝 2 = 1, · · · , 𝑃2 , 𝑞 1 = 1, · · · , 𝑄 1 and 𝑞 2 = 1, · · · , 𝑄 2 . Moreover, we assume that 𝑥𝑖,𝑝1 ,𝑝2 (𝑡) are observed with measurement error, i.e., 𝑢𝑖,𝑝1 ,𝑝2 (𝑡) = 𝑥𝑖,𝑝1 ,𝑝2 + 𝛿 𝑝1 ,𝑝2 (𝑙) where 𝛿 𝑝1 ,𝑝2 ∼ 𝑁 (0, 0.62 ). Assume that the set of random variables { 𝜒𝑖,𝑝 1 ,𝑝 2 : 𝑙 = 1, 2, 3} and (𝑙) {𝜂𝑖,𝑞 1 ,𝑞 2 : 𝑙 = 1, 2} are mutually independent. The data generation process is influenced by Kim et al. (2018) although in an entirely different situation. We observe the data at 81 equidistant time-points in [0, 1] with 𝑡 𝑗 = ( 𝑗 − 0.5)/𝐽 for all 𝑗 = 1, · · · , 𝐽. We also fix 𝑃1 × 𝑃2 = 5 × 2 and 𝑄 1 × 𝑄 2 as either 5 × 2 or 15 × 12. Set, number of subjects, 𝑁 ∈ {30, 100}. We consider the following scenarios. (1) (2) (3) (Situation1) We choose 𝜒𝑖,𝑝 1 ,𝑝 2 ∼ 𝑁 (0, 12 ), 𝜒𝑖,𝑝 1 ,𝑝 2 ∼ 𝑁 (0, 0.852 ), 𝜒𝑖,𝑝 1 ,𝑝 2 ∼ 𝑁 (0, 0.72 ) and they (1) (2) are mutually independent. 𝜂𝑖,𝑞 1 ,𝑞 2 ∼ 𝑁 (0, 22 ), 𝜂𝑖,𝑞 1 ,𝑞 2 ∼ 𝑁 (0, 0.752 ) and they are mutually independent. Here, the covariates do not depend on the modes of the data structure. (Situation2) In addition, with the assumption of the coefficients of covariates, impose the spa- tial correlation structure to address the mode-wise dependencies. We consider the following two cases. (𝑙) a) 𝜒𝑖,𝑝 1 ,𝑝 2 at mode ( 𝑝 1 , 𝑝 2 ) is 𝜌 𝑠 (ED 𝑝1 ,𝑝2 ; 𝜃), where 𝜌 𝑠 is the exponential cor- relation function, ED 𝑝1 ,𝑝2 is defined as scaled Euclidean distance between two modes, having scaled by a constant 𝜃, therefore, 𝜃 defines an isotropic covariance function. In this simulation setup, 𝜃 is taken as 8. 116 (𝑙) b) 𝜒𝑖,𝑝 1 ,𝑝 2 at mode ( 𝑝 1 , 𝑝 2 ) is 𝜌 𝑀 (𝑑 𝑝1 ,𝑝2 ; 𝜅, 𝜈), where 𝑑 𝑞1 ,𝑞2 denotes the Euclidean distance between two different modes and 𝜌 𝑀 is the correlation function, belongs to Matérn family. The Matérn isotropic auto-correlation function has a specific form  √ 𝜈  √  21−𝜈 2𝑑 𝜈 2𝑑 𝜈 𝜌 𝑀 (𝑑; 𝜅, 𝜈) = 𝐾𝜈 , 𝜅, 𝜈 > 0 (4.23) Γ(𝜈) 𝜅 𝜅 Here, 𝐾𝜈 (·) is termed as Bessel function of order 𝜈. The positive range parameter 𝜅 controls the decay of the correlation between the observations at a large distance 𝑑. The order 𝜈 controls the behavior of the auto-correlation function for observations that are separated by a small distance. For our numerical example, we set the scale 𝜅 = 0.55 and the smoothness parameter 𝜈 = 1. The above mentioned situations have been implemented using “stationary.image.cov” and “matern.image.cov” functions respectively available in fields package in R (Dou- glas Nychka et al., 2017). We run the simulation 100 times for each scenario for the evaluation of our method. For each of the simulation setups, we take the number of knots as [𝐽/4], where [𝑎] denotes the integer part of 𝑎. We compare the overall performance of the models to estimate the parameter curves for different choices of ranks by studying several error rates based on different norms. We choose the smoothing parameters 𝜃 from the set {0, 0.001, 0.005, 0.01, 0.05, 0.1}, on the other hand, 𝜙 is chosen from a grid from {0, 0.5, 3, 10} and allow the different values from 1 to 5 for the choice of rank 𝑅. In the following tables, we denote FToTM𝑟 as proposed functional tensor-on-tensor model with rank 𝑟. To compare with the existing literature, we apply the concurrent linear model (Ramsay and Silverman, 2005) (CLM) for mode-wise analysis and implement this method using “pffr” function available in refund (Goldsmith et al., 2020) package in R, with the penalized concurrent effect of functional covariates (Ivanescu et al., 2015). Tables 4.1, 4.2 and 4.3 show the results of integrated mean square errors and absolute errors ∫ ∫ 𝜷(𝑡) − 𝜷(𝑡)∥ 2F 𝑑𝑡 and IMAE = 𝑡∈T 𝑝1 ,𝑝2 ,𝑞1 ,𝑞2 𝛽b𝑝1 ,𝑝2 ,𝑞1 ,𝑞2 (𝑡) − 𝛽 𝑝1 ,𝑝2 ,𝑞1 ,𝑞2 (𝑡) 𝑑𝑡 Í IMSE = 𝑡∈T ∥ b 117 Table 4.1 Results of simulation situations (Situation1) where each modes are assumed to be independent for X (𝑡) and E (𝑡) for fixed time-points. Here we assume each of { 𝜒 𝑝(𝑘) } 1 ,𝑝 2 𝑝 1 ,𝑝 2 and (𝑘) {𝜂 𝑞1 ,𝑞2 } 𝑞1 ,𝑞2 are independent for ( 𝑝 1 , 𝑝 2 ) and (𝑞 1 , 𝑞 2 ) respectively. Method IMSE (SD) RIMSE (SD) IMAE (SD) RIMAE (SD) 𝑁 = 30, 𝑃1 × 𝑃2 = 5 × 2, 𝑄 1 × 𝑄 2 = 5 × 2 CLM 0.14294 (0.02046) 0.01059 (0.00152) 0.28311 (0.02027) 0.09244 (0.00662) FToTM 1 1.48469 (0.05628) 0.10998 (0.00417) 0.96636 (0.01626) 0.31552 (0.00531) FToTM 2 0.45773 (0.02218) 0.03391 (0.00164) 0.53786 (0.01068) 0.17561 (0.00349) FToTM 3 0.15078 (0.01316) 0.01117 (0.00097) 0.29482 (0.01452) 0.09626 (0.00474) FToTM 4 0.01065 (0.00383) 0.00079 (0.00028) 0.07871 (0.01367) 0.0257 (0.00446) FToTM 5 0.01558 (0.00582) 0.00115 (0.00043) 0.09412 (0.01695) 0.03073 (0.00553) 𝑁 = 30, 𝑃1 × 𝑃2 = 5 × 2, 𝑄 1 × 𝑄 2 = 15 × 12 CLM 0.1448 (0.01339) 0.00193 (0.00018) 0.28468 (0.0132) 0.04054 (0.00188) FToTM 1 9.24824 (0.06732) 0.12304 (0.0009) 2.27313 (0.01304) 0.32372 (0.00186) FToTM 2 1.79804 (0.06786) 0.02392 (0.0009) 1.02121 (0.01836) 0.14543 (0.00261) FToTM 3 0.23289 (0.02089) 0.0031 (0.00028) 0.36104 (0.01293) 0.05142 (0.00184) FToTM 4 0.06108 (0.06808) 0.00081 (0.00091) 0.15243 (0.13744) 0.02171 (0.01957) FToTM 5 0.00195 (0.00053) 0.00003 (0.00001) 0.03348 (0.00451) 0.00477 (0.00064) 𝑁 = 100, 𝑃1 × 𝑃2 = 5 × 2, 𝑄 1 × 𝑄 2 = 5 × 2 CLM 0.03087 (0.00348) 0.00229 (0.00026) 0.13236 (0.00731) 0.04322 (0.00239) FToTM 1 1.46268 (0.04068) 0.10835 (0.00301) 0.95921 (0.01095) 0.31319 (0.00358) FToTM 2 0.43737 (0.01418) 0.0324 (0.00105) 0.52551 (0.00725) 0.17158 (0.00237) FToTM 3 0.13651 (0.00541) 0.01011 (0.0004) 0.27253 (0.01099) 0.08898 (0.00359) FToTM 4 0.00303 (0.00091) 0.00022 (0.00007) 0.04222 (0.00632) 0.01379 (0.00206) FToTM 5 0.0037 (0.00115) 0.00027 (0.00008) 0.04663 (0.00696) 0.01523 (0.00227) 𝑁 = 100, 𝑃1 × 𝑃2 = 5 × 2, 𝑄 1 × 𝑄 2 = 15 × 12 CLM 0.03082 (0.00163) 0.00041 (0.00002) 0.1328 (0.00357) 0.01891 (0.00051) FToTM 1 9.21298 (0.04487) 0.12257 (0.0006) 2.26689 (0.01132) 0.32283 (0.00161) FToTM 2 1.76018 (0.04482) 0.02342 (0.0006) 1.00917 (0.01218) 0.14372 (0.00173) FToTM 3 0.22276 (0.03467) 0.00296 (0.00046) 0.35168 (0.02647) 0.05008 (0.00377) FToTM 4 0.05837 (0.06468) 0.00078 (0.00086) 0.14918 (0.14726) 0.02124 (0.02097) FToTM 5 0.00085 (0.00033) 0.00001 (<0.00001) 0.02197 (0.00403) 0.00313 (0.00057) respectively. Similarly, we report the relative mean and absolute errors, which are defined as ∫ ∫ Í 𝜷(𝑡)−𝜷(𝑡) ∥ 2F 𝑑𝑡 ∥b 𝑡 ∈T 𝑝1 , 𝑝2 ,𝑞1 ,𝑞2 𝛽b𝑝1 , 𝑝2 ,𝑞1 ,𝑞2 (𝑡)−𝛽 𝑝1 , 𝑝2 ,𝑞1 ,𝑞2 (𝑡) 𝑑𝑡 𝑡 ∈T RIMSE = ∫ and RIMAE = ∫ respectively. ∥ 𝜷(𝑡) ∥ 2F 𝑑𝑡 | 𝛽 𝑝1 , 𝑝2 ,𝑞1 ,𝑞2 (𝑡) | 𝑑𝑡 Í 𝑡 ∈T 𝑡 ∈T 𝑝1 , 𝑝2 ,𝑞1 ,𝑞2 The advantages of these simulation situations are that these models are not based on the reduced- rank model. Here, we observe the curves in the presence of errors. All integrals are approximated 118 Table 4.2 Results of simulation situations (Situation2)a where each modes are assumed to be independent for E (𝑡) for fixed time-points whereas modes for X (𝑡) are assumed to be dependent. Here we assume { 𝜒 𝑝(𝑘) } 1 ,𝑝 2 𝑝 1 ,𝑝 2 is spatially dependent with exponential covariance function. Method IMSE (SD) RIMSE (SD) IMAE (SD) RIMAE (SD) 𝑁 = 30, 𝑃1 × 𝑃2 = 5 × 2, 𝑄 1 × 𝑄 2 = 5 × 2 CLM 11.03513 (2.27364) 0.81742 (0.16842) 2.45079 (0.24082) 0.8002 (0.07863) FToTM 1 1.46631 (0.0141) 0.10862 (0.00104) 0.96402 (0.00583) 0.31476 (0.0019) FToTM 2 0.60273 (0.01917) 0.04465 (0.00142) 0.60152 (0.01318) 0.1964 (0.0043) FToTM 3 0.32753 (0.01741) 0.02426 (0.00129) 0.42707 (0.01962) 0.13944 (0.00641) FToTM 4 0.21328 (0.21078) 0.0158 (0.01561) 0.35394 (0.13306) 0.11556 (0.04344) FToTM 5 0.13694 (0.02654) 0.01014 (0.00197) 0.30854 (0.0384) 0.10074 (0.01254) 𝑁 = 30, 𝑃1 × 𝑃2 = 5 × 2, 𝑄 1 × 𝑄 2 = 15 × 12 CLM 11.36335 (1.34533) 0.15118 (0.0179) 2.49712 (0.14845) 0.35562 (0.02114) FToTM 1 9.21977 (0.02778) 0.12266 (0.00037) 2.27091 (0.0079) 0.32341 (0.00112) FToTM 2 1.76995 (0.02734) 0.02355 (0.00036) 1.01769 (0.01081) 0.14493 (0.00154) FToTM 3 0.41264 (0.16057) 0.00549 (0.00214) 0.48365 (0.08798) 0.06888 (0.01253) FToTM 4 0.18218 (0.21293) 0.00242 (0.00283) 0.32063 (0.13906) 0.04566 (0.0198) FToTM 5 0.06936 (0.05182) 0.00092 (0.00069) 0.19811 (0.08864) 0.02821 (0.01262) 𝑁 = 100, 𝑃1 × 𝑃2 = 5 × 2, 𝑄 1 × 𝑄 2 = 5 × 2 CLM 2.55232 (0.45649) 0.18906 (0.03381) 1.19172 (0.10708) 0.38911 (0.03496) FToTM 1 1.45974 (0.00711) 0.10813 (0.00053) 0.96178 (0.00323) 0.31403 (0.00105) FToTM 2 0.58776 (0.01049) 0.04354 (0.00078) 0.59246 (0.00766) 0.19344 (0.0025) FToTM 3 0.31275 (0.00961) 0.02317 (0.00071) 0.411 (0.01063) 0.1342 (0.00347) FToTM 4 0.18492 (0.20235) 0.0137 (0.01499) 0.32604 (0.13409) 0.10646 (0.04378) FToTM 5 0.11149 (0.03648) 0.00826 (0.0027) 0.27665 (0.06128) 0.09033 (0.02001) 𝑁 = 100, 𝑃1 × 𝑃2 = 5 × 2, 𝑄 1 × 𝑄 2 = 15 × 12 CLM 2.5259 (0.21061) 0.0336 (0.0028) 1.18808 (0.05122) 0.1692 (0.00729) FToTM 1 9.26995 (0.13929) 0.12333 (0.00185) 2.28525 (0.03385) 0.32545 (0.00482) FToTM 2 1.74798 (0.01575) 0.02325 (0.00021) 1.00948 (0.00691) 0.14376 (0.00098) FToTM 3 0.61359 (0.30173) 0.00816 (0.00401) 0.58812 (0.16308) 0.08376 (0.02322) FToTM 4 0.66733 (0.41716) 0.00888 (0.00555) 0.596 (0.24026) 0.08488 (0.03422) FToTM 5 0.0914 (0.04684) 0.00122 (0.00062) 0.23906 (0.07987) 0.03405 (0.01137) using the Riemann sum. Since our proposed method involves an iterative procedure, which depends on the initial estimates, the computational time is therefore not comparable to that of the classical CLM, which is not an iterative method. For all situations, our proposed method does a much better job in terms of low error rates in estimating the parameter 𝜷(𝑡). 119 Table 4.3 Results of simulation situations (Situation2)b where each modes are assumed to be independent for E (𝑡) for fixed time-points whereas modes for X (𝑡) are assumed to be dependent. Here we assume { 𝜒 𝑝(𝑘) } 1 ,𝑝 2 𝑝 1 ,𝑝 2 is spatially dependent with Matérn covariance function. 𝑁 = 30, 𝑃1 × 𝑃2 = 5 × 2, 𝑄 1 × 𝑄 2 = 5 × 2 Method IMSE (SD) RIMSE (SD) IMAE (SD) RIMAE (SD) CLM 0.26393 (0.04919) 0.01955 (0.00364) 0.38374 (0.03318) 0.12529 (0.01083) FToTM 1 1.45885 (0.02061) 0.10806 (0.00153) 0.9599 (0.00731) 0.31342 (0.00239) FToTM 2 0.46879 (0.02445) 0.03473 (0.00181) 0.54097 (0.01118) 0.17663 (0.00365) FToTM 3 0.16291 (0.01629) 0.01207 (0.00121) 0.30998 (0.01506) 0.10121 (0.00492) FToTM 4 0.0087 (0.01146) 0.00064 (0.00085) 0.06782 (0.0274) 0.02214 (0.00895) FToTM 5 0.0111 (0.00525) 0.00082 (0.00039) 0.07909 (0.01855) 0.02582 (0.00606) 𝑁 = 30, 𝑃1 × 𝑃2 = 5 × 2, 𝑄 1 × 𝑄 2 = 15 × 12 CLM 0.26313 (0.02958) 0.0035 (0.00039) 0.3835 (0.02167) 0.05462 (0.00309) FToTM 1 9.22145 (0.02791) 0.12268 (0.00037) 2.27063 (0.00894) 0.32337 (0.00127) FToTM 2 1.77848 (0.02878) 0.02366 (0.00038) 1.02026 (0.01052) 0.1453 (0.0015) FToTM 3 0.23293 (0.01206) 0.0031 (0.00016) 0.36047 (0.00952) 0.05134 (0.00136) FToTM 4 0.05929 (0.06315) 0.00079 (0.00084) 0.15872 (0.13817) 0.0226 (0.01968) FToTM 5 0.00175 (0.00133) 0.00002 (0.00002) 0.03081 (0.00931) 0.00439 (0.00133) 𝑁 = 100, 𝑃1 × 𝑃2 = 5 × 2, 𝑄 1 × 𝑄 2 = 5 × 2 CLM 0.05833 (0.00912) 0.00432 (0.00068) 0.18217 (0.01463) 0.05948 (0.00478) FToTM 1 1.44275 (0.00963) 0.10687 (0.00071) 0.95499 (0.00374) 0.31181 (0.00122) FToTM 2 0.44676 (0.01346) 0.03309 (0.001) 0.52798 (0.00559) 0.17239 (0.00183) FToTM 3 0.14999 (0.00779) 0.01111 (0.00058) 0.29657 (0.00869) 0.09683 (0.00284) FToTM 4 0.00231 (0.00143) 0.00017 (0.00011) 0.03593 (0.00956) 0.01173 (0.00312) FToTM 5 0.00284 (0.00125) 0.00021 (0.00009) 0.04026 (0.00816) 0.01314 (0.00266) 𝑁 = 100, 𝑃1 × 𝑃2 = 5 × 2, 𝑄 1 × 𝑄 2 = 15 × 12 CLM 0.05746 (0.00427) 0.00076 (0.00006) 0.18093 (0.00695) 0.02577 (0.00099) FToTM 1 9.18754 (0.00744) 0.12223 (0.0001) 2.26337 (0.00385) 0.32233 (0.00055) FToTM 2 1.73663 (0.00773) 0.0231 (0.0001) 1.00481 (0.0038) 0.1431 (0.00054) FToTM 3 0.2181 (0.00522) 0.0029 (0.00007) 0.34535 (0.00306) 0.04918 (0.00044) FToTM 4 0.05167 (0.05987) 0.00069 (0.0008) 0.13999 (0.14339) 0.01994 (0.02042) FToTM 5 0.00081 (0.00061) 0.00001 (0.00001) 0.02055 (0.00654) 0.00293 (0.00093) 120 4.6 Application to ForrestGump-data 4.6.1 Details about the data-set The studyforrest (website: https://www.studyforrest.org/) describes a publicly available data-set for the study of neural language and story processing. The imaging data analyzed here are publicly avail- able through OpenfMRI (https://openneuro.org/datasets/ds000113/versions/1.3.0) (Hanke et al., 2014; Sengupta et al., 2016). In total 15 right-handed participants (mean age 29.4 years, range 21–39, 40% females, native German speaker) volunteered for a series of studies including eye- tracking experiments using natural signal stimulation with a motion picture. Volunteers have no known hearing problem without permanent or current temporary impairments, and no neurological disorder. Participants viewed a feature film “ Forrest Gump” (Robert Zemeckis, Paramount Pic- tures, 1994 with German audio track) in eight back-to-back 15-minute movie sessions, which were presented chronologically in two back-to-back sessions on the same day. Each session contained four segments, each approximately 15 minutes long. The eye tracking camera was fitted just outside the scanner bore, approximately centered and viewing the left eye of the participant at a distance of 100 cm through a small gap between the top of the back projection screen and the scanner bore ceiling. Participants were allowed to perform free eye movements without having to fixate or keep the eye open. The eye-gaze recording started as soon as the computer received the first fMRI trigger signal. In the audio-visual movie, the video-track of the movie was extracted and encoded as H.264 (1280 × 720 at 25 fps). The movie was shown on a 1280 × 1024 pixel screen with a 63 cm viewing distance in 720p resolution. The temporal resolution of the participants’ eye gaze recording was 1000Hz. All fMRI acquisitions had the following parameters: T2*- weighted echo-planner images with 2 second repetition time (TR), 30 ms echo time, and 90-degree flip angle were acquired during stimulation using a 3 Tesla MRI scanner. The dimension of the images for each time-point was 80 × 80 × 35 (with pixel dimension 3 × 3 × 3.3𝑚𝑚 3 ). The number of volumes acquired for the selected session was 451. 121 All files related to data acquisition for a particular subject are available as sub-/ses-movie/ directory, where ID is the numeric subject ID. fMRI data files are available with the file name ses- movie_task-movie__bold. In the scanner, the normalized eye-gaze coordinate time series are located at sub-/ses-movie/func/sub-_recording-eyegaze_physio.tsv.gz which contain X and Y coordinates of the eye-gaze, pupil area measurements, and numerical ID of movie frame presented at the time of measurement. Since the sampling rate is uniformly 1000 Hz, we have 1000 lines per second, with the first line corresponding to the onset of the movie stimulus. Here, the coordinates (0, 0) were located at the top-left corner of the movie frame, and the lower right corner is located at (1280, 546). Both measurements were taken excluding the gray bar of the frame. All in-scanner recordings were temporally normalized by shifting the time series by the minimal video onset. Stimulus timing information was recorded in events.tsv files which contain the onset, duration of each movie frame. In eye-gazing data, there is huge change of loss of information due to eye blinks and those are marked as nan in the data-set and perform spline interpolation. We used 14 individuals and removed Subject 5 due to excessive missing data. We use the eye position in the angular unit (i.e., polar coordinates) instead of Cartesian co- ordinates, where we report magnitude changes of eye position in the screen reference system. Moreover, X and Y coordinates, related polar coordinates, and pupil area were down-sampled to the match fMRI sampling frequency. Table 4.4 reports the covariance between angles and distance for each participants and quotient of the standard deviations of distance and angle, and that of vertical and horizontal direction of the frame. Fransson et al. (2014) investigates the relationship between spontaneous charges in eye position during passive fixation and intrinsic brain position in a block-related task in a resting-state fMRI and concurrent recordings of eye-gaze experiments. 122 0.3 0.6 0.2 0.4 0.1 0.2 0.0 Radians mm 0.0 −0.1 −0.2 −0.2 −0.4 −0.3 0 100 200 300 400 0 100 200 300 400 TR TR 0.6 1.0 0.4 0.2 0.5 0.0 Radians mm −0.2 0.0 −0.4 −0.5 −0.6 −0.8 0 100 200 300 400 0 100 200 300 400 TR TR 0.6 1.0 0.4 0.5 0.2 Radians 0.0 0.0 mm −0.2 −0.5 −0.4 −1.0 −0.6 0 100 200 300 400 0 100 200 300 400 TR TR Figure 4.3 ForrestGump-data: Summary statistics for the parameters estimates of head motion correction across TRs and participants. (Left panel) Magnitude of three rotational parameters (in radians) and (Right panel) Magnitude of three translation parameters (in millimeters) for each individual on each of the 451 TRs. In each plot, solid black line indicates mean over the individuals through TRs and black dotted lines indicate mean±2sd over the individuals through TRs. 123 1500 1500 1000 1000 X coordinate 500 y coordinate 500 0 0 0 100 200 300 400 0 100 200 300 400 TR TR 2.5 2.0 1500 1.5 1000 theta r 1.0 0.5 500 0.0 0 100 200 300 400 0 100 200 300 400 TR TR 3000 2500 2000 theta 1500 1000 500 0 100 200 300 400 TR Figure 4.4 ForrestGump-data: Covariates of interest. Cartesian and polar coordinates and pupil area are shown across TRs and participants. 124 ID sd(X)/sd(Y) sd(dist)/sd(angle) corr(dist, angle) 1 1.7044 922.2200 -0.3640 2 1.5247 1004.1233 -0.2552 3 1.6271 919.6505 -0.3124 4 1.6990 839.1719 -0.4135 5 1.1656 1156.9217 -0.4048 6 1.3943 772.5980 -0.2928 9 1.4403 762.4111 -0.3457 10 1.1282 821.5585 0.0177 14 1.9538 967.5292 -0.4522 15 1.1085 753.6706 -0.0512 16 1.7194 899.7864 -0.4071 17 1.9600 904.5869 -0.4612 18 1.6047 890.4765 -0.3568 19 1.3867 1069.9573 -0.2195 20 1.4631 944.8355 -0.3688 Table 4.4 ForrestGump-data: Summary statistics across participants. 4.6.2 Analysis To analyze the data on a local computer, we only used the first run of the experiment for each individual and down-sampled the images to 64 × 64 × 64 via nearest-neighbor interpolation using “resize” function in Matlab, where the number of time-points was 451. Details of the pre-processing steps performed along with further information of data acquisitions are described in Appendix A. Our scientific question of interest was to understand the association between brain image pattern in the presence of audio-visual inputs. This is the first approach to statistically analyze such a study by exploiting the complex structure of the data. We fit a time-varying tensor regression coefficient model as described in Section 4.2. Our covariate is a 3-mode tensor representing normalized eye-gaze coordinate time-series; each mode representing scaled polar coordinates of the eye-gaze and pupil area measurements, respectively. The response of the model is pre-processed fMRI data. Response and covariates are collected simultaneously. The coefficient functions 𝜷1 , 𝜷2 and 𝜷3 are amplitudes over the time associated with distance, angle of eye-gaze and pupil area respectively; included to detect the effect of movie in a visual form in BOLD response change. We choose the 125 rank for reduced-rank extraction to be 3 since it has lowest prediction error. For interpretation purposes, we evaluate estimates b 𝜷(𝑡) by taking average values over eight different functional networks in the brain. This was achieved by first parcellating the brain into the 268 regions of the Shen atlas (Shen et al., 2013). These regions were thereafter further combined into eight functional networks (Finn et al., 2015): medial frontal, frontoparietal, default mode, subcortical-cerebellum, motor, visual I, visual II, and visual association. Figures 4.5, 4.6, 4.7 represent the average estimated coefficient function corresponding to three visual features (distance, angle of eye-gaze, and pupil area) over all the time-points for each network respectively. Vertical lines represent scene changes in the movie. The first segment, consisting of approximately 84 time-points corresponds to the opening sequence, which shows a feather floating through the sky as credits are shown. The second segment consists of the famous scene where the protagonist of the movie sits on a bench at a bus stop and begins discussing the story of his life. During this scene, there is heightened activation in several brain networks in reaction to different visual features. Throughout the time course, the changes in visual features have the greatest impact on activation in “visual I”, which is depicted using purple lines and is consistent with what we should expect. Moreover, subsequent segments represent scene changes alternating between interior and exterior settings; see Häusler and Hanke (2016) for more details. 126 0.04 0.02 β1(t) 0.00 −0.02 −0.04 0 100 200 300 400 time−points Medical Frontal Frontoparietal Default Mode Subcortical−cerebellum Motor Visual I Visual II Visual Association Figure 4.5 ForrestGump-data results: Estimate of the coefficient 𝜷1 (𝑡) corresponding to visual feature distance of eye-gaze for different location. Legends for different parcellation as mentioned in Shen et al. (2013) are also provided. 127 0.3 0.2 0.1 β2(t) 0.0 −0.1 −0.2 0 100 200 300 400 time−points Medical Frontal Frontoparietal Default Mode Subcortical−cerebellum Motor Visual I Visual II Visual Association Figure 4.6 ForrestGump-data results: Estimate of the coefficient 𝜷2 (𝑡) corresponding to visual feature angle of eye-gaze for different location. Legends for different parcellation as mentioned in Shen et al. (2013) are also provided. 128 0.2 0.1 β3(t) 0.0 −0.1 0 100 200 300 400 time−points Medical Frontal Frontoparietal Default Mode Subcortical−cerebellum Motor Visual I Visual II Visual Association Figure 4.7 ForrestGump-data results: Estimate of the coefficient 𝜷3 (𝑡) corresponding to visual feature pupil area for different location. Legends for different parcellation as mentioned in Shen et al. (2013) are also provided. 129 4.7 Discussion In this chapter, we have proposed a time-varying tensor-on-tensor regression model and a method to estimate the coefficient tensors which belong to an infinite-dimensional space. We believe that the method provides an efficient approach towards performing multi-modal data analysis using neuroimaging data. Regression coefficients are expressed using the B-spline technique, and the coefficients of the B-spline bases are estimated using low-rank tensor decomposition. This method reduces the vastness of the parameters of interest and computational complexity. We have provided a meaningful simulation study as well as performed real data analysis combining fMRI and eye- tracking data. The results of our data analysis suggest that the approach has promise for identifying brain regions responding to an external stimulus, which in this case is movie-watching. Although our tensor data can be compactly represented by a CP model, it is NP hard to determine the rank of the low-rank decomposition (Johan, 1990). To determine the tuning parameters, one can perform the cross-validation technique. However, our main objective is not to choose the optimal rank of the low-rank decomposition in the algorithm, and we leave this for future research. Furthermore, the tensor train representation (Liu et al., 2020) could be an alternative representation of the multi-dimensional array. In conclusion, our work provides an important direction for dealing with massive structured data such as time-varying tensors for analysis in multi-modal neuroimaging studies. 4.8 Technical details 4.8.1 Technical lemmas Lemma 4.8.1. For any positive definite matrices A and B we have 𝜆 min (A)trace{B} ≤ trace{AB} ≤ 𝜆 max (A)trace{B} (4.24) where 𝜆 max (A) is the largest eigen-value of A and 𝜆min (A) is the smallest eigen-value of A. Proof. See Fang et al. (1994) for the proof in detail. 130 Before introducing the next lemma, define a 𝑃-dimensional vector u = (𝑢 1 , · · · 𝑢 𝑃 ) T which is sub-Gaussian with some parameters 𝜎, then for all 𝜶 ∈ R𝑃 , E{exp 𝜶T u} ≤ exp(∥𝜶∥ 2 𝜎 2 /2) (4.25) Define the locally stationary time series 𝑢 𝑗 = G( 𝑗/𝐽, F𝑗 ) where F𝑗 = (· · · , 𝜂 𝑗−1 , 𝜂 𝑗 , · · · ), 𝜂 𝑗 s are i.i.d. random variables and G : [0, 1]×R∞ → R is a measurable function such that 𝜉 𝑗 (𝑡) = G(𝑡, F𝑗 ). Let {𝜂′ } be i.i.d. copies of 𝜂 and assume that for some 𝑎 > 0, define the 𝐿 𝑎 -norm ∥𝜂∥ 𝑎 = {E|𝜂| 𝑎 }1/𝑎 . Then for 𝑘 ≥ 0 define the physical dependence measure Δ(𝑘, 𝑎) = sup𝑡∈[0,1] max 𝑗 ∥G(𝑡, F𝑗 ) − G(𝑡, F𝑗,𝑘 ) ∥ 𝑎 where F𝑗,𝑘 = (F𝑗−𝑘−1 , 𝜂′𝑗−𝑘 , 𝜂 𝑗−𝑘+1 , · · · , 𝜂 𝑗 ). Moreover, recall the condition (C4) where for some large 𝑎, 𝜅0 > 0, there exists a universal constant 𝐶 > 0 such that Δ(𝑘, 𝑎) ≤ 𝐶 𝑘 −𝜅0 for 𝑘 ≥ 1. Furthermore, let ∥𝜂∥ 𝑎 be finite for some 𝑎 > 1. Lemma 4.8.2. Under condition (C4), with the above explanation, for some constant 𝐶𝑎 > 0,   1 𝑄𝜉 𝑁 𝜏 P 𝜎1 (PE) ≤ √ ≥ 1 − 𝐶𝑎 𝑁 −𝑎𝜏 (4.26) 𝑁𝐽 𝐽 where 𝜏 is some small positive real number and 𝜉 = sup1≤ℎ≤𝐻 sup𝑡∈[0,1] |Bℎ (𝑡)| Proof. See Ding et al. (2021) and the references herein for the proof in detail. Í𝐾 𝑁 +𝑣+1 Lemma 4.8.3. Define S𝑛 be a collection of spline such that the function 𝑔• (𝑡) = ℎ=1 𝑏 ℎ,• 𝐵 ℎ (𝑡), where {𝐵 ℎ , ℎ = 1, · · · , (𝐾 𝑁 + 𝑣 + 1)} is a set of B-spline bases in 𝑆𝑛 . Under conditions (C2) and (C3), there exists a spline function 𝑔• (𝑡) ∈ 𝑆𝑛 such that ! 1 sup |𝛽• (𝑡) − 𝑔• (𝑡)| = 𝑂 (4.27) 𝑡∈T 𝐾 𝑁𝑣+1 Proof. This proof follows from De Boor et al. (1978). 4.8.2 Proof of Theorem 4.3.1 For simplicity, assume Y ∈ R𝑁 𝐽×𝑄 and Z ∈ R𝑁 𝐽×𝐻×𝑃 , thus B ∈ R𝐻×𝑃×𝑄 . The contracted inner product in this proof is of order 2, i.e., < ·, · >2 , for simplicity, we drop subscript 2 from the inner 131 product. By the definition of b B0 , for all matrices C of rank 𝑅0 with order 𝐻 𝑁 × 𝑃 × 𝑄, we have D E ∥Y − Z, b B0 ∥ 2F + (𝑁 𝐽)∥b B0 ∥ 2F,W 𝜔 ≤ ∥Y − ⟨Z, C⟩ ∥ 2F + (𝑁 𝐽)∥ C ∥ 2F,W 𝜔 (4.28) In addition, the following two equations hold for any tensor C, ∥Y − ⟨Z, C⟩ ∥ 2F = ∥Y − ⟨Z, B0 ⟩ ∥ 2F + ∥ ⟨Z, ( B0 − C)⟩ ∥ 2F + 2 ⟨E, ⟨Z, ( B0 − C)⟩⟩ F D E D E D D EE ∥Y − Z, b B0 ∥ 2F = ∥Y − ⟨Z, B0 ⟩ ∥ 2F + ∥ Z, ( B0 − b B0 ) ∥ 2F + 2 E, Z, ( B0 − b B0 ) F (4.29) with ⟨A, B⟩ F = trace{AT B} for any matrices A and B such that the matrix product of AT B is permissible. Define, P = Z (1) ( ZT(1) Z (1) ) −1 ZT(1) , then by the definition of Frobenius inner product, D D EE D D EE E, Z, (bB0 − B) = PE, Z, (b B0 − C) . Moreover, the inner product norm ⟨·, ·⟩ F , operator F F Í norm ∥ · ∥ 2 = 𝜎1 (·) and nuclear norm ∥ · ∥ ∗ = 𝑖 𝜎𝑖 (·) are related using the inequalities ⟨A, B⟩ F ≤ √ ∥A∥ 2 ∥B∥ ∗ and ∥B∥ ∗ ≤ 𝑟 ∥B∥ F where 𝑟 be the rank of the matrix B and 𝜎𝑖 (·) represents the 𝑖 th largest singular value of a matrix. By subtracting the two Equations in 4.29 and exercising the properties of different norms mentioned above, we get the following inequalities. D E D D EE n o 2 2 2 2 ∥ Z, ( B0 − B0 ) ∥ F ≤ ∥ ⟨Z, ( C − B0 )⟩ ∥ F + 2 E, Z, ( B0 − C) b b + (𝑁 𝐽) ∥ C ∥ F,W 𝜔 − ∥ B0 ∥ F,W 𝜔 b F D D EE = ∥ ⟨Z, ( C − B0 )⟩ ∥ 2F + 2 PE, Z, (b B0 − C) F n o + (𝑁 𝐽) ∥ C ∥ 2F,W 𝜔 − ∥b B0 ∥ 2F,W 𝜔 √︁ D E ≤ ∥ ⟨Z, ( C − B0 )⟩ ∥ 2F + 2𝜎1 (PE) 2𝑅0 ∥ Z, (b B0 − C) ∥ F n o + (𝑁 𝐽) ∥ C ∥ 2F,W 𝜔 − ∥b B0 ∥ 2F,W 𝜔 (4.30) ∫ ∫ Define, P = I𝑄 ⊗I𝑃 ⊗ B′′ (𝑡)B′′ (𝑡) T 𝑑𝑡 and observe the fact that 𝜆 max (P) = 𝜆 max ( B′′ (𝑡)B′′ (𝑡) T 𝑑𝑡). Now consider, for any tensor with C, using Lemma 4.8.1, vec( C) T P vec( C) − vec(b B0 ) T P vec(bB0 ) = trace{P(vec( C) vec( C) T − vec(b B0 ) vec(bB0 ) T )} ≤ 𝜆 max (P)trace{vec( C) vec( C) T − vec(b B0 ) vec(b B0 ) T } 132 ∫ = 𝜆 max ( B′′ (𝑡)B′′ (𝑡) T 𝑑𝑡){∥ C ∥ 2F − ∥b B0 ∥ 2F } (4.31) As a consequence of the above inequality, ∥ C ∥ 2F,W 𝜔 − ∥b B0 ∥ 2F,W 𝜔 = vec( C) T W𝜔 vec( C) − vec(b B0 ) T W𝜔 vec(b B0 )   = 𝜃 vec( C) T P vec( C)} − vec(b B0 ) T P vec(b B0 )}   + 𝜙 vec( C) T vec( C)} − vec(b B0 ) T vec(b B0 )} n o ≤ (𝜃𝜆max (P) + 𝜙) ∥ C ∥ 2F − ∥b B0 ∥ 2F ∫ n o ′′ ′′ T 2 2 = (𝜃𝜆max ( B (𝑡)B (𝑡) 𝑑𝑡) + 𝜙) ∥ C ∥ F − ∥ B0 ∥ F b (4.32) Then for tensor C with rank( C) ≤ 𝑅0 and 𝐼 = min(𝐻, 𝑃𝑄), we have the following inequalities. ∥ C ∥ 2F − ∥b B0 ∥ 2F ∑︁ 𝐼 𝐼 ∑︁ = 𝜎𝑖2 ( C (1) ) − 𝜎𝑖2 (b B0(1) ) 𝑖=1 𝑖=1 ( 𝐼 ) n o ∑︁   ≤ 𝜎1 ( C (1) ) + 𝜎1 (b B0(1) ) 𝜎𝑖 ( C (1) ) − 𝜎𝑖 (b B0(1) ) 𝑖=1 ( 𝐼 ) (𝑖) n o ∑︁ ≤ 2𝜎1 ( C (1) ) + 𝜎1 (b B0(1) − C (1) ) 𝜎𝑖 (bB0(1) − C (1) ) 𝑖=1 ( 𝑅0 ) n o ∑︁ = 2𝜎1 ( C (1) ) + 𝜎1 (b B0(1) − C (1) ) 𝜎𝑖 (b B0(1) − C (1) ) 𝑖=1 (𝑖𝑖) n o n√︁ o ≤ 2𝜎1 ( C (1) ) + ∥b B0 − C ∥ F 2𝑅0 ∥b B0 − C ∥ F √︁ n o2 ≤ 2𝑅0 2𝜎1 ( C (1) ) + ∥ B0 − C ∥ F b (4.33) where inequality (i) follows since 𝜎𝑖+ 𝑗−1 (A + B) ≤ 𝜎𝑖 (A) + 𝜎 𝑗 (B), or in other words due to Weyl’s additive perturbation theory which states that 𝜎𝑖+ 𝑗−1 (A) ≤ 𝜎𝑖 (B) + 𝜎 𝑗 (A − B). Inequal- ity (ii) holds since by definition 𝜎1 (A) = ∥A∥ 2 , operator norm. Moreover, ∥A∥ 2 ≤ ∥A∥ F and due to Cauchy-Schwarz inequality along with the fact that rank(A + B) ≤ 𝑟𝑎𝑛𝑘 (A) + D E D E 𝑟𝑎𝑛𝑘 (B). Also, ∥ Z, (b B0 − C) ∥ 2F = ∥ Z (1) , (b B0 − C) (3) ∥ 2F = ∥ ZT(1) (b B0 − C) (3) ∥ 2F ≥ ∥b B0 − C ∥ 2F 𝜆 min (ZT(1) Z (1) ) due to 4.8.1. Therefore, using the inequality (𝑥 + 𝑦) 2 ≤ 2(𝑥 2 + 𝑦 2 ) we have for 133 ∫ √ 𝜇 = (𝑁 𝐽) (𝜃𝜆max ( B′′ (𝑡)B′′ (𝑡) T 𝑑𝑡) + 𝜙) 2𝑅0 n o n D E o2 (𝑁 𝐽) ∥ C ∥ 2F,W 𝜔 − ∥b B0 ∥ 2F,W 𝜔 ≤ 𝜇 2𝜎1 ( C (1) ) + 𝜆−1 ( Z T min (1) (1) Z )∥ Z , ( B b 0 − C ) ∥F n D E o −2 ≤ 𝜇 4𝜎1 ( C (1) ) + 𝜆 min ( Z (1) Z (1) )∥ Z, ( B0 − C) ∥ 2F 2 T b D E −2 ≤ 4𝜇𝜎1 ( C (1) ) + 2𝜇𝜆min ( Z (1) Z (1) )∥ Z, ( B0 − B0 ) ∥ 2F 2 T b + 2𝜇𝜆−2 T min ( Z (1) Z (1) )∥ ⟨Z, ( C − B0 )⟩ ∥ F 2 (4.34) Therefore, we obtain the bound for the prediction error as the following way using the assumption that 𝜆 min ( ZT(1) Z (1) ) is bounded below by 𝜆 with high probability and by inequality 2𝑥𝑦 ≤ 𝑥 2 /𝑎 + 𝑎𝑦 2 in (★), consider the following from Equation (4.30), D E √︁ D E 2 2 ∥ Z, ( B0 − B0 ) ∥ F ≤ ∥ ⟨Z, ( C − B0 )⟩ ∥ F + 2𝜎1 (PE) 2𝑅0 ∥ Z, ( B0 − C) ∥ F b b D E + 2𝜇𝜆 ∥ Z, ( B0 − B0 ) ∥ 2F + 2𝜇𝜆−2 ∥ ⟨Z, ( C − B0 )⟩ ∥ 2F + 4𝜇𝜎12 ( C (1) ) −2 b √︁ D E ≤ ∥ ⟨Z, ( C − B0 )⟩ ∥ 2F + 2𝜎1 (PE) 2𝑅0 ∥ Z, (b B0 − B0 ) ∥ F √︁ + 2𝜎1 (PE) 2𝑅0 ∥ ⟨Z, ( C − B0 )⟩ ∥ F D E + 2𝜇𝜆 ∥ Z, ( B0 − B0 ) ∥ 2F + 2𝜇𝜆−2 ∥ ⟨Z, ( C − B0 )⟩ ∥ 2F + 4𝜇𝜎12 ( C (1) ) −2 b (★) ≤ 4𝜇𝜎12 ( C (1) ) + ∥ ⟨Z, ( C − B0 )⟩ ∥ 2F D E D E −2 + 2𝑅0 𝑎𝜎1 (PE) + ∥ Z, ( B0 − B0 ) ∥ F /𝑎 + 2𝜇𝜆 ∥ Z, ( B0 − B0 ) ∥ 2F 2 b 2 b + 2𝑅0 𝑏𝜎12 (PE) + ∥ ⟨Z, ( C − B0 )⟩ ∥ 2F /𝑏 + 2𝜇𝜆−2 ∥ ⟨Z, ( C − B0 )⟩ ∥ 2F ≤ 4𝜇𝜎12 ( C (1) ) + 2(𝑎 + 𝑏)𝑅0 𝜎12 (PE)     D 𝑏+1 −2 2 1 −2 E + + 2𝜇𝜆 ∥ ⟨Z, ( C − B0 )⟩ ∥ F + + 2𝜇𝜆 ∥ Z, ( B0 − B0 ) ∥ 2F b 𝑏 𝑎 (4.35) Therefore, by doing some algebra, we have the following.   D 𝑎−1 −2 E − 2𝜇𝜆 ∥ Z, ( B0 − B0 ∥ 2F ≤ 4𝜇𝜎12 ( C (1) ) + 2(𝑎 + 𝑏)𝑅0 𝜎12 (PE) b 𝑎   𝑏+1 −2 + + 2𝜇𝜆 ∥ ⟨Z, ( C − B0 )⟩ ∥ 2F 𝑏 134 D E   −1  2 −1 −2 ∥ Z, ( B0 − B0 ) ∥ F ≤ C(𝛿) − 2𝜇𝜆 b 4𝜇𝜎12 ( C) + 2(1 + 𝛿)𝑅0 𝜎12 (PE) C(𝛿) + 2𝜇𝜆−2   + ∥ ⟨Z, ( C − B)⟩ ∥ 2F (4.36) C(𝛿) −1 − 2𝜇𝜆−2 where C(𝛿) = 1 + 2/𝛿 and 𝜎1 ( C) = max{𝜎1 ( C (1) ), 𝜎1 ( C (2) ), 𝜎1 ( C (3) )}. The last inequality holds after choosing 𝑎 = 1 + 𝛿/2 and 𝑏 = 𝛿/2. Now, it is enough to provide an upper bound of the largest singular value of PE. For some positive constant 𝐶0 , with high probability 1 − 𝐶0 𝑁 −𝑎𝜏 , by Lemma 4.8.2, D E   −1  2 −1 −2 ∥ Z, ( B0 − B0 ) ∥ F ≤ C(𝛿) − 2𝜇𝜆 b 4𝜇𝜎12 ( C) + 2𝑅0 (1 + 𝛿)𝑄 2 𝜉 2 𝑁 2𝜏+2 𝐽 C(𝛿) + 2𝜇𝜆−2   + ∥ ⟨Z, ( C − B0 )⟩ ∥ 2F (4.37) C(𝛿) −1 − 2𝜇𝜆−2 Since C is an arbitrary matrix with rank( B) ≤ 𝑅0 , the choosing C = B0 , we have, D E   −1  ∥ Z, (bB0 − B0 ) ∥ 2F ≤ C(𝛿) −1 − 2𝜇𝜆−2 4𝜇𝜎12 ( C) + 2𝑅0 (1 + 𝛿)𝑄 2 𝜉 2 𝑁 2𝜏+2 𝐽 (4.38) The estimation bound can be derived from the above expression under condition 𝜆min ( ZT(1) Z (1) ) ≥ 𝜆, from inequality 4.38, we have   −1  2 −1 −1 −2 ∥ B0 − B0 ∥ F ≤ 𝜆 b C(𝛿) − 2𝜇𝜆 4𝜇𝜎12 ( C) + 2𝑅0 (1 + 𝛿)𝑄 2 𝜉 2 𝑁 2𝜏+2 𝐽 (4.39) 4.8.3 Proof of Theorem 4.3.2 Observe that, due to Lemma 4.8.3 and the fact that, ∫ ∫  T b 2 T T b0 −B0 ∥ 2 = 𝑂 𝑃 (𝑎 𝑁 ) [B(𝑡) ( B0 −B0 )] 𝑓𝑇 (𝑡)𝑑𝑡 = (b B0 −B0 ) Bℎ (𝑡)Bℎ (𝑡) 𝑓𝑇 (𝑡)𝑑𝑡 (b B0 −B0 ) ∝ ∥ B T T (4.40) Therefore, ∫ ( 𝛽b• (𝑡) − 𝛽• (𝑡)) 2 𝑓𝑇 (𝑡)𝑑𝑡 T ∫  2 T b T = B(𝑡) ( B0 − B0 ) + B(𝑡) B0 − 𝛽(𝑡) 𝑓𝑇 (𝑡)𝑑𝑡 T ∫ 2 ≤ 2∥b B0 − B0 ∥ F + 2 [B(𝑡) T B0 − 𝛽(𝑡)] 2 𝑓𝑇 (𝑡)𝑑𝑡 T ≤ 𝑂 𝑃 (𝑎 𝑁 ) + 𝑂 (𝐾 𝑁−2(𝜈+1) ) (4.41) 135 CHAPTER 5 EPILOGUE 5.1 Conclusions In this dissertation, we develop new methods and theories for functional data analysis with depen- dence and complex structures that are often produced in neuroimaging studies. Classical statistical approaches either do not adequately take advantage of dependence or are not applicable to data with complex structures. The first part of this dissertation (Chapters 2 and 3) focuses on improving existing non-parametric methods via incorporating dependence in functional data. In Chapter 2 we develop an efficient and robust estimation technique based on quadratic inference for a coefficient vector in a constant linear effects model with dense functional responses. The proposed method uses a data-driven approach to construct bases, which avoids the possible mis-specification and improves estimation efficiency. Then, in Chapter 3 we develop a multi-step estimation procedure for estimating non-parametric coefficient function in a varying-coefficient linear model with het- eroskedastic errors. This method incorporates the dependence via a local linear generalized method of moments based on continuous moment conditions. In the second part of the dissertation (Chapter 4) we develop a new approach for studying neural correlates in the presence of tensor-valued brain images and tensor-valued predictors. We consider a time-varying tensor regression model where the inherent structural composition of responses and covariates is preserved. Extensive simulation studies and real data analyses are conducted to justify the efficacy of the proposed methods. 5.2 Future directions We would like to provide some possible directions with feasible applications to work on in the future, which directly follow from this dissertation. 1. Can quadratic inference approach improve the efficiency of parameter estimation for longi- tudinal tensor data? 136 - In many applications, we are convinced that brain images appear as structured data. Therefore, scientists are often interested in analyzing such complex structured longitudinal data which are observed at finitely many time-points or clusters due to the number of visits of patients in the clinic. Zhang et al. (2014) proposed a unifying regression framework with tensor-variate image predictor with tensor-GEE for longitudinal imaging analysis. The proposed GEE approach takes into account the intra-subject correlation responses where a low-rank tensor decomposition of the coefficient array becomes effective during estimation and prediction. Similarly to the existing problem of GEE in classical longitudinal analysis, the estimation of the parameter tensor becomes inconsistent when the working correlation structure is miss- specified. Therefore, a quadratic inference-based method can be developed for tensor-GEE to improve the efficiency of the tensor-variate regression model. 2. Can a time-varying tensor approach improve FDR control for fMRI data? - In the statistics and neuroimaging literature, false discovery rate (FDR) is commonly used for inference. A straightforward application of FDR violates the complex data features, thereby producing unsatisfactory performance. Brown and Behrmann (2017) correctly observed that the overstatement of Eklund et al. (2016) regarding FDR cast doubt on fMRI technique for studying brain function and caused damage to the field of cognitive neuroscience. Subse- quently, Cox et al. (2017); Kessler et al. (2017) offered several clarifications. Among other issues, these PNAS papers recognized that accounting for the spatio-temporal aspects is ut- terly important for fMRI and is a remarkable methodology for understanding brain function and its relationship to behavioral characteristics. Therefore, based on the model proposed in Chapter 4, one can propose a new thresholding technique for the multiple hypothesis testing problem for spatially dependent tests over continuum null hypotheses to identify activated voxels over time based on a tensor-structured “statistical parametric map”, which can be an interesting development. Apart from the above two immediate future directions, the methodologies presented in this 137 dissertation will provide essential tools for analysis of such complex structured data not only in neuroimaging studies, but also in other scientific areas such as network analysis, recommendation systems and statistical genetics. 138 APPENDIX 139 APPENDIX PRE-PROCESSING OF IMAGING DATA USING R The main goal of studying fMRI data is to identify brain areas activated by a task. Scanner drift may occur when the strength of the magnetic field inside the bore changes slowly over time during a scan session. Therefore, it is not recommended to conduct any statistical analyses using the raw data coming from the scanner. As a result, an important role is played by the pre-processing steps of fMRI data, which consist of all required transformations needed to prepare the data for analysis. In statistics, noise is the fundamental uncertainty, but sometimes it consists of systematic variability and it is possible to remove the noise from the data. Hence, the main goal of these pre-processing steps is to remove the systematic variability that can arise due to movement of the head during an experiment, size of the brain, etc. which are mainly sources of variability due to artifacts. Pre-processing of the fMRI data is almost similar for all kinds of experiment and typically involves a number of steps such as aligning the functional and structural scans, correcting the possible head movements, skull stripping, registration to the template, and smoothing to reduce noise; although, the type of smoothing depends on the objective of the statistical analysis. In most of the cases, the fMRIs are in NIfTI (Neuroimaging Informatics Technology Initiative) format with extension of the file “*.nii” and this can be read as a multi-dimensional array. We read the data in R and perform all required pre-processing steps using some tools and packages in R such as the fslr (Jenkinson et al., 2012; Muschelli et al., 2015). Interested readers are encouraged to study Ashby (2011); Wager and Lindquist (2015) for further details in the pre-processing steps. The six commonly used pre-processing steps are performed in the order as listed below. 1. Slice timing correction: This method corrects the variability in the BOLD responses that is due to the fact that data in different voxels are acquired at different times. This step is performed using the function “slicetimer” where indexing is from top, the order of acquisition is continuous, and the interpolation using this function is done using “sinc” filter. For example, 140 in ForrestGump-data, the brain image consists of 35 slices and the replication time is 2 seconds. Therefore, the time between acquisition of the first and last slices should be almost the same as 2 seconds. Moreover, the variability in the data due to differences in the times of slice acquirement can be reduced by this pre-processing step using interpolation and/or by analyzing task-related activation using the flexible hrf model. Later, the bias_correct function is used for bias field corrections. 2. Motion correction: This step is performed to correct for variability due to head movement. Motion correction is a special case of image registration where a series of images are aligned by considering mean image over all time-points as the target image for each individuals. This is one of the most important steps of pre-processing. When a subject moves their head, a specific brain region either moves to a different region or out of the scanning area. This correction procedure is based on the assumption that the shape and size of the brain remain intact irrespective of the subject moving their head. Therefore, the rigid body registration (Ashburner and Friston, 2007) method can be applied. It is easy to visualize that any rigid body movement can be described by six parameters. When a subject lies inside the scanner, the center of any voxel in their head occupies a point in space that can be characterized by the triplet (𝑥, 𝑦, 𝑧). By convention, the 𝑍-axis is parallel to the bore of the magnet, the 𝑋-axis passes through the subject’s ears from left to right side and the 𝑌 -axis is a pole that enters through the back of the head and exits in the forehead. Based on this coordinate system, possible rigid body movements are translations and rotations along the 𝑋, 𝑌 and 𝑍 axes. BOLD responses at the mean over different TR is taken as the standard and then rigid body transformation is performed for all TRs until each of the data-sets agree as closely as possible with the mean data. Motion-corrected images have the same dimension, voxel spacing, origin, and direction as those of images collected from the scanner. Here we use “antsrMotionCalculation” function which provides an R-wrapper around the Insight Segmentation and Registration Toolkit (ITK). A rigid body transformation for the calculation of frame-wise motion parameters is illustrated in Figure 4.3 where first three parameters 141 contain the rotation matrix (rotation along the 𝑋, 𝑌 and 𝑍 axes, respectively) and the other three parameters are translation vectors (translation along the 𝑋, 𝑌 and 𝑍 axes, respectively) at each TR. 3. Co-registration: Since the functional images are collected with low spatial resolution, and the voxel size is much smaller than the structural images, in this step we align the functional and structural images of the pre-processing steps. We perform brain extraction of T1-weighted images (after skull stripping and bias-correction since brain activity is restricted to brain tissues only; therefore, brain extraction of the anatomical image and inhomogeneity correction must be performed to remove artifacts). The method of co-registration is similar to motion correction, rather simpler, since it involves only two images to be aligned, but challenge is that voxels are no longer one-to-one between two images, and the functional and structural images are run with different imaging parameters (or modalities), as a result of which their contrasts are different. In this step, only the structural image (mostly the mean image) and any one of the functional images must align, since all functional images are already in alignment after the head movement corrections. We use the function “registration” to register the average fMRI with spatial resolution of 3.0 × 3.0 × 3.0𝑚𝑚 3 to an 0.7 × 0.67 × 0.67𝑚𝑚 3 T1-weighted image by applying affine transformation and non-linear registration using the symmetric normalization (SyN) algorithm in Advance Normalizing Tools (ANTs). Note that the resulting images have the same dimension, voxel spacing, origin, and direction as those of the anatomical coordinate systems. 4. Normalization: This step is performed to wrap the structural images in the standard brain atlas. There exist huge disparities between two brains in terms of size and shape across the subjects. Therefore, it is difficult to make decisions regarding task-related activation in order to figure which voxel/region of the brain is activated in a subject. The common practice is to map the data onto a “standard brain” for which coordinates of all major brain areas have already been discovered and henceforth determine the activated region in a brain 142 atlas. Here we use an atlas in Section 4.6 proposed by Montreal Neurological Institute (MNI) where the MNI-atlas was created by averaging the results from high-resolution structural images taken over 152 different brains with dimension 182 × 218 × 182 with pixel dimension 1 × 1 × 1mm3 and it is also provided in FSL as MNI152_T1_1mm_brain. Furthermore, the functional brain atlas provides information about the location of the functional brain region, obtaining knowledge on the brain functionality. The steps of the normalization process are the following. • Co-registering the T1-weighted image into the coordinate system of the MNI space. • Co-registering functional and T1-weighted structural images. • Applying the calculated non-linear forward transformation from previous steps to project fMRI time-series to MNI space. Here we use “registration” and “antsApplyTransforms” accordingly. 5. Spatial smoothing: This step reduces high-frequency noise that changes rapidly in small regions of the brain. As a result of local averaging due to spatial smoothing, the resulting images become blurred due to the reduction of intensity of BOLD responses. The standard choice of the kernel function in this smoothing method is the Gaussian kernel, which is centered at the smoothing location and 𝜎•2 is the width of the kernel in “•” direction. In the context of fMRI data analysis, this is termed as full width at half maximum (FWHM) which is the width of the interval between the points at which the height of the kernel along “•” direction is half of the peak height. In our data processing, we take (6, 6, 7) as kernel width (FWHM). 6. Temporal filtering: This is done to reduce the effect of slow fluctuations in the local magnetic field properties of the scanner. This pre-processing step smooths the data at each voxel across neighboring TRs. Like spatial smoothing, the goal here is to reduce noise and, thereafter, make it easier to identify signal. We perform de-noising by regressing BOLD time-series 143 on nuisance regressors including 6 principal component scores (obtained using “compcor” function (Behzadi et al., 2007)) and motion parameters obtained from motion correction step. In the final stage, we perform high-pass filtering using the “frequencyFilterfMRI” function with the lower and the higher frequency limits in band-pass filter 0.01 and 0.1 respectively. • Additional step - extraction of ROI: This is an additional step that is performed as per requirement for a given data. It is well-known that functional brain atlases provide infor- mation about the location of functional brain regions. One of the recent atlases is that of Shen286 (Shen et al., 2013; Finn et al., 2015) and related website https://sites.google.com/ dartmouth.edu/canlab-brainpatterns/brain-atlases-and-parcellations/ which provides a brain parcellation into 268 regions that were obtained as a result of a resting-state fMRI study. The T1-weighted image of Shen268 atlas, with size 181 × 217 × 181 and dimensions 1 × 1 × 1𝑚𝑚 3 is commonly used. This image is then transformed into the MNI space as described in the normalization step. Symmetric Normalization (SyN) is performed using Advance Normal- ising Tools (ANTs), furthermore, deformable registration using the “registration” function in fslr package of R is performed. Estimated transformation is used to wrap the atlas to the associated MNI space, and labels are created by the nearest-neighbor interpolation method. In addition, for DTI data, the JHU-ICBM-FA-1mm template with dimension 182 × 218 × 182 where the pixel dimension is 1 × 1 × 1 with 50 ROIs is commonly used (Mori et al., 2008). 144 BIBLIOGRAPHY 145 BIBLIOGRAPHY Ai, C. and Chen, X. (2003). Efficient estimation of models with conditional moment restrictions containing unknown functions. Econometrica, 71(6):1795–1843. Albert, R. and Barabási, A.-L. (2002). Statistical mechanics of complex networks. Reviews of modern physics, 74(1):47. Allen, G. I., Grosenick, L., and Taylor, J. (2014). A generalized least-square matrix decomposition. Journal of the American Statistical Association, 109(505):145–159. Anderson, T. W., Anderson, T. W., Anderson, T. W., Anderson, T. W., and Mathématicien, E.-U. (1958). An introduction to multivariate statistical analysis, volume 2. Wiley New York. Ashburner, J. and Friston, K. J. (2007). Rigid body registration. Statistical parametric mapping: The analysis of functional brain images, pages 49–62. Ashby, F. G. (2011). Statistical analysis of fMRI data. MIT press. Aue, A., Norinho, D. D., and Hörmann, S. (2015). On the prediction of stationary functional time series. Journal of the American Statistical Association, 110(509):378–392. Bai, Y., Zhu, Z., and Fung, W. K. (2008). Partial linear models for longitudinal data based on quadratic inference functions. Scandinavian Journal of Statistics, 35(1):104–118. Balan, R. M., Schiopu-Kratina, I., et al. (2005). Asymptotic results with generalized estimating equations for longitudinal data. The Annals of Statistics, 33(2):522–541. Beck, A. and Teboulle, M. (2009). A fast iterative shrinkage-thresholding algorithm for linear inverse problems. SIAM journal on imaging sciences, 2(1):183–202. Behzadi, Y., Restom, K., Liau, J., and Liu, T. T. (2007). A component based noise correction method (compcor) for bold and perfusion based fmri. Neuroimage, 37(1):90–101. Berrendero, J. R., Justel, A., and Svarc, M. (2011). Principal components for multivariate functional data. Computational Statistics & Data Analysis, 55(9):2619–2634. Bi, X., Tang, X., Yuan, Y., Zhang, Y., and Qu, A. (2021). Tensors in statistics. Annual review of statistics and its application, 8:345–368. Bongiorno, E. G., Salinelli, E., Goia, A., and Vieu, P. (2014). Contributions in infinite-dimensional statistics and related topics. Societa Editrice Esculapio. 146 Bravo, F. (2021). Second order expansions of estimators in nonparametric moment conditions models with weakly dependent data. Econometric Reviews, pages 1–24. Brown, E. N. and Behrmann, M. (2017). Controversy in statistical analysis of functional magnetic resonance imaging data. Proceedings of the National Academy of Sciences, 114(17):E3368– E3369. Cai, T. T. and Hall, P. (2006). Prediction in functional linear regression. The Annals of Statistics, 34(5):2159–2179. Cai, Z., Das, M., Xiong, H., and Wu, X. (2006). Functional coefficient instrumental variables models. Journal of Econometrics, 133(1):207–241. Cai, Z. and Li, Q. (2008). Nonparametric estimation of varying coefficient dynamic panel data models. Econometric Theory, 24(5):1321–1342. Cardot, H., Degras, D., and Josserand, E. (2013). Confidence bands for horvitz–thompson estima- tors using sampled noisy functional data. Bernoulli, 19(5A):2067–2097. Cardot, H., Ferraty, F., and Sarda, P. (1999). Functional linear model. Statistics & Probability Letters, 45(1):11–22. Cardot, H., Ferraty, F., and Sarda, P. (2003). Spline estimators for the functional linear model. Statistica Sinica, pages 571–591. Cardot, H. and Sarda, P. (2008). Varying-coefficient functional linear regression models. Commu- nications in statistics—theory and methods, 37(20):3186–3203. Carroll, J. D. and Chang, J.-J. (1970). Analysis of individual differences in multidimensional scaling via an n-way generalization of “eckart-young” decomposition. Psychometrika, 35(3):283–319. Casey, B., Soliman, F., Bath, K. G., and Glatt, C. E. (2010). Imaging genetics and development: challenges and promises. Human Brain Mapping, 31(6):838–851. Castro, P. E., Lawton, W. H., and Sylvestre, E. (1986). Principal modes of variation for processes with continuous sample curves. Technometrics, 28(4):329–337. Chen, K., Dong, H., and Chan, K.-S. (2013). Reduced rank regression via adaptive nuclear norm penalization. Biometrika, 100(4):901–920. Chen, K. and Müller, H.-G. (2012). Modeling repeated functional observations. Journal of the American Statistical Association, 107(500):1599–1609. Chen, X., Li, H., Liang, H., and Lin, H. (2019). Functional response regression analysis. Journal of Multivariate Analysis, 169:218–233. 147 Chiou, J.-M., Chen, Y.-T., and Yang, Y.-F. (2014). Multivariate functional principal component analysis: A normalization approach. Statistica Sinica, pages 1571–1596. Chiou, J.-M., Müller, H.-G., and Wang, J.-L. (2003). Functional quasi-likelihood regression models with smooth random effects. Journal of the Royal Statistical Society: Series B (Statistical Methodology), 65(2):405–423. Chiou, J.-M., Müller, H.-G., and Wang, J.-L. (2004). Functional response models. Statistica Sinica, pages 675–693. Conn, A. R., Gould, N. I., and Toint, P. L. (2000). Trust region methods, volume 1. Siam. Cox, R. W., Chen, G., Glen, D. R., Reynolds, R. C., and Taylor, P. A. (2017). fmri clustering and false-positive rates. Proceedings of the National Academy of Sciences, 114(17):E3370–E3371. Cragg, J. G. (1983). More efficient estimation in the presence of heteroscedasticity of unknown form. Econometrica: Journal of the Econometric Society, pages 751–763. Crainiceanu, C. M., Staicu, A.-M., and Di, C.-Z. (2009). Generalized multilevel functional regres- sion. Journal of the American Statistical Association, 104(488):1550–1561. Craven, P. and Wahba, G. (1978). Smoothing noisy data with spline functions. Numerische mathematik, 31(4):377–403. Dauxois, J., Pousse, A., and Romain, Y. (1982). Asymptotic theory for the principal component analysis of a vector random function: some applications to statistical inference. Journal of multivariate analysis, 12(1):136–154. De Boor, C., De Boor, C., Mathématicien, E.-U., De Boor, C., and De Boor, C. (1978). A practical guide to splines, volume 27. springer-verlag New York. Delaigle, A. and Hall, P. (2010). Defining probability density for a distribution of random functions. The Annals of Statistics, pages 1171–1193. Diggle, P., Diggle, P. J., Heagerty, P., Liang, K.-Y., Heagerty, P. J., Zeger, S., et al. (2002). Analysis of longitudinal data. Oxford University Press. Ding, X., Yu, D., Zhang, Z., and Kong, D. (2021). Multivariate functional response low-rank regression with an application to brain imaging data. Canadian Journal of Statistics, 49(1):150– 181. Ding, X. and Zhou, Z. (2020). Estimation and inference for precision matrices of nonstationary time series. The Annals of Statistics, 48(4):2455–2477. Douglas Nychka, Reinhard Furrer, John Paige, and Stephan Sain (2017). fields: Tools for spatial 148 data. R package version 12.3. Dunford, N. and Schwartz, J. T. (1988). Linear operators, part 1: general theory, volume 10. John Wiley & Sons. Eklund, A., Nichols, T. E., and Knutsson, H. (2016). Cluster failure: Why fmri inferences for spatial extent have inflated false-positive rates. Proceedings of the national academy of sciences, 113(28):7900–7905. Eubank, R. L. (1999). Nonparametric regression and spline smoothing. CRC press. Fan, J. and Gijbels, I. (1996). Local polynomial modelling and its applications. Chapman & Hall/CRC. Fan, J., Yao, Q., and Cai, Z. (2003). Adaptive varying-coefficient linear models. Journal of the Royal Statistical Society: series B (statistical methodology), 65(1):57–80. Fan, J. and Zhang, W. (2008). Statistical methods with varying coefficient models. Statistics and its Interface, 1(1):179. Fan, J., Zhang, W., et al. (1999). Statistical estimation in varying coefficient models. The annals of Statistics, 27(5):1491–1518. Fang, E. X., Ning, Y., Li, R., et al. (2020). Test of significance for high-dimensional longitudinal data. Annals of Statistics, 48(5):2622–2645. Fang, Y., Loparo, K. A., and Feng, X. (1994). Inequalities for the trace of matrix product. IEEE Transactions on Automatic Control, 39(12):2489–2490. Faraway, J. J. (1997). Regression analysis for a functional response. Technometrics, 39(3):254–261. Ferraty, F. and Vieu, P. (2006). Nonparametric functional data analysis: theory and practice. Springer Science & Business Media. Ferré, L. and Yao, A.-F. (2005). Smoothed functional inverse regression. Statistica Sinica, pages 665–683. Finn, E. S., Shen, X., Scheinost, D., Rosenberg, M. D., Huang, J., Chun, M. M., Papademetris, X., and Constable, R. T. (2015). Functional connectome fingerprinting: identifying individuals using patterns of brain connectivity. Nature neuroscience, 18(11):1664–1671. Fransson, P., Flodin, P., Seimyr, G. Ö., and Pansell, T. (2014). Slow fluctuations in eye position and resting-state functional magnetic resonance imaging brain activity during visual fixation. European Journal of Neuroscience, 40(12):3828–3835. 149 Gahrooei, M. R., Yan, H., Paynabar, K., and Shi, J. (2018). A novel approach for fusion of heterogeneous sources of data. arXiv preprint arXiv:1803.00138. Gajardo, A., Carroll, C., Chen, Y., Dai, X., Fan, J., Hadjipantelis, P. Z., Han, K., Ji, H., Mueller, H.-G., and Wang, J.-L. (2021). fdapace: Functional Data Analysis and Empirical Dynamics. R package version 0.5.7. Givens, G. H. and Hoeting, J. A. (2012). Computational statistics, volume 703. John Wiley & Sons. Goldsmith, J., Bobb, J., Crainiceanu, C. M., Caffo, B., and Reich, D. (2011). Penalized functional regression. Journal of computational and graphical statistics, 20(4):830–851. Goldsmith, J., Crainiceanu, C. M., Caffo, B., and Reich, D. (2012). Longitudinal penalized functional regression for cognitive outcomes on neuronal tract measurements. Journal of the Royal Statistical Society: Series C (Applied Statistics), 61(3):453–469. Goldsmith, J., Scheipl, F., Huang, L., Wrobel, J., Di, C., Gellar, J., Harezlak, J., McLean, M. W., Swihart, B., Xiao, L., Crainiceanu, C., and Reiss, P. T. (2020). refund: Regression with Functional Data. R package version 0.1-23. Goldsmith, J., Zipunnikov, V., and Schrack, J. (2015). Generalized multilevel function-on-scalar regression and principal component analysis. Biometrics, 71(2):344–353. Green, P. J. and Silverman, B. W. (1993). Nonparametric regression and generalized linear models: a roughness penalty approach. CRC Press. Grenander, U. (1950). Stochastic processes and statistical inference. Arkiv för matematik, 1(3):195– 277. Greven, S. and Scheipl, F. (2017). A general framework for functional regression modelling. Statistical Modelling, 17(1-2):1–35. Guhaniyogi, R., Qamar, S., and Dunson, D. B. (2017). Bayesian tensor regression. The Journal of Machine Learning Research, 18(1):2733–2763. Guhaniyogi, R. and Spencer, D. (2018). Bayesian tensor response regression with an application to brain activation studies. Technical report, Technical report, UCSC. 2, 13. Guo, W., Kotsia, I., and Patras, I. (2012). Tensor learning for regression. IEEE Transactions on Image Processing, 21(2):816–827. Hadinejad-Mahram, H., Dahlhaus, D., and Blomker, D. (2002). Karhunen-loéve expansion of vector random processes. Technical report, Technical Report No. IKTNT 1019, Communications Technological Laboratory. 150 Hall, A. R. (2004). Generalized method of moments. OUP Oxford. Hall, P., Horowitz, J. L., et al. (2007). Methodology and convergence rates for functional linear regression. The Annals of Statistics, 35(1):70–91. Hall, P. and Hosseini-Nasab, M. (2006). On properties of functional principal components analysis. Journal of the Royal Statistical Society: Series B (Statistical Methodology), 68(1):109–126. Hall, P. and Hosseini-Nasab, M. (2009). Theory for high-order bounds in functional principal components analysis. In Mathematical Proceedings of the Cambridge Philosophical Society, volume 146, page 225. Cambridge University Press. Hall, P., Müller, H.-G., and Wang, J.-L. (2006). Properties of principal component methods for functional and longitudinal data analysis. The annals of statistics, pages 1493–1517. Hand, D. and Crowder, M. (2017). Practical longitudinal data analysis. Routledge. Hanke, M., Baumgartner, F. J., Ibe, P., Kaule, F. R., Pollmann, S., Speck, O., Zinke, W., and Stadler, J. (2014). A high-resolution 7-tesla fmri dataset from complex natural stimulation with an audio movie. Scientific data, 1:140003. Hansen, L. P. (1982). Large sample properties of generalized method of moments estimators. Econometrica: Journal of the Econometric Society, pages 1029–1054. Happ, C. and Greven, S. (2018). Multivariate functional principal component analysis for data observed on different (dimensional) domains. Journal of the American Statistical Association, 113(522):649–659. Hardle, W. (1990). Applied Nonparametric Regression. Cambridge University Press. Harshman, R. A. et al. (1970). Foundations of the PARAFAC procedure: Models and conditions for an “explanatory" multimodal factor analysis. Technical report, University of California at Los Angeles Los Angeles, CA. Hastie, T. and Tibshirani, R. (1993). Varying-coefficient models. Journal of the Royal Statistical Society: Series B (Methodological), 55(4):757–779. Hastie, T. J. and Tibshirani, R. J. (2017). Generalized additive models. Routledge. Häusler, C. O. and Hanke, M. (2016). An annotation of cuts, depicted locations, and temporal progression in the motion picture “forrest gump”. F1000Research, 5. He, G., Muller, H., and Wang, J.-L. (2018). Extending correlation and regression from multivariate to functional data. In Asymptotics in statistics and probability, pages 197–210. De Gruyter. 151 He, G., Müller, H.-G., and Wang, J.-L. (2003). Functional canonical analysis for square integrable stochastic processes. Journal of Multivariate Analysis, 85(1):54–77. Hedeker, D. and Gibbons, R. D. (2006). Longitudinal data analysis, volume 451. John Wiley & Sons. Hitchcock, F. L. (1928). Multiple invariants and generalized rank of a p-way matrix or tensor. Journal of Mathematics and Physics, 7(1-4):39–79. Hoff, P. D. (2011). Hierarchical multilinear models for multiway data. Computational Statistics & Data Analysis, 55(1):530–543. Hoff, P. D. (2015). Multilinear tensor regression for longitudinal relational data. The annals of applied statistics, 9(3):1169. Hoff, P. D. et al. (2011). Separable covariance arrays via the tucker product, with applications to multivariate relational data. Bayesian Analysis, 6(2):179–196. Hoover, D. R., Rice, J. A., Wu, C. O., and Yang, L.-P. (1998). Nonparametric smoothing estimates of time-varying coefficient models with longitudinal data. Biometrika, 85(4):809–822. Hörmann, S. and Kokoszka, P. (2010). Weakly dependent functional data. The Annals of Statistics, 38(3):1845–1884. Horváth, L. and Kokoszka, P. (2012). Inference for functional data with applications, volume 200. Springer Science & Business Media. Hsing, T. and Eubank, R. (2015). Theoretical foundations of functional data analysis, with an introduction to linear operators, volume 997. John Wiley & Sons. Huang, H., Li, Y., and Guan, Y. (2014). Joint modeling and clustering paired generalized longi- tudinal trajectories with application to cocaine abuse treatment data. Journal of the American Statistical Association, 109(508):1412–1424. Huang, J. Z., Wu, C. O., and Zhou, L. (2002). Varying-coefficient models and basis function approximations for the analysis of repeated measurements. Biometrika, 89(1):111–128. Huang, J. Z., Wu, C. O., and Zhou, L. (2004). Polynomial spline estimation and inference for varying coefficient models with longitudinal data. Statistica Sinica, pages 763–788. Huettel, S. A., Song, A. W., McCarthy, G., et al. (2004). Functional magnetic resonance imaging, volume 1. Sinauer Associates Sunderland, MA. Ivanescu, A. E., Staicu, A.-M., Scheipl, F., and Greven, S. (2015). Penalized function-on-function regression. Computational Statistics, 30(2):539–568. 152 J Mercer, B. (1909). Xvi. functions of positive and negative type, and their connection the theory of integral equations. Phil. Trans. R. Soc. Lond. A, 209(441-458):415–446. James, G. M., Hastie, T. J., and Sugar, C. A. (2000). Principal component models for sparse functional data. Biometrika, 87(3):587–602. Jenkinson, M., Beckmann, C. F., Behrens, T. E., Woolrich, M. W., and Smith, S. M. (2012). Fsl. Neuroimage, 62(2):782–790. Ji, Y., Wang, Q., Li, X., and Liu, J. (2019). A survey on tensor techniques and applications in machine learning. IEEE Access, 7:162950–162990. Johan, H. (1990). Tensor rank is np-complete. Journal of Algorithms, 4(11):644–654. Karhunen, K. (1946). Zur spektraltheorie stochastischer prozesse. Ann. Acad. Sci. Fennicae, AI, 34. Kato, K. (2012). Estimation in functional linear quantile regression. The Annals of Statistics, 40(6):3108–3136. Kessler, D., Angstadt, M., and Sripada, C. S. (2017). Reevaluating “cluster failure” in fmri using nonparametric control of the false discovery rate. Proceedings of the National Academy of Sciences, 114(17):E3372–E3373. Kiers, H. A. (2000). Towards a standardized notation and terminology in multiway analysis. Journal of Chemometrics: A Journal of the Chemometrics Society, 14(3):105–122. Kim, J. S., Maity, A., and Staicu, A.-M. (2018). Additive nonlinear functional concurrent model. Statistics and its interface, 11(4):669–685. Kinson, C., Tang, X., Zuo, Z., and Qu, A. (2020). Longitudinal principal component analysis with an application to marketing data. Journal of Computational and Graphical Statistics, 29(2):335–350. Kokoszka, P. and Reimherr, M. (2017). Introduction to functional data analysis. Chapman and Hall/CRC. Kolda, T. and Bader, B. (2009). Tensor decompositions and applications. SIAM Review, 51(3):455– 500. Kowal, D. R., Matteson, D. S., and Ruppert, D. (2017). A bayesian multivariate functional dynamic linear model. Journal of the American Statistical Association, 112(518):733–744. Kuenzer, T., Hörmann, S., and Kokoszka, P. (2021). Principal component analysis of spatially indexed functions. Journal of the American Statistical Association, 116(535):1444–1456. 153 Li, Y. and Guan, Y. (2014). Functional principal component analysis of spatiotemporal point pro- cesses with applications in disease surveillance. Journal of the American Statistical Association, 109(507):1205–1215. Li, Y. and Hsing, T. (2007). On rates of convergence in functional linear regression. Journal of Multivariate Analysis, 98(9):1782–1804. Li, Y. and Hsing, T. (2010). Uniform convergence rates for nonparametric regression and principal component analysis in functional/longitudinal data. The Annals of Statistics, 38(6):3321–3351. Li, Y., Qiu, Y., and Xu, Y. (2022). From multivariate to functional data analysis: Fundamentals, recent developments, and emerging areas. Journal of Multivariate Analysis, 188:104806. Liang, K.-Y. and Zeger, S. L. (1986). Longitudinal data analysis using generalized linear models. Biometrika, 73(1):13–22. Liang, X., Zou, T., Guo, B., Li, S., Zhang, H., Zhang, S., Huang, H., and Chen, S. X. (2015). Assess- ing beijing’s pm2. 5 pollution: severity, weather impact, apec and winter heating. Proceedings of the Royal Society A: Mathematical, Physical and Engineering Sciences, 471(2182):20150257. Lindquist, M. A. (2008). The statistical analysis of fmri data. Statistical science, 23(4):439–464. Liu, B. and Müller, H.-G. (2009). Estimating derivatives for samples of sparsely observed functions, with application to online auction dynamics. Journal of the American Statistical Association, 104(486):704–717. Liu, T., Yuan, M., and Zhao, H. (2017). Characterizing spatiotemporal transcriptome of human brain via low rank tensor decomposition. arXiv preprint arXiv:1702.07449. Liu, Y., Liu, J., and Zhu, C. (2020). Low-rank tensor train coefficient array estimation for tensor- on-tensor regression. IEEE Transactions on Neural Networks and Learning Systems. Lock, E. F. (2018). Tensor-on-tensor regression. Journal of Computational and Graphical Statistics, 27(3):638–647. Loève, M. (1946). Functions aleatoire de second ordre. Revue science, 84:195–206. Long, X., Liao, W., Jiang, C., Liang, D., Qiu, B., and Zhang, L. (2012). Healthy aging: an automatic analysis of global and regional morphological alterations of human brain. Academic radiology, 19(7):785–793. Lu, C. and Wooldridge, J. M. (2020). A gmm estimator asymptotically more efficient than ols and wls in the presence of heteroskedasticity of unknown form. Applied Economics Letters, 27(12):997–1001. 154 Mallows, C. L. (1973). Some comments on cp. Technometrics, 15(4):661–675. McCullagh, P. and Nelder, J. (1989). Generalized linear models ii. Melzer, A., Härdle, W. K., and Cabrera, B. L. (2019). Joint tensor expectile regression for electricity day-ahead price curves. Available at SSRN 3363167. Mori, S., Oishi, K., Jiang, H., Jiang, L., Li, X., Akhter, K., Hua, K., Faria, A. V., Mahmood, A., Woods, R., et al. (2008). Stereotaxic white matter atlas based on diffusion tensor imaging in an icbm template. Neuroimage, 40(2):570–582. Morris, J. S. (2015). Functional regression. Annual Review of Statistics and Its Application, 2:321–359. Müller, H.-G. (2016). Peter hall, functional data analysis and random objects. The Annals of Statistics, 44(5):1867–1887. Müller, H.-G. and Stadtmüller, U. (2005). Generalized functional linear models. the Annals of Statistics, 33(2):774–805. Müller, H.-G. and Yao, F. (2010). Empirical dynamics for longitudinal data. The Annals of Statistics, 38(6):3458–3486. Muschelli, J., Sweeney, E., Lindquist, M., and Crainiceanu, C. (2015). fslr: Connecting the fsl software with r. The R Journal, 7(1):163–175. Newey, W. K. (1990). Efficient instrumental variables estimation of nonlinear models. Economet- rica: Journal of the Econometric Society, pages 809–837. Newey, W. K. and McFadden, D. (1994). Large sample estimation and hypothesis testing. Handbook of econometrics, 4:2111–2245. Ombao, H., Lindquist, M., Thompson, W., and Aston, J. (2016). Handbook of neuroimaging data analysis. Chapman and Hall/CRC. O’Donnell, L. J. and Westin, C.-F. (2011). An introduction to diffusion tensor image analysis. Neurosurgery Clinics, 22(2):185–196. Park, S. Y. and Staicu, A.-M. (2015). Longitudinal functional data analysis. Stat, 4(1):212–226. Porter, D., Stirling, D. S., and David, P. (1990). Integral equations: a practical treatment, from spectral theory to applications, volume 5. Cambridge university press. Qu, A. and Li, R. (2006). Quadratic inference functions for varying-coefficient models with longitudinal data. Biometrics, 62(2):379–391. 155 Qu, A., Lindsay, B. G., and Li, B. (2000). Improving generalised estimating equations using quadratic inference functions. Biometrika, 87(4):823–836. Ramsay, J. O. (1982). When the data are functions. Psychometrika, 47(4):379–396. Ramsay, J. O. and Dalzell, C. (1991). Some tools for functional data analysis. Journal of the Royal Statistical Society: Series B (Methodological), 53(3):539–561. Ramsay, J. O. and Silverman, B. W. (2005). Functional data analysis. Springer series in statistics. Ramsay, J. O. and Silverman, B. W. (2007). Applied functional data analysis: methods and case studies. Springer. Rao, C. R. (1958). Some statistical methods for comparison of growth curves. Biometrics, 14(1):1– 17. Raskutti, G., Yuan, M., and Chen, H. (2019). Convex regularization for high-dimensional multire- sponse tensor regression. Ann. Statist., 47(3):1554–1584. Rice, J. A. and Silverman, B. W. (1991). Estimating the mean and covariance structure non- parametrically when the data are curves. Journal of the Royal Statistical Society: Series B (Methodological), 53(1):233–243. Rice, J. A. and Wu, C. O. (2001). Nonparametric mixed effects models for unequally sampled noisy curves. Biometrics, 57(1):253–259. Riss, F. and Sz-Nagy, B. (1990). Functional analysis. Dover Publications Inc. Ruppert, D., Wand, M. P., and Carroll, R. J. (2003). Semiparametric regression. Cambridge university press. Scheipl, F., Staicu, A.-M., and Greven, S. (2015). Functional additive mixed models. Journal of Computational and Graphical Statistics, 24(2):477–501. Schumaker, L. (2007). Spline functions: basic theory. Cambridge University Press. Sengupta, A., Kaule, F. R., Guntupalli, J. S., Hoffmann, M. B., Häusler, C., Stadler, J., and Hanke, M. (2016). A studyforrest extension, retinotopic mapping and localization of higher visual areas. Scientific data, 3:160093. Shen, X., Tokoglu, F., Papademetris, X., and Constable, R. T. (2013). Groupwise whole-brain parcellation from resting-state fmri data for network node identification. Neuroimage, 82:403– 415. Sidiropoulos, N. D. and Bro, R. (2000). On the uniqueness of multilinear decomposition of n-way 156 arrays. Journal of Chemometrics: A Journal of the Chemometrics Society, 14(3):229–239. Staicu, A.-M., Islam, M. N., Dumitru, R., and Heugten, E. v. (2020). Longitudinal dynamic functional regression. Journal of the Royal Statistical Society: Series C (Applied Statistics), 69(1):25–46. Stone, M. (1974). Cross-validatory choice and assessment of statistical predictions. Journal of the Royal Statistical Society: Series B (Methodological), 36(2):111–133. Su, L., Murtazashvili, I., and Ullah, A. (2013). Local linear gmm estimation of functional coefficient iv models with an application to estimating the rate of return to schooling. Journal of Business & Economic Statistics, 31(2):184–207. Sun, Y. (2016). Functional-coefficient spatial autoregressive models with nonparametric spatial weights. Journal of Econometrics, 195(1):134–153. Taris, T. (2000). Longitudinal data analysis. Sage. Tian, R., Xue, L., and Liu, C. (2014). Penalized quadratic inference functions for semiparamet- ric varying coefficient partially linear models with longitudinal data. Journal of Multivariate Analysis, 132:94–110. Tran, K. C. and Tsionas, E. G. (2009). Local gmm estimation of semiparametric panel data with smooth coefficient models. Econometric Reviews, 29(1):39–61. Tucker, L. R. (1958). Determination of parameters of a functional relation by factor analysis. Psychometrika, 23(1):19–23. Tucker, L. R. (1966). Some mathematical notes on three-mode factor analysis. Psychometrika, 31(3):279–311. Turner, E. L., Li, F., Gallis, J. A., Prague, M., and Murray, D. M. (2017). Review of recent methodological developments in group-randomized trials: part 1—design. American journal of public health, 107(6):907–915. Vaart, A. W. and Wellner, J. A. (1996). Weak convergence. In Weak convergence and empirical processes, pages 16–28. Springer. van Delft, A. and Eichler, M. (2018). Locally stationary functional time series. Electronic Journal of Statistics, 12(1):107–170. Van Hecke, W., Emsell, L., and Sunaert, S. (2016). Diffusion tensor imaging: a practical handbook. Springer. Viviani, R., Grön, G., and Spitzer, M. (2005). Functional principal component analysis of fmri 157 data. Human brain mapping, 24(2):109–129. Wager, T. D. and Lindquist, M. A. (2015). Principles of fmri. New York: Leanpub. Wahba, G. (1990). Spline models for observational data, volume 59. Siam. Wand, M. and Jones, M. (1995). Kernel smoothing chapman & hall. Monographs on Statistics and Applied Probability, London. Wang, J.-L., Chiou, J.-M., and Müller, H.-G. (2016). Functional data analysis. Annual Review of Statistics and Its Application, 3:257–295. Wang, L. (2008). Karhunen-Loéve expansions and their applications. London School of Economics and Political Science (United Kingdom). Wang, L., Li, H., and Huang, J. Z. (2008). Variable selection in nonparametric varying-coefficient models for analysis of repeated measurements. Journal of the American Statistical Association, 103(484):1556–1569. Wasserman, L. (2006). All of nonparametric statistics. Springer Science & Business Media. Wu, C. O. and Chiang, C.-T. (2000). Kernel smoothing on varying coefficient models with longitudinal dependent variable. Statistica Sinica, pages 433–456. Xiong, Y., Zhou, X. J., Nisi, R. A., Martin, K. R., Karaman, M. M., Cai, K., and Weaver, T. E. (2017). Brain white matter changes in cpap-treated obstructive sleep apnea patients with residual sleepiness. Journal of Magnetic Resonance Imaging, 45(5):1371–1378. Xu, Y., Li, Y., and Nettleton, D. (2018). Nested hierarchical functional data modeling and inference for the analysis of functional plant phenotypes. Journal of the American Statistical Association, 113(522):593–606. Xue, L., Shu, X., and Qu, A. (2018). Time-varying estimation and dynamic model selection with an application of network data. Statistica Sinica. Yao, F. and Lee, T. C. (2006). Penalized spline models for functional principal component analysis. Journal of the Royal Statistical Society: Series B (Statistical Methodology), 68(1):3–25. Yao, F., Müller, H.-G., Clifford, A. J., Dueker, S. R., Follett, J., Lin, Y., Buchholz, B. A., and Vogel, J. S. (2003). Shrinkage estimation for functional principal component scores with application to the population kinetics of plasma folate. Biometrics, 59(3):676–685. Yao, F., Müller, H.-G., and Wang, J.-L. (2005). Functional linear regression analysis for longitudinal data. The Annals of Statistics, pages 2873–2903. 158 Yao, W. and Li, R. (2013). New local estimation procedure for a non-parametric regression function for longitudinal data. Journal of the Royal Statistical Society: Series B (Statistical Methodology), 75(1):123–138. Yu, H., Tong, G., and Li, F. (2020). A note on the estimation and inference with quadratic inference functions for correlated outcomes. Communications in Statistics-Simulation and Computation, pages 1–12. Yu, K. and Jones, M. (2004). Likelihood-based local linear estimation of the conditional variance function. Journal of the American Statistical Association, 99(465):139–144. Zhang, C., Peng, H., and Zhang, J.-T. (2010). Two samples tests for functional data. Communications in Statistics—Theory and Methods, 39(4):559–578. Zhang, J.-T. (2013). Analysis of variance for functional data. Chapman and Hall/CRC. Zhang, J.-T. and Chen, J. (2007). Statistical inferences for functional data. The Annals of Statistics, 35(3):1052–1079. Zhang, L. and Banerjee, S. (2021). Spatial factor modeling: A bayesian matrix-normal approach for misaligned data. Biometrics. Zhang, X., Li, L., Zhou, H., Shen, D., et al. (2014). Tensor generalized estimating equations for longitudinal imaging analysis. arXiv preprint arXiv:1412.6592. Zhang, X. and Wang, J.-L. (2016). From sparse to dense functional data and beyond. The Annals of Statistics, 44(5):2281–2321. Zhao, M., Gao, Y., and Cui, Y. (2020). Variable selection for longitudinal varying coefficient errors-in-variables models. Communications in Statistics-Theory and Methods, pages 1–26. Zhao, M., Xu, X., Zhu, Y., Zhang, K., and Zhou, Y. (2021). Model estimation and selection for partial linear varying coefficient ev models with longitudinal data. Journal of Applied Statistics, pages 1–23. Zheng, X., Xue, L., and Qu, A. (2018). Time-varying correlation structure estimation and local- feature detection for spatio-temporal data. Journal of Multivariate Analysis, 168:221–239. Zhou, H., Li, L., and Zhu, H. (2013). Tensor regression with applications in neuroimaging data analysis. Journal of the American Statistical Association, 108(502):540–552. PMID: 24791032. Zhou, J. and Qu, A. (2012). Informative estimation and selection of correlation structure for longitudinal data. Journal of the American Statistical Association, 107(498):701–710. Zhou, L., Huang, J. Z., and Carroll, R. J. (2008). Joint modelling of paired sparse functional data 159 using principal components. Biometrika, 95(3):601–619. Zhou, L., Huang, J. Z., Martinez, J. G., Maity, A., Baladandayuthapani, V., and Carroll, R. J. (2010). Reduced rank mixed effects models for spatially correlated hierarchical functional data. Journal of the American Statistical Association, 105(489):390–400. Zhu, H., Fan, J., and Kong, L. (2014). Spatially varying coefficient model for neuroimaging data with jump discontinuities. Journal of the American Statistical Association, 109(507):1084–1098. Zhu, H., Li, R., and Kong, L. (2012). Multivariate varying coefficient model for functional responses. Annals of statistics, 40(5):2634. 160