NOVEL METHODS FOR FUNCTIONAL DATA ANALYSIS WITH APPLICATIONS TO
                      NEUROIMAGING STUDIES
                                     By
                            Pratim Guha Niyogi
                             A DISSERTATION
                                 Submitted to
                         Michigan State University
                 in partial fulfillment of the requirements
                              for the degree of
                     Statistics – Doctor of Philosophy
                                    2022


                                            ABSTRACT
    NOVEL METHODS FOR FUNCTIONAL DATA ANALYSIS WITH APPLICATIONS TO
                                    NEUROIMAGING STUDIES
                                                 By
                                         Pratim Guha Niyogi
In recent years, there has been explosive growth in different neuroimaging studies such as functional
magnetic resonance imaging (fMRI) and diffusion tensor imaging (DTI). The data generated from
such studies are often complex structured which are collected for different individuals, via various
time-points and across various modalities, thus paving the way for interesting problems in statistical
methodology for analysis of such data. In this dissertation, some efficient methodologies are
proposed with considerable development which have nice statistical properties and can be useful
not only in neuroimaging but also in other scientific domains.
    A brief overview of the dissertation is provided in Chapter 1 and in particular, different kinds
of data structures that are commonly used in consecutive chapters are described. Some useful
mathematical results frequently used in the theoretical derivations in various chapters are also
provided. Moreover, we raise some fundamental questions that arise due to some specific data
structures with applications in neuroimaging and answer these questions in subsequent chapters.
    In Chapter 2, we consider the problem of estimation of coefficients in constant linear effect
models for semi-parametric functional regression with functional response, where each response
curve is decomposed into the overall mean function indexed by a covariate function with constant
regression parameters and random error process. We provide an alternative semi-parametric
solution to estimate the parameters using quadratic inference approach by estimating bases functions
non-parametrically. Therefore, the proposed method can be easily implemented without assuming
                                                                         √
any working correlation structure. Moreover, we achieve a parametric 𝑛-convergence rate of the
proposed estimator under the proper choice of bandwidth and establish its asymptotic normality.
    A multi-step estimation procedure to simultaneously estimate the varying-coefficient functions
using a local linear generalized method of moments (GMM) based on continuous moment conditions


is developed in Chapter 3 under heteroskedasticity of unknown form. To incorporate spatial
dependence, the continuous moment conditions are first projected onto eigen-functions and then
combined by weighted eigen-values. This approach solves the challenges of using an inverse
covariance operator directly. We propose an optimal instrumental variable that minimizes the
asymptotic variance function among the class of all local linear GMM estimators, and it is found
to outperform the initial estimates that do not incorporate spatial dependence.
    Neuroimaging data are increasingly being combined with other non-imaging modalities, such
as behavioral and genetic data. The data structure of many of these modalities can be expressed
as time-varying multidimensional arrays (tensors), collected at different time-points on multiple
subjects. In Chapter 4, we consider a new approach to study neural correlates in the presence of
tensor-valued brain images and tensor-valued predictors, where both data types are collected over
the same set of time-points. We propose a time-varying tensor regression model with an inherent
structural composition of responses and covariates. This development is a non-trivial extension
of function-on-function concurrent linear models for complex and large structural data where the
inherent structures are preserved.
    Through extensive simulation studies and real-data analyses, we demonstrate the opportunities
and advantages of the proposed methods.


         Copyright by
PRATIM GUHA NIYOGI
                2022


           To my mother Maya for her sacrifices.
     To my father Kajal for his support and protection.
To my wife Debolina for her unconditional love and support.
                             v


                                    ACKNOWLEDGEMENTS
My journey till the completion of this dissertation has been a long one which has taught me many
things. Now that I look back into all those moments, all I can feel is gratitude towards time, place,
events and some very important people. With humbleness, I thank my academic advisors Dr. Ping-
Shou Zhong and Dr. Tapabrata Maiti for providing invaluable technical support during my doctoral
journey without which my journey would not have reached its goal. They trained me to pursue
research independently, think out of the box, and deal with multiple problems simultaneously.
Their inspiration boosted my confidence and persuaded me to contribute to statistical sciences. I
am especially grateful to Dr. Zhong for continuing to mentor me even after moving to University of
Illinois at Chicago. This long-distance mentoring was difficult, but he carried on with it effortlessly
and never left me struggling. I thank Dr. Lyudmila Sakhanenko and Dr. Alla Sikorskii for serving
in my dissertation committee and for providing helpful constructive suggestions to improve the
quality of this dissertation.
     Reflecting upon the start of my journey, I would like to thank the Graduate Students’ Selection
Committee of 2016 for offering me the opportunity to pursue the doctoral program here at Michigan
State University (MSU) and for providing assistantship during my education. Dr. Tapabrata Maiti,
the then graduate director, considered me worthy of the opportunity and expressed his belief in
me when I was just another fresh Masters from India venturing into this foreign land for a new
endeavour. His encouraging words have helped me believe in myself and seek excellence. I thank
Dr. Alla Sikorskii for supporting me through research assistantship for more than three years from
her grants and helping me grow in collaborative research. This opportunity gifted me the beneficial
experience of interdisciplinary research and exposure to statistics in health sciences. I will never
forget her immense help and support. I would like to thank Drs. Shlomo Levental, Ping-Shou
Zhong and Lyudmila Sakhanenko for teaching four very important courses: probability theory,
linear models, and asymptotics and non-parametric curve estimation, respectively, with care and
depth. The lessons learned from those courses have helped a lot in my research. I am thankful to my
                                                  vi


seniors especially Rejaul Karim, who was my roommate for three years and I spent some of the best
days at East Lansing in his company, and Atrayee Majumder for her valuable suggestions. I would
like to make a special mention of Andy Hufford for being approachable and helping out with just
about any technology-related issues. Also, I thank Tami Hankey for her assistance in completing
numerous paper-works all along the way. A huge role on this dissertation was played by the high-
performance computing technology of MSU without which running codes with voluminous data
would not be possible. I am therefore thankful to computational resources and services provided
by the Institute for Cyber-Enabled Research at MSU.
    There are also people outside the University who have been of immense help. The person to
whom I am extremely grateful for providing me with the opportunity to spend a wonderful summer
at the Biostatistics department of Johns Hopkins University is Dr. Martin Lindquist. I learnt many
things from him and am honoured to have been able to collaborate with him in one of my projects.
I would also like to thank Dr. Xiaohong Joe Zhou of University of Illinois at Chicago for providing
me with valuable data that helped me validate my research findings.
    Going back to my roots, I would like to thank my high-school statistics teacher Ashoke Panda,
faculty of Hare School, Kolkata who for the first time introduced me to this colorful subject and
motivated me to pursue it. I thank Arindam Bhattacharyya, who has been a valuable guide all the
way upto my Masters. I am thankful to my alma maters Presidency University and Indian Statistical
Institute for their intellectual environment that shaped my critical thinking ability and propelled me
to pursue research, and I will forever remain grateful to my roots. I am thankful to all professors
who helped me, guided me, and motivated me during those times. I am especially thankful to
Drs. Saurabh Ghosh and Ayanendranath Basu of Indian Statistical Institute and Dr. Subhra Sankar
Dhar of Indian Institute of Technology, Kanpur who engaged me in research at various phases of
my educational levels; their training and treatment were very beneficial to me. Nevertheless, I
am indebted to Dr. Partha Sarathi Chakraborty for his excellent teaching, which created a strong
foundation during my undergraduate days. Also, my sincere gratitude to the Ramakrishna Mission
Institute of Culture for awarding me the Debesh-Kamal Scholarship for Higher Studies in 2016
                                                   vii


which provided me the necessary funds to travel to USA.
    My family has been a pillar of strength throughout my life. I thank my parents Maya and Kajal
Guha Niyogi for not stopping me from trying things out and letting me fail so I could learn the
right way. I am grateful to them for never comparing me to anyone else, for always believing in me,
and for teaching me humility and respect. I am thankful to my wife Debolina Chatterjee for her
unconditional love and support and for being the first reader of all my writings. She is my “better
half” in uncountable ways and I thank her for bringing out the best in me.
    Last but not least, I thank the Almighty for this life and for making things fall into place.
                                                                                 Pratim Guha Niyogi
                                                                                         June 29, 2022
                                                                                     East Lansing, MI
                                                 viii


                                  TABLE OF CONTENTS
LIST OF TABLES . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . xii
LIST OF FIGURES . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . xv
LIST OF ALGORITHMS . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . xvii
KEY TO ABBREVIATIONS . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . xviii
CHAPTER 1 PROLOGUE . . . . . . . . . . . . . . .             . . . . . . . . . . . . . . . . . .  1
   1.1 Big data analysis . . . . . . . . . . . . . . . . .   . . . . . . . . . . . . . . . . . .  1
       1.1.1 Functional data analysis . . . . . . . . . .    . . . . . . . . . . . . . . . . . .  1
       1.1.2 Tensor data analysis . . . . . . . . . . . .    . . . . . . . . . . . . . . . . . .  6
   1.2 Some applications . . . . . . . . . . . . . . . . .   . . . . . . . . . . . . . . . . . .  6
       1.2.1 (Functional) magnetic resonance imaging         . . . . . . . . . . . . . . . . . .  6
       1.2.2 Diffusion tensor imaging . . . . . . . . .      . . . . . . . . . . . . . . . . . .  9
   1.3 Mathematical preliminaries . . . . . . . . . . . .    . . . . . . . . . . . . . . . . . . 10
       1.3.1 Notations of tensor/matrix object . . . . .     . . . . . . . . . . . . . . . . . . 11
       1.3.2 Different kinds of products . . . . . . . .     . . . . . . . . . . . . . . . . . . 12
       1.3.3 Tensor decomposition . . . . . . . . . . .      . . . . . . . . . . . . . . . . . . 13
       1.3.4 Some useful results . . . . . . . . . . . .     . . . . . . . . . . . . . . . . . . 13
   1.4 Dissertation outline . . . . . . . . . . . . . . . .  . . . . . . . . . . . . . . . . . . 14
CHAPTER 2     IMPROVING QUADRATIC INFERENCE APPROACH FOR FUNC-
              TIONAL RESPONSES . . . . . . . . . . . . . . . . . . . . . . . . . .             . 17
   2.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .  . 17
   2.2 Functional response model and estimation procedure . . . . . . . . . . . . . . .        . 22
       2.2.1 Basic model . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .     . 22
       2.2.2 Incorporating eigen-functions in QIF . . . . . . . . . . . . . . . . . . . .      . 24
       2.2.3 Estimation of eigen-functions . . . . . . . . . . . . . . . . . . . . . . . .     . 25
   2.3 Asymptotic properties . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .   . 28
   2.4 Simulation studies . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .  . 31
       2.4.1 Simulation set-up . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .     . 31
       2.4.2 Comparison and evaluation . . . . . . . . . . . . . . . . . . . . . . . . .       . 32
       2.4.3 Simulation results . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .    . 46
   2.5 Real data analysis . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .  . 46
       2.5.1 Beijing’s PM pollution study . . . . . . . . . . . . . . . . . . . . . . . .      . 46
       2.5.2 DTI study for sleep apnea patients . . . . . . . . . . . . . . . . . . . . .      . 48
   2.6 Discussion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .  . 49
   2.7 Technical details . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 51
       2.7.1 Some preliminary definitions and concepts of operators . . . . . . . . . .        . 51
       2.7.2 Some useful lemmas . . . . . . . . . . . . . . . . . . . . . . . . . . . .        . 54
       2.7.3 Proof of Theorem 2.3.1 . . . . . . . . . . . . . . . . . . . . . . . . . . .      . 59
                                                ix


      2.7.4   Proof of Theorem 2.3.2 . . . . . . . . . . . . . . . . . . . . . . . . . . . . 65
CHAPTER 3    ESTIMATION FOR VARYING-COEFFICIENT MODEL IN FUNC-
             TIONAL DATA ANALYSIS UNDER UNKNOWN HETEROSKEDAS-
             TICITY: A GMM-BASED APPROACH . . . . . . . . . . . . . . . . .                   .  67
  3.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .  .  67
  3.2 Varying-coefficient functional model and moment conditions . . . . . . . . . . .        .  73
      3.2.1 Model . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .     .  73
      3.2.2 Local-linear mean-zero function . . . . . . . . . . . . . . . . . . . . . .       .  74
  3.3 Multi-step estimation procedure . . . . . . . . . . . . . . . . . . . . . . . . . .     .  76
      3.3.1 Step-I: Initial least squares estimates . . . . . . . . . . . . . . . . . . . .   .  76
      3.3.2 Step-II: Intermediate steps . . . . . . . . . . . . . . . . . . . . . . . . .     .  77
      3.3.3 Step-III: Final estimates . . . . . . . . . . . . . . . . . . . . . . . . . . .   .  80
  3.4 Asymptotic results . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .  .  81
  3.5 Simulation studies . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .  .  84
  3.6 Real data analysis . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .  .  87
  3.7 Discussion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .  .  90
  3.8 Technical details . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .  90
      3.8.1 Some useful lemmas . . . . . . . . . . . . . . . . . . . . . . . . . . . .        .  90
      3.8.2 Proof of Theorem 3.4.1 . . . . . . . . . . . . . . . . . . . . . . . . . . .      .  96
      3.8.3 Proof of Theorem 3.4.3 . . . . . . . . . . . . . . . . . . . . . . . . . . .      .  97
CHAPTER 4    TENSOR BASED SPATIO-TEMPORAL MODELS                          FOR ANALY-
             SIS OF FUNCTIONAL NEUROIMAGING DATA .                        . . . . . . . . . . . 100
  4.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . .  . . . . . . . . . . . 100
  4.2 Tensor-on-tensor functional regression . . . . . . . . . . . . .    . . . . . . . . . . . 105
  4.3 Asymptotic properties . . . . . . . . . . . . . . . . . . . . . .   . . . . . . . . . . . 110
      4.3.1 Identifiability . . . . . . . . . . . . . . . . . . . . . .   . . . . . . . . . . . 110
      4.3.2 Convergence rate . . . . . . . . . . . . . . . . . . . .      . . . . . . . . . . . 111
  4.4 Algorithm and implementation . . . . . . . . . . . . . . . . .      . . . . . . . . . . . 113
  4.5 Simulation studies . . . . . . . . . . . . . . . . . . . . . . . .  . . . . . . . . . . . 114
  4.6 Application to ForrestGump-data . . . . . . . . . . . . . . . .     . . . . . . . . . . . 121
      4.6.1 Details about the data-set . . . . . . . . . . . . . . . .    . . . . . . . . . . . 121
      4.6.2 Analysis . . . . . . . . . . . . . . . . . . . . . . . . .    . . . . . . . . . . . 125
  4.7 Discussion . . . . . . . . . . . . . . . . . . . . . . . . . . . .  . . . . . . . . . . . 130
  4.8 Technical details . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 130
      4.8.1 Technical lemmas . . . . . . . . . . . . . . . . . . . .      . . . . . . . . . . . 130
      4.8.2 Proof of Theorem 4.3.1 . . . . . . . . . . . . . . . . .      . . . . . . . . . . . 131
      4.8.3 Proof of Theorem 4.3.2 . . . . . . . . . . . . . . . . .      . . . . . . . . . . . 135
CHAPTER 5 EPILOGUE . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 136
  5.1 Conclusions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 136
  5.2 Future directions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 136
APPENDIX . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 139
                                               x


BIBLIOGRAPHY . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 145
                                          xi


                                      LIST OF TABLES
Table 2.1 Performance of the estimation procedure where the residuals are generated
          from Brownian motion (a). Mean of the estimated coefficients, standard de-
          viation, absolute bias, mean square error (×100) and FVE in percentage are
          summarized upto four decimal places. . . . . . . . . . . . . . . . . . . . . . . . 34
Table 2.2 Performance of the estimation procedure where the residuals are generated
          from linear process (b) with 𝑙0 = 1. Mean of the estimated coefficients, standard
          deviation, absolute bias, mean square error (×100) and FVE in percentage are
          summarized upto four decimal places. . . . . . . . . . . . . . . . . . . . . . . . 35
Table 2.3 Performance of the estimation procedure where the residuals are generated
          from linear process (b) with 𝑙0 = 2. Mean of the estimated coefficients, standard
          deviation, absolute bias, mean square error (×100) and FVE in percentage are
          summarized upto four decimal places. . . . . . . . . . . . . . . . . . . . . . . . 36
Table 2.4 Performance of the estimation procedure where the residuals are generated
          from linear process (b) with 𝑙0 = 3. Mean of the estimated coefficients, standard
          deviation, absolute bias, mean square error (×100) and FVE in percentage are
          summarized upto four decimal places. . . . . . . . . . . . . . . . . . . . . . . . 37
Table 2.5 Performance of the estimation procedure where the residuals are generated
          from Ornstein-Uhlenbeck process (c) with 𝜇0 = 1. Mean of the estimated
          coefficients, standard deviation, absolute bias, mean square error (×100) and
          FVE in percentage are summarized upto four decimal places. . . . . . . . . . . . 38
Table 2.6 Performance of the estimation procedure where the residuals are generated
          from Ornstein-Uhlenbeck process (c) with 𝜇0 = 3. Mean of the estimated
          coefficients, standard deviation, absolute bias, mean square error (×100) and
          FVE in percentage are summarized upto four decimal places. . . . . . . . . . . . 39
Table 2.7 Performance of the estimation procedure where the residuals are generated
          with power exponential covariance function (d) where scale parameter 𝑎 0 = 1
          and shape parameter 𝑏 0 = 1. Mean of the estimated coefficients, standard
          deviation, absolute bias, mean square error (×100) and FVE in percentage are
          summarized upto four decimal places. . . . . . . . . . . . . . . . . . . . . . . . 40
Table 2.8 Performance of the estimation procedure where the residuals are generated
          with power exponential covariance function (d) where scale parameter 𝑎 0 = 1
          and shape parameter 𝑏 0 = 2. Mean of the estimated coefficients, standard
          deviation, absolute bias, mean square error (×100) and FVE in percentage are
          summarized upto four decimal places. . . . . . . . . . . . . . . . . . . . . . . . 41
                                                xii


Table 2.9  Performance of the estimation procedure where the residuals are generated
           with power exponential covariance function (d) where scale parameter 𝑎 0 = 1
           and shape parameter 𝑏 0 = 5. Mean of the estimated coefficients, standard
           deviation, absolute bias, mean square error (×100) and FVE in percentage are
           summarized upto four decimal places. . . . . . . . . . . . . . . . . . . . . . . . 42
Table 2.10 Performance of the estimation procedure where the residuals are generated
           with rational quadratic covariance function (e) where scale parameter 𝑎 0 = 1
           and shape parameter 𝑏 0 = 1. Mean of the estimated coefficients, standard
           deviation, absolute bias, mean square error (×100) and FVE in percentage are
           summarized upto four decimal places. . . . . . . . . . . . . . . . . . . . . . . . 43
Table 2.11 Performance of the estimation procedure where the residuals are generated
           with rational quadratic covariance function (e) where scale parameter 𝑎 0 = 1
           and shape parameter 𝑏 0 = 2. Mean of the estimated coefficients, standard
           deviation, absolute bias, mean square error (×100) and FVE in percentage are
           summarized upto four decimal places. . . . . . . . . . . . . . . . . . . . . . . . 44
Table 2.12 Performance of the estimation procedure where the residuals are generated
           with rational quadratic covariance function (e) where scale parameter 𝑎 0 = 1
           and shape parameter 𝑏 0 = 5. Mean of the estimated coefficients, standard
           deviation, absolute bias, mean square error (×100) and FVE in percentage are
           summarized upto four decimal places. . . . . . . . . . . . . . . . . . . . . . . . 45
Table 2.13 Apnea-data results: Estimated values and associated standard errors for the
           regression coefficients are provided upto four decimal places based on the
           existing and proposed methods. First line corresponding to each ROI shows
           results based on initial estimates and the second line corresponds to that of
           proposed estimates. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 50
Table 3.1  Performance of the estimation procedure with SNR𝜃 = 0.5 . . . . . . . . . . . . 86
Table 3.2  Performance of the estimation procedure with SNR𝜃 = 1 . . . . . . . . . . . . . 87
Table 4.1  Results of simulation situations (Situation1) where each modes are assumed to
           be independent for X (𝑡) and E (𝑡) for fixed time-points. Here we assume each
           of { 𝜒 𝑝(𝑘)    }
                    1 ,𝑝 2 𝑝 1 ,𝑝 2
                                    and {𝜂 𝑞1 ,𝑞2 } 𝑞(𝑘)
                                                      1 ,𝑞 2
                                                             are independent for ( 𝑝 1 , 𝑝 2 ) and (𝑞 1 , 𝑞 2 )
           respectively. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 118
Table 4.2  Results of simulation situations (Situation2)a where each modes are assumed
           to be independent for E (𝑡) for fixed time-points whereas modes for X (𝑡) are
           assumed to be dependent. Here we assume { 𝜒 𝑝(𝑘)                   }
                                                                        1 ,𝑝 2 𝑝 1 ,𝑝 2
                                                                                        is spatially dependent
           with exponential covariance function. . . . . . . . . . . . . . . . . . . . . . . . 119
                                                              xiii


Table 4.3 Results of simulation situations (Situation2)b where each modes are assumed
          to be independent for E (𝑡) for fixed time-points whereas modes for X (𝑡) are
          assumed to be dependent. Here we assume { 𝜒 𝑝(𝑘)     }
                                                         1 ,𝑝 2 𝑝 1 ,𝑝 2
                                                                         is spatially dependent
          with Matérn covariance function. . . . . . . . . . . . . . . . . . . . . . . . . . 120
Table 4.4 ForrestGump-data: Summary statistics across participants. . . . . . . . . . . . . 125
                                               xiv


                                       LIST OF FIGURES
Figure 1.1 A schematic diagram of an MRI scanner. . . . . . . . . . . . . . . . . . . . . .       8
Figure 1.2 A typical example of MRI scan of healthy human brain. Source: Long et al.
           (2012) . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .   9
Figure 2.1 Beijing2017-data: Reading of hourly PM2.5 measures for twelve different
           locations over 608 hourly time-points during January 2017. . . . . . . . . . . . 20
Figure 2.2 Beijing2017-data results: Scree plots of fraction of variance explained (FVE). . 48
Figure 3.1 Apnea-data: Smoothed average path length (APL) from 29 patients over
           different thresholds. Black solid line indicates the mean of APL over thresholds.     72
Figure 3.2 Apnea-data analysis: Plots of estimated coefficient functions of age (top panel)
           and number of lapses (bottom panel) for average path length associated with
           Fractional Anisotropy (FA) in DTI analysis. . . . . . . . . . . . . . . . . . . . 89
Figure 4.1 Multi-modal-data: An example of multi-modal data analysis which seeks to
           explore the relationship between EEG and fMRI data. . . . . . . . . . . . . . . 100
Figure 4.2 ForrestGump-data: (Top panel) BOLD fMRI for an example subject during
           their first run (see Section 4.6 for details). 35 axial slices (thickness 3.0 mm)
           represents the third mode of the tensor with 80 × 80 voxels (3.0 × 3.0 mm) in-
           plate resolution measured at every repetition time (TR) of 2 seconds. (Bottom
           panel) fMRI data-set consists of a time series of 3D images (tensors) at each
           TR (source: Wager and Lindquist (2015)). . . . . . . . . . . . . . . . . . . . . 102
Figure 4.3 ForrestGump-data: Summary statistics for the parameters estimates of head
           motion correction across TRs and participants. (Left panel) Magnitude of
           three rotational parameters (in radians) and (Right panel) Magnitude of three
           translation parameters (in millimeters) for each individual on each of the 451
           TRs. In each plot, solid black line indicates mean over the individuals through
           TRs and black dotted lines indicate mean±2sd over the individuals through
           TRs. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 123
Figure 4.4 ForrestGump-data: Covariates of interest. Cartesian and polar coordinates
           and pupil area are shown across TRs and participants. . . . . . . . . . . . . . . 124
Figure 4.5 ForrestGump-data results: Estimate of the coefficient 𝜷1 (𝑡) corresponding to
           visual feature distance of eye-gaze for different location. Legends for different
           parcellation as mentioned in Shen et al. (2013) are also provided. . . . . . . . . 127
                                                 xv


Figure 4.6 ForrestGump-data results: Estimate of the coefficient 𝜷2 (𝑡) corresponding to
           visual feature angle of eye-gaze for different location. Legends for different
           parcellation as mentioned in Shen et al. (2013) are also provided. . . . . . . . . 128
Figure 4.7 ForrestGump-data results: Estimate of the coefficient 𝜷3 (𝑡) corresponding to
           visual feature pupil area for different location. Legends for different parcel-
           lation as mentioned in Shen et al. (2013) are also provided. . . . . . . . . . . . 129
                                               xvi


                                LIST OF ALGORITHMS
Algorithm 2.1   Estimation of 𝜷 using the Quasi-Newton method with halving. . . . . . . 26
Algorithm 3.1 Estimation of 𝜷(𝑠) : 𝑠 ∈ S for the proposed local-GMM based estima-
     tion procedure. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 81
Algorithm 4.1 Estimation of 𝜷(𝑡) : 𝑡 ∈ [0, 𝑇] for tensor based function-on-function
     regression method. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 115
                                             xvii


                                  KEY TO ABBREVIATIONS
ADC Apparent Diffusion Coefficient
BOLD Blood Oxygen Level Dependent
DTI Diffusion Tensor Imaging
FA Fractional Anisotropy
(f)MRI (Functional) Magnetic Resonance Imaging
FDA Functional Data Analysis
FPCA Functional Principal Component Analysis
FVE Fraction of variance explained
GEE Generalized Estimating Equations
GLM Generalized Linear Model
(G)MM (Generalized) Method of Moments
i.i.d. Independent and Identically Distributed
LDA Longitudinal Data Analysis
MD Mean Diffusivity
OSA Obstructive Sleep Apnea
PCA Principal Component Analysis
PM Particulate Matter
QIF Quadratic Inference Functions
ROI Region of Interest
TR Repetition Time
VCM Varying Coefficient Model
                                               xviii


                                               CHAPTER 1
                                               PROLOGUE
In this chapter, we provide a brief overview of the relevant topics and state the problems that we
are going to present in the following chapters.
1.1    Big data analysis
The advances in scientific research and technological developments have led to the collection and
storage of huge amounts of data which are not only voluminous but also complex in structure.
These are commonly called “Big data”. Big data give rise to statistical problems in natural
science, engineering, social sciences, and humanities. The analysis of such data having massive
volumes and complex structures for decision-making and scientific discovery is a challenge faced
by statisticians and computer scientists, which requires innovative statistical and computational
methods, sophisticated statistical modelling, and theoretical results. Collectively, this is known
as “Data science” which has nowadays become a multi-disciplinary field involving knowledge
from various disciplines for developing new methodologies for various kinds of data: low or high
dimensional; structured, unstructured, or semi-structured. In recent decades, big data has become a
significant part of scientific interest, where images, videos, texts, and other objects can be considered
as a form of massive data. Therefore, the statistician plays an important role in proposing new
methodologies for discovering information from available big data. In the following subsections,
we discuss the different data structures that lead to different methodologies. In Sub-section 1.1.1
we discuss functional data analysis, and in Sub-section 1.1.2 we discuss analysis of a complex
structured multidimensional array (also known as tensor data analysis).
1.1.1   Functional data analysis
This subsection is dedicated to a discussion on functional data analysis (FDA) and its relevance
in the dissertation. Some relevant review articles include Morris (2015); Müller (2016); Wang
                                                    1


et al. (2016). Owing to the types of data generated in various scientific research in the fields
of biology, audiology, environmental sciences, earth sciences, and economics (to name a few),
there was a need for a statistical methodology that could analyze data which are observed as
functions varying over time, space, or other continuum domains. This led to the development of
functional data analysis. Although the term “functional data analysis” was coined by Jim Ramsey
in the famous 1982 paper (Ramsay, 1982), its origin dates back to the late 1940s in the Ph.D.
theses of Kari Karhunen (Karhunen, 1946) and Ulf Grenander (Grenander, 1950). In their seminal
works, Karhunen and Grenander respectively, discussed the decomposition of square integrable
continuous-time stochastic processes into series expansions to obtain representations in a Hilbert
space. The idea to expand random curves appeared in Rao (1958) and Tucker (1958) around the
same time. In the last three decades, FDA gained considerable momentum in statistics literature, of
which some significant works are Ramsay and Silverman (2005, 2007); Ferraty and Vieu (2006);
Horváth and Kokoszka (2012); Zhang (2013); Hsing and Eubank (2015) and some notable survey
articles are Morris (2015); Wang et al. (2016); Greven and Scheipl (2017); Li et al. (2022). The
main feature that makes functional data distinct from other types of data, especially those having
a large 𝑝 (number of parameters) and small 𝑛 (sample size) framework, is that the functional data
are infinite-dimensional in nature, since the underlying statistical quantity of the measurement is
a curve or a surface over a continuum domain. Thus, the commonly used classical multivariate
statistical methods (Anderson et al., 1958) do not suffice for these types of analyses. Moreover, in
asymptotic analysis, the space between the function arguments is assumed to approach zero, hence
making the number of arguments tend to infinity. This is essentially the large 𝑝 (rather 𝑝 𝑛 , where
the number of arguments of the function is 𝑝 𝑛 for the sample size 𝑛) problem in high-dimensional
statistics. In fact, this dimensionality issue is a blessing in disguise because we end up with more
data, with an extra cost being paid by the smoothness assumption on some standard spaces. The
smoothness assumption tells us that the information from measurements at neighboring arguments
can be pooled, thereby overcoming the curse of dimensionality.
    Some significant research maneuvering the use of FDA include
                                                    2


     • Functional generalized linear models (Müller and Stadtmüller, 2005)
     • Functional sliced inverse regression (Ferré and Yao, 2005)
     • Multi-level functional data analysis (Crainiceanu et al., 2009; Huang et al., 2014; Xu et al.,
        2018)
     • Functional time series (Hörmann and Kokoszka, 2010; Aue et al., 2015; Kowal et al., 2017;
        van Delft and Eichler, 2018)
     • Spatially dependent functional data (Zhu et al., 2014; Kuenzer et al., 2021)
     • Spatio-temporal point process (Li and Guan, 2014; Goldsmith et al., 2015)
     • Longitudinal functional data analysis (Goldsmith et al., 2012; Chen and Müller, 2012; Park
        and Staicu, 2015; Staicu et al., 2020)
    In FDA, continuous functional data are available at every time-point or can at least be evaluated
for some time-points. In practice, however, data are observed in discrete domains such as time-
points, with or without measurement errors. For the demonstration of the theoretical results,
without loss of generality, it suffices to assume that the functional data are observed continuously
without measurement error. Note that the theory behind the estimation can be different for different
measurement schedules/ sampling plans such as densely or sparsely observing data over time-points.
In most cases, the analyses of sparse and dense functional data are different, although sparse and
dense data are asymptotic concepts and are difficult to use in practice. While for dense functional
data, one can smooth each of the curves separately and then proceed with further estimation and
inference procedures based on pre-smoothed curves (Castro et al., 1986; Rice and Silverman, 1991;
Zhang and Chen, 2007), for sparse functional data, the pre-smoothing step is not required (Yao
et al., 2005). Various smoothing techniques are available in non-parametric literature to deal with
functional data. The different types of non-parametric smoothing techniques commonly used in
the literature are
                                                   3


     • Spline smoothing (Rice and Silverman, 1991; Cai and Hall, 2006)
     • B-spline (Cardot et al., 1999; James et al., 2000; Rice and Wu, 2001)
     • Penalized splines (Ruppert et al., 2003; Yao and Lee, 2006)
     • Local polynomial smoothing (Fan and Gijbels, 1996; Zhang and Chen, 2007; Yao and Li,
        2013)
    Principal component analysis (PCA) in FDA is a generalization of the classical high-dimensional
statistics for finite-dimensional matrix-valued observations to the case of infinite-dimensional con-
tinuum domain, and it is termed as functional principal component analysis (FPCA). The main
objective of FPCA is to express the underlying stochastic processes as a truncated sum of a count-
able sequence of uncorrelated random variables, thereby reducing the problem from infinite into
that of finite dimension, so that the tools of multivariate data analysis can be applied to the resulting
random vector of scores. FPCA based on spline smoothing was studied in James et al. (2000);
Zhou et al. (2008), whereas, FPCA based on kernel were discussed in Hall et al. (2006); Müller
and Yao (2010); Li and Hsing (2010). Asymptotic theories based on kernel smoothing (a.k.a.
local polynomial smoothing) are more profound in the literature. For fully observed dense data,
Hall and Hosseini-Nasab (2009) derived a stochastic expansion of estimators of eigen-values and
eigen-functions based on the principles of operator theory, the statistical implementations of which
were provided in Hall and Hosseini-Nasab (2006). For sparse functional data, FPCA approach was
studied in Yao et al. (2005); Liu and Müller (2009). Hall et al. (2006) discussed the theoretical
properties of FPCA based on local linear smoother. In one of the seminal works in Li and Hsing
(2010), an estimation procedure was discussed for all types of sampling strategies. It was found that
in some specific ranges for the rate of the number of functional points, for dense sampling strategy,
pre-smoothing was found to be asymptotically negligible, and other important commonly used
statistics such as mean, covariance, and eigen-components could be estimated using the parametric
rate. On the other hand, for sparse functional data, those statistics could only be estimated with
a non-parametric convergence rate. It was shown that the estimation of the eigen-values was not
                                                    4


as sensitive to the sampling design as the estimation of the eigen-functions. This was the first
time where a phase transition was observed. Zhang and Wang (2016) investigated local linear
estimation of mean and covariance functions with general weighting schemes, where equal weight
per observation and equal weight per subject were two special cases. All works mentioned till now
were based on univariate functional data. Multivariate FPCA was discussed in Viviani et al. (2005);
Wang (2008); Berrendero et al. (2011); Chiou et al. (2014); Happ and Greven (2018) among many
others.
    Regression analysis for FDA is one of the most active research domains for the analysis of
functional data wherein the modelling of the data depends on the type of variables. For example,
    • when the response is functional, but the covariates are vectors, the approach is called function-
       on-scalar regression (Zhu et al., 2014; Chen et al., 2019).
    • when the response is vector valued, while the covariate is functional, this approach is called
       scalar-on-function regression (Cardot et al., 1999, 2003; Müller and Stadtmüller, 2005; Cai
       and Hall, 2006; Hall et al., 2007; Li and Hsing, 2007; Goldsmith et al., 2011; Kato, 2012).
       In this approach, the covariates and the varying coefficient are expressed as the same set of
       orthogonal functional bases.
    • when the response and covariates are both functional, the approach is called function-on-
       function regression. It was introduced by (Ramsay and Dalzell, 1991). In this regression
       set-up, a varying coefficient model (Hastie and Tibshirani, 1993) was implemented. These
       regression models are often referred to as concurrent linear models. Recent literature for
       functional concurrent linear models include Faraway (1997); Zhang and Chen (2007); Zhang
       et al. (2010); Wang et al. (2016); Fang et al. (2020). Other techniques to estimate the
       regression function can be found in Hoover et al. (1998); He et al. (2003); Yao et al. (2005);
       He et al. (2018).
    In this dissertation, in Chapter 3, we consider the first case where the response is functional but
the covariates are vectors and we consider the third case where the covariate and response both are
                                                   5


functional in Chapters 2 and 4.
1.1.2    Tensor data analysis
In many scientific researches, for instance, in areas of imaging studies, network sciences, economics,
computer technologies, genetics, recommendation systems, etc. data appear structured. Such high-
dimensional as well as multi-dimensional structures have raised various challenges to their analysis.
Thus, multidimensional arrays, popularly known as “tensors”, came as a savior for understanding
the structure of these complex data. The tensor as a generalization of matrices appeared for the
first time in the literature during 1928 (Hitchcock, 1928) and was used to represent and store data
efficiently. Ever since then, its use has seen a boom in the scientific community. Some significant
research surveys can be found in Ji et al. (2019); Bi et al. (2021). Sub-section 1.3.1 discusses some
basic notation and properties of the tensor. We will consider such kind of data in chapter 4 in more
detail.
1.2      Some applications
In December 2, 1956, eminent statistician Professor P. C. Mahalanobis emphasized that
        Statistics is the universal tool of inductive inference, research in natural and social
        sciences, and technological applications. Statistics, therefore, must have a clearly
        defined purpose, either in the pursuit of knowledge or in the promotion of human
        welfare.
In this dissertation, some advanced methodologies driven by their applications are proposed for
two types of neuroimaging studies.
1.2.1    (Functional) magnetic resonance imaging
A remarkable research area developed in magnetic resonance imaging (MRI) for studying the
structure and functioning of the human brain in the years following 1977 after the first MRI scanner
                                                    6


was developed. In these studies, the differences in magnetic properties of certain molecules
(especially water molecules) in the brain are measured by using the fact that their density differs in
different media like air, white matter, gray matter, blood vessels, and tumors. Functional MRI, also
known as fMRI, has recently gained popularity as a pre-surgical procedure to map the functional
architecture of a subject’s brain without exciting the tissues associated with some critical skills
like vision, hearing, etc. Occurrence of neural activity in a certain portion of the brain results in
increased metabolic activity, causing a rush of oxygen-carrying hemoglobin in that particular area,
whereas immediately following the end of neural activity, the oxygen level drops. These changes
in the oxygen levels give rise to a measure called the blood oxygen level dependent (BOLD) signal,
which is the ratio between oxygenated and de-oxygenated hemoglobin in blood. The objective of
fMRI studies is to observe the neural activity of the brain in instantaneous time with high spatial
resolution by detecting changes in the BOLD signal. Typically, the BOLD signal happens to rise
well above the baseline with a peak at around 6 seconds following a neural activity, and decays
back to baseline over a period of 20 seconds. Due to the observable nature of neural activities,
we can use fMRI data to make various inferences if we can assess the relationship between neural
activity and the BOLD response.
    In fMRI data, images are collected over time; therefore, in order to maintain high temporal
resolution, spatial resolution is sacrificed. High-resolution structural images are used to get back the
spatial resolution from the fMRI data, and the spatial coordinates are used to identify the activation
regions during the fMRI scans by examining the aligned structural coordinates. The times between
two successive scans are called repetition time (TR). Subjects are aligned in the scanner which
is assumed to be a three-dimensional coordinate system with coordinates (𝑋, 𝑌 , 𝑍) to the bore of
the magnet, where the 𝑍 direction is downward to the bore (from feet to head) and the 𝑋 and 𝑌
directions refer to the plane which is perpendicular to the 𝑍 axis. A schematic diagram of the MRI
scanner is provided in Figure 1.1. The brain is naturally a continuous medium due to the existence
of neurons in almost all coordinates, but can be made discrete by dividing the brain into a set of
cubes. These cubes are commonly called voxels. A typical MRI scan of a healthy human brain is
                                                    7


                                                             Superconducting primary
                                                                electromagnet coil
                    Scanner Bore
                                                                      Scanner table
                         Figure 1.1 A schematic diagram of an MRI scanner.
provided in Figure 1.2.
    fMRI data consist of spatial and temporal correlations; therefore, we need sophisticated tools
to analyze them. For example, brain tissue in neighboring voxels is supplied by the same kind of
vasculature; as a result, a large response in one voxel in a neighborhood set increases the probability
that the neighboring voxels consist of a large response (spatial dependency) or, under the same
set of stimuli over time, the brain activation is expected to be similar. Moreover, fMRI data can
often be corrupted with noise arising due to the thermal motion of electrons inside the bore of the
magnet, the brain itself, and due to other physiological reasons. In order to reduce such inherent
unaccounted and uncontrolled errors due to head motion scanner drift, a series of prepossessing
is performed (see Appendix A for more details). The main objective of fMRI data analysis is to
identify regions of the brain that show task-related activity. For more information on fMRI data
analysis, please refer to Huettel et al. (2004); Lindquist (2008); Ashby (2011); Wager and Lindquist
(2015).
                                                    8


  Figure 1.2 A typical example of MRI scan of healthy human brain. Source: Long et al. (2012)
1.2.2   Diffusion tensor imaging
Diffusion tensor imaging, popularly known as DTI measures the restricted diffusion of water
molecules in the brain tissues in order to produce neural tract images. When water molecules
are located in fiber tracts, their movement is restricted and they are more likely to be anisotropic;
whereas those molecules in the rest of brain are less restricted in their movement, and are therefore
isotropic. Diffusion causes water molecules to diverge from a central point and gradually reach the
surface of an ellipsoid when the medium is anisotropic. In an isotropic medium, water molecules
move out at the same rate in all directions. Thus, using the laws of physics (such as attenuation),
the signal of an MRI voxel can be converted into numerical measures of diffusion, which are taken
care of by physicians. Thus, each brain voxel has one or more pairs of parameters, such as the
                                                  9


rate of diffusion, direction of diffusion etc. The properties of each voxel of a single DTI image are
usually calculated by vector or higher-order multi-dimensional arrays.
     Consider an ellipsoid tensor in a three-dimensional Cartesian grid, where there exist three
projections of the given ellipsoid into three different axes. These projections provide apparent
diffusion coefficient (ADC) and are denoted as ADC𝑥 , ADC𝑦 and ADC𝑧 corresponding to 𝑋, 𝑌
and 𝑍 axes respectively. Therefore, the average diffusivity in a given voxel is defied by ADC =
(ADC𝑥 + ADC𝑦 + ADC𝑧 )/3. Note that ellipsoid has three axes, one principle axis (longest) and two
small axes passing through center, where the direction and length of these axes are eigen-vectors
and eigenvalues, respectively, in the context of tensor algebra. The diffusion along the principle axis
is termed as axial diffusivity (denoted as 𝐿 1 ) and the average diffusivity along two other minor axes
is termed as radial diffusivity (denoted as 𝐿 23 where 𝐿 23 = (𝐿 2 + 𝐿 3 )/2, 𝐿 2 and 𝐿 3 are eigen-values
corresponding to minor axes). The mean diffusivity is defined as MD = (𝐿 1 + 𝐿 2 + 𝐿 3 )/3. The
degree of anisotropy of a diffusion process is termed as fractional anisotropy (FA), which is a scaled
measure that belongs to the interval [0, 1]. The quantity FA takes the value zero when diffusion is
isotropic (i.e., unrestricted in all directions) and takes the value one when diffusion occurs only on
one and is fully restricted in other directions. FA can be calculated using the following formula.
                                                                                  1/2
                                3 (𝐿 1 − MD) 2 + (𝐿 2 − MD) 2 + (𝐿 3 − MD) 2
                             √︂ 
                      FA =                                        1/2
                                                                                                     (1.1)
                                2                  2
                                                   𝐿 + 𝐿2 + 𝐿2
                                                    1     2    3
For more information on DTI, please refer to O’Donnell and Westin (2011); Van Hecke et al.
(2016).
1.3    Mathematical preliminaries
In this section, we introduce some notation and basic mathematical principles that will be used
frequently throughout the dissertation.
                                                    10


1.3.1      Notations of tensor/matrix object
Tensor is a multidimensional array indexed by 𝐷 many indices. The first-order tensor is a vector
for 𝐷 = 1, second-order tensor is a matrix for 𝐷 = 2 and for 𝐷 > 2, we call the set of objects
higher-order tensors. In the following paragraph, we provide a brief summary of tensors and define
important notation. Interested readers can refer to a survey article by Kolda and Bader (2009) for
more details.
     A 𝐷-dimensional tensor is denoted by Sans-serif upper-face letters A ∈ R 𝐼1 ×···×𝐼𝐷 where the
size 𝐼 𝑑 along each mode or dimension 𝑑 for 𝑑 = 1, · · · , 𝐷. Therefore, the number of elements
                                Î𝐷
in the tensor A is 𝐼 = 𝑑=1            𝐼 𝑑 and the order of the tensor is the number of dimensions. Here
and henceforth, matrices are denoted by bold-face capital letters (examples: A, B, · · · ), vectors
are written as bold-face lower-case letters (examples: a, b, · · · ) and scalars are presented as Latin
alphabets (𝑎, 𝑏, · · · ). The entry on 𝑖-th row and 𝑗-th column of a matrix A is denoted by (A)𝑖, 𝑗 = 𝑎𝑖 𝑗
and (𝑖 1 , · · · , 𝑖 𝐷 )-th entry of a 𝐷 dimensional tensor is denoted as ( A)𝑖1 ,··· ,𝑖 𝐷 = 𝑎𝑖1 ,··· ,𝑖 𝐷 . Slices are
two-dimensional sections of the tensor defined by fixing all but two indices and thus become a
𝐼 𝑑 × 𝐼 𝑑 ′ dimensional matrix. For a 𝐷-way tensor A ∈ R 𝐼1 ×···×𝐼𝐷 with the element 𝑎𝑖1 ,··· ,𝑖 𝐷 at the
position with mode 𝑖 𝑑 , 𝑑 = 1, · · · , 𝐷, vectorization operator vec(·) is defined as a vector with length
Î𝐷
    𝑑=1 𝐼 𝑑 which is formed by stacking the nodes of A into a single column vector, i.e.,
                                            "     𝐷 Ö𝑑−1
                                                            !           #
                                                 ∑︁
                                   vec( A) 𝑖1 +          𝐼 𝑘 (𝑖 𝑑 − 1) = 𝑎𝑖1 ,··· ,𝑖 𝐷                            (1.2)
                                                 𝑑=2 𝑘=1
By simplifying, for a matrix A of order 𝐼 × 𝐽, vec(A) = (𝑎 1,1 , · · · , 𝑎 𝐼,1 , · · · , 𝑎 1,𝐽 , · · · , 𝑎 𝐼,𝐽 ) T .
Similarly to the vectorization operator, one can unfold as 𝑑-mode matricization or unfolding, a
                                                            Î
𝐷-array A, to form a matrix (A (𝑑) with 𝐼 𝑑 rows and 𝑑 ′:𝑑 ′≠𝑑       ) 𝐼 𝑑 columns where the element 𝑎𝑖1 ,··· ,𝑖 𝐷
                                                                          ′
is at the row 𝑖 𝑑 and column 1 + 𝑑𝐷1 =1 (𝑖 𝑑1 − 1) 𝑑𝑑12 −1
                                          Í             Î
                                                              =1 𝐼 𝑑2 .
                                           𝑑1 ≠𝑑         𝑑2 ≠𝑑
                                                        11


1.3.2        Different kinds of products
Analogous to the Frobenius norm in a 2D space, the norm of a tensor A is the square root of the
sum of squares of its indices, denoted as
                                                           v
                                                           u
                                                           t
                                                                 ∑︁𝐼1          𝐼𝐷
                                                                              ∑︁
                                                ∥A∥F =                   ···          𝑎𝑖21 ,··· ,𝑖 𝑑                         (1.3)
                                                                 𝑖1 =1       𝑖 𝐷 =1
The scalar product ⟨A, B⟩ of two 𝐷-dimensional tensors with the same size is defined as
                                                               ∑︁
                                                ⟨A , B ⟩ =               𝑏𝑖1 ,··· ,𝑖 𝐷 𝑎𝑖1 ,··· ,𝑖 𝐷                         (1.4)
                                                           𝑖 1 ,··· ,𝑖 𝐷
                                                                                                                      √︁
Thus, immediately, the Frobenius norm of the tensor A can be expressed as ∥ A ∥ F =                                      ⟨A, A⟩.
Two tensors A and B are said to be orthogonal if ⟨A, B⟩ = 0. Furthermore, consider the con-
tracted tensor product between two tensors with different mode dimensions. For two tensors A ∈
R 𝐼1 ×···×𝐼𝐾 ×𝑃1 ×···×𝑃 𝐿 and B ∈ R𝑃1 ×···×𝑃 𝐿 ×𝑄 1 ×···×𝑄 𝑀 , contracted tensor product (Lock, 2018; Raskutti
                                                                                                     Í
et al., 2019) is defined as ⟨A, B⟩ 𝐿 with (𝑖1 , · · · , 𝑖 𝐾 , 𝑞 1 , · · · , 𝑞 𝑀 )-th element 𝑝1 ,··· ,𝑝 𝐿 𝑎𝑖1 ,··· ,𝑖 𝐾 ,𝑝1 ,··· ,𝑝 𝐿
×𝑏 𝑝1 ,··· ,𝑝 𝐿 ,𝑞1 ,··· ,𝑞 𝐾 . As a special case, ⟨A, B⟩ 1 = AB where A and B are 𝐼 × 𝑃 and 𝑃 × 𝑄 matrices.
     A 𝐷-way tensor A has a rank-1 when it can be written as the outer product of 𝐷 vectors
u (1) , · · · , u (𝐷) of length 𝐼1 , · · · , 𝐼𝐾 respectively, i.e.,
                                                      A = u (1) ◦ · · · ◦ u (𝐷)                                              (1.5)
                                                     Î𝐷         (𝑑)
where (𝑖1 , · · · , 𝑖 𝐷 )-th element of A is            𝑑=1 𝑢𝑖 𝑑 .       The Kronecker product of the matrices A ∈ R 𝐼×𝐽
and B ∈ R𝐾×𝐿 is denoted by A ⊗ B, an (𝐼𝐾) × (𝐽 𝐿) matrix, defined by
                           A ⊗ B = (𝑎𝑖 𝑗 B)𝑖, 𝑗 = [a1 ⊗ b1 , a1 ⊗ b2 , · · · , a𝐽 ⊗ b 𝐿−1 , a𝐽 ⊗ b 𝐿 ]                       (1.6)
The Khatri-Rao product of matrices A ∈ R 𝐼×𝐾 , B ∈ R𝐽×𝐾 , denoted as A ⊙ B, is defined by
                                             A ⊙ B = [a1 ⊗ b1 , · · · , a𝐾 ⊗ b𝐾 ]                                            (1.7)
which is an (𝐼𝐽) × 𝐾 matrix. The Hadamard product is the element-wise matrix product which is
denoted by A ∗ B where A, B are 𝐼 × 𝐽.
                                                                      12


1.3.3    Tensor decomposition
Now the question is how to represent the tensor as a sum of a finite number of rank-1 tensors?
The answers came from psychometrics in the form of canonical decomposition or CANDECOMP
(Carroll and Chang, 1970) and parallel factors or PARAFAC (Harshman et al., 1970) decomposition
and now in the literature of tensor decomposition, it is known as CANDECOMP/PARAFAC (CP)
decomposition which is an extension of matrix singular value decomposition (Tucker, 1966; Kiers,
2000). Therefore, the CP decomposition factorizes a tensor into a sum of rank-1 tensors, i.e.,
mathematically,
                                                   𝑅
                                                  ∑︁
                                             A=       u𝑟(1) ◦ · · · ◦ u𝑟(𝐷)                          (1.8)
                                                  𝑟=1
where u𝑟(𝑑) ∈ R 𝐼𝑑 , 𝑑 = 1, · · · , 𝐷 and 𝑟 = 1, · · · , 𝑅 are column vectors and A cannot be written as a
sum of less than 𝑅 outer products for a positive integer 𝑅 which is the rank of the tensor. Equation
(1.8) is sometimes denoted as A = [[U1 , · · · , U𝐷 ]] where U1 , · · · , U𝐷 have linearly independent
columns U𝑑 = [u1(𝑑) , · · · , u 𝑅(𝑑) ] ∈ R 𝐼𝑑 ×𝑅 for each 𝑑 = 1, · · · 𝐷.
1.3.4    Some useful results
In this sub-section, we will present some useful well-known results without proofs. We define 1 𝑅
as an 𝑅-dimensional vector with all elements 1.
   1. vec(u (1) ◦ u (2) ◦ · · · ◦ u (𝐷) ) = u (𝐷) ⊗ · · · ⊗ u (1) .
   2. For two vectors a and b, a ⊗ b = a ⊙ b, a ◦ b = abT .
   3. For any matrices A, B and C so that the required matrix multiplications are possible, we have
       the following:
          a) (A ⊗ B)(C ⊗ D) = AC ⊗ BD.
          b) (A ⊗ B) −1 = A−1 ⊗ B−1
          c) A ⊙ B ⊙ C = (A ⊙ B) ⊙ C = A ⊙ (B ⊙ C)
                                                          13


           d) (A ⊙ B) T (A ⊙ B) = (AT A) ∗ (BT B)
           e) (A ⊙ B) −1 = {(AT A) ∗ (BT B)}−1 (A ⊙ B) T
           f) vec(A ⊙ B) = ((I ⊙ A) ⊗ I) vec(B)
           g) vec(B ⊙ A) = I ⊙ (A(I ⊗ 1T )) vec(B)
                                      
           h) trace(AB) = trace(BA) = vec(AT ) T vec(B)
            i) vec(ABC) = (CT ⊗ A) vec(B)
            j) rank(A ⊙ B) ≤ rank(A ⊗ B) ≤ rank(A)rank(B)
           k) If A is a matrix of order 𝑚 1 × 𝑚 2 with (𝑖, 𝑗)-th element 𝑎𝑖 𝑗 , then the Frobenius norm
                                                    √︃Í Í                                        √︃
                                                         𝑚1        𝑚2          2
                                                                                 √︁
                                                                                          T
                                                                                                   Ímin(𝑚1 ,𝑚2 ) 2
                  of A is defined as ∥A∥ F =             𝑖=1       𝑗=1 |𝑎 𝑖 𝑗 | = trace(A A) =       𝑖=1           𝜎𝑖 (A)
                  where 𝜎𝑖 (A) is the 𝑖-th order singular value of A and trace(A) is the trace operator of a
                  square matrix A.
   4. If the tensor A admits a rank-𝑅 decomposition, then A (𝑑) = U𝑑 (U𝐷 ⊙ · · · ⊙ U𝑑+1 ⊙ U𝑑−1 ⊙
       · · · ⊙ U1 )𝑇 and vec( A) = (U𝐷 ⊙ · · · ⊙ U1 )1 𝑅
   5. ∥ A ∥ F = ∥A (𝑑) ∥ F = ∥ vec(A (𝑑) )∥ 2 for 𝑑 = 1, · · · , 𝐷
   6. vec( A) = P 𝐼(𝑑)   1 ,··· ,𝐼 𝐷
                                     × vec(A (𝑑) ) where P 𝐼(𝑑)
                                                             1 ,··· ,𝐼 𝐷
                                                                         are permutation matrices such that P 𝐼(𝑑)−1
                                                                                                                1 ,··· ,𝐼 𝐷
                                                                                                                            =
       P 𝐼(𝑑)T
           1 ,··· ,𝐼 𝐷
                       .
1.4     Dissertation outline
The main objective of this dissertation is to answer some fundamental questions that appear in
different domains of statistics due to real-life situations. Let us dive into these questions one by one
and briefly introduce them.
Question 1. How should we handle the dense functional response in quadratic inference method?
       We consider the problem of estimation for constant linear effect models in semi-parametric
       functional regression with functional response, where each response curve is decomposed into
                                                                14


     the overall mean function indexed by a covariate function with constant regression parameters
     and random error. In Chapter 2, we provide an alternative solution using a popular method
     for the analysis of correlated data, viz., the quadratic inference approach for such models.
     Here, we use the estimated basis functions which are being estimated non-parametrically.
     Therefore, the proposed method can be easily implemented without assuming any working
                                                                    √
     correlation structure. Moreover, we achieve a parametric 𝑛-convergence rate under the
     proper choice of bandwidth when the number of repeated measurements per trajectory is
     larger than 𝑛𝑎0 where 𝑛 is the number of trajectories and establish the asymptotic normality
     of the resulting estimator. The performance of the proposed method is compared with that
     of existing methods through extensive simulation studies. Real data analysis is also carried
     out to demonstrate the proposed method.
Question 2. How should the heteroskedastic functional data be analyzed?
     Motivated by recent work on diffusion tensor imaging, we propose a novel varying-coefficient
     model in Chapter 3. We develop a multi-step estimation procedure to simultaneously estimate
     the varying-coefficient functions using a local linear generalized method of moments (GMM)
     based on continuous moment conditions. To incorporate spatial dependence, the continuous
     moment conditions are first projected onto eigen-functions and then combined by weighted
     eigen-values. This approach solves the challenges of using an inverse covariance operator
     directly. We propose an optimal instrumental variable that minimizes the asymptotic variance
     function among the class of all local linear GMM estimators and it outperforms the initial
     estimates which do not incorporate the spatial dependence. It is shown that with our pro-
     posed method, accuracy of the estimation is significantly improved under heteroskedasticity
     conditions. We investigate the asymptotic properties of the initial and proposed estimators.
     Extensive simulation studies illustrate the finite sample performance and the analysis of real
     data confirms the efficacy of the proposed method.
Question 3. How should functional regression be preformed for complex structured data such as
                                                 15


      tensor?
      All neuroimaging modalities have their own strengths and limitations. A current trend is
      towards interdisciplinary approaches that use multiple imaging methods to overcome limita-
      tions of each method in isolation. At the same time neuroimaging data is increasingly being
      combined with other non-imaging modalities, such as behavioral and genetic data. The data
      structure of many of these modalities can be expressed as time-varying multidimensional
      arrays (tensors), collected at different time-points on multiple subjects. In Chapter 4, we con-
      sider a new approach for the study of neural correlates in the presence of tensor-valued brain
      images and tensor-valued predictors, where both data types are collected over the same set of
      time-points. We propose a time-varying tensor regression model with an inherent structural
      composition of responses and covariates. Regression coefficients are expressed using the
      B-spline technique, and basis function coefficients are estimated using CP-decomposition
      by minimizing a penalized loss function. We develop a varying-coefficient model for the
      tensor-valued regression model, where both predictors and responses are modeled as tensors.
      This development is a non-trivial extension of function-on-function concurrent linear models
      for complex and large structural data where the inherent structures are preserved. In addition
      to the methodological and theoretical development, the usefulness of the proposed method
      based on both simulated and real data analysis (e.g., the combination of eye-tracking data
      and functional magnetic resonance imaging (fMRI) data) is also discussed.
Putting it all together, in this chapter, we have introduced the concepts of functional data, its
computational framework, required mathematical notations, definitions and real-life applications
thereby establishing a foundation of the upcoming chapters of this dissertation.
                                                   16


                                            CHAPTER 2
        IMPROVING QUADRATIC INFERENCE APPROACH FOR FUNCTIONAL
                                            RESPONSES
2.1    Introduction
The key characteristic of longitudinal data analysis (LDA) is the collection of repeated measure-
ments on the same set of individuals over multiple time-points, thus allowing study of changes
in responses over time and identification of factors that influence those changes. Unlike cross-
sectional studies, where one can estimate only the “between-individual” responses, since they are
measured at a single time-point; in LDA, it is possible to capture the “with-in individual” changes
as repeated measurements on each individual are available. Moreover, longitudinal data is always
observed as clusters, where each cluster pertains to repeated measurements obtained from each
individual. Although longitudinal studies are performed for data that are observed sparsely over
irregular time-points, such studies do not suffice when voluminous data are observed in a continuum
domain. As technologies advance, this type of data is being observed more often, so sophisticated
methods are needed to handle it. Since functional data are natural generalizations to multivari-
ate data from finite to infinite dimension, functional data analysis (FDA) has turned out to be an
important methodological tool.
    In the following two paragraphs, we will present a brief review of some significant research
in the past decades that led to the current research. In LDA, the data are generally observed
with noise for measurements at each time-point (Taris, 2000; Diggle et al., 2002; Hedeker and
Gibbons, 2006; Hand and Crowder, 2017). Moreover, a few repeated measurements are required
in LDA and the data are observed sparsely with noise. On the other hand, in FDA, data are
densely observed as a continuous-time stochastic process without noise (Zhang and Wang, 2016).
Often, the sampling plan can have an effect on the performance of the estimation procedures and
inference (Hall and Hosseini-Nasab, 2006). In some situations, data are typically functions by
                                                   17


nature and are observed densely over time. Chiou et al. (2003) proposed a class of semi-parametric
functional regression models to describe the influence of vector-valued covariates on a sample of
the response curve. When data collection leads to experimental error, smoothing is performed
at closely spaced time-points in order to reduce the effect of noise. The current developments
of functional regression techniques have been rigorously studied in Fan et al. (1999); Hall et al.
(2007); Chen et al. (2019). The applicability of FDA spans across various scientific domains such
as medical imaging, speech recognition, growth curves, climatology, price index analysis, and
many more. Some recent literature on applications of FDA include Ramsay and Silverman (2005);
Ferraty and Vieu (2006); Ramsay and Silverman (2007); Zhang (2013); Hsing and Eubank (2015);
Morris (2015); Wang et al. (2016); Kokoszka and Reimherr (2017).
    Methodologically, in LDA in the past few years, the generalized estimating equation (GEE)
technique proposed by Liang and Zeger (1986) has been extensively used for estimation of pa-
rameters. Although it is an efficient technique, the GEE is unable to estimate the parameters of
interest efficiently when the correlation matrix of covariates is not specified correctly. Hence,
without requiring the estimation of the correlation parameters, the quadratic inference function
(QIF) approach proposed by Qu et al. (2000) is useful for parameter estimation in longitudinal
studies (Diggle et al., 2002) and cluster randomized trials (Turner et al., 2017). By representing
the inverse of the working correlation matrix in terms of linear combinations of the basis matrices
and involving multiple sets of score functions, the QIF approach has improved efficiency over GEE
when the working correlation matrix is not specified correctly. Although it maintains the same
efficiency as in the situation where the working correlation matrix is specified correctly, the QIF
method is not independent of the choice of the working correlation matrix. A QIF method-based
approach to varying-coefficient models for longitudinal data was proposed by Qu and Li (2006).
The related work of Bai et al. (2008) is an extension of QIF for the partial linear model. An
alternative method was presented in Yu et al. (2020) where each set of score equations was solved
separately and their solutions were combined afterwards; thereby providing results on inference
for an optimally weighted estimator and extending those insights to the general setting with over-
                                                  18


identified estimating equations. Zhao et al. (2020, 2021) proposed variable selection method for
the varying-coefficient model when some of the covariates were contaminated with additive errors
based on bias-corrected penalized QIFs that are defined by combining the bias function approxi-
mation to the coefficient functions and bias-corrected QIF with shrinkage estimation. Zhou and
Qu (2012) proposed QIF based strategy which minimizes the norm of the difference between two
estimating functions based on empirical correlation information. Tian et al. (2014) focused on the
selection of variables for the semi-parametric varying-coefficient model based on the combination
of the approximated basis function and the QIFs. A longitudinal principal component analysis was
proposed in Kinson et al. (2020) based on eigen-decomposition of random effects, while data on
correlations information of multivariate observations over time were decomposed by nonparametric
splines. Zheng et al. (2018) proposed a method based on a time-varying linear representation of
the inverse of the correlation matrix which is projected over the span of basis matrices.
    The fundamental limitations that all the above-mentioned powerful techniques suffer from are:
(1) all the above methodologies require prior information on the working correlation structure; and
(2) performance of the classical QIF approach is unknown for dense functional data. Our study
is motivated by problems from multiple real-data applications that involve dense functional data
when information on the working correlation structure is lacking. Let us discuss two motivating
examples that we will use to illustrate the proposed method in this chapter (see Section 2.5 for more
details).
     • Beijing2017-data example - In different locations in China, particulate matter (PM) with
       diameter less than 2.5 micrometer is collected over different time-points. Scientists are inter-
       ested in knowing the linear dependence of the pollution factor PM2.5 with other atmospheric
       chemicals (Liang et al., 2015). Figure 2.1 pictorially demonstrates the readings of PM2.5 for
       the given locations over several hourly time-points; therefore, dense functional data analysis
       can be implemented.
     • Apnea-data example - In neuroimaging data analysis, scientists are interested in modelling
                                                  19


          600
          400
 PM2.5
          200
          0
                 0           100          200          300          400          500           600
                                                   time−points
Figure 2.1 Beijing2017-data: Reading of hourly PM2.5 measures for twelve different locations over
608 hourly time-points during January 2017.
         the change of responses among voxels in each region of interest (ROI) of the human brain.
         Therefore, we can fit a linear regression model and compare the estimated coefficient across
         each ROIs. Needless to say, there exist a large number of voxels and the responses change
         smoothly across the voxels in each ROI, therefore, the data are functional and dense in
         nature. In recent literature, Xiong et al. (2017) investigates white matter structural alterations
                                                     20


        using diffusion tensor imaging (DTI) in obstructive sleep apnea (OSA) patients. Here, the
        change of DTI parameters such as fractional anisotropy (FA) with interaction of count of
        lapses obtained from the Psychomotor Vigilance Task and voxel locations are investigated
        and compared to the results obtained in each ROIs.
    We propose a data-driven way to select the working covariance matrix and express the inverse
of the covariance function in terms of the empirical eigen-functions of the covariance operator. The
covariance operator can be estimated as in Hsing and Eubank (2015) and other related methods
based on functional principal component analysis (FPCA) as found in Dauxois et al. (1982); Yao
et al. (2005); Hall and Hosseini-Nasab (2006); Hall et al. (2007); Li and Hsing (2010). Note that
the estimation of the eigen-functions creates some error in the proposed estimation method. In
this chapter, we try to answer the following question: while we estimate the eigen-functions from
                                                                                   √
the data, is the estimation of coefficient vectors in a semi-parametric problem 𝑛- consistent in
dense functional data, and can we achieve asymptotic normality? The advantages of our proposed
method are the following: First, our method preserves the good properties of the QIF method and
is easier to implement since the eigen-functions can be estimated using the existing packages in
statistical softwares such as R. Second, under some mild conditions, our proposed estimator can
obtain the optimal convergence rate and is asymptotically normally distributed with less variance
as compared to the classical QIF methods. Third, asymptotic results show the estimation accuracy
of the coefficient in semi-parametric functional model, therefore, making the influence of the
dimension reduction step using FPCA redundant. The error in the estimation of the eigen-functions
contributes to the error in the estimation of the parameters. Under some mild bandwidth conditions,
the above-mentioned error contribution is of the same order of magnitude as error in parameter
estimation if eigen-functions are known in advance.
    The rest of the chapter is organized as follows. In Section 2.2 we introduce the basic concept
of QIF along with our proposed method. The asymptotic results for the proposed estimator are
presented in Section 2.3. In Section 2.4, we demonstrate the performance for finite samples. We
also apply the proposed method to real data-sets in Section 2.5. We conclude in some remarks in
                                                   21


Section 2.6. All technical proofs are given in Section 2.7.
2.2     Functional response model and estimation procedure
2.2.1    Basic model
To analyze longitudinal data, a straightforward application of a generalized linear model (GLM)
(McCullagh and Nelder, 1989) for single response variables is not applicable due to the lack of
independence between repeated measures. To account for the high correlation in the longitudinal
data, some special techniques are required. A seminal work by Liang and Zeger (1986) proposed
the use of GLM for the analysis of longitudinal data. The model we consider in this chapter is
commonly observed in spatial modeling, where associations among variables do not change over
the functional domain (see Zhang and Banerjee (2021) and references therein); this is termed a
constant linear effects model. In this chapter, the variable “time” is used as a functional domain
variable.
     Let 𝑦(𝑡) be the response variable at time-point 𝑡 and x(𝑡) be 𝑝-dimensional covariates observed
                                  
at time 𝑡 ∈ T where T = 𝑎, 𝑎 , −∞ < 𝑎 < 𝑎 < ∞ is the spectrum of the time-points. Without
loss of generality, assume that 𝑎 = 0 and 𝑎 = 1 in the rest of this chapter. The stochastic process 𝑦(𝑡)
is square-integrable with marginal mean E{𝑦(𝑡)|x(𝑡)} and finite covariance function; the regression
parameter 𝜷 is unknown and is to be efficiently estimated. Thus, linear models with longitudinal
data have the following expression.
                                             𝑦(𝑡) = x(𝑡) T 𝜷 + 𝑒(𝑡)                                         (2.1)
where, the stochastic process 𝑦(𝑡) is decomposed into two parts: one is the mean function 𝜇(𝑡) =
x(𝑡) T 𝜷 that depends on time-varying covariates and vector-valued coefficient vector 𝜷, and other
is the random error part 𝑒(𝑡) where E{𝑒} = 0 and has finite second-order covariance. Let 𝑦𝑖 be
i.i.d. copies of the stochastic process and for each individual, the measurements are taken on 𝑚𝑖
discrete time-points 𝑇𝑖 𝑗 for 𝑗 = 1, · · · , 𝑚𝑖 ; 𝑖 = 1, · · · , 𝑛. Therefore, at time 𝑇𝑖 𝑗 , we observe a 𝑚𝑖 × 1
response vector 𝑦𝑖 (𝑇𝑖 𝑗 ) and corresponding covariates x𝑖 (𝑇𝑖 𝑗 ) for the 𝑖-th subject. We assume that
                                                        22


𝑚𝑖 ’s are all of the same order as 𝑚 = 𝑛𝑎 for some 𝑎 ≥ 0, thus 𝑚𝑖 /𝑚 are bounded below and above
by some constants. Functional data are considered to be sparse depending on the choice of 𝑎 (Hall
and Hosseini-Nasab, 2006). Data with bounded 𝑚 or 𝑎 = 0 are called sparse functional data, and if
𝑎 ≥ 𝑎 0 where 𝑎 0 is a transition point are called dense functional data. Moreover, the regions (0, 𝑎 0 )
are sometimes referred to as moderately dense. Furthermore, we denote 𝑦𝑖 𝑗 and x𝑖 𝑗 as 𝑦𝑖 (𝑇𝑖 𝑗 ) and
x𝑖 (𝑇𝑖 𝑗 ) respectively. (𝑦𝑖1 , · · · , 𝑦𝑖𝑚𝑖 ) T and (𝜇𝑖1 , · · · , 𝜇𝑖𝑚𝑖 ) T are 𝑚𝑖 component vectors, denoted as y𝑖
and 𝝁𝑖 respectively. The derivative of 𝝁, denoted as 𝝁¤ , is a 𝑚𝑖 × 𝑝 matrix.
     In the classical problem of GEE, we estimate 𝜷 by solving the quasi-likelihood equations (Liang
and Zeger, 1986):
                                                 ∑︁𝑛
                                                      𝝁¤ 𝑖T V𝑖−1 (y𝑖 − 𝝁𝑖 ) = 0                                (2.2)
                                                  𝑖=1
We denote V𝑖 = 𝜈A𝑖1/2 R𝑖 (𝜌)A𝑖1/2 where R𝑖 (𝜌) is the working correlation matrix, 𝜈 is an over-
dispersion parameter and A𝑖 is a diagonal matrix where entries are marginal variances Var(𝑦𝑖1 ), · · · ,
Var(𝑦𝑖𝑚𝑖 ). In this article, we simply set 𝜈 = 1 while the extension to a general 𝜈 is straightforward.
The GEE approach is robust in the sense that it does not require the true knowledge of the likelihood
function.
     Note that, in practice, the prior knowledge of the working correlation matrix is not known, and
the estimation of the coefficient is influenced by its choice. Therefore, Qu et al. (2000) suggested
an expansion of the inverse of the working correlation matrix as R(𝜌) −1 = 𝜅𝑘=1
                                                                                             Í0
                                                                                                   𝑎 𝑘 (𝜌)M 𝑘 where
M 𝑘 are some basis matrices. Zhou and Qu (2012) modified linear representation by grouping the
basis matrices into an identity matrix and some symmetric basis matrices. For example, if the
working correlation matrix is exchangeable/ compound symmetric, R(𝜌) −1 = 𝑐 1 I𝑚 + 𝑐 2 J𝑚 where
I𝑚 is the 𝑚 × 𝑚 identity matrix and J𝑚 is the 𝑚 × 𝑚 matrix such that 0 is in diagonal and 1 is
in off-diagonal positions. On the other hand, for first-order auto-regressive correlation matrix,
R(𝜌) −1 = 𝑐 1 I𝑚 + 𝑐 2 J𝑚(1) + 𝑐 3 J𝑚(2) where J𝑚(1) is a matrix with 1 in the two main off-diagonal positions
and 0 otherwise, J𝑚(2) is a matrix such that 1 is in the corner positions, viz. (1, 1) and (𝑚, 𝑚) and
0 elsewhere. Here, 𝑐 𝑘 s are real constants that depend on the nuisance parameter 𝜌. Therefore,
                                                                23


Equation (2.2) reduces to the linear combination of the score vectors:
                                                            T A−1/2 M A−1/2 (y − 𝝁 ) 
                                                 1 Í𝑛                                  
                                                
                                                𝑛    𝑖=1 ¤
                                                          𝝁 𝑖 𝑖          1 𝑖      𝑖   𝑖
                                   𝑛                                                    
                               1 ∑︁                                  ..                
                       ḡ(𝛽) =        g𝑖 (𝛽) =                       .                                 (2.3)
                               𝑛 𝑖=1             Í
                                                                                        
                                                                                        
                                                1 𝑛        T   −1/2        −1/2
                                                          ¤
                                                 𝑛 𝑖=1 𝝁𝑖 A𝑖 M𝜅0 A𝑖 (y𝑖 − 𝝁𝑖 ) 
                                                                                        
                                                                                       
Due to the higher dimension of g, Qu et al. (2000) used the generalized method of moments (GMM)
(Hansen, 1982) for which the method of estimation boils down to minimization of the quadratic
inference function Q( 𝜷) = 𝑛ḡ( 𝜷) T C(  b 𝜷) −1 ḡ( 𝜷) where C( b 𝜷) = 1 Í𝑛 ḡ𝑖 ( 𝜷) ḡ𝑖 ( 𝜷) T is the sample
                                                                             𝑛 𝑖=1
covariance matrix of Equation (2.3). In order to obtain the solution of 𝜷, Newton-Raphson method
is used which iteratively updates the value of 𝜷.
2.2.2    Incorporating eigen-functions in QIF
Now, due to standard Karhunen-Loève expansions of 𝑒𝑖 (𝑡) = 𝑦𝑖 (𝑡) − 𝜇𝑖 (𝑡) (Karhunen, 1946; Loève,
1946)
                                                       ∞
                                                      ∑︁
                                             𝑒𝑖 (𝑡) =      𝜉𝑖𝑟 𝜙𝑟 (𝑡)                                     (2.4)
                                                      𝑟=1
where independently distributed random variables 𝜉𝑖𝑟 ∼ (0, 𝜆𝑟 ) for ordered eigen-values 𝜆𝑟 such
                                                                                    ∫
that 𝜆 1 ≥ 𝜆 2 ≥ · · · ≥ 0 and 𝜙𝑟 s are orthonormal eigen-functions such that 𝜙𝑟 (𝑡)𝜙𝑙 (𝑡) = 1(𝑟 = 𝑙).
We extract the main directions of the variation of the response variables using FPCA. In this
situation, we take the first 𝜅0 terms, which provide a good approximation of the infinite sum in
Equation (2.4) by considering that the majority of the variations in the data are contained in the
subspace spanned by few eigen-functions (Chen et al., 2019). For finite 𝜅0 ≥ 1, we, therefore,
consider the rank-𝜅0 FPCA model,
                                                          ∑︁𝜅0
                               E{𝑦(𝑡)|x(𝑡)} = 𝜇(𝑡) +            E{𝜉𝑟 |x(𝑡)}𝜙𝑟 (𝑡)                         (2.5)
                                                           𝑟=1
An analogue of the truncated empirical version of Equation (2.22) defined in Section 2.7 and
Equation (2.4) can be provided easily and we discuss the proposed method based on this truncated
version. Moreover, we discuss how to choose 𝜅0 in our situation in Section 2.3 in detail.
                                                       24


    In this chapter, we propose a data-driven way to compute the basis matrices to obtain the
approximate inverse of V as discussed earlier. In this approach, it is enough to find the eigen-
functions to construct a GEE. Let us define
                                                1 Í𝑛       T𝚽
                                                                             
                                               
                                                𝑛 𝑖=1 𝑖 𝝁¤  b 𝑖1 (y𝑖 − 𝝁𝑖 ) 
                                                                             
                                                             ..             
                                      ḡ( 𝜷) = 
                                                              .             
                                                                                                                  (2.6)
                                                Í                           
                                               1 𝑛 Tb
                                                𝑛 𝑖=1 𝜇¤𝑖 𝚽𝑖𝜅0 (y𝑖 − 𝝁𝑖 ) 
                                                                             
                                                                            
                                                                            
                                            b          −2
where for 𝑘 = 1, · · · , 𝜅0 , we define 𝚽𝑖𝑘 = 𝑚𝑖 𝜙 𝑘 (𝑇𝑖 𝑗 ) 𝜙 𝑘 (𝑇𝑖 𝑗 ′ )
                                                          b        b                              . Since the dimension
                                                                               𝑗, 𝑗 ′ =1,··· ,𝑚 𝑖
of g in Equation (2.6) is greater than the number of parameters to estimate, instead of setting g to
zero, we minimize the following quadratic function.
                        𝜷 = arg min Q( 𝜷) where Q( 𝜷) = 𝑛ḡ( 𝜷) T C(
                        b                                                  b 𝜷) −1 ḡ( 𝜷)                          (2.7)
                                     𝜷
        b 𝜷) =    1 Í𝑛                   T.                           b−1 we need the additional restriction:
where, C(         𝑛   𝑖=1 g𝑖 ( 𝜷)g𝑖 ( 𝜷)     For the existence of C
𝑛 ≥ 𝑑𝑖𝑚(g𝑖 ) = 𝑝 × 𝜅0 where 𝜅0 is the number of eigen-functions. Under the given set-up, by
Equation (8) in Qu et al. (2000) the estimating equation for 𝜷 will be
                                          ¤ 𝜷) ≈ 2g(
                                         Q(                 b 𝜷) −1 ḡ( 𝜷)
                                                    ¤̄ 𝜷) T C(                                                     (2.8)
For obtaining the solution of the above equation, we use a Newton-like method. In practice, the
standard Newton method does not lead to a decrease in the objective function, that is, at each step
of the iteration, there is no guarantee that Q( 𝜷 𝑠+1 ) < Q( 𝜷 𝑠 ). Therefore, we use the following
algorithm to estimate 𝜷 using the Quasi-Newton method with halving (Givens and Hoeting, 2012).
2.2.3   Estimation of eigen-functions
Estimation of eigen-functions is an important step in our proposed quadratic inference technique.
In general, FPCA plays an important role as a dimension reduction technique in functional data
analysis. Some important theories on FPCA have been developed in recent years. In particular,
Hall and Hosseini-Nasab (2006) proved various asymptotic expressions for FPCA for densely
observed functional data. Later, Hall and Hosseini-Nasab (2009) showed more common theoretical
                                                        25


   Algorithm 2.1 Estimation of 𝜷 using the Quasi-Newton method with halving.
     Data: b 𝜷0 (initial estimates) and calculate Q( c             ¤ c
                                                            𝜷0 ), Q( 𝜷0 ), Q(¥ c
                                                                               𝜷0 ) respectively. 𝜖 0 (threshold,
             a small number) and max.count (maximum number of repetition)
     Result: Estimate 𝜷 using proposed method
  1: Calculate: b  𝜷1 ← b  𝜷0 − Q( ¥ c
                                     𝜷0 ) −1Q(¤ c𝜷0 )
  2: while Error > 𝜖 0 do
  3:     Calculate: Q(   ¤ b
                           𝜷1 ) and Q( ¥ b𝜷1 ) based on b 𝜷1
  4:     Initialise: 𝑟 0 = 1
  5:     b𝜷2 ← b           ¥ b
                  𝜷1 − 𝑟 0Q(          ¤ b
                              𝜷1 ) −1Q(  𝜷1 )
  6:     Calculate Q( b  𝜷1 ) and Q( b  𝜷2 ) based on b 𝜷1 and b  𝜷2 respectively using proposed method
  7:     while Q( b  𝜷2 ) > Q( b𝜷1 ) do
  8:          𝑟 0 ← 𝑟 0 /2
  9:          𝜷2 ← b
              b       𝜷1 − 𝑟 0Q(¥ b𝜷1 ) −1Q(¤ b
                                              𝜷1 )
 10:          Calculate Q( 𝜷1 ) and Q( 𝜷2 ) based on b
                             b               b               𝜷1 and b  𝜷2 respectively using proposed
                method
         end
 11:     Calculate: Error = ∥ b     𝜷2 − b 𝜷1 ∥ 2
 12:     b𝜷0 ← b  𝜷1
 13:     b𝜷1 ← 𝜷2 b
     end
arguments, including the effect of gap between eigen-value (a.k.a., spacing) on the property of eigen-
value estimators. In Li and Hsing (2010), uniform rates of convergence of mean and covariance
functions are given, which are equipped for all possible choices/scenarios of 𝑚𝑖 s. In this section,
we adopt the estimation of covariance functions mostly from Li and Hsing (2010).
     Note that the error process 𝑒(𝑡) has mean zero, defined on compact set T = [0, 1] satisfying
∫
  T
     E{𝑒 2 } < ∞. The functional principal components can be constructed via the covariance function
𝑅(𝑠, 𝑡) defined as
                                                𝑅(𝑠, 𝑡) = E{𝑒(𝑠)𝑒(𝑡)}                                           (2.9)
which is assumed to be square-integrable. This function 𝑅 induces the kernel operator F as defined
in Sub-section 2.7.1. An empirical analogue of the spectral decomposition of 𝑅 can be obtained
                                                         ∞
                                                        ∑︁
                                             b 𝑡) =
                                             𝑅(𝑠,           b𝑟 𝜙b𝑟 (𝑠) 𝜙b𝑟 (𝑡)
                                                            𝜆                                                 (2.10)
                                                        𝑟=1
                                                          26


where the random variables 𝜆            b1 ≥ 𝜆    b2 ≥ · · · ≥ 0 are the eigen-values of the estimated operator F                       b and
                                                                                                                                ∫
the corresponding sequence of eigen-functions are 𝜙b1 , 𝜙b2 , · · · . Further, assume that T 𝜙𝑟 𝜙b𝑟 ≥ 0 to
avoid the issue regarding change of sign (Hall and Hosseini-Nasab, 2006) for practical comparison
of eigen-functions, otherwise there is no impact on the convergence rate of eigen-functions and
hence the proposed estimators. Our proposed method can be generalized for finite ties of the true
eigen-values 𝜆𝑟 but to avoid more technicalities, we assume that the eigen-values are distinct.
    Suppose that 𝑇𝑖 𝑗 are observational points with a positive density function 𝑓𝑇 (·). Assume 𝑚𝑖 ≥ 2
                        Í𝑛
and define 𝑁 = 𝑖=1           𝑁𝑖 where 𝑁𝑖 = 𝑚𝑖 (𝑚𝑖 − 1). This approach is based on local linear smoother,
which is popular in functional data analysis, including Fan and Gijbels (1996); Li and Hsing (2010)
among many others. Let 𝐾 (·) be a symmetric probability density function on [−1, 1], which is used
as kernel, and ℎ > 0 be bandwidth; thus the re-scaled kernel function is defined as 𝐾 ℎ (·) = 1ℎ 𝐾 (·).
Therefore, for given 𝑠, 𝑡 ∈ T, choose (b                     𝑎0, b𝑏1, b  𝑏 2 ) be the minimizer of the following equation.
      𝑛       𝑚 𝑚
  1 ∑︁ 1 ∑︁𝑖 ∑︁𝑖
                         {𝑒𝑖 (𝑇𝑖 𝑗1 )𝑒𝑖 (𝑇𝑖 𝑗2 ) −𝑎 0 −𝑏 1 (𝑇𝑖 𝑗1 − 𝑠) −𝑏 2 (𝑇𝑖 𝑗2 −𝑡)}2 𝐾 ℎ 𝑇𝑖 𝑗1 − 𝑠 𝐾 ℎ 𝑇𝑖 𝑗2 − 𝑡 (2.11)
                                                                                                                                     
  𝑛 𝑖=1 𝑁𝑖 𝑗 =1 𝑗 =1
              1      2
                𝑗1 ≠ 𝑗2
Thus, we estimate 𝑅(𝑠, 𝑡) = E{𝑒(𝑠)𝑒(𝑡)} using the quantity b                                   𝑎 0 , viz., 𝑅(𝑠,
                                                                                                            b 𝑡) = b     𝑎 0 . The operator F
                                                                                                                                            b
is in general positive semi-definite and the estimated eigen-values 𝜆                                    b𝑟 are non-negative; indeed, 𝑅     b
is symmetric. Along with the lines in the existing literature, we define the following.
                                                               𝑇         𝑎 𝑇          𝑏
                                                                 𝑖 𝑗1 −𝑠         𝑖 𝑗2 −𝑡
     • 𝑆 𝑎,𝑏 (𝑠, 𝑡) =    1           1 Í𝑚 𝑖 Í𝑚 𝑖
                           Í𝑛
                         𝑛    𝑖=1 𝑁𝑖 𝑗1 =1             𝑗2 =1        ℎ               ℎ       𝐾 ℎ (𝑇𝑖 𝑗1 − 𝑠)𝐾 ℎ (𝑇𝑖 𝑗2 − 𝑡)
                                               𝑗1 ≠ 𝑗2
                                                   𝑇          𝑎 𝑇           𝑏
                                                       𝑖 𝑗1 −𝑠         𝑖 𝑗2 −𝑡
     • R𝑎,𝑏 (𝑠, 𝑡) =       1 Í𝑚 𝑖 Í𝑚 𝑖
                           𝑁𝑖 𝑗 1 =1        𝑗2 =1         ℎ               ℎ        𝑒𝑖 (𝑇𝑖 𝑗1 )𝑒𝑖 (𝑇𝑖 𝑗2 )𝐾 ℎ (𝑇𝑖 𝑗1 − 𝑠)𝐾 ℎ (𝑇𝑖 𝑗2 − 𝑡)
                                    𝑗1 ≠ 𝑗2
     • A1 = 𝑆20 𝑆02 − 𝑆11      2 , A = 𝑆 𝑆 − 𝑆 𝑆 , and A = 𝑆 𝑆 − 𝑆 𝑆
                                        2         10 02           01 11                  3      01 20         10 11
     • B = A1 𝑆00 − A2 𝑆10 − A3 𝑆01
Therefore, 𝑅(𝑠,b 𝑡) = (A1 R00 − A2 R10 − A3 R01 )B−1 .
                                                                             27


2.3     Asymptotic properties
In this section, we study the asymptotic properties of the proposed estimator. Let us introduce some
notation. Assume that 𝑚𝑖 s are all of the same order, i.e., 𝑚 ≡ 𝑚(𝑛) = 𝑛𝑎 for some 𝑎 ≥ 0. Define
𝑑𝑛1 (ℎ) = ℎ2 + ℎ𝑚/𝑚 and 𝑑𝑛2 (ℎ) = ℎ4 + ℎ3 𝑚/𝑚 + ℎ2 𝑚/𝑚 2 where 𝑚 = lim𝑛→∞ 𝑛1 𝑖=1
                                                                                               Í𝑛
                                                                                                        𝑚/𝑚𝑖
                                                                                    1/2
and 𝑚 = lim𝑛→∞ 𝑛1 𝑖=1        (𝑚/𝑚𝑖 ) 2 . Denote 𝛿𝑛1 (ℎ) = 𝑑𝑛1 (ℎ) log 𝑛/(𝑛ℎ2 )
                       Í𝑛                                       
                                                                                           and 𝛿𝑛2 (ℎ) =
                       1/2                    ∫
  𝑑𝑛2 (ℎ) log 𝑛/(𝑛ℎ4 )     . Further, 𝜈𝑎,𝑏 = 𝑡 𝑎 𝐾 𝑏 (𝑡)𝑑𝑡. Define W = (𝝓(𝑡 1 ) T , · · · , 𝝓(𝑡 𝑚 ) T ) T is a

matrix of order 𝑚 × 𝜅0 obtained after stacking all 𝝓 𝑘 s and random components 𝜉𝑖 = (𝜉𝑖1 , · · · , 𝜉𝑖𝜅0 ) T .
Further, 𝝃 has mean zero and variance 𝚲 which is a diagonal matrix with components 𝜆 1 , · · · , 𝜆 𝜅0 .
The sign “ ≲ ” indicates that the left-hand side of the inequality is bounded by the right-hand side
up to a multiplicative positive constant, i.e. for two positive variables 𝑓1 and 𝑓2 we define 𝑓1 ≲ 𝑓2
as 𝑓1 ≤ 𝐶 𝑓2 where 𝐶 is a positive constant not involving 𝑛. The following conditions are needed
for further discussion of the asymptotic properties.
(C1) Kernel function 𝐾 (·) is a symmetric density function defined on bounded support [−1, 1].
(C2) Density function 𝑓𝑇 of 𝑇 is bounded above and away from infinity. Also the density function
       is bounded below away from zero. Moreover, 𝑓 is differentiable and the derivative is
       continuous.
(C3) 𝑅(𝑠, 𝑡) is twice differentiable and all second order partial derivatives are bounded on [0, 1] 2 .
(C4) E{sup𝑡∈[0,1] |𝑒(𝑡)| 𝛾 } < ∞ and E{sup𝑡∈[0,1] |x𝑖 (𝑡)| 2𝛾 } < ∞ for some 𝛾 ∈ (4, ∞).
(C5) ℎ → 0 as 𝑛 → ∞ such that 𝑑𝑛1          −1 (log 𝑛/𝑛) 1−2/𝛾 → 0 and 𝑑 −1 (log 𝑛/𝑛) 1−4/𝛾 → 0 for
                                                                            𝑛2
       𝛾 ∈ (4, ∞).
(C6) Condition for eigen-components.
          a) for each 1 ≤ 𝑘 < 𝑟 < ∞ and for non-zero finite generic constant 𝐶0 ,
                                           max {𝜆 𝑘 , 𝜆𝑟 }       max {𝑘, 𝑟}
                                                           ≤ 𝐶0                                        (2.12)
                                            |𝜆 𝑘 − 𝜆𝑟 |           |𝑘 − 𝑟 |
                                                   28


                                                                                         ∫
         b) For some 𝛼 > 0, with the condition 𝑉𝑟 𝜆𝑟−2𝑟 1+𝛼 → 0 as 𝑟 → ∞ where 𝑉𝑟 = E{ 𝜇(𝑡)𝜙
                                                                                           ¤             2
                                                                                                 𝑟 (𝑡)𝑑𝑡} .
             The above two conditions hold if 𝜆𝑟 = 𝑟 −𝜏1 Λ(𝑟) and 𝑉𝑟 = 𝑟 −𝜏2 Γ(𝑟) for slowly varying
             functions Λ and Γ where 𝜏2 > 1 + 2𝜏1 > 3.
             ∫               ∫
         c)    𝜙4𝑘 (𝑡)𝑑𝑡 and   𝜇¤ 2 (𝑡)𝜙2𝑘 (𝑡)𝑑𝑡 are finite for all 𝑘 ≥ 1.
       b 𝜷) converges almost surely to an invertible matrix C0 = E{g( 𝜷0 )g( 𝜷0 ) −1 }.
 (C7) C(
 (C8) Conditions for ℎ and 𝜅0 . For 𝜏 = 𝛼 + 𝜏1 ,
         a) If 𝑎 > 1/4, 𝜅0 = 𝑂 (𝑛1/(3−𝜏) ) and 𝑛−1/4 ≲ ℎ ≲ 𝑛−(𝑎+1)/5
         b) If 𝑎 ≤ 1/4, 𝜅0 = 𝑂 (𝑛4(1+𝑎)/5(3−𝜏) ) and ℎ ≲ 𝑛−1/4
Remark 2.3.1. Condition (C1) is commonly used in non-parametric regression. The bound con-
dition for the density function of time-points has the standard Condition (C2) for random design.
Similar results can be obtained for fixed design where the grid-points are pre-fixed according to
                      ∫𝑇
the design density 0 𝑗 𝑓 (𝑡)𝑑𝑡 = 𝑗/𝑚 for 𝑗 = 1, · · · , 𝑚, for 𝑚 ≥ 1. Furthermore, it is important
to note that this approach does not involve the requirement to obtain sample path differentiation
when we invoke the estimation of eigenfunctions from Li and Hsing (2010). Therefore, the method
could be applicable for Brownian motion which has a continuously non-differentiable sample path.
To expand in Taylor series, Condition (C3) is required, and is also common in non-parametric
regression. Condition (C4) is required for a uniform bound for certain higher-order expectations
to show uniform convergence. This is a similar condition adopted from Li and Hsing (2010).
Smoothness conditions in (C5) and (C8) are common in kernel smoothing and functional data
to control bias and variance. The first condition for tuning the parameters mentioned in (C5) is
similar to Li and Hsing (2010). The required spacing assumptions for eigen-values in Conditions
(C6)a and (C6)b are similar as in Hall and Hosseini-Nasab (2009). Condition (C6)c is the trivial
assumption that frequently arises in the functional data analysis literature. In most situations,
this condition automatically holds. Using the weak law of large numbers, Condition (C7) holds
for large 𝑛. Similar kind of conditions can be invoked, such as the convexity assumption, i.e.,
                                                       29


𝜆𝑟 − 𝜆𝑟+1 ≤ 𝜆𝑟−1 − 𝜆𝑟 for all 𝑟 ≥ 2. Condition (C8) is determined to control the rate of the number
of repeated measurements.
Now, the following theorem provides the asymptotic expansion and consistency of the proposed
              𝜷.
estimator for b
Theorem 2.3.1. Let 𝜷0 be the true value of 𝜷. Under the Conditions (C1), (C2), (C3), (C4), (C5),
(C6)a, (C6)b and (C6)c, for 𝑘 = 1, · · · , 𝜅0 , we have the asymptotic mean square error for g 𝑘 ( 𝜷0 )
as
                                                                                 
                     AMSE{g 𝑘 ( 𝜷0 )} = 𝑂 𝑛−1 + 𝑛−1 𝜅03−𝜏 𝑅𝑛 (ℎ)                             almost surely                (2.13)
                  n                                                                     o
                         1      1          1            1          1             1
where 𝑅𝑛 (ℎ) =      ℎ4 + 𝑛  +  𝑛𝑚ℎ  +  𝑛2 𝑚 2 ℎ 2
                                                  + 𝑛2 𝑚 4 ℎ 4
                                                               + 𝑛2 𝑚ℎ
                                                                          +  𝑛2 𝑚 3 ℎ 3
                                                                                          .
Moreover, under Condition (C8), AMSE{b               g 𝑘 ( 𝜷0 )} = 𝑂 (𝑛−1 ). Therefore, if in addition, Condition
(C7) holds, as 𝑛 → ∞, ∥ b    𝜷 − 𝜷0 ∥ = 𝑂 (𝑛−1/2 ) in probability.
The following theorem states the results of asymptotic normality.
Theorem 2.3.2. Define C𝑖 = 𝜅𝑘01 =1 𝜅𝑘02 =1 𝚽 𝑘 1 X𝑖 C−1                       T                     −1
                                      Í         Í
                                                                    𝑘 1 ,𝑘 2 X𝑖 𝚽 𝑘 2 , where C 𝑘 1 ,𝑘 2 is a (𝑘 1 , 𝑘 2 ) block
of C−1                    g( 𝜷0 )g𝑖T ( 𝜷0 ) . Assume that the conditions for Theorem 2.3.1 hold. Then
                        
     0  with  C 0  =  E
√ b            𝑑
              − 𝑁 (0, 𝚺) where 𝚺 = B−1 AB−1 . The quantities A and B are, respectively, limits
  𝑛( 𝜷 − 𝜷0 ) →
                                                                                           𝑑
   b = 1 Í 𝑛 XT C         e𝑖T C             b = 1 Í 𝑛 XT C                                − ′′ denotes the convergence in
of A                                                𝑛 𝑖=1 𝑖 𝑖 X𝑖 , and “ →
                     b e𝑖b    b𝑖 X𝑖 and B                         b
         𝑛 𝑖=1 𝑖 𝑖b
distribution.
Remark 2.3.2. Here, the selection of the bandwidth only effects the second-order term of the MSE
   𝜷 and has no effect on the asymptotic result of normality as long as ℎ satisfies the Conditions
of b
(C5) and (C8) along with some restrictions on 𝜅0 .
    All proofs with relevant technical details are available in Section 2.7.
                                                               30


2.4     Simulation studies
We conduct the numerical studies to compare the finite sample performance to the corresponding
longitudinal approach of quadratic inference proposed in Qu et al. (2000) under different correlation
structures.
2.4.1    Simulation set-up
Consider the normal response model
                                          𝑦𝑖 (𝑇𝑖 𝑗 ) = x𝑖 (𝑇𝑖 𝑗 ) T 𝜷 + 𝑒𝑖 (𝑇𝑖 𝑗 )                         (2.14)
For 𝑝 = 2, we set coefficient vectors, 𝜷 = (𝛽1 , 𝛽2 ) T where 𝛽1 = 1 and 𝛽2 = 0.5. The covariates are
generated in following way.
                                                       √                         √
                              𝑥𝑖𝑘 (𝑡) = 𝜒𝑖1(𝑘) + 𝜒𝑖2(𝑘) 2 sin (𝜋𝑡) + 𝜒𝑖3(𝑘) 2 cos (𝜋𝑡)                     (2.15)
The coefficients 𝜒𝑖1(𝑘) ∼ 𝑁 (0, (2−0.5(𝑘−1) ) 2 ), 𝜒𝑖2(𝑘) ∼ 𝑁 (0, (0.85 × 2−0.5(𝑘−1) ) 2 ), 𝜒𝑖3(𝑘) ∼ 𝑁 (0, (0.7 ×
2−0.5(𝑘−1) ) 2 ) and 𝜒𝑖 𝑗 s are mutually independent for each trajectories 𝑖 and each 𝑗. Consider the
following simulation design.
     • Observational times-points. In a fixed-design situation, associated observational times are
       fixed. Sample trajectories are observed at 𝑚 = 100 equidistant time-points {𝑡1 , · · · , 𝑡 𝑚 } on
       [0, 1].
     • Choice of residuals. The residual process 𝑒𝑖 (𝑡) is a smoothed function with mean zero
                                                                                            Í
       and unknown covariance function, where each 𝑒𝑖 is distributed as 𝑒𝑖 = 𝑘 ≥1 𝜉𝑖 𝜙𝑖 and 𝜉 𝑘 s
       are independent normal random variables with mean zero and respective variances 𝜆 𝑘 . For
       numerical computation, we truncate the finite series at 𝑘 = 3 in Karhunen-Loève expansions
       for Situations (a), (b) and (c) as described below. In Situations (d) and (e), the error process
       is generated from given covariance functions.
                                                           31


        (a) Brownian motion. The covariance function for the Brownian motion is min(𝑠, 𝑡),
                        4
                                             √         √
             𝜆 𝑘 = 𝜋2 (2𝑘−1) 2 and 𝜙 𝑘 (𝑡) =   2 sin(𝑡/ 𝜆 𝑘 ).
                                                                                           √
        (b) Linear process. Consider the eigen-values be 𝜆 𝑘 = 𝑘 −2𝑙0 and 𝜙 𝑘 (𝑡) = 2 cos(𝑘 𝜋𝑡). We
             fix 𝑙 0 ∈ {1, 2, 3}.
        (c) Ornstein Uhlenbeck (OU) process. For positive constants 𝜇0 and 𝜌0 , we have a stochastic
             differential equation for 𝑒(𝑡) as 𝜕𝑒(𝑡) = −𝜇0 𝑒(𝑡)𝜕𝑡 + 𝜌0 𝜕𝑤(𝑡) for the Brownian motion
             𝑤(𝑡). It can be shown that cov{𝑒(𝑡), 𝑒(𝑠)} = 𝑐 exp{−𝜇0 |𝑡 − 𝑠|} where 𝑐 = 𝜌02 /2𝜇0 .
             Here we assume 𝑐 = 1. Thus, by solving the integral equation we have 𝜙 𝑘 (𝑡) =
                                                         2𝜇0                                      𝜔2 −𝜇2
             𝐴 𝑘 cos(𝜔 𝑘 𝑡) + 𝐵 𝑘 sin(𝜔 𝑘 𝑡) and 𝜆 𝑘 =  𝜔2𝑘 +𝜇02
                                                                 where 𝜔 is solution of cot(𝜔) = 2𝜇0 𝜔0 . The
                                                                                       √︂
                                                                                              2𝜔2
             constants 𝐴 𝑘 and 𝐵 𝑘 are defined as 𝐵 𝑘 = 𝜇0 𝐴 𝑘 /𝜔 𝑘 where 𝐴 𝑘 = 2𝜇 +𝜇2𝑘+𝜔2 . Here 𝜇0
                                                                                            0   0 𝑘
             is chosen to be 1 or 3.
        (d) Power exponential. 𝑅(𝑠, 𝑡) = exp{(−|𝑠 − 𝑡|/𝑎 0 ) 𝑏0 } where scale parameter 𝑎 0 = 1 and
             shape parameter 𝑏 0 ∈ {1, 2, 5}.
                                                      2  −𝑏0
        (e) Rational quadratic. 𝑅(𝑠, 𝑡) = 1 + 𝑠−𝑡      𝑎0          where scale parameter 𝑎 0 = 1 and shape
             parameter 𝑏 0 ∈ {1, 2, 5}.
    • Sample size parameter. Number of individuals, 𝑛 ∈ {100, 300, 500}.
2.4.2   Comparison and evaluation
For each of the situations, we perform 500 simulation replicates. To execute Qu et al. (2000)’s
approach, we construct the scores using basis matrices as described in Example 1 (approximation
of the compound symmetric correlation structure, denoted as ldaCS in the tables) and Example
2 (for the first-order autoregressive correlation structure, denoted as ldaAR in the tables) in their
paper. Ordinary least squares estimate (denoted as init in the tables) is taken as the initial estimate
of 𝜷 for both ours (denoted as fda-k for specific 𝑘 in the tables) and Qu et al. (2000)’s approach.
The estimation procedure in the iterative algorithm converges when the square difference between
the estimated values of two consecutive steps is bounded by a small number 10−10 or the maximum
                                                       32


number of steps crosses 500, whichever happens earlier. To make theoretical results and numerical
examples consistent, we use “FPCA” function in R which is available in fdapace packages (Gajardo
et al., 2021) or in the MATLAB (MATLAB, 2014) package PACE available at http://www.stat.
ucdavis.edu/PACE/ to estimate the eigen-functions. The key references for the PACE approach
and associated works include in Yao et al. (2003, 2005); Müller and Yao (2010); Li and Hsing
(2010). Bandwidths are selected using generalized cross-validation and the Epanechnikov kernel
𝐾 (𝑥) = 0.75(1 − 𝑥 2 )+ is used for estimation where (𝑎)+ = max(𝑎, 0).
    The means and standard deviations (SD) of the regression coefficients based on 500 simulations
are given as summary measures. We calculate the standard deviation mentioned in the tables based
on 500 estimates from 500 replications that can be viewed as the true standard error. Moreover,
we also compute the following statistics to compare the performance of estimation, where for 𝑏-th
            𝜷 𝑏 be the estimated value for 𝜷,
replication b
    • Absolute bias, AB =      1  Í500 b
                              500   𝑏=1 | 𝜷 𝑏 − 𝜷|
    • Mean square error, MSE =        1  Í500 b            2
                                     500    𝑏=1 ( 𝜷 𝑏 − 𝜷)
MSEs are reported in the order of 10−2 . The number of selected eigen-functions plays a critical
role in our proposed method. We choose 𝜅0 based on a scree plot where the elbow of the graph is
found and the components to the left are considered as significant.
                                                      33


Table 2.1 Performance of the estimation procedure where the residuals are generated from Brownian
motion (a). Mean of the estimated coefficients, standard deviation, absolute bias, mean square error
(×100) and FVE in percentage are summarized upto four decimal places.
 Method                     𝛽1                                     𝛽2                        FVE
             Mean       SD        AB      MSE       Mean        SD       AB      MSE       %-age
                                                    𝑛 = 100
     init   0.9999  0.0373     0.0297   0.1391     0.4995   0.0486    0.0384   0.2354
  ldaAR     0.9991  0.0331     0.0265   0.1095     0.5004   0.0445    0.0353   0.1972
  ldaCS     0.9987  0.0316     0.0253   0.1000     0.4997   0.0411    0.0322   0.1685
   fda-1    0.9998  0.0564     0.0447   0.3180     0.5006   0.0743    0.0587   0.5516     86.2672
   fda-2    1.0001  0.0269     0.0213   0.0723     0.4971   0.0362    0.0290   0.1314     96.3746
   fda-3    0.9998  0.0231     0.0181   0.0532     0.4978   0.0317    0.0251   0.1010     99.9220
   fda-4    1.0004  0.0052     0.0014   0.0028     0.4994   0.0092    0.0022   0.0085     99.9979
   fda-5    0.9999  0.0021     0.0008   0.0004     0.4999   0.0051    0.0012   0.0026    100.0000
   fda-6    0.9999  0.0021     0.0008   0.0004     0.4999   0.0051    0.0012   0.0026    100.0000
   fda-7    0.9999  0.0021     0.0008   0.0004     0.4999   0.0051    0.0012   0.0026    100.0000
                                                    𝑛 = 300
     init   1.0002  0.0200     0.0162   0.0401     0.5003   0.0288    0.0226   0.0825
  ldaAR     1.0003  0.0184     0.0147   0.0336     0.5000   0.0259    0.0203   0.0670
  ldaCS     1.0007  0.0170     0.0134   0.0288     0.4995   0.0242    0.0190   0.0583
   fda-1    1.0002  0.0309     0.0251   0.0955     0.5008   0.0443    0.0350   0.1962     86.7578
   fda-2    1.0002  0.0142     0.0114   0.0202     0.4998   0.0213    0.0169   0.0451     96.4747
   fda-3    1.0002  0.0122     0.0098   0.0150     0.4992   0.0179    0.0144   0.0321     99.9745
   fda-4    1.0002  0.0021     0.0003   0.0004     0.4998   0.0032    0.0005   0.0010     99.9993
   fda-5    1.0000  0.0002     0.0001   0.0000     0.5000   0.0003    0.0002   0.0000    100.0000
   fda-6    1.0000  0.0002     0.0001   0.0000     0.5000   0.0003    0.0002   0.0000    100.0000
   fda-7    1.0000  0.0002     0.0001   0.0000     0.5000   0.0003    0.0002   0.0000    100.0000
                                                    𝑛 = 500
     init   1.0002  0.0148     0.0117   0.0219     0.5006   0.0223    0.0177   0.0497
  ldaAR     1.0005  0.0138     0.0111   0.0189     0.5000   0.0206    0.0162   0.0422
  ldaCS     1.0000  0.0128     0.0102   0.0164     0.4992   0.0184    0.0146   0.0340
   fda-1    1.0007  0.0234     0.0185   0.0545     0.5012   0.0348    0.0277   0.1213     86.7520
   fda-2    0.9996  0.0105     0.0083   0.0110     0.5002   0.0157    0.0126   0.0247     96.5174
   fda-3    0.9991  0.0091     0.0074   0.0084     0.4999   0.0133    0.0107   0.0177     99.9851
   fda-4    1.0000  0.0002     0.0001   0.0000     0.5000   0.0003    0.0002   0.0000     99.9996
   fda-5    1.0000  0.0002     0.0001   0.0000     0.5000   0.0003    0.0002   0.0000    100.0000
   fda-6    1.0000  0.0002     0.0001   0.0000     0.5000   0.0003    0.0002   0.0000    100.0000
   fda-7    1.0000  0.0002     0.0001   0.0000     0.5000   0.0003    0.0002   0.0000    100.0000
                                                 34


Table 2.2 Performance of the estimation procedure where the residuals are generated from linear
process (b) with 𝑙0 = 1. Mean of the estimated coefficients, standard deviation, absolute bias, mean
square error (×100) and FVE in percentage are summarized upto four decimal places.
 Method                      𝛽1                                    𝛽2                         FVE
             Mean        SD        AB      MSE     Mean         SD       AB       MSE       %-age
                                                  𝑛 = 100
      init  1.0019    0.0340    0.0267   0.1155  0.5001     0.0456    0.0356   0.2078
   ldaAR    1.0007    0.0230    0.0187   0.0528  0.5010     0.0361    0.0285   0.1303
   ldaCS    1.0000    0.0006    0.0005   0.0000  0.4999     0.0008    0.0006   0.0001
    fda-1   1.0093    0.1436    0.1150   2.0675  0.5020     0.2010    0.1570   4.0311     73.0607
    fda-2   1.0036    0.1055    0.0828   1.1123  0.5052     0.1337    0.1070   1.7869     91.6726
    fda-3   1.0038    0.1024    0.0804   1.0473  0.5056     0.1303    0.1020   1.6969     99.7657
    fda-4   1.0000    0.0092    0.0021   0.0084  0.5006     0.0096    0.0024   0.0093     99.9993
    fda-5   1.0000    0.0011    0.0007   0.0001  0.5000     0.0017    0.0010   0.0003    100.0000
    fda-6   1.0000    0.0011    0.0007   0.0001  0.5000     0.0017    0.0010   0.0003    100.0000
    fda-7   1.0000    0.0011    0.0007   0.0001  0.5000     0.0017    0.0010   0.0003    100.0000
                                                  𝑛 = 300
      init  0.9991    0.0181    0.0144   0.0326  0.5006     0.0268    0.0212   0.0715
   ldaAR    0.9995    0.0133    0.0105   0.0178  0.5000     0.0212    0.0169   0.0447
   ldaCS    1.0000    0.0003    0.0002   0.0000  0.5000     0.0005    0.0004   0.0000
    fda-1   0.9958    0.0767    0.0616   0.5888  0.5037     0.1163    0.0918   1.3523     73.4907
    fda-2   0.9974    0.0567    0.0458   0.3220  0.5011     0.0804    0.0648   0.6460     91.7757
    fda-3   0.9974    0.0564    0.0455   0.3182  0.5007     0.0800    0.0643   0.6391     99.9225
    fda-4   1.0000    0.0005    0.0001   0.0000  0.5000     0.0009    0.0002   0.0001     99.9998
    fda-5   1.0000    0.0002    0.0001   0.0000  0.5000     0.0003    0.0002   0.0000    100.0000
    fda-6   1.0000    0.0002    0.0001   0.0000  0.5000     0.0003    0.0002   0.0000    100.0000
    fda-7   1.0000    0.0002    0.0001   0.0000  0.5000     0.0003    0.0002   0.0000    100.0000
                                                  𝑛 = 500
      init  0.9999    0.0152    0.0121   0.0230  0.5027     0.0207    0.0163   0.0436
   ldaAR    1.0008    0.0100    0.0079   0.0100  0.5019     0.0161    0.0129   0.0263
   ldaCS    1.0000    0.0003    0.0002   0.0000  0.5000     0.0004    0.0003   0.0000
    fda-1   0.9990    0.0657    0.0523   0.4303  0.5113     0.0883    0.0698   0.7913     73.4098
    fda-2   1.0013    0.0468    0.0371   0.2185  0.5078     0.0651    0.0520   0.4292     91.8501
    fda-3   1.0014    0.0459    0.0364   0.2107  0.5074     0.0650    0.0516   0.4274     99.9490
    fda-4   1.0000    0.0001    0.0001   0.0000  0.5000     0.0001    0.0001   0.0000     99.9999
    fda-5   1.0000    0.0001    0.0001   0.0000  0.5000     0.0001    0.0001   0.0000    100.0000
    fda-6   1.0000    0.0001    0.0001   0.0000  0.5000     0.0001    0.0001   0.0000    100.0000
    fda-7   1.0000    0.0001    0.0001   0.0000  0.5000     0.0001    0.0001   0.0000    100.0000
                                                35


Table 2.3 Performance of the estimation procedure where the residuals are generated from linear
process (b) with 𝑙0 = 2. Mean of the estimated coefficients, standard deviation, absolute bias, mean
square error (×100) and FVE in percentage are summarized upto four decimal places.
 Method                      𝛽1                                    𝛽2                         FVE
             Mean        SD        AB      MSE     Mean         SD       AB       MSE       %-age
                                                  𝑛 = 100
      init  1.0020    0.0331    0.0261   0.1094  0.4999     0.0451    0.0349   0.2028
   ldaAR    1.0002    0.0167    0.0133   0.0278  0.5014     0.0232    0.0183   0.0538
   ldaCS    1.0000    0.0003    0.0002   0.0000  0.5000     0.0004    0.0003   0.0000
    fda-1   1.0096    0.1431    0.1144   2.0520  0.5022     0.2003    0.1561   4.0029     92.5933
    fda-2   1.0014    0.0648    0.0508   0.4188  0.5037     0.0842    0.0667   0.7094     98.5532
    fda-3   1.0009    0.0535    0.0406   0.2860  0.5018     0.0744    0.0570   0.5524     99.7251
    fda-4   1.0000    0.0074    0.0017   0.0055  0.4999     0.0103    0.0024   0.0106     99.9991
    fda-5   1.0000    0.0045    0.0009   0.0021  0.5000     0.0021    0.0009   0.0004    100.0000
    fda-6   1.0000    0.0045    0.0009   0.0021  0.5000     0.0021    0.0009   0.0004    100.0000
    fda-7   1.0000    0.0045    0.0009   0.0021  0.5000     0.0021    0.0009   0.0004    100.0000
                                                  𝑛 = 300
      init  0.9991    0.0175    0.0140   0.0308  0.5006     0.0263    0.0208   0.0689
   ldaAR    0.9999    0.0091    0.0071   0.0083  0.4997     0.0127    0.0102   0.0161
   ldaCS    1.0000    0.0002    0.0001   0.0000  0.5000     0.0002    0.0002   0.0000
    fda-1   0.9957    0.0767    0.0616   0.5893  0.5038     0.1164    0.0919   1.3541     92.9525
    fda-2   0.9989    0.0365    0.0295   0.1331  0.4998     0.0511    0.0410   0.2608     98.7539
    fda-3   0.9992    0.0349    0.0282   0.1218  0.4991     0.0483    0.0385   0.2332     99.9061
    fda-4   0.9999    0.0017    0.0002   0.0003  0.4999     0.0033    0.0004   0.0011     99.9997
    fda-5   1.0000    0.0002    0.0001   0.0000  0.5000     0.0002    0.0001   0.0000    100.0000
    fda-6   1.0000    0.0002    0.0001   0.0000  0.5000     0.0002    0.0001   0.0000    100.0000
    fda-7   1.0000    0.0002    0.0001   0.0000  0.5000     0.0002    0.0001   0.0000    100.0000
                                                  𝑛 = 500
      init  0.9998    0.0148    0.0118   0.0219  0.5026     0.0201    0.0158   0.0409
   ldaAR    1.0006    0.0073    0.0059   0.0054  0.5008     0.0098    0.0079   0.0097
   ldaCS    1.0000    0.0001    0.0001   0.0000  0.5000     0.0002    0.0001   0.0000
    fda-1   0.9990    0.0656    0.0523   0.4295  0.5114     0.0882    0.0698   0.7897     92.9459
    fda-2   1.0013    0.0296    0.0233   0.0874  0.5039     0.0415    0.0329   0.1735     98.7957
    fda-3   1.0012    0.0285    0.0224   0.0812  0.5036     0.0407    0.0322   0.1670     99.9389
    fda-4   1.0000    0.0001    0.0001   0.0000  0.5000     0.0001    0.0001   0.0000     99.9998
    fda-5   1.0000    0.0001    0.0001   0.0000  0.5000     0.0001    0.0001   0.0000    100.0000
    fda-6   1.0000    0.0001    0.0001   0.0000  0.5000     0.0001    0.0001   0.0000    100.0000
    fda-7   1.0000    0.0001    0.0001   0.0000  0.5000     0.0001    0.0001   0.0000    100.0000
                                                36


Table 2.4 Performance of the estimation procedure where the residuals are generated from linear
process (b) with 𝑙0 = 3. Mean of the estimated coefficients, standard deviation, absolute bias, mean
square error (×100) and FVE in percentage are summarized upto four decimal places.
 Method                      𝛽1                                    𝛽2                         FVE
             Mean        SD        AB      MSE     Mean         SD       AB       MSE       %-age
                                                  𝑛 = 100
      init  1.0020    0.0328    0.0260   0.1077  0.4998     0.0451    0.0349   0.2027
   ldaAR    1.0000    0.0087    0.0069   0.0076  0.5009     0.0122    0.0096   0.0149
   ldaCS    1.0000    0.0001    0.0001   0.0000  0.5000     0.0002    0.0002   0.0000
    fda-1   1.0096    0.1430    0.1143   2.0496  0.5023     0.2001    0.1559   3.9978     97.9586
    fda-2   1.0000    0.0326    0.0250   0.1058  0.5023     0.0440    0.0344   0.1941     99.5699
    fda-3   0.9998    0.0144    0.0072   0.0208  0.4986     0.0209    0.0103   0.0436     99.8941
    fda-4   1.0002    0.0053    0.0013   0.0028  0.5002     0.0067    0.0018   0.0044     99.9991
    fda-5   1.0003    0.0037    0.0008   0.0014  0.4998     0.0042    0.0011   0.0018    100.0000
    fda-6   1.0003    0.0037    0.0008   0.0014  0.4998     0.0042    0.0011   0.0018    100.0000
    fda-7   1.0003    0.0037    0.0008   0.0014  0.4998     0.0042    0.0011   0.0018    100.0000
                                                  𝑛 = 300
      init  0.9991    0.0174    0.0139   0.0303  0.5006     0.0262    0.0207   0.0684
   ldaAR    1.0000    0.0047    0.0037   0.0022  0.4997     0.0066    0.0052   0.0044
   ldaCS    1.0000    0.0001    0.0001   0.0000  0.5000     0.0001    0.0001   0.0000
    fda-1   0.9957    0.0767    0.0616   0.5896  0.5038     0.1164    0.0919   1.3543     98.2262
    fda-2   0.9996    0.0197    0.0158   0.0386  0.4996     0.0274    0.0219   0.0750     99.7663
    fda-3   1.0002    0.0151    0.0101   0.0227  0.4990     0.0186    0.0125   0.0347     99.9255
    fda-4   1.0001    0.0022    0.0003   0.0005  0.5001     0.0017    0.0003   0.0003     99.9997
    fda-5   1.0000    0.0002    0.0001   0.0000  0.5000     0.0002    0.0001   0.0000    100.0000
    fda-6   1.0000    0.0002    0.0001   0.0000  0.5000     0.0002    0.0001   0.0000    100.0000
    fda-7   1.0000    0.0002    0.0001   0.0000  0.5000     0.0002    0.0001   0.0000    100.0000
                                                  𝑛 = 500
      init  0.9998    0.0147    0.0117   0.0216  0.5025     0.0199    0.0157   0.0400
   ldaAR    1.0003    0.0038    0.0031   0.0015  0.5003     0.0052    0.0041   0.0027
   ldaCS    1.0000    0.0001    0.0001   0.0000  0.5000     0.0001    0.0001   0.0000
    fda-1   0.9990    0.0656    0.0523   0.4293  0.5114     0.0882    0.0698   0.7895     98.2518
    fda-2   1.0008    0.0159    0.0126   0.0253  0.5015     0.0222    0.0175   0.0495     99.8021
    fda-3   1.0001    0.0126    0.0091   0.0158  0.5002     0.0179    0.0130   0.0320     99.9449
    fda-4   1.0000    0.0001    0.0001   0.0000  0.5000     0.0001    0.0001   0.0000     99.9998
    fda-5   1.0000    0.0001    0.0001   0.0000  0.5000     0.0001    0.0001   0.0000    100.0000
    fda-6   1.0000    0.0001    0.0001   0.0000  0.5000     0.0001    0.0001   0.0000    100.0000
    fda-7   1.0000    0.0001    0.0001   0.0000  0.5000     0.0001    0.0001   0.0000    100.0000
                                                37


Table 2.5 Performance of the estimation procedure where the residuals are generated from Ornstein-
Uhlenbeck process (c) with 𝜇0 = 1. Mean of the estimated coefficients, standard deviation, absolute
bias, mean square error (×100) and FVE in percentage are summarized upto four decimal places.
 Method                     𝛽1                                   𝛽2                         FVE
             Mean        SD       AB      MSE     Mean        SD        AB      MSE        %-age
                                                  𝑛 = 100
      init  1.0003   0.0541    0.0434   0.2922   0.4994   0.0711    0.0563    0.5048
   ldaAR    1.0001   0.0476    0.0383   0.2261   0.4984   0.0650    0.0513    0.4214
   ldaCS    0.9994   0.0398    0.0316   0.1581   0.4978   0.0534    0.0421    0.2849
    fda-1   1.0009   0.0705    0.0563   0.4964   0.5006   0.0947    0.0747    0.8954     79.4156
    fda-2   1.0001   0.0453    0.0358   0.2044   0.4982   0.0608    0.0482    0.3690     94.9669
    fda-3   0.9993   0.0386    0.0307   0.1491   0.4978   0.0511    0.0405    0.2613     99.9949
    fda-4   1.0003   0.0127    0.0081   0.0162   0.4991   0.0242    0.0146    0.0583     99.9992
    fda-5   0.9997   0.0084    0.0071   0.0071   0.4991   0.0159    0.0124    0.0254    100.0000
    fda-6   0.9997   0.0084    0.0071   0.0071   0.4991   0.0159    0.0124    0.0254    100.0000
    fda-7   0.9997   0.0084    0.0071   0.0071   0.4991   0.0159    0.0124    0.0254    100.0000
                                                  𝑛 = 300
      init  1.0000   0.0288    0.0233   0.0829   0.5003   0.0416    0.0329    0.1728
   ldaAR    1.0000   0.0258    0.0206   0.0662   0.4996   0.0368    0.0294    0.1350
   ldaCS    1.0002   0.0212    0.0170   0.0449   0.4987   0.0314    0.0254    0.0983
    fda-1   0.9997   0.0388    0.0316   0.1503   0.5014   0.0560    0.0440    0.3130     79.9233
    fda-2   1.0005   0.0240    0.0193   0.0576   0.4983   0.0352    0.0284    0.1242     95.0283
    fda-3   0.9999   0.0202    0.0161   0.0409   0.4994   0.0298    0.0240    0.0884     99.9984
    fda-4   0.9999   0.0072    0.0064   0.0051   0.4997   0.0129    0.0111    0.0166     99.9998
    fda-5   1.0000   0.0072    0.0064   0.0051   0.4997   0.0128    0.0111    0.0165    100.0000
    fda-6   1.0000   0.0072    0.0064   0.0051   0.4997   0.0128    0.0111    0.0165    100.0000
    fda-7   1.0000   0.0072    0.0064   0.0051   0.4997   0.0128    0.0111    0.0165    100.0000
                                                  𝑛 = 500
      init  1.0000   0.0288    0.0233   0.0829   0.5003   0.0416    0.0329    0.1728
   ldaAR    1.0000   0.0258    0.0206   0.0662   0.4996   0.0368    0.0294    0.1350
   ldaCS    1.0002   0.0212    0.0170   0.0449   0.4987   0.0314    0.0254    0.0983
    fda-1   0.9997   0.0388    0.0316   0.1503   0.5014   0.0560    0.0440    0.3130     79.9233
    fda-2   1.0005   0.0240    0.0193   0.0576   0.4983   0.0352    0.0284    0.1242     95.0283
    fda-3   0.9999   0.0202    0.0161   0.0409   0.4994   0.0298    0.0240    0.0884     99.9984
    fda-4   0.9999   0.0072    0.0064   0.0051   0.4997   0.0129    0.0111    0.0166     99.9998
    fda-5   1.0000   0.0072    0.0064   0.0051   0.4997   0.0128    0.0111    0.0165    100.0000
    fda-6   1.0000   0.0072    0.0064   0.0051   0.4997   0.0128    0.0111    0.0165    100.0000
    fda-7   1.0000   0.0072    0.0064   0.0051   0.4997   0.0128    0.0111    0.0165    100.0000
                                               38


Table 2.6 Performance of the estimation procedure where the residuals are generated from Ornstein-
Uhlenbeck process (c) with 𝜇0 = 3. Mean of the estimated coefficients, standard deviation, absolute
bias, mean square error (×100) and FVE in percentage are summarized upto four decimal places.
 Method                     𝛽1                                   𝛽2                         FVE
             Mean        SD       AB      MSE     Mean        SD        AB      MSE        %-age
                                                  𝑛 = 100
      init  1.0001   0.0454    0.0363   0.2056   0.4990   0.0591    0.0469    0.3487
   ldaAR    0.9996   0.0390    0.0312   0.1521   0.4979   0.0521    0.0414    0.2717
   ldaCS    0.9999   0.0429    0.0341   0.1841   0.4981   0.0564    0.0448    0.3176
    fda-1   1.0005   0.0557    0.0445   0.3100   0.5003   0.0743    0.0588    0.5511     59.1692
    fda-2   1.0005   0.0459    0.0362   0.2107   0.4981   0.0604    0.0480    0.3639     87.0120
    fda-3   1.0000   0.0436    0.0348   0.1894   0.4975   0.0568    0.0455    0.3221     99.9908
    fda-4   1.0004   0.0136    0.0058   0.0185   0.4989   0.0237    0.0106    0.0562     99.9960
    fda-5   1.0001   0.0049    0.0033   0.0024   0.4997   0.0099    0.0061    0.0098    100.0000
    fda-6   1.0001   0.0049    0.0033   0.0024   0.4997   0.0099    0.0061    0.0098    100.0000
    fda-7   1.0001   0.0049    0.0033   0.0024   0.4997   0.0099    0.0061    0.0098    100.0000
                                                  𝑛 = 300
      init  1.0002   0.0239    0.0191   0.0568   0.4998   0.0345    0.0274    0.1190
   ldaAR    1.0003   0.0209    0.0167   0.0435   0.4990   0.0303    0.0244    0.0916
   ldaCS    1.0003   0.0224    0.0178   0.0500   0.4992   0.0328    0.0265    0.1075
    fda-1   0.9998   0.0305    0.0247   0.0928   0.5011   0.0443    0.0349    0.1957     59.4418
    fda-2   1.0004   0.0240    0.0192   0.0574   0.4991   0.0347    0.0279    0.1201     86.9290
    fda-3   1.0003   0.0222    0.0177   0.0492   0.4992   0.0327    0.0264    0.1070     99.9972
    fda-4   1.0002   0.0043    0.0039   0.0018   0.4998   0.0075    0.0065    0.0057     99.9987
    fda-5   1.0001   0.0038    0.0035   0.0014   0.4998   0.0068    0.0059    0.0046    100.0000
    fda-6   1.0001   0.0038    0.0035   0.0014   0.4998   0.0068    0.0059    0.0046    100.0000
    fda-7   1.0001   0.0038    0.0035   0.0014   0.4998   0.0068    0.0059    0.0046    100.0000
                                                  𝑛 = 500
      init  1.0003   0.0180    0.0141   0.0323   0.5017   0.0268    0.0213    0.0720
   ldaAR    1.0001   0.0155    0.0123   0.0241   0.5016   0.0232    0.0186    0.0541
   ldaCS    1.0002   0.0171    0.0134   0.0291   0.5013   0.0251    0.0202    0.0629
    fda-1   1.0005   0.0229    0.0180   0.0523   0.5021   0.0346    0.0276    0.1201     59.3104
    fda-2   1.0004   0.0177    0.0138   0.0314   0.5022   0.0268    0.0217    0.0724     87.0183
    fda-3   1.0001   0.0173    0.0136   0.0300   0.5010   0.0249    0.0197    0.0620     99.9983
    fda-4   1.0000   0.0042    0.0038   0.0017   0.5005   0.0075    0.0065    0.0056     99.9992
    fda-5   1.0000   0.0038    0.0035   0.0015   0.5004   0.0066    0.0058    0.0044    100.0000
    fda-6   1.0000   0.0038    0.0035   0.0015   0.5004   0.0066    0.0058    0.0044    100.0000
    fda-7   1.0000   0.0038    0.0035   0.0015   0.5004   0.0066    0.0058    0.0044    100.0000
                                               39


Table 2.7 Performance of the estimation procedure where the residuals are generated with power
exponential covariance function (d) where scale parameter 𝑎 0 = 1 and shape parameter 𝑏 0 = 1.
Mean of the estimated coefficients, standard deviation, absolute bias, mean square error (×100)
and FVE in percentage are summarized upto four decimal places.
  Method                    𝛽1                                   𝛽2                      FVE
             Mean       SD        AB      MSE     Mean        SD       AB     MSE      %-age
                                                 𝑛 = 100
      init  0.9968   0.0525    0.0423   0.2758   0.4961   0.0755    0.0603  0.5705
   ldaAR    0.9985   0.0486    0.0387   0.2361   0.4962   0.0702    0.0562  0.4938
   ldaCS    0.9978   0.0389    0.0309   0.1514   0.4986   0.0549    0.0438  0.3013
    fda-1   0.9960   0.0708    0.0564   0.5018   0.4951   0.1010    0.0813  1.0195    73.1399
    fda-2   0.9975   0.0439    0.0347   0.1929   0.4980   0.0629    0.0496  0.3948    87.5389
    fda-3   0.9975   0.0381    0.0305   0.1453   0.4987   0.0531    0.0425  0.2811    92.2253
    fda-4   0.9978   0.0392    0.0311   0.1540   0.4977   0.0540    0.0434  0.2914    94.4211
    fda-5   0.9978   0.0388    0.0302   0.1508   0.4975   0.0527    0.0420  0.2773    95.6881
    fda-6   0.9979   0.0393    0.0306   0.1546   0.4977   0.0534    0.0424  0.2852    96.5184
    fda-7   0.9990   0.0396    0.0308   0.1570   0.4986   0.0535    0.0423  0.2857    97.0882
    fda-8   0.9985   0.0405    0.0317   0.1637   0.4981   0.0540    0.0429  0.2915    97.5117
                                                 𝑛 = 300
      init  0.9975   0.0297    0.0236   0.0885   0.4989   0.0428    0.0352  0.1832
   ldaAR    0.9972   0.0281    0.0224   0.0793   0.4992   0.0389    0.0316  0.1509
   ldaCS    0.9988   0.0226    0.0180   0.0513   0.4995   0.0300    0.0238  0.0899
    fda-1   0.9967   0.0391    0.0310   0.1539   0.4986   0.0588    0.0482  0.3456    73.4971
    fda-2   0.9986   0.0254    0.0203   0.0648   0.4990   0.0335    0.0266  0.1122    87.5190
    fda-3   0.9987   0.0221    0.0173   0.0490   0.4996   0.0293    0.0233  0.0859    92.1425
    fda-4   0.9986   0.0219    0.0171   0.0479   0.4997   0.0294    0.0233  0.0862    94.3152
    fda-5   0.9988   0.0213    0.0167   0.0456   0.5000   0.0287    0.0232  0.0823    95.5616
    fda-6   0.9989   0.0213    0.0166   0.0454   0.5001   0.0289    0.0232  0.0832    96.3760
    fda-7   0.9988   0.0212    0.0166   0.0449   0.5003   0.0291    0.0234  0.0843    96.9441
    fda-8   0.9988   0.0212    0.0166   0.0449   0.5001   0.0292    0.0236  0.0852    97.3641
                                                 𝑛 = 500
      init  0.9996   0.0212    0.0169   0.0450   0.4989   0.0316    0.0252  0.0999
   ldaAR    0.9993   0.0189    0.0151   0.0356   0.4983   0.0294    0.0237  0.0868
   ldaCS    0.9996   0.0170    0.0135   0.0288   0.4998   0.0235    0.0193  0.0553
    fda-1   0.9996   0.0278    0.0222   0.0773   0.4984   0.0422    0.0333  0.1777    73.6172
    fda-2   0.9992   0.0186    0.0150   0.0347   0.4995   0.0264    0.0216  0.0694    87.5609
    fda-3   1.0002   0.0164    0.0130   0.0268   0.5001   0.0225    0.0183  0.0505    92.1497
    fda-4   1.0001   0.0163    0.0129   0.0264   0.4999   0.0226    0.0184  0.0508    94.3084
    fda-5   1.0000   0.0160    0.0127   0.0255   0.4996   0.0223    0.0182  0.0496    95.5492
    fda-6   1.0000   0.0161    0.0128   0.0260   0.4995   0.0222    0.0182  0.0493    96.3576
    fda-7   0.9999   0.0161    0.0128   0.0258   0.4994   0.0219    0.0179  0.0481    96.9249
    fda-8   0.9999   0.0161    0.0128   0.0258   0.4994   0.0219    0.0177  0.0478    97.3423
                                               40


Table 2.8 Performance of the estimation procedure where the residuals are generated with power
exponential covariance function (d) where scale parameter 𝑎 0 = 1 and shape parameter 𝑏 0 = 2.
Mean of the estimated coefficients, standard deviation, absolute bias, mean square error (×100)
and FVE in percentage are summarized upto four decimal places.
 Method                    𝛽1                                    𝛽2                       FVE
            Mean        SD       AB      MSE      Mean       SD        AB     MSE       %-age
                                                 𝑛 = 100
      init 0.9967   0.0563    0.0454   0.3170   0.4957    0.0811    0.0649  0.6591
  ldaAR    0.9980   0.0506    0.0399   0.2562   0.4970    0.0724    0.0578  0.5234
   ldaCS   0.9982   0.0373    0.0294   0.1389   0.4983    0.0531    0.0425  0.2814
    fda-1  0.9957   0.0766    0.0611   0.5872   0.4947    0.1094    0.0881  1.1964    85.6976
    fda-2  0.9979   0.0440    0.0348   0.1934   0.4975    0.0633    0.0503  0.4010    98.9964
    fda-3  0.9993   0.0239    0.0187   0.0569   0.4988    0.0343    0.0273  0.1177    99.9585
    fda-4  0.9984   0.0224    0.0176   0.0505   0.5004    0.0305    0.0242  0.0931    99.9987
    fda-5  0.9984   0.0224    0.0176   0.0505   0.5004    0.0305    0.0242  0.0931   100.0000
    fda-6  0.9984   0.0224    0.0176   0.0505   0.5004    0.0305    0.0242  0.0931   100.0000
    fda-7  0.9984   0.0224    0.0176   0.0505   0.5004    0.0305    0.0242  0.0931   100.0000
    fda-8  0.9984   0.0224    0.0176   0.0505   0.5004    0.0305    0.0242  0.0931   100.0000
                                                 𝑛 = 300
      init 0.9973   0.0318    0.0253   0.1014   0.4987    0.0461    0.0378  0.2118
  ldaAR    0.9972   0.0297    0.0239   0.0888   0.4992    0.0400    0.0323  0.1597
   ldaCS   0.9990   0.0216    0.0172   0.0468   0.4995    0.0289    0.0229  0.0835
    fda-1  0.9964   0.0424    0.0336   0.1804   0.4984    0.0637    0.0521  0.4047    86.0991
    fda-2  0.9987   0.0253    0.0201   0.0642   0.4990    0.0336    0.0266  0.1129    99.0443
    fda-3  0.9992   0.0146    0.0114   0.0213   0.5003    0.0197    0.0159  0.0388    99.9593
    fda-4  0.9993   0.0128    0.0100   0.0165   0.5000    0.0176    0.0139  0.0309    99.9987
    fda-5  0.9993   0.0128    0.0100   0.0165   0.5000    0.0176    0.0139  0.0309   100.0000
    fda-6  0.9993   0.0128    0.0100   0.0165   0.5000    0.0176    0.0139  0.0309   100.0000
    fda-7  0.9993   0.0128    0.0100   0.0165   0.5000    0.0176    0.0139  0.0309   100.0000
                                                 𝑛 = 500
      init 0.9994   0.0226    0.0180   0.0512   0.4987    0.0340    0.0270  0.1153
  ldaAR    0.9993   0.0201    0.0161   0.0403   0.4983    0.0309    0.0250  0.0954
   ldaCS   0.9993   0.0163    0.0130   0.0266   0.4995    0.0227    0.0186  0.0517
    fda-1  0.9995   0.0301    0.0240   0.0906   0.4983    0.0456    0.0361  0.2081    86.1915
    fda-2  0.9990   0.0186    0.0150   0.0346   0.4992    0.0265    0.0217  0.0699    99.0585
    fda-3  1.0004   0.0111    0.0087   0.0122   0.5002    0.0148    0.0121  0.0219    99.9596
    fda-4  1.0007   0.0100    0.0080   0.0100   0.5007    0.0129    0.0106  0.0168    99.9987
    fda-5  1.0007   0.0100    0.0080   0.0100   0.5007    0.0129    0.0106  0.0168   100.0000
    fda-6  1.0007   0.0100    0.0080   0.0100   0.5007    0.0129    0.0106  0.0168   100.0000
    fda-7  1.0007   0.0100    0.0080   0.0100   0.5007    0.0129    0.0106  0.0168   100.0000
    fda-8  1.0007   0.0100    0.0080   0.0100   0.5007    0.0129    0.0106  0.0168   100.0000
                                               41


Table 2.9 Performance of the estimation procedure where the residuals are generated with power
exponential covariance function (d) where scale parameter 𝑎 0 = 1 and shape parameter 𝑏 0 = 5.
Mean of the estimated coefficients, standard deviation, absolute bias, mean square error (×100)
and FVE in percentage are summarized upto four decimal places.
  Method                    𝛽1                                   𝛽2                      FVE
             Mean       SD        AB      MSE     Mean        SD       AB     MSE      %-age
                                                 𝑛 = 100
      init  0.9970   0.0582    0.0468   0.3390   0.4952   0.0841    0.0676  0.7089
   ldaAR    0.9977   0.0528    0.0415   0.2789   0.4961   0.0772    0.0621  0.5956
   ldaCS    0.9996   0.0274    0.0218   0.0747   0.4979   0.0396    0.0321  0.1572
    fda-1   0.9956   0.0810    0.0646   0.6564   0.4943   0.1157    0.0933  1.3399    92.2949
    fda-2   0.9995   0.0375    0.0296   0.1400   0.4966   0.0548    0.0441  0.3013    99.7252
    fda-3   0.9993   0.0283    0.0203   0.0801   0.5004   0.0383    0.0274  0.1462    99.8787
    fda-4   0.9991   0.0121    0.0053   0.0147   0.4995   0.0165    0.0073  0.0273    99.9461
    fda-5   1.0001   0.0116    0.0022   0.0134   0.5005   0.0146    0.0030  0.0212    99.9842
    fda-6   1.0000   0.0133    0.0029   0.0177   0.5004   0.0184    0.0041  0.0336    99.9979
    fda-7   1.0000   0.0133    0.0028   0.0176   0.5004   0.0183    0.0040  0.0334    99.9990
    fda-8   1.0000   0.0133    0.0028   0.0176   0.5004   0.0183    0.0040  0.0334    99.9998
                                                 𝑛 = 300
      init  0.9972   0.0328    0.0260   0.1082   0.4986   0.0482    0.0395  0.2317
   ldaAR    0.9971   0.0310    0.0251   0.0968   0.4994   0.0426    0.0344  0.1810
   ldaCS    0.9995   0.0156    0.0123   0.0244   0.4996   0.0216    0.0171  0.0467
    fda-1   0.9962   0.0449    0.0356   0.2027   0.4982   0.0673    0.0552  0.4524    92.6741
    fda-2   0.9990   0.0213    0.0168   0.0454   0.4994   0.0291    0.0229  0.0846    99.8076
    fda-3   0.9995   0.0196    0.0153   0.0384   0.4992   0.0271    0.0213  0.0736    99.9391
    fda-4   1.0005   0.0075    0.0044   0.0056   0.4994   0.0105    0.0064  0.0111    99.9716
    fda-5   0.9999   0.0023    0.0006   0.0005   0.5000   0.0037    0.0010  0.0014    99.9889
    fda-6   1.0000   0.0025    0.0006   0.0006   0.4999   0.0049    0.0011  0.0024    99.9979
    fda-7   1.0000   0.0012    0.0003   0.0001   0.5002   0.0039    0.0006  0.0015    99.9989
    fda-8   1.0000   0.0012    0.0003   0.0001   0.5002   0.0039    0.0006  0.0015    99.9997
                                                 𝑛 = 500
      init  0.9994   0.0232    0.0185   0.0539   0.4985   0.0352    0.0280  0.1242
   ldaAR    0.9989   0.0209    0.0169   0.0437   0.4981   0.0327    0.0264  0.1073
   ldaCS    0.9991   0.0119    0.0095   0.0142   0.4992   0.0168    0.0134  0.0281
    fda-1   0.9995   0.0318    0.0254   0.1011   0.4981   0.0483    0.0382  0.2328    92.7586
    fda-2   0.9988   0.0158    0.0127   0.0252   0.4988   0.0227    0.0183  0.0517    99.8272
    fda-3   0.9989   0.0153    0.0123   0.0235   0.4989   0.0218    0.0176  0.0478    99.9574
    fda-4   0.9997   0.0070    0.0050   0.0049   0.5002   0.0105    0.0072  0.0110    99.9811
    fda-5   1.0000   0.0024    0.0007   0.0006   0.5001   0.0035    0.0011  0.0012    99.9923
    fda-6   1.0000   0.0026    0.0006   0.0007   0.5003   0.0045    0.0011  0.0021    99.9979
    fda-7   1.0001   0.0017    0.0004   0.0003   0.5000   0.0038    0.0007  0.0014    99.9989
    fda-8   1.0001   0.0015    0.0003   0.0002   0.5001   0.0035    0.0007  0.0012    99.9997
                                               42


Table 2.10 Performance of the estimation procedure where the residuals are generated with rational
quadratic covariance function (e) where scale parameter 𝑎 0 = 1 and shape parameter 𝑏 0 = 1. Mean
of the estimated coefficients, standard deviation, absolute bias, mean square error (×100) and FVE
in percentage are summarized upto four decimal places.
 Method                      𝛽1                                     𝛽2                       FVE
             Mean         SD        AB     MSE       Mean        SD       AB      MSE       %-age
                                                     𝑛 = 100
      init  0.9967    0.0565     0.0455  0.3193     0.4957   0.0814    0.0651   0.6624
   ldaAR    0.9982    0.0507     0.0399  0.2567     0.4968   0.0725    0.0580   0.5257
   ldaCS    0.9983    0.0349     0.0275  0.1215     0.4986   0.0495    0.0396   0.2444
    fda-1   0.9957    0.0774     0.0617  0.5994     0.4946   0.1105    0.0890   1.2213   87.2296
    fda-2   0.9980    0.0413     0.0326  0.1706     0.4979   0.0594    0.0471   0.3521   98.4366
    fda-3   0.9989    0.0265     0.0210  0.0704     0.4988   0.0379    0.0302   0.1434   99.8285
    fda-4   0.9985    0.0271     0.0212  0.0733     0.4991   0.0372    0.0296   0.1383   99.9800
    fda-5   0.9987    0.0259     0.0203  0.0671     0.4987   0.0350    0.0276   0.1224   99.9977
    fda-6   0.9987    0.0259     0.0203  0.0671     0.4987   0.0350    0.0276   0.1224   99.9997
    fda-7   0.9987    0.0259     0.0203  0.0671     0.4987   0.0350    0.0276   0.1224  100.0000
    fda-8   0.9987    0.0259     0.0203  0.0671     0.4987   0.0350    0.0276   0.1224  100.0000
                                                     𝑛 = 300
      init  0.9973    0.0319     0.0253  0.1021     0.4987   0.0463    0.0381   0.2144
   ldaAR    0.9971    0.0297     0.0239  0.0889     0.4992   0.0404    0.0327   0.1630
   ldaCS    0.9991    0.0203     0.0161  0.0412     0.4995   0.0271    0.0215   0.0736
    fda-1   0.9964    0.0429     0.0339  0.1846     0.4984   0.0643    0.0527   0.4132   87.6273
    fda-2   0.9988    0.0238     0.0190  0.0568     0.4991   0.0317    0.0251   0.1002   98.4851
    fda-3   0.9991    0.0159     0.0125  0.0255     0.5002   0.0215    0.0173   0.0462   99.8287
    fda-4   0.9991    0.0157     0.0121  0.0248     0.5001   0.0214    0.0171   0.0457   99.9800
    fda-5   0.9993    0.0144     0.0113  0.0209     0.5005   0.0202    0.0164   0.0408   99.9977
    fda-6   0.9993    0.0144     0.0113  0.0209     0.5005   0.0202    0.0164   0.0408   99.9997
    fda-7   0.9993    0.0144     0.0113  0.0209     0.5005   0.0202    0.0164   0.0408  100.0000
    fda-8   0.9993    0.0144     0.0113  0.0209     0.5005   0.0202    0.0164   0.0408  100.0000
                                                     𝑛 = 500
      init  0.9995    0.0227     0.0181  0.0514     0.4987   0.0341    0.0271   0.1160
   ldaAR    0.9993    0.0200     0.0160  0.0399     0.4982   0.0310    0.0250   0.0962
   ldaCS    0.9994    0.0153     0.0122  0.0235     0.4996   0.0213    0.0174   0.0454
    fda-1   0.9995    0.0304     0.0243  0.0924     0.4982   0.0461    0.0365   0.2125   87.7225
    fda-2   0.9991    0.0176     0.0142  0.0310     0.4993   0.0249    0.0204   0.0620   98.5022
    fda-3   1.0003    0.0121     0.0096  0.0146     0.5002   0.0163    0.0133   0.0265   99.8293
    fda-4   1.0005    0.0120     0.0096  0.0145     0.5004   0.0160    0.0131   0.0256   99.9801
    fda-5   1.0002    0.0114     0.0090  0.0129     0.5000   0.0151    0.0121   0.0227   99.9977
    fda-6   1.0002    0.0114     0.0090  0.0129     0.5000   0.0151    0.0121   0.0227   99.9997
    fda-7   1.0002    0.0114     0.0090  0.0129     0.5000   0.0151    0.0121   0.0227  100.0000
    fda-8   1.0002    0.0114     0.0090  0.0129     0.5000   0.0151    0.0121   0.0227  100.0000
                                                  43


Table 2.11 Performance of the estimation procedure where the residuals are generated with rational
quadratic covariance function (e) where scale parameter 𝑎 0 = 1 and shape parameter 𝑏 0 = 2. Mean
of the estimated coefficients, standard deviation, absolute bias, mean square error (×100) and FVE
in percentage are summarized upto four decimal places.
 Method                      𝛽1                                     𝛽2                       FVE
             Mean         SD        AB     MSE       Mean        SD       AB      MSE       %-age
                                                     𝑛 = 100
      init  0.9967    0.0547     0.0441  0.2993     0.4960   0.0788    0.0628   0.6206
   ldaAR    0.9983    0.0498     0.0393  0.2473     0.4968   0.0714    0.0571   0.5091
   ldaCS    0.9977    0.0425     0.0337  0.1805     0.4981   0.0602    0.0481   0.3618
    fda-1   0.9957    0.0730     0.0583  0.5343     0.4951   0.1042    0.0839   1.0868   78.3365
    fda-2   0.9972    0.0479     0.0381  0.2297     0.4975   0.0687    0.0544   0.4721   96.4030
    fda-3   0.9981    0.0368     0.0293  0.1358     0.4982   0.0520    0.0418   0.2699   99.4758
    fda-4   0.9982    0.0377     0.0299  0.1424     0.4979   0.0528    0.0421   0.2791   99.9264
    fda-5   0.9986    0.0356     0.0280  0.1266     0.4975   0.0483    0.0384   0.2335   99.9902
    fda-6   0.9986    0.0356     0.0280  0.1266     0.4975   0.0483    0.0384   0.2335   99.9987
    fda-7   0.9986    0.0356     0.0280  0.1266     0.4975   0.0483    0.0384   0.2335   99.9998
    fda-8   0.9986    0.0356     0.0280  0.1266     0.4975   0.0483    0.0384   0.2335  100.0000
                                                     𝑛 = 300
      init  0.9974    0.0309     0.0246  0.0959     0.4988   0.0445    0.0365   0.1977
   ldaAR    0.9972    0.0290     0.0233  0.0848     0.4991   0.0393    0.0317   0.1542
   ldaCS    0.9987    0.0246     0.0195  0.0606     0.4994   0.0327    0.0259   0.1066
    fda-1   0.9966    0.0403     0.0320  0.1633     0.4986   0.0607    0.0497   0.3682   78.7132
    fda-2   0.9985    0.0276     0.0220  0.0764     0.4989   0.0364    0.0290   0.1327   96.4278
    fda-3   0.9985    0.0218     0.0171  0.0477     0.4999   0.0290    0.0232   0.0842   99.4738
    fda-4   0.9985    0.0218     0.0171  0.0478     0.4998   0.0293    0.0235   0.0858   99.9259
    fda-5   0.9988    0.0198     0.0157  0.0394     0.5004   0.0271    0.0224   0.0733   99.9900
    fda-6   0.9988    0.0198     0.0157  0.0394     0.5004   0.0271    0.0224   0.0733   99.9987
    fda-7   0.9988    0.0198     0.0157  0.0394     0.5004   0.0271    0.0224   0.0733   99.9998
    fda-8   0.9988    0.0198     0.0157  0.0394     0.5004   0.0271    0.0224   0.0733  100.0000
                                                     𝑛 = 500
      init  0.9995    0.0221     0.0176  0.0490     0.4989   0.0330    0.0263   0.1086
   ldaAR    0.9993    0.0197     0.0158  0.0389     0.4983   0.0302    0.0245   0.0910
   ldaCS    0.9994    0.0184     0.0146  0.0339     0.4996   0.0257    0.0211   0.0658
    fda-1   0.9996    0.0288     0.0229  0.0826     0.4984   0.0435    0.0344   0.1892   78.8101
    fda-2   0.9991    0.0201     0.0162  0.0405     0.4993   0.0286    0.0235   0.0819   96.4486
    fda-3   1.0003    0.0162     0.0128  0.0262     0.5000   0.0222    0.0181   0.0491   99.4752
    fda-4   1.0003    0.0162     0.0129  0.0263     0.4999   0.0224    0.0182   0.0501   99.9261
    fda-5   0.9999    0.0151     0.0120  0.0227     0.4994   0.0208    0.0170   0.0434   99.9900
    fda-6   0.9999    0.0151     0.0120  0.0227     0.4994   0.0208    0.0170   0.0434   99.9987
    fda-7   0.9999    0.0151     0.0120  0.0227     0.4994   0.0208    0.0170   0.0434   99.9998
    fda-8   0.9999    0.0151     0.0120  0.0227     0.4994   0.0208    0.0170   0.0434  100.0000
                                                  44


Table 2.12 Performance of the estimation procedure where the residuals are generated with rational
quadratic covariance function (e) where scale parameter 𝑎 0 = 1 and shape parameter 𝑏 0 = 5. Mean
of the estimated coefficients, standard deviation, absolute bias, mean square error (×100) and FVE
in percentage are summarized upto four decimal places.
  Method                      𝛽1                                     𝛽2                     FVE
              Mean        SD        AB      MSE      Mean         SD       AB     MSE      %-age
                                                    𝑛 = 100
       init 0.9967     0.0504    0.0407   0.2545    0.4965    0.0725    0.0577  0.5251
    ldaAR   0.9986     0.0466    0.0369   0.2173    0.4965    0.0671    0.0539  0.4508
    ldaCS   0.9970     0.0473    0.0377   0.2238    0.4978    0.0668    0.0531  0.4458
     fda-1  0.9959     0.0648    0.0517   0.4205    0.4960    0.0922    0.0742  0.8505   62.1253
     fda-2  0.9964     0.0505    0.0405   0.2561    0.4976    0.0721    0.0572  0.5199   89.2976
     fda-3  0.9970     0.0462    0.0368   0.2136    0.4974    0.0644    0.0513  0.4142   97.4903
     fda-4  0.9981     0.0464    0.0363   0.2149    0.4957    0.0647    0.0520  0.4201   99.4795
     fda-5  0.9986     0.0441    0.0345   0.1941    0.4952    0.0621    0.0497  0.3872   99.9012
     fda-6  0.9987     0.0446    0.0350   0.1986    0.4958    0.0622    0.0500  0.3885   99.9827
     fda-7  0.9986     0.0440    0.0343   0.1934    0.4959    0.0625    0.0504  0.3921   99.9971
     fda-8  0.9986     0.0440    0.0343   0.1934    0.4959    0.0625    0.0504  0.3921   99.9995
                                                    𝑛 = 300
       init 0.9977     0.0286    0.0229   0.0820    0.4991    0.0406    0.0332  0.1646
    ldaAR   0.9975     0.0271    0.0215   0.0737    0.4991    0.0367    0.0296  0.1346
    ldaCS   0.9983     0.0272    0.0216   0.0739    0.4994    0.0362    0.0288  0.1309
     fda-1  0.9972     0.0355    0.0283   0.1264    0.4991    0.0539    0.0441  0.2896   62.2661
     fda-2  0.9983     0.0289    0.0229   0.0835    0.4990    0.0384    0.0308  0.1473   89.2161
     fda-3  0.9981     0.0266    0.0209   0.0712    0.4992    0.0355    0.0282  0.1257   97.4664
     fda-4  0.9981     0.0260    0.0205   0.0677    0.4992    0.0351    0.0278  0.1227   99.4727
     fda-5  0.9984     0.0244    0.0192   0.0594    0.4996    0.0332    0.0267  0.1097   99.8990
     fda-6  0.9985     0.0243    0.0191   0.0589    0.4997    0.0334    0.0269  0.1115   99.9820
     fda-7  0.9985     0.0237    0.0189   0.0562    0.4998    0.0334    0.0270  0.1112   99.9969
     fda-8  0.9985     0.0237    0.0189   0.0562    0.4998    0.0334    0.0270  0.1112   99.9995
                                                    𝑛 = 500
       init 0.9996     0.0207    0.0165   0.0430    0.4992    0.0304    0.0245  0.0921
    ldaAR   0.9993     0.0187    0.0151   0.0351    0.4985    0.0281    0.0229  0.0792
    ldaCS   0.9996     0.0201    0.0160   0.0404    0.4997    0.0282    0.0230  0.0792
     fda-1  0.9997     0.0256    0.0204   0.0654    0.4988    0.0385    0.0304  0.1484   62.3377
     fda-2  0.9992     0.0210    0.0168   0.0440    0.4995    0.0299    0.0244  0.0892   89.2460
     fda-3  1.0001     0.0193    0.0153   0.0372    0.4999    0.0270    0.0219  0.0728   97.4696
     fda-4  0.9997     0.0188    0.0150   0.0355    0.4993    0.0269    0.0219  0.0724   99.4726
     fda-5  0.9994     0.0180    0.0142   0.0322    0.4989    0.0260    0.0211  0.0674   99.8988
     fda-6  0.9993     0.0180    0.0142   0.0323    0.4988    0.0258    0.0210  0.0668   99.9818
     fda-7  0.9992     0.0176    0.0140   0.0309    0.4988    0.0258    0.0209  0.0663   99.9969
     fda-8  0.9992     0.0176    0.0140   0.0309    0.4988    0.0258    0.0209  0.0663   99.9995
                                                  45


2.4.3    Simulation results
Simulation results associated with the Brownian motion are shown in Table 2.1. In this situation,
we observe that our approach produces better results in terms of the dispersion measures. Tables
2.2, 2.3 and 2.4 show results for linear processes, here our proposed method performs better in
situations with working correlation matrix as AR, but is comparable for an exchangeable structure
for 𝑙0 = 1, 2, 3. Moreover, in our proposed method, as 𝑙 increases, all dispersion measures, such
as MSE, decrease. The results based on the OU process are documented in Tables 2.5 and 2.6.
Our method outperforms the existing methods for both situations and, as 𝜇0 increases, the MSE
decreases. For three different parameter choices of the power exponential and rational quadratic
covariance structure, numerical results are presented in Tables 2.7, 2.8, 2.9 and 2.10, 2.11, 2.12,
respectively. As before, we observe that our proposed method is finer than the existing ones in all
sub-cases; but interestingly, when 𝑏 0 increases, MSE decreases for the power exponential, whereas
it increases for the rational quadratic covariance structure, as expected due to the covariance
structure. Overall, we observe that for all the above situations, as sample size increases, the
dispersion measures, for example SD and MSE decrease. It establishes that as the sample size
increases, the parameter estimates get closer and closer to the true parameters. From empirical
studies, it was observed that if the estimated number of principal components b                    𝜷)
                                                                                    𝜅 > 𝜅0 then Q( b
may not be continuous at b  𝜷. In each of the above situations, the SDs of the proposed methods
decrease as we increase 𝜅 0 and stabilize after some value of 𝜅0 where the fraction of variance (FVE)
is approximately 100%.
2.5     Real data analysis
In this section, we apply our proposed method to motivating examples in two different data-sets.
2.5.1    Beijing’s PM2.5 pollution study
In the atmosphere, suspended microscopic particles of solid and liquid matter are commonly
known as particulates or particulate matter (PM). Such particulates often have a strong noxious
                                                  46


impact on human health, climate, and visibility. One such common and fine type of atmospheric
particle is PM2.5 with a diameter less than 2.5 micrometers. Many developed and developing
cities across the world are experiencing chronic air pollution, with major pollutant being PM2.5 ;
Beijing and a substantial part of China are among such places. Some studies show that there are
many non-ignorable sources of variability in the distribution and transmission pattern of PM2.5 ,
which are confounded with secondary chemical generation. The atmospheric PM2.5 data used
in our analysis were collected from the UCI machine learning repository https://archive.ics.uci.
edu/ml/datasets/Beijing+Multi-Site+Air-Quality+Data (Liang et al., 2015). The data-set includes
daily measurements of PM2.5 and associated covariates at twelve different locations in China,
viz., Aotizhongxin, Changping, Dingling, Dongsi, Guanyuan, Gucheng, Huairou, Nongzhanguan,
Shunyi, Tiantan, Wanliu, and Wanshouxigong during January 2017. After excluding missing
data, there were 608 hourly data points in Beijing2017-data. We assume that the atmospheric
measurements are independent since they are located quite apart. The objective of our analysis
is to describe the trend of the functional response PM2.5 (as shown in Figure 2.1) and to evaluate
the effect of covariates including chemical compounds such as sulfur dioxide (SO2 ), nitrogen
dioxide (NO2 ), carbon monoxide (CO) and ozone (O3 ) over time. We smoothed the covariates and
responses to reduce variability and center them. Subsequently, we consider the following model
                  𝑌𝑖 (𝑡) = 𝛽0 + SO2 (𝑡) 𝛽1 + NO2 (𝑡) 𝛽2 + CO(𝑡) 𝛽3 + O3 (𝑡) 𝛽4 + 𝑒𝑖 (𝑡)        (2.16)
    We use Algorithm 2.1 to estimate the coefficients of the regression model mentioned above.
Through the simulation results, we observe that if the values of 𝜅 0 increase, the standard deviation
of the coefficients decreases. For small FVEs such as 50%, the corresponding 𝜅0 = 1 and the
estimation procedure performs poorly; whereas for large FVE percentages, the estimation procedure
has adequately improved in terms of standard error. The estimated values for 𝛽0 , 𝛽1 , 𝛽2 , 𝛽3 , and
𝛽4 produce similar results across different choices of 𝜅0 . From the estimated standard error, using
the scree plot, we conclude that the suitable choice of 𝜅0 is approximately 10. The estimated
scaled eigen-values are provided in Figure 2.2 which clearly shows their decay rate. The estimated
                                                  47


coefficients with standard error are 0.0009 (1.1644), 0.0829 (0.2584), 0.9503 (0.1586), 0.0196
(0.0037) and 1.1523 (0.1198) respectively.
     Figure 2.2 Beijing2017-data results: Scree plots of fraction of variance explained (FVE).
2.5.2   DTI study for sleep apnea patients
MRI is a powerful technique for investigating the structural and functional changes in the brain
during pathological and neuro-psychological processes. Due to the advancement in diffusion tensor
imaging (DTI), several studies on white matter alterations associated with clinical variables can be
found in the recent literature. For our analysis, we use Apnea-data obtained from one such study on
obstructive sleep apnea (OSA) patients (Xiong et al., 2017). The data consist of 29 male patients
between the ages of 30-55 years who underwent a study for the diagnosis of continuous positive
airway pressure (CPAP) therapy. Among those who have sleep disorder other than OSA, night-
shift workers, patients with psychiatric disorders, hypertension, diabetes, and other neurological
disorders were excluded. In this study, the psychomotor vigilance task (PVT) was performed in
                                                   48


which a light was randomly switched-on on a screen for several seconds in a certain interval of
time and subjects were asked to press a button as soon as they saw the light appear on screen; such
an experiment provides a numerical measure of sleepiness by counting the number of “lapses” for
each individual.
    DTI was performed on a 3T MRI scanner using a commercial 32-channel head coil, followed
by analysis using tract-based spatial statistics to investigate the difference in fractional anisotropy
(FA) and other DTI parameters. Image acquisition is as follows. An axial T1-weighted image of
the brain (3D-BRAVO) is collected with repetition time (TR) = 12ms, echo time (TE) = 5,2ms, flip
angle = 13◦ , inversion time = 450 ms, matrix = 384 × 256, voxel size = 1.2 × 0.57 × 0.69mm and
scan time = 2 min 54 s. DTI are obtained in the axial plane using a spin-echo echo planner imaging
sequence with TR = 4500ms, TE = 89.4ms, field of view = 20 × 20cm2 , matrix size = 160 × 132,
slice thickness = 3mm, slice spacing = 1mm, b-values = 0, 1000 s/mm2 .
    Our objective is to investigate the structural alteration of white matter using DTI in the patients
with OSA over each voxel at various regions of the brain (called ROIs). Thus, our response variable
is one of the DTI parameters, viz., fractional anisotropy (FA) and we are interested in studying the
effect of the changes of FA over continuous domain such as voxels with the interaction of the lapses
and the voxel locations in each ROIs. We consider the following model for each ROI.
                                 FA𝑖 (𝑠) = 𝛽0 + 𝛽1 lapses𝑖 × 𝑠 + 𝑒𝑖 (𝑠)                          (2.17)
where 𝑠 ∈ S, a set of voxels in the considered ROIs. Using the Algorithm 2.1, we estimate the
coefficients 𝛽1 and 𝛽2 as mentioned in the model 2.17 and the results are presented in Table 2.13.
We find that the coefficient estimates are close enough to their initial estimates and the estimated
standard error is smaller for the coefficients based on the proposed method. Here 𝜅0 (i.e., the
number of eigen-functions) are determined by FVE for simplicity.
2.6     Discussion
In this chapter, we propose an estimation procedure for the constant linear effects model, which is
commonly used in statistics (Zhang and Banerjee, 2021) especially in spatial modelling. One of
                                                   49


Table 2.13 Apnea-data results: Estimated values and associated standard errors for the regression
coefficients are provided upto four decimal places based on the existing and proposed methods.
First line corresponding to each ROI shows results based on initial estimates and the second line
corresponds to that of proposed estimates.
                                       𝛽0                                  𝛽1
             # functional
                   points Estimate   Std. Error (×100)   Estimate (×100)     Std. Error (×100)
   ROI.6             659    0.4512              0.1343            -0.0606               0.0130
                            0.4512              0.0983            -0.0605               0.0028
   ROI.7            1362    0.5048              0.0628              0.0309              0.0061
                            0.5050              0.0681              0.0342              0.0007
   ROI.8            1370    0.5256              0.0586            -0.0667               0.0057
                            0.5271              0.0346            -0.0733               0.0006
   ROI.9             690    0.4951              0.0910              0.2904              0.0088
                            0.5443              0.0874              0.1660              0.0014
  ROI.10             699    0.4951              0.0892              0.3314              0.0086
                            0.5262              0.1398              0.4231              0.0014
  ROI.11             968    0.4372              0.0979              0.1323              0.0095
                            0.4380              0.0637              0.1311              0.0009
  ROI.12             968    0.4529              0.0948              0.0965              0.0092
                            0.4664              0.0750              0.0504              0.0013
  ROI.13             992    0.5448              0.1060              0.3453              0.0103
                            0.5449              0.0856              0.3559              0.0011
  ROI.14             992    0.5435              0.1068              0.3432              0.0104
                            0.5436              0.0754              0.3437              0.0003
  ROI.37            1236    0.3695              0.0779            -0.1126               0.0076
                            0.3713              0.0669            -0.1175               0.0017
  ROI.38            1155    0.3564              0.0819            -0.1356               0.0079
                            0.3578              0.0420            -0.1356               0.0009
  ROI.39            1124    0.4618              0.0760              0.1972              0.0074
                            0.4621              0.0615              0.1996              0.0007
  ROI.40            1125    0.4786              0.0658              0.0953              0.0064
                            0.4780              0.0369              0.1016              0.0005
  ROI.45             380    0.4189              0.1071              0.1647              0.0104
                            0.4190              0.0175              0.1648              0.0001
  ROI.46             376    0.4074              0.1033              0.1988              0.0100
                            0.4074              0.0159              0.1994              0.0002
  ROI.47             596    0.4596              0.0932              0.1304              0.0090
                            0.4594              0.0191              0.1349              0.0001
  ROI.48             600    0.4045              0.0868              0.1100              0.0084
                            0.4036              0.0644              0.1067              0.0006
                                               50


the key factors of this estimation procedure is the fact that it is based on the quadratic inference
methodology, which has played a huge role in the analysis of correlated data since it was discovered
by Qu et al. (2000). In contrast to the existing method, our approach allows the number of repeated
measurements to grow with sample size; therefore, the trajectories of individuals can be observed
on a dense grid of a continuum domain. Instead of assuming a working correlation structure,
we propose a data-driven way by estimating the eigen-functions that are obtained by functional
                                                     √
principal component analysis. Here, we achieve 𝑛−consistency of the parametric estimates in the
regression model, even though the eigen-functions are estimated non-parametrically.
     Additionally, our method is easy to implement in a wide range of applications. The applicability
of the proposed method is illustrated by extensive simulation studies. Moreover, two real-data
applications in different scientific domains confirm the efficacy of the proposed method.
2.7     Technical details
2.7.1    Some preliminary definitions and concepts of operators
In this section, we discuss some basic concepts of operators and discuss some useful properties on
it. This can be found in Dunford and Schwartz (1988); Riss and Sz-Nagy (1990) along many more
textbooks in functional analysis. Since FDA deals with continuous-time stochastic process, we need
to be equipped with the dealing of random function and hence an overview of functional spaces
is required. Perturbation theory of compact operators is required to discuss functional principal
component analysis and these are demonstrated here with some useful results in our context. In
the next subsection, we discuss the functional principal component analysis in brief and show the
estimation techniques of eigen-values and eigen-functions based on the data at hand. This plays a
fundamental role in our proposed method.
     Consider the standard L 2 [0, 1] space that defines the set of square-integrable functions defined
on the closed set [0, 1] that takes values on the real line. The space L 2 [0, 1] is equipped with an
inner product and is defined as
                                                   ∫  1
                                        ⟨ 𝑓 , 𝑔⟩ =      𝑓 (𝑡)𝑔(𝑡)𝑑𝑡                              (2.18)
                                                    0
                                                     51


for 𝑓 and 𝑔 in that space, and forms a Hilbert space. Moreover, we denote the norm ∥ · ∥ 2 in L 2
                                  ∫ 2              1/2
which is defined as ∥ 𝑓 ∥ 2 =          𝑓 (𝑢)𝑑𝑢           . Define F be an operator that assigns an element 𝑓 in
L 2 [0, 1] to a new element F 𝑓 in L 2 [0, 1]. The operator is linear if F(𝛼 𝑓 + 𝛽𝑔) = 𝛼F 𝑓 + 𝛽F𝑔
for any scalar 𝛼 and 𝛽. It is said to be bounded if for any positive constant 𝑀 (which may depend
on 𝑓 ) we have ∥F 𝑓 ∥ ≤ 𝑀 ∥ 𝑓 ∥ for all 𝑓 ∈ L 2 [0, 1] where the largest bound for all 𝑀 is called the
norm of the operator F, denoted by ∥F∥, and it is defined as ∥F∥ = sup ∥ 𝑓 ∥≤1 ∥F 𝑓 ∥. The operator
is bounded if and only if it is continuous. F is said to be self-adjoint if ⟨F 𝑓 , 𝑔⟩ = ⟨ 𝑓 , F𝑔⟩ and
becomes non-negetive definite if ⟨F 𝑓 , 𝑓 ⟩ ≥ 0 for all 𝑓 ∈ L 2 [0, 1].
                                          ∫
    A linear mapping F 𝑓 (·) = 𝑅(·, 𝑢) 𝑓 (𝑢)𝑑𝑢 for any function 𝑓 ∈ L 2 [0, 1] and for some
integrable function 𝑅(·, ·) on [0, 1] × [0, 1]. This function is preferably known as integral operator
and the bivariate function 𝑅 is known as a kernel in statistics and functional analysis literature.
Note that the above linear mapping is bounded since
                            ∫                  ∫
                       2          2
             |F 𝑓 (𝑡)| ≤        𝑅 (𝑡, 𝑢)𝑑𝑢          𝑓 2 (𝑢)𝑑𝑢       using Cauchy-Schwarz inequality
                                  ∫
                                2
                         = ∥ 𝑓 ∥2      𝑅 2 (𝑡, 𝑢)𝑑𝑢                                                       (2.19)
                                    ∫ ∫                                                ∫ ∫
Furthermore, ∥F 𝑓 ∥ 22 ≤ ∥ 𝑓 ∥ 22          𝑅 2 (𝑠, 𝑡)𝑑𝑠𝑑𝑡. under the assumption that        𝑅 2 (𝑢, 𝑣)𝑑𝑢𝑑𝑣 < ∞.
It is easy to see that F 𝑓 (·) is uniformly continuous and compact for a non-negative definite
symmetric kernel 𝑅.
    For some 𝜆, in Fredholm integral equation , F𝜙 = 𝜆𝜙 has non-zero solution 𝜙 then we call
𝜆 as eigen-value of F and the solution of the eigen-equation is called eigen-functions, altogether,
the pair of eigen-values and eigen-function, viz., (𝜆, 𝜙) are called eigen-elements. Let 𝑓1 and 𝑓2
                                                                                        Ë
be the elements in Hilbert space H then the tensor product operator ( 𝑓1                    𝑓2 ) : H → H and
                       𝑓2 )(𝑢) = ⟨ 𝑓1 , 𝑔⟩ 𝑓2 for 𝑔 ∈ H. For a compact self-adjoint operator in L 2 [0, 1],
                    Ë
defined by ( 𝑓1
let {(𝜆 𝑘 , 𝜙 𝑘 ) : 𝑘 ≥ 1} be as set of eigen-components such that 𝜙 𝑘 s are orthogonal. Then for any
function 𝑓 ∈ L 2 [0, 1], it can be represented as
                                                               ∑︁∞
                                                     𝑓 = 𝑓0 +       P𝑟                                    (2.20)
                                                                𝑟=1
                                                             52


                                                                                                Ë
where P𝑟 is the projection operator for eigen-spaces 𝜆𝑟 . In our particular situation P𝑟 = 𝜙𝑟        𝜙𝑟 .
Moreover, for a suitable 𝑓0 such that F 𝑓0 = 0, it can be shown that
                                                       ∞
                                                      ∑︁
                                                F=         𝜆𝑟 P𝑟                                 (2.21)
                                                      𝑟=1
where 𝜆𝑟 is repeated as its multiplicity. Due to non-negative definiteness of F, the eigen-values
are ordered as 𝜆 1 ≥ 𝜆 2 ≥ · · · ≥ 0 .
    Now we discuss Mercer’s theorem (J Mercer, 1909) for a symmetric continuous non-negative
definite kernel function 𝑅. It states that, for {(𝜆𝑟 , 𝜙𝑟 )}𝑟 ≥1 be the set of eigen-components, such
kernel 𝑅 has the following representation
                                                   ∑︁∞
                                         𝑅(𝑠, 𝑡) =       𝜆𝑟 𝜙𝑟 (𝑠)𝜙𝑟 (𝑡)                         (2.22)
                                                    𝑟=1
where the sum is absolutely and uniformly convergence.
    In this paper, we assume that the eigen-values are distinct for mathematical simplicity. We
conclude this subsection perturbation theory of compact operators in the sense that every sub-
sequences of {F 𝑓𝑛 } is a Cauchy sequence. In the statistical literature, this can be found in Hall and
Hosseini-Nasab (2006, 2009). This is useful to find the bound of eigen-components in different
applications.
    Suppose for self-adjoint compact operator on Hilbert space H consider two operators F and
G, define perturbation operator Δ = G − F such that G = F + Δ where G is an approximation
to F where Δ amount of error is occurred. Let F and G have kernels 𝐹 and 𝐺 respectively with
eigen-elements (𝜃 𝑟 , 𝜓𝑟 ) and (𝜆𝑟 , 𝜙𝑟 ). For simplicity, we assume that the eigen-values are distinct.
Then the following Lemma provides perturbation of the eigen-functions.
Lemma 2.7.1 (Theorem 5.1.8 in Hsing and Eubank (2015)). Let (𝜆, 𝜙) be the eigen-components
of F and (𝜃, 𝜓) be that of G with multiplicity of all eigen-values are restricted to be 1. Define
𝜂 𝑘 = min𝑟≠𝑘 |𝜆𝑟 − 𝜆 𝑘 |. Assume ⟨𝜙𝑟 , 𝜓𝑟 ⟩ ≥ 0 and 𝜂 𝑘 > 0. Then
                                         ∞
                                        ∑︁
                           𝜓𝑘 − 𝜙𝑘 =        (𝜃 𝑘 − 𝜆𝑟 ) −1 P𝑟 Δ𝜓 𝑘 + P𝑘 (𝜓 𝑘 − 𝜙 𝑘 )             (2.23)
                                        𝑟=1
                                        𝑟≠𝑘
                                                       53


The above equation follows
                                                   ∞
                                                  ∑︁
                                 𝜓𝑘 − 𝜙𝑘 =             (𝜃 𝑘 − 𝜆𝑟 ) −1 P𝑟 Δ𝜓 𝑘 + 𝑂 (∥Δ∥ 2 )                         (2.24)
                                                  𝑟=1
                                                  𝑟≠𝑘
Remark 2.7.1. Equation (2.24) plays an important role in finding the bound of the proposed
estimator introduced in Section 2.2.2. Note that sup𝑟 ≥1 |𝜃 𝑟 − 𝜆𝑟 | ≤ ∥Δ∥ ≤ inf 𝑟≠𝑘 |𝜆 𝑘 − 𝜆𝑟 | (see
Theorem 4.2.8 in Hsing and Eubank (2015) for proof). Thus, it is easy to see, |𝜃 𝑟 − 𝜆𝑟 | ≤ |𝜆 𝑘 − 𝜆𝑟 |
which implies from Equation (2.23)
                          ∞                     ∞                   𝑠
                         ∑︁
                                          −1
                                              ∑︁      𝜆 𝑘 − 𝜃𝑟
           𝜓𝑘 − 𝜙𝑘 =         (𝜆 𝑘 − 𝜆𝑟 )                                P𝑟 Δ{𝜙 𝑘 + (𝜓 𝑘 − 𝜙 𝑘 )} + P𝑘 (𝜓 𝑘 − 𝜙 𝑘 )
                         𝑟=1                   𝑠=0
                                                      𝜆  𝑘  −   𝜆  𝑟
                         𝑟≠𝑘
                ∞
               ∑︁                                 ∞
                                                 ∑︁
            =      (𝜆 𝑘 − 𝜆𝑟 ) −1 P𝑟 Δ𝜙 𝑘 +           (𝜆 𝑘 − 𝜆𝑟 ) −1 P𝑟 Δ(𝜓 𝑘 − 𝜙 𝑘 )
               𝑟=1                               𝑟=1
               𝑟≠𝑘                               𝑟≠𝑘
                     ∞ ∑︁  ∞
                    ∑︁          (𝜆 𝑘 − 𝜆 𝑠 ) 𝑠
                  +                         𝑠+1 𝑟
                                                 P Δ𝜓 𝑘 + P𝑘 (𝜓 𝑘 − 𝜙 𝑘 )                                          (2.25)
                    𝑟=1 𝑠=1
                               (𝜆 𝑘 − 𝜆 𝑟 )
                    𝑟≠𝑘
Moreover, using Bessel’s inequality 1 , we can bound last three terms in the above equation by
∥Δ∥ 2 .
    Another useful tool in functional data analysis is Karhunen-Loève expansions Karhunen (1946);
Loève (1946) for random function 𝑒(𝑡) which is mean zero, second order process with kernel 𝑅 is
defined in Sub-section 2.2.3. It states that, with probability 1, the random function can be expressed
as
                                             ∞
                                            ∑︁
                                     𝑒𝑖 =         𝜉𝑖𝑟 𝜙𝑟 ,           where 𝜉𝑖𝑟 := ⟨𝑒𝑖 , 𝜙𝑟 ⟩                       (2.26)
                                            𝑟=1
The random variables 𝜉𝑖𝑟 are uncorrelated with mean zero and variance 𝜆𝑟 . This provides a sufficient
and necessary condition for the decomposition of a random process.
2.7.2    Some useful lemmas
In this section, we represent some useful lemmas. For convenience, let us recall the notation.
Assume that 𝑚𝑖 s are all of the same order, viz, 𝑚 ≡ 𝑚(𝑛). Define, 𝑑𝑛1 (ℎ) = ℎ2 +ℎ𝑚/𝑚 and 𝑑𝑛2 (ℎ) =
    1 Bessel’s                               Í∞
               inequality: for any 𝑓 ∈ H,       𝑘=1 | ⟨ 𝑓 , 𝜙 𝑘 ⟩ |2 ≤ ∥ 𝑓 ∥2
                                                                    54


ℎ4 + ℎ3 𝑚/𝑚 + ℎ2 𝑚/𝑚 2 where 𝑚 = lim𝑛→∞ 𝑛−1 𝑖=1                               𝑚/𝑚𝑖 and 𝑚 = lim𝑛→∞ 𝑛−1 𝑖=1           (𝑚/𝑚𝑖 ) 2 .
                                                                        Í𝑛                                   Í𝑛
                                                          1/2                                        1/2
Denote 𝛿𝑛1 (ℎ) = 𝑑𝑛1 (ℎ) log 𝑛/(𝑛ℎ2 )                         , 𝛿𝑛2 (ℎ) = 𝑑𝑛2 (ℎ) log 𝑛/(𝑛ℎ4 )           and 𝛿 𝑛 (ℎ) = ℎ2 +
                                                                             
                                                  ∫
𝛿𝑛1 (ℎ) + 𝛿𝑛2    2 (ℎ). Further, 𝜈                    𝑡 𝑎 𝐾 𝑏 (𝑡)𝑑𝑡. Define, W = (𝝓(𝑡1 ) T , · · · , 𝝓(𝑡 𝑚 ) T ) T be matrix
                                         𝑎,𝑏 =
of order 𝑚 × 𝜅 0 obtained after stacking all 𝝓 𝑘 s and random components 𝝃 𝑖 = (𝜉𝑖1 , · · · , 𝜉𝑖𝜅0 ) T .
Furthermore, 𝝃 has mean zero and variance 𝚲 which is a diagonal matrix with components
𝜆1 , · · · , 𝜆 𝑘 0 . Define the indicator function 1(𝑎 = 𝑏) = 1 if 𝑎 = 𝑏 and zero otherwise.
Lemma 2.7.2. Consider 𝑍1 , · · · , 𝑍𝑛 be independent and identically distributed random variables
with mean zero and finite variance. Suppose that there exists an 𝑀 such that 𝑃(|𝑍𝑖 | ≤ 𝑀) = 1
for all 𝑖 = 1, · · · , 𝑛. Let 𝑇𝑛 = 𝑛1 𝑖=1                  𝑍𝑖 . then, 𝑛1 𝑖=1      𝑍𝑖 = 𝑂 ((log 𝑛/𝑛) 1/2 almost surely. If
                                                  Í𝑛                        Í𝑛
√︁
  𝑉 𝑎𝑟 (𝑇𝑛 ) = 𝑂 ((log 𝑛/𝑛) 1/2 ) then 𝑇𝑛 = 𝑂 (log 𝑛/𝑛) almost surely.
Proof. Bernstein’s inequality states that if 𝑍1 , · · · , 𝑍𝑛 be centered independent bounded random
variables with probability 1. Let 𝑇𝑛 = 𝑛1 𝑖=1                       𝑍𝑖 , then let Var{𝑇𝑛 } = 𝜎𝑛2 . Then for any positive
                                                             Í𝑛
                                                                       𝑛𝑢 2
real number 𝑢, we have 𝑃(|𝑇𝑛 | ≥ 𝑢) ≤ exp{− 2𝜎2 +2𝑀𝑢/3                          } where 𝑀 is such that 𝑃(|𝑍𝑖 | ≤ 𝑀) = 1.
                                                                     𝑛
Moreover, if 𝑇𝑛 converges to its limit in probability fast enough, then it converge almost surely
in the limit, i.e., if for any 𝑢 > 0, ∞
                                                      Í
                                                          𝑛=1 𝑃(|𝑇𝑛 | ≥ 𝑢) < ∞ them 𝑇𝑛 converges to zero almost
                                  √︃
                                       4𝜎𝑛2 log 𝑛     4𝑀 log 𝑛
                                                  + 3𝑛 . Thus, ∞
                                                                           Í                     Í∞        2
surely. Now, choose 𝑢 =                    𝑛                                 𝑛=1 𝑃(|𝑇𝑛 | ≥ 𝑢) <    𝑛=1 1/𝑛 which is finite.
                                                                         √︁
Therefore, 𝑇𝑛 = 𝑂 (𝑢) almost surely. Now let 𝜎𝑛 ≤ 4𝑀 2 log 𝑛/9𝑛, we have, 𝑇𝑛 = 𝑂 (log 𝑛/𝑛) and
if 𝜎𝑛 = 𝑂 (1) then 𝑇𝑛 = 𝑂 ((log 𝑛/𝑛) 1/2 ) almost surely.
Lemma 2.7.3. Suppose 𝑇𝑖 𝑗 are i.i.d. with density 𝑓𝑇 . Then for fixed 𝑖 = 1, · · · , 𝑛, any 𝑘 and 𝑙 ≥ 1,
under assumptions (C2) and (C6)c, the following holds.
                        𝑚
                     1 ∑︁𝑖
                            𝜙 𝑘 (𝑇𝑖 𝑗 )𝜙𝑙 (𝑇𝑖 𝑗 ) = 1(𝑘 = 𝑙) + 𝑂 ((log 𝑚𝑖 /𝑚𝑖 ) 1/2 )           almost surely         (2.27)
                     𝑚𝑖 𝑗=1
Proof. Note that,
                                1 ∑︁
                               
                                       𝑚𝑖                         ∫
                                                                  
                                                                  
                            E                𝜙 𝑘 (𝑇𝑖 𝑗 )𝜙𝑙 (𝑇𝑖 𝑗 ) =        𝜙 𝑘 (𝑡)𝜙𝑙 (𝑡)𝑑𝑡 = 1(𝑘 = 𝑙)                (2.28)
                                𝑚𝑖                               
                                       𝑗=1                       
                                                                    55


and,
             𝑚𝑖                                             𝑚                                2
       1 ∑︁                                           1 𝑖
      
                                        
                                         
                                                   ∑︁
                                                   
                                                                                         
                                                                                          
                                                                                          
  Var              𝜙 𝑘 (𝑇𝑖 𝑗 )𝜙𝑙 (𝑇𝑖 𝑗 ) = E                      𝜙 𝑘 (𝑇𝑖 𝑗 )𝜙𝑙 (𝑇𝑖 𝑗 ) − 1(𝑘 = 1)
       𝑚𝑖                                         𝑚𝑖                                   
            𝑗=1                                          𝑗=1                           
          𝑚𝑖                                           𝑚 𝑖 ∑︁𝑚𝑖
      1   ∑︁                                     1    ∑︁
   = 2         E{𝜙2𝑘 (𝑇𝑖 𝑗 )𝜙2𝑙 (𝑇𝑖 𝑗 )} + 2                      E{𝜙 𝑘 (𝑇𝑖 𝑗1 )𝜙 𝑘 (𝑇𝑖 𝑗2 )𝜙𝑙 (𝑇𝑖 𝑗1 )𝜙𝑙 (𝑇𝑖 𝑗2 )} − 1(𝑘 = 𝑙)
     𝑚𝑖 𝑗=1                                     𝑚𝑖 𝑗1 =1 𝑗2 =1
                                                        𝑗1 ≠ 𝑗2
          𝑚𝑖                                           𝑚
                                                      ∑︁𝑖 ∑︁ 𝑚𝑖
      1   ∑︁                                     1
   =           E{𝜙2𝑘 (𝑇𝑖 𝑗 )𝜙2𝑙 (𝑇𝑖 𝑗 )} +                        E{𝜙 𝑘 (𝑇𝑖 𝑗1 )𝜙𝑙 (𝑇𝑖 𝑗1 )}E{𝜙 𝑘 (𝑇𝑖 𝑗2 )𝜙𝑙 (𝑇𝑖 𝑗2 )} − 1(𝑘 = 𝑙)
     𝑚𝑖2  𝑗=1
                                                𝑚𝑖2 𝑗1 =1 𝑗2 =1
                                                        𝑗1 ≠ 𝑗2
     
                              𝑚 −1
          ∫                            ∫
      𝑚1𝑖 𝜙4𝑘 (𝑡)𝑑𝑡 + 𝑚𝑖 𝑖 ( 𝜙2𝑘 (𝑡)𝑑𝑡) 2 − 1
     
                                                                                                           if 𝑘 = 𝑙
     
     
     
   =
                                       𝑚 −1
        1
          ∫                                    ∫ ∫
      𝑚𝑖 𝜙2𝑘 (𝑡)𝜙2𝑙 (𝑡)𝑑𝑡 + 𝑚𝑖 𝑖 (                   𝜙 𝑘 (𝑡)𝜙𝑙 (𝑡)𝜙 𝑘 (𝑡 ′)𝜙𝑙 (𝑡 ′)𝑑𝑡𝑑𝑡 ′)
     
     
                                                                                                          if 𝑘 ≠ 𝑙
     
   = 𝑂 (1/𝑚𝑖 )                                                                                                                          (2.29)
Therefore, by applying Lemma 2.7.2, we get Equation (2.27).
Lemma 2.7.4. Suppose 𝑇𝑖 𝑗 are i.i.d with density 𝑓𝑇 . Then for fixed 𝑖 = 1, · · · , 𝑛, for any 𝑘 ≥ 1,
under assumptions (C2), (C6)c, the following holds.
          𝑚                              ∫
      1 ∑︁𝑖                                                                                           
              𝜇¤𝑖 (𝑇𝑖 𝑗 )𝜙 𝑘 (𝑇𝑖 𝑗 ) =        𝜇¤𝑖 (𝑡)𝜙 𝑘 (𝑡)𝑑𝑡 + 𝑂 (log 𝑚𝑖 /𝑚𝑖 ) 1/2                            almost surely           (2.30)
     𝑚𝑖 𝑗=1
Proof. Note that,
                                        1 ∑︁
                                       
                                              𝑚𝑖                            ∫
                                                                            
                                                                            
                                   E                𝜇¤𝑖 (𝑇𝑖 𝑗 )𝜙 𝑘 (𝑇𝑖 𝑗 ) =              𝜇¤𝑖 (𝑡)𝜙 𝑘 (𝑡)𝑑𝑡                              (2.31)
                                        𝑚𝑖                                 
                                              𝑗=1                          
and,
                        𝑚𝑖                                              𝑚𝑖                               2
                1 ∑︁
               
                                                  
                                                   
                                                               1 ∑︁
                                                               
                                                                                                     
                                                                                                      
                                                                                                      
          Var                𝜇¤𝑖 (𝑇𝑖 𝑗 )𝜙 𝑘 (𝑇𝑖 𝑗 ) ≤ E                       𝜇¤𝑖 (𝑇𝑖 𝑗 )𝜙 𝑘 (𝑇𝑖 𝑗 )
                𝑚𝑖                                            𝑚𝑖                                   
                       𝑗=1                                          𝑗=1                            
                    𝑚𝑖                                              𝑚 𝑖 ∑︁ 𝑚𝑖
               1   ∑︁                                           1  ∑︁
                         E 𝜇¤𝑖2 (𝑇𝑖 𝑗 )𝜙 𝑘 (𝑇𝑖 𝑗 ) 2 + 2
                             
          = 2                                                                    E{ 𝜇¤ 𝑖 (𝑇𝑖 𝑗1 ) 𝜇¤𝑖 (𝑇𝑖 𝑗2 )𝜙 𝑘 (𝑇𝑖 𝑗1 )𝜙 𝑘 (𝑇𝑖 𝑗2 )}
              𝑚𝑖 𝑗=1                                          𝑚𝑖 𝑗1 =1 𝑗2 =1
                                                                      𝑗1 ≠ 𝑗2
                                             ∫
          = 𝑂 (1/𝑚𝑖 )              since          𝜇¤ 2 (𝑡)𝜙2𝑘 (𝑡)𝑑𝑡 < ∞                                                                 (2.32)
   Therefore, applying the Lemma 2.7.2, we get Equation (2.30).
                                                                      56


                                      ∫                                    ∫
                                         𝜇¤𝑖 (𝑡)𝜙𝑟 (𝑡)𝑑𝑡 and 𝑉𝑟 = E{ 𝜇(𝑡)𝜙     ¤               2
Lemma 2.7.5. Define, M𝑖𝑟 =                                                           𝑟 (𝑡)𝑑𝑡} for 𝑟 ≥ 2. Then under
Conditions (C6)a, (C6)b, for some 𝛼 > 0 such that 𝑉𝑟 𝜆𝑟−2𝑟 1+𝛼 → 0 as 𝑟 → ∞ (due to Condition
(C6)b)
          ∞                    𝑛
         ∑︁                 1 ∑︁                                                  
             (𝜆 𝑘 − 𝜆𝑟 ) −1       M𝑖𝑟 𝜉𝑖𝑘 = 𝑂 (log 𝑛/𝑛) 1/2𝜆1/2       𝑘  𝑘 (1−𝛼)/2
                                                                                            almost surely         (2.33)
         𝑟=1
                            𝑛 𝑖=1
         𝑟≠𝑘
Proof. It is easy to see that,
                                        
                                                                               
                                                                                
                                           ∞                      𝑛            
                                                               1 ∑︁
                                        
                                         ∑︁
                                                                               
                                                                                
                                                                                
                                      E         (𝜆 𝑘 − 𝜆𝑟 ) −1        M𝑖𝑟 𝜉𝑖𝑘 = 0                                 (2.34)
                                        
                                          𝑟=1
                                                               𝑛  𝑖=1
                                                                                
                                                                                
                                        
                                         𝑟≠𝑘                                   
                                                                                
                                                                               
    Using the spacing condition among the eigen-values in (C6)a, for each 1 ≤ 𝑘 < 𝑟 < ∞ and for
non-zero finite generic constant 𝐶0 ,
                                           max {𝜆 𝑘 , 𝜆𝑟 }          max {𝑘, 𝑟}
                                                             ≤ 𝐶0                                                 (2.35)
                                              |𝜆 𝑘 − 𝜆𝑟 |              |𝑘 − 𝑟 |
Similar kind of conditions can be invoked such as the convexity assumption, i.e. 𝜆𝑟 −𝜆𝑟+1 ≤ 𝜆𝑟−1 −𝜆𝑟
for all 𝑟 ≥ 2. Thus, using Inequality (2.35), for some 𝛼 > 0, with condition 𝑉𝑟 𝜆𝑟−2𝑟 1+𝛼 → 0 as
𝑟 → ∞ we can write
    ∞                         ∞                                2
   ∑︁
                      −2
                             ∑︁              max(𝑘, 𝑟)
       𝑉𝑟 (𝜆 𝑘 − 𝜆𝑟 ) ≲          𝑉𝑟
   𝑟=1                       𝑟=1
                                      |𝑘 − 𝑟 | max(𝜆 𝑘 , 𝜆𝑟 )
   𝑟≠𝑘                       𝑟≠𝑘
        ∑︁               𝑘2        ∑︁                𝑟2          ∑︁                 𝑘2           ∑︁             𝑟2
    =        𝑉𝑟 𝜆𝑟−2            +       𝑉 𝑟 𝜆 −2
                                                             +           𝑉𝑟 𝜆 −2
                                                                              𝑟              +       𝑉𝑟 𝜆 −2
      𝑟 ≤𝑘/2
                     (𝑘 − 𝑟) 2 𝑟>2𝑘           𝑘
                                                 (𝑘 − 𝑟) 2 𝑘/2<𝑟<𝑘               (𝑘 − 𝑟) 2 𝑘 <𝑟<2𝑘        𝑘
                                                                                                             (𝑘 − 𝑟) 2
           ∑︁                       ∑︁
    ≲              𝑉𝑟 𝜆𝑟−2 + 𝑘 2             𝑉𝑟 𝜆𝑟−2 (𝑘 − 𝑟) −2
       𝑟 ≤𝑘/2,𝑟>2𝑘               𝑘/2<𝑟<2𝑘
                     ∑︁
    ≲ 1 + 𝑘 1−𝛼              (𝑘 − 𝑟) −2 ≲ 𝑘 1−𝛼 .                                                                 (2.36)
                  𝑘/2<𝑟<2𝑘
This follows the line of proofs in Hall and Hosseini-Nasab (2009) in different contexts. Thus, using
the Inequality (2.36), it follows that
             
                                                  
                                                   
               ∞                     𝑛            
                                                    1 ∑︁      ∞
                                  1
             
              ∑︁
                                   ∑︁             
                                                                                                        
                               −1
        Var        (𝜆 𝑘 − 𝜆𝑟 )           M𝑖𝑟 𝜉𝑖𝑘 = 𝜆 𝑘 𝑉𝑟 (𝜆 𝑘 − 𝜆𝑟 ) −2 = 𝑂 𝑛−1𝜆 𝑘 𝑘 (1−𝛼)                       (2.37)
             
              𝑟=1
                                  𝑛 𝑖=1            
                                                       𝑛 𝑟=1
             
              𝑟≠𝑘                                 
                                                   
                                                            𝑟≠𝑘
Therefore, applying Lemma 2.7.2, we get Equation (2.33).
                                                             57


                                        ∫
Lemma 2.7.6. For, M𝑖𝑟 =                      ¤
                                            𝜇(𝑡)𝜙    𝑟 (𝑡)𝑑𝑡 and 𝜂 𝑘 = min𝑟≠𝑘 |𝜆 𝑘 − 𝜆𝑟 | > 0, under conditions (C6)a
and (C6)b, we almost surely have the following.
   ∞ ∑︁   𝜅0                                           𝑛
                                                                                                           (𝜅       ) 1/2
                                                                                                               0
 ∑︁                                                1 ∑︁                                                      ∑︁
               (𝜆 𝑘 − 𝜆𝑟1 ) −1 (𝜆 𝑘 − 𝜆𝑟2 ) −1            M𝑖𝑟1 𝜉𝑖𝑟2 = 𝑂 ­ (log 𝑛/𝑛) 1/2 𝜅0(3−𝛼)/2𝜆−1                      ® (2.38)
                                                                           ©                                              ª
                                                                                                        𝜅0       𝜆𝑟
 𝑟 1 ≠1 𝑟 2 ≠1
                                                   𝑛  𝑖=1                                                    𝑟=1
                                                                           «                                              ¬
 𝑟 1 ≠𝑘 𝑟 2 ≠𝑘
Proof. It is easy to see that,
                                 
                                                                                                 
                                                                                                  
                                    ∞ ∑︁  𝜅0                                        𝑛            
                                                                                 1 ∑︁
                                 
                                  ∑︁
                                                                                                 
                                                                                                  
                                                                                                  
                               E                 (𝜆 𝑘 − 𝜆𝑟1 ) −1 (𝜆 𝑘 − 𝜆𝑟2 ) −1         M𝑖𝑟1 𝜉𝑖𝑟2 = 0                      (2.39)
                                                                                𝑛                
                                  𝑟1 =1 𝑟2 =1                                      𝑖=1
                                 
                                                                                                 
                                                                                                  
                                                                                                  
                                 𝑟1 ≠𝑘 𝑟2 ≠𝑘                                                     
Moreover, using the spacing condition mentioned in (C6)a, one can derive the upper bound of 𝜂−1                                 𝑘
by
                                                                      −1
                                         𝜂−1
                                          𝑘     = min |𝜆 𝑘 − 𝜆𝑟 |          = max |𝜆 𝑘 − 𝜆𝑟 | −1
                                                     𝑟≠𝑘                      𝑟≠𝑘
                                                                                   
                                                                  max(𝑘, 𝑟)
                                                ≲ max                                  ≤ 𝜆−1
                                                                                           𝑘 𝑘                              (2.40)
                                                    𝑟≠𝑘     |𝑘 − 𝑟 | max(𝜆 𝑘 , 𝜆𝑟 )
Due to the monotonic decreasing property of eigen-values, for fixed 𝑘 = 1, · · · , 𝜅0 , we have
                            𝜅0
                           ∑︁                                   𝜅0
                                                               ∑︁                  𝜅0
                                                                                  ∑︁                   𝜅0
                                                                                                      ∑︁
                                                   −2       −2              −2 2               −2 2
                                𝜆𝑟 (𝜆 𝑘 − 𝜆𝑟 ) ≲ 𝜂 𝑘               𝜆𝑟 ≲ 𝜆 𝑘 𝑘          𝜆𝑟 ≲ 𝜆 𝜅0 𝜅 0       𝜆𝑟               (2.41)
                           𝑟=1                                 𝑟=1                𝑟=1                 𝑟=1
                           𝑟≠𝑘
Therefore, the following holds under similar conditions to obtain the Inequality (2.36),
                                  
                                                                                                    !2
                                    ∞ ∑︁   𝜅0                                          𝑛               
                                                                                   1 ∑︁
                                  
                                   ∑︁
                                                                                                       
                                                                                                        
                                                                                                        
                               E                 (𝜆 𝑘 − 𝜆𝑟1 ) −2 (𝜆 𝑘 − 𝜆𝑟2 ) −2           M𝑖𝑟1 𝜉𝑖𝑟2
                                                                                  𝑛                    
                                   𝑟1 =1 𝑟2 =1                                       𝑖=1
                                  
                                                                                                       
                                                                                                        
                                                                                                        
                                  𝑟1 ≠𝑘 𝑟2 ≠𝑘                                                          
                                        ∞ 𝜅0
                                   1 ∑︁ ∑︁
                                =                  (𝜆 𝑘 − 𝜆𝑟1 ) −2 (𝜆 𝑘 − 𝜆𝑟2 ) −2𝑉𝑟1 𝜆𝑟2
                                   𝑛 𝑟 =1 𝑟 =1
                                       1      1
                                     𝑟 1 ≠𝑘 𝑟 1 ≠𝑘
                                                𝜅0                                      𝜅0
                                    1 1−𝛼 ∑︁                       −2     1 3−𝛼 −2 ∑︁
                                ≲     𝑘             𝜆𝑟 (𝜆 𝑘 − 𝜆𝑟 )     ≲ 𝜅 0 𝜆 𝜅0          𝜆𝑟                               (2.42)
                                    𝑛          𝑟=1
                                                                          𝑛            𝑟=1
                                               𝑟≠𝑘
Therefore, applying the Lemma 2.7.2, we get Equation (2.38).
                                                                     58


2.7.3    Proof of Theorem 2.3.1
For the 𝑘−th element of g𝑛 ( 𝜷0 ) (1 ≤ 𝑘 ≤ 𝜅0 ),
                                               𝑛
                                          1 ∑︁ 1 T b
                        g𝑛,𝑘 ( 𝜷0 ) =                    𝝁¤ 𝚽 𝑘 (y𝑖 − 𝝁𝑖 )
                                          𝑛 𝑖=1 𝑚𝑖2 𝑖
                                               𝑛
                                          1 ∑︁ 1 T b
                                      =                  𝝁¤ 𝚽 𝑘 W𝝃 𝑖
                                          𝑛 𝑖=1 𝑚𝑖2 𝑖
                                               𝑛                                 𝑛
                                          1 ∑︁ 1 T                          1 ∑︁ 1 T b
                                      =                  𝝁¤   𝚽  𝑘 W𝝃  𝑖 +                𝝁¤ ( 𝚽 𝑘 − 𝚽 𝑘 )W𝝃 𝑖
                                          𝑛 𝑖=1 𝑚𝑖2 𝑖                       𝑛 𝑖=1 𝑚𝑖2 𝑖
                                      := 𝐽 𝑘𝑛1 + 𝐽 𝑘𝑛2                                                                        (2.43)
Now, using Lemmas 2.7.3 and 2.7.4, the first part of the expression of g𝑛,𝑘 ( 𝜷0 ) becomes
           𝑛
        1 ∑︁ 1 T
𝐽 𝑘𝑛1 =            𝝁¤ 𝚽 𝑘 W𝝃 𝑖
        𝑛 𝑖=1 𝑚𝑖2 𝑖
           𝑛          𝑚 𝑚 𝜅0
        1 ∑︁ 1 ∑︁𝑖 ∑︁𝑖 ∑︁
      =                                𝜇¤𝑖 (𝑇𝑖 𝑗1 )𝜙 𝑘 (𝑇𝑖 𝑗1 )𝜙 𝑘 (𝑇𝑖 𝑗2 )𝜙𝑙 (𝑇𝑖 𝑗2 )𝜉𝑖𝑙
        𝑛 𝑖=1 𝑚𝑖2 𝑗 =1 𝑗 =1 𝑙=1
                      1     2
           𝑛         𝑚
                                                    (𝜅                                                   )
                                                         0
        1 ∑︁ 1 ∑︁𝑖                                    ∑︁
      =                   𝜇¤𝑖 (𝑇𝑖 𝑗1 )𝜙 𝑘 (𝑇𝑖 𝑗1 )          [1(𝑘 = 𝑙) + 𝑂 ((log 𝑚/𝑚) 1/2 )] 𝜉𝑖𝑙
        𝑛 𝑖=1 𝑚𝑖 𝑗 =1                                  𝑙=1
                     1
            𝑛                                                                                                       ∫
        1 ∑︁ n                                   1/2
                                                       on
                                                                                    1/2
                                                                                           o
      ≲        M𝑖𝑘 + 𝑂 ((log 𝑚/𝑚)                    )      1 + 𝑂 ((log 𝑚/𝑚)            ) 𝜉𝑖𝑘        where M𝑖𝑘 =      𝜇¤𝑖 (𝑡)𝜙 𝑘 (𝑡)𝑑𝑡
        𝑛 𝑖=1
           𝑛
        1 ∑︁              n                                  o
      =       M𝑖𝑘 𝜉𝑖𝑘 1 + 𝑂 (log 𝑚/𝑚) 1/2
        𝑛 𝑖=1
                              n                            o
      = 𝑂 (log 𝑛/𝑛) 1/2 1 + (log 𝑚/𝑚) 1/2                             almost surely                                           (2.44)
On the other hand, the last part of g𝑛,𝑘 ( 𝜷) can be expressed as
                                         𝑛
                                     1 ∑︁ 1 T b
                          𝐽 𝑘𝑛2  =                 𝝁¤ ( 𝚽 𝑘 − 𝚽 𝑘 )W𝝃 𝑖
                                     𝑛 𝑖=1 𝑚𝑖2 𝑖
                                         𝑛           𝑚 𝑚 𝜅0
                                     1 ∑︁ 1 ∑︁𝑖 ∑︁𝑖 ∑︁
                                 =                                   𝜇¤𝑖 (𝑇𝑖 𝑗1 )𝑑𝑖𝑘 (𝑇𝑖 𝑗1 , 𝑇𝑖 𝑗2 )𝜙𝑙 (𝑇𝑖 𝑗2 )𝜉𝑖𝑙           (2.45)
                                     𝑛 𝑖=1 𝑚𝑖2 𝑗 =1 𝑗 =1 𝑙=1
                                                     1      2
where 𝑑𝑖𝑘 (𝑇𝑖 𝑗1 , 𝑇𝑖 𝑗2 ) := 𝜙b𝑘 (𝑇𝑖 𝑗1 ) 𝜙b𝑘 (𝑇𝑖 𝑗2 ) − 𝜙 𝑘 (𝑇𝑖 𝑗1 )𝜙 𝑘 (𝑇𝑖 𝑗2 ). Now replace G, 𝜃 and 𝜓 by F,        b 𝜆 b and 𝜙b
respectively since F     b be the approximation of F and Δ is the corresponding perturbation operator
                                                                     59


in Equation (2.23). Therefore, Lemma 2.7.1 immediately implies the following expansion, which
is the key fact to represent the objective function in QIF.
                                       ∑︁∞
                      𝜙b𝑘 − 𝜙 𝑘 =           (𝜆 𝑘 − 𝜆𝑟 ) −1 ⟨𝜙𝑟 , Δ𝜙 𝑘 ⟩ 𝜙𝑟 + 𝑂 (∥Δ∥ 2 )                      almost surely             (2.46)
                                       𝑟=1
                                       𝑟≠𝑘
where Δ be the integral operator with kernel 𝑅                       b − 𝑅. Therefore, almost surely, we have,
    𝑑𝑖𝑘 (𝑇𝑖 𝑗1 , 𝑇𝑖 𝑗2 ) := 𝜙b𝑘 (𝑇𝑖 𝑗1 ) 𝜙b𝑘 (𝑇𝑖 𝑗2 ) − 𝜙 𝑘 (𝑇𝑖 𝑗1 )𝜙 𝑘 (𝑇𝑖 𝑗2 )
       
                                                                                                 
                                                                                                  
       
       
       
                            ∞
                            ∑︁                                                                    
                                                                                                  
                                                                                                  
                                                                                                  
     = 𝜙 𝑘 (𝑇𝑖 𝑗1 ) +            (𝜆 𝑘 − 𝜆𝑟 ) −1 ⟨𝜙𝑟 , Δ𝜙 𝑘 ⟩ 𝜙𝑟 (𝑇𝑖 𝑗1 ) + 𝑂 (∥Δ∥ 2 )
                                                                                                 
       
       
                           𝑟=1                                                                   
                                                                                                  
                                                                                                  
                           𝑟≠𝑘                                                                   
               
                                                                                                         
                                                                                                          
               
               
               
                                  ∑︁∞                                                                    
                                                                                                          
                                                                                                          
                                                                                                          
                                                        −1                                            2
            × 𝜙 𝑘 (𝑇𝑖 𝑗2 ) +             (𝜆 𝑘 − 𝜆𝑟 ) ⟨𝜙𝑟 , Δ𝜙 𝑘 ⟩ 𝜙𝑟 (𝑇𝑖 𝑗2 ) + 𝑂 (∥Δ∥ ) − 𝜙 𝑘 (𝑇𝑖 𝑗1 )𝜙 𝑘 (𝑇𝑖 𝑗2 )
                                                                                                         
               
               
                                  𝑟=1                                                                    
                                                                                                          
                                                                                                          
                                  𝑟≠𝑘                                                                    
       ∑︁∞                                       n                                                      o
     =       (𝜆 𝑘 − 𝜆𝑟 ) −1 ⟨𝜙𝑟 , Δ𝜙 𝑘 ⟩ 𝜙𝑟 (𝑇𝑖 𝑗1 )𝜙 𝑘 (𝑇𝑖 𝑗2 ) + 𝜙 𝑘 (𝑇𝑖 𝑗1 )𝜙𝑟 (𝑇𝑖 𝑗2 )
        𝑟=1
       𝑟≠𝑘
               ∑︁ ∑︁
            +                (𝜆 𝑘 − 𝜆𝑟1 ) −1 (𝜆 𝑘 − 𝜆𝑟2 ) −1 𝜙𝑟1 , Δ𝜙 𝑘                𝜙𝑟2 , Δ𝜙 𝑘 𝜙𝑟1 (𝑇𝑖 𝑗1 )𝜙𝑟2 (𝑇𝑖 𝑗2 ) + 𝑂 (∥Δ∥ 2 )
               𝑟 1 ≠𝑘 𝑟 2 ≠𝑘
          𝑛1
     := 𝐼𝑖𝑘   (𝑇𝑖 𝑗1 , 𝑇𝑖 𝑗2 ) + 𝐼𝑖𝑘𝑛2
                                       (𝑇𝑖 𝑗1 , 𝑇𝑖 𝑗2 ) + 𝑂 (∥Δ∥ 2 )            almost surely                                          (2.47)
Therefore, by placing the expression of 𝑑𝑖𝑘 (𝑇𝑖 𝑗1 , 𝑇𝑖 𝑗2 ) in Equation (2.45), we have the following.
                      𝑛
                 1 ∑︁ 1 T b
      𝐽 𝑘𝑛2 =                  𝝁¤ ( 𝚽 𝑘 − 𝚽 𝑘 )W𝝃 𝑖
                 𝑛 𝑖=1 𝑚𝑖2 𝑖
                      𝑛          𝑚 𝑚 𝜅0
                 1 ∑︁ 1 ∑︁𝑖 ∑︁𝑖 ∑︁
            =                                      𝜇¤𝑖 (𝑇𝑖 𝑗1 )𝑑𝑖𝑘 (𝑇𝑖 𝑗1 , 𝑇𝑖 𝑗2 )𝜙𝑙 (𝑇𝑖 𝑗2 )𝜉𝑖𝑙
                 𝑛 𝑖=1 𝑚𝑖2 𝑗 =1 𝑗 =1 𝑙=1
                                 1      2
                      𝑛          𝑚 𝑖 ∑︁ 𝑚 𝑖 ∑︁𝜅0
                 1  ∑︁      1   ∑︁                             n                                             o
            =                                                    𝑛1
                                                   𝜇¤𝑖 (𝑇𝑖 𝑗1 ) 𝐼𝑖𝑘                      𝑛2
                                                                    (𝑇𝑖 𝑗1 , 𝑇𝑖 𝑗2 ) + 𝐼𝑖𝑘   (𝑇𝑖 𝑗1 , 𝑇𝑖 𝑗2 ) 𝜙𝑙 (𝑇𝑖 𝑗2 )𝜉𝑖𝑙 + 𝑂 (∥Δ∥ 2 )
                 𝑛   𝑖=1
                           𝑚𝑖2  𝑗1 =1 𝑗2 =1 𝑙=1
            := 𝐽 𝑘1 𝑛2       𝑛2
                        + 𝐽 𝑘2   + 𝑂 (∥Δ∥ 2 )             almost surely                                                                (2.48)
Under assumptions (C1), (C2), (C3), (C4) and (C5), by using Theorem 3.3 in Li and Hsing (2010),
we have the following.
                                           ∥Δ∥ 2 = 𝑂 (ℎ4 + 𝛿𝑛2       2
                                                                         (ℎ))          almost surely                                   (2.49)
                                                                         60


Now observe that
           𝑛         𝑚 𝑚 𝜅0
  𝑛2    1 ∑︁ 1 ∑︁𝑖 ∑︁𝑖 ∑︁                         𝑛1
𝐽 𝑘1 =           2
                                     𝜇¤𝑖 (𝑇𝑖 𝑗1 )𝐼𝑖𝑘  (𝑇𝑖 𝑗1 )𝜙𝑙 (𝑇𝑖 𝑗2 )𝜉𝑖𝑙
        𝑛 𝑖=1 𝑚𝑖 𝑗 =1 𝑗 =1 𝑙=1
                     1     2
           𝑛         𝑚 𝑖 ∑︁     𝜅 0 ∑︁
                           𝑚 𝑖 ∑︁     ∞
        1 ∑︁    1  ∑︁
                                                        −1
                                                                        n                                           o
     =                                    (𝜆 𝑘 − 𝜆𝑟 ) 𝜇¤𝑖 (𝑇𝑖 𝑗1 ) 𝜙𝑟 (𝑇𝑖 𝑗1 )𝜙 𝑘 (𝑇𝑖 𝑗2 ) + 𝜙 𝑘 (𝑇𝑖 𝑗1 )𝜙𝑟 (𝑇𝑖 𝑗2 ) 𝜙𝑙 (𝑇𝑖 𝑗2 )
        𝑛 𝑖=1
               𝑚𝑖2 𝑗1 =1 𝑗2 =1 𝑙=1 𝑟=1
                                     𝑟≠𝑘
          × ⟨𝜙𝑟 , Δ𝜙 𝑘 ⟩ 𝜉𝑖𝑙
            𝑛       𝑚      ∞ 𝜅0
        1 ∑︁ 1 ∑︁𝑖 ∑︁ ∑︁                                                     n                                   o
     ≲                              (𝜆 𝑘 − 𝜆𝑟 ) −1 𝜇¤𝑖 (𝑇𝑖 𝑗1 )𝜙𝑟 (𝑇𝑖 𝑗1 ) 1(𝑙 = 𝑘) + 𝑂 ((log 𝑚/𝑚) 1/2 ) ⟨𝜙𝑟 , Δ𝜙 𝑘 ⟩ 𝜉𝑖𝑙
        𝑛 𝑖=1 𝑚𝑖 𝑗 =1 𝑟=1 𝑙=1
                    1
                          𝑟≠𝑘
           𝑛       𝑚      ∞ 𝜅0
       1 ∑︁ 1 ∑︁𝑖 ∑︁ ∑︁                                                      n                                  o
     +                             (𝜆 𝑘 − 𝜆𝑟 ) −1 𝜇¤𝑖 (𝑇𝑖 𝑗1 )𝜙 𝑘 (𝑇𝑖 𝑗1 ) 1(𝑟 = 𝑙) + 𝑂 ((log 𝑚/𝑚) 1/2 ) ⟨𝜙𝑟 , Δ𝜙 𝑘 ⟩ 𝜉𝑖𝑙
       𝑛 𝑖=1 𝑚𝑖 𝑗 =1 𝑟=1 𝑙=1
                   1
                         𝑟≠𝑘
            𝑛       𝑚      ∞
        1 ∑︁ 1 ∑︁𝑖 ∑︁                                                                     n                          o
     ≲                        (𝜆 𝑘 − 𝜆𝑟 ) −1 𝜇¤𝑖 (𝑇𝑖 𝑗1 )𝜙𝑟 (𝑇𝑖 𝑗1 ) ⟨𝜙𝑟 , Δ𝜙 𝑘 ⟩ 𝜉𝑖𝑘 1 + 𝑂 ((log 𝑚/𝑚) 1/2 )
        𝑛 𝑖=1 𝑚𝑖 𝑗 =1 𝑟=1
                    1
                          𝑟≠𝑘
                 𝑛         𝑚 𝜅0
              1 ∑︁ 1 ∑︁𝑖 ∑︁                                                                   n                           o
          +                          (𝜆 𝑘 − 𝜆𝑟 ) −1 𝜇¤𝑖 (𝑇𝑖 𝑗1 )𝜙 𝑘 (𝑇𝑖 𝑗1 ) ⟨𝜙𝑟 , Δ𝜙 𝑘 ⟩ 𝜉𝑖𝑟 1 + 𝑂 ((log 𝑚/𝑚) 1/2 )
              𝑛 𝑖=1 𝑚𝑖 𝑗 =1 𝑟=1
                           1
                               𝑟≠𝑘
                         n                                 o
            𝑛      𝑛                                 1/2
     := (𝑈 𝑘1  + 𝑈 𝑘2  )   1 + 𝑂 ((log 𝑚/𝑚)              )          almost surely                                          (2.50)
Then applying the triangle inequality, we have the following.
                                   𝑛         𝑚 ∞
                              1 ∑︁ 1 ∑︁𝑖 ∑︁
                   𝑈 𝑘1𝑛
                           =                           (𝜆 𝑘 − 𝜆𝑟 ) −1 𝜇¤𝑖 (𝑇𝑖 𝑗 )𝜙𝑟 (𝑇𝑖 𝑗 ) ⟨Δ𝜙 𝑘 , 𝜙𝑟 ⟩ 𝜉𝑖𝑘
                              𝑛 𝑖=1 𝑚𝑖 𝑗=1 𝑟=1
                                                  𝑟≠𝑘
                                          ∞                         𝑛 n
                                                             −1 1
                                         ∑︁                       ∑︁                                     o
                           ≲ ∥Δ𝜙 𝑘 ∥          (𝜆 𝑘 − 𝜆𝑟 )                 M𝑖𝑟 + 𝑂 ((log 𝑚/𝑚) 1/2 ) 𝜉𝑖𝑘
                                         𝑟=1
                                                                𝑛 𝑖=1
                                         𝑟≠𝑘
                                          ∞                        𝑛
                                                            −1 1
                                        ∑︁                        ∑︁             n                           o
                           = ∥Δ𝜙 𝑘 ∥          (𝜆 𝑘 − 𝜆𝑟 )               M𝑖𝑟 𝜉𝑖𝑘 1 + 𝑂 ((log 𝑚/𝑚) 1/2 )                     (2.51)
                                         𝑟=1
                                                               𝑛  𝑖=1
                                        𝑟≠𝑘
                 ∫
where M𝑖𝑟 =        𝜇¤𝑖 (𝑡)𝜙𝑟 (𝑡)𝑑𝑡. By Lemma 6 of Li and Hsing (2010), under conditions (C1), (C2),
(C3), (C4), (C5), for any measurable bounded function 𝑒 on [0, 1], one can show the following.
                    ∥Δ𝜙 𝑘 ∥ = 𝑂 (ℎ2 + 𝛿𝑛1 (ℎ) + 𝛿𝑛2           2
                                                                 (ℎ)) ≡ 𝑂 (𝛿 𝑛 (ℎ))           almost surely                (2.52)
                                                                   61


where 𝛿 𝑛 (ℎ) = ℎ2 + 𝛿𝑛1 (ℎ) + 𝛿𝑛2        2 (ℎ). Thus, combining the Inequalities (2.37) in Lemma 2.7.5 and
                                                                                                                            
Equation (2.52), we obtain 𝑈 𝑘1           𝑛 = 𝑂 𝛿 (ℎ)(log 𝑛/𝑛) 1/2 𝜆 1/2 𝑘 (1−𝛼)/2 {1 + (log 𝑚/𝑚) 1/2 } almost
                                                           𝑛                           𝑘
surely. Next, under the spacing condition mentioned earlier and in assumption (C6)a, using the
Inequality (2.40) recall 𝜂−1               −1
                                  𝑘 ≲ 𝜆 𝑘 𝑘. Thus, observe that
                                 𝑛         𝑚 𝜅0
                              1 ∑︁ 1 ∑︁𝑖 ∑︁
                      𝑛
                   𝑈 𝑘2  =                             (𝜆 𝑘 − 𝜆𝑟 ) −1 𝜇¤𝑖 (𝑇𝑖 𝑗 )𝜙 𝑘 (𝑇𝑖 𝑗 ) ⟨Δ𝜙 𝑘 , 𝜙𝑟 ⟩ 𝜉𝑖𝑟
                              𝑛 𝑖=1 𝑚𝑖 𝑗=1 𝑟=1
                                                𝑟≠𝑘
                                       𝜅0
                                                               (       𝑛
                                                                                     )
                                     ∑︁                          1 ∑︁                    n                           o
                         ≲ ∥Δ𝜙 𝑘 ∥         (𝜆 𝑘 − 𝜆𝑟 ) −1                  M𝑖𝑘 𝜉𝑖𝑟 1 + 𝑂 ((log 𝑚/𝑚) 1/2 )
                                      𝑟=1
                                                                 𝑛 𝑖=1
                                     𝑟≠𝑘
                                             𝜅0
                                                  ( 𝑛                    )
                                        −1
                                           ∑︁         1  ∑︁                  n
                                                                                                     1/2
                                                                                                          o
                         ≲ ∥Δ𝜙 𝑘 ∥𝜂 𝑘                         M𝑖𝑘 𝜉𝑖𝑟 1 + 𝑂 ((log 𝑚/𝑚) )                                       (2.53)
                                           𝑟=1
                                                      𝑛 𝑖=1
                                           𝑟≠𝑘
Using Condition (C6)b we also have 𝑉𝑘1/2 𝜂−1                             1/2 −1
                                                              𝑘 ≲ 𝑉𝑘 𝜆 𝑘 𝑘 = 𝑂 (𝑘
                                                                                               (1−𝛼)/2 ). Finally, combining with
the bounds for 𝑈 𝑘1    𝑛 , 𝑈 𝑛 , we have, almost surely,
                              𝑘2
                                                   
                                                                                                 
                                                                                                  
                      ©                            
                                                                                         𝜅0      
                                                                                                                           ª
            𝑛2                      1/2
                                                   
                                                       1/2   (1−𝛼)/2          −1  1/2
                                                                                         ∑︁
                                                                                              1/2
                                                                                                  
                                                                                                                        1/2 ®
               = 𝑂 ­ (log 𝑛/𝑛) 𝛿 𝑛 (ℎ) 𝜆 𝑘 𝑘                              + 𝜂 𝑘 𝑉𝑘                   1 + (log 𝑚/𝑚)
                      ­
          𝐽 𝑘1                                                                               𝜆𝑟                              ®
                      ­                                                                                                    ®
                                                   
                                                   
                                                                                        𝑟=1      
                                                                                                  
                                                                                                  
                      «                                                                 𝑟≠𝑘       !                        ¬
                                                                𝜅0
                                                               ∑︁            n                       o
               = 𝑂 (log 𝑛/𝑛) 1/2 𝛿 𝑛 (ℎ)𝑘 (1−𝛼)/2                     𝜆𝑟1/2 1 + (log 𝑚/𝑚) 1/2
                                                               𝑘=1
               := 𝑂 (𝜔 𝑘1 (𝑛, ℎ))                                                                                              (2.54)
                                                                  Í 0 1/2 
where 𝜔 𝑘1 (𝑛, ℎ) = (log 𝑛/𝑛) 1/2 𝛿 𝑛 (ℎ)𝑘 (1−𝛼)/2 𝜅𝑘=1                     𝜆𝑟      1 + (log 𝑚/𝑚) 1/2 . It is easy to see that
                    𝜏
Í𝜅0 1/2           − 21 +1                                                       1/2 𝛿 (ℎ)𝜅 (3−𝜏)/2 1 + (log 𝑚/𝑚) 1/2 where
                                                                                                       
  𝑟=1 𝜆 𝑟    ∼  𝜅 0       .  Therefore,      𝜔  𝑘1  (𝑛,   ℎ)  ∼  (log     𝑛/𝑛)         𝑛      0
𝜏 = 𝛼 + 𝜏1 .
    Similarly to the derivation of the bound for 𝐽 𝑘1                  𝑛2 , we can write
                 𝑛          𝑚 𝑚 𝜅0
      𝑛2     1 ∑︁ 1 ∑︁𝑖 ∑︁𝑖 ∑︁                            𝑛2
    𝐽 𝑘2  =                                𝜇¤𝑖 (𝑇𝑖 𝑗1 )𝐼𝑖𝑘   (𝑇𝑖 𝑗1 , 𝑇𝑖 𝑗2 )𝜙𝑙 (𝑇𝑖 𝑗2 )𝜉𝑖𝑙
             𝑛 𝑖=1 𝑚𝑖2 𝑗 =1 𝑗 =1 𝑙=1
                            1    2
                 𝑛          𝑚 𝑚 𝜅 0 ∑︁       ∞ ∑︁   ∞
             1 ∑︁ 1 ∑︁𝑖 ∑︁𝑖 ∑︁
          =                                              (𝜆 𝑘 − 𝜆𝑟1 ) −1 (𝜆 𝑘 − 𝜆𝑟2 ) −1 𝜙𝑟1 , Δ𝜙 𝑘           𝜙𝑟2 , Δ𝜙 𝑘
             𝑛 𝑖=1 𝑚𝑖2 𝑗 =1 𝑗 =1 𝑙=1 𝑟 =1 𝑟 =1
                            1    2           1      1
                                           𝑟 1 ≠𝑘 𝑟 2 ≠𝑘
                × 𝜇¤𝑖 (𝑇𝑖 𝑗1 )𝜙𝑟1 (𝑇𝑖 𝑗1 )𝜙𝑟2 (𝑇𝑖 𝑗2 )𝜙𝑙 (𝑇𝑖 𝑗2 )𝜉𝑖𝑙
                                                                      62


               𝑛            𝑚 𝜅 0 ∑︁        ∞ ∑︁  ∞
           1 ∑︁ 1 ∑︁𝑖 ∑︁
         ≲                                             (𝜆 𝑘 − 𝜆𝑟1 ) −1 (𝜆 𝑘 − 𝜆𝑟2 ) −1 𝜙𝑟1 , Δ𝜙 𝑘       𝜙𝑟2 , Δ𝜙 𝑘
           𝑛 𝑖=1 𝑚𝑖 𝑗 =1 𝑙=1 𝑟 =1 𝑟 =1
                            1              1      2
                                         𝑟 1 ≠𝑘 𝑟 2 ≠𝑘
                                                 n                                       o
              × 𝜇¤𝑖 (𝑇𝑖 𝑗1 )𝜙𝑟1 (𝑇𝑖 𝑗1 ) 1(𝑟 2 = 𝑙) + 𝑂 ((log 𝑚/𝑚) 1/2 ) 𝜉𝑖𝑙
               𝑛            𝑚       ∞ 𝜅0
           1 ∑︁ 1 ∑︁𝑖 ∑︁ ∑︁
         =                                      (𝜆 𝑘 − 𝜆𝑟1 ) −1 (𝜆 𝑘 − 𝜆𝑟2 ) −1 𝜙𝑟1 , Δ𝜙 𝑘        𝜙𝑟2 , Δ𝜙 𝑘
           𝑛 𝑖=1 𝑚𝑖 𝑗 =1 𝑟 =1 𝑟 =1
                            1       1      2
                                  𝑟 1 ≠𝑘 𝑟 2 ≠𝑘
                                                       n                            o
              × 𝜇¤𝑖 (𝑇𝑖 𝑗1 )𝜙𝑟1 (𝑇𝑖 𝑗1 )𝜉𝑖𝑟2 1 + 𝑂 ((log 𝑚/𝑚) 1/2 )
                         ∞ ∑︁   𝜅0                                                𝑛
                                                                                                                          !
                       ∑︁                                                     1 ∑︁ n                                 o
         ≲ ∥Δ𝜙 𝑘 ∥ 2                   (𝜆 𝑘 − 𝜆𝑟1 ) −1 (𝜆 𝑘 − 𝜆𝑟2 ) −1                 M𝑖𝑟1 + 𝑂 ((log 𝑚/𝑚) 1/2 ) 𝜉𝑖𝑟2
                       𝑟 1 =1 𝑟 2 =1
                                                                              𝑛 𝑖=1
                      𝑟 1 ≠𝑘 𝑟 2 ≠𝑘
                 n                                        o
              × 1 + 𝑂 ((log 𝑚/𝑚) 1/2 )
                         ∞ ∑︁  𝜅0
                                                                               (     𝑛
                                                                                                  )
                    2
                      ∑︁
                                                        −1              −1       1 ∑︁                n
                                                                                                                           1/2
                                                                                                                                o
         = ∥Δ𝜙 𝑘 ∥                    (𝜆 𝑘 − 𝜆𝑟1 ) (𝜆 𝑘 − 𝜆𝑟2 )             ×           M𝑖𝑟1 𝜉𝑖𝑟2 1 + 𝑂 ((log 𝑚/𝑚) )
                      𝑟 1 =1 𝑟 2 =1
                                                                                 𝑛 𝑖=1
                      𝑟 1 ≠𝑘 𝑟 2 ≠𝑘
                                                                                                                               (2.55)
Therefore, Inequality (2.55) immediately follows using Lemma 2.7.6,
                                                                  𝜅0
                                                                                                       !
                                     2
                                                                 ∑︁           n                      o
    𝑛2
  𝐽 𝑘2 =𝑂   (log 𝑛/𝑛) 1/2 𝛿 𝑛 (ℎ)𝜅0(3−𝛼)/2𝜆−1               𝜅0 {     𝜆𝑟 } 1/2
                                                                                1 + (log 𝑚/𝑚)    1/2
                                                                                                         := 𝑂 (𝜔 𝑘2 (𝑛, ℎ)) (2.56)
                                                                 𝑟=1
                                                                       2
almost surely where 𝜔 𝑘2 (𝑛, ℎ) = (log 𝑛/𝑛) 1/2 𝛿 𝑛 (ℎ)𝜅0(3−𝛼)/2𝜆−1                                                      1/2 . It is
                                                                                             Í𝜅0       
                                                                                         𝜅0    𝑟=1 𝜆𝑟 1 + (log 𝑚/𝑚)
                                                                                                                   2
                   𝜆𝑟 ∼ 𝜅0−𝜏1 +1 under assumption (C6)b. Thus, 𝜔 𝑘2 (𝑛, ℎ) ∼ (log 𝑛/𝑛)𝛿 (ℎ)𝜅0(4−𝜏)/2 {1+
           Í𝜅0
easy to see 𝑟=1
(log 𝑚/𝑚) 1/2 }.
                                               2
     Since 𝜔𝑛2 = 𝑂 (𝜔𝑛1 ) and 𝛿 𝑛 (ℎ) = 𝑂 (𝜔𝑛1 ), in summary, for each 𝑘 = 1, · · · , 𝜅0 , the following
holds almost surely.
                                                                    n                      o                 
                           g 𝑘 ( 𝜷) = 𝑂 (log 𝑛/𝑛) 1/2 1 + (log 𝑚/𝑚) 1/2 + 𝜔 𝑘1 (𝑛, ℎ)                                          (2.57)
Therefore, AMSE of g 𝑘 ( 𝜷) is the following.
       AMSE{g 𝑘 ( 𝜷0 )}
                                                                                                                      
             1            1                  3−𝜏       4     1       1           1            1           1          1
       =𝑂        1+                1 + 𝜅0            ℎ + +               +             +            +           +
             𝑛           𝑚                                   𝑛 𝑛𝑚ℎ 𝑛2 𝑚 2 ℎ2 𝑛2 𝑚 4 ℎ4 𝑛2 𝑚ℎ 𝑛2 𝑚 3 ℎ3
                                                                       63


                                               
               1        1           3−𝜏
        =𝑂         1+         1 + 𝜅0 𝑅𝑛 (ℎ)
               𝑛        𝑚
                                       
        = 𝑂 𝑛−1 + 𝑛−1 𝜅03−𝜏 𝑅𝑛 (ℎ)                since (1/𝑛𝑚) = 𝑂 (1/𝑛)                                            (2.58)
                      n                                                                     o
                             1       1         1             1             1         1
where 𝑅𝑛 (ℎ) = ℎ4 +          𝑛  +   𝑛𝑚ℎ +  𝑛2 𝑚 2 ℎ 2
                                                      +  𝑛2 𝑚 4 ℎ 4
                                                                    +   𝑛2 𝑚ℎ
                                                                               + 𝑛2 𝑚 3 ℎ 3
                                                                                              . Combining the above condi-
tions, we find that if 𝑎 > 1/4, 𝜅0 = 𝑂 (𝑛1/(3−𝜏) ) and 𝑛−1/4 ≲ ℎ ≲ 𝑛−(𝑎+1)/5 then AMSE{g 𝑘 ( 𝜷0 )} =
𝑂 (1/𝑛). On the other hand, if 𝑎 ≤ 1/4, 𝜅 0 = 𝑂 (𝑛4(1+𝑎)/5(3−𝜏) ) and ℎ ≲ 𝑛−1/4 AMSE{g 𝑘 ( 𝜷0 )} =
𝑂 (1/𝑛).
    Note that for a three-dimensional array (𝜕𝐶/𝜕 𝛽1 , · · · , 𝜕𝐶/𝜕 𝛽 𝑝 ) such that the following is a
𝑝 × 1 vector.
                                       g( 𝜷0 ) T C−1 ( 𝜷0 ) C(  ¤ 𝜷0 )C−1 ( 𝜷0 )g( 𝜷0 )                             (2.59)
Therefore,
           𝑛−1Q(¤ 𝜷0 ) = 2g(¤ 𝜷 ) T C−1 ( 𝜷 )g( 𝜷 ) − g( 𝜷 ) T C−1 ( 𝜷 ) C(             ¤ 𝜷0 )C−1 ( 𝜷0 )g( 𝜷0 )     (2.60)
                                  0           0         0              0            0
And
                         𝑛−1Q(¥ 𝜷0 ) = 2g( ¤ 𝜷 ) T C−1 ( 𝜷 ) g(       ¤ 𝜷 ) + 𝑟 𝑛1 + 𝑟 𝑛2 + 𝑟 𝑛3 + 𝑟 𝑛4             (2.61)
                                                  0               0         0
where
                             𝑟 𝑛1 = 2g(¥ 𝜷 )C−1 ( 𝜷 )g( 𝜷 )
                                           0              0         0
                             𝑟 𝑛2 = −4g( ¤ 𝜷 ) T C−1 ( 𝜷 ) C(      ¤ 𝜷0 )C−1 ( 𝜷0 )g( 𝜷0 )
                                              0               0
                                                                                                                    (2.62)
                             𝑟 𝑛3      ¤ 𝜷 )C−1 ( 𝜷 )C−1 ( 𝜷 ) C(
                                   = 2g(                                     ¤ 𝜷0 )C−1 ( 𝜷0 )g( 𝜷0 )
                                           0              0              0
                             𝑟 𝑛4 = −g( 𝜷0 ) T C−1 ( 𝜷0 ) C(    ¥ 𝜷0 )C−1 ( 𝜷0 )g( 𝜷0 )
Since g( 𝜷0 ) = 𝑂 𝑃 (𝑛−1/2 ) and the weight matrix converges almost surely to an invertible matrix,
                     ¤ 𝜷0 )C−1 ( 𝜷0 )g( 𝜷0 ) = 𝑜(𝑛−1 ) almost surely. Furthermore, 𝑟 𝑛1 = 𝑂 (𝑛−1/2 ), 𝑟 𝑛2 =
g( 𝜷0 ) T C−1 ( 𝜷0 ) C(
𝑜(𝑛−1/2 ), 𝑟 𝑛3 = 𝑂 (𝑛−1/2 ), and 𝑟 𝑛4 = 𝑂 (𝑛−1 ) almost surely. Combining these bounds, we have
𝑟 𝑛 = 𝑜(1) almost surely. Therefore, ∥𝑛−1Q¤ − 2g(                   ¤ 𝜷 ) T C−1 ( 𝜷 )g( 𝜷 )∥ = 𝑜 𝑃 (𝑛−1 ) and ∥𝑛−1Q¥ −
                                                                           0            0       0
  ¤ 𝜷 ) T C−1 ( 𝜷 ) g(
2g(                   ¤ 𝜷 )∥ = 𝑜 𝑃 (1).
      0            0      0
    The following lines are based on common steps in the GEE literature that includes McCullagh
and Nelder (1989); Balan et al. (2005); Tian et al. (2014) among many others. Let 𝜷𝑛 = 𝜷0 + 𝛿d
                                                                  64


where set 𝛿 = 𝑛−1/2 . We have to show that for any 𝜖 > 0 there exists a large constant 𝑐 such that
                                                𝑃{ inf Q( 𝜷𝑛 ) ≥ Q( 𝜷0 )} > 1 − 𝜖                                                      (2.63)
                                                      ∥𝑑 ∥=𝑐
Note that the above statement is always true if 𝜖 ≥ 1. Thus, we assume that 𝜖 ∈ (0, 1). Due to
Taylor series expansion,
                Q( 𝜷𝑛 ) = Q( 𝜷0 + 𝛿d) = Q( 𝜷0 ) + 𝛿dTQ(                      ¤ 𝜷0 ) + 1 𝛿dTQ(           ¥ 𝜷0 )d + ∥d∥ 2 𝑜 𝑃 (1)        (2.64)
                                                                                           2
Now, observe that, using Equation (2.60),
                                                                             √
                                           𝛿dTQ(   ¤ 𝜷0 ) = ∥d∥𝑂 𝑃 ( 𝑛𝛿) + ∥d∥𝑂 𝑃 (𝛿)                                                  (2.65)
and
                          1 2 T¥ ∗                             ¤ 𝜷 )C−1 ( 𝜷 ) g¤ ( 𝜷 )d + 𝑛𝛿2 ∥d∥ 2 𝑜 𝑃 (1)
                             𝛿 d Q( 𝜷 )d = 𝑛𝛿2 dT g(                 0             0         0                                         (2.66)
                          2
Therefore, for given 𝜖 > 0, there exists a large enough 𝑐 such that the above equation (2.63) holds.
This implies that there exists a 𝜷ˆ that satisfies ∥ b                 𝜷 − 𝜷0 ∥ = 𝑂 𝑃 (𝛿). Thus, for large 𝑛, with probability
1, Q( 𝜷) attains the minimal value at 𝜷ˆ and therefore, Q¤ = 0.
2.7.4    Proof of Theorem 2.3.2
                  Í𝜅0          Í𝜅0                    −1       T        where C−1                                                  −1
Recall, C𝑖 =          𝑘 1 =1      𝑘 2 =1 𝚽 𝑘 1 X𝑖 C 𝑘 1 ,𝑘 2 X𝑖 𝚽 𝑘 2 ,                𝑘 1 ,𝑘 2 is the (𝑘 1 , 𝑘 2 ) block of C0 . Sim-
                                   b𝑖 = Í𝜅0 Í𝜅0 𝚽                            −1        Tb
ilarly, we can define C                        𝑘 1 =1 𝑘 2 =1 𝑘 1 X𝑖 C 𝑘 1 ,𝑘 2 X𝑖 𝚽 𝑘 2 . It is easy to observe that C𝑖 =
                                                                 b                                                                      b
C𝑖 + 𝜅𝑘01 =1 𝜅𝑘02 =1 ( 𝚽
      Í      Í            b 𝑘 1 − 𝚽 𝑘 1 )X𝑖 C−1 XT ( 𝚽          b 𝑘 2 − 𝚽 𝑘 2 ) + 2 Í𝜅0 Í𝜅0 𝚽 𝑘 1 X𝑖 C−1 XT ( 𝚽                 b 𝑘 2 − 𝚽 𝑘 2 ).
                                                   𝑘 1 ,𝑘 2 𝑖                               𝑘 1 =1 𝑘 2 =1            𝑘 1 ,𝑘 2 𝑖
Therefore, 𝑛1 𝑖=1
                Í𝑛                b𝑖 − C)X𝑖 = 1 Í𝑛 Í𝜅0 Í𝜅0 P𝑖𝑘 1 C−1 P𝑖𝑘 2 and 1 Í𝑛 𝝁¤ T ( C
                          𝝁¤ 𝑖T ( C                                                                                             b𝑖 − C)y𝑖 =
                                                        𝑛 𝑖=1 𝑘 1 =1 𝑘 2 =1                      𝑘 1 ,𝑘 2         𝑛 𝑖=1 𝑖
1 Í𝑛 Í 𝜅 0 Í 𝜅 0                       −1
𝑛 𝑖=1 𝑘 1 =1 𝑘 2 =1 P𝑖𝑘 1 C 𝑘 1 ,𝑘 2 Q𝑖𝑘 2 where P𝑖,𝑘 = 𝝁                 ¤ 𝑖T D𝑖𝑘 X𝑖 and Q𝑖𝑘 = 𝝁¤ 𝑖T D𝑖𝑘 y𝑖 with D𝑖𝑘 be the dif-
ference matrix with ( 𝑗1 , 𝑗2 )-th element is 𝑑𝑖 (𝑇𝑖 𝑗1 , 𝑇𝑖 𝑗2 ). Thus, note that, almsot surely we have the
following relation,
             𝑚 𝑚
        1 ∑︁𝑖 ∑︁𝑖
P𝑖𝑘 =                     𝜇¤𝑖 (𝑇𝑖 𝑗1 )𝑑𝑖 (𝑇𝑖 𝑗1 , 𝑇𝑖 𝑗2 )𝑥𝑖 (𝑇𝑖 𝑗2 )
        𝑚𝑖2 𝑗1 =1 𝑗2 =1
             𝑚 𝑚
        1 ∑︁𝑖 ∑︁𝑖
                                                                      (𝑇𝑖 𝑗1 , 𝑇𝑖 𝑗2 ) + 𝑂 (∥Δ∥ 2 ) 𝑥𝑖 (𝑇𝑖 𝑗2 )
                                         𝑛1                       𝑛2
     = 2                  𝜇¤𝑖 (𝑇𝑖 𝑗1 ) 𝐼𝑖𝑘    (𝑇𝑖 𝑗1 , 𝑇𝑖 𝑗2 ) + 𝐼𝑖𝑘
        𝑚𝑖 𝑗1 =1 𝑗2 =1
                                                                        65


            𝑚 𝑚                      ∞
        1 ∑︁𝑖 ∑︁𝑖                   ∑︁
                                                      −1
                                                                          𝜙𝑟 (𝑇𝑖 𝑗1 )𝜙 𝑘 (𝑇𝑖 𝑗2 ) + 𝜙 𝑘 (𝑇𝑖 𝑗1 )𝜙𝑟 (𝑇𝑖 𝑗2 ) + 𝑂 (∥Δ∥ 2 )
                                                                        
    ≲    2
                      ¤
                      𝜇 𝑖 (𝑇𝑖 𝑗 1 )     (𝜆 𝑘  − 𝜆 𝑟 )    ⟨𝜙 𝑟 , Δ𝜙  𝑘 ⟩
       𝑚𝑖 𝑗1 =1 𝑗2 =1               𝑟=1
                                    𝑟≠𝑘
      ∑︁∞
    ≲      (𝜆 𝑘 − 𝜆𝑟 ) −1 ∥Δ𝜙 𝑘 ∥ + 𝑂 (∥Δ∥ 2 )
       𝑟=1
      𝑟≠𝑘
                      𝑚                                       𝑚
                  1 ∑︁𝑖                                  1 ∑︁𝑖
          since             ¤ 𝑖 𝑗 )𝜙 𝑘 (𝑇𝑖 𝑗 ) and
                           𝜇(𝑇                                    𝑥𝑖 (𝑇𝑖 𝑗 )𝜙 𝑘 (𝑇𝑖 𝑗 ) are finite
                 𝑚𝑖 𝑗=1                                 𝑚𝑖 𝑗=1
    = 𝑂 (𝜛)                                                                                                                    (2.67)
             Í∞                   −1 𝛿           2 +𝛿 2 (ℎ).
where 𝜛 =       𝑟=1 (𝜆 𝑘 −𝜆𝑟 )         𝑛 (ℎ) +ℎ        𝑛2         A similar result can be obtained for Q𝑖𝑘 . Combining
              𝑟≠𝑘
                                                                                          2 Í𝑛
such results, in summary, we have − 𝑛2 𝑖=1                 X𝑖T C b𝑖 (y𝑖 − XT b                        T               Tb
                                                    Í𝑛
                                                                              𝑖 𝜷) = − 𝑛 𝑖=1 X𝑖 C𝑖 (y𝑖 − X𝑖 𝜷) + 𝑂 (𝜛𝑛 ).
Since, for 𝑛 → ∞, Q( 𝜷) attains a minimal value at 𝜷 = b                      𝜷, we therefore have Q(        ¤ b𝜷) = 0. Thus,
                                         𝑛
                     ¤ b             2  ∑︁
                    Q(  𝜷) = −               X𝑖T C
                                                 b𝑖 (y𝑖 − XT b
                                                                𝑖 𝜷)
                                     𝑛  𝑖=1
                                  𝑛                                              𝑛
                           2 ∑︁            b𝑖 − C𝑖 )(y𝑖 − XT b               2 ∑︁ T
                     =−              X𝑖T ( C                      𝑖 𝜷) −             X C𝑖 (y𝑖 − X𝑖T b    𝜷) = 0                (2.68)
                           𝑛    𝑖=1
                                                                             𝑛 𝑖=1 𝑖
Therefore, almost surely, we have,
                                            𝑛
                                        2 ∑︁ T
                                     −         X C𝑖 (y𝑖 − X𝑖T b     𝜷) + 𝑂 (𝜛𝑛 ) = 0
                                        𝑛 𝑖=1 𝑖
                                𝑛
                          2 ∑︁ T
                       −             X C𝑖 (X𝑖T 𝜷0 + e𝑖 − X𝑖T b      𝜷) + 𝑂 (𝜛𝑛 ) = 0
                          𝑛 𝑖=1 𝑖
                                                            ( 𝑛                     )             𝑛
                                           √                    1 ∑︁
                                                                           T               1 ∑︁ T
                                              𝑛( 𝜷 − 𝜷0 )
                                                b                       X C𝑖 X𝑖 = √                 X C𝑖 e𝑖                    (2.69)
                                                                𝑛 𝑖=1 𝑖                     𝑛 𝑖=1 𝑖
Now, using the central limit theorem, we can obtain the following.
                                                        𝑛
                                                  1 ∑︁ T                𝑑
                                                 √         X𝑖 C𝑖 e𝑖 →   − 𝑁 (0, A)                                             (2.70)
                                                    𝑛 𝑖=1
                                                         1 Í𝑛         T
In addition, by the law of large numbers                 𝑛    𝑖=1 X𝑖 C𝑖 X𝑖      → B in probability. Therefore, using the
Slutsky theorem, we complete the proof of Theorem 2.3.2.
                                                                  66


                                                CHAPTER 3
     ESTIMATION FOR VARYING-COEFFICIENT MODEL IN FUNCTIONAL DATA
       ANALYSIS UNDER UNKNOWN HETEROSKEDASTICITY: A GMM-BASED
                                                 APPROACH
3.1    Introduction
Due to modern advancements of technology, varying-coefficient models in functional data have
become popular to analyze data coming from several imaging technologies such as magnetic reso-
nance imaging (MRI), diffusion tensor imaging (DTI) etc. We consider the problem of estimating
non-parametric coefficient function 𝜷(𝑠) which is defined on functional domain (for example, space,
time) S to understand the relationship between the functional response 𝑌 (𝑠) and the real-valued
covariates denoted by X = (𝑋1 , · · · , 𝑋 𝑝 ) T , which takes the following form.
                                           𝑌 (𝑠) = XT 𝜷(𝑠) + 𝑈 (𝑠)                                    (3.1)
where 𝜷(𝑠) = (𝛽1 (𝑠), · · · , 𝛽 𝑝 (𝑠)) T is a 𝑝-dimensional vector of unknown smooth functions, and
it is assumed that 𝜷(·) is twice-differentiable with continuous second-order derivatives. The
random error {𝑈𝑖 (𝑠) : 𝑠 ∈ S} is assumed to be a stochastic process indexed by 𝑠 ∈ S and it
characterizes the within curve dependence with mean zero and an unknown covariance function
Σ(𝑠, 𝑠′) = cov{𝑈 (𝑠), 𝑈 (𝑠′)|X}. The varying-coefficient model (VCM) in Equation (3.1) allows its
regression coefficient to vary over some predictors of interest. It was introduced in the literature by
Hastie and Tibshirani (1993) and systematically studied in Hoover et al. (1998); Fan et al. (1999);
Wu and Chiang (2000); Huang et al. (2002); Fan et al. (2003); Huang et al. (2004); Chiou et al.
(2004); Ramsay and Silverman (2005); Zhang and Chen (2007); Cardot and Sarda (2008); Fan and
Zhang (2008); Wang et al. (2008); Zhu et al. (2014); Kokoszka and Reimherr (2017).
     The notion of density is not well defined for functional responses (in general for any random
function) (Delaigle and Hall, 2010), as a result of which it is difficult to take advantage of likelihood-
based inference; therefore, we need to rely on the moment conditions. Typically, we assume that
                                                      67


the error term 𝑈 (𝑠) satisfies the conditional mean-zero assumption, such as E{𝑈 (𝑠)|X} = 0. By the
iterated law of expectation, it is easy to see that, for a given point 𝑠 ∈ S, we can define least-square
estimates as solution of the sample version of E{X[𝑌 (𝑠) − XT 𝜷(𝑠)]} = 0. Equivalently, we can
obtain these estimates by a minimizer of the sample version of E{[𝑌 (𝑠) − XT 𝜷(𝑠)] 2 } which is
termed as non-parametric local linear estimates (Fan and Gijbels, 1996). Since the above estimates
rely only on the conditional mean-zero assumption, they become inefficient in the presence of
heteroskedasticity. Therefore, to analyze such data, there is need for a robust estimation procedure,
which does not require distributional assumption and can accommodate heteroskedasticity of
unknown form. Therefore, we introduce the functional generalized method of moments (GMM)
estimation procedure for such VCM.
    In classical statistics, the method of moment (MM) estimator solves the sample moment con-
ditions corresponding to the population moment conditions to obtain solutions for the unknown
parameters. For example, if the data 𝑌𝑖 are independently and identically distributed with mean 𝜇,
then E{𝑌𝑖 − 𝜇} = 0, and the MM estimate of 𝜇 is simply 𝜇ˆ = 𝑌 , the sample mean. Now consider
the linear regression model 𝑌𝑖 = X𝑖T 𝜷 + 𝑈𝑖 where 𝑌𝑖 is the response, X and 𝜷 are, respectively, the
covariates of dimensions 𝑝 and the unknown regression coefficient with simple moment restric-
tions 𝐸 {𝑈𝑖 |X𝑖 } = 0. By applying the law of iterated expectation, we get the unconditional moment
condition E{X𝑖 𝑈𝑖 } = 0 for random error 𝑈𝑖 . It is easy to see that MM estimates coincide with
the ordinary least squares (OLS) estimates for simple linear regression model. For a non-linear
regression model with additive error 𝑌𝑖 = 𝔐(X𝑖 , 𝜷) + 𝑈𝑖 , the moment condition is similar to the
classical linear model which is essentially E{𝑔(X𝑖 )𝑈𝑖 } = 0 for any function 𝑔(·) of X. An obvious
                      𝜕𝔐(x,𝜷)
choice is 𝑔(X) =        𝜕𝜷    as first-order conditions in non-linear least squares estimator.
    In simple linear regression, let us consider that the covariates are decomposed into 𝑝 1 and 𝑝 2
components such that X𝑖 = (X𝑖1 , X𝑖2 ) T with 𝑝 1 + 𝑝 2 = 𝑝. It is a well-known fact that without any
further assumptions, an asymptotically efficient estimator of 𝜷 is the OLS estimator. Assume that
𝜷 = ( 𝜷1 , 𝜷2 ) T , where 𝜷2 is known and is 0. Then, we can rewrite the above model as 𝑌𝑖 = X𝑖1T 𝜷 +𝑈
                                                                                                      1    𝑖
with a similar restriction E{X𝑖1𝑈𝑖 } = 0. For estimating 𝜷1 based on the data {𝑌𝑖 , X𝑖1 : 𝑖 = 1, · · · , 𝑛},
                                                     68


the MM estimates become inefficient since the number of moment conditions (𝑝) is larger than
the number of parameters to estimate (𝑝 1 ). This is called “over-identified” situation whereas when
𝑝 = 𝑝 1 , it is referred as “just-identified” situation. In general, for over-identified situation, let
g(𝑌 , X; 𝜷) be a 𝑑-dimensional function of 𝜷 ∈ R 𝑝 where 𝑑 ≥ 𝑝 such that
                                           Eg(𝑌𝑖 , X𝑖 ; 𝜷0 ) = 0                                   (3.2)
where 𝜷0 is the vector of true parameters. The set of restrictions in Equation (3.2) is often called
“estimating equations”. In a seminal paper, Hansen (1982) proposed an extended version of the
MM approach for over-identified model. Let 𝑊1 , · · · , 𝑊𝑛 be a set of random variables indexed by
𝑝-dimensional parameter vector 𝜷 with moment conditions E{g(𝑊, 𝜷0 )} = 0, where g(𝑊, 𝜷) be a
𝑑-dimensional vector such that g(·; 𝜷) be the function of 𝑊. The estimator is formed by choosing 𝜷
such that the sample average of g𝑖 is close to zero. For over-identification problem, it is not possible
for all the moment conditions to satisfy. Therefore, GMM approach is used to define an estimator
which brings the sample mean of g, (viz., g(𝑊, 𝜷)) close to zero. GMM estimates minimize the
following objective function.
                                   J( 𝜷) = g(𝑊; 𝜷) T W( 𝜷)g(𝑊; 𝜷)                                  (3.3)
Note that, the above Equation (3.3) is a well-defined norm as long as the weight matrix W( 𝜷) is
symmetric positive definite with dimension 𝑑 × 𝑑.
    When the likelihood is not specified or ill-specified, GMM is an alternative estimation technique
to the likelihood principle and has become quite popular in statistics and econometrics in the last
few decades due to its intuitive idea and applicability; also its properties are quite well-known.
In the original version of the proposed method described in Hansen (1982) a two-step GMM was
described by the following algorithm: First, compute 𝜷˘ ∈ arg min 𝜷 g( 𝜷) T g( 𝜷); then estimate the
precision matrix W based on the first-step estimates, W( 𝜷).     ˘ Therefore, the two-step GMM esti-
mates are b                          ˘
            𝜷 ∈ arg min 𝜷 g( 𝜷) T W( 𝜷)g( 𝜷). Under some sufficient conditions, it is well known that
the GMM estimator is constant and asymptotically normally distributed with the asymptotic covari-
                                       −1                                           −1
ance matrix 𝑛−1 G( 𝜷0 ) T WG( 𝜷0 )        G( 𝜷0 ) T W𝛀WG( 𝜷0 ) G( 𝜷0 ) T WG( 𝜷0 ) , where G( 𝜷0 ) =
                                                                 
                                                     69


E{∇ 𝜷0 g(𝑊; 𝜷0 )} and 𝛀 = E{g(𝑊; 𝜷0 )g(𝑊; 𝜷0 ) T } for true 𝜷0 (Newey and McFadden, 1994). Note
that the variance of the GMM estimator depends on the weight matrix and is optimal if the weight
                                                                                              −1
matrix is W = 𝛀−1 . Therefore the asymptotic variance would be 𝑛−1 G( 𝜷0 ) T WG( 𝜷0 ) .
                                                                          
    A crucial problem of the above estimation procedure is identifying the number of moment
conditions. An increase in the number of moment conditions may lead to the problem of finite
sample bias. Based on the situation, additional moment conditions may be beneficial. Adding
more moment conditions leads to decrease (or at least no change) in the asymptotic variance of
the estimator, since the optimal weight matrix for fewer moment conditions is not optimal for all
moment conditions. In the presence of heteroskedasticity, the set of moment conditions g(𝑊; 𝜷) =
(X𝑈, 𝔐(X)𝑈) T produces an estimator more efficient than the least squares estimates for the simple
linear regression model by efficiently choosing 𝔐. It is of interest to figure out how many functions
should be used to get a better estimator. Moreover, some choices of g are better than others depending
on additional assumptions. For example, in a linear regression model, if we choose g = X𝑈 then
the resulting estimator becomes OLS. On the other hand, if we choose g = X𝑈/Var{𝑈|X} we land
upon a generalized least squares estimator under heteroskedasticity. Var{𝑈|X} can be modelled
parametrically and substituted in the above mentioned mean-zero function. If the complete form of
the likelihood structure is known, one can choose g = ∇ 𝜷 ln{P(𝑈|X)} where P(·) is the density
of 𝑈. When explanatory variables are endogenous, additional moment conditions may create bias
and, as a result, increase the small sample variance. However, this issue does not arise when
the explanatory variables are exogenous and the presence of heteroskedasticity does not cause
OLS inconsistency problems in classical linear regression. Lu and Wooldridge (2020) obtained
an asymptotic efficient estimator using Cragg (1983) which showed existence of GMM estimators
which are more efficient than OLS in the presence of heteroskedasticity of unknown form.
    It is well known that the constant or varying-coefficient models may often be misspecified
and, therefore, this may lead to inconsistent estimation. In varying-coefficient non-parametric
models, mostly exogenous regression situation has been considered so far (Hastie and Tibshirani,
1993; Fan et al., 1999). In last two decades, some semi-parametric models were considered under
                                                 70


endogenous variables and a non-parametric or semi-parametric GMM with instrument variables
approach was considered. For example, Cai and Li (2008) proposed a one-step local linear GMM
estimator that corresponds to the local linear GMM discussed in Su et al. (2013) with an identity
weight matrix. Tran and Tsionas (2009) provided a local constant two-step GMM estimator with
specified weight matrix by minimizing the asymptotic variance. Su et al. (2013) developed a
local linear GMM estimator procedure of functional coefficient instrument variable models with
a general weight matrix under exogenous conditions. Cai et al. (2006) proposed a two-step local
linear estimation procedure to estimate the functional coefficient which include the estimation of
high-dimensional non-parametric model in first step and later estimate the functional coefficients
using the first-step non-parametric estimates as generated regressor. As opposed to the classical
GMM, for non-parametric local linear GMM estimator, integrated mean square error increases as
the number of instrument variables increase for arbitrary choice of instrument variables (Bravo,
2021).
    The current work is motivated by the problem encountered in diffusion tensor imaging (DTI)
where multiple diffusion properties are measured along common major white matter fiber tracts
across multiple individuals to characterize the structure and orientation of white matter in the
human brain. Recently a study has been performed to understand white matter structural alteration
using DTI for obstructive sleep apnea patients (Xiong et al., 2017). As an illustration, we present
smoothed functional data to analyze the efficiency properties of network generated by diffusion
properties of the water molecules. We plot the graphical characteristics of one of the diffusion
properties called fractional anisotropy (FA) over different significant levels to obtain the graphical
connectivity from 29 patients in Figures 3.1. Scientists are often interested to know the individual
association of average path length (APL) of the network generated from FA with a set of covariates
of interests such as age and lapses score (see Section 2.5). Moreover, in this data, there is sufficient
evidence of heteroskedasticity in the covariates. Details about the data-set and associated variables
are described in Section 3.6. We therefore need an estimation procedure which (1) does not need
knowledge of distribution, (2) can handle heteroskedasticity of covariates, (3) can estimate non-
                                                  71


                  2.2
                  2.0
                  1.8
        APL (s)   1.6
                  1.4
                  1.2
                  1.0
                        0.0     0.2           0.4               0.6       0.8           1.0
                                                    threshold
Figure 3.1 Apnea-data: Smoothed average path length (APL) from 29 patients over different
thresholds. Black solid line indicates the mean of APL over thresholds.
parametric coefficient functions from varying-coefficient models, and (4) has a systematic technique
for computing an efficient estimator. In this chapter, we develop a local linear GMM estimation
procedure for varying-coefficient model. For given instrument variables, we propose an optimal
local-linear GMM estimator motivated by Lu and Wooldridge (2020). However, the key difference
in our approach from the later is that we model the variance of integrated squared error using a
non-parametric function of covariates whereas they assume a parametric form in case of classical
regression. Therefore, we can ensure that the proposed estimator is at least as efficient as local linear
estimates (initial estimator) and more efficient than that under the presence of heteroskedasticity.
                                                    72


      This chapter is organized as follows. In Section 3.2, we introduce our varying-coefficient model
and propose a local linear GMM estimator. In Section 3.3, we present a multi-step estimation
procedure. We establish asymptotic results in Section 3.4. We perform a set of simulations studied
to understand the finite sample performance of the proposed estimator and present those in Section
3.5. In Section 3.6, we apply the proposed method in a real imaging data-set on obstructive sleep
apnea (OSA). In Section 3.7, we conclude this chapter with some discussion. All technical details
are provided in Section 3.8.
3.2       Varying-coefficient functional model and moment conditions
In this section, we first introduce heteroskedastic conditions for SVC model and thereafter, propose
a heuristic method to construct a mean-zero function.
3.2.1      Model
Let {𝑌𝑖 (𝑠), X𝑖 } for 𝑖 = 1, · · · , 𝑛 be independent copies of {𝑌 (𝑠), X}. Instead of observing the entire
functional trajectory, one can observe 𝑌 (𝑠) only on the discrete spatial grid {𝑠1 , · · · , 𝑠𝑟 } on the
functional domain S. Data can be Gaussian or non-Gaussian and homoskedastic or heteroskedas-
tic depending on the real applications. Therefore, the observed data for the 𝑖-th individual are
{𝑠 𝑗 , 𝑌𝑖 (𝑠 𝑗 ), X𝑖 : 𝑗 = 1, · · · , 𝑟}. For simplifying the notation, define 𝑌𝑖 𝑗 = 𝑌𝑖 (𝑠 𝑗 ) and 𝑈𝑖 𝑗 = 𝑈𝑖 (𝑠 𝑗 ).
      Considering the functional principal component analysis (FPCA) model for 𝑈𝑖 (𝑠), we assume
that 𝑈𝑖 (𝑠) is square-integrable and admits the Karhunen-Loève expansion (Karhunen, 1946; Loève,
1946). Let 𝜔1 (X) ≥ 𝜔2 (X) ≥ · · · ≥ 0 be ordered eigen-values of the linear operator determined by
Σ with ∞
           Í
               𝑘=1 𝜔 𝑘 (X) being finite and 𝜓 𝑘 (𝑠)’s being the corresponding orthonormal eigen-functions
or principal components. Thus, the spectral decomposition (J Mercer, 1909) is given by
                                                        ∞
                                                       ∑︁
                                                 ′
                                           Σ(𝑠, 𝑠 ) =      𝜔 𝑘 (X)𝜓 𝑘 (𝑠)𝜓 𝑘 (𝑠′)                              (3.4)
                                                       𝑘=1
Therefore, 𝑈𝑖 (𝑠) admits the Karhunen-Loève expansion as follows.
                                                          ∞
                                                         ∑︁
                                                𝑈𝑖 (𝑠) =     𝜉 𝑘 (X𝑖 )𝜓 𝑘 (𝑠)                                  (3.5)
                                                         𝑘=1
                                                            73


                    ∫
where 𝜉 𝑘 (X𝑖 ) =    S
                       𝑈𝑖 (𝑠)𝜓𝑖 (𝑠)𝑑𝑠, which is termed as the 𝑘-th functional principal score for 𝑖-th
individual. The 𝜉 𝑘 (X𝑖 ) are uncorrelated over 𝑘 with E{𝜉 𝑘 (X𝑖 )|X𝑖 } = 0 and Var{𝜉 𝑘 (X𝑖 )|X𝑖 } =
𝜔 𝑘 (X𝑖 ), 𝑘 ≥ 1. Furthermore, assume that the eigen-values vary with X𝑖 such that 𝜔 𝑘 (X𝑖 ) =
𝜃 𝑘 𝜎 2 (X𝑖 ) for some unknown function 𝜎(X) ≥ 0 and 𝜃 1 ≥ 𝜃 2 ≥ · · · ≥ 0. For identifiability, we
need some restrictions on 𝜃 𝑘 s, such as known or fixed 𝜃 1 . Therefore, the above assumption on
eigen-values for spectral decomposition allow us to incorporate heteroskedasticity into the model.
To best of our knowledge, this is the first attempt to model SVC with unknown heteroskedasticity.
3.2.2     Local-linear mean-zero function
Let us reiterate our main objective: we want to efficiently estimate the varying-coefficient functions
based on GMM for the case of continuum moment conditions together with infinite-dimensional
parameters. Therefore, we need to construct a mean-zero function which will be described in this
sub-section.
     Since 𝜷(·) in model 3.1 is twice continuously differentiable, we can apply the Taylor series
expansion to 𝜷(𝑠 𝑗 ) around an interior point 𝑠0 and get
                                              ¤ 0 )(𝑠 𝑗 − 𝑠0 ) + 𝜷(𝑠
                           𝜷(𝑠 𝑗 ) = 𝜷(𝑠) + 𝜷(𝑠                  ¥ ∗ )(𝑠 𝑗 − 𝑠0 ) 2 /2                     (3.6)
where 𝑠∗ lies between 𝑠 𝑗 and 𝑠0 for all 𝑗 = 1, · · · , 𝑟 and 𝜷¤ and 𝜷¥ denote the gradients of 𝜷 and 𝜷¤
                                                                                       𝜕 𝛽 𝑘 (𝑠0 )
with respect to 𝑠. Thus, 𝜷(𝑠 𝑗 ) can be approximated as 𝛽 𝑘 (𝑠 𝑗 ) ≈ 𝛽(𝑠0 ) +              𝜕𝑠 (𝑠 𝑗 − 𝑠0 ). So in
matrix notation, the first-order Taylor series expansion of the coefficient functions becomes
                                         𝜷(𝑠 𝑗 ) ≈ A(𝑠0 )zℎ (𝑠 𝑗 − 𝑠0 )                                    (3.7)
                             𝑠 −𝑠 T
                                   
where zℎ (𝑠 𝑗 − 𝑠0 ) = 1, 𝑗 ℎ 0 and A(𝑠) = [𝜷(𝑠0 ), ℎ 𝜷(𝑠        ¤ 0 )] which is a 𝑝 × 2 matrix. Hence,
applying the approximation procedure in Equation (3.7), we can rewrite the model 3.1 as
                               𝑌𝑖 𝑗 ≈ X𝑖T A(𝑠)zℎ (𝑠 𝑗 − 𝑠0 ) + 𝑈𝑖𝑟
                                          
                                                            T
                                    = zℎ (𝑠 𝑗 − 𝑠0 ) ⊗ X𝑖 vec{A(𝑠0 )} + 𝑈𝑖𝑟
                                    = W𝑖 𝑗 (𝑠0 ) T 𝜸(𝑠0 ) + 𝑈𝑖𝑟                                            (3.8)
                                                         74


such that 𝑠 𝑗 are sufficiently close to 𝑠0 , where W𝑖 𝑗 (𝑠0 ) = [zℎ (𝑠 𝑗 − 𝑠0 ) ⊗ X𝑖 ] and 𝜸(𝑠0 ) =
               ¤ 0 ) T ) T , both of which are vectors of length 2𝑝 × 1.
( 𝜷(𝑠0 ) T , ℎ 𝜷(𝑠
    Let 𝐾 (·) be a symmetric probability density function which is used as a kernel and ℎ > 0 be
the bandwidth; thus a re-scaled kernel function is defined as 𝐾 ℎ (·) = 1ℎ 𝐾 (·). It is easy to see that
for a given location 𝑠0 ∈ S, we can construct a least squares estimator of 𝜸(𝑠) defined in Equation
(3.8) by minimizing the sample version of mean squared error E{[𝑌𝑖 𝑗 − W𝑖T𝑗 (𝑠0 )𝜸(𝑠0 )] 2 |X𝑖 }. Let
𝔐(X) be a 𝑞-dimensional instrument variable with 𝑞 ≥ 𝑝; the moment condition can be written
      n Í                                 o
as E 1𝑟 𝑟𝑗=1 𝐾 ℎ (𝑠 𝑗 − 𝑠0 )𝚫𝑖 𝑗 (𝑠0 ) = 0 where 𝚫𝑖 𝑗 (𝑠0 ) = 𝔐(X𝑖 ) 𝑌𝑖 𝑗 − W𝑖 𝑗 (𝑠0 ) T 𝜸(𝑠0 ) is a zero-
                                                                                 
mean stochastic process. Two popular approaches to construct optimal instrument variables are
proposed by Newey (1990); Ai and Chen (2003). Due to functional dependence and the existence
of heteroskedasticity of an unknown form, these approaches can not be undertaken for the model
3.8 that we have considered. Since Taylor series expansion provides a local approximation of the
function, we need to incorporate this phenomenon into the instrument variables during construction
of the mean-zero function.
    Motivated from the idea of local linear estimator Fan and Gijbels (1996), consider the local linear
instrument variables Q𝑖𝑟 (𝑠0 ) = (𝔐(X𝑖 ), 𝔐(X𝑖 )(𝑠 𝑗 − 𝑠0 )/ℎ) T . Therefore, consider the following
non-parametrically localizing augmented orthogonal moment conditions for estimating 𝜷(𝑠).
                                             𝑟
                                        1 ∑︁
                                                𝐾 ℎ (𝑠 𝑗 − 𝑠0 )Q𝑖𝑟 (𝑠0 ) 𝑌𝑖 𝑗 − W𝑖 𝑗 (𝑠0 ) T 𝜸(𝑠0 )
                                                                            
                        g𝑖 {𝜸(𝑠0 )} =
                                        𝑟 𝑗=1
                                        © 1𝑟 𝑟𝑗=1 𝐾 ℎ (𝑠 𝑗 − 𝑠0 )𝚫𝑖 𝑗 (𝑠0 ) ª
                                                Í
                                     = ­­ Í                                       ®
                                          1 𝑟                      (𝑠 𝑗 −𝑠0 )     ®
                                          𝑟    𝑗=1 𝐾 ℎ (𝑠 𝑗 − 𝑠0 ) ℎ 𝚫𝑖 𝑗 (𝑠0 )
                                        «                                         ¬
                                             𝑟
                                        1   ∑︁
                                     =          𝐾 ℎ (𝑠 𝑗 − 𝑠0 )zℎ (𝑠 𝑗 − 𝑠0 ) ⊗ 𝚫𝑖 𝑗 (𝑠0 )           (3.9)
                                        𝑟 𝑗=1
and note that {g𝑖 (𝜸(𝑠))} : 𝑖 = 1, · · · , 𝑛} are independent and E{g𝑖 (𝜸(𝑠))} = 02𝑞×1 for 𝑠 ∈ S.
    Most of the varying-coefficient models that exist in the literature assume homoskedasticity in
covariates and are limited to weakly dependent non-parametric models (Su et al., 2013; Sun, 2016),
that differs significantly in our model assumptions. In contrast, we assume a varying-coefficient
model under heteroskedasticity of an unknown form.
                                                             75


    3.3    Multi-step estimation procedure
    This section develops a multi-step estimation procedure to estimate 𝜷(𝑠) simultaneously across all
    functional points. Essentially, the multi-step procedure can be broken down as, Step-I: an initial
    estimation; Step-II: estimation of the variance function, mean zero function, and eigen-components
    and Step-III: GMM estimation. The key ideas of each step are described below.
  Step-I. Calculate the least squares estimates of 𝜷(𝑠) as initial estimates, denoted by 𝜷(𝑠)             ˘    across all
          𝑠 ∈ S.
 Step-II. Estimate the conditional variance of integrated square residuals non-parametrically, subse-
          quently estimate the covariance of mean-zero function. Estimate the eigen-components using
          multivariate FPCA.
Step-III. Project the continuous moment conditions onto eigen-functions and then combine them by
          weighted eigenvalues to incorporate spatial dependence and thus obtain the updated estimate
          of 𝜷(𝑠), denoted by b  𝜷(𝑠) across all 𝑠 ∈ S.
    3.3.1   Step-I: Initial least squares estimates
    We consider a local linear smoother (Fan and Gijbels, 1996) to obtain an initial estimator of 𝜷(·)
    ignoring functional dependencies. In this case, the non-linear least squares function of the model
    3.1 can be defined as an objective function given by J𝑖𝑛𝑖𝑡 {𝜷(·)} = 𝑛𝑟1 𝑖=1                               T         2
                                                                                       Í𝑛 Í𝑟
                                                                                                 𝑗=1 {𝑌𝑖 𝑗 − X𝑖 𝜷(𝑠 𝑗 )} .
    By the local linear smoothing method we estimate 𝜸 at functional point 𝑠0 , by minimizing
                                               𝑛    𝑟
                                           1 ∑︁ ∑︁                                                 2
                                                        𝐾 ℎ (𝑠 𝑗 − 𝑠0 ) 𝑌𝑖 𝑗 − W𝑖 𝑗 (𝑠0 ) T 𝜸(𝑠0 )
                                                                        
                       J𝑖𝑛𝑖𝑡 {𝜸(𝑠0 )} =                                                                            (3.10)
                                          𝑛𝑟 𝑖=1 𝑗=1
    The solution of the above least-squares problem can be expressed as
                       𝑛 ∑︁ 𝑟                                          −1      𝑛 ∑︁ 𝑟
                 1 ∑︁
                
                                                                  T
                                                                     
                                                                     
                                                                          1 ∑︁
                                                                                                                     
                                                                                                                      
                                                                                                                      
        𝜸(𝑠
        ˘ 0) =                 𝐾 ℎ (𝑠 𝑗 − 𝑠0 )W𝑖 𝑗 (𝑠0 )W𝑖 𝑗 (𝑠0 )                      𝐾 ℎ (𝑠 𝑗 − 𝑠0 )W𝑖 𝑗 (𝑠0 )𝑌𝑖 𝑗
                 𝑛𝑟                                                   𝑛𝑟                                           
                     𝑖=1 𝑗=1                                                𝑖=1 𝑗=1                                
                                                                                                                   (3.11)
                                                             76


Consequently, the estimator of the coefficient function vector 𝜷(𝑠) at 𝑠0 is
                                      ˘ 0 ) = [(1, 0) ⊗ I 𝑝 ] 𝜸(𝑠
                                      𝜷(𝑠                     ˘ 0)                                 (3.12)
We determine the tuning parameter ℎ by using some data-driven techniques such as cross-validation
and generalized cross-validation (Hastie and Tibshirani, 2017).
3.3.2   Step-II: Intermediate steps
Step-II consists of two important steps in determining the class of GMM estimator. First in
Step-II.A, we propose a method to obtain optimal instrument variables and therefore estimate the
eigen-components which are used in local linear GMM objective function in Step-III. To estimate
eigen-components, we essentially need to use a multivariate version of FPCA which is quite
uncommon in the literature. We borrow the method proposed by Wang (2008).
Step-II.A: Choice of instrument variables
The choice of instrument variables is critical and the required identification condition is 𝑞 ≥ 𝑝,
which ensures that the dimension of Q𝑖 𝑗 (𝑠0 ) is at least equal to the dimension of 𝜸(𝑠0 ). In our
model, as discussed in Section 3.2, the error term has a potential heteroskedasticity of unknown
form. We define a set of independent and identically distributed random variables 𝑅1 , 𝑅2 , · · · , 𝑅𝑛
                                  ∫
for 𝑛 individuals, where 𝑅𝑖 = 𝑈𝑖2 (𝑠)𝑑𝑠 for each 𝑖, termed as integrated square of residuals,
and E{𝑅𝑖 |X𝑖 } = 𝜎 2 (X𝑖 ) ∞
                          Í
                            𝑘=1 𝜃 𝑘 . Therefore, consider the following non-parametric regression
problem.
                                       log 𝑅𝑖 = log 𝜎 2 (X𝑖 ) + 𝜖𝑖                                 (3.13)
where 𝜖𝑖 is the mean zero random variable with constant variance. The above model in Equation
(3.13) boils down to the problem of estimation of log 𝜎 2 (X𝑖 ) by regressing the logarithmic value of
the integrated squared residual variables on the covariates X𝑖 . This approach is inspired by Yu and
Jones (2004); Wasserman (2006) although used in a different context. Since 𝑈𝑖 s are not observable,
                                                                                                 ˘
we replace 𝑈𝑖 by an efficient estimate that is obtained from Step-I, viz., 𝑈˘ 𝑖 (𝑠) = 𝑌𝑖 𝑗 − X𝑖T 𝜷(𝑠) for
                                                  77


all 𝑠 ∈ S. This step can easily be implemented using “gam" function available in mgcv package
in R to get an estimate of the non-parametric mean function, denoted by 𝜇                b(X) and therefore
𝜎 2 (X) = exp{b
b                 𝜇 (X)}. Given the estimate of 𝜎(·), we can, therefore, choose instrument variables
                                 T
as 𝔐(X𝑖 ) = X𝑖 , X𝑖 /b  𝜎 2 (X𝑖 ) .
Step-II.B: Estimation of eigen-components
Without loss of generality, assume that the spectrum of functional domain S = [0, 1] and the
dimension of mean-zero function g(𝑠) = (𝑔1 (𝑠), · · · , 𝑔𝑑 (𝑠)) T is 𝑑 (in our problem, it equals 2𝑞) for
                                                                           Í𝑑 ∫
simplicity. Note that g(𝑠) is defined on an interval [0, 1] such that 𝑙=1            E{𝑔𝑙2 (𝑠)}𝑑𝑠 is finite and
the covariance function C(𝑠, 𝑠′) = E g{𝜸(𝑠)}g{𝜸(𝑠)}T . Under the condition (C6) mentioned in
                                           
Section 3.4, using the lining-up method in Wang (2008), define a new stochastic process 𝑒(𝑠) on
the interval [0, 𝑑] with eigen-function 𝜙 𝑒 such that,
         
                                                                          
                                                                           
           𝑔1 (𝑠)              0≤𝑠<1                                         𝜙1 (𝑠)             0≤𝑠<1
         
                                                                          
                                                                           
         
                                                                          
                                                                           
         
                                                                          
                                                                           
         
                                                                          
                                                                           
           𝑔2 (𝑠 − 1)          1≤𝑠<2                                         𝜙2 (𝑠 − 1)         1≤𝑠<2
         
                                                                          
                                                                           
         
                                                                          
                                                                           
         
                                                                          
                                                                           
         
                                                                          
                                                                           
         · · ·                                                            · · ·
         
                                                                          
                                                                           
                                                                          
𝑒(𝑠) =                                                           𝜙 𝑒 (𝑠) =
                                                                          
         
         
         
          𝑔𝑙 (𝑠 − (𝑙 − 1))    𝑙−1 ≤ 𝑠 < 𝑙                                 
                                                                           
                                                                           
                                                                            𝜙𝑙 (𝑠 − (𝑙 − 1))   𝑙−1 ≤ 𝑠 < 𝑙
         
                                                                          
                                                                           
         
                                                                          
                                                                           
         · · ·
         
                                                                          · · ·
                                                                           
                                                                           
         
                                                                          
                                                                           
         
                                                                          
                                                                           
                                                                          
          𝑔𝑑 (𝑠 − 𝑑 + 1)      𝑑−1 ≤ 𝑠 < 𝑑                                  𝜙 𝑑 (𝑠 − 𝑑 + 1)    𝑑−1 ≤ 𝑠 < 𝑑
         
                                                                          
                                                                           
                                                                          
where we define the eigen-function for each 𝑔𝑙 as 𝜙𝑙 for 𝑙 = 1, · · · , 𝑑. Therefore, the covariance
function between 𝑔𝑙 (𝑠) and 𝑔𝑙 ′ (𝑠′) can be expressed as 𝐶𝑙,𝑙 ′ (𝑠, 𝑠′) = cov{𝑔𝑙 (𝑠 − (𝑙 − 1)), 𝑔𝑙 ′ (𝑠′ −
(𝑙 ′ − 1))} for 𝑙 − 1 ≤ 𝑠 < 𝑙 and 𝑙 ′ − 1 ≤ 𝑠′ < 𝑙 ′; 𝑙, 𝑙 ′ = 1, · · · , 𝑑. Note that for 𝑑-dimensional
processes, the Fredholm integral equation (Porter et al., 1990) is equivalent to 𝑑-simultaneous
integral equations where each of them corresponds to a specific functional interval of 𝑒(𝑠). For
𝑙 − 1 ≤ 𝑠 < 𝑙; 𝑙 = 1, · · · , 𝑑, the Fredholm integral equation yields
                                   ∫   𝑑
                                         cov{𝑒(𝑠), 𝑒(𝑠′)}𝜙 𝑒 (𝑠)𝑑𝑠 = 𝜆𝜙 𝑒 (𝑠)                            (3.14)
                                     0
                                                      78


Now observe that for (𝑙 − 1) ≤ 𝑠 < 𝑙; 𝑙 = 1, · · · , 𝑑, the above relation is equivalent to the following.
             ∑︁ 𝑑 ∫        𝑙′
                               cov{𝑔𝑙 (𝑠 − (𝑙 − 1)), 𝑔𝑙 ′ (𝑠′ − (𝑙 ′ − 1))𝜙𝑙 ′ (𝑠′)𝑑𝑠′ = 𝜆𝜙𝑙 (𝑠 − (𝑙 − 1))
             𝑙 ′ =1    𝑙 ′ −1
             ∑︁ 𝑑 ∫ 1
                              cov{𝑔𝑙 (𝑠 − (𝑙 − 1)), 𝑔𝑙 ′ (𝑠′)}𝜙𝑙 ′ (𝑠′)𝑑𝑠′ = 𝜆𝜙𝑙 (𝑠 − (𝑙 − 1))
             𝑙 ′ =1 0
             ∑︁ 𝑑 ∫ 1
                              cov{𝑔𝑙 (𝑠), 𝑔𝑙 ′ (𝑠′)}𝜙𝑙 ′ (𝑠′)𝑑𝑠′ = 𝜆𝜙𝑙 (𝑠)                                              (3.15)
             𝑙 ′ =1    0
In a multivariate setting, the orthogonality condition is
          ∫  𝑑                                                   ∑︁𝑑 ∫    𝑘
                                                         ′
                 𝜙 𝑒,𝑙 (𝑠)𝜙     𝑒,𝑙 ′ (𝑠)𝑑𝑠 = 1(𝑙 = 𝑙 ) =                    𝜙 𝑘,𝑙 (𝑠 − (𝑘 − 1))𝜙 𝑘,𝑙 ′ (𝑠 − (𝑘 − 1))𝑑𝑠
           0                                                     𝑘=1    𝑘−1
                                                                 ∑︁𝑑 ∫ 1
                                                             =              𝜙 𝑘,𝑙 (𝑠)𝜙 𝑘,𝑙 ′ (𝑠)𝑑𝑠                      (3.16)
                                                                 𝑘=1    0
Using the generalized Mercer’s theorem (J Mercer, 1909), the results for the covariance function
can be briefly shown using the lining-up method. Assume that the covariance function is continuous
after the lining-up processes, so for (𝑙 1 − 1) ≤ 𝑠 < 𝑙1 and (𝑙2 − 1) ≤ 𝑠 < 𝑙2 ; 𝑙1 , 𝑙2 = 1, · · · , 𝑑, the
covariance function between 𝑔𝑙1 (𝑠) and 𝑔𝑙2 (𝑠′) can be expressed as
                                                                     ∞
                                                                    ∑︁
                              ′
              𝐶𝑙,𝑙 ′ (𝑠, 𝑠 ) = cov{𝑔𝑙 (𝑠), 𝑔𝑙 ′ (𝑠)} =                  𝜆 𝑘 𝜙 𝑘,𝑙 (𝑠 − (𝑙 − 1))𝜙 𝑘,𝑙 ′ (𝑠 − (𝑙 ′ − 1))
                                                                    𝑘=1
                                        ∞
                                       ∑︁
                                  =        𝜆 𝑘 𝜙 𝑘,𝑙 (𝑠)𝜙 𝑘,𝑙 ′ (𝑠′)                                                    (3.17)
                                       𝑘=1
Therefore, using the above argument, we can define the multivariate spectral decomposition
                                                                   ∞
                                                                  ∑︁
                                                  C(𝑠, 𝑠′) =           𝜆 𝑘 𝝓 𝑘 (𝑠)𝝓 𝑘 (𝑠′) T                            (3.18)
                                                                  𝑘=1
with the orthogonality condition 3.16. Since the lining-up data are univariate, we can adopt the
existing techniques of estimating functional eigen-values and eigen-function in the literature (Yao
et al., 2003, 2005; Müller and Yao, 2010; Li and Hsing, 2010) to estimate 𝜆 and 𝜙 𝑒 (𝑠), and hence
can estimate 𝝓(𝑠) by stacking all components for aliened eigen-functions 𝜙 𝑒 (𝑠).
                                                                      79


3.3.3      Step-III: Final estimates
Finally, we demonstrate our proposed estimator based on local-linear GMM where the proposed
mean-zero function can be projected onto eigen-function and then combined by the weighted
eigen-values. Then, for any positive 𝛼, the objective function is given by
                          ∞              n                         o2
                         ∑︁       𝜆
                                  b𝑘
      J{𝜸(𝑠0 )} =                          g(𝜸(𝑠0 )) Tb  𝝓 𝑘 (𝑠0 )
                         𝑘=1
                               b2 + 𝛼
                               𝜆 𝑘
                          ∞                      𝑛 ∑︁ 𝑟                                                                     2
                         ∑︁       𝜆
                                  b𝑘      1 ∑︁
                                         
                                                                                  T
                                                                                                                 T
                                                                                                                        
                                                                                                                          
                      =                                   𝐾 ℎ (𝑠 𝑗 − 𝑠0 ) 𝝓 𝑘 (𝑠0 ) Q𝑖 𝑗 (𝑠0 ) 𝑌𝑖 𝑗 − W𝑖 𝑗 (𝑠0 ) 𝜸(𝑠0 )
                                                                           b
                         𝑘=1
                               b2 + 𝛼  𝑛𝑟 𝑖=1 𝑗=1
                               𝜆                                                                                          
                                 𝑘                                                                                       
                                                                                                                          (3.19)
By minimizing the above objective function, we obtain the following.
                    ∞                1 ∑︁
                                           𝑛 ∑︁ 𝑟                                                  
                  ∑︁        𝜆
                            b𝑘                                                T
                                                                                                    
                                                                                                    
                                                     𝐾 ℎ (𝑠 𝑗 − 𝑠0 ) 𝝓 𝑘 (𝑠0 ) Q𝑖 𝑗 (𝑠0 )W𝑖 𝑗 (𝑠0 )
                                                                      b
                         b2
                   𝑘=1 𝜆 𝑘 + 𝛼 
                                     𝑛𝑟
                                           𝑖=1 𝑗=1
                                                                                                    
                                                                                                    
                              1       𝑛    𝑟
                              
                                    ∑︁   ∑︁                                                                     
                                                                                                                  
                                                                                                                  
                                                               𝝓 𝑘 (𝑠0 ) T Q𝑖 𝑗 (𝑠0 ) 𝑌𝑖 𝑗 − W𝑖 𝑗 (𝑠0 ) T 𝜸(𝑠0 )
                                                                                      
                           ×                   𝐾 ℎ (𝑠 𝑗 − 𝑠0 ) b
                               𝑛𝑟                                                                                
                                     𝑖=1 𝑗=1                                                                     
                         ∞
                        ∑︁       𝜆
                                 b𝑘
                                        X𝑘 (𝑠0 ) y𝑘 (𝑠0 ) − X𝑘 (𝑠0 ) T 𝜸(𝑠0 )
                                                  
                   :=                                                                                                     (3.20)
                             b2
                        𝑘=1 𝜆 + 𝛼
                                𝑘
                         1                                                                                  1
                                                          𝝓 𝑘 (𝑠0 ) T Q𝑖 𝑗 (𝑠0 )W𝑖 𝑗 (𝑠0 ) and y𝑘 (𝑠0 ) =
                            Í𝑛 Í𝑟                                                                              Í𝑛 Í𝑟
where X𝑘 (𝑠0 ) =        𝑛𝑟     𝑖=1     𝑗=1 𝐾 ℎ (𝑠 𝑗 −𝑠0 ) b                                                𝑛𝑟    𝑖=1  𝑗=1 𝐾 ℎ (𝑠 𝑗 −
     𝝓 𝑘 (𝑠0 ) T Q𝑖 𝑗 (𝑠0 )𝑌𝑖 𝑗 . Therefore, the final estimate of the coherent function is b
𝑠0 ) b                                                                                                        𝜷(𝑠0 ) = [(1, 0) ⊗
     𝜸 (𝑠0 ) where
I 𝑝 ]b
                                   (  ∞
                                                                         ) −1 (   ∞
                                                                                                               )
                                     ∑︁      𝜆
                                             b𝑘                                  ∑︁      𝜆
                                                                                         b𝑘
                      𝜸 (𝑠0 ) =
                      b                            X𝑘 (𝑠0 )X𝑘 (𝑠0 ) T                         X𝑘 (𝑠0 )y𝑘 (𝑠0 )            (3.21)
                                         b2
                                     𝑘=1 𝜆 𝑘  +𝛼                                 𝑘=1
                                                                                      b2 + 𝛼
                                                                                      𝜆 𝑘
     The algorithm 3.1 summarizes the proposed method. For demonstration purposes, we choose the
tuning parameters using cross-validation as discussed in the algorithm. In the proposed algorithm,
𝛼 controls the number of eigen-values, and can be chosen so that the condition (C8) defined in
Section 3.4 is valid. Moreover, a continuity condition for lining-up is required for theoretical
justification, by empirical studies, in the present of discontinuity in 𝜙 𝑒 , the end results are still
adequate to use in practice.
                                                                    80


   Algorithm 3.1 Estimation of 𝜷(𝑠) : 𝑠 ∈ S for the proposed local-GMM based estimation
   procedure.
     Data: {𝑌𝑖 (𝑠 𝑗 ), 𝑋𝑖 , 𝑠 𝑗 }, for 𝑗 = 1, · · · , 𝑟; 𝑖 = 1, · · · , 𝑛
     Result: Estimate 𝜷(𝑠) using proposed method
                                                               Í Í𝑟 n                       −𝑖
                                                                                                       o2
  1: Calculate optimal ℎ: b      ℎ𝑖𝑛𝑖𝑡 ← arg minℎ∈H 1 𝑛    𝑛𝑟    𝑖=1    𝑟=1  𝑌𝑖 𝑗 − XT 𝜷˘ (𝑠𝑟 ; ℎ)
                                                                                        𝑖
  2: Calculate 𝜸(𝑠;
                 ˘ b   ℎ𝑖𝑛𝑖𝑡 )
                   1 Í𝑟
  3: g𝑖 {𝜸(𝑠)} = 𝑟 𝑗=1 𝐾bℎ𝑖𝑛𝑖𝑡 (𝑠 𝑗 − 𝑠)Q𝑖 𝑗 (𝑠; b      ℎ𝑖𝑛𝑖𝑡 ){𝑌𝑖 𝑗 − W𝑖 𝑗 (𝑠; bℎ𝑖𝑛𝑖𝑡 ) T 𝜸(𝑠;
                                                                                           ˘ b  ℎ𝑖𝑛𝑖𝑡 )}
  4: Determine the instrument variables 𝔐(X)
  5: Compute eigen-components 𝜆               𝝓 𝑘 (𝑠) and get the value of 𝛼 using the condition (C8).
                                         b𝑘 , b
                                                                            n               −𝑖
                                                                                                       o2
                                                            1 Í𝑛 Í𝑟                     T
  6: Calculate optimal ℎ: ℎ 𝑜 𝑝𝑡 ← arg minℎ∈H 𝑛𝑟 𝑖=1 𝑟=1 𝑌𝑖 𝑗 − X𝑖 𝜷 (𝑠𝑟 ; ℎ)
                                 b                                                        b
  7: Calculate b 𝜷(𝑠; ℎ 𝑜 𝑝𝑡 )
3.4      Asymptotic results
In this section, we provide some assumptions and then study the asymptotic properties of the local
linear GMM estimator. Here, we allow the sample size 𝑛 and the number of functional domains 𝑟
to grow to infinity. Detailed technical proofs are provided in Section 3.8.
     Let 𝜷0 (𝑠0 ) be the true value of 𝜷(𝑠0 ) at the location 𝑠0 . For simplicity, define 𝛿𝑛1 (ℎ) =
                             1/2                                                                1/2         ∫
  (1 + (ℎ𝑟) −1 ) log 𝑛/𝑛          and 𝛿𝑛2 (ℎ) = (1 + (ℎ𝑟) −1 + (ℎ𝑟) −2 ) log 𝑛/𝑛
                                                    
                                                                                                    . 𝜈𝑎,𝑏 = 𝑡 𝑎 𝐾 𝑏 (𝑡)𝑑𝑡.
Consider the following conditions that will be useful in asymptotic results.
(C1) Kernel function 𝐾 (·) is a symmetric density function defined on the bounded support [−1, 1]
        and is Lipschitz continuous.
(C2) Density function 𝑓𝑇 of 𝑇 is bounded above and away from infinity, and also below and away
        from zero. Moreover, 𝑓 is differentiable, and the derivative is continuous.
(C3) E{∥ 𝑋 ∥ 𝑎 } < ∞ and E{sup𝑠∈S |𝑈 (𝑠)| 𝑎 } < ∞ for some positive 𝑎 > 1. Define E{𝔐(X)X} = 𝛀
        with rank 𝑝.
(C4) The true coefficient function 𝜷0 (𝑠) is three-times continuously differentiable and Σ(𝑠, 𝑠′) are
        twice continuously differentiable.
                                                               81


 (C5) {𝑈 (𝑠) : 𝑠 ∈ [0, 1]} and {𝔐(X)𝑈 (𝑠) : 𝑠 ∈ [0, 1]} are Donsker class, where X ⊂ 𝔐(X)
 (C6)    a) lim𝑠↘1 E{|𝑔𝑙 (𝑠 − 1) − 𝑔𝑙 (0)| 2 } = 0 for 𝑙 = 1, · · · , 𝑑
         b) lim𝑠↗1 E{|𝑔𝑙−1 (𝑠) − 𝑔𝑙 (0)| 2 } = 0 for 𝑙 = 2, · · · , 𝑑
 (C7) All second order partial derivatives of C(𝑠, 𝑠′) exist and are bounded on the support of the
       functional domain.
                                          Í                         
                                𝛼−1           𝜅0   −1 Í∞
 (C8) For some 𝜅 0 ≥ 1 and            =𝑜      𝑘=1 𝜆 𝑘 / 𝑘=𝜅 0 +1 𝜆 𝑘
 (C9) The numbers of individuals and functional grid-points are growing to infinity such that ℎ → 0
       and 𝑟 ℎ → ∞. For some positive number 𝑎 ∈ (2, 4), | log ℎ| 1−2/𝑎 /ℎ ≤ 𝑟 1−2/𝑎 . For 𝑎 > 2,
       (ℎ4 + ℎ3 /𝑟 + ℎ2 /𝑟 2 ) −1 (log 𝑛/𝑛) 1−2/𝑎 → 0 as 𝑛 → ∞.
Remark 3.4.1. Conditions (C1) and (C2) are commonly used in the literature of non-parametric
regression. The bound condition for the density function in (C2) of the functional points is standard
for random design. Similar results can be obtained for fixed design where the grid-points are
                                                 ∫𝑠
pre-fixed according to the design density 0 𝑗 𝑓 (𝑠)𝑑𝑡 = 𝑗/𝑟 for 𝑗 = 1, · · · , 𝑟, for 𝑟 ≥ 1. Condition
(C3) is similar to that in Li and Hsing (2010) which requires the bound on the higher order moment
of X. Moreover, the rank condition of 𝛀 is required for the identification of the functional coefficient
and its first-order derivatives (Su et al., 2013). Condition (C4) is also common in functional data
analysis literature (Wang et al., 2016). This condition allows us to perform the Taylor series
expansion. Condition in (C5) avoids the smoothness condition of the sample path (Zhu et al.,
2012, 2014) which is commonly assumed in Hall and Hosseini-Nasab (2006); Zhang and Chen
(2007); Cardot et al. (2013). Conditions (C6) are required to check the continuity in the mean-zero
function, which is equivalent to checking the mean square continuity of the process after lining-up
(Hadinejad-Mahram et al., 2002). Here, the first condition shows the limits from right and remain
always right; therefore, it involves only one process. A similar but opposite phenomenon occurs in
the second condition. Moreover, if the vector process g(𝑠) is mean-square continuous, then both
approaches are equivalent, as a result, the covariance function is continuous after lining-up the
                                                       82


process. To obtain the asymptotic expression of b            𝜷(𝑠), observed for a fixed sample size, there exists
𝜅0 such that 𝑘 ≤ 𝜅0 , 𝜆2𝑘 is much larger than 𝛼, thus the ratio 𝜆 𝑘 /(𝜆2𝑘 + 𝛼) ≈ 𝜆−1           𝑘 . On the other hand,
if 𝑘 > 𝜅0 , 𝜆2𝑘 << 𝛼, as a result, the fraction 𝜆 𝑘 /(𝜆2𝑘 + 𝛼) can be approximately written as 𝜆 𝑘 /𝛼.
Therefore, by the assumption that we make in (C8), we can write for 𝑠 ∈ S,
      𝜅0
     ∑︁                                   ∞
                                         ∑︁                              𝜅0
                                                                        ∑︁
          𝜆−1               ′ T
           𝑘 𝝓 𝑘 (𝑠)𝝓 𝑘 (𝑠 ) + 𝛼
                                    −1
                                                𝜆 𝑘 𝝓 𝑘 (𝑠)𝝓 𝑘 (𝑠′) T =      𝜆−1             ′ T
                                                                              𝑘 𝝓 𝑘 (𝑠)𝝓 𝑘 (𝑠 ) {1 + 𝑜(1)}     (3.22)
     𝑘=1                               𝑘=𝜅 0 +1                         𝑘=1
Condition (C9) provide the range of bandwidth. Under the fixed sampling design, this condition
can be weakened; see Zhu et al. (2012).
The following result provide the asymptotic properties of the initial estimates mentioned in Step-I.
Theorem 3.4.1. Under conditions (C1), (C2), (C3), (C4), (C5), and (C9)
     n√                                                                             o
            ˘                           2     ¥
         𝑛 𝜷(𝑠0 ) − 𝜷0 (𝑠0 ) − 0.5ℎ 𝜈21 𝜷0 (𝑠0 ) × (1 + 𝑜 𝑎.𝑠. (1)) : 𝑠0 ∈ S weakly converges to a mean
zero Gaussian process with a covariance matrix Σ(𝑠0 , 𝑠0 )𝛀−1               x where 𝛀x = E{XX }.
                                                                                                    T
     Next, we study the convergence rates of the estimated eigen-components based on the proposed
lining-up method. The following lemma is the output of the asymptotic expansion of eigen-
components of an estimated covariance function developed by Li and Hsing (2010).
Lemma 3.4.2. Under assumptions (C1), (C2), (C3), (C6), (C7), (C8), and (C9) the following
convergence holds almost surely for any finite-dimensional mean-zero function g(𝑠).
    1. 𝜆 b𝑘 − 𝜆 𝑘 = 𝑂 (ℎ2 + 𝛿𝑛1 (ℎ) + 𝛿𝑛2 (ℎ))
    2. sup𝑠0 ∈S b 𝝓 𝑘 (𝑠0 ) − 𝝓 𝑘 (𝑠0 ) = 𝑂 (ℎ2 + 𝛿𝑛1 (ℎ) + 𝛿𝑛2     2 (ℎ))
for all 𝑘 = 1, · · · , 𝜅0 .
     We skip the proof of the above lemma, as it is well developed in the literature of functional data
analysis including Hall (2004); Hall and Hosseini-Nasab (2006); Li and Hsing (2010). Next, we
show the asymptotic results of the proposed estimation.
                                                             83


Theorem 3.4.3. Suppose the conditions (C1), (C2), (C3), (C4), (C5), (C6), (C7), (C8), and (C9)
hold, then for the proposed local linear GMM estimator b                𝜷(𝑠), have the following results hold.
     n√                                                                      o
        𝑛 b𝜷(𝑠) − 𝜷0 (𝑠) − 0.5ℎ2 𝜈21 𝜷(𝑠))¥          (1 + 𝑜 𝑎.𝑠. (1)) : 𝑠 ∈ S weakly converges to a mean zero
Gaussian process with a covariance matrix
                                       −1                                                                             −1
A(𝑠0 , 𝑠0 ) = 𝛀C−1          (𝑠 ,
                     𝜅 0 ,11 0 0 𝑠 )𝛀T
                                            𝛀C  −1
                                                       (𝑠 ,
                                                𝜅 0 ,11 0 0 𝑠  )𝚺(𝑠   ,
                                                                     0 0𝑠 )C −1
                                                                                    (𝑠 ,
                                                                             𝜅 0 ,11 0 0 𝑠 )𝛀   𝛀C        (𝑠 ,
                                                                                                   𝜅 0 ,11 0 0 𝑠 ) −1 T
                                                                                                                     𝛀
.
     The proofs of Theorems 3.4.1 and 3.4.3 are provided in Section 3.8.
3.5     Simulation studies
We conduct numerical studies to compare finite sample performance under different correlation
structures and heterogeneity conditions. Data are generated from the following model.
                                             𝑌𝑖 (𝑠) = 𝑋𝑖 𝜷(𝑠) + 𝑈𝑖 (𝑠)                                             (3.23)
where we generate trajectories observed at 𝑟 spatial locations for 𝑖-th curve, 𝑖 = 1, · · · , 𝑛. Assume
that the functional fixed effect be 𝛽(𝑠) = cos(2𝜋𝑠) and corresponding fixed effect covariate is
generated from the normal distribution with unit mean and variance. The error process is generated
as
                                      𝑈𝑖 (𝑠) = 𝜉1 (𝑋𝑖 )𝜓1 (𝑠) + 𝜉2 (𝑋𝑖 )𝜓2 (𝑠)                                     (3.24)
where 𝜉1 (𝑋𝑖 ) and 𝜉2 (𝑋𝑖 ) are independent central normal random variables with variance 3𝜎 2 (𝑋𝑖 )𝜃 02
and 1.5𝜎 2 (𝑋𝑖 )𝜃 02 where 𝜃 0 is determined by the relative importance of random error signal-to-noise
ratio, denoted as SNR𝜃 which is interpreted as the ratio of standard deviation of the additive
prediction without noise divided by the standard error of the random noise function. For example,
SNR𝜃 = 2 means that the contribution of each functional random noise to the variability in 𝑌 (𝑠)
is about double that of the fixed effect (Scheipl et al., 2015). Here, we use scaled orthonormal
functions 𝜓1 (𝑠) ∝ (1.5 − sin(2𝜋𝑠) − cos(2𝜋𝑠)) and 𝜓2 (𝑠) ∝ sin(4𝜋𝑠); due to orthonormality, the
proportionality constant can be easily determined. Contributions to the conditional variances in
𝜉 𝑘 (𝑋) are specified below.
                                                            84


  S.0 𝜎 2 (𝑥) = 1 (homoskedastic)
  S.1 𝜎 2 (𝑥) = (1 + 𝑥 2 /2) 2
  S.2 𝜎 2 (𝑥) = exp(1 + 𝑥 2 /2)
  S.3 𝜎 2 (𝑥) = exp(1 + |𝑥| + 𝑥 2 )
  S.4 𝜎 2 (𝑥) = (1 + |𝑥|/2) 2
The following parameters are considered for each of the above scenarios.
   1. Observational spatial points. We sample the trajectories at 𝑟 equidistant spatial points
      {𝑠1 , · · · , 𝑠𝑟 } on [0, 1]. Let 𝑠𝑖 = ( 𝑗 − 0.5)/𝑟 for 𝑗 = 1, · · · , 𝑟 for the 𝑖-th curve. The number
      of spatial points is assumed to be 200 for each case.
   2. Sample size. Number of trajectories 𝑛 ∈ {50, 100, 200, 500}.
   3. Signal to noise ratio. The controlling parameter 𝜃 0 is determined using signal-to-noise ratio,
      SNR𝜃 which is assumed to be either 0.5 or 1.
    Here, for each of the above situations, we perform 500 simulation replicates. To make it
consistent with theoretical results and numerical examples, we use “FPCA" function in R which is
available in fdapace packages (Gajardo et al., 2021) or in the MATLAB (MATLAB, 2014) package
PACE available at http://www.stat.ucdavis.edu/PACE/ to estimate the eigen-functions. Bandwidths
are selected using five-fold generalized cross-validation in all situations and for estimation, the
Epanechnikov kernel 𝐾 (𝑥) = 0.75(1 − 𝑥 2 )+ is used; where (𝑎)+ = max(𝑎, 0). Accuracy of the
parameter estimation is assessed using integrated mean square error (IMSE) and integrated mean
absolute error (IMAE) which for the 𝑏−th replication are defined as
                                              𝑟 
                                             ∑︁                          2          
                                                                                      
                                   IMSE𝑏 =          𝜷 𝑏 (𝑠 𝑗 ) − 𝜷(𝑠 𝑗 ) Δ(𝑠𝑟 ) 
                                                      b                                                 (3.25)
                                              𝑗=1                                    
                                                                                     
and
                                                 𝑟
                                                ∑︁                                 
                                                                                    
                                    IMAE𝑏 =          𝜷 𝑏 (𝑠 𝑗 ) − 𝜷(𝑠 𝑗 )|Δ(𝑠𝑟 ) 
                                                      |b                                                (3.26)
                                                 𝑗=1                               
                                                                                   
                                                          85


with Δ(𝑠 𝑗 ) = 𝑠 𝑗 − 𝑠 𝑗−1 and 𝑠0 = 0 and 𝑠1 < · · · < 𝑠𝑟 are the observed points over the support set
of observational points. We have noticed that the results can be improved by multiplying ℎ∗ by a
constant in a certain range where ℎ∗ is the optimal bandwidth obtained from cross-validation. We
    𝜷 corresponding to bandwidth 0.75ℎ∗ for our numerical studies. We present Tables 3.1 and
use b
3.2 where IMSEs and IMAEs are averaged over 500 replications for each situation. We denote
LLE, LLGMM and LLGMM-opt by local linear smoothing estimator described in Step-I, local
linear GMM without incorporating weight matrix and local linear GMM with weight matrix as
proposed in Step-III in Section 3.3 respectively. As expected, for all situations, IMSE and IMAE are
significantly reduced if we increase the sample size and/or SNR𝜃 . For the homoskedastic case, the
error rates of LLE are similar for LLGM but under the presence of heteroskedasticity of unknown
form, our proposed method outperforms.
                Table 3.1 Performance of the estimation procedure with SNR𝜃 = 0.5
                                n = 50            n = 100             n = 200            n = 500
  Case         Method       IMSE     IMAE     IMSE      IMAE       IMSE     IMAE      IMSE    IMAE
   S.0             LLE     0.0372   0.1429   0.0189     0.1016    0.0099   0.0737    0.0041   0.0472
              LLGMM        0.0375   0.1435   0.0191     0.1022    0.0099   0.0737    0.0041   0.0471
         LLGMM-opt         0.0388   0.1460   0.0198     0.1038    0.0100   0.0740    0.0042   0.0474
   S.1             LLE     0.0939   0.2271   0.0516     0.1679    0.0261   0.1189    0.0106   0.0766
              LLGMM        0.0816   0.2123   0.0443     0.1556    0.0227   0.1109    0.0091   0.0708
         LLGMM-opt         0.0585   0.1820   0.0292     0.1262    0.0135   0.0867    0.0050   0.0517
   S.2             LLE     0.1381   0.2810   0.0812     0.2169    0.0468   0.1632    0.0209   0.1094
              LLGMM        0.0804   0.2105   0.0372     0.1409    0.0164   0.0902    0.0048   0.0486
         LLGMM-opt         0.0557   0.1462   0.0134     0.0817    0.0045   0.0471    0.0015   0.0262
   S.3             LLE     0.1762   0.3330   0.1018     0.2538    0.0581   0.1913    0.0265   0.1291
              LLGMM        0.0328   0.1069   0.0126     0.0619    0.0055   0.0376    0.0023   0.0243
         LLGMM-opt         0.1067   0.0600   0.0021     0.0201    0.0004   0.0093    0.0003   0.0064
   S.4             LLE     0.0588   0.1798   0.0309     0.1298    0.0158   0.0928    0.0065   0.0596
              LLGMM        0.0577   0.1782   0.0303     0.1285    0.0155   0.0920    0.0063   0.0589
         LLGMM-opt         0.0576   0.1792   0.0303     0.1287    0.0153   0.0914    0.0063   0.0585
                                                   86


                 Table 3.2 Performance of the estimation procedure with SNR𝜃 = 1
                               n = 50            n = 100          n = 200            n = 500
  Case         Method     IMSE      IMAE     IMSE     IMAE     IMSE     IMAE     IMSE     IMAE
   S.0             LLE   0.0097    0.0729   0.0048    0.0515  0.0025   0.0372   0.0010    0.0237
              LLGMM      0.0099    0.0738   0.0051    0.0526  0.0025   0.0373   0.0010    0.0238
         LLGMM-opt       0.0101    0.0741   0.0052    0.0532  0.0025   0.0374   0.0010    0.0238
   S.1             LLE   0.0248    0.1169   0.0135    0.0860  0.0068   0.0608   0.0027    0.0387
              LLGMM      0.0142    0.0887   0.0070    0.0617  0.0034   0.0430   0.0013    0.0269
         LLGMM-opt       0.0126    0.0836   0.0062    0.0573  0.0029   0.0403   0.0012    0.0253
   S.2             LLE   0.0363    0.1440   0.0215    0.1117  0.0124   0.0842   0.0055    0.0560
              LLGMM      0.0103    0.0734   0.0046    0.0480  0.0019   0.0304   0.0006    0.0172
         LLGMM-opt       0.0069    0.0589   0.0029    0.0376  0.0012   0.0239   0.0004    0.0142
   S.3             LLE   0.0466    0.1705   0.0274    0.1314  0.0157   0.0994   0.0071    0.0669
              LLGMM      0.0050    0.0398   0.0025    0.0273  0.0011   0.0168   0.0004    0.0101
         LLGMM-opt       0.0006    0.0133   0.0003    0.0078  0.0002   0.0069   0.0001    0.0052
   S.4             LLE   0.0155    0.0924   0.0080    0.0662  0.0041   0.0472   0.0016    0.0300
              LLGMM      0.0141    0.0881   0.0073    0.0631  0.0036   0.0447   0.0015    0.0285
         LLGMM-opt       0.0142    0.0883   0.0074    0.0634  0.0037   0.0449   0.0015    0.0285
3.6    Real data analysis
For illustrating the application of our proposed method and the estimation procedure therein, we
use Apnea-data to understand white matter structural alterations using diffusion tensor imaging
(DTI) in obstructive sleep apnea (OSA) patients (Xiong et al., 2017). The details of this data-set
have already been discussed in Chapter 2, Subsection 2.5.2.
    FA varies systematically along the trajectory of each white matter fascicle. Several pre- and
post-processing steps were performed by the FSL software. The brain was extracted using brain
segmentation tools. After generating FA maps using the FMRIB diffusion toolbox, images from all
individuals were aligned to an FA standard template through non-linear co-registration. The Johns
Hopkins University (JHU) white matter tractography atlas was used as a standard template for white
matter parcellation with 50 regions of interest (ROIs). All imaging parameters were calculated by
averaging the voxel values in each ROI.
    For each subject, we calculate the similarity matrix C with dimension 50 × 50. The (𝑘, 𝑙)-th
                                                 87


element of the matrix C is defined as 𝑐 𝑘𝑙 = |𝑦 𝑘 − 𝑦 𝑙 | where 𝑦 𝑘 is the measure of FA associated
with 𝑘-th ROI. For simplicity, we scale the similarity matrix such that the range of elements of the
matrix is [0, 1]. To create the network, we threshold each similarity matrix to build an adjacency
matrix G with elements {1, 0} depending on whether the similarity values exceed the threshold or
not. Since this threshold controls the topology of the data, we contract the adjacent matrix over a
set of threshold parameters from 0.01 to 0.99, and this set is denoted as S with cardinally 99. A
popular measure of the connectivity is average path length (APL) which is defined as the average
number of steps along the shortest path for all possible pairs of the network nodes. Therefore, it
measures the efficacy of information on a network (Albert and Barabási, 2002). For a series of
threshold parameters (𝑠), we observe the APL for FA as shown in Figure 3.1. Scientists are often
interested to know the association of APL of the network generated from FA with a set of covariates
of interests such as age and lapses score.
    We fit the model 3.1 to APLs that are collected over continuous spatial domains (viz, thresholds)
from all individuals in which X𝑖 included clinical variables such as lapses, age. We discard the
subjects from the analysis with missing clinical variables and therefore sample size 𝑛 = 27. Here
we used three-fold cross-validation to obtain the tuning parameters and the FVE is set be at 0.99.
In Figure 3.2, we present the estimated coefficient functions corresponding to age and number of
lapses associated with APL where it can be observed that the coefficient of the network property
is negative with age but positive with lapses counts. Moreover the effect for the APL is found to
be increasing when the significant level is small to moderate and decreasing at moderate to large
significance levels; whereas, the effect of APL is more-or-less similar upto the larger values of
threshold, and after that it significantly decreases. Here small values of significance thresholds
represent sparse connected graph where the true connection might be eliminated due to lenient
thresholding; on the other hand, for large significant values, the generated graphs are densely
connected.
                                                  88


                             0.000
                             −0.001
                     β1(s)   −0.002
                             −0.003
                             −0.004
                                      0.0   0.2   0.4               0.6   0.8   1.0
                                                        threshold
                             0.040
                             0.035
                     β2(s)
                             0.030
                             0.025
                                      0.0   0.2   0.4               0.6   0.8   1.0
                                                        threshold
Figure 3.2 Apnea-data analysis: Plots of estimated coefficient functions of age (top panel) and
number of lapses (bottom panel) for average path length associated with Fractional Anisotropy
(FA) in DTI analysis.
                                                        89


3.7     Discussion
In this chapter, we propose an efficient estimation procedure for the varying-coefficient model
which is commonly used in neuroimaging and econometrics. We understand that this procedure is
an efficient approach to incorporate heteroskedasticity in the analysis of functional data. To best
of our knowledge, this is the first initiative to incorporate such a condition into the model. This
model is therefore equipped with a more complex relationship between the functional response
and real-valued covariates. Additionally, our method is easy to implement in a wide range of
applications due to the multi-stage structure of the algorithm. The applicability of the proposed
method is illustrated by simulation studies and real data analysis. We leave the testing of hypotheses
for linear constraints on 𝜷0 (·) for future studies.
3.8     Technical details
In this section, we provide technical details of the proposed theorems in Section 3.4. We prove
theorems 3.4.1 and 3.4.3 by proving the following lemmas.
3.8.1   Some useful lemmas
                                               √1
                                                   Í𝑛
Lemma 3.8.1. Under the conditions (C5)           𝑛  𝑖=1 𝑈𝑖 (𝑠0 )𝔐(X𝑖 ) is tight.
Proof. Consider the class of function C = {𝑈 (𝑠0 )𝔐(X𝑖 ) : 𝑠0 ∈ [0, 1]}. Therefore, due the
assumption (C5), C is a P-Donsker class. Therefore, √1𝑛 𝑖=1
                                                               Í𝑛
                                                                  𝑈𝑖 (𝑠0 )𝔐(X𝑖 ) is tight.
Lemma 3.8.2. Under the assumptions (C1), (C2) and (C9), the following holds for any power
𝑐 ≥ 0.
                          ∫
                   sup      𝐾 ℎ (𝑡 − 𝑠) {(𝑡 − 𝑠)/ℎ} 𝑐 𝑑Π(𝑡) − Π(𝑡) = 𝑂 (1/(𝑟 ℎ) −1/2 )          (3.27)
                  𝑠∈[0,1]
The above bound can be replaced by 𝑂 (1/𝑟 ℎ) for fixed design case.
Proof. This can be proved by using the empirical process techniques by observing that the class
{𝐾 ((· − 𝑠/ℎ))((· − 𝑠/ℎ)) 𝑐 : 𝑠 ∈ [0, 1]} is a P-Donsker class (Zhu et al., 2012). For the balanced
case, the results can be shown using Tayler’s series expansion.
                                                    90


                                         1
                                                         𝐾 ℎ (𝑠 𝑗 − 𝑠0 )W𝑖 𝑗 (𝑠0 )Q𝑖 𝑗 (𝑠0 ) T . Under the conditions
                                           Í𝑛 Í𝑟
Lemma 3.8.3. Define I(𝑠0 ) =            𝑛𝑟    𝑖=1    𝑗=1
(C1), (C2), (C3) and (C9) I(𝑠0 ) = 𝑓 (𝑠0 )diag(1, 𝜈21 ) ⊗ 𝛀 + 𝑂 (ℎ + 𝛿𝑛1 (ℎ)) almost surely, where
𝛀 = E{𝔐(X)XT }.
Proof. Observe the following.
                        𝑛    𝑟
                    1 ∑︁ ∑︁
        I(𝑠0 ) =                 𝐾 ℎ (𝑠 𝑗 − 𝑠0 )W𝑖 𝑗 (𝑠0 )Q𝑖 𝑗 (𝑠0 ) T
                  𝑛𝑟 𝑖=1 𝑗=1
                        𝑛    𝑟
                    1 ∑︁ ∑︁                                                                        T
                =                𝐾 ℎ (𝑠 𝑗 − 𝑠0 ) zℎ (𝑠 𝑗 − 𝑠0 ) ⊗ 𝔐(X𝑖 ) zℎ (𝑠 𝑗 − 𝑠0 ) ⊗ X𝑖
                  𝑛𝑟 𝑖=1 𝑗=1
                         𝑛    𝑟
                    1 ∑︁ ∑︁                       n               2
                                                                                  o
                =                 𝐾 ℎ (𝑠 𝑗 − 𝑠) zℎ (𝑠 𝑗 − 𝑠0 ) ⊗ ⊗ 𝔐(X𝑖 )X𝑖T
                  𝑛𝑅 𝑖=1 𝑗=1
                        𝑛
                    1 ∑︁ ∑︁
                             𝑟                    ©       𝔐(X𝑖 )X𝑖T               𝔐(X𝑖 )X𝑖T (𝑠 𝑗 − 𝑠0 )/ℎ ª
                =                𝐾 ℎ (𝑠 𝑗 − 𝑠0 ) ­­                                                         ®
                  𝑛𝑟 𝑖=1 𝑗=1                        𝔐(X𝑖 )X𝑖T (𝑠 𝑗 − 𝑠0 )/ℎ 𝔐(X𝑖 )X𝑖T ((𝑠 𝑗 − 𝑠0 )/ℎ) 2
                                                                                                            ®
                                                  «                                                         ¬
                    ©I11 (𝑠0 ) I12 (𝑠0 ) ª
                := ­­                      ®
                                           ®                                                                   (3.28)
                      I (𝑠 ) I22 (𝑠0 )
                    « 21 0                 ¬
Let us define I𝑎,𝑏 = 𝑛𝑟1 𝑖=1                                        𝑎+𝑏 𝔐(X )XT . Assume that 𝜈 is finite and
                           Í𝑛 Í𝑟
                                     𝑗=1 𝐾 ℎ (𝑠 𝑗 − 𝑠0 )(𝑠 𝑗 − 𝑠0 )           𝑖   𝑖                    41
due to condition (C2),for general index 𝑐, we can derive the uniform bound of for all 𝑠0 ∈ S.
                         1 ∑︁
                        
                               𝑛 ∑︁ 𝑟                                                 
                                                                                       
                                                                                       
                                                                         𝑐          T
    E{I𝑎,𝑏 (𝑠0 )} = E                    𝐾 ℎ (𝑠 𝑗 − 𝑠0 )((𝑠 𝑗 − 𝑠0 )/ℎ) 𝔐(X𝑖 )X𝑖
                         𝑛𝑟                                                           
                              𝑖=1 𝑗=1                                                 
                            1 ∑︁
                           
                                𝑟                                     
                                                                       
                                                                       
                  = 𝛀E              𝐾 ℎ (𝑠 𝑗 − 𝑠0 )((𝑠 𝑗 − 𝑠0 )/ℎ) 𝑐
                           𝑟                                          
                            𝑗=1                                       
                        ∫
                  =𝛀         𝐾 ℎ (𝑢 − 𝑠0 )((𝑢 − 𝑠0 )/ℎ) 𝑐 𝑓 (𝑢)𝑑𝑢
                        ∫
                  =𝛀         𝐾 (𝑢)𝑢 𝑐 𝑓 (𝑠0 + ℎ𝑢)𝑑𝑢
                        ∫
                             𝐾 (𝑢)𝑢 𝑐 𝑓 (𝑠0 ) + ℎ𝑢 𝑓 ′ (𝑠0 ) + 0.5ℎ2 𝑢 2 𝑓 ′′ (𝑠0 ) + · · · 𝑑𝑢
                                        
                  =𝛀
                                                            91


                          
                                                     𝑐 = 0, provided 𝜈21 < ∞, 𝑓 ′′ exists and finite
                          
                            𝑓 (𝑠0 ) + 𝑂 (ℎ2 )
                          
                          
                          
                          
                          
                          
                          
                                                     𝑐 = 1, provided 𝜈21 < ∞, 𝑓 ′ exists and finite
                          
                          𝑂 (ℎ)
                          
                          
                          
                    =𝛀                                                                                              (3.29)
                            𝑓 (𝑠0 )𝜈21 + 𝑂 (ℎ2 )                                        𝑓 ′′
                          
                          
                          
                          
                                                    𝑐 = 2, provided 𝜈41 < ∞,                exists and finite
                          
                          
                          
                                                     𝑐 = 3, provided 𝜈41 < ∞, 𝑓 ′ exists and finite
                          
                          𝑂 (ℎ)
                          
                          
                          
Moreover, under the condition (C3), we have E∥X∥ 𝑎 is finite for some 𝑎 > 2 and can define,
𝑏 𝑛 = ℎ2 + ℎ/𝑟 where ℎ → 0 such that 𝑏 −1       𝑛 (log 𝑛/𝑛)
                                                                1−2/𝑎 = 𝑜(1). Thus, 𝛿 (ℎ) = {𝑏 log 𝑛/𝑛ℎ2 } 1/2 .
                                                                                            𝑛1          𝑛
Now to establish the uniform bound for I(𝑠0 ), by using Lemma 2 in Li and Hsing (2010) for each
of I𝑎,𝑏 (𝑠0 ) for 𝑎, 𝑏 = 1, 2, we have
                   I(𝑠0 ) = 𝑓 (𝑠0 )(diag(1, 𝜈21 )) ⊗ 𝛀 + 𝑂 (ℎ + 𝛿𝑛1 (ℎ))                almost surely               (3.30)
                                       1
                                                         𝐾 ℎ (𝑠 𝑗 − 𝑠0 )Q𝑖 𝑗 (𝑠0 )X𝑖T 𝜷0 (𝑠 𝑗 ). Thus, under the condi-
                                          Í𝑛 Í𝑟
Lemma 3.8.4. Define, J(𝑠0 ) =          𝑛𝑟    𝑖=1   𝑗=1
tions (C1), (C2), (C4), (C3) and (C9), J(𝑠0 ) − I(𝑠0 )𝜸 0 (𝑠0 ) = 𝑓 (𝑠0 )(𝜈21 , 0) ⊗ 𝛀 + 𝑂 (𝛿𝑛1 (ℎ) + ℎ)
almost surely, where 𝜸 0 (𝑠0 ) = ( 𝜷0 (𝑠0 ) T , ℎ 𝜷¤ 0 (𝑠0 ) T ) T . Moreover, T(𝑠0 ) = 𝑛𝑟1 𝑖=1
                                                                                                     Í𝑛 Í𝑟
                                                                                                             𝑗=1 𝐾 ℎ (𝑠 𝑗 −
𝑠0 )Q𝑖 𝑗 (𝑠0 )𝑈𝑖 𝑗 = 𝑂 (𝛿𝑛1 (ℎ)) almost surely.
Proof. Observe that, because of condition (C4), using Taylor’s series expansion,
                                𝑛    𝑟
                            1 ∑︁ ∑︁
                 J(𝑠0 ) =               𝐾 ℎ (𝑠 𝑗 − 𝑠0 )Q𝑖 𝑗 (𝑠0 )X𝑖T 𝜷0 (𝑠0 )
                           𝑛𝑟 𝑖=1 𝑗=1
                                𝑛    𝑟
                            1 ∑︁ ∑︁
                                        𝐾 ℎ (𝑠 𝑗 − 𝑠0 )Q𝑖 𝑗 (𝑠0 ) X𝑖T 𝜷0 (𝑠0 ) + (𝑠 𝑗 − 𝑠0 )X𝑖T 𝜷¤ 0 (𝑠0 )
                                                                   
                        =
                           𝑛𝑟 𝑖=1 𝑗=1
                              +0.5(𝑠 𝑗 − 𝑠0 ) 2 X𝑖T 𝜷¥ 0 (𝑠0 ) + 𝑜(ℎ2 )
                        = I(𝑠0 )𝜸 0 (𝑠0 ) + 0.5ℎ2 I21 (𝑠0 ) + 𝑜(ℎ2 )                                                (3.31)
Using similar arguments, due to Lemma 2 in (Li and Hsing, 2010), under the conditions (C1), (C2)
                                                             92


and (C3), with 𝜈41 being finite, we have
                                      𝑛
                                1 ∑︁ ∑︁
                                          𝑟                      © ((𝑠 𝑗 − 𝑠0 )/ℎ) 2 ª
                  I21 (𝑠0 ) =                 𝐾 ℎ (𝑠 𝑗 − 𝑠0 ) ­­                     ® 𝔐(X𝑖 )XT
                                                                                                  𝑖
                                𝑛𝑟 𝑖=1 𝑗=1                         ((𝑠 𝑗 − 𝑠0 )/ℎ) 3
                                                                                     ®
                                                                 «                   ¬
                             = 𝑓 (𝑠0 )(𝜈21 , 0) ⊗ 𝛀 + 𝑂 (𝛿𝑛1 (ℎ) + ℎ)                  almost surely                 (3.32)
and
                                            1 Í𝑛 Í𝑟
                                ©          𝑛𝑟     𝑖=1      𝑗=1 𝐾 ℎ (𝑠 𝑗 − 𝑠0 )𝔐(X𝑖 )𝑈𝑖 𝑗            ª
                     T(𝑠0 ) = ­­ Í Í                                                                ®
                                   1    𝑛      𝑟
                                                                                                    ®
                                  𝑛𝑟    𝑖=1    𝑗=1    𝐾 ℎ (𝑠 𝑗 − 𝑠0 )((𝑠 𝑗 − 𝑠0 )/ℎ)𝔐(X𝑖 )𝑈𝑖𝑟
                                «                                                                   ¬
                                = 𝑂 (𝛿𝑛1 (ℎ))              almost surely                                             (3.33)
Lemma 3.8.5. Under conditions (C1),(C2), (C3), (C5), (C9),
                         √                                  𝑑
                            𝑛T(𝑠0 )(1 + 𝑜 𝑎.𝑠. (1)) →     − 𝑁 (0, 𝑓 2 (𝑠0 )Σ(𝑠0 , 𝑠0 ) ⊗ 𝛀)                          (3.34)
where T(𝑠0 ) is defined in Lemma 3.8.4.
Proof. Note that
                                          𝑛     𝑟
                  √                  1 ∑︁ ∑︁                                                    
                     𝑛T(𝑠0 ) = √                    𝐾 ℎ (𝑠 𝑗 − 𝑠0 ) zℎ (𝑠 𝑗 − 𝑠0 ) ⊗ 𝔐(X𝑖 ) 𝑈𝑖 𝑗                     (3.35)
                                     𝑛𝑟 𝑖=1 𝑗=1
Therefore, the variance of the above quantity is
     √
 Var{ 𝑛T(𝑠0 )}
           𝑛 ∑︁
              𝑟 ∑︁ 𝑟
    1 
        
         ∑︁                                              h                                             i          
                                                                                                      2            
                                                                                                                   
 = E                  𝐾 ℎ (𝑠 𝑗 − 𝑠0 )𝐾 ℎ (𝑠 𝑗 ′ − 𝑠0 ) zℎ (𝑠 𝑗 − 𝑠0 )zℎ (𝑠 𝑗 ′ − 𝑠0 ) T ⊗ 𝔐(X𝑖 ) ⊗ 𝑈𝑖 𝑗 𝑈𝑖 𝑗 ′
    𝑛  𝑖=1 𝑗=1 𝑗 ′=1                                                                                              
                                                                                                                  
       𝑛          𝑟    𝑟
    1 ∑︁    1 ∑︁ ∑︁
                                                              h                                           i               
                                                                                                        ⊗2
                                                                                                                           
                                                                                                                           
 =        E 2              𝐾 ℎ (𝑠 𝑗 − 𝑠0 )𝐾 ℎ (𝑠 𝑗 − 𝑠0 ) zℎ (𝑠 𝑗 − 𝑠0 )zℎ (𝑠 𝑗 − 𝑠0 ) ⊗ 𝔐(X𝑖 )
                                                    ′                                ′                       Σ(𝑠 𝑗 , 𝑠 𝑗 )
                                                                                                                        ′
    𝑛 𝑖=1  𝑟 𝑗=1 𝑗 ′=1                                                                                                    
                                                                                                                          
       1 ∑︁
      
           𝑟 ∑︁𝑟                                                                                
                                                                                                 
                                                                                                 
 =E 2              𝐾 ℎ (𝑠 𝑗 − 𝑠0 )𝐾 ℎ (𝑠 𝑗 ′ − 𝑠0 )zℎ (𝑠 𝑗 − 𝑠0 )zℎ (𝑠 𝑗 − 𝑠0 ) T Σ(𝑠 𝑗 , 𝑠 𝑗 ′ ) ⊗ 𝛀
      𝑟
       𝑗=1 𝑗 ′=1
                                                                                                 
                                                                                                 
 = [E{D1 (𝑠0 )} + E{D2 (𝑠0 )}] ⊗ 𝛀                                                                                   (3.36)
                                                              93


                        1                                               2                                       1 Í𝑛 Í𝑟
                                     𝐾 ℎ2 (𝑠 𝑗 − 𝑠0 )zℎ (𝑠 𝑗 − 𝑠0 ) ⊗ Σ(𝑠 𝑗 , 𝑠 𝑗 ) and D2 (𝑠0 ) =
                           Í𝑛
where D1 (𝑠0 ) =        𝑟2     𝑗=1                                                                         𝑟 (𝑟−1) 𝑗=1       𝑗 ′ =1 𝐾 ℎ (𝑠 𝑗 −
                                                                                                                      𝑗≠ 𝑗 ′
𝑠0 )𝐾 ℎ (𝑠 𝑗 ′ − 𝑠0 )zℎ (𝑠 𝑗 − 𝑠0 )zℎ (𝑠 𝑗 ′ − 𝑠0 ) T Σ(𝑠 𝑗 , 𝑠 𝑗 ′ ). Note that
                                                     1 ∑︁
                                                           𝑟                                                     
                                                                                                   2             
                                                                                                                  
                           E{D2 (𝑠0 )} = E 2                    𝐾 2 (𝑠 𝑗 − 𝑠0 )zℎ (𝑠 𝑗 − 𝑠0 ) ⊗ Σ(𝑠 𝑗 , 𝑠 𝑗 )
                                                    𝑟                                                            
                                                     𝑗=1                                                         
                                                    ∫
                                                 1                                      2
                                              =          𝐾 ℎ2 (𝑡 − 𝑠0 )zℎ (𝑡 − 𝑠0 ) ⊗ Σ(𝑡, 𝑡) 𝑓 (𝑡)𝑑𝑡
                                                 𝑟
                                                                    ©1 𝑡 ª
                                                      ∫
                                                  1
                                              =            𝐾 2 (𝑡) ­­         ® Σ(𝑠0 + ℎ𝑢) 𝑓 (𝑠0 + ℎ𝑡)𝑑𝑡
                                                 ℎ𝑟                   𝑡 𝑡   2
                                                                              ®
                                                                    «         ¬
                                                  1
                                              =       { 𝑓 (𝑠0 )diag(𝜈02 , 𝜈22 )Σ(𝑠0 , 𝑠0 ) + 𝑂 (ℎ)}                                   (3.37)
                                                 ℎ𝑟
                                                                                                                 ∫
Now assume that Θ(𝑠0 ) = E{D2 (𝑠0 )} with (𝑙, 𝑙 ′)-th entry 𝜃 𝑙,𝑙 ′ and P(𝑡) = S 𝐾 ℎ (𝑡 − 𝑠0 )𝐾 ℎ (𝑡 ′ −
𝑠0 )zℎ (𝑡 − 𝑠0 )zℎ (𝑡 ′ − 𝑠0 )Σ(𝑡, 𝑡 ′) 𝑓 (𝑡 ′)𝑑𝑡 ′ with (𝑙, 𝑙 ′)-the element P𝑙,𝑙 ′ . Therefore, using Hájek pro-
jection (Vaart and Wellner, 1996), we have
                                                                  𝑟
                                                               2 ∑︁ 
                           D1,𝑙.𝑙 ′ (𝑠0 ) = 𝜃 𝑙,𝑙 ′ (𝑠0 ) +             P𝑙,𝑙 ′ (𝑠 𝑗 ) − 𝜃 𝑙,𝑙 ′ (𝑠0 ) + 𝜖˜𝑙,𝑙 ′ (𝑠0 )                 (3.38)
                                                               𝑟 𝑗=1
         2  Í𝑟       
where    𝑟      𝑗=1    P𝑙,𝑙 ′ (𝑠 𝑗 ) − 𝜃 𝑙,𝑙 ′ (𝑠0 ) is the projection on D2, ′,𝑙 ′ (𝑠0 ) − 𝜃 𝑙,𝑙 ′ (𝑠0 ) onto the set of all
statistics of the linear order form. Thus, it is easy to see Var{𝜖˜} = 𝑂 (1/(𝑟 ℎ) 2 ) (Zhu et al., 2012).
Since the Taylor series expansion for small ℎ → 0, we have 𝜃 𝑙,𝑙 ′ (𝑠0 ) = 𝑓 (𝑠0 ) 2 𝜈𝑙−1, 𝜈𝑙 ′−1,1 Σ(𝑠0 , 𝑠0 ).
                                                        √
Therefore, in summery, we have Var{ 𝑛T(𝑠0 )} = 𝑓 2 (𝑠0 )UΣ(𝑠0 , 𝑠0 ). where the element (𝑙, 𝑙 ′) of
the matrix U is 𝜈𝑙−1 𝜈𝑙 ′−1 .
                                                                                                 √
     To hold the above asymptotic results, we need to show that                                    𝑛T(𝑠0 ) be tight asymptotically.
Therefore, consider the following, for suitable choice of 𝑙 < 𝑙 after change of variables,
  √
    𝑛T(𝑠0 )
               𝑛    𝑟
        1 ∑︁ ∑︁
   =√                   𝐾 ℎ (𝑠 𝑗 − 𝑠0 ) [zℎ (𝑠 𝑗 − 𝑠0 ) ⊗ 𝔐(X𝑖 )]𝑈𝑖 𝑗
        𝑛𝑟 𝑖=1 𝑟=1
              𝑛        𝑟                                               ∫ 1
       1 ∑︁      1 ∑︁                                                                                                   
                                                                                                                          
                                                                                                                          
   =√                       𝐾 ℎ (𝑠 𝑗 − 𝑠0 )zℎ (𝑠 𝑗 − 𝑠0 )𝑈𝑖 𝑗 −                𝐾 ℎ (𝑡 − 𝑠0 )zℎ (𝑡 − 𝑠0 )𝑈𝑖 (𝑡) 𝑓 (𝑡)𝑑𝑡 ⊗ 𝔐(X𝑖 )
        𝑛 𝑖=1      𝑟                                                     0                                               
                  𝑗=1                                                                                                    
                    𝑛              ∫    𝑙
               1 ∑︁
         +√            𝑈𝑖 (𝑠0 )           𝐾 (𝑡)(1, 𝑡) T 𝑓 (𝑠0 + ℎ𝑡)𝑑𝑡 ⊗ 𝔐(X𝑖 )
                𝑛 𝑖=1                𝑙
                                                                       94


                   𝑛   ∫   𝑙
              1 ∑︁
        +√                   𝐾 (𝑡)(1, 𝑡) T {𝑈𝑖 (𝑠0 + ℎ𝑡) − 𝑈𝑖 (𝑠0 )} 𝑓 (𝑠0 + ℎ𝑡)𝑑𝑡 ⊗ 𝔐(X𝑖 )
               𝑛 𝑖=1     𝑙
  := T1 (𝑠0 ) + T2 (𝑠0 ) + T3 (𝑠0 )                                                                                   (3.39)
Note that,
T1 (𝑠0 )
       𝑟
                                           (          𝑛                            𝑛
                                                                                                           )
    1 ∑︁                                      1 ∑︁                            1 ∑︁
 =           𝐾 ℎ (𝑠 𝑗 − 𝑠0 )zℎ (𝑠 𝑗 − 𝑠0 ) √             𝑈𝑖 𝑗 ⊗ 𝔐(X𝑖 ) − √            𝑈𝑖 (𝑡) ⊗ 𝔐(X𝑖 )
    𝑟 𝑟=1                                       𝑛 𝑖=1                          𝑛 𝑖=1
              𝑟                                     ∫ 𝑙                                        (        𝑛
                                                                                                                           )
       1 ∑︁
      
                                                                                            1 ∑︁
                                                                                            
                                                                                            
    +             𝐾 ℎ (𝑠 𝑗 − 𝑠0 )zℎ (𝑠 𝑗 − 𝑠0 ) −         𝐾 ℎ (𝑡 − 𝑠0 )zℎ (𝑡 − 𝑠0 ) 𝑓 (𝑡)𝑑𝑡      √         𝑈𝑖 (𝑡) ⊗ 𝔐(X𝑖 )
      𝑟                                               𝑙                                          𝑛
            𝑗=1                                                                                      𝑖=1
      ∫ 𝑙                                     𝑛
                                         1  ∑︁
    +        𝐾 ℎ (𝑡 − 𝑠0 )zℎ (𝑡 − 𝑠0 ) √           {𝑈𝑖 (𝑠0 ) − 𝑈𝑖 (𝑡)} 𝑓 (𝑡)𝑑𝑡 ⊗ 𝔐(X𝑖 )
        𝑙                                 𝑛 𝑖=1
 := T11 (𝑠0 ) + T12 (𝑠0 ) + T13 (𝑠0 )                                                                                 (3.40)
    Due to the Donsker Theorem, we have √1𝑛 𝑖=1
                                                                 Í𝑛
                                                                       𝔐(X𝑖 )𝑈𝑖 (𝑠) weekly converges to a centered
Gaussian process and sup𝑠∈[0,1] | √1𝑛 𝑖=1
                                               Í𝑛
                                                         𝔐(X𝑖 )𝑈𝑖 (𝑠)| = 𝑂 𝑝 (1) (Vaart and Wellner, 1996). There-
fore,
  |T11 (𝑠0 )|
           𝑟                                               𝑛                             𝑛
      1 ∑︁                                            1 ∑︁                          1 ∑︁
   ≤           𝐾 ℎ (𝑠 𝑗 − 𝑠0 )∥zℎ (𝑠 𝑗 − 𝑠0 )∥ 2 √            𝑈𝑖 𝑗 ⊗ 𝔐(X𝑖 ) − √            𝑈𝑖 (𝑠) ⊗ 𝔐(X𝑖 )
      𝑟 𝑗=1                                             𝑛 𝑖=1                        𝑛 𝑖=1
           𝑟                                                        𝑛                                𝑛
      1 ∑︁                                                      1 ∑︁                            1 ∑︁
   ≤           𝐾 ℎ (𝑠 𝑗 − 𝑠0 )∥zℎ (𝑠 𝑗 − 𝑠0 )∥ 2 sup √                 𝑈𝑖 (𝑠) ⊗ 𝔐(X𝑖 ) − √               𝑈𝑖 (𝑠0 ) ⊗ 𝔐(X𝑖 )
      𝑟 𝑗=1                                        |𝑠−𝑠0 |≤ℎ     𝑛 𝑖=1                           𝑛 𝑖=1
   = 𝑜 𝑃 (1)                                                                                                          (3.41)
|T12 (𝑠0 )|
         𝑟                                     ∫                                                 (        𝑛
                                                                                                                             )
                                                    𝑙
     1 ∑︁                                                                                           1 ∑︁
 ≤           𝐾 ℎ (𝑠 𝑗 − 𝑠0 )zℎ (𝑠 𝑗 − 𝑠0 ) −          𝐾 ℎ (𝑡 − 𝑠0 )zℎ (𝑡 − 𝑠0 ) 𝑓 (𝑡)𝑑𝑡 sup        √         𝑈𝑖 (𝑡) ⊗ 𝔐(X𝑖 )
     𝑟 𝑗=1                                       𝑙                                       𝑡∈[0,1]     𝑛 𝑖=1
             √
 = 𝑂 𝑃 (1/ 𝑟 ℎ)𝑂 𝑃 (1) = 𝑜 𝑃 (1)                                                                                      (3.42)
                                                                 95


The above bound holds for Lemma 3.8.2 and Condition (C9) so that 𝑚ℎ → ∞.
|T13 (𝑠0 )|
                   𝑛                                  𝑛                           ∫   𝑙
               1 ∑︁                              1 ∑︁
 ≤ sup        √       𝑈𝑖 (𝑠) ⊗ 𝔐(X𝑖 ) − √                𝑈𝑖 (𝑠0 ) ⊗ 𝔐(X𝑖 )              𝐾 ℎ (𝑠 𝑗 − 𝑠0 )∥zℎ (𝑠 𝑗 − 𝑠0 )∥ 2 𝑓 (𝑠)𝑑𝑠
   |𝑠−𝑠0 |≤ℎ    𝑛 𝑖=1                             𝑛 𝑖=1                             𝑙
 = 𝑂 𝑃 (1)                                                                                                               (3.43)
By combining the above three bounds, due to conditions (C1),(C2), (C3), (C5), (C9), we obtain
T1 (𝑠0 ) = 𝑜 𝑃 (1). Now, rewrite T3 (𝑠0 ) as
                            𝑛 ∫
                       1 ∑︁ 𝑙
             T3 (𝑠) = √               𝐾 (𝑡)(1, 𝑡) T {𝑈𝑖 (𝑠0 + ℎ𝑡) − 𝑈𝑖 (𝑠0 )} 𝑓 (𝑠0 + ℎ𝑡)𝑑𝑡 ⊗ 𝔐(X𝑖 )
                         𝑛 𝑖=1 𝑙
                      ∫ 𝑙
                    =      𝐾 (𝑡)(1, 𝑡) T ⊗ {𝑈𝑖 (𝑠0 + ℎ𝑡) − 𝑈𝑖 (𝑠0 )} 𝔐(X𝑖 ) 𝑓 (𝑠0 + ℎ𝑡)𝑑𝑡                                (3.44)
                       𝑙
         √1
             Í𝑛
Since,     𝑛  𝑖=1 𝔐(X𝑖 )𝑈𝑖 (𝑠0 )is     asymptotically tight, for any ℎ → 0, we have the following (Vaart
and Wellner, 1996).
                                               𝑛
                                          1 ∑︁
                            sup         √          𝔐(X𝑖 ) {𝑈𝑖 (𝑠0 + ℎ𝑡) − 𝑈𝑖 (𝑠0 )} = 𝑜 𝑃 (1)                            (3.45)
                       𝑠0 ∈[0,1]:|𝑡|≤1 𝑛 𝑖=1
Now it is enough to show that T2 (𝑠0 ) is tight. First, observe that
                                        ∫ 𝑙
                                                               −1
                               (1, 0)         𝐾 (𝑡)diag(1, 𝜈21      )(1, 𝑡) T 𝑓 (𝑠0 + ℎ𝑡)𝑑𝑡
                                           𝑙
                                    ∫   𝑙
                                =          𝐾 (𝑡) 𝑓 (𝑠0 + ℎ𝑡)𝑑𝑡
                                     𝑙
                                    ∫   𝑙
                                =          𝐾 (𝑡) { 𝑓 (𝑠0 ) + ℎ𝑡 𝑓 ′ (𝑠0 ) + · · · }
                                     𝑙
                                = 𝑓 (𝑠0 ) + 𝑜(ℎ)                                                                         (3.46)
                                          √1
                                              Í𝑛
Therefore, T2 (𝑠0 )(1 + 𝑜 𝑃 (ℎ)) =          𝑛   𝑖=1 𝑈𝑖 (𝑠0 ) ⊗ 𝔐(X𝑖 ).     By assumption (C5), T2 (𝑠0 ) is tight.
3.8.2    Proof of Theorem 3.4.1
Under the initial estimates, by considering 𝔐(X) = X, 𝛀 can be replaced by 𝛀x in Equation (3.47)
and inverse of 𝛀x exits. Therefore, by using Lemma 3.8.3, it is easy to observe that, almost surely
                       I(𝑠0 ) −1 = 𝑓 (𝑠0 ) −1 (diag(1, 𝜈21 ) −1 ) ⊗ 𝛀−1      x + 𝑂 (ℎ + 𝛿 𝑛1 (ℎ))                        (3.47)
                                                             96


Similarly, for the numerator, we have the following.
                           𝑛    𝑟
                     1 ∑︁ ∑︁                          
                                    𝐾 ℎ (𝑠 𝑗 − 𝑠0 ) zℎ (𝑠 𝑗 − 𝑠0 ) ⊗ X𝑖 𝑌𝑖 𝑗
                    𝑛𝑟 𝑖=1 𝑗=1
                              𝑛      𝑟
                          1 ∑︁ ∑︁
                                         𝐾 ℎ (𝑠 𝑗 − 𝑠0 ) zℎ (𝑠 𝑗 − 𝑠0 ) ⊗ X𝑖 X𝑖T 𝜷0 (𝑠 𝑗 ) + 𝑈𝑖 𝑗
                                                                                 
                    =
                         𝑛𝑟 𝑖=1 𝑗=1
                    = I(𝑠0 )𝜸 0 (𝑠0 ) + 0.5ℎ2 I21 (𝑠0 ) 𝜷¥ 0 (𝑠0 ) + T(𝑠0 ) + 𝑜(ℎ2 )                                  (3.48)
Thus, using Equation (3.47) and (3.48), we can derive,
 ˘ 0 ) = [(1, 0) ⊗ I 𝑝 ]I(𝑠0 ) −1 I(𝑠0 )𝜸 0 (𝑠0 ) + 0.5ℎ2 I21 (𝑠0 ) 𝜷¥ 0 (𝑠0 ) + T(𝑠0 ) + 𝑜(ℎ2 )
                                       
𝜷(𝑠
        = 𝜷0 (𝑠0 ) + [(1, 0) ⊗ I 𝑝 ] 𝑓 (𝑠0 ) −1 diag(1, 𝜈21 ) −1 ⊗ 𝛀−1              { 𝑓 (𝑠0 )(𝜈21 , 0) ⊗ 𝛀x } 0.5ℎ2 𝜷¥ 0 (𝑠0 )
                                                    
                                                                              x
             + 𝑂 (𝛿𝑛1 (ℎ) + ℎ)
        = 𝜷0 (𝑠0 ) + 0.5ℎ2 𝜈21 𝜷¥ 0 (𝑠0 ) + 𝑂 (𝛿𝑛1 (ℎ) + ℎ)
        = 𝜷0 (𝑠0 ) + 𝑂 (𝛿𝑛1 (ℎ) + ℎ)              almost surely                                                       (3.49)
Therefore, sup𝑠0 ∈S 𝜷(𝑠 ˘ 0 ) − 𝜷0 (𝑠0 ) = 𝑂 (𝛿𝑛1 + ℎ) almost surely. Furthermore, observe that the bias
of the initial estimator is
                           ˘ 0 )} − 𝜷0 (𝑠0 ) = 0.5ℎ2 𝜈21 𝜷¥ 0 (𝑠0 ) {1 + 𝑂 𝑃 (𝛿𝑛1 (ℎ) + ℎ)}
                       E{ 𝜷(𝑠                                                                                         (3.50)
Now, to calculate the variance, note that
                           √
                                  ˘ 0 ) − 𝜷(𝑠0 ) − 0.5ℎ2 𝜈21 𝛽¥0 (𝑠0 )}(1 + 𝑜 𝑎.𝑠. (1))
                             𝑛{ 𝜷(𝑠
                                                                                         √
                            = [(1, 0) ⊗ I 𝑝 ] 𝑓 (𝑠0 ) diag(1, 𝜈21 ) −1 ⊗ 𝛀−1
                                                         
                                                                                    x      𝑛T(𝑠0 )                    (3.51)
By Lemma 3.8.5, we have the variance of the above quantity Σ(𝑠0 , 𝑠0 )𝛀−1                       x .
3.8.3   Proof of Theorem 3.4.3
Define C𝜅0 (𝑠, 𝑠′) =                               ′) T and hence, we can define C−1                 ′
                        Í𝜅0
                           𝑘=1 𝜆 𝑘 𝝓 𝑘 (𝑠)𝝓 𝑘 (𝑠                                            𝜅 0 (𝑠, 𝑠 ) with possible block
matrix
                                      ∑︁𝜅0
                                                                   ©C−1               ′
                                                                       𝜅 0 ,1,1 (𝑠, 𝑠 )           0
                   C−1        ′
                                           𝜆−1              ′ T
                                                                                                           ª
                     𝜅 0 (𝑠, 𝑠 )   =         𝑘 𝝓 𝑘 (𝑠)𝝓 𝑘 (𝑠 )   =­                                                   (3.52)
                                                                   ­                                       ®
                                                                                             −1         ′
                                                                                                           ®
                                       𝑘=1                                    0           C𝜅0 ,2,2 (𝑠, 𝑠 )
                                                                   «                                       ¬
                                                             97


Also define,
                  (   𝜅0
                                                                         ) −1 (     𝜅0
                                                                                                                               )
                    ∑︁         𝜆𝑘                                                  ∑︁       𝜆𝑘
    𝜸 𝜅0 (𝑠0 ) =
    b                         2
                                        X𝑘 (𝑠0 ; 𝜅0 )X𝑘 (𝑠0 ; 𝜅0 ) T                       2
                                                                                                   X𝑘 (𝑠0 ; 𝜅 0 )y𝑘 (𝑠0 ; 𝜅0 )   (3.53)
                     𝑘=1
                           𝜆𝑘 + 𝛼                                                  𝑘=1
                                                                                         𝜆𝑘 + 𝛼
where
                                                     𝑛      𝑟
                                                 1 ∑︁ ∑︁
                         X𝑘 (𝑠0 ; 𝜅0 ) =                        𝐾 ℎ (𝑠 𝑗 − 𝑠0 )𝝓 𝑘 (𝑠0 ) T Q𝑖 𝑗 (𝑠0 )W𝑖 𝑗 (𝑠0 )                  (3.54)
                                                𝑛𝑟 𝑗=1 𝑗=1
and
                                                          𝑛     𝑟
                                                   1 ∑︁ ∑︁
                               y𝑘 (𝑠0 ; 𝜅 0 ) =                      𝐾 ℎ (𝑠 𝑗 − 𝑠0 )𝝓 𝑘 (𝑠0 ) T Q𝑖 𝑗 (𝑠0 )𝑌𝑖 𝑗                   (3.55)
                                                   𝑛𝑟 𝑗=1 𝑗=1
Therefore, we have the following.
                𝜅0
               ∑︁       𝜆𝑘
                       2
                                 X𝑘 (𝑠0 ; 𝜅0 )X𝑘 (𝑠0 ; 𝜅0 ) T
               𝑘=1
                     𝜆𝑘 + 𝛼
                    1 ∑︁
                   
                            𝑛 ∑︁   𝑟                                               
                                                                                    
                                                                                    
               =                        𝐾 ℎ (𝑠 𝑗 − 𝑠0 )W𝑖 𝑗 (𝑠0 )Q𝑖 𝑗 (𝑠0 ) T
                    𝑛𝑟                                                             
                          𝑖=1 𝑗=1                                                  
                            𝜅0                                          𝑛 ∑︁  𝑟                                              T
                          ∑︁
                                    −1                      T
                                                               1 ∑︁
                                                              
                                                                                                                        T
                                                                                                                           
                                                                                                                           
                                                                                                                           
                       ×         𝜆 𝑘 𝝓 𝑘 (𝑠0 )𝝓 𝑘 (𝑠0 )                           𝐾 ℎ (𝑠 𝑗 − 𝑠0 )W𝑖 𝑗 (𝑠0 )Q𝑖 𝑗 (𝑠0 )
                                                               𝑛𝑟                                                         
                          𝑘=1                                         𝑖=1 𝑗=1                                             
               = I(𝑠0 )C−1     𝜅0 (𝑠0 , 𝑠0 )I(𝑠0 )
                                                    T
               = 𝑓 2 (𝑠0 ) [diag(1, 𝜈21 ) ⊗ 𝛀] C−1             𝜅 0 (𝑠0 , 𝑠0 ) [diag(1, 𝜈21 ) ⊗ 𝛀] + 𝑂 (𝛿(ℎ))
                                                                                                             T
               = V(𝑠0 ) + 𝑂 (𝛿(ℎ))                                                                                               (3.56)
                                                                                                                        
                                                                                      T 2                              T
where we define V(𝑠0 ) = 𝑓 2 (𝑠0 )diag 𝛀C−1                             (𝑠
                                                               𝜅 0 ,1,1, 0 0, 𝑠   )𝛀    , 𝜈 21 𝛀C   −1       (𝑠
                                                                                                    𝜅 0 ,2,2, 0 0, 𝑠 )𝛀    and
                  𝜅0
                 ∑︁       𝜆𝑘
                                    X𝑘 (𝑠0 ; 𝜅 0 )y𝑘 (𝑠0 ; 𝜅0 )
                 𝑘=1 𝑘
                       𝜆2   +𝛼
                      1 ∑︁
                     
                               𝑛 ∑︁  𝑟                                                
                                                                                       
                                                                                       
                                                                                    T
                 =                       𝐾 ℎ (𝑠 𝑗 − 𝑠0 )W𝑖 𝑗 (𝑠0 )Q𝑖 𝑗 (𝑠0 )
                      𝑛𝑟                                                              
                             𝑖=1 𝑗=1                                                  
                          𝜅 0                                          𝑛 ∑︁ 𝑟
                         ∑︁                                   1 ∑︁
                                                             
                                                                                                                  
                                                                                                                   
                                                                                                                   
                                𝜆−1
                                  𝑘   𝝓   (𝑠  )𝝓
                                        𝑘 0 𝑘 0    (𝑠   ) T
                                                                                 𝐾   (𝑠
                                                                                   ℎ 𝑗     −  𝑠 0 )Q     (𝑠
                                                                                                     𝑖𝑗 0 𝑖𝑗  )𝑌
                                                              𝑛𝑟                                                  
                         𝑘=1                                         𝑖=1 𝑗=1                                      
                 = I(𝑠0 )C−1                                                    2            ¥                          2
                                                
                                 𝜅 0 (𝑠0 , 𝑠0 ) I(𝑠0 )𝜸 0 (𝑠0 ) + 0.5ℎ I21 (𝑠0 ) 𝜷(𝑠0 ) + T(𝑠0 ) + 𝑜(ℎ )                         (3.57)
                                                                        98


Therefore,
   𝜷(𝑠0 ) − 𝜷0 (𝑠0 )
   b
    = 0.5ℎ2 [diag(1, 0) ⊗ I 𝑝 ]V(𝑠0 ) −1 I(𝑠0 )C−1                                 ¥
                                                           𝜅 0 (𝑠0 , 𝑠0 )I21 (𝑠0 ) 𝜷(𝑠0 )
         + [diag(1, 0) ⊗ I 𝑝 ]V(𝑠0 ) −1 I(𝑠0 )C−1        𝜅 0 (𝑠0 , 𝑠0 )T(𝑠0 ) + 𝑂 (𝛿(ℎ))
                                                                                                                      T¥
    = 0.5ℎ2 𝑓 2 (𝑠0 ) [diag(1, 0) ⊗ I 𝑝 ]V(𝑠0 ) −1 [diag(1, 𝜈21 ) ⊗ 𝛀]C−1               𝜅 0 (𝑠0 , 𝑠0 ) [(𝜈21 , 0) ⊗ 𝛀] 𝜷(𝑠0 )
         + 𝑓 2 (𝑠0 ) [diag(1, 0) ⊗ I 𝑝 ]V(𝑠0 ) −1 [diag(1, 𝜈21 ) ⊗ 𝛀]C−1              𝜅 0 (𝑠0 , 𝑠0 )T(𝑠0 ) + 𝑂 (𝛿(ℎ))
                 ¥ 0 ) + T(𝑠0 ) + 𝑂 (𝛿(ℎ))
    = 0.5ℎ2 𝜈21 𝜷(𝑠                                                                                                      (3.58)
In order to obtain the asymptotic variance, consider, using Lemmas 3.8.3, 3.8.4 and 3.8.5, we have
√ b                                     ¥ 0 )} →𝑑
  𝑛{ 𝜷(𝑠0 )−𝜷0 (𝑠0 )−0.5ℎ2 𝜈21 𝜷(𝑠              − 𝑁 (0, A(𝑠0 , 𝑠0 )) where A(𝑠0 , 𝑠0 ) is the asymptotic variance
   √                                                                                      ˜ 0 , 𝑠0 )V −1 (𝑠0 ) [(1, 0) ⊗ I 𝑝 ] for
of 𝑛T(𝑠0 ), where we derive, A(𝑠0 , 𝑠0 ) = [(1, 0) ⊗ I 𝑝 ]V −1 (𝑠0 ) A(𝑠
 ˜ 0 , 𝑠0 ) = [diag(1, 𝜈21 ) ⊗ 𝛀]C−1
A(𝑠                                                                              2                   −1
                                           𝜅 0 (𝑠0 , 𝑠0 )diag(Σ(𝑠0 , 𝑠0 ), 𝜈11 Σ(𝑠0 , 𝑠0 ))C𝜅 0 (𝑠0 , 𝑠0 ) [diag(1, 𝜈21 ) ⊗
𝛀] T . By simple calculation, it can be shown that
A(𝑠0 , 𝑠0 ) = (𝛀C−1                      T −1       −1                              −1                        −1
                      𝜅 0 ,11 (𝑠0 , 𝑠0 )𝛀 ) 𝛀C𝜅 0 ,11 (𝑠0 , 𝑠0 )𝚺(𝑠0 , 𝑠0 )C𝜅 0 ,11 (𝑠0 , 𝑠0 )𝛀(𝛀C𝜅 0 ,11 (𝑠0 , 𝑠0 )𝛀 )
                                                                                                                            T −1
                                                                                                                         (3.59)
                                                                99


                                                         CHAPTER 4
 TENSOR BASED SPATIO-TEMPORAL MODELS FOR ANALYSIS OF FUNCTIONAL
                       NEUROIMAGING DATA
4.1   Introduction
Recent years have seen an explosive growth in the number of neuroimaging studies performed.
Popular imaging modalities include functional magnetic resonance imaging (fMRI), electroen-
cephalography (EEG), diffusion tensor imaging (DTI), positron emission tomography (PET), and
single-photon emission computed tomography (SPECT). Each of these techniques has its own
limitations and strengths. Therefore, a current trend is toward interdisciplinary approaches that use
multiple imaging techniques to overcome limitations of each method in isolation. As an example,
Figure 4.1 illustrates the combination of fMRI and EEG data. At the same time, neuroimaging data
is increasingly being combined with non-imaging modalities, such as behavioral and genetic data.
                          Prepossessed EEG data
       Electrode
                   Time
                                                           Prepossessed fMRI data
                                      Voxel
                                                  Time
Figure 4.1 Multi-modal-data: An example of multi-modal data analysis which seeks to explore the
relationship between EEG and fMRI data.
                                                             100


Multi-modal analysis is an increasingly important topic of research, and to fully realize its promise,
novel statistical techniques are needed. Here, we present a new approach towards performing such
analysis. It is common for the data generated from neuroimaging studies to consist of time-varying
signal measured over a large three-dimensional (3D) domain (Lindquist, 2008; Ombao et al., 2016).
Hence, the data are inherently spatio-temporal in nature. Due to the massive size of the data along
with its complex anatomical structure, classical vector-based spatio-temporal statistical methods
are often deemed unrealistic and inadequate. It is becoming increasingly clear that any new model
and methodology should address three fundamental concerns. First, standard spatio-temporal co-
variance modelling techniques are based on many parametric assumptions, which are often hard
to validate in large high-dimensional data such as fMRI. Second, modelling of spatio-temporal
interactions often produces large covariance matrices containing millions of elements that are hard
to estimate properly. Third, storage of these large data-sets while performing analysis is nearly
impossible.
    The current research is motivated by the experiment studyforrest (http://studyforrest.org/) which
investigates high-level cognition in the human brain using complex natural stimulation, namely
watching the Hollywood movie Forrest Gump (1994). The data consist of several hours of fMRI
scans, structural brain images, eye-tracking data, and extensive annotations of the movie. Details of
this experiment are presented in Section 4.6. In our motivating example, we focus on data consisting
of voxel-wise fMRI images, measured over a large number of spatial locations (voxels) at 451 time-
points. The goal of our analysis is to use the multivariate eye-tracking data, measured while the
participants watch the movie, as covariates in a model that explains changes in the multivariate
brain data. The vast size and scale of these data call for well-equipped statistical techniques to find
the association between brain regions and other covariates over time-varying activities. It is useful
to consider this as a regression problem with a multi-dimensional array of outcomes and predictors.
These multi-dimensional arrays are popularly known as tensors. Figure 4.2 illustrates the reason
for considering a time-varying multi-dimensional array for analysis. Although the signals in both
modalities (in this case fMRI and eye-tracking) are measured discretely over time, we consider
                                                 101


Figure 4.2 ForrestGump-data: (Top panel) BOLD fMRI for an example subject during their first
run (see Section 4.6 for details). 35 axial slices (thickness 3.0 mm) represents the third mode of
the tensor with 80 × 80 voxels (3.0 × 3.0 mm) in-plate resolution measured at every repetition time
(TR) of 2 seconds. (Bottom panel) fMRI data-set consists of a time series of 3D images (tensors)
at each TR (source: Wager and Lindquist (2015)).
them to be discrete measures of a smooth underlying function over time in a certain interval. This
assumption is reasonable in the context of both brain activity and eye movement, as they can
potentially change at any moment.
    There are two main advantages to taking a tensor-based approach (Guo et al., 2012) towards
modeling this data-set. First, we can represent the unknown parameters to be estimated as a
linear combination of rank-1 components, where the latter are expressed as the outer product of
low-dimensional vectors. This allows the estimation of fewer parameters, which is consistent with
variable selection or dimension reduction problems in statistics. Second, because of the need to
estimate fewer parameters, the computational complexity is significantly reduced.
    In an exploratory analysis of multi-dimensional data, principal component analysis (PCA) is
one of the most common tools for reducing dimensionality. Its use in tensor data has been studied
in various articles; for example, Liu et al. (2017) provided a generalized classical PCA that can
                                                 102


deal with data matrices and tensors and can explore the spatial and temporal dependencies of
data simultaneously. (Allen et al., 2014) proposed generalization of singular value decomposition
(SVD) to quantify two-way regularization of PCA. This generalization involves a class of penalty
functions that can be used to regularize the matrix factors. In recent years, the methodology for
modelling tensor data has developed considerably with interesting applications. Hoff et al. (2011)
proposed a class of multi-dimensional normal distributions by applying multi-linear transformation
to an array of independently and identically distributed (i.i.d.) 𝑁 (0, 1) items and hence studied the
maximum likelihood estimator of separable complex covariance structures. Hoff (2011) discussed
a model-based version of low rank decomposition.
    In a previous work, Zhou et al. (2013) formulated a regression framework that considers
clinical outcomes as response and images as covariates. Their method efficiently explored the
spatial dependence of images in the form of a multi-dimensional array structure. By extending
the generalized linear regression to a multi-way parameter corresponding to the tensor-structured
predictor, they proposed a penalized likelihood approach with adaptive lasso penalties, which
are imposed on the individual margins of PARAFAC decomposition. Guhaniyogi et al. (2017)
later proposed a Bayesian approach using a similar setup as Zhou et al. (2013), but with a novel
multi-way shrinkage prior, which can identify important cells in the tensor predictor appropriately.
However, a shortcoming of both these approaches is that they unable to address the issue when
responses are multi-dimensional images, and the covariates are also multi-dimensional variables
(e.g., clinical data or data from another imaging modality), as in the case of multi-modal analysis
and our motivating example. This necessitates the use of a tensor-on-tensor type model. The
recent work of Zhang et al. (2014) presents a tensor generalized estimation equation (GEE) for
longitudinal data analysis using low-rank CANDECOMP/PARAFAC (CP) decomposition on the
coefficient array in GEE. This decomposition approach accommodates the longitudinal correlation
of the data. Hoff (2015) proposed multi-linear regression model for longitudinal data using the
least squares method. In practice, there might be the possibility that there exists an effect among
the relations between the numbers of different pairs of modes. The general multi-linear regression
                                                103


model can address this via separable, Kronecker-structured regression parameters along with a
separable covariance structure.
    A tensor-on-tensor regression approach was proposed in Lock (2018), followed by a general
multiple tensor-on-tensor regression in Gahrooei et al. (2018). Furthermore, Guhaniyogi and
Spencer (2018) discussed a tensor response regression where the coefficients corresponding to each
vector covariate are assumed to be tensors in the Bayesian framework. Recently, Liu et al. (2020)
have represented a generalized multi-linear tensor-on-tensor ridge regression model via tensor
train representation. Melzer et al. (2019) proposed a joint tensor regression which is weighted
at expectile levels. Their estimation technique is based on low-rank factorization combined with
regularization techniques using the smooth fast iterative shrinkage-thresholding algorithm (Beck
and Teboulle, 2009). An adaptive tensor-based SVD estimation is also discussed in the light of
Remannian trust method in Conn et al. (2000).
    A varying-coefficient model in functional data analysis (FDA) literature allows the regression
coefficient to vary over some predictors of interest (say, 𝑇). In some cases, these predictors are
confounded with covariates X or some special variables such as time. This kind of model was
introduced and discussed by Hastie and Tibshirani (1993) and has since been widely studied by
researchers. The non-constant relationship between functional response and predictors has been
described in Fan et al. (1999); Ramsay and Silverman (2005); Ferraty and Vieu (2006); Horváth
and Kokoszka (2012); Bongiorno et al. (2014); Hsing and Eubank (2015), which are some good
references in FDA among many others.
    The current chapter provides the following contributions to this literature. First, we propose
a method of modelling image data that can efficiently process large amounts of information and
identify associations while preserving the structure of the 3D images and multi-layer covariates.
Second, we consider the time-varying function-on-function concurrent linear model (Hastie and
Tibshirani, 1993) and generalize it to the tensor-on-tensor regression case, thus moving a step
further than Lock (2018), which did not consider the time-varying coefficient. Consequently,
our generalization provides an extension to classical functional concurrent regression with tensor
                                                104


predictors and tensor covariates. To the best of our knowledge, such an approach has not yet been
proposed in the statistics literature. Here, we express the regression coefficients using the B-spline
technique, and the coefficients of the basis functions are estimated using CP-decomposition, thereby
reducing computational complexity. Furthermore, our model requires minimum assumptions
compared to those in the existing literature. Our approach does not require the estimation of
covariance separately. Thus, our proposal offers an important addition to the literature on functional
and imaging data analysis. Our methods are flexible and general; therefore, they are applicable
using data from different domains such as multi-phenotype analysis and imaging genetics (Casey
et al., 2010). This makes it an ideal approach for modeling multi-modal data of the type described
in our motivating example.
    The rest of this chapter is organized as follows. The proposed tensor-on-tensor functional
regression models are described in Section 4.2. Section 4.3 provides the theoretical properties of
the proposed estimator. Section 4.4 presents the algorithm and implementation of the method. The
simulation results are presented in Section 4.5 and real data examples are shown in Section 4.6.
Section 4.7 concludes with a discussion of future extensions. The technical proofs are presented in
Section 4.8.
4.2      Tensor-on-tensor functional regression
Recall the notations and definition in matrix algebra from Section 1.3 in Chapter 1. A 𝐷-dimensional
tensor is denoted by Sans-serif upper-face letters A ∈ R 𝐼1 ×···×𝐼𝐷 where the size 𝐼 𝑑 along each mode
                                                                                                          Î𝐷
or dimension 𝑑 for 𝑑 = 1, · · · , 𝐷. Therefore, the number of elements in tensor A is 𝐼 = 𝑑=1                    𝐼𝑑
and the order of the tensor is the number of dimensions. Here and henceforth, matrices are denoted
by bold-face capital letters (examples: A, B · · · ), vectors are written as bold-face lower-case letters
(examples: a, b, · · · ) and scalars are presented as Latin alphabets (𝑎, 𝑏, · · · ).
    In this section, we discuss tensor-on-tensor functional regression with time-varying coefficients.
Let Y (𝑡) ∈ R𝑄 1 ×···×𝑄 𝑀 with (𝑞 1 , · · · , 𝑞 𝑀 )-th element 𝑦 𝑞1 ,··· ,𝑞 𝑀 for all possible indices, be a set of
time-varying response variables observed at time 𝑡 and {Y (𝑡) : 𝑡 ∈ T} be the underlying continuous
                                                         105


stochastic process defined on a compact interval T. Without loss of generality, we assume
T = [0, 𝑇] , 𝑇 > 0. Suppose there are 𝑁 individuals/trajectories on T. Observations are taken
at 𝐽 distinct points for each individual. Collection of points for the 𝑖-th individual is denoted as
T𝑖 † = {0 ≤ 𝑡𝑖1 < · · · < 𝑡𝑖𝐽 ≤ 𝑇 }. Therefore, for 𝑖-th individual at a set of discrete time-points T𝑖 † , we
observe the responses Y𝑖 (𝑡𝑖 ) = ( Y𝑖 (𝑡𝑖1 ), · · · , Y𝑖 (𝑡𝑖𝐽 )) ∈ R𝐽×𝑄 1 ×···×𝑄 𝑀 which are distinct realizations
of the corresponding stochastic process. The covariate X (𝑡) ∈ R𝑃1 ×···×𝑃 𝐿 with ( 𝑝 1 , · · · , 𝑝 𝐿 )-th
element 𝑥 𝑝1 ,··· ,𝑝 𝐿 (𝑡) for all indices, observed at T𝑖 † is denoted as X𝑖 (𝑡𝑖 ) = ( X𝑖 (𝑡𝑖1 ), · · · , X𝑖 (𝑡𝑖𝐽 )) ∈
R𝐽×𝑃1 ×···×𝑃 𝐿 . The time-varying tensor coefficient 𝜷(𝑡) ∈ R𝑃1 ×···×𝑃 𝐿 ×𝑄 1 ×···×𝑄 𝑀 is assumed to vary
smoothly over time. Therefore, we can apply local polynomial smoothing (Hardle, 1990; Wahba,
1990; Wand and Jones, 1995; Fan and Gijbels, 1996; Eubank, 1999), smoothing spline (Wahba,
1990; Green and Silverman, 1993; Eubank, 1999), regression spline (Eubank, 1999), P-spline
Ruppert et al. (2003). In this chapter, we use B-spline bases which are very popular in mathematics,
computer science and statistics (De Boor et al., 1978). Suppose, {𝜏ℎ } 𝐾ℎ=1                                     𝑁
                                                                                                                      be 𝐾 𝑁 interior knots
within the compact interval [0, 𝑇] and the partition of the interval [0, 𝑇] at these knots be denoted
          
as K = 0 = 𝜏0 < 𝜏1 < · · · < 𝜏𝐾 𝑁 < 𝜏𝐾 𝑁 +1 = 𝑇 . The polynomial spline of order 𝑣 + 1 is a function
                                                                                                                                     
of polynomials with degree 𝑣 on the intervals [𝜏ℎ−1 , 𝜏ℎ ) for ℎ = 1, · · · , 𝐾 𝑁 and 𝜏𝐾 𝑁 , 𝜏𝐾 𝑁 +1 and
𝑣 − 1 continuous derivatives globally. Let S𝐾𝑣 𝑁 (𝑡) denotes a set of such spline functions, i.e., 𝑠(𝑡)
belongs to S𝐾𝑣 𝑁 (𝑡) if and only if 𝑠(𝑡) belongs to 𝐶 𝑣−1 [0, 𝑇] and its restriction to each intervals
[𝜏ℎ−1 , 𝜏ℎ ) is a polynomial of degree atleast 𝑣. Define for ℎ = 1, · · · , 𝐻 𝑁
                                         Bℎ (𝑡) = (𝜏ℎ − 𝜏ℎ−𝑣−1 ) [𝜏ℎ−𝑣−1 , · · · , 𝜏ℎ ] (𝑧 − 𝑡)+𝑣                                      (4.1)
where 𝐻 := 𝐻 𝑁 = 𝐾 𝑁 + 𝑣 + 1, [𝜏ℎ−𝑣−1 , · · · , 𝜏ℎ ] 𝑓 denotes the (𝑣 + 1)-st order divided difference of
the function 𝑓 and 𝜏ℎ = 𝜏0 for ℎ = −𝑣, · · · , −1 and 𝜏ℎ = 𝜏𝐾 𝑁 +1 for ℎ = 𝐾 𝑁 + 2, · · · , 𝐻 𝑁 . Therefore,
{Bℎ } 𝐻
      ℎ=1 forms the basis for S𝐾 𝑁 (𝑡) (Schumaker, 2007).
        𝑁                                     𝑣
    Now, for 1 ≤ 𝑝 𝑙 ≤ 𝑃𝑙 , 1 ≤ 𝑞 𝑚 ≤ 𝑄 𝑚 , 1 ≤ 𝑙 ≤ 𝐿, 1 ≤ 𝑚 ≤ 𝑀, each function 𝛽 𝑝1 ,··· ,𝑝 𝐿 ,𝑞1 ,··· ,𝑞 𝑀 (𝑡)
can be approximated by
                                                    𝐻
                                                   ∑︁
                𝛽 𝑝1 ,··· ,𝑝 𝐿 ,𝑞1 ,··· ,𝑞 𝑀 (𝑡) =     𝑏 ℎ,𝑝1 ,··· ,𝑝 𝐿 ,𝑞1 ,··· ,𝑞 𝑀 Bℎ (𝑡) = bT𝑝1 ,··· ,𝑝 𝐿 ,𝑞1 ,··· ,𝑞 𝑀 B(𝑡)       (4.2)
                                                   ℎ=1
                                                                      106


where b 𝑝1 ,··· ,𝑝 𝐿 ,𝑞1 ,··· ,𝑞 𝑀 = (𝑏 1,𝑝1 ,··· ,𝑝 𝐿 ,𝑞1 ,··· ,𝑞 𝑀 , · · · , 𝑏 𝐻,𝑝1 ,··· ,𝑝 𝐿 ,𝑞1 ,··· ,𝑞 𝑀 ) T is the collection of basis
coefficients and B(𝑡) = (B1 (𝑡), · · · , B𝐻 (𝑡)) T is a vector of known B-spline bases.
      In practice, we can use different basis functions in mode to approximate 𝛽 𝑝1 ,··· ,𝑝 𝐿 ,𝑞1 ,··· ,𝑞 𝑀 (𝑡).
However, for convenience, we use the same set of bases in this chapter. Instead of the B-spline,
one can use other basis functions to approximate the coefficient functions. We use the B-spline
base for its simplicity and numerical tractability. Although this method does not produce a
desirable approximation for discontinuous functions, in this chapter we restrict ourselves to smooth
continuous coefficients.
      We propose a general time-varying tensor-on-tensor regression model,
                                                          Y𝑖 (𝑡) = ⟨X𝑖 (𝑡), 𝜷(𝑡)⟩ 𝐿 + E𝑖 (𝑡)                                                    (4.3)
which can be reduced into the following mode-wise time-varying coefficient model.
                                                 𝑃1
                                                ∑︁          ∑︁𝑃𝐿
                       𝑦𝑖,𝑞1 ,··· ,𝑞 𝑀 (𝑡) =           ···           𝑥𝑖,𝑝1 ,··· ,𝑝 𝐿 (𝑡) 𝛽 𝑝1 ,··· ,𝑝 𝐿 ,𝑞1 ,··· ,𝑞 𝑀 (𝑡) + 𝜖𝑖,𝑞1 ,··· ,𝑞 𝑀 (𝑡) (4.4)
                                                𝑝 1 =1     𝑝 𝐿 =1
where 𝜖𝑖,𝑞1 ,··· ,𝑞 𝑀 (𝑡) is a random error with mean zero. Errors can be correlated over time and
modes, but are independent over the trajectories. After plugging-in the approximate expression of
𝛽 𝑝1 ,··· ,𝑝 𝐿 ,𝑞1 ,··· ,𝑞 𝑀 (𝑡) at each mode, the model now boils down to
                                       𝑃1
                                      ∑︁            𝑃 𝐿 ∑︁
                                                   ∑︁     𝐻
           𝑦𝑖,𝑞1 ,··· ,𝑞 𝑀 (𝑡) =             ···                𝑏 ℎ,𝑝1 ,··· ,𝑝 𝐿 ,𝑞1 ,··· ,𝑞 𝑀 𝑥𝑖,𝑝1 ,··· ,𝑝 𝐿 (𝑡)Bℎ (𝑡) + 𝜖𝑖,𝑞1 ,··· ,𝑞 𝑀 (𝑡)  (4.5)
                                      𝑝 1 =1      𝑝 𝐿 =1 ℎ=1
                                                                                    
The multi-dimensional basis coefficients B0 = 𝑏 ℎ,𝑝1 ,··· ,𝑝 𝐿 ,𝑞1 ,··· ,𝑞 𝑀 : 1 ≤ ℎ ≤ 𝐻, 1 ≤ 𝑝 𝑙 ≤ 𝑃𝑙 ,
1 ≤ 𝑞 𝑚 ≤ 𝑄 𝑚 , 1 ≤ 𝑙 ≤ 𝐿, 1 ≤ 𝑚 ≤ 𝑀 } can be estimated by minimizing the mode-wise penalized
integrated sum of square errors with respect to B0 . Let us denote the smoothness penalty by Ω𝑠𝑚
where
  Ω𝑠𝑚 ( B0 )
          𝑃1             𝑃 𝐿 ∑︁ 𝑄1           𝑄𝑀 ∫
                                                                                                                         2
        ∑︁              ∑︁                  ∑︁
                                                        𝜃 𝑝1 ,··· ,𝑝 𝐿 ,𝑞1 ,··· ,𝑞 𝑀 𝛽′′𝑝1 ,··· ,𝑝 𝐿 ,𝑞1 ,··· ,𝑞 𝑀 (𝑡)
                                                                                     
   =            ···                   ···                                                                                  𝑑𝑡
        𝑝 1 =1         𝑝 𝐿 =1 𝑞 1 =1       𝑞 𝑀 =1
                                                                                  107


        ∑︁𝑃1         𝑃 𝐿 ∑︁
                    ∑︁       𝑄1         𝑄𝑀
                                        ∑︁                                                              ∫
    =          ···                ···         𝜃 𝑝1 ,··· ,𝑝 𝐿 ,𝑞1 ,··· ,𝑞 𝑀 bT𝑝1 ,··· ,𝑝 𝐿 ,𝑞1 ,··· ,𝑞 𝑀        B′′ (𝑡)B′′ (𝑡) T 𝑑𝑡b 𝑝1 ,··· ,𝑝 𝐿 ,𝑞1 ,··· ,𝑞 𝑀
        𝑝 1 =1      𝑝 𝐿 =1 𝑞 1 =1      𝑞 𝑀 =1
                                                                                                                                                            (4.6)
Hence, the loss function turns out to be
L( B0 )
          ∫ ∑︁   𝑁 ∑︁  𝑄1          𝑄𝑀                                  𝑃1           𝑃 𝐿 ∑︁ 𝐻                                                               2
      1                            ∑︁                                  ∑︁          ∑︁
 =                           ···           𝑦𝑖,𝑞1 ,··· ,𝑞 𝑀 (𝑡) −            ···                    𝑏 ℎ,𝑝1 ,··· ,𝑝 𝐿 ,𝑞1 ,··· ,𝑞 𝑀 𝑥𝑖,𝑝1 ,··· ,𝑝 𝐿 (𝑡)𝐵 ℎ (𝑡) 𝑑𝑡
     𝑁 T 𝑖=1 𝑞 =1 𝑞 =1                                                𝑝 =1         𝑝 =1 ℎ=1
                       1           𝑀                                    1            𝐿
          + Ω𝑠𝑚 ( B0 )                                                                                                                                      (4.7)
In Equation (4.6), {𝜃 𝑝1 ,··· ,𝑝 𝐿 ,𝑞1 ,··· ,𝑞 𝑀 } 𝑝1 ,··· ,𝑝 𝐿 ,𝑞1 ,··· ,𝑞 𝑀 are the tuning parameters for smoothness.
Penalty due to smoothness is widespread in the literature of functional data analysis (Ramsay
and Silverman (2005) among many others). In practice, it is unrealistic to find these large num-
bers of pre-assigned tuning parameters. By considering 𝜃 𝑝1 ,··· ,𝑝 𝐿 ,𝑞1 ,··· ,𝑞 𝑀 = 𝜃, for all possible
𝑝 1 , · · · , 𝑝 𝐿 , 𝑞 1 , · · · , 𝑞 𝑀 , the simplest version of smoothness penalty would be,
                                                                                       ∫
                                                               T
                              Ω𝑠𝑚 ( B0 ) = 𝜃 vec( B0 ) (I𝑄 ⊗ I𝑃 ⊗                            B′′ (𝑡)B′′ (𝑡) T 𝑑𝑡) vec( B0 )                                 (4.8)
Note that, vec( B0 ) = (b11 , · · · , b𝑃1 , b12 , · · · , b𝑃2 , · · · , b1𝑄 , · · · , b𝑃𝑄 ) T . Therefore, the penalized
likelihood estimating equation for functional tensor-on-tensor regression problem is
                                             ∫            𝑁
                                                     1 ∑︁
                               L( B0 ) =                       ∥ Y𝑖 (𝑡) − ⟨Z𝑖 (𝑡), B0 ⟩ 𝐿+1 ∥ 2F 𝑑𝑡 + Ω𝑠𝑚 ( B0 )                                            (4.9)
                                               T 𝑁 𝑖=1
where ⟨·, ·⟩ 𝐿+1 is the contracted tensor product defined in Section 1.3.1 and ∥ · ∥ F is the Frobenius
norm. The first term of the Equation (4.9) is integrated sum of squares and the second term is the
smoothness penalty.
      Let the response tensor for time 𝑡, Y (𝑡) ∈ R𝑁×𝑄 1 ×···×𝑄 𝑀 with its (𝑖, 𝑞 1 , · · · , 𝑞 𝑀 )-th element be
𝑦𝑖,𝑞1 ,··· ,𝑞 𝑀 (𝑡) for all 𝑖 = 1, · · · , 𝑁; 𝑞 𝑚 = 1, · · · , 𝑄 𝑚 ; 𝑚 = 1, · · · , 𝑀. Similarly, we define an updated
covariate tensor contaminated with B-spline bases Z (𝑡) ∈ R𝑁×𝐻×𝑃1 ×···×𝑃 𝐿 where (𝑖, ℎ, 𝑝 1 , · · · , 𝑝 𝐿 )-
th element of the tensor is defined as 𝑧𝑖,ℎ,𝑝1 ,··· ,𝑝 𝐿 (𝑡) = 𝑥𝑖,𝑝1 ,··· ,𝑝 𝐿 (𝑡)Bℎ (𝑡). Therefore, the corre-
sponding penalized loss function in Equation (4.9) is equivalent to
                                                    ∫
                                     L( B0 ) =             ∥ Y (𝑡) − ⟨Z (𝑡), B0 ⟩ 𝐿+1 ∥ 2F 𝑑𝑡 + Ω𝑠𝑚 ( B0 )                                                (4.10)
                                                       T
                                                                             108


Remark 4.2.1. For 𝑄 = 0, the proposed model reduces to the classical concurrent linear model
Ramsay and Silverman (2005). For 𝑄 = 1 and 𝑃 = 1, the time-varying network model (Xue et al.,
2018) is a special case of our proposed model for a specific choice of covariates. For 𝑄 = 2,
𝑦𝑖,𝑞1 ,𝑞2 (𝑡) is the observation of the quantity of interest at time 𝑡 for sub-unit 𝑞 2 from unit 𝑞 1 of a
treatment group 𝑖 in a hierarchical model (Zhou et al., 2010).
                Î𝐿                                                                                Î𝑀
    Let 𝑃 =        𝑙=1 𝑃𝑙 be the total number of predictors for each observation and 𝑄 =             𝑚=1 𝑄 𝑚 be the
total number of outcomes for each predictor over time. To minimize the penalized integrated sum
of squared residuals described in Equation (4.10), the solution for B0 could be inconsistent. Since
                                                Î𝐿         Î𝑀
the unknown coefficient tensor B0 has 𝐻 𝑙=1             𝑃𝑙 𝑚=1      𝑄 𝑚 many parameters, we need to adopt the
dimension reduction technique. Inspired by the novel idea discussed in Lock (2018), we consider
rank 𝑅 decomposition of B0 as B0 = [[U0 , U1 , · · · , U 𝐿 , V1 , · · · , V 𝑀 ]] where U0 , U𝑙 and V𝑚 are
matrices with dimensions 𝐻×𝑅, 𝑃𝑙 ×𝑅 and 𝑄 𝑚 ×𝑅 respectively for all 1 ≤ 𝑙 ≤ 𝐿, 1 ≤ 𝑚 ≤ 𝑀. After
                                                                                            Í𝐿           Í𝑀
dimension reduction, the number of unknown parameters reduces to 𝑅(𝐻 + 𝑙=1                        𝑃𝑙 + 𝑚=1    𝑄 𝑚 ).
Therefore, the estimate of the coefficient tensor is as follows.
                                           B0 = arg
                                           e              min       L( B0 )                                  (4.11)
                                                      rank( B0 )≤𝑅
However, this estimated coefficient tensor suffers from over-fitting and instability problems due
to multi-collinearity of Z and/or the large number of observed outcomes. Thus, we obtain an
alternative estimate of the coefficient tensor B0
                                           B0 = arg
                                           b              min       Q( B0 )                                  (4.12)
                                                       rank( B0 )≤𝑅
which is based on the modified loss function, Q, defined by
                                          ∫
                                        1
                            Q(B0 ) =          ∥ Y (𝑡) − ⟨Z (𝑡), B0 ⟩ 𝐿+1 ∥ 2F 𝑑𝑡 + Ω( B0 )                   (4.13)
                                        𝑁 T
where
                                            ∫
                              T
        Ω( B0 ) = 𝜃 vec( B0 ) (I𝑄 ⊗ I𝑃 ⊗        B′′ (𝑡)B′′ (𝑡) T 𝑑𝑡) vec( B0 ) + 𝜙 vec( B0 ) T vec( B0 )     (4.14)
Equation (4.14) suggests the penalization of the smoothness and sparsity of the coefficient functions
simultaneously.
                                                         109


Remark 4.2.2. Tuning parameter selection: The number of knots and tuning parameters 𝜃 and
𝜙 are unknown and we need to select it using Mallows’s 𝐶 𝑝 (Mallows, 1973), generalized cross-
validation (Craven and Wahba, 1978) or leave-one-out cross validation method (Stone, 1974). Also,
the selection of rank is a separate problem altogether. A rank selection method can be proposed
based on Chen et al. (2013).
4.3     Asymptotic properties
In this section, we will study identifiability of the model and consistency of parameter estimates
under the proposed model as the number of subjects 𝑁 goes to infinity while we assume that the
rank of the basis tensor coefficient is known and fixed.
4.3.1    Identifiability
Identifiability issue plays important roles in tensor regression (Lock, 2018; Zhou et al., 2013;
Guhaniyogi et al., 2017). The model discussed in Section 4.2 would be identifiable for 𝜷(𝑡), if for
𝜷(𝑡) ≠ 𝜷∗ (𝑡) implies ⟨X (𝑡), 𝜷(𝑡)⟩ 𝐿 ≠ ⟨X (𝑡), 𝜷∗ (𝑡)⟩ 𝐿 for some 𝑡 ∈ T and some X (𝑡) ∈ R𝑃1 ×···×𝑃 𝐿 .
Using the basis expansion in Equation (4.2), we can say that B0 is identifiable if and only if
𝜷(𝑡) is identifiable for all 𝑡 ∈ T. Therefore, the reduced model is identifiable if for B0 ≠ B∗0
implies ⟨Z (𝑡), B0 ⟩ 𝐿+1 = Z (𝑡), B∗0         𝐿+1
                                                  for some 𝑡 ∈ T and for some Z (𝑡) ∈ R𝐻×𝑃1 ×···×𝑃 𝐿 . Assume
that, for 𝑡 = 𝑡0 , Zℎ,𝑝 𝑘1 ,··· ,𝑝 𝑘𝐿 (𝑡0 ) = 1 at 𝑘 1 = 1, · · · , 𝑘 𝐿 = 𝐿 and 0 otherwise, then the product
becomes 𝑏 ℎ,𝑝1 ,··· ,𝑝 𝐿 ,𝑞1 ,··· ,𝑞 𝑀 . Furthermore, U0 , U1 , · · · , U 𝐿 , V1 , · · · , V 𝑀 in the expression of CP-
decomposition are not identifiable. Therefore, The identifiability conditions can be imposed in the
following way (Sidiropoulos and Bro, 2000).
   1. Restrictions for scale and non-uniqueness: B0 will remain same after replacing U0 , U𝑙 and
       V𝑚 by 𝑐 𝑠 U0 , 𝑐 𝑢𝑙 U𝑙 and 𝑐 𝑣 𝑚 V𝑚 respectively, where {𝑐 𝑠 , 𝑐 𝑢𝑙 , 𝑐 𝑣 𝑚 } is the set of constants with
          Î𝐿         Î𝑀
       𝑐 𝑠 𝑙=1  𝑐 𝑢𝑙 𝑙=1        𝑐 𝑣 𝑙 = 1. This problem can be solved by introducing the condition that the
       norm of each of u𝑟0 , u𝑟𝑙 and v𝑟𝑚 is set to be 1, 1 ≤ 𝑟 ≤ 𝑅, 1 ≤ 𝑙 ≤ 𝐿, 1 ≤ 𝑚 ≤ 𝑀.
                                                           110


                                                                                              Í𝑅
   2. Restriction for permutation: For any permutation 𝜋(·) of {1, · · · , 𝑅}, 𝑟=1                  u𝑟0 ◦ u𝑟1 ◦ · · · ◦
                                           Í𝑅
       u𝑟 𝐿 ◦v𝑟1 ◦· · ·◦v𝑟 𝑀 is same as 𝑟=1      u𝜋(𝑟)0 ◦u𝜋(𝑟)1 ◦· · ·◦u𝜋(𝑟)𝐿 ◦v𝜋(𝑟)1 ◦· · ·◦v𝜋(𝑟) 𝑀 . Therefore,
       we impose the restriction ∥u01 ∥ ≥ · · · ≥ ∥u0𝑅 ∥.
These indeterminacies are enough to ensure the identifiability for 𝐿 + 𝑀 ≥ 2. Therefore, we do
not need the additional orthogonality condition that appeared in Lock (2018); Zhou et al. (2013);
Guhaniyogi et al. (2017).
4.3.2    Convergence rate
In this subsection, we study the asymptotic properties of the estimate of the time-varying tensor
regression parameter 𝜷(𝑡) based on polynomial spline approximation and the CP decomposition.
To proceed further, we introduce some regularity conditions which are required to establish the
asymptotic properties.
(C1) Without loss of generality, assume T = [0, 1]. The observation times 𝑡𝑖 𝑗 for 𝑖 = 1, · · · , 𝑁; 𝑗 =
       1, · · · , 𝐽 are independent and follow a distribution 𝑓𝑇 (𝑡) over the support T. the density
       function 𝑓𝑇 (𝑡) is assumed to be absolutely continuous and bounded by a nonzero and finite
       constant.
(C2) {𝜏ℎ } 𝐾ℎ=1 𝑛
                   be 𝐾𝑛 interior knots within the compact interval K = [0, 1] and the partition of the
                                                                     
       interval [0, 𝑇] with 𝐾 𝑁 knots can be denoted as I = 0 = 𝜏0 < 𝜏1 < · · · < 𝜏𝐾 𝑁 < 𝜏𝐾 𝑁 +1 = 1 .
(C3) The polynomial spline of order 𝑣 + 1 are the function with degree 𝑣 of polynomials on
                                                                         
       the interval [𝜏ℎ−1 , 𝜏ℎ ) for ℎ = 1, · · · , 𝐾 and 𝜏𝐾 𝑁 , 𝜏𝐾 𝑁 +1 and 𝑣 − 1 continuous derivatives
       globally.
(C4) For 𝑡 ∈ T, 𝜖𝑖,𝑞1 ,··· ,𝑞 𝑀 (𝑡)’s are i.i.d. copies with mean zero and finite second order moment over
       𝑖. Moreover, for each 𝑖 and the coordinates 𝑞 1 , · · · , 𝑞 𝑀 , 𝜖𝑖,𝑞1 ,··· ,𝑞 𝑀 (𝑡𝑖 𝑗 ) are locally stationary
       time series of the form given in appendix. Assume physical dependence measure Δ(𝑘, 𝑎) is
       upper bounded by 𝑘 −𝜅0 for some positive 𝜅0 and for all 𝑗 ≥ 1.
                                                        111


(C5) The covariate 𝑥𝑖,𝑝1 ,··· ,𝑝 𝐿 (𝑡)’s are i.i.d. for index 𝑖 and it is bounded almost everywhere.
Remark 4.3.1. Conditions (C1), (C2), (C3) are standard conditions in the context of polynomial
spline regression and require the consistency of spline estimation of varying-coefficient models.
Condition (C3) provides the degree of smoothness on the time-varying coefficients. We assume
Condition (C4) to represent a wide class of stationary, locally stationary, and non-linear processes.
Similar conditions can be found in Ding and Zhou (2020); Ding et al. (2021). This is a natural
assumption of temporal short-range dependent process where temporal correlation decays in poly-
nomial order. This phenomenon can also be observed in well-known Ornstein–Uhlenbeck process
and the linear process with the standard basis expansion 𝜖𝑖,• (𝑡) = ∞
                                                                                 Í
                                                                                    𝑘=1 𝑎 𝑖𝑘,• 𝜙 𝑘 (𝑡) where 𝑎 𝑖𝑘,• is
an uncorrelated mean zero, finite variance random variables over (𝑖, 𝑘) and sup𝑡 𝜙 𝑘 (𝑡) ≤ 𝐶 𝑘 −𝑎 for
some positive constants 𝐶 and 𝑎 .
Since the number of modes is fixed, we reduce the objective function by following the notation
Y ∈ R𝑁 𝐽×𝑄 and Z ∈ R𝑁 𝐽×𝐻 𝑁 ×𝑃
                                            1
                               Q(B0 ) =         ∥Y − ⟨Z, B0 ⟩ 2 ∥ 2F + ∥ B0 ∥ 2F,W 𝜔                          (4.15)
                                           𝑁𝐽
                                                                                         √︁
where ∥ B0 ∥ F,W 𝜔 be the weighted Frobenius norm, defined as ∥ B0 ∥ F,W 𝜔 =                vec( B0 ) T W𝜔 vec(B0 )
where 𝜔 is a set of all tuning parameters. Moreover, assume that rank( B0 ) = 𝑅0 which is assumed
to be known and fixed. Further assume that
                    
(C6) 𝜆min ZT(1) Z (1) = 𝜎min ( Z (1) ) 2 ≥ 𝜆 min (BT B)𝜆 min (XT X) > 𝜆
where 𝜆𝑖 (A) and 𝜎𝑖 (A) denotes 𝑖-th eigen-value and singular value respective for a matrix A. Define,
                                                                                               ∫
the constants C(𝛿) = 1 + 2/𝛿 such that C(𝛿) ≤ 𝜆2 /2𝜇 where 𝜇 = (𝑁 𝐽)(𝜃𝜆max ( B′′ (𝑡)B′′ (𝑡) T 𝑑𝑡) +
   √
𝜙) 2𝑅0 . Further, define 𝜉 = sup1≤ℎ≤𝐻 sup𝑡∈[0,1] |Bℎ (𝑡)| which is typically bounded. Additionally,
define 𝜎1 ( C) = max{𝜎1 ( C (1) ), 𝜎1 ( C (2) ), 𝜎1 ( C (3) )}. Therefore, we propose the following theorem
for the estimation and prediction performance of the coefficient tensor.
Theorem 4.3.1. Under assumptions (C4) and (C6), when both the number of time-points and
trajectories are large enough, there exists a constant 𝐶𝑎 , with probability atleast 1 − 𝐶𝑎 𝑁 −𝑎𝜏 , such
                                                         112


that we have the following.
        D              E                                   −1 
    ∥ Z, (b   B0 − B0 ) ∥ 2F ≤ 𝜆−1 C(𝛿) −1 − 2𝜇𝜆−2                4𝜇𝜎12 ( C) + 2𝑅0 (1 + 𝛿)𝑄 2 𝜉 2 𝑁 2𝜏+2 𝐽     (4.16)
for any 𝐻 𝑁 × 𝑃 × 𝑄 matrix C with rank( C) ≤ 𝑅0 , By choosing C = B0 , a simplified prediction error
could be obtained. Under the same set of assumptions, the estimation error of the matrix B0 is
                                                      −1 
          ∥Bb0 − B0 ∥ 2 ≤ 𝜆−1 C(𝛿) −1 − 2𝜇𝜆−2                4𝜇𝜎12 ( C) + 2𝑅0 (1 + 𝛿)𝑄 2 𝜉 2 𝑁 2𝜏+2 𝐽          (4.17)
                      F
     Additionally, we introduce the following theorem which states the consistency result for the
coefficient tensor function.
Theorem 4.3.2. Under assumptions (C1)-(C6), with probability, we have the following with prob-
ability 1 − 𝐶𝑎 𝑁 −𝑎𝜏 ,
                                                                                                                 √ o
∫                                                                   −1 n
     | 𝛽b• (𝑡) − 𝛽• (𝑡)| 𝑓𝑇 (𝑡)𝑑𝑡 = 𝑂 𝜆−1 C(𝛿) −1 − 2𝜇𝜆−2
                        2
                                                                            4𝜇𝜎12 ( C (1) ) + 2𝑅0 (1 + 𝛿)𝑄𝜉 𝑁 𝜏+1 𝐽
  T
                                                    o
                                        +𝐾 𝑁−2(𝑣+1)
4.4        Algorithm and implementation
In this section, we propose a general algorithm to estimate the basis coefficient tensor using the
objective function described in Section 4.2. For given time-points 𝑡 1 , · · · , 𝑡 𝐽 , define Z and Y as
the combined tensor after staking over all time-points. Therefore, Z and Y are the tensors of order
𝑁 𝐽 × 𝐻 × 𝑃1 × · · · 𝑃 𝐿 × 𝑄 1 × · · · 𝑄 𝑀 and 𝑁 𝐽 × 𝑄 1 × · · · 𝑄 𝑀 respectively. Moreover, define B̆0 as the
matrix of coefficient of order 𝐻𝑃 × 𝑄, where columns and rows of B0 are obtained by vectorizing
first (𝐿 + 1) and last 𝑀 modes of B0 respectively. For the alternative expression of the penalty term
in Equation (4.13), observe the following.
                                                                       T
    1. ∥ B0 ∥ 2 = ∥ B˘0 ∥ 2 = vec( B0 ) T vec( B0 ) = trace( B˘0 B˘0 ), where trace(A) denotes the trace of a
          square matrix A.
                       ∫                              1/2 
                               ′′   ′′
    2. I𝑄 ⊗ I𝑃 ⊗ 𝜃 B (𝑡)B (𝑡) 𝑑𝑡 + 𝜙I𝐻    T                    vec( B0 )
                                                           113


                           ∫                                      1/2          
                                     ′′
       = vec (I𝑃 ⊗ 𝜃 B (𝑡)B (𝑡) 𝑑𝑡 + 𝜙I𝐻     ′′    T                        ˘
                                                                         ) B0 I𝑄
Therefore, equivalently, the optimization problem reduces to an unregulated least squares problem
with modified predictor and outcome variables for estimating B0 .
                                                                      ∫
                                                                 1                    e B0 ⟩ 𝐿+1 ∥ 2 𝑑𝑡
                                   B0 = arg min
                                   b                                       ∥Y e − ⟨Z,                                  (4.18)
                                                rank( B0 )≤𝑅 𝑁 𝐽 T
where Z e ∈ R (𝑁 𝐽+𝐻𝑃)×𝐻×𝑃1 ×···×𝑃 𝐿 ×𝑄 1 ×···𝑄 𝑀 be the contamination of Z (𝑡) along with smoothing term
and sparsity. Y    e ∈ R (𝑁 𝐽+𝐻𝑃)×𝑄 1 ×···×𝑄 𝑀 which is a contamination of Y (𝑡) and zero tensor function.
The unfolding of Z      e and Y     e along the first dimension produce the following matrices:
                                                                                                                
                                                      Z(1)                                               Y(1) 
                  Z     =                                                                and Y          =            (4.19)
                   e(1)                                                                         e(1)            
                                         ∫                                      1/2                             
                              (I ⊗ 𝜃 B′′ (𝑡)B′′ (𝑡) T 𝑑𝑡 + 𝜙I                       )                  0       
                              𝑃                                              𝐻                            𝐻𝑃×𝑄 
                                                                                                                
Thus, applying the following Algorithm 4.1, we obtain an estimate of coefficient tensor for the
known rank of the coefficient array and hence the coefficient function 𝜷(𝑡).
    The above algorithm is similar to function “rrr” available in the package MultiwayRegression
in R. The selection of the adjustment parameters 𝜃, 𝜙 and the rank 𝑅 of the coefficient tensor is
crucial. It can be done using integrated predictive accuracy in a training and test set. K-fold cross-
validation can be used to obtain these tuning parameters; however, it is computationally expensive.
Fortunately, our estimate is robust for the selection of 𝜃 and 𝜙. The rank of CP decomposition of
the coefficient tensor is the number of rank-1 terms that are necessary to represent the coefficient
tensor. For large 𝑅, every B0 can be represented by the CP decomposition. Therefore, it determines
the complexity of the model. We leave the optimal determination of the rank for future research.
4.5     Simulation studies
In this section, we conduct numerical studies to compare the finite sample performance to estimate
four-way time-varying tensor coefficient 𝜷(𝑡). Data are generated from the following model for
each mode 𝑝 1 , 𝑝 2 , 𝑞 1 , 𝑞 2
                         𝑃1 ∑︁
                        ∑︁       𝑃2
       𝑦𝑖,𝑞1 ,𝑞2 (𝑡) =                𝑥𝑖,𝑝1 ,𝑝2 (𝑡) 𝛽 𝑝1 ,𝑝2 ,𝑞1 ,𝑞2 (𝑡) + 𝜖𝑖,𝑞1 ,𝑞2 (𝑡), 𝑖 = 1, · · · , 𝑁; 𝑡 ∈ [0, 1] (4.20)
                        𝑝 1 =1 𝑝 2 =1
                                                                     114


  Algorithm 4.1 Estimation of 𝜷(𝑡) : 𝑡 ∈ [0, 𝑇] for tensor based function-on-function
  regression method.
    Data: X (𝑡), Y (𝑡) for 𝑡 ∈ [0, 𝑇] , 𝑇 > 0 observed on a grid in [0, 𝑇]
    Result: Estimate 𝜷(𝑠) using proposed method
    Tuning parameters: {𝜃, 𝜙}, rank 𝑅 ∈ N, number of knots 𝐾 𝑁 , a vector of known B-spline
     bases B(𝑡) = (B1 (𝑡), · · · , B𝐻 (𝑡)) T
    Stopping parameter: 𝜖0 > 0
    Create: Z and Y as mentioned in Equation (4.19)
    Initialize: U0 , U1 , · · · , U 𝐿 , V1 , · · · , V 𝑀 be randomly chosen matrices of specific order
 1: while Error > 𝜖 0 do
 2:     for 𝑙 ← 1 to #{𝐻, 𝑃1 , · · · , 𝑃 𝐿 } do
 3:         Set 𝑑 (𝑙) be the 𝑙-th entry of {𝐻, 𝑃1 , · · · , 𝑃 𝐿 }
 4:         for 𝑟 = 1, · · · , 𝑅 do
 5:              C𝑟 ← ⟨Z,     e u𝑟0 ◦ · · · ◦ u𝑟,𝑘−1 ◦ u𝑟,𝑘+1 ◦ · · · ◦ u𝑟 𝐿 ◦ v𝑟1 ◦ · · · ◦ v𝑟 𝑀 ⟩ 𝐿 which is a
                   tensor of dimension (𝑁 𝐽 + 𝐻𝑃) × 𝑑 (𝑙) × 𝑄 1 × · · · × 𝑄 𝑀
 6:              Unfolding C𝑟 along with dimension corresponding to 𝑑 (𝑙)
 7:              Obtain a (𝑁 𝐽 + 𝐻𝑃)𝑄 × 𝑑 (𝑙) dimension matrix C𝑟
            end
                                                                     (𝑙)
 8:         C ← [C1 , · · · , C 𝑅 ] ∈ R (𝑁 𝐽+𝐻𝑃)𝑄×𝑅𝑑
 9:         vec(U𝑙 ) ← (CT C) −1 CT vec(Y)             e
        end
10:     for 𝑚 ← 1 to #{𝑄 1 , · · · , 𝑄 𝑀 } do
11:         Set 𝑑 (𝑚) be the 𝑚-th entry of {𝑄 1 , · · · , 𝑄 𝐿 }
12:         Ye𝑑 (𝑚) is unfolded along the mode corresponding to 𝑑 (𝑚) and obtain a
              𝑑 (𝑚) × (𝑁 𝐽 + 𝐻𝑃) 𝑚≠𝑘 𝑄 𝑚
                                           Î
13:         for 𝑟 = 1, · · · , 𝑅 do
14:              𝐷 𝑟 ← vec(⟨Z,        e u𝑟0 ◦ u𝑟1 ◦ · · · ◦ u𝑟 𝐿 ◦ v𝑟1 ◦ · · · v𝑟,𝑘−1 ◦ v𝑟,𝑘+1 ◦ · · · ◦ v𝑟 𝑀 ⟩ 𝐿+1 )
            end                                              Î
15:         D ← [𝐷 1 , · · · , 𝐷 𝑅 ] ∈ R (𝑁 𝐽+𝐻𝑃) 𝑚≠𝑘 𝑄 𝑚 ×𝑅
16:         V𝑚 ← Y      e𝑑 (𝑚) D(DT D) −1
        end
17:     Compute B = [[U0 , U1 ,D· · · E, U 𝐿 , V1 , · · · , V 𝑀 ]]
                                     ∥Y−e Z,eBb      ∥2
                                                 𝐿+1 F
18:     Calculate: Error =                   b 2
                                           ∥Y∥ F
    end
19: Compute 𝛽 𝑝1 ,··· ,𝑝 𝐿 ,𝑞1 ,··· ,𝑞 𝑀 (𝑡) = bT𝑝1 ,··· ,𝑝 𝐿 ,𝑞1 ,··· ,𝑞 𝑀 B(𝑡) using Equation (4.2) for each node
Regression functions are given by
              𝛽 𝑝1 ,𝑝2 ,𝑞1 ,𝑞2 (𝑡) = 𝑝 1 cos (2𝜋𝑡) + 𝑞 1 sin (2𝜋𝑡) + 𝑝 2 sin (4𝜋𝑡) + 𝑞 2 cos (4𝜋𝑡)                  (4.21)
                                                                   115


Here, changes in one unit of the index of each mode produce a change in one unit of the coefficient
when the time is fixed. The covarites are generated in the following way,
                                                  (1)           (2)                        (3)
                             𝑥𝑖,𝑝1 ,𝑝2 (𝑡) = 𝜒𝑖,𝑝    1 ,𝑝 2
                                                            + 𝜒𝑖,𝑝  1 ,𝑝 2
                                                                           sin (𝜋𝑡) + 𝜒𝑖,𝑝    1 ,𝑝 2
                                                                                                      cos (𝜋𝑡)                         (4.22)
and errors are generated as follows.
                                                    (1)
                                                             √                     (2)
                                                                                           √
                                  𝜖𝑖,𝑞1 ,𝑞2 (𝑡) = 𝜂𝑖,𝑞    ,𝑞
                                                        1 2
                                                               2  cos     (𝜋𝑡) + 𝜂 𝑖,𝑞  ,𝑞
                                                                                       1 2
                                                                                              2 sin (𝜋𝑡)
for all 𝑝 1 = 1, · · · , 𝑃1 , 𝑝 2 = 1, · · · , 𝑃2 , 𝑞 1 = 1, · · · , 𝑄 1 and 𝑞 2 = 1, · · · , 𝑄 2 . Moreover, we
assume that 𝑥𝑖,𝑝1 ,𝑝2 (𝑡) are observed with measurement error, i.e., 𝑢𝑖,𝑝1 ,𝑝2 (𝑡) = 𝑥𝑖,𝑝1 ,𝑝2 + 𝛿 𝑝1 ,𝑝2
                                                                                                               (𝑙)
where 𝛿 𝑝1 ,𝑝2 ∼ 𝑁 (0, 0.62 ). Assume that the set of random variables { 𝜒𝑖,𝑝                                      1 ,𝑝 2
                                                                                                                          : 𝑙 = 1, 2, 3} and
   (𝑙)
{𝜂𝑖,𝑞  1 ,𝑞 2
              : 𝑙 = 1, 2} are mutually independent. The data generation process is influenced by Kim
et al. (2018) although in an entirely different situation. We observe the data at 81 equidistant
time-points in [0, 1] with 𝑡 𝑗 = ( 𝑗 − 0.5)/𝐽 for all 𝑗 = 1, · · · , 𝐽. We also fix 𝑃1 × 𝑃2 = 5 × 2 and
𝑄 1 × 𝑄 2 as either 5 × 2 or 15 × 12. Set, number of subjects, 𝑁 ∈ {30, 100}. We consider the
following scenarios.
                                      (1)                          (2)                                 (3)
(Situation1) We choose 𝜒𝑖,𝑝              1 ,𝑝 2
                                                ∼ 𝑁 (0, 12 ), 𝜒𝑖,𝑝     1 ,𝑝 2
                                                                              ∼ 𝑁 (0, 0.852 ), 𝜒𝑖,𝑝       1 ,𝑝 2
                                                                                                                  ∼ 𝑁 (0, 0.72 ) and they
                                                             (1)                            (2)
                   are mutually independent. 𝜂𝑖,𝑞               1 ,𝑞 2
                                                                        ∼ 𝑁 (0, 22 ), 𝜂𝑖,𝑞      1 ,𝑞 2
                                                                                                       ∼ 𝑁 (0, 0.752 ) and they are
                   mutually independent. Here, the covariates do not depend on the modes of the data
                   structure.
(Situation2) In addition, with the assumption of the coefficients of covariates, impose the spa-
                   tial correlation structure to address the mode-wise dependencies. We consider the
                   following two cases.
                           (𝑙)
                      a) 𝜒𝑖,𝑝  1 ,𝑝 2
                                      at mode ( 𝑝 1 , 𝑝 2 ) is 𝜌 𝑠 (ED 𝑝1 ,𝑝2 ; 𝜃), where 𝜌 𝑠 is the exponential cor-
                         relation function, ED 𝑝1 ,𝑝2 is defined as scaled Euclidean distance between two
                         modes, having scaled by a constant 𝜃, therefore, 𝜃 defines an isotropic covariance
                         function. In this simulation setup, 𝜃 is taken as 8.
                                                                    116


                      (𝑙)
                b) 𝜒𝑖,𝑝   1 ,𝑝 2
                                 at mode ( 𝑝 1 , 𝑝 2 ) is 𝜌 𝑀 (𝑑 𝑝1 ,𝑝2 ; 𝜅, 𝜈), where 𝑑 𝑞1 ,𝑞2 denotes the Euclidean
                   distance between two different modes and 𝜌 𝑀 is the correlation function, belongs
                   to Matérn family. The Matérn isotropic auto-correlation function has a specific
                   form
                                                            √ 𝜈              √ 
                                                    21−𝜈 2𝑑 𝜈                  2𝑑 𝜈
                                  𝜌 𝑀 (𝑑; 𝜅, 𝜈) =                         𝐾𝜈          ,         𝜅, 𝜈 > 0       (4.23)
                                                    Γ(𝜈)       𝜅                  𝜅
                   Here, 𝐾𝜈 (·) is termed as Bessel function of order 𝜈. The positive range parameter
                   𝜅 controls the decay of the correlation between the observations at a large
                   distance 𝑑. The order 𝜈 controls the behavior of the auto-correlation function for
                   observations that are separated by a small distance. For our numerical example,
                   we set the scale 𝜅 = 0.55 and the smoothness parameter 𝜈 = 1.
              The above mentioned situations have been implemented using “stationary.image.cov”
              and “matern.image.cov” functions respectively available in fields package in R (Dou-
              glas Nychka et al., 2017).
    We run the simulation 100 times for each scenario for the evaluation of our method. For each of
the simulation setups, we take the number of knots as [𝐽/4], where [𝑎] denotes the integer part of
𝑎. We compare the overall performance of the models to estimate the parameter curves for different
choices of ranks by studying several error rates based on different norms. We choose the smoothing
parameters 𝜃 from the set {0, 0.001, 0.005, 0.01, 0.05, 0.1}, on the other hand, 𝜙 is chosen from a
grid from {0, 0.5, 3, 10} and allow the different values from 1 to 5 for the choice of rank 𝑅. In the
following tables, we denote FToTM𝑟 as proposed functional tensor-on-tensor model with rank 𝑟. To
compare with the existing literature, we apply the concurrent linear model (Ramsay and Silverman,
2005) (CLM) for mode-wise analysis and implement this method using “pffr” function available
in refund (Goldsmith et al., 2020) package in R, with the penalized concurrent effect of functional
covariates (Ivanescu et al., 2015).
    Tables 4.1, 4.2 and 4.3 show the results of integrated mean square errors and absolute errors
          ∫                                                  ∫
                 𝜷(𝑡) − 𝜷(𝑡)∥ 2F 𝑑𝑡 and IMAE = 𝑡∈T 𝑝1 ,𝑝2 ,𝑞1 ,𝑞2 𝛽b𝑝1 ,𝑝2 ,𝑞1 ,𝑞2 (𝑡) − 𝛽 𝑝1 ,𝑝2 ,𝑞1 ,𝑞2 (𝑡) 𝑑𝑡
                                                                   Í
IMSE = 𝑡∈T ∥ b
                                                           117


Table 4.1 Results of simulation situations (Situation1) where each modes are assumed to be
independent for X (𝑡) and E (𝑡) for fixed time-points. Here we assume each of { 𝜒 𝑝(𝑘)                                                    }
                                                                                                                                    1 ,𝑝 2 𝑝 1 ,𝑝 2
                                                                                                                                                    and
            (𝑘)
{𝜂 𝑞1 ,𝑞2 } 𝑞1 ,𝑞2 are independent for ( 𝑝 1 , 𝑝 2 ) and (𝑞 1 , 𝑞 2 ) respectively.
  Method                      IMSE (SD)                 RIMSE (SD)                           IMAE (SD)                          RIMAE (SD)
                                           𝑁 = 30, 𝑃1 × 𝑃2 = 5 × 2, 𝑄 1 × 𝑄 2 = 5 × 2
       CLM         0.14294 (0.02046)              0.01059 (0.00152) 0.28311 (0.02027) 0.09244 (0.00662)
  FToTM 1          1.48469 (0.05628)              0.10998 (0.00417) 0.96636 (0.01626) 0.31552 (0.00531)
  FToTM 2          0.45773 (0.02218)              0.03391 (0.00164) 0.53786 (0.01068) 0.17561 (0.00349)
  FToTM 3          0.15078 (0.01316)              0.01117 (0.00097) 0.29482 (0.01452) 0.09626 (0.00474)
  FToTM 4          0.01065 (0.00383)              0.00079 (0.00028) 0.07871 (0.01367) 0.0257 (0.00446)
  FToTM 5          0.01558 (0.00582)              0.00115 (0.00043) 0.09412 (0.01695) 0.03073 (0.00553)
                                          𝑁 = 30, 𝑃1 × 𝑃2 = 5 × 2, 𝑄 1 × 𝑄 2 = 15 × 12
       CLM           0.1448 (0.01339)             0.00193 (0.00018)               0.28468 (0.0132) 0.04054 (0.00188)
  FToTM 1          9.24824 (0.06732)                0.12304 (0.0009)           2.27313 (0.01304) 0.32372 (0.00186)
  FToTM 2          1.79804 (0.06786)                0.02392 (0.0009)           1.02121 (0.01836) 0.14543 (0.00261)
  FToTM 3          0.23289 (0.02089)                0.0031 (0.00028)           0.36104 (0.01293) 0.05142 (0.00184)
  FToTM 4          0.06108 (0.06808)              0.00081 (0.00091)            0.15243 (0.13744) 0.02171 (0.01957)
  FToTM 5          0.00195 (0.00053)              0.00003 (0.00001)            0.03348 (0.00451) 0.00477 (0.00064)
                                           𝑁 = 100, 𝑃1 × 𝑃2 = 5 × 2, 𝑄 1 × 𝑄 2 = 5 × 2
       CLM         0.03087 (0.00348)              0.00229 (0.00026) 0.13236 (0.00731) 0.04322 (0.00239)
  FToTM 1          1.46268 (0.04068)              0.10835 (0.00301) 0.95921 (0.01095) 0.31319 (0.00358)
  FToTM 2          0.43737 (0.01418)                0.0324 (0.00105) 0.52551 (0.00725) 0.17158 (0.00237)
  FToTM 3          0.13651 (0.00541)                0.01011 (0.0004) 0.27253 (0.01099) 0.08898 (0.00359)
  FToTM 4          0.00303 (0.00091)              0.00022 (0.00007) 0.04222 (0.00632) 0.01379 (0.00206)
  FToTM 5            0.0037 (0.00115)             0.00027 (0.00008) 0.04663 (0.00696) 0.01523 (0.00227)
                                          𝑁 = 100, 𝑃1 × 𝑃2 = 5 × 2, 𝑄 1 × 𝑄 2 = 15 × 12
       CLM         0.03082 (0.00163)              0.00041 (0.00002) 0.1328 (0.00357) 0.01891 (0.00051)
  FToTM 1          9.21298 (0.04487)                0.12257 (0.0006) 2.26689 (0.01132) 0.32283 (0.00161)
  FToTM 2          1.76018 (0.04482)                0.02342 (0.0006) 1.00917 (0.01218) 0.14372 (0.00173)
  FToTM 3          0.22276 (0.03467)              0.00296 (0.00046) 0.35168 (0.02647) 0.05008 (0.00377)
  FToTM 4          0.05837 (0.06468)              0.00078 (0.00086) 0.14918 (0.14726) 0.02124 (0.02097)
  FToTM 5          0.00085 (0.00033)            0.00001 (<0.00001) 0.02197 (0.00403) 0.00313 (0.00057)
respectively. Similarly, we report the relative mean and absolute errors, which are defined as
                 ∫                                          ∫     Í
                         𝜷(𝑡)−𝜷(𝑡) ∥ 2F 𝑑𝑡
                        ∥b                                   𝑡 ∈T   𝑝1 , 𝑝2 ,𝑞1 ,𝑞2  𝛽b𝑝1 , 𝑝2 ,𝑞1 ,𝑞2 (𝑡)−𝛽 𝑝1 , 𝑝2 ,𝑞1 ,𝑞2 (𝑡) 𝑑𝑡
                  𝑡 ∈T
RIMSE =              ∫                      and RIMAE =             ∫                                                                 respectively.
                           ∥ 𝜷(𝑡) ∥ 2F 𝑑𝑡                                                       | 𝛽 𝑝1 , 𝑝2 ,𝑞1 ,𝑞2 (𝑡) | 𝑑𝑡
                                                                             Í
                      𝑡 ∈T                                            𝑡 ∈T      𝑝1 , 𝑝2 ,𝑞1 ,𝑞2
The advantages of these simulation situations are that these models are not based on the reduced-
rank model. Here, we observe the curves in the presence of errors. All integrals are approximated
                                                                118


Table 4.2 Results of simulation situations (Situation2)a where each modes are assumed to be
independent for E (𝑡) for fixed time-points whereas modes for X (𝑡) are assumed to be dependent.
Here we assume { 𝜒 𝑝(𝑘)     }
                      1 ,𝑝 2 𝑝 1 ,𝑝 2
                                      is spatially dependent with exponential covariance function.
  Method              IMSE (SD)                   RIMSE (SD)           IMAE (SD)           RIMAE (SD)
                                  𝑁 = 30, 𝑃1 × 𝑃2 = 5 × 2, 𝑄 1 × 𝑄 2 = 5 × 2
     CLM      11.03513 (2.27364) 0.81742 (0.16842) 2.45079 (0.24082)                  0.8002 (0.07863)
  FToTM 1        1.46631 (0.0141) 0.10862 (0.00104) 0.96402 (0.00583)                 0.31476 (0.0019)
  FToTM 2      0.60273 (0.01917) 0.04465 (0.00142) 0.60152 (0.01318)                    0.1964 (0.0043)
  FToTM 3      0.32753 (0.01741) 0.02426 (0.00129) 0.42707 (0.01962)                 0.13944 (0.00641)
  FToTM 4      0.21328 (0.21078) 0.0158 (0.01561) 0.35394 (0.13306)                  0.11556 (0.04344)
  FToTM 5      0.13694 (0.02654) 0.01014 (0.00197) 0.30854 (0.0384)                  0.10074 (0.01254)
                               𝑁 = 30, 𝑃1 × 𝑃2 = 5 × 2, 𝑄 1 × 𝑄 2 = 15 × 12
     CLM      11.36335 (1.34533)            0.15118 (0.0179) 2.49712 (0.14845) 0.35562 (0.02114)
  FToTM 1      9.21977 (0.02778)           0.12266 (0.00037) 2.27091 (0.0079) 0.32341 (0.00112)
  FToTM 2      1.76995 (0.02734)           0.02355 (0.00036) 1.01769 (0.01081) 0.14493 (0.00154)
  FToTM 3      0.41264 (0.16057)           0.00549 (0.00214) 0.48365 (0.08798) 0.06888 (0.01253)
  FToTM 4      0.18218 (0.21293)           0.00242 (0.00283) 0.32063 (0.13906) 0.04566 (0.0198)
  FToTM 5      0.06936 (0.05182)           0.00092 (0.00069) 0.19811 (0.08864) 0.02821 (0.01262)
                                𝑁 = 100, 𝑃1 × 𝑃2 = 5 × 2, 𝑄 1 × 𝑄 2 = 5 × 2
     CLM       2.55232 (0.45649)           0.18906 (0.03381) 1.19172 (0.10708) 0.38911 (0.03496)
  FToTM 1      1.45974 (0.00711)           0.10813 (0.00053) 0.96178 (0.00323) 0.31403 (0.00105)
  FToTM 2      0.58776 (0.01049)           0.04354 (0.00078) 0.59246 (0.00766) 0.19344 (0.0025)
  FToTM 3      0.31275 (0.00961)           0.02317 (0.00071)       0.411 (0.01063) 0.1342 (0.00347)
  FToTM 4      0.18492 (0.20235)            0.0137 (0.01499) 0.32604 (0.13409) 0.10646 (0.04378)
  FToTM 5      0.11149 (0.03648)            0.00826 (0.0027) 0.27665 (0.06128) 0.09033 (0.02001)
                              𝑁 = 100, 𝑃1 × 𝑃2 = 5 × 2, 𝑄 1 × 𝑄 2 = 15 × 12
     CLM         2.5259 (0.21061)             0.0336 (0.0028) 1.18808 (0.05122) 0.1692 (0.00729)
  FToTM 1      9.26995 (0.13929) 0.12333 (0.00185) 2.28525 (0.03385) 0.32545 (0.00482)
  FToTM 2      1.74798 (0.01575) 0.02325 (0.00021) 1.00948 (0.00691) 0.14376 (0.00098)
  FToTM 3      0.61359 (0.30173) 0.00816 (0.00401) 0.58812 (0.16308) 0.08376 (0.02322)
  FToTM 4      0.66733 (0.41716) 0.00888 (0.00555)                 0.596 (0.24026) 0.08488 (0.03422)
  FToTM 5        0.0914 (0.04684) 0.00122 (0.00062) 0.23906 (0.07987) 0.03405 (0.01137)
using the Riemann sum. Since our proposed method involves an iterative procedure, which depends
on the initial estimates, the computational time is therefore not comparable to that of the classical
CLM, which is not an iterative method. For all situations, our proposed method does a much better
job in terms of low error rates in estimating the parameter 𝜷(𝑡).
                                                        119


Table 4.3 Results of simulation situations (Situation2)b where each modes are assumed to be
independent for E (𝑡) for fixed time-points whereas modes for X (𝑡) are assumed to be dependent.
Here we assume { 𝜒 𝑝(𝑘)     }
                      1 ,𝑝 2 𝑝 1 ,𝑝 2
                                      is spatially dependent with Matérn covariance function.
                                  𝑁 = 30, 𝑃1 × 𝑃2 = 5 × 2, 𝑄 1 × 𝑄 2 = 5 × 2
   Method            IMSE (SD)                   RIMSE (SD)           IMAE (SD)         RIMAE (SD)
     CLM    0.26393 (0.04919)              0.01955 (0.00364) 0.38374 (0.03318)     0.12529 (0.01083)
  FToTM 1   1.45885 (0.02061)              0.10806 (0.00153) 0.9599 (0.00731)      0.31342 (0.00239)
  FToTM 2   0.46879 (0.02445)              0.03473 (0.00181) 0.54097 (0.01118)     0.17663 (0.00365)
  FToTM 3   0.16291 (0.01629)              0.01207 (0.00121) 0.30998 (0.01506)     0.10121 (0.00492)
  FToTM 4     0.0087 (0.01146)             0.00064 (0.00085) 0.06782 (0.0274)      0.02214 (0.00895)
  FToTM 5     0.0111 (0.00525)             0.00082 (0.00039) 0.07909 (0.01855)     0.02582 (0.00606)
                               𝑁 = 30, 𝑃1 × 𝑃2 = 5 × 2, 𝑄 1 × 𝑄 2 = 15 × 12
     CLM    0.26313 (0.02958)               0.0035 (0.00039)     0.3835 (0.02167) 0.05462 (0.00309)
  FToTM 1   9.22145 (0.02791)              0.12268 (0.00037)    2.27063 (0.00894) 0.32337 (0.00127)
  FToTM 2   1.77848 (0.02878)              0.02366 (0.00038)    1.02026 (0.01052)     0.1453 (0.0015)
  FToTM 3   0.23293 (0.01206)               0.0031 (0.00016)    0.36047 (0.00952) 0.05134 (0.00136)
  FToTM 4   0.05929 (0.06315)              0.00079 (0.00084)    0.15872 (0.13817) 0.0226 (0.01968)
  FToTM 5   0.00175 (0.00133)              0.00002 (0.00002)    0.03081 (0.00931) 0.00439 (0.00133)
                                𝑁 = 100, 𝑃1 × 𝑃2 = 5 × 2, 𝑄 1 × 𝑄 2 = 5 × 2
     CLM    0.05833 (0.00912)              0.00432 (0.00068)    0.18217 (0.01463)  0.05948 (0.00478)
  FToTM 1   1.44275 (0.00963)              0.10687 (0.00071)    0.95499 (0.00374)  0.31181 (0.00122)
  FToTM 2   0.44676 (0.01346)                 0.03309 (0.001)   0.52798 (0.00559)  0.17239 (0.00183)
  FToTM 3   0.14999 (0.00779)              0.01111 (0.00058)    0.29657 (0.00869)  0.09683 (0.00284)
  FToTM 4   0.00231 (0.00143)              0.00017 (0.00011)    0.03593 (0.00956)  0.01173 (0.00312)
  FToTM 5   0.00284 (0.00125)              0.00021 (0.00009)    0.04026 (0.00816)  0.01314 (0.00266)
                              𝑁 = 100, 𝑃1 × 𝑃2 = 5 × 2, 𝑄 1 × 𝑄 2 = 15 × 12
     CLM    0.05746 (0.00427) 0.00076 (0.00006) 0.18093 (0.00695)                  0.02577 (0.00099)
  FToTM 1   9.18754 (0.00744)               0.12223 (0.0001) 2.26337 (0.00385)     0.32233 (0.00055)
  FToTM 2   1.73663 (0.00773)                 0.0231 (0.0001) 1.00481 (0.0038)       0.1431 (0.00054)
  FToTM 3     0.2181 (0.00522)              0.0029 (0.00007) 0.34535 (0.00306)     0.04918 (0.00044)
  FToTM 4   0.05167 (0.05987)               0.00069 (0.0008) 0.13999 (0.14339)     0.01994 (0.02042)
  FToTM 5   0.00081 (0.00061) 0.00001 (0.00001) 0.02055 (0.00654)                  0.00293 (0.00093)
                                                        120


4.6     Application to ForrestGump-data
4.6.1   Details about the data-set
The studyforrest (website: https://www.studyforrest.org/) describes a publicly available data-set for
the study of neural language and story processing. The imaging data analyzed here are publicly avail-
able through OpenfMRI (https://openneuro.org/datasets/ds000113/versions/1.3.0) (Hanke et al.,
2014; Sengupta et al., 2016). In total 15 right-handed participants (mean age 29.4 years, range
21–39, 40% females, native German speaker) volunteered for a series of studies including eye-
tracking experiments using natural signal stimulation with a motion picture. Volunteers have no
known hearing problem without permanent or current temporary impairments, and no neurological
disorder. Participants viewed a feature film “ Forrest Gump” (Robert Zemeckis, Paramount Pic-
tures, 1994 with German audio track) in eight back-to-back 15-minute movie sessions, which were
presented chronologically in two back-to-back sessions on the same day. Each session contained
four segments, each approximately 15 minutes long. The eye tracking camera was fitted just outside
the scanner bore, approximately centered and viewing the left eye of the participant at a distance
of 100 cm through a small gap between the top of the back projection screen and the scanner bore
ceiling. Participants were allowed to perform free eye movements without having to fixate or keep
the eye open. The eye-gaze recording started as soon as the computer received the first fMRI trigger
signal. In the audio-visual movie, the video-track of the movie was extracted and encoded as H.264
(1280 × 720 at 25 fps). The movie was shown on a 1280 × 1024 pixel screen with a 63 cm viewing
distance in 720p resolution. The temporal resolution of the participants’ eye gaze recording was
1000Hz.
    All fMRI acquisitions had the following parameters: T2*- weighted echo-planner images with
2 second repetition time (TR), 30 ms echo time, and 90-degree flip angle were acquired during
stimulation using a 3 Tesla MRI scanner. The dimension of the images for each time-point was
80 × 80 × 35 (with pixel dimension 3 × 3 × 3.3𝑚𝑚 3 ). The number of volumes acquired for the
selected session was 451.
                                                121


    All files related to data acquisition for a particular subject are available as sub-<ID>/ses-movie/
directory, where ID is the numeric subject ID. fMRI data files are available with the file name ses-
movie_task-movie_<run>_bold. In the scanner, the normalized eye-gaze coordinate time series are
located at sub-<ID>/ses-movie/func/sub-<ID>_recording-eyegaze_physio.tsv.gz which contain X
and Y coordinates of the eye-gaze, pupil area measurements, and numerical ID of movie frame
presented at the time of measurement. Since the sampling rate is uniformly 1000 Hz, we have 1000
lines per second, with the first line corresponding to the onset of the movie stimulus. Here, the
coordinates (0, 0) were located at the top-left corner of the movie frame, and the lower right corner
is located at (1280, 546). Both measurements were taken excluding the gray bar of the frame.
All in-scanner recordings were temporally normalized by shifting the time series by the minimal
video onset. Stimulus timing information was recorded in events.tsv files which contain the onset,
duration of each movie frame. In eye-gazing data, there is huge change of loss of information due
to eye blinks and those are marked as nan in the data-set and perform spline interpolation. We used
14 individuals and removed Subject 5 due to excessive missing data.
    We use the eye position in the angular unit (i.e., polar coordinates) instead of Cartesian co-
ordinates, where we report magnitude changes of eye position in the screen reference system.
Moreover, X and Y coordinates, related polar coordinates, and pupil area were down-sampled to
the match fMRI sampling frequency. Table 4.4 reports the covariance between angles and distance
for each participants and quotient of the standard deviations of distance and angle, and that of
vertical and horizontal direction of the frame. Fransson et al. (2014) investigates the relationship
between spontaneous charges in eye position during passive fixation and intrinsic brain position in
a block-related task in a resting-state fMRI and concurrent recordings of eye-gaze experiments.
                                                    122


                   0.3
                                                                  0.6
                   0.2
                                                                  0.4
                   0.1
                                                                  0.2
                   0.0
         Radians
                                                             mm
                                                                  0.0
                   −0.1
                                                                  −0.2
                   −0.2
                                                                  −0.4
                   −0.3
                          0   100   200        300   400                 0   100   200        300   400
                                          TR                                             TR
                   0.6
                                                                  1.0
                   0.4
                   0.2
                                                                  0.5
                   0.0
         Radians
                                                             mm
                   −0.2                                           0.0
                   −0.4
                                                                  −0.5
                   −0.6
                   −0.8
                          0   100   200        300   400                 0   100   200        300   400
                                          TR                                             TR
                                                                  0.6
                   1.0
                                                                  0.4
                   0.5                                            0.2
         Radians
                                                                  0.0
                   0.0                                       mm
                                                                  −0.2
                   −0.5
                                                                  −0.4
                   −1.0
                                                                  −0.6
                          0   100   200        300   400                 0   100   200        300   400
                                          TR                                             TR
Figure 4.3 ForrestGump-data: Summary statistics for the parameters estimates of head motion
correction across TRs and participants. (Left panel) Magnitude of three rotational parameters
(in radians) and (Right panel) Magnitude of three translation parameters (in millimeters) for each
individual on each of the 451 TRs. In each plot, solid black line indicates mean over the individuals
through TRs and black dotted lines indicate mean±2sd over the individuals through TRs.
                                                           123


                    1500
                                                                                              1500
                    1000
                                                                                              1000
     X coordinate   500
                                                                               y coordinate
                                                                                              500
                    0
                                                                                              0
                           0   100   200                   300   400                                   0    100   200        300   400
                                           TR                                                                           TR
                                                                                              2.5
                                                                                              2.0
                    1500
                                                                                              1.5
                    1000                                                       theta
     r
                                                                                              1.0
                                                                                              0.5
                    500
                                                                                              0.0
                           0   100   200                   300   400                                   0    100   200        300   400
                                           TR                                                                           TR
                                                    3000
                                                    2500
                                                    2000
                                            theta
                                                    1500
                                                    1000
                                                    500
                                                             0    100   200                          300   400
                                                                              TR
Figure 4.4 ForrestGump-data: Covariates of interest. Cartesian and polar coordinates and pupil
area are shown across TRs and participants.
                                                                        124


                      ID    sd(X)/sd(Y)    sd(dist)/sd(angle)   corr(dist, angle)
                        1         1.7044            922.2200              -0.3640
                        2         1.5247           1004.1233              -0.2552
                        3         1.6271            919.6505              -0.3124
                        4         1.6990            839.1719              -0.4135
                        5         1.1656           1156.9217              -0.4048
                        6         1.3943            772.5980              -0.2928
                        9         1.4403            762.4111              -0.3457
                      10          1.1282            821.5585               0.0177
                      14          1.9538            967.5292              -0.4522
                      15          1.1085            753.6706              -0.0512
                      16          1.7194            899.7864              -0.4071
                      17          1.9600            904.5869              -0.4612
                      18          1.6047            890.4765              -0.3568
                      19          1.3867           1069.9573              -0.2195
                      20          1.4631            944.8355              -0.3688
               Table 4.4 ForrestGump-data: Summary statistics across participants.
4.6.2   Analysis
To analyze the data on a local computer, we only used the first run of the experiment for each
individual and down-sampled the images to 64 × 64 × 64 via nearest-neighbor interpolation using
“resize” function in Matlab, where the number of time-points was 451. Details of the pre-processing
steps performed along with further information of data acquisitions are described in Appendix A.
    Our scientific question of interest was to understand the association between brain image pattern
in the presence of audio-visual inputs. This is the first approach to statistically analyze such a study
by exploiting the complex structure of the data. We fit a time-varying tensor regression coefficient
model as described in Section 4.2. Our covariate is a 3-mode tensor representing normalized
eye-gaze coordinate time-series; each mode representing scaled polar coordinates of the eye-gaze
and pupil area measurements, respectively. The response of the model is pre-processed fMRI data.
Response and covariates are collected simultaneously. The coefficient functions 𝜷1 , 𝜷2 and 𝜷3 are
amplitudes over the time associated with distance, angle of eye-gaze and pupil area respectively;
included to detect the effect of movie in a visual form in BOLD response change. We choose the
                                                  125


rank for reduced-rank extraction to be 3 since it has lowest prediction error.
    For interpretation purposes, we evaluate estimates b   𝜷(𝑡) by taking average values over eight
different functional networks in the brain. This was achieved by first parcellating the brain into
the 268 regions of the Shen atlas (Shen et al., 2013). These regions were thereafter further
combined into eight functional networks (Finn et al., 2015): medial frontal, frontoparietal, default
mode, subcortical-cerebellum, motor, visual I, visual II, and visual association. Figures 4.5, 4.6, 4.7
represent the average estimated coefficient function corresponding to three visual features (distance,
angle of eye-gaze, and pupil area) over all the time-points for each network respectively. Vertical
lines represent scene changes in the movie. The first segment, consisting of approximately 84
time-points corresponds to the opening sequence, which shows a feather floating through the sky
as credits are shown. The second segment consists of the famous scene where the protagonist
of the movie sits on a bench at a bus stop and begins discussing the story of his life. During
this scene, there is heightened activation in several brain networks in reaction to different visual
features. Throughout the time course, the changes in visual features have the greatest impact on
activation in “visual I”, which is depicted using purple lines and is consistent with what we should
expect. Moreover, subsequent segments represent scene changes alternating between interior and
exterior settings; see Häusler and Hanke (2016) for more details.
                                                 126


        0.04
        0.02
β1(t)
        0.00
        −0.02
        −0.04
                0            100               200                   300          400
                                               time−points
                                            Medical Frontal
                                            Frontoparietal
                                            Default Mode
                                            Subcortical−cerebellum
                                            Motor
                                            Visual I
                                            Visual II
                                            Visual Association
Figure 4.5 ForrestGump-data results: Estimate of the coefficient 𝜷1 (𝑡) corresponding to visual
feature distance of eye-gaze for different location. Legends for different parcellation as mentioned
in Shen et al. (2013) are also provided.
                                                127


        0.3
        0.2
        0.1
β2(t)
        0.0
        −0.1
        −0.2
               0             100               200                   300          400
                                               time−points
                                            Medical Frontal
                                            Frontoparietal
                                            Default Mode
                                            Subcortical−cerebellum
                                            Motor
                                            Visual I
                                            Visual II
                                            Visual Association
Figure 4.6 ForrestGump-data results: Estimate of the coefficient 𝜷2 (𝑡) corresponding to visual
feature angle of eye-gaze for different location. Legends for different parcellation as mentioned in
Shen et al. (2013) are also provided.
                                                128


        0.2
        0.1
β3(t)
        0.0
        −0.1
               0            100               200                   300         400
                                              time−points
                                           Medical Frontal
                                           Frontoparietal
                                           Default Mode
                                           Subcortical−cerebellum
                                           Motor
                                           Visual I
                                           Visual II
                                           Visual Association
Figure 4.7 ForrestGump-data results: Estimate of the coefficient 𝜷3 (𝑡) corresponding to visual
feature pupil area for different location. Legends for different parcellation as mentioned in Shen
et al. (2013) are also provided.
                                               129


4.7     Discussion
In this chapter, we have proposed a time-varying tensor-on-tensor regression model and a method
to estimate the coefficient tensors which belong to an infinite-dimensional space. We believe that
the method provides an efficient approach towards performing multi-modal data analysis using
neuroimaging data. Regression coefficients are expressed using the B-spline technique, and the
coefficients of the B-spline bases are estimated using low-rank tensor decomposition. This method
reduces the vastness of the parameters of interest and computational complexity. We have provided
a meaningful simulation study as well as performed real data analysis combining fMRI and eye-
tracking data. The results of our data analysis suggest that the approach has promise for identifying
brain regions responding to an external stimulus, which in this case is movie-watching.
    Although our tensor data can be compactly represented by a CP model, it is NP hard to determine
the rank of the low-rank decomposition (Johan, 1990). To determine the tuning parameters, one
can perform the cross-validation technique. However, our main objective is not to choose the
optimal rank of the low-rank decomposition in the algorithm, and we leave this for future research.
Furthermore, the tensor train representation (Liu et al., 2020) could be an alternative representation
of the multi-dimensional array. In conclusion, our work provides an important direction for dealing
with massive structured data such as time-varying tensors for analysis in multi-modal neuroimaging
studies.
4.8     Technical details
4.8.1   Technical lemmas
Lemma 4.8.1. For any positive definite matrices A and B we have
                         𝜆 min (A)trace{B} ≤ trace{AB} ≤ 𝜆 max (A)trace{B}                      (4.24)
where 𝜆 max (A) is the largest eigen-value of A and 𝜆min (A) is the smallest eigen-value of A.
Proof. See Fang et al. (1994) for the proof in detail.
                                                 130


     Before introducing the next lemma, define a 𝑃-dimensional vector u = (𝑢 1 , · · · 𝑢 𝑃 ) T which is
sub-Gaussian with some parameters 𝜎, then for all 𝜶 ∈ R𝑃 ,
                                      E{exp 𝜶T u} ≤ exp(∥𝜶∥ 2 𝜎 2 /2)                                    (4.25)
Define the locally stationary time series 𝑢 𝑗 = G( 𝑗/𝐽, F𝑗 ) where F𝑗 = (· · · , 𝜂 𝑗−1 , 𝜂 𝑗 , · · · ), 𝜂 𝑗 s are
i.i.d. random variables and G : [0, 1]×R∞ → R is a measurable function such that 𝜉 𝑗 (𝑡) = G(𝑡, F𝑗 ).
Let {𝜂′ } be i.i.d. copies of 𝜂 and assume that for some 𝑎 > 0, define the 𝐿 𝑎 -norm ∥𝜂∥ 𝑎 = {E|𝜂| 𝑎 }1/𝑎 .
Then for 𝑘 ≥ 0 define the physical dependence measure Δ(𝑘, 𝑎) = sup𝑡∈[0,1] max 𝑗 ∥G(𝑡, F𝑗 ) −
G(𝑡, F𝑗,𝑘 ) ∥ 𝑎 where F𝑗,𝑘 = (F𝑗−𝑘−1 , 𝜂′𝑗−𝑘 , 𝜂 𝑗−𝑘+1 , · · · , 𝜂 𝑗 ). Moreover, recall the condition (C4)
where for some large 𝑎, 𝜅0 > 0, there exists a universal constant 𝐶 > 0 such that Δ(𝑘, 𝑎) ≤ 𝐶 𝑘 −𝜅0
for 𝑘 ≥ 1. Furthermore, let ∥𝜂∥ 𝑎 be finite for some 𝑎 > 1.
Lemma 4.8.2. Under condition (C4), with the above explanation, for some constant 𝐶𝑎 > 0,
                                                            
                                     1               𝑄𝜉 𝑁 𝜏
                                 P      𝜎1 (PE) ≤ √             ≥ 1 − 𝐶𝑎 𝑁 −𝑎𝜏                           (4.26)
                                     𝑁𝐽                  𝐽
where 𝜏 is some small positive real number and 𝜉 = sup1≤ℎ≤𝐻 sup𝑡∈[0,1] |Bℎ (𝑡)|
Proof. See Ding et al. (2021) and the references herein for the proof in detail.
                                                                                         Í𝐾 𝑁 +𝑣+1
Lemma 4.8.3. Define S𝑛 be a collection of spline such that the function 𝑔• (𝑡) =           ℎ=1     𝑏 ℎ,• 𝐵 ℎ (𝑡),
where {𝐵 ℎ , ℎ = 1, · · · , (𝐾 𝑁 + 𝑣 + 1)} is a set of B-spline bases in 𝑆𝑛 . Under conditions (C2) and
(C3), there exists a spline function 𝑔• (𝑡) ∈ 𝑆𝑛 such that
                                                                            !
                                                                       1
                                      sup |𝛽• (𝑡) − 𝑔• (𝑡)| = 𝑂                                          (4.27)
                                      𝑡∈T                            𝐾 𝑁𝑣+1
Proof. This proof follows from De Boor et al. (1978).
4.8.2    Proof of Theorem 4.3.1
For simplicity, assume Y ∈ R𝑁 𝐽×𝑄 and Z ∈ R𝑁 𝐽×𝐻×𝑃 , thus B ∈ R𝐻×𝑃×𝑄 . The contracted inner
product in this proof is of order 2, i.e., < ·, · >2 , for simplicity, we drop subscript 2 from the inner
                                                     131


product. By the definition of b   B0 , for all matrices C of rank 𝑅0 with order 𝐻 𝑁 × 𝑃 × 𝑄, we have
                      D      E
               ∥Y − Z, b   B0 ∥ 2F + (𝑁 𝐽)∥b  B0 ∥ 2F,W 𝜔 ≤ ∥Y − ⟨Z, C⟩ ∥ 2F + (𝑁 𝐽)∥ C ∥ 2F,W 𝜔             (4.28)
In addition, the following two equations hold for any tensor C,
         ∥Y − ⟨Z, C⟩ ∥ 2F = ∥Y − ⟨Z, B0 ⟩ ∥ 2F + ∥ ⟨Z, ( B0 − C)⟩ ∥ 2F + 2 ⟨E, ⟨Z, ( B0 − C)⟩⟩ F
              D       E                                  D                E           D D               EE
        ∥Y − Z, b B0 ∥ 2F = ∥Y − ⟨Z, B0 ⟩ ∥ 2F + ∥ Z, ( B0 − b        B0 ) ∥ 2F + 2 E, Z, ( B0 − b B0 )
                                                                                                           F
                                                                                                             (4.29)
with ⟨A, B⟩ F = trace{AT B} for any matrices A and B such that the matrix product of AT B is
permissible. Define, P = Z (1) ( ZT(1) Z (1) ) −1 ZT(1) , then by the definition of Frobenius inner product,
D D               EE       D       D             EE
 E, Z, (bB0 − B)         = PE, Z, (b     B0 − C)       . Moreover, the inner product norm ⟨·, ·⟩ F , operator
                     F                               F
                                                      Í
norm ∥ · ∥ 2 = 𝜎1 (·) and nuclear norm ∥ · ∥ ∗ = 𝑖 𝜎𝑖 (·) are related using the inequalities ⟨A, B⟩ F ≤
                            √
∥A∥ 2 ∥B∥ ∗ and ∥B∥ ∗ ≤ 𝑟 ∥B∥ F where 𝑟 be the rank of the matrix B and 𝜎𝑖 (·) represents the 𝑖 th
largest singular value of a matrix. By subtracting the two Equations in 4.29 and exercising the
properties of different norms mentioned above, we get the following inequalities.
  D             E                                     D D                   EE               n                       o
                    2                          2                                                 2             2
∥ Z, ( B0 − B0 ) ∥ F ≤ ∥ ⟨Z, ( C − B0 )⟩ ∥ F + 2 E, Z, ( B0 − C)
       b                                                          b               + (𝑁 𝐽) ∥ C ∥ F,W 𝜔 − ∥ B0 ∥ F,W 𝜔
                                                                                                           b
                                                                               F
                                                      D         D              EE
                        = ∥ ⟨Z, ( C − B0 )⟩ ∥ 2F + 2 PE, Z, (b       B0 − C)
                                                                                  F
                                       n                              o
                             + (𝑁 𝐽) ∥ C ∥ 2F,W 𝜔 − ∥b    B0 ∥ 2F,W 𝜔
                                                                  √︁       D                E
                        ≤ ∥ ⟨Z, ( C − B0 )⟩ ∥ 2F + 2𝜎1 (PE) 2𝑅0 ∥ Z, (b          B0 − C) ∥ F
                                       n                              o
                             + (𝑁 𝐽) ∥ C ∥ 2F,W 𝜔 − ∥b    B0 ∥ 2F,W 𝜔                                        (4.30)
                        ∫                                                                       ∫
Define, P = I𝑄 ⊗I𝑃 ⊗ B′′ (𝑡)B′′ (𝑡) T 𝑑𝑡 and observe the fact that 𝜆 max (P) = 𝜆 max ( B′′ (𝑡)B′′ (𝑡) T 𝑑𝑡).
Now consider, for any tensor with C, using Lemma 4.8.1,
                          vec( C) T P vec( C) − vec(b   B0 ) T P vec(bB0 )
                          = trace{P(vec( C) vec( C) T − vec(b        B0 ) vec(bB0 ) T )}
                          ≤ 𝜆 max (P)trace{vec( C) vec( C) T − vec(b        B0 ) vec(b B0 ) T }
                                                         132


                                           ∫
                                = 𝜆 max (       B′′ (𝑡)B′′ (𝑡) T 𝑑𝑡){∥ C ∥ 2F − ∥b    B0 ∥ 2F }                             (4.31)
As a consequence of the above inequality,
           ∥ C ∥ 2F,W 𝜔 − ∥b  B0 ∥ 2F,W 𝜔 = vec( C) T W𝜔 vec( C) − vec(b             B0 ) T W𝜔 vec(b    B0 )
                                                                                                            
                                             = 𝜃 vec( C) T P vec( C)} − vec(b          B0 ) T P vec(b  B0 )}
                                                                                                               
                                                    + 𝜙 vec( C) T vec( C)} − vec(b          B0 ) T vec(b  B0 )}
                                                                     n                        o
                                             ≤ (𝜃𝜆max (P) + 𝜙) ∥ C ∥ 2F − ∥b          B0 ∥ 2F
                                                           ∫                                  n                     o
                                                                 ′′     ′′    T                     2             2
                                             = (𝜃𝜆max ( B (𝑡)B (𝑡) 𝑑𝑡) + 𝜙) ∥ C ∥ F − ∥ B0 ∥ F               b              (4.32)
Then for tensor C with rank( C) ≤ 𝑅0 and 𝐼 = min(𝐻, 𝑃𝑄), we have the following inequalities.
                            ∥ C ∥ 2F − ∥b B0 ∥ 2F
                                 ∑︁ 𝐼                  𝐼
                                                      ∑︁
                             =        𝜎𝑖2 ( C (1) ) −      𝜎𝑖2 (b
                                                                B0(1) )
                                  𝑖=1                 𝑖=1
                                                                  ( 𝐼                                     )
                                  n                            o ∑︁                                    
                             ≤ 𝜎1 ( C (1) ) + 𝜎1 (b    B0(1) )            𝜎𝑖 ( C (1) ) − 𝜎𝑖 (b  B0(1) )
                                                                    𝑖=1
                                                                             ( 𝐼                            )
                             (𝑖)   n                                       o ∑︁
                             ≤ 2𝜎1 ( C (1) ) + 𝜎1 (b      B0(1) − C (1) )            𝜎𝑖 (bB0(1) − C (1) )
                                                                                𝑖=1
                                                                            (   𝑅0
                                                                                                           )
                                 n                                       o    ∑︁
                             = 2𝜎1 ( C (1) ) + 𝜎1 (b     B0(1) − C (1) )            𝜎𝑖 (b
                                                                                        B0(1) − C (1) )
                                                                               𝑖=1
                             (𝑖𝑖)   n                               o n√︁                        o
                              ≤ 2𝜎1 ( C (1) ) + ∥b     B0 − C ∥ F          2𝑅0 ∥b  B0 − C ∥ F
                                 √︁      n                                  o2
                             ≤ 2𝑅0 2𝜎1 ( C (1) ) + ∥ B0 − C ∥ F
                                                             b                                                              (4.33)
where inequality (i) follows since 𝜎𝑖+ 𝑗−1 (A + B) ≤ 𝜎𝑖 (A) + 𝜎 𝑗 (B), or in other words due to
Weyl’s additive perturbation theory which states that 𝜎𝑖+ 𝑗−1 (A) ≤ 𝜎𝑖 (B) + 𝜎 𝑗 (A − B). Inequal-
ity (ii) holds since by definition 𝜎1 (A) = ∥A∥ 2 , operator norm. Moreover, ∥A∥ 2 ≤ ∥A∥ F
and due to Cauchy-Schwarz inequality along with the fact that rank(A + B) ≤ 𝑟𝑎𝑛𝑘 (A) +
                           D                 E             D                        E
𝑟𝑎𝑛𝑘 (B). Also, ∥ Z, (b             B0 − C) ∥ 2F = ∥ Z (1) , (b      B0 − C) (3) ∥ 2F = ∥ ZT(1) (b       B0 − C) (3) ∥ 2F ≥ ∥b
                                                                                                                             B0 −
C ∥ 2F 𝜆 min (ZT(1) Z (1) ) due to 4.8.1. Therefore, using the inequality (𝑥 + 𝑦) 2 ≤ 2(𝑥 2 + 𝑦 2 ) we have for
                                                                  133


                         ∫                                √
𝜇 = (𝑁 𝐽) (𝜃𝜆max ( B′′ (𝑡)B′′ (𝑡) T 𝑑𝑡) + 𝜙) 2𝑅0
                n                                  o        n                                        D                 E o2
        (𝑁 𝐽) ∥ C ∥ 2F,W 𝜔 − ∥b      B0 ∥ 2F,W 𝜔 ≤ 𝜇 2𝜎1 ( C (1) ) + 𝜆−1             ( Z T
                                                                                min (1) (1)  Z    )∥   Z , ( B
                                                                                                             b 0 − C )   ∥F
                                                            n                                        D                 E o
                                                                                 −2
                                                      ≤ 𝜇 4𝜎1 ( C (1) ) + 𝜆 min ( Z (1) Z (1) )∥ Z, ( B0 − C) ∥ 2F
                                                                  2                       T                 b
                                                                                                       D                  E
                                                                                   −2
                                                      ≤ 4𝜇𝜎1 ( C (1) ) + 2𝜇𝜆min ( Z (1) Z (1) )∥ Z, ( B0 − B0 ) ∥ 2F
                                                               2                            T                  b
                                                            + 2𝜇𝜆−2         T
                                                                    min ( Z (1) Z (1) )∥ ⟨Z, ( C − B0 )⟩ ∥ F
                                                                                                               2
                                                                                                                              (4.34)
Therefore, we obtain the bound for the prediction error as the following way using the assumption
that 𝜆 min ( ZT(1) Z (1) ) is bounded below by 𝜆 with high probability and by inequality 2𝑥𝑦 ≤ 𝑥 2 /𝑎 + 𝑎𝑦 2
in (★), consider the following from Equation (4.30),
  D                  E                                                     √︁          D               E
                          2                             2
∥ Z, ( B0 − B0 ) ∥ F ≤ ∥ ⟨Z, ( C − B0 )⟩ ∥ F + 2𝜎1 (PE) 2𝑅0 ∥ Z, ( B0 − C) ∥ F
       b                                                                                      b
                                                  D                 E
                                   + 2𝜇𝜆 ∥ Z, ( B0 − B0 ) ∥ 2F + 2𝜇𝜆−2 ∥ ⟨Z, ( C − B0 )⟩ ∥ 2F + 4𝜇𝜎12 ( C (1) )
                                            −2          b
                                                                           √︁          D                 E
                             ≤ ∥ ⟨Z, ( C − B0 )⟩ ∥ 2F + 2𝜎1 (PE) 2𝑅0 ∥ Z, (b                  B0 − B0 ) ∥ F
                                                     √︁
                                   + 2𝜎1 (PE) 2𝑅0 ∥ ⟨Z, ( C − B0 )⟩ ∥ F
                                                  D                 E
                                   + 2𝜇𝜆 ∥ Z, ( B0 − B0 ) ∥ 2F + 2𝜇𝜆−2 ∥ ⟨Z, ( C − B0 )⟩ ∥ 2F + 4𝜇𝜎12 ( C (1) )
                                            −2          b
                             (★)
                              ≤ 4𝜇𝜎12 ( C (1) ) + ∥ ⟨Z, ( C − B0 )⟩ ∥ 2F
                                                                D                  E                         D              E
                                                                                                      −2
                                   + 2𝑅0 𝑎𝜎1 (PE) + ∥ Z, ( B0 − B0 ) ∥ F /𝑎 + 2𝜇𝜆 ∥ Z, ( B0 − B0 ) ∥ 2F
                                                2                     b                 2                         b
                                   + 2𝑅0 𝑏𝜎12 (PE) + ∥ ⟨Z, ( C − B0 )⟩ ∥ 2F /𝑏 + 2𝜇𝜆−2 ∥ ⟨Z, ( C − B0 )⟩ ∥ 2F
                             ≤ 4𝜇𝜎12 ( C (1) ) + 2(𝑎 + 𝑏)𝑅0 𝜎12 (PE)
                                                                                                            D
                                       𝑏+1                −2                          2         1         −2
                                                                                                                                E
                                   +            + 2𝜇𝜆          ∥ ⟨Z, ( C − B0 )⟩ ∥ F +            + 2𝜇𝜆         ∥ Z, ( B0 − B0 ) ∥ 2F
                                                                                                                        b
                                           𝑏                                                    𝑎
                                                                                                                              (4.35)
Therefore, by doing some algebra, we have the following.
                         D
    𝑎−1              −2
                                              E
           − 2𝜇𝜆            ∥ Z, ( B0 − B0 ∥ 2F ≤ 4𝜇𝜎12 ( C (1) ) + 2(𝑎 + 𝑏)𝑅0 𝜎12 (PE)
                                  b
     𝑎
                                                                                     
                                                                  𝑏+1              −2
                                                             +           + 2𝜇𝜆           ∥ ⟨Z, ( C − B0 )⟩ ∥ 2F
                                                                    𝑏
                                                                  134


                       D                  E                                −1 
                                             2               −1         −2
                      ∥ Z, ( B0 − B0 ) ∥ F ≤ C(𝛿) − 2𝜇𝜆
                            b                                                     4𝜇𝜎12 ( C) + 2(1 + 𝛿)𝑅0 𝜎12 (PE)
                                                      C(𝛿) + 2𝜇𝜆−2
                                                                          
                                                +                            ∥ ⟨Z, ( C − B)⟩ ∥ 2F               (4.36)
                                                    C(𝛿) −1 − 2𝜇𝜆−2
where C(𝛿) = 1 + 2/𝛿 and 𝜎1 ( C) = max{𝜎1 ( C (1) ), 𝜎1 ( C (2) ), 𝜎1 ( C (3) )}. The last inequality holds
after choosing 𝑎 = 1 + 𝛿/2 and 𝑏 = 𝛿/2. Now, it is enough to provide an upper bound of the largest
singular value of PE. For some positive constant 𝐶0 , with high probability 1 − 𝐶0 𝑁 −𝑎𝜏 , by Lemma
4.8.2,
           D              E                                   −1 
                              2                −1          −2
          ∥ Z, ( B0 − B0 ) ∥ F ≤ C(𝛿) − 2𝜇𝜆
                b                                                    4𝜇𝜎12 ( C) + 2𝑅0 (1 + 𝛿)𝑄 2 𝜉 2 𝑁 2𝜏+2 𝐽
                                         C(𝛿) + 2𝜇𝜆−2
                                                             
                                   +                            ∥ ⟨Z, ( C − B0 )⟩ ∥ 2F                          (4.37)
                                        C(𝛿) −1 − 2𝜇𝜆−2
Since C is an arbitrary matrix with rank( B) ≤ 𝑅0 , the choosing C = B0 , we have,
       D               E                                  −1 
      ∥ Z, (bB0 − B0 ) ∥ 2F ≤ C(𝛿) −1 − 2𝜇𝜆−2                    4𝜇𝜎12 ( C) + 2𝑅0 (1 + 𝛿)𝑄 2 𝜉 2 𝑁 2𝜏+2 𝐽       (4.38)
The estimation bound can be derived from the above expression under condition 𝜆min ( ZT(1) Z (1) ) ≥ 𝜆,
from inequality 4.38, we have
                                                       −1 
                    2      −1            −1         −2
        ∥ B0 − B0 ∥ F ≤ 𝜆
         b                       C(𝛿) − 2𝜇𝜆                   4𝜇𝜎12 ( C) + 2𝑅0 (1 + 𝛿)𝑄 2 𝜉 2 𝑁 2𝜏+2 𝐽          (4.39)
4.8.3    Proof of Theorem 4.3.2
Observe that, due to Lemma 4.8.3 and the fact that,
 ∫                                                   ∫                               
           T b          2                         T                      T                           b0 −B0 ∥ 2 = 𝑂 𝑃 (𝑎 𝑁 )
    [B(𝑡) ( B0 −B0 )] 𝑓𝑇 (𝑡)𝑑𝑡 = (b      B0 −B0 )           Bℎ (𝑡)Bℎ (𝑡) 𝑓𝑇 (𝑡)𝑑𝑡 (b    B0 −B0 ) ∝ ∥ B
  T                                                     T
                                                                                                                (4.40)
Therefore,
                           ∫
                                 ( 𝛽b• (𝑡) − 𝛽• (𝑡)) 2 𝑓𝑇 (𝑡)𝑑𝑡
                              T
                               ∫                                                  2
                                            T b                      T
                            =          B(𝑡) ( B0 − B0 ) + B(𝑡) B0 − 𝛽(𝑡) 𝑓𝑇 (𝑡)𝑑𝑡
                                  T
                                                       ∫
                                                2
                            ≤ 2∥b    B0 − B0 ∥ F + 2         [B(𝑡) T B0 − 𝛽(𝑡)] 2 𝑓𝑇 (𝑡)𝑑𝑡
                                                          T
                            ≤ 𝑂 𝑃 (𝑎 𝑁 ) + 𝑂 (𝐾 𝑁−2(𝜈+1) )                                                      (4.41)
                                                            135


                                            CHAPTER 5
                                            EPILOGUE
5.1     Conclusions
In this dissertation, we develop new methods and theories for functional data analysis with depen-
dence and complex structures that are often produced in neuroimaging studies. Classical statistical
approaches either do not adequately take advantage of dependence or are not applicable to data
with complex structures. The first part of this dissertation (Chapters 2 and 3) focuses on improving
existing non-parametric methods via incorporating dependence in functional data. In Chapter 2 we
develop an efficient and robust estimation technique based on quadratic inference for a coefficient
vector in a constant linear effects model with dense functional responses. The proposed method
uses a data-driven approach to construct bases, which avoids the possible mis-specification and
improves estimation efficiency. Then, in Chapter 3 we develop a multi-step estimation procedure
for estimating non-parametric coefficient function in a varying-coefficient linear model with het-
eroskedastic errors. This method incorporates the dependence via a local linear generalized method
of moments based on continuous moment conditions. In the second part of the dissertation (Chapter
4) we develop a new approach for studying neural correlates in the presence of tensor-valued brain
images and tensor-valued predictors. We consider a time-varying tensor regression model where
the inherent structural composition of responses and covariates is preserved. Extensive simulation
studies and real data analyses are conducted to justify the efficacy of the proposed methods.
5.2     Future directions
We would like to provide some possible directions with feasible applications to work on in the
future, which directly follow from this dissertation.
   1. Can quadratic inference approach improve the efficiency of parameter estimation for longi-
       tudinal tensor data?
                                                 136


 - In many applications, we are convinced that brain images appear as structured data. Therefore,
   scientists are often interested in analyzing such complex structured longitudinal data which
   are observed at finitely many time-points or clusters due to the number of visits of patients in
   the clinic. Zhang et al. (2014) proposed a unifying regression framework with tensor-variate
   image predictor with tensor-GEE for longitudinal imaging analysis. The proposed GEE
   approach takes into account the intra-subject correlation responses where a low-rank tensor
   decomposition of the coefficient array becomes effective during estimation and prediction.
   Similarly to the existing problem of GEE in classical longitudinal analysis, the estimation of
   the parameter tensor becomes inconsistent when the working correlation structure is miss-
   specified. Therefore, a quadratic inference-based method can be developed for tensor-GEE
   to improve the efficiency of the tensor-variate regression model.
2. Can a time-varying tensor approach improve FDR control for fMRI data?
 - In the statistics and neuroimaging literature, false discovery rate (FDR) is commonly used for
   inference. A straightforward application of FDR violates the complex data features, thereby
   producing unsatisfactory performance. Brown and Behrmann (2017) correctly observed that
   the overstatement of Eklund et al. (2016) regarding FDR cast doubt on fMRI technique for
   studying brain function and caused damage to the field of cognitive neuroscience. Subse-
   quently, Cox et al. (2017); Kessler et al. (2017) offered several clarifications. Among other
   issues, these PNAS papers recognized that accounting for the spatio-temporal aspects is ut-
   terly important for fMRI and is a remarkable methodology for understanding brain function
   and its relationship to behavioral characteristics. Therefore, based on the model proposed in
   Chapter 4, one can propose a new thresholding technique for the multiple hypothesis testing
   problem for spatially dependent tests over continuum null hypotheses to identify activated
   voxels over time based on a tensor-structured “statistical parametric map”, which can be an
   interesting development.
   Apart from the above two immediate future directions, the methodologies presented in this
                                              137


dissertation will provide essential tools for analysis of such complex structured data not
only in neuroimaging studies, but also in other scientific areas such as network analysis,
recommendation systems and statistical genetics.
                                         138


APPENDIX
   139


                                             APPENDIX
                         PRE-PROCESSING OF IMAGING DATA USING R
The main goal of studying fMRI data is to identify brain areas activated by a task. Scanner drift
may occur when the strength of the magnetic field inside the bore changes slowly over time during
a scan session. Therefore, it is not recommended to conduct any statistical analyses using the raw
data coming from the scanner. As a result, an important role is played by the pre-processing steps
of fMRI data, which consist of all required transformations needed to prepare the data for analysis.
In statistics, noise is the fundamental uncertainty, but sometimes it consists of systematic variability
and it is possible to remove the noise from the data. Hence, the main goal of these pre-processing
steps is to remove the systematic variability that can arise due to movement of the head during an
experiment, size of the brain, etc. which are mainly sources of variability due to artifacts.
     Pre-processing of the fMRI data is almost similar for all kinds of experiment and typically
involves a number of steps such as aligning the functional and structural scans, correcting the
possible head movements, skull stripping, registration to the template, and smoothing to reduce
noise; although, the type of smoothing depends on the objective of the statistical analysis. In most
of the cases, the fMRIs are in NIfTI (Neuroimaging Informatics Technology Initiative) format with
extension of the file “*.nii” and this can be read as a multi-dimensional array. We read the data in
R and perform all required pre-processing steps using some tools and packages in R such as the fslr
(Jenkinson et al., 2012; Muschelli et al., 2015). Interested readers are encouraged to study Ashby
(2011); Wager and Lindquist (2015) for further details in the pre-processing steps.
     The six commonly used pre-processing steps are performed in the order as listed below.
    1. Slice timing correction: This method corrects the variability in the BOLD responses that
       is due to the fact that data in different voxels are acquired at different times. This step is
       performed using the function “slicetimer” where indexing is from top, the order of acquisition
       is continuous, and the interpolation using this function is done using “sinc” filter. For example,
                                                   140


   in ForrestGump-data, the brain image consists of 35 slices and the replication time is 2
   seconds. Therefore, the time between acquisition of the first and last slices should be almost
   the same as 2 seconds. Moreover, the variability in the data due to differences in the times
   of slice acquirement can be reduced by this pre-processing step using interpolation and/or
   by analyzing task-related activation using the flexible hrf model. Later, the bias_correct
   function is used for bias field corrections.
2. Motion correction: This step is performed to correct for variability due to head movement.
   Motion correction is a special case of image registration where a series of images are aligned
   by considering mean image over all time-points as the target image for each individuals. This
   is one of the most important steps of pre-processing. When a subject moves their head, a
   specific brain region either moves to a different region or out of the scanning area. This
   correction procedure is based on the assumption that the shape and size of the brain remain
   intact irrespective of the subject moving their head. Therefore, the rigid body registration
   (Ashburner and Friston, 2007) method can be applied. It is easy to visualize that any rigid
   body movement can be described by six parameters. When a subject lies inside the scanner,
   the center of any voxel in their head occupies a point in space that can be characterized
   by the triplet (𝑥, 𝑦, 𝑧). By convention, the 𝑍-axis is parallel to the bore of the magnet, the
   𝑋-axis passes through the subject’s ears from left to right side and the 𝑌 -axis is a pole that
   enters through the back of the head and exits in the forehead. Based on this coordinate
   system, possible rigid body movements are translations and rotations along the 𝑋, 𝑌 and
   𝑍 axes. BOLD responses at the mean over different TR is taken as the standard and then
   rigid body transformation is performed for all TRs until each of the data-sets agree as
   closely as possible with the mean data. Motion-corrected images have the same dimension,
   voxel spacing, origin, and direction as those of images collected from the scanner. Here
   we use “antsrMotionCalculation” function which provides an R-wrapper around the Insight
   Segmentation and Registration Toolkit (ITK). A rigid body transformation for the calculation
   of frame-wise motion parameters is illustrated in Figure 4.3 where first three parameters
                                              141


   contain the rotation matrix (rotation along the 𝑋, 𝑌 and 𝑍 axes, respectively) and the other
   three parameters are translation vectors (translation along the 𝑋, 𝑌 and 𝑍 axes, respectively)
   at each TR.
3. Co-registration: Since the functional images are collected with low spatial resolution, and
   the voxel size is much smaller than the structural images, in this step we align the functional
   and structural images of the pre-processing steps. We perform brain extraction of T1-weighted
   images (after skull stripping and bias-correction since brain activity is restricted to brain
   tissues only; therefore, brain extraction of the anatomical image and inhomogeneity correction
   must be performed to remove artifacts). The method of co-registration is similar to motion
   correction, rather simpler, since it involves only two images to be aligned, but challenge is
   that voxels are no longer one-to-one between two images, and the functional and structural
   images are run with different imaging parameters (or modalities), as a result of which their
   contrasts are different. In this step, only the structural image (mostly the mean image) and any
   one of the functional images must align, since all functional images are already in alignment
   after the head movement corrections. We use the function “registration” to register the
   average fMRI with spatial resolution of 3.0 × 3.0 × 3.0𝑚𝑚 3 to an 0.7 × 0.67 × 0.67𝑚𝑚 3
   T1-weighted image by applying affine transformation and non-linear registration using the
   symmetric normalization (SyN) algorithm in Advance Normalizing Tools (ANTs). Note that
   the resulting images have the same dimension, voxel spacing, origin, and direction as those
   of the anatomical coordinate systems.
4. Normalization: This step is performed to wrap the structural images in the standard brain
   atlas. There exist huge disparities between two brains in terms of size and shape across
   the subjects. Therefore, it is difficult to make decisions regarding task-related activation
   in order to figure which voxel/region of the brain is activated in a subject. The common
   practice is to map the data onto a “standard brain” for which coordinates of all major brain
   areas have already been discovered and henceforth determine the activated region in a brain
                                                 142


   atlas. Here we use an atlas in Section 4.6 proposed by Montreal Neurological Institute (MNI)
   where the MNI-atlas was created by averaging the results from high-resolution structural
   images taken over 152 different brains with dimension 182 × 218 × 182 with pixel dimension
   1 × 1 × 1mm3 and it is also provided in FSL as MNI152_T1_1mm_brain. Furthermore, the
   functional brain atlas provides information about the location of the functional brain region,
   obtaining knowledge on the brain functionality. The steps of the normalization process are
   the following.
       • Co-registering the T1-weighted image into the coordinate system of the MNI space.
       • Co-registering functional and T1-weighted structural images.
       • Applying the calculated non-linear forward transformation from previous steps to project
         fMRI time-series to MNI space.
   Here we use “registration” and “antsApplyTransforms” accordingly.
5. Spatial smoothing: This step reduces high-frequency noise that changes rapidly in small
   regions of the brain. As a result of local averaging due to spatial smoothing, the resulting
   images become blurred due to the reduction of intensity of BOLD responses. The standard
   choice of the kernel function in this smoothing method is the Gaussian kernel, which is
   centered at the smoothing location and 𝜎•2 is the width of the kernel in “•” direction. In the
   context of fMRI data analysis, this is termed as full width at half maximum (FWHM) which
   is the width of the interval between the points at which the height of the kernel along “•”
   direction is half of the peak height. In our data processing, we take (6, 6, 7) as kernel width
   (FWHM).
6. Temporal filtering: This is done to reduce the effect of slow fluctuations in the local magnetic
   field properties of the scanner. This pre-processing step smooths the data at each voxel across
   neighboring TRs. Like spatial smoothing, the goal here is to reduce noise and, thereafter,
   make it easier to identify signal. We perform de-noising by regressing BOLD time-series
                                              143


  on nuisance regressors including 6 principal component scores (obtained using “compcor”
  function (Behzadi et al., 2007)) and motion parameters obtained from motion correction step.
  In the final stage, we perform high-pass filtering using the “frequencyFilterfMRI” function
  with the lower and the higher frequency limits in band-pass filter 0.01 and 0.1 respectively.
• Additional step - extraction of ROI: This is an additional step that is performed as per
  requirement for a given data. It is well-known that functional brain atlases provide infor-
  mation about the location of functional brain regions. One of the recent atlases is that of
  Shen286 (Shen et al., 2013; Finn et al., 2015) and related website https://sites.google.com/
  dartmouth.edu/canlab-brainpatterns/brain-atlases-and-parcellations/ which provides a brain
  parcellation into 268 regions that were obtained as a result of a resting-state fMRI study. The
  T1-weighted image of Shen268 atlas, with size 181 × 217 × 181 and dimensions 1 × 1 × 1𝑚𝑚 3
  is commonly used. This image is then transformed into the MNI space as described in the
  normalization step. Symmetric Normalization (SyN) is performed using Advance Normal-
  ising Tools (ANTs), furthermore, deformable registration using the “registration” function
  in fslr package of R is performed. Estimated transformation is used to wrap the atlas to the
  associated MNI space, and labels are created by the nearest-neighbor interpolation method.
  In addition, for DTI data, the JHU-ICBM-FA-1mm template with dimension 182 × 218 × 182
  where the pixel dimension is 1 × 1 × 1 with 50 ROIs is commonly used (Mori et al., 2008).
                                            144


BIBLIOGRAPHY
     145


                                          BIBLIOGRAPHY
Ai, C. and Chen, X. (2003). Efficient estimation of models with conditional moment restrictions
  containing unknown functions. Econometrica, 71(6):1795–1843.
Albert, R. and Barabási, A.-L. (2002). Statistical mechanics of complex networks. Reviews of
  modern physics, 74(1):47.
Allen, G. I., Grosenick, L., and Taylor, J. (2014). A generalized least-square matrix decomposition.
  Journal of the American Statistical Association, 109(505):145–159.
Anderson, T. W., Anderson, T. W., Anderson, T. W., Anderson, T. W., and Mathématicien, E.-U.
  (1958). An introduction to multivariate statistical analysis, volume 2. Wiley New York.
Ashburner, J. and Friston, K. J. (2007). Rigid body registration. Statistical parametric mapping:
  The analysis of functional brain images, pages 49–62.
Ashby, F. G. (2011). Statistical analysis of fMRI data. MIT press.
Aue, A., Norinho, D. D., and Hörmann, S. (2015). On the prediction of stationary functional time
  series. Journal of the American Statistical Association, 110(509):378–392.
Bai, Y., Zhu, Z., and Fung, W. K. (2008). Partial linear models for longitudinal data based on
  quadratic inference functions. Scandinavian Journal of Statistics, 35(1):104–118.
Balan, R. M., Schiopu-Kratina, I., et al. (2005). Asymptotic results with generalized estimating
  equations for longitudinal data. The Annals of Statistics, 33(2):522–541.
Beck, A. and Teboulle, M. (2009). A fast iterative shrinkage-thresholding algorithm for linear
  inverse problems. SIAM journal on imaging sciences, 2(1):183–202.
Behzadi, Y., Restom, K., Liau, J., and Liu, T. T. (2007). A component based noise correction
  method (compcor) for bold and perfusion based fmri. Neuroimage, 37(1):90–101.
Berrendero, J. R., Justel, A., and Svarc, M. (2011). Principal components for multivariate functional
  data. Computational Statistics & Data Analysis, 55(9):2619–2634.
Bi, X., Tang, X., Yuan, Y., Zhang, Y., and Qu, A. (2021). Tensors in statistics. Annual review of
  statistics and its application, 8:345–368.
Bongiorno, E. G., Salinelli, E., Goia, A., and Vieu, P. (2014). Contributions in infinite-dimensional
  statistics and related topics. Societa Editrice Esculapio.
                                                  146


Bravo, F. (2021). Second order expansions of estimators in nonparametric moment conditions
  models with weakly dependent data. Econometric Reviews, pages 1–24.
Brown, E. N. and Behrmann, M. (2017). Controversy in statistical analysis of functional magnetic
  resonance imaging data. Proceedings of the National Academy of Sciences, 114(17):E3368–
  E3369.
Cai, T. T. and Hall, P. (2006). Prediction in functional linear regression. The Annals of Statistics,
  34(5):2159–2179.
Cai, Z., Das, M., Xiong, H., and Wu, X. (2006). Functional coefficient instrumental variables
  models. Journal of Econometrics, 133(1):207–241.
Cai, Z. and Li, Q. (2008). Nonparametric estimation of varying coefficient dynamic panel data
  models. Econometric Theory, 24(5):1321–1342.
Cardot, H., Degras, D., and Josserand, E. (2013). Confidence bands for horvitz–thompson estima-
  tors using sampled noisy functional data. Bernoulli, 19(5A):2067–2097.
Cardot, H., Ferraty, F., and Sarda, P. (1999). Functional linear model. Statistics & Probability
  Letters, 45(1):11–22.
Cardot, H., Ferraty, F., and Sarda, P. (2003). Spline estimators for the functional linear model.
  Statistica Sinica, pages 571–591.
Cardot, H. and Sarda, P. (2008). Varying-coefficient functional linear regression models. Commu-
  nications in statistics—theory and methods, 37(20):3186–3203.
Carroll, J. D. and Chang, J.-J. (1970). Analysis of individual differences in multidimensional scaling
  via an n-way generalization of “eckart-young” decomposition. Psychometrika, 35(3):283–319.
Casey, B., Soliman, F., Bath, K. G., and Glatt, C. E. (2010). Imaging genetics and development:
  challenges and promises. Human Brain Mapping, 31(6):838–851.
Castro, P. E., Lawton, W. H., and Sylvestre, E. (1986). Principal modes of variation for processes
  with continuous sample curves. Technometrics, 28(4):329–337.
Chen, K., Dong, H., and Chan, K.-S. (2013). Reduced rank regression via adaptive nuclear norm
  penalization. Biometrika, 100(4):901–920.
Chen, K. and Müller, H.-G. (2012). Modeling repeated functional observations. Journal of the
  American Statistical Association, 107(500):1599–1609.
Chen, X., Li, H., Liang, H., and Lin, H. (2019). Functional response regression analysis. Journal
  of Multivariate Analysis, 169:218–233.
                                                  147


Chiou, J.-M., Chen, Y.-T., and Yang, Y.-F. (2014). Multivariate functional principal component
  analysis: A normalization approach. Statistica Sinica, pages 1571–1596.
Chiou, J.-M., Müller, H.-G., and Wang, J.-L. (2003). Functional quasi-likelihood regression models
  with smooth random effects. Journal of the Royal Statistical Society: Series B (Statistical
  Methodology), 65(2):405–423.
Chiou, J.-M., Müller, H.-G., and Wang, J.-L. (2004). Functional response models. Statistica Sinica,
  pages 675–693.
Conn, A. R., Gould, N. I., and Toint, P. L. (2000). Trust region methods, volume 1. Siam.
Cox, R. W., Chen, G., Glen, D. R., Reynolds, R. C., and Taylor, P. A. (2017). fmri clustering and
  false-positive rates. Proceedings of the National Academy of Sciences, 114(17):E3370–E3371.
Cragg, J. G. (1983). More efficient estimation in the presence of heteroscedasticity of unknown
  form. Econometrica: Journal of the Econometric Society, pages 751–763.
Crainiceanu, C. M., Staicu, A.-M., and Di, C.-Z. (2009). Generalized multilevel functional regres-
  sion. Journal of the American Statistical Association, 104(488):1550–1561.
Craven, P. and Wahba, G. (1978). Smoothing noisy data with spline functions. Numerische
  mathematik, 31(4):377–403.
Dauxois, J., Pousse, A., and Romain, Y. (1982). Asymptotic theory for the principal component
  analysis of a vector random function: some applications to statistical inference. Journal of
  multivariate analysis, 12(1):136–154.
De Boor, C., De Boor, C., Mathématicien, E.-U., De Boor, C., and De Boor, C. (1978). A practical
  guide to splines, volume 27. springer-verlag New York.
Delaigle, A. and Hall, P. (2010). Defining probability density for a distribution of random functions.
  The Annals of Statistics, pages 1171–1193.
Diggle, P., Diggle, P. J., Heagerty, P., Liang, K.-Y., Heagerty, P. J., Zeger, S., et al. (2002). Analysis
  of longitudinal data. Oxford University Press.
Ding, X., Yu, D., Zhang, Z., and Kong, D. (2021). Multivariate functional response low-rank
  regression with an application to brain imaging data. Canadian Journal of Statistics, 49(1):150–
  181.
Ding, X. and Zhou, Z. (2020). Estimation and inference for precision matrices of nonstationary
  time series. The Annals of Statistics, 48(4):2455–2477.
Douglas Nychka, Reinhard Furrer, John Paige, and Stephan Sain (2017). fields: Tools for spatial
                                                  148


  data. R package version 12.3.
Dunford, N. and Schwartz, J. T. (1988). Linear operators, part 1: general theory, volume 10. John
  Wiley & Sons.
Eklund, A., Nichols, T. E., and Knutsson, H. (2016). Cluster failure: Why fmri inferences for
  spatial extent have inflated false-positive rates. Proceedings of the national academy of sciences,
  113(28):7900–7905.
Eubank, R. L. (1999). Nonparametric regression and spline smoothing. CRC press.
Fan, J. and Gijbels, I. (1996). Local polynomial modelling and its applications. Chapman &
  Hall/CRC.
Fan, J., Yao, Q., and Cai, Z. (2003). Adaptive varying-coefficient linear models. Journal of the
  Royal Statistical Society: series B (statistical methodology), 65(1):57–80.
Fan, J. and Zhang, W. (2008). Statistical methods with varying coefficient models. Statistics and
  its Interface, 1(1):179.
Fan, J., Zhang, W., et al. (1999). Statistical estimation in varying coefficient models. The annals of
  Statistics, 27(5):1491–1518.
Fang, E. X., Ning, Y., Li, R., et al. (2020). Test of significance for high-dimensional longitudinal
  data. Annals of Statistics, 48(5):2622–2645.
Fang, Y., Loparo, K. A., and Feng, X. (1994). Inequalities for the trace of matrix product. IEEE
  Transactions on Automatic Control, 39(12):2489–2490.
Faraway, J. J. (1997). Regression analysis for a functional response. Technometrics, 39(3):254–261.
Ferraty, F. and Vieu, P. (2006). Nonparametric functional data analysis: theory and practice.
  Springer Science & Business Media.
Ferré, L. and Yao, A.-F. (2005). Smoothed functional inverse regression. Statistica Sinica, pages
  665–683.
Finn, E. S., Shen, X., Scheinost, D., Rosenberg, M. D., Huang, J., Chun, M. M., Papademetris,
  X., and Constable, R. T. (2015). Functional connectome fingerprinting: identifying individuals
  using patterns of brain connectivity. Nature neuroscience, 18(11):1664–1671.
Fransson, P., Flodin, P., Seimyr, G. Ö., and Pansell, T. (2014). Slow fluctuations in eye position
  and resting-state functional magnetic resonance imaging brain activity during visual fixation.
  European Journal of Neuroscience, 40(12):3828–3835.
                                                   149


Gahrooei, M. R., Yan, H., Paynabar, K., and Shi, J. (2018). A novel approach for fusion of
  heterogeneous sources of data. arXiv preprint arXiv:1803.00138.
Gajardo, A., Carroll, C., Chen, Y., Dai, X., Fan, J., Hadjipantelis, P. Z., Han, K., Ji, H., Mueller,
  H.-G., and Wang, J.-L. (2021). fdapace: Functional Data Analysis and Empirical Dynamics. R
  package version 0.5.7.
Givens, G. H. and Hoeting, J. A. (2012). Computational statistics, volume 703. John Wiley &
  Sons.
Goldsmith, J., Bobb, J., Crainiceanu, C. M., Caffo, B., and Reich, D. (2011). Penalized functional
  regression. Journal of computational and graphical statistics, 20(4):830–851.
Goldsmith, J., Crainiceanu, C. M., Caffo, B., and Reich, D. (2012). Longitudinal penalized
  functional regression for cognitive outcomes on neuronal tract measurements. Journal of the
  Royal Statistical Society: Series C (Applied Statistics), 61(3):453–469.
Goldsmith, J., Scheipl, F., Huang, L., Wrobel, J., Di, C., Gellar, J., Harezlak, J., McLean, M. W.,
  Swihart, B., Xiao, L., Crainiceanu, C., and Reiss, P. T. (2020). refund: Regression with
  Functional Data. R package version 0.1-23.
Goldsmith, J., Zipunnikov, V., and Schrack, J. (2015). Generalized multilevel function-on-scalar
  regression and principal component analysis. Biometrics, 71(2):344–353.
Green, P. J. and Silverman, B. W. (1993). Nonparametric regression and generalized linear models:
  a roughness penalty approach. CRC Press.
Grenander, U. (1950). Stochastic processes and statistical inference. Arkiv för matematik, 1(3):195–
  277.
Greven, S. and Scheipl, F. (2017). A general framework for functional regression modelling.
  Statistical Modelling, 17(1-2):1–35.
Guhaniyogi, R., Qamar, S., and Dunson, D. B. (2017). Bayesian tensor regression. The Journal of
  Machine Learning Research, 18(1):2733–2763.
Guhaniyogi, R. and Spencer, D. (2018). Bayesian tensor response regression with an application
  to brain activation studies. Technical report, Technical report, UCSC. 2, 13.
Guo, W., Kotsia, I., and Patras, I. (2012). Tensor learning for regression. IEEE Transactions on
  Image Processing, 21(2):816–827.
Hadinejad-Mahram, H., Dahlhaus, D., and Blomker, D. (2002). Karhunen-loéve expansion of
  vector random processes. Technical report, Technical Report No. IKTNT 1019, Communications
  Technological Laboratory.
                                                 150


Hall, A. R. (2004). Generalized method of moments. OUP Oxford.
Hall, P., Horowitz, J. L., et al. (2007). Methodology and convergence rates for functional linear
  regression. The Annals of Statistics, 35(1):70–91.
Hall, P. and Hosseini-Nasab, M. (2006). On properties of functional principal components analysis.
  Journal of the Royal Statistical Society: Series B (Statistical Methodology), 68(1):109–126.
Hall, P. and Hosseini-Nasab, M. (2009). Theory for high-order bounds in functional principal
  components analysis. In Mathematical Proceedings of the Cambridge Philosophical Society,
  volume 146, page 225. Cambridge University Press.
Hall, P., Müller, H.-G., and Wang, J.-L. (2006). Properties of principal component methods for
  functional and longitudinal data analysis. The annals of statistics, pages 1493–1517.
Hand, D. and Crowder, M. (2017). Practical longitudinal data analysis. Routledge.
Hanke, M., Baumgartner, F. J., Ibe, P., Kaule, F. R., Pollmann, S., Speck, O., Zinke, W., and Stadler,
  J. (2014). A high-resolution 7-tesla fmri dataset from complex natural stimulation with an audio
  movie. Scientific data, 1:140003.
Hansen, L. P. (1982). Large sample properties of generalized method of moments estimators.
  Econometrica: Journal of the Econometric Society, pages 1029–1054.
Happ, C. and Greven, S. (2018). Multivariate functional principal component analysis for data
  observed on different (dimensional) domains. Journal of the American Statistical Association,
  113(522):649–659.
Hardle, W. (1990). Applied Nonparametric Regression. Cambridge University Press.
Harshman, R. A. et al. (1970). Foundations of the PARAFAC procedure: Models and conditions
  for an “explanatory" multimodal factor analysis. Technical report, University of California at
  Los Angeles Los Angeles, CA.
Hastie, T. and Tibshirani, R. (1993). Varying-coefficient models. Journal of the Royal Statistical
  Society: Series B (Methodological), 55(4):757–779.
Hastie, T. J. and Tibshirani, R. J. (2017). Generalized additive models. Routledge.
Häusler, C. O. and Hanke, M. (2016). An annotation of cuts, depicted locations, and temporal
  progression in the motion picture “forrest gump”. F1000Research, 5.
He, G., Muller, H., and Wang, J.-L. (2018). Extending correlation and regression from multivariate
  to functional data. In Asymptotics in statistics and probability, pages 197–210. De Gruyter.
                                                 151


He, G., Müller, H.-G., and Wang, J.-L. (2003). Functional canonical analysis for square integrable
  stochastic processes. Journal of Multivariate Analysis, 85(1):54–77.
Hedeker, D. and Gibbons, R. D. (2006). Longitudinal data analysis, volume 451. John Wiley &
  Sons.
Hitchcock, F. L. (1928). Multiple invariants and generalized rank of a p-way matrix or tensor.
  Journal of Mathematics and Physics, 7(1-4):39–79.
Hoff, P. D. (2011). Hierarchical multilinear models for multiway data. Computational Statistics &
  Data Analysis, 55(1):530–543.
Hoff, P. D. (2015). Multilinear tensor regression for longitudinal relational data. The annals of
  applied statistics, 9(3):1169.
Hoff, P. D. et al. (2011). Separable covariance arrays via the tucker product, with applications to
  multivariate relational data. Bayesian Analysis, 6(2):179–196.
Hoover, D. R., Rice, J. A., Wu, C. O., and Yang, L.-P. (1998). Nonparametric smoothing estimates
  of time-varying coefficient models with longitudinal data. Biometrika, 85(4):809–822.
Hörmann, S. and Kokoszka, P. (2010). Weakly dependent functional data. The Annals of Statistics,
  38(3):1845–1884.
Horváth, L. and Kokoszka, P. (2012). Inference for functional data with applications, volume 200.
  Springer Science & Business Media.
Hsing, T. and Eubank, R. (2015). Theoretical foundations of functional data analysis, with an
  introduction to linear operators, volume 997. John Wiley & Sons.
Huang, H., Li, Y., and Guan, Y. (2014). Joint modeling and clustering paired generalized longi-
  tudinal trajectories with application to cocaine abuse treatment data. Journal of the American
  Statistical Association, 109(508):1412–1424.
Huang, J. Z., Wu, C. O., and Zhou, L. (2002). Varying-coefficient models and basis function
  approximations for the analysis of repeated measurements. Biometrika, 89(1):111–128.
Huang, J. Z., Wu, C. O., and Zhou, L. (2004). Polynomial spline estimation and inference for
  varying coefficient models with longitudinal data. Statistica Sinica, pages 763–788.
Huettel, S. A., Song, A. W., McCarthy, G., et al. (2004). Functional magnetic resonance imaging,
  volume 1. Sinauer Associates Sunderland, MA.
Ivanescu, A. E., Staicu, A.-M., Scheipl, F., and Greven, S. (2015). Penalized function-on-function
  regression. Computational Statistics, 30(2):539–568.
                                                152


J Mercer, B. (1909). Xvi. functions of positive and negative type, and their connection the theory
   of integral equations. Phil. Trans. R. Soc. Lond. A, 209(441-458):415–446.
James, G. M., Hastie, T. J., and Sugar, C. A. (2000). Principal component models for sparse
   functional data. Biometrika, 87(3):587–602.
Jenkinson, M., Beckmann, C. F., Behrens, T. E., Woolrich, M. W., and Smith, S. M. (2012). Fsl.
   Neuroimage, 62(2):782–790.
Ji, Y., Wang, Q., Li, X., and Liu, J. (2019). A survey on tensor techniques and applications in
   machine learning. IEEE Access, 7:162950–162990.
Johan, H. (1990). Tensor rank is np-complete. Journal of Algorithms, 4(11):644–654.
Karhunen, K. (1946). Zur spektraltheorie stochastischer prozesse. Ann. Acad. Sci. Fennicae, AI,
   34.
Kato, K. (2012). Estimation in functional linear quantile regression. The Annals of Statistics,
   40(6):3108–3136.
Kessler, D., Angstadt, M., and Sripada, C. S. (2017). Reevaluating “cluster failure” in fmri using
   nonparametric control of the false discovery rate. Proceedings of the National Academy of
   Sciences, 114(17):E3372–E3373.
Kiers, H. A. (2000). Towards a standardized notation and terminology in multiway analysis. Journal
   of Chemometrics: A Journal of the Chemometrics Society, 14(3):105–122.
Kim, J. S., Maity, A., and Staicu, A.-M. (2018). Additive nonlinear functional concurrent model.
   Statistics and its interface, 11(4):669–685.
Kinson, C., Tang, X., Zuo, Z., and Qu, A. (2020). Longitudinal principal component analysis
   with an application to marketing data. Journal of Computational and Graphical Statistics,
   29(2):335–350.
Kokoszka, P. and Reimherr, M. (2017). Introduction to functional data analysis. Chapman and
   Hall/CRC.
Kolda, T. and Bader, B. (2009). Tensor decompositions and applications. SIAM Review, 51(3):455–
   500.
Kowal, D. R., Matteson, D. S., and Ruppert, D. (2017). A bayesian multivariate functional dynamic
   linear model. Journal of the American Statistical Association, 112(518):733–744.
Kuenzer, T., Hörmann, S., and Kokoszka, P. (2021). Principal component analysis of spatially
   indexed functions. Journal of the American Statistical Association, 116(535):1444–1456.
                                                 153


Li, Y. and Guan, Y. (2014). Functional principal component analysis of spatiotemporal point pro-
  cesses with applications in disease surveillance. Journal of the American Statistical Association,
  109(507):1205–1215.
Li, Y. and Hsing, T. (2007). On rates of convergence in functional linear regression. Journal of
  Multivariate Analysis, 98(9):1782–1804.
Li, Y. and Hsing, T. (2010). Uniform convergence rates for nonparametric regression and principal
  component analysis in functional/longitudinal data. The Annals of Statistics, 38(6):3321–3351.
Li, Y., Qiu, Y., and Xu, Y. (2022). From multivariate to functional data analysis: Fundamentals,
  recent developments, and emerging areas. Journal of Multivariate Analysis, 188:104806.
Liang, K.-Y. and Zeger, S. L. (1986). Longitudinal data analysis using generalized linear models.
  Biometrika, 73(1):13–22.
Liang, X., Zou, T., Guo, B., Li, S., Zhang, H., Zhang, S., Huang, H., and Chen, S. X. (2015). Assess-
  ing beijing’s pm2. 5 pollution: severity, weather impact, apec and winter heating. Proceedings of
  the Royal Society A: Mathematical, Physical and Engineering Sciences, 471(2182):20150257.
Lindquist, M. A. (2008). The statistical analysis of fmri data. Statistical science, 23(4):439–464.
Liu, B. and Müller, H.-G. (2009). Estimating derivatives for samples of sparsely observed functions,
  with application to online auction dynamics. Journal of the American Statistical Association,
  104(486):704–717.
Liu, T., Yuan, M., and Zhao, H. (2017). Characterizing spatiotemporal transcriptome of human
  brain via low rank tensor decomposition. arXiv preprint arXiv:1702.07449.
Liu, Y., Liu, J., and Zhu, C. (2020). Low-rank tensor train coefficient array estimation for tensor-
  on-tensor regression. IEEE Transactions on Neural Networks and Learning Systems.
Lock, E. F. (2018). Tensor-on-tensor regression. Journal of Computational and Graphical Statistics,
  27(3):638–647.
Loève, M. (1946). Functions aleatoire de second ordre. Revue science, 84:195–206.
Long, X., Liao, W., Jiang, C., Liang, D., Qiu, B., and Zhang, L. (2012). Healthy aging: an
  automatic analysis of global and regional morphological alterations of human brain. Academic
  radiology, 19(7):785–793.
Lu, C. and Wooldridge, J. M. (2020). A gmm estimator asymptotically more efficient than ols
  and wls in the presence of heteroskedasticity of unknown form. Applied Economics Letters,
  27(12):997–1001.
                                                  154


Mallows, C. L. (1973). Some comments on cp. Technometrics, 15(4):661–675.
McCullagh, P. and Nelder, J. (1989). Generalized linear models ii.
Melzer, A., Härdle, W. K., and Cabrera, B. L. (2019). Joint tensor expectile regression for electricity
  day-ahead price curves. Available at SSRN 3363167.
Mori, S., Oishi, K., Jiang, H., Jiang, L., Li, X., Akhter, K., Hua, K., Faria, A. V., Mahmood, A.,
  Woods, R., et al. (2008). Stereotaxic white matter atlas based on diffusion tensor imaging in an
  icbm template. Neuroimage, 40(2):570–582.
Morris, J. S. (2015). Functional regression. Annual Review of Statistics and Its Application,
  2:321–359.
Müller, H.-G. (2016). Peter hall, functional data analysis and random objects. The Annals of
  Statistics, 44(5):1867–1887.
Müller, H.-G. and Stadtmüller, U. (2005). Generalized functional linear models. the Annals of
  Statistics, 33(2):774–805.
Müller, H.-G. and Yao, F. (2010). Empirical dynamics for longitudinal data. The Annals of
  Statistics, 38(6):3458–3486.
Muschelli, J., Sweeney, E., Lindquist, M., and Crainiceanu, C. (2015). fslr: Connecting the fsl
  software with r. The R Journal, 7(1):163–175.
Newey, W. K. (1990). Efficient instrumental variables estimation of nonlinear models. Economet-
  rica: Journal of the Econometric Society, pages 809–837.
Newey, W. K. and McFadden, D. (1994). Large sample estimation and hypothesis testing. Handbook
  of econometrics, 4:2111–2245.
Ombao, H., Lindquist, M., Thompson, W., and Aston, J. (2016). Handbook of neuroimaging data
  analysis. Chapman and Hall/CRC.
O’Donnell, L. J. and Westin, C.-F. (2011). An introduction to diffusion tensor image analysis.
  Neurosurgery Clinics, 22(2):185–196.
Park, S. Y. and Staicu, A.-M. (2015). Longitudinal functional data analysis. Stat, 4(1):212–226.
Porter, D., Stirling, D. S., and David, P. (1990). Integral equations: a practical treatment, from
  spectral theory to applications, volume 5. Cambridge university press.
Qu, A. and Li, R. (2006). Quadratic inference functions for varying-coefficient models with
  longitudinal data. Biometrics, 62(2):379–391.
                                                 155


Qu, A., Lindsay, B. G., and Li, B. (2000). Improving generalised estimating equations using
  quadratic inference functions. Biometrika, 87(4):823–836.
Ramsay, J. O. (1982). When the data are functions. Psychometrika, 47(4):379–396.
Ramsay, J. O. and Dalzell, C. (1991). Some tools for functional data analysis. Journal of the Royal
  Statistical Society: Series B (Methodological), 53(3):539–561.
Ramsay, J. O. and Silverman, B. W. (2005). Functional data analysis. Springer series in statistics.
Ramsay, J. O. and Silverman, B. W. (2007). Applied functional data analysis: methods and case
  studies. Springer.
Rao, C. R. (1958). Some statistical methods for comparison of growth curves. Biometrics, 14(1):1–
  17.
Raskutti, G., Yuan, M., and Chen, H. (2019). Convex regularization for high-dimensional multire-
  sponse tensor regression. Ann. Statist., 47(3):1554–1584.
Rice, J. A. and Silverman, B. W. (1991). Estimating the mean and covariance structure non-
  parametrically when the data are curves. Journal of the Royal Statistical Society: Series B
  (Methodological), 53(1):233–243.
Rice, J. A. and Wu, C. O. (2001). Nonparametric mixed effects models for unequally sampled noisy
  curves. Biometrics, 57(1):253–259.
Riss, F. and Sz-Nagy, B. (1990). Functional analysis. Dover Publications Inc.
Ruppert, D., Wand, M. P., and Carroll, R. J. (2003). Semiparametric regression. Cambridge
  university press.
Scheipl, F., Staicu, A.-M., and Greven, S. (2015). Functional additive mixed models. Journal of
  Computational and Graphical Statistics, 24(2):477–501.
Schumaker, L. (2007). Spline functions: basic theory. Cambridge University Press.
Sengupta, A., Kaule, F. R., Guntupalli, J. S., Hoffmann, M. B., Häusler, C., Stadler, J., and Hanke,
  M. (2016). A studyforrest extension, retinotopic mapping and localization of higher visual areas.
  Scientific data, 3:160093.
Shen, X., Tokoglu, F., Papademetris, X., and Constable, R. T. (2013). Groupwise whole-brain
  parcellation from resting-state fmri data for network node identification. Neuroimage, 82:403–
  415.
Sidiropoulos, N. D. and Bro, R. (2000). On the uniqueness of multilinear decomposition of n-way
                                                156


  arrays. Journal of Chemometrics: A Journal of the Chemometrics Society, 14(3):229–239.
Staicu, A.-M., Islam, M. N., Dumitru, R., and Heugten, E. v. (2020). Longitudinal dynamic
  functional regression. Journal of the Royal Statistical Society: Series C (Applied Statistics),
  69(1):25–46.
Stone, M. (1974). Cross-validatory choice and assessment of statistical predictions. Journal of the
  Royal Statistical Society: Series B (Methodological), 36(2):111–133.
Su, L., Murtazashvili, I., and Ullah, A. (2013). Local linear gmm estimation of functional coefficient
  iv models with an application to estimating the rate of return to schooling. Journal of Business
  & Economic Statistics, 31(2):184–207.
Sun, Y. (2016). Functional-coefficient spatial autoregressive models with nonparametric spatial
  weights. Journal of Econometrics, 195(1):134–153.
Taris, T. (2000). Longitudinal data analysis. Sage.
Tian, R., Xue, L., and Liu, C. (2014). Penalized quadratic inference functions for semiparamet-
  ric varying coefficient partially linear models with longitudinal data. Journal of Multivariate
  Analysis, 132:94–110.
Tran, K. C. and Tsionas, E. G. (2009). Local gmm estimation of semiparametric panel data with
  smooth coefficient models. Econometric Reviews, 29(1):39–61.
Tucker, L. R. (1958). Determination of parameters of a functional relation by factor analysis.
  Psychometrika, 23(1):19–23.
Tucker, L. R. (1966). Some mathematical notes on three-mode factor analysis. Psychometrika,
  31(3):279–311.
Turner, E. L., Li, F., Gallis, J. A., Prague, M., and Murray, D. M. (2017). Review of recent
  methodological developments in group-randomized trials: part 1—design. American journal of
  public health, 107(6):907–915.
Vaart, A. W. and Wellner, J. A. (1996). Weak convergence. In Weak convergence and empirical
  processes, pages 16–28. Springer.
van Delft, A. and Eichler, M. (2018). Locally stationary functional time series. Electronic Journal
  of Statistics, 12(1):107–170.
Van Hecke, W., Emsell, L., and Sunaert, S. (2016). Diffusion tensor imaging: a practical handbook.
  Springer.
Viviani, R., Grön, G., and Spitzer, M. (2005). Functional principal component analysis of fmri
                                                  157


  data. Human brain mapping, 24(2):109–129.
Wager, T. D. and Lindquist, M. A. (2015). Principles of fmri. New York: Leanpub.
Wahba, G. (1990). Spline models for observational data, volume 59. Siam.
Wand, M. and Jones, M. (1995). Kernel smoothing chapman & hall. Monographs on Statistics and
  Applied Probability, London.
Wang, J.-L., Chiou, J.-M., and Müller, H.-G. (2016). Functional data analysis. Annual Review of
  Statistics and Its Application, 3:257–295.
Wang, L. (2008). Karhunen-Loéve expansions and their applications. London School of Economics
  and Political Science (United Kingdom).
Wang, L., Li, H., and Huang, J. Z. (2008). Variable selection in nonparametric varying-coefficient
  models for analysis of repeated measurements. Journal of the American Statistical Association,
  103(484):1556–1569.
Wasserman, L. (2006). All of nonparametric statistics. Springer Science & Business Media.
Wu, C. O. and Chiang, C.-T. (2000). Kernel smoothing on varying coefficient models with
  longitudinal dependent variable. Statistica Sinica, pages 433–456.
Xiong, Y., Zhou, X. J., Nisi, R. A., Martin, K. R., Karaman, M. M., Cai, K., and Weaver, T. E.
  (2017). Brain white matter changes in cpap-treated obstructive sleep apnea patients with residual
  sleepiness. Journal of Magnetic Resonance Imaging, 45(5):1371–1378.
Xu, Y., Li, Y., and Nettleton, D. (2018). Nested hierarchical functional data modeling and inference
  for the analysis of functional plant phenotypes. Journal of the American Statistical Association,
  113(522):593–606.
Xue, L., Shu, X., and Qu, A. (2018). Time-varying estimation and dynamic model selection with
  an application of network data. Statistica Sinica.
Yao, F. and Lee, T. C. (2006). Penalized spline models for functional principal component analysis.
  Journal of the Royal Statistical Society: Series B (Statistical Methodology), 68(1):3–25.
Yao, F., Müller, H.-G., Clifford, A. J., Dueker, S. R., Follett, J., Lin, Y., Buchholz, B. A., and Vogel,
  J. S. (2003). Shrinkage estimation for functional principal component scores with application to
  the population kinetics of plasma folate. Biometrics, 59(3):676–685.
Yao, F., Müller, H.-G., and Wang, J.-L. (2005). Functional linear regression analysis for longitudinal
  data. The Annals of Statistics, pages 2873–2903.
                                                  158


Yao, W. and Li, R. (2013). New local estimation procedure for a non-parametric regression function
  for longitudinal data. Journal of the Royal Statistical Society: Series B (Statistical Methodology),
  75(1):123–138.
Yu, H., Tong, G., and Li, F. (2020). A note on the estimation and inference with quadratic inference
  functions for correlated outcomes. Communications in Statistics-Simulation and Computation,
  pages 1–12.
Yu, K. and Jones, M. (2004). Likelihood-based local linear estimation of the conditional variance
  function. Journal of the American Statistical Association, 99(465):139–144.
Zhang, C., Peng, H., and Zhang, J.-T. (2010). Two samples tests for functional data. Communications
  in Statistics—Theory and Methods, 39(4):559–578.
Zhang, J.-T. (2013). Analysis of variance for functional data. Chapman and Hall/CRC.
Zhang, J.-T. and Chen, J. (2007). Statistical inferences for functional data. The Annals of Statistics,
  35(3):1052–1079.
Zhang, L. and Banerjee, S. (2021). Spatial factor modeling: A bayesian matrix-normal approach
  for misaligned data. Biometrics.
Zhang, X., Li, L., Zhou, H., Shen, D., et al. (2014). Tensor generalized estimating equations for
  longitudinal imaging analysis. arXiv preprint arXiv:1412.6592.
Zhang, X. and Wang, J.-L. (2016). From sparse to dense functional data and beyond. The Annals
  of Statistics, 44(5):2281–2321.
Zhao, M., Gao, Y., and Cui, Y. (2020). Variable selection for longitudinal varying coefficient
  errors-in-variables models. Communications in Statistics-Theory and Methods, pages 1–26.
Zhao, M., Xu, X., Zhu, Y., Zhang, K., and Zhou, Y. (2021). Model estimation and selection for
  partial linear varying coefficient ev models with longitudinal data. Journal of Applied Statistics,
  pages 1–23.
Zheng, X., Xue, L., and Qu, A. (2018). Time-varying correlation structure estimation and local-
  feature detection for spatio-temporal data. Journal of Multivariate Analysis, 168:221–239.
Zhou, H., Li, L., and Zhu, H. (2013). Tensor regression with applications in neuroimaging data
  analysis. Journal of the American Statistical Association, 108(502):540–552. PMID: 24791032.
Zhou, J. and Qu, A. (2012). Informative estimation and selection of correlation structure for
  longitudinal data. Journal of the American Statistical Association, 107(498):701–710.
Zhou, L., Huang, J. Z., and Carroll, R. J. (2008). Joint modelling of paired sparse functional data
                                                  159


  using principal components. Biometrika, 95(3):601–619.
Zhou, L., Huang, J. Z., Martinez, J. G., Maity, A., Baladandayuthapani, V., and Carroll, R. J.
  (2010). Reduced rank mixed effects models for spatially correlated hierarchical functional data.
  Journal of the American Statistical Association, 105(489):390–400.
Zhu, H., Fan, J., and Kong, L. (2014). Spatially varying coefficient model for neuroimaging data
  with jump discontinuities. Journal of the American Statistical Association, 109(507):1084–1098.
Zhu, H., Li, R., and Kong, L. (2012). Multivariate varying coefficient model for functional
  responses. Annals of statistics, 40(5):2634.
                                               160