SEMI-PARAMETRIC ESTIMATION OF BIVARIATE DEPENDENCE UNDER MIXED MARGINALS By Wenmei Huang A DISSERTATION Submitted to Michigan State University in partial fulfillment of the requirements for the degree of DOCTOR OF PHILOSOPHY Department of Statistics and Probability 2011 ABSTRACT SEMI-PARAMETRIC ESTIMATION OF BIVARIATE DEPENDENCE UNDER MIXED MARGINALS By Wenmei Huang Copulas are a flexible way of modeling dependence in a set of variables where the association between variables can be elicited separately from the marginal distributions. A semi-parametric approach for estimating the dependence structure of bivariate distributions derived from copulas is investigated when the associated marginals are mixed, that is, consisting of both discrete and continuous components. The semi-parametric likelihood approach is proposed for obtaining the estimator of the dependence parameter under unknown marginals. The consistency and asymptotic normality of the estimator is established as sample size tends to infinity. For constructing confidence intervals in practice, an estimator of the asymptotic variance is proposed and its properties are investigated via simulation. Extensions to higher dimensions are discussed. Several simulation studies and real data examples are provided for investigating the application of the developed methodology of inference. This work generalizes prior results obtained on the estimation of dependence when the marginals are continuous by Genest et al. [11]. To my professors and my loving family. iii ACKNOWLEDGMENTS I would like to express my sincere gratitude to my dissertation advisor, Prof. Sarat C. Dass, for his patient guidance, encouragement and support during my graduate study at Michigan State University, and his precious advice for my professional career. He helped me to learn not only how to do research in statistics, but also how to write statistical papers using proper English. The education I received under his guidance will be the most valuable in my future career. I am grateful to Prof. Tapabrata Maiti, Prof. Chae Young Lim of the Department of Statistics and Probability, and Prof. Zhengfang Zhou of the Department of Mathematics, for serving on my thesis committee. I accumulated practical experience from the consulting work at the Center for Statistical Training and Consulting. Prof. Connie Page, Prof. Dennis Gilliland, and Prof. Sandra Herman shared with me their experience and knowledge in dealing with practical problems, and guided me through my projects. I would like to thank Prof. James Stapleton for his help and encouragement with respect to my teaching and study. He has been very supportive to me, as well as to all his students. I thank him for offering to review my thesis. I would like to thank Prof. Anil K. Jain at Department of Computer Science and Engineering, Michigan State University, for letting me use the biometrics data in the PRIP lab in my research. I would like to thank Dr. Yongfang Zhu for discussions on course work and research. I also would like to thank all the professors of the Department of Statistics and Probability and all my great friends at Michigan State University for making my life there so unique and wonderful. Finally, I especially thank my loving family for their encouragement and support. iv TABLE OF CONTENTS List of Tables . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . vii List of Figures viii . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1 INTRODUCTION 1 2 A MODEL FOR BIVARIATE DISTRIBUTIONS 2.1 Joint Distributions with Mixed Marginals . . . . . . . . . . . . . . . . . . . . . . 2.2 Semi-Parametric Estimation of θ . . . . . . . . . . . . . . . . . . . . . . . . . . . 6 6 9 3 ASYMPTOTIC PROPERTIES OF SEMI-PARAMETRIC VARIATE DISTRIBUTIONS 3.1 Statement of the Main Theorem . . . . . . . . . . . . . . ˆ 3.2 Consistency of θn . . . . . . . . . . . . . . . . . . . . . . 3.3 Asymptotic Behavior of Bn . . . . . . . . . . . . . . . . 3.4 Asymptotic Behavior of An . . . . . . . . . . . . . . . . cc 3.4.1 Asymptotic normality of Rn . . . . . . . . . . . cd dc 3.4.2 Asymptotic normality of Rn and Rn . . . . . . dd 3.4.3 Asymptotic normality of Rn . . . . . . . . . . . 3.4.4 Asymptotic normality of Rn . . . . . . . . . . . . 3.5 Proof of the Main Result . . . . . . . . . . . . . . . . . . ESTIMATOR IN BI. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 11 11 13 19 20 23 33 41 44 45 4 A VARIANCE ESTIMATOR OF BIVARIATE DISTRIBUTIONS 46 5 EXTENSIONS TO HIGHER DIMENSIONS 51 6 JOINT DISTRIBUTIONS USING t-COPULA 6.1 Joint Distributions Using t-copula . . . . . . . 6.2 Estimation of R and ν . . . . . . . . . . . . . . 6.2.1 Estimation of R for fixed ν . . . . . . . 6.2.2 Selection of the Degrees of Freedom, ν 6.3 Summary . . . . . . . . . . . . . . . . . . . . 7 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 56 56 63 63 68 68 SIMULATION AND REAL DATA RESULTS 7.1 Simulation Results . . . . . . . . . . . . . . . . . . . . . . 7.2 Application to Multimodal Fusion in Biometric Recognition 7.2.1 Introduction . . . . . . . . . . . . . . . . . . . . . . 7.2.2 Application in Biometric Fusion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 69 69 74 74 81 v . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 8 SUMMARY AND CONCLUSION 86 Bibliography . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . vi 88 LIST OF TABLES 4.1 Relationship among (Xj , Yj ), RX(j) , and RY (j) . . . . . . . . . . . . . . . . . . 47 7.1 Simulation results for Experiment 1 with ν0 = ∞ and ρ = 0.75. The absolute bias and relative absolute bias of the estimator ρn are provided, together with the ˆ empirical coverage of the approximate 95% confidence interval for ρ based on the asymptotic normality of ρn . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 69 ˆ 7.2 L1 distances for paired values of ν0 corresponding to two values of ρ0 , 0.20 and 0.75. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 72 7.3 Simulation results for Experiment 2 with ν0 = ∞ and ρ0 = 0.75. . . . . . . . . . 72 7.4 Simulation results for Experiment 3 with ν0 = 10. The two correlation values considered are ρ0 = 0.2 and ρ0 = 0.75. . . . . . . . . . . . . . . . . . . . . . . . 73 7.5 Simulation results for Experiment 4. . . . . . . . . . . . . . . . . . . . . . . . . . 73 7.6 Results of the estimation procedure for R and ν based on the NIST database. . . . 82 vii LIST OF FIGURES 7.1 Density curves for t distribution with degrees of freedom ν = 3, 5, 10, 15, 20, 25 and normal distribution. For interpretation of the references to color in this and all other figures, the reader is referred to the electronic version of this dissertation. . . 71 7.2 Some examples of biometric traits: (a) fingerprint, (b) iris scan, (c) face scan, (d) signature, (e) voice, (f) hand geometry, (g) retina, and (h) ear. . . . . . . . . . . . 75 7.3 Histograms of matching scores, corresponding to genuine scores for Matcher 1. Continuous (respectively, generalized) density estimators is given by the dashed lines (solid lines). . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 77 7.4 Histograms of matching scores, corresponding to genuine scores for Matcher 2. Continuous (respectively, generalized) density estimators is given by the dashed lines (solid lines). . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 78 7.5 Histograms of matching scores, corresponding to impostor scores for Matcher 1. Continuous (respectively, generalized) density estimators is given by the dashed lines (solid lines). The spike corresponds to discrete components. Note how the generalized density estimator performs better compared to the continuous estimator (assuming no discrete components). . . . . . . . . . . . . . . . . . . . . . . . 79 7.6 Histograms of matching scores, corresponding to impostor scores for Matcher 2. Continuous (respectively, generalized) density estimators is given by the dashed lines (solid lines). . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 80 7.7 Performance of copula fusion on the MSU-Multimodal database . . . . . . . . . . 83 7.8 Performance of copula fusion on the NIST database. . . . . . . . . . . . . . . . . 84 viii CHAPTER 1 INTRODUCTION Copulas are functions that join or “couple” multivariate distribution functions to their onedimensional marginal distribution functions. Essentially, a copula is a function C from [0, 1]D (D ≥ 2) to [0, 1] with the following properties: 1. For every (u1 , ..., uD ) ∈ [0, 1]D , C(u1 , ..., uD ) = 0 if at least one of uk = 0, for k = 1, ..., D (1.1) C(u1 , ..., uD ) = uk if uj = 1 for all j = k, k = 1, ..., D (1.2) and 2. For every (u1 , ..., uD ) and (v1 , ..., vD ) ∈ [0, 1]D such that uk ≤ vk for k = 1, ..., D, which defines a D-box V = [u1 , v1 ] × [u2 , v2 ] × ... × [uD , vD ] ⊂ [0, 1]D . Then sgn(c)C(c) ≥ 0 (1.3) where the sum is taken over all vertices c of V and sgn(c) is given by    1, if c = u for an even number of k s, k k sgn =   −1, if c = u for an odd number of k s. k k Alternatively, copulas are multivariate distribution functions whose one-dimensional marginals are uniform on the interval [0, 1]. 1 Since their introduction by Sklar [46], copulas have proved to be a useful tool for analyzing multivariate dependence structures due to many unique and interesting features. Let Fk (xk ) = P (Xk ≤ xk ) for k = 1, 2, . . . , D denote D continuous distribution functions on the real line, and let F denote a joint distribution function on RD whose k-th marginal distribution corresponds to Fk . According to Sklar’s Theorem [46], there exists a unique function C(u1 , ..., uD ) from [0, 1]D to [0, 1] satisfying F (x1 , ..., xD ) = C(F1 (x1 ), . . . , FD (xD )), (1.4) where x1 , ..., xD are D real numbers. The function C is known as a D-copula function that couples the one-dimensional distribution functions Fk , k = 1, 2, . . . , D to obtain F . If not all marginals are continuous, C is uniquely determined on RanF1 × ... × RanFD , where RanFk is the range of Fk . Equation (1.4) can also be used to construct D-dimensional distribution functions F whose marginals are the pre-specified distributions Fk , k = 1, . . . , D: Choose a copula function C and define the distribution function F as in (1.4). It follows that F is a D-dimensional distribution function with marginals Fk , k = 1, . . . , D. Investigating dependence structures of multivariate distributions has always been an important area for researchers; see, for example, [26], [27], and [35]. For example, one of the central problems in statistics concerns testing the hypothesis that random variables are independent. Prior to the very recent explosion of copula theory and application, the only models available in many fields to represent the dependence structure were the classical multivariate models, such as Gaussian multivariate model. These models entailed rigid assumptions on the marginal and joint behaviors of the variables. Therefore, they provide limited usefulness. Though the phrase “copula” was first used by Sklar [46] in 1959 and traces of copula theory can be found in Hoeffdings work during the 1940s, the study of copulas and their application in statistics is a rather modern phenomenon. Earlier efforts have addressed different statistical aspects of specialized as well as general copula-based joint distributions. Inference procedures for bivariate Archimedean copulas have been developed and discussed by Genest et. al [12]. Demarta et al., 2 [8], studied properties related to t-copulas, whereas Genest et al., [11] and [13], gave a goodness of fit procedure for general copulas. Estimation techniques and properties of the estimators have been well studied with applications ranging from statistics to mathematical finance and financial risk management; see, for example, Shih et al. [44], Embrechts et al. [9], Cherubini et al. [5], Frey et al. [10] and Chen et al. [4]. Many multivariate models for dependence can be generated by parametric families of copulas, {Cθ : θ ∈ Θ}, typically indexed by a real or vector-valued parameter θ. Examples of such systems are given in [23], [30], [22], and others. The recent monograph by Hutchinson and Lai [17], which includes an extensive bibliography, constitutes a handy reference to this expanding literature. Copula-based models are natural in situations where learning about the association between the variables is important, since the effect of the dependence structure is easily separated from that of the marginals. In such situations, there is typically enough data to obtain nonparametric estimators of the marginal distributions, but insufficient information to afford nonparametric estimation of the structure of the association. In such cases, it is convenient to adopt a parametric form for the dependence function while keeping the marginals unspecified. To estimate the dependence parameter θ, two strategies could be adopted depending on the circumstances: (i) if valid parametric models are already available for the marginals, then it is straightforward in principle to write down a likelihood function for the data, which makes the estimation of θ margin-dependent, because the estimators of the parameters involved in the marginals would be indirectly affected by the copula. This is the parametric approach. (ii) when nonparametric estimators are contemplated for the marginals, however, inference about the dependence parameter θ will be margin-free. This is the semi-parametric approach. Clayton [6], Hougaard et al. [15] and Oakes [36] have pointed out that the margin-free requirement is sensible in applications where the focus of the analysis is on the dependence structure. Given a sample of n observations Xj = (X1j , X2j , . . . , XDj ), j = 1, 2, . . . , n from the joint distribution F (x1 , x2 , . . . , xD ) = Cθ (F1 (x1 ), . . . , FD (xD )), the estimation procedure involves 3 ˆ selecting θn to maximize the semi-parametric likelihood n (θ) = log [cθ (F1n (X1j ), . . . , FDn (XDj ))] j=1 (1.5) where Fkn denotes n/(n + 1) times the empirical distribution function of the k-th component observations Xkj for j = 1, 2, . . . , n, and cθ is the density of Cθ with respect to Lebesgue measure on [0, 1]D . The utilization of Fkn instead of the empirical distribution here avoids difficulties arising from the potential unboundedness of log cθ (u1 , ..., uD ) as some of the uk ’s tend to 1. A central assumption made in the earlier studies is that the marginals associated with the joint distribution F should be continuous. In reality, there are many situations where this assumption is not satisfied and the marginals can be mixed, that is, they contain both discrete and continuous components (see, for example, Kohn et al. [37]). Based on the continuous assumption and the ˆ ˆ regularity conditions, Genest et al. [11] have shown that n1/2 θn is asymptotic normal, where θn is the semi-parametric estimator of θ in (1.5). Our work extends previous methodology by accounting for mixed marginals for D = 2 and Θ ⊂ R. Uniqueness is an important consideration when estimating the unknown parameters corresponding to a copula. Since our mixed marginals have discrete components, one way to achieve uniqueness is to restrict C to belong to a particular parametric family. Under this assumption, We develop an estimation technique for finding the semi-parametric maximum likelihood estimator, ˆ θn , for the parametric family {Cθ : θ ∈ Θ} in the bivariate situation in Chapter 2. The estimation methodology involves integrals corresponding to the discrete components and is therefore, non-standard. Consistency and asymptotic normality of the semi-parametric maximum likelihood ˆ ˆ estimator θn is developed in Chapter 3. A a variance estimator of θn , is developed in Chapter 4. Note that for multivariate observations, copulas allow flexible modelling of the joint distribution via its marginal distributions as well as the correlation between pairwise components of the vector of observations. Therefore, the theory presented in Chapter 2 and Chapter 3 can be extended to higher dimensions, and this is given in Chapter 5. In Chapter 6, C is restricted to a particular parametric family of copula, the t-copula, and findings related to the t-copula are provided. Numerical 4 simulations and application of the methodology to real biometric data are presented in Chapter 7. We summarize our work and discuss future research directions in Chapter 8. 5 CHAPTER 2 A MODEL FOR BIVARIATE DISTRIBUTIONS 2.1 Joint Distributions with Mixed Marginals Let F and G be two distributions on the real line that are mixed, that is, both F and G consist of a mixture of discrete as well as continuous components. The general form for F is given by dF dF pF h I{D ≤x} + (1 − Fh F (x) = pF h ) x −∞ f (w) dw, (2.1) h=1 h=1 where I{A} is the indicator function of set A (that is, I{A} = 1 if A is true, and 0 otherwise); in (2.1), the distribution function F consists of dF discrete components denoted by DF h with F (DF h ) − F (D− ) = pF h for h = 1, 2, . . . , dF where x− denotes the left limit of x, and f Fh is the density (continuous) component of F . The discrete components DF h , h = 1, 2, . . . , dF correspond to jump points in the distribution function F , and are called the jump points of F . All other points on R correspond to points of continuity of F , that is, F (x) − F (x− ) = 0 if x = DF h . The set of jump and continuity points of F are denoted by J (F ) and C(F ), respectively. In the same way, writing dG G(y) = l=1 dG pGl I{D ≤y} + (1 − Gl 6 pGl ) l=1 y −∞ g(w) dw, (2.2) similar definitions can be given for the quantities dG , pGl , DGl and g. We denote the set of jump and continuity points of G by J (G) and C(G), respectively. Let Hθ be the bivariate function defined by Hθ (x, y) = Cθ (F (x), G(y)) (2.3) for θ ∈ Θ with F and G as in (2.1) and (2.2), respectively. It follows from the properties of a copula function that Theorem 2.1.1. The function Hθ , as defined in (2.3), is a valid bivariate distribution function on R2 with marginals F and G. Proof: For Hθ (x, y) to be a valid distribution function on R2 , the following four conditions should be satisfied: (1) 0 ≤ Hθ (x, y) ≤ 1 for all (x, y), (2) Hθ (x, y) → 0 as max(x, y) → −∞, (3) Hθ (x, y) → 1 as min(x, y) → +∞, and (4) for every x1 ≤ x2 and y1 ≤ y2 , ∆H ≡ Hθ (x2 , y2 ) − Hθ (x1 , y2 ) − Hθ (x2 , y1 ) + Hθ (x1 , y1 ) ≥ 0. Note that since Hθ (x, y) = Cθ (F (x), G(y)) = P (U ≤ F (x), V ≤ G(y)), (1) follows. To prove (2), note that F (x) → 0 and G(y) → 0 when max(x, y) → −∞. Hence, Hθ (x, y) → 0 from the property of a cumulative distribution function. (3) follows similarly. To prove (4), we use the facts that F (x1 ) ≤ F (x2 ) for x1 ≤ x2 and G(y1 ) ≤ G(y2 ) for y1 ≤ y2 , and ∆H = [F (x1 ),F (x2 )]×[G(y1 ),G(y2 )] cθ (u, v) du dv. (2.4) Since cθ (u, v) ≥ 0, ∆H ≥ 0 follows. Proof completed. We turn now to give a characterization of the density, hθ (x, y), of Hθ (x, y). For a fixed (x, y) ∈ R2 , the two components of (x, y) correspond to either a point of continuity or a jump point of 7 the corresponding marginal distribution function. For (X, Y ) ∼ Hθ , hθ (x, y) has four different expressions, namely,  P {X ∈ (x, x + a], Y ∈ (y, y + b]}   lim   a,b→0  ab        P {X = x, Y ∈ (y, y + b]}  lim   b→0 b hθ (x, y) =    P {X ∈ (x, x + a], Y = y}   lim  a→0   a        P {X = x, Y = y} if x ∈ C(F ), y ∈ C(G), if x ∈ J (F ), y ∈ C(G), if x ∈ C(F ), y ∈ J (G), and if x ∈ J (F ), y ∈ J (G), (2.5) The following theorem gives workable expressions for hθ (x, y) in terms of the copula: Theorem 2.1.2. For each (x, y) ∈ R2 ,    f (x) g(y) c (F (x), G(y))   θ         y(y) cθ (u, G(y)) du    [F (x− ),F (x)]  hθ (x, y) =   f (x)  cθ (F (x), v) dv    [G(y − ),G(y)]         cθ (u, v) du dv   [F (x− ),F (x)]×[G(y − ),G(y)] if x ∈ C(F ), y ∈ C(G), if x ∈ J (F ), y ∈ C(G), if x ∈ C(F ), y ∈ J (G), if x ∈ J (F ), y ∈ J (G). (2.6) Proof: When x ∈ C(F ) and y ∈ C(G), P {X ∈ (x, x + a], Y ∈ (y, y + b]} = [F (x),F (x+a)]×[G(y),G(y+b)] cθ (u, v) du dv from (2.4). This is approximately ab f (x) g(y) cθ (F (x), G(y)) for small a and b. When x ∈ J (F ) and y ∈ C(G), note that P {X = x, Y ∈ (y, y + b]} = [F (x− ),F (x)]×[G(y),G(y+b)] 8 cθ (u, v) du dv, again from (2.4), which is approximately b g(y) cθ (u, G(y)) du for small b. The [F (x− ),F (x)] third case follows similarly and the fourth expression is straightforward. Proof completed. The terms in hθ (x, y) that involve the densities f and g do not depend on θ and can, hence, be ignored for the estimation of θ. Subsequently, we focus on the function c∗ ≡ θ c∗ (F (x− ), F (x), G(y − ), G(y)) defined by θ    c (F (x), G(y)) if x ∈ C(F ), y ∈ C(G),  θ          c (u, G(y)) du if x ∈ J (F ), y ∈ C(G),    [F (x− ),F (x)] θ  c∗ = (2.7) θ    if x ∈ C(F ), y ∈ J (G),   [G(y − ),G(y)] cθ (F (x), v) dv          cθ (u, v) du dv if x ∈ J (F ), y ∈ J (G).   [F (x− ),F (x)]×[G(y − ),G(y)] 2.2 Semi-Parametric Estimation of θ If not all marginals are continuous, C in (1.4) is no longer unique. Uniqueness is an important consideration when estimating the unknown parameters corresponding to the copula function. Since our marginals have discrete components, one way to achieve uniqueness is to restrict C to belong to a particular parametric family. From now on, let {Cθ : θ ∈ Θ} be a restricted parametric family of copulas. Suppose { (Xj , Yj )T , j = 1, 2, . . . , n } is the set of n independent and identically distributed bivariate random vectors arising from the joint distribution Hθ (x, y) in (2.3). The parameter of interest is the bivariate dependence parameter, θ, and the log-likelihood function corresponding to the n observations is n j=1 log hθ (xj , yj ). The estimation of θ is complicated by the presence of nuisance parameters consisting of the unknown distributions F and G (only the number dF and jump points DF h , h = 1, 2, . . . , dF 9 of F (correspondingly, dG and points DGl of G) are known). An objective function that has θ as the only unknown parameter can be obtained by replacing F and G by Fn and Gn in the loglikelihood function, where Fn (and respectively, Gn ) is n/(n + 1) times the empirical cdf of F (respectively, G). The rescaling by n/(n + 1) avoids difficulties arising from the unboundedness of the log-likelihood function as the empirical cdfs tend to 0 or 1, and has been employed by Genest ˆ et al. in [11]. Thus, the unique semi-parametric maximum likelihood estimator of θ, θn , is the maximizer of n L(θ) = j=1 − log {c∗ (Fn (x− ), Fn (xj ), Gn (yj ), Gn (yj ))} θ,j j (2.8) ˆ (that is, θn = arg maxθ∈Θ L(θ)), where − c∗ ≡ c∗ (Fn (x− ), Fn (xj ), Gn (yj ), Gn (yj )), θ,j θ j ignoring terms that do not depend on θ in n log h (x , y ). In the special case when no disθ j j j=1 crete components are present, Genest et al. [11] developed a semi-parametric approach to estimate the unknown parameters based on the semi-parametric likelihood. Subsequently, it was shown that the resulting estimators were consistent and asymptotically normally distributed; see [11] for details. Expressions (2.7) and (2.8), respectively, are generalizations of the methodology of Genest et al. [11] when F and G contain discrete components. Note that the challenge in maximizing the semiparameteric log-likelihood in (2.8) is that it involves several integrals corresponding to discrete components in (xj , yj ), j = 1, 2, . . . , n. 10 CHAPTER 3 ASYMPTOTIC PROPERTIES OF SEMI-PARAMETRIC ESTIMATOR IN BIVARIATE DISTRIBUTIONS 3.1 Statement of the Main Theorem In view of its similarity with the semi-parametric maximum likelihood estimator for continuous ˆ marginals, we expect that θn in the case of mixed marginals to be consistent and asymptotically normal. To prove this, we denote l(θ, u1 , u2 , v1 , v2 ) = log {c∗ (u1 , u2 , v1 , v2 )} and use the notaθ ˆ tion lθ and lθ,θ to denote the first and second derivative of l with respect to θ. The estimator θn satisfies 1 ∂L(θ) 1 = n ∂θ n n j=1 − lθ (θ, Fn (x− ), Fn (xj ), Gn (yj ), Gn (yj )) = 0. j (3.1) Here is some heuristics of the proof of asymptotic normality. Expanding in a Taylor’s series, one obtains, 1 ∂L(θ) ˆ ˆ = 0 ≈ An − (θn − θ)Bn n ∂θ θ=θn 11 (3.2) where An = 1 n n j=1 − lθ (θ, Fn (x− ), Fn (xj ), Gn (yj ), Gn (yj )) j and 1 Bn = − n n j=1 − lθ,θ (θ, Fn (x− ), Fn (xj ), Gn (yj ), Gn (yj )). j ˆ It follows from equation (3.2) that n1/2 (θn − θ) ≈ n1/2 An /Bn . From the theory of multivariate rank statistics, one can get the following limiting behaviors: Bn → β = −E(lθ,θ {θ, F (X − ), F (X), G(Y − ), G(Y )}) (3.3) almost surely, and n1/2 An is asymptotically normal with zero mean and variance σ 2 , of which the explicit form will be provided in Section 3.4. Thus, we have ˆ Theorem 3.1.1. Under suitable regularity conditions, the semi-parametric estimator θn is consisˆ tent and n1/2 (θn − θ) is asymptotically normal with mean 0 and variance = σ 2 /β 2 . We start with the proofs of consistency in Section 3.2, the asymptotic properties of Bn and An in Section 3.3 and Section 3.4 respectively, on which Theorem 3.1.1 is based, and complete this Chapter by proving Theorem 3.1.1 in Section 3.5. Here are some notations which will be used later in this Chapter. Let SF h = [F (D− ), F (DF h )], for h = 1, 2, ..., dF , SGl = [F (D− ), F (DGl )] for l = 1, 2, ..., dG . Under Fh Gl the situation that both F and G only have one jump point, the previous notations can be simplified − − to DF , DG , SF = [F (DF ), F (DF )], and SG = [G(DG ), G(DG )], respectively. 12 ˆ 3.2 Consistency of θn ˆ Nothing that θn satisfying (3.1), by letting J = lθ , which is a function on (0, 1)2 , we only need to show that 1 Rn = n n J(Fn (Xj ), Gn (Yj )) → 0 a.s. (3.4) j=1 Since Hθ involves one or more jump points, Rn can be written as 1 Rn = n n j=1 − J(Fn (Xj ), Fn (Xj ), Gn (Yj− ), Gn (Yj )) (3.5) We introduce some notations for the subsequent presentation. Let {cc}, {cd}, {dc}, and {dd} be the events that {(X, Y ) : X ∈ C(F ), Y ∈ C(G)}, {(X, Y ) : X ∈ C(F ), Y ∈ J (G)}, {(X, Y ) : X ∈ J (F ), Y ∈ C(G)} and {(X, Y ) : X ∈ J (F ), Y ∈ J (G)}, respectively. Further, let Acc = {j : (Xj , Yj ) ∈ {cc}}. Similarly, one can define Acd , Adc and Add . Note that the sets AS for S = {cc}, {cd}, {dc}, and {dd} is a partition of the set of integers {1, 2, . . . , n}. We consider cc cd dc dd the decomposition of Rn based on these four partition sets, namely, Rn = Rn +Rn +Rn +Rn where 1 S Rn = n j∈AS − J S (Fn (Xj ), Fn (Xj ), Gn (Yj− ), Gn (Yj )), for S = {cc}, {cd}, {dc}, and {dd}, and J S is given by   J cc (u , v )   2 2     cd  J (u , v , v ) 2 1 2 S (u , u , v , v ) = J 1 2 1 2  dc  J (u , u , v )  1 2 2     dd  J (u , u , v , v ) 1 2 1 2 if S = {cc} if S = {cd} (3.6) if S = {dc} if S = {dd}. The functions J S , where S = {cc}, {cd}, {dc} and {dd}, are assumed to be continuous on their respective domains. For example, J cc (u2 , v2 ) is assumed to be continuous on [0, 1]2 ; this ∂ is a reasonable assumption to make since candidates for J cc will be either log cθ (u2 , v2 ) or ∂θ 13 ∂2 log cθ (u2 , v2 ) in this Chapter, which are continuous functions of u2 and v2 . J cd (u2 , v1 , v2 ) 2 ∂θ is assumed to be continuous on [0, 1]3 . A particular candidate of J cd is J cd (u2 , v1 , v2 ) = ∂ log cθ (u2 , v)dv ∂θ [v1 ,v2 ] which is continuous in u2 , v1 and v2 . Similarly for J dc and J dd . Almost sure convergence of Rn is established in the following theorem: Theorem 3.2.1. Let J = lθ , r(u) = u(1 − u), δ > 0, p and q are positive numbers satisfying 1/p + 1/q = 1. Let a and b be numbers given by a = (−1 + δ)/p and b = (−1 + δ)/q. Consider the conditions (C1) J cc (u2 , v2 ) ≤ M1 r(u2 )a r(v2 )b , (C2) J cd (u2 , v1 , v2 ) ≤ M2 r(u2 )a (independent of v1 and v2 ), and (C3) J dc (u1 , u2 , v2 ) ≤ M3 r(v2 )b (independent of u1 and u2 ). (C4) In a small neighborhood of each (F (D− ), F (DF h ), G(D− ), G(DGl )) ∈ [0, 1]4 , for h = Fh Gl dd (u , u , v , v ) is finite. 1, 2, ..., dF and l = 1, 2, ..., dG , J 1 2 1 2 Under conditions (C1-C4), Rn → κ(= 0) almost surely. where κ = E[N cc (X, Y ) + N cd (X, Y ) + N dc (X, Y ) + N dd (X, Y )], (3.7) with E[N cc (X, Y )], E[N cd (X, Y )], E[N dc (X, Y )] and E[N dd (X, Y )] are given as below, respectively, J cc (u, v)cθ (u, v) du dv, dF dG {(0,1)∩(∪ S )c }×{(0,1)∩(∪ S )c } l=1 Gl h=1 F h dG cθ (u, v) dv du, J cd (u, G(D− ), G(DGl )) dF Gl SGl (0,1)∩(∪ SF h )c l=1 h=1 dF cθ (u, v) du dv, J dc (F (D− ), F (DF h ), v) dG Fh c SF h (0,1)∩(∪ SGl ) h=1 l=1 dd (F (D− ), F (D ), G(D− ), G(D )) cθ (u, v) du dv. J Fh Gl Fh Gl SF h ×SGl h,l 14 Remark: In the case of t-copulas (as defined in (6.1)), a and b can be chosen such that 2 2 −1 + δ −1 + δ a≤− , b≤− , a= , b= for some p, q and δ. ν ν p q Proof: Without loss of generality, we consider the case that there is only one point of dis- continuity in both F and G, i. e., DF and DG . Let Cn (u, v) be the empirical copula defined by the sample 1 Cn (u, v) = n n j=1 I{F (X )≤u,G (Y )≤v} . n j n j cc cd We consider the decomposition of the empirical copula measure into 4 sub-measures Cn , Cn , dc dd Cn , and Cn , where n 1 S Cn (u, v) = I{F (X )≤u,G (Y )≤v} , n j n j n j∈AS for S = {cc}, {cd}, {dc} and {dd}, and S Cn (u, v). Cn (u, v) = S∈{cc,cd,dc,dd} cc cd Then by the Glivenko-Cantelli Theorem, Cn (u, v) converges uniformly to C 1 (u, v), Cn (u, v) dc dd converges uniformly to C 2 (u, v), Cn (u, v) converges uniformly to C 3 (u, v), and Cn (u, v) converges uniformly to C 4 (u, v) respectively, and C i (u, v) for i = 1, 2, 3, 4, are given by C 1 (u, v) =   P (F (X) ≤ u and G(Y ) ≤ v)        P (F (X) ∈ (0, u) ∩ S c and G(Y ) ≤ v) F   P (F (X) ≤ u and G(Y ) ∈ (0, v) ∩ S c )   G     P (F (X) ∈ (0, u) ∩ S c and G(Y ) ∈ (0, v) ∩ S c ) F G − − if (u, v) ∈ (0, F (DF )) × (0, G(DG )) − − if (u, v) ∈ [F (DF ), 1) × (0, G(DG )) − − if (u, v) ∈ (0, F (DF )) × [G(DG ), 1) − − if (u, v) ∈ [F (DF ), 1) × [G(DG ), 1) C 2 (u, v) =   0  if v ∈ (0, G(DG ))    − if (u, v) ∈ (0, F (DF )) × [G(DG ), 1)  P (F (X) ≤ u and Y = DG )     P (F (X) ∈ (0, u) ∩ S c and Y = D ) if (u, v) ∈ [F (D− ), 1) × [G(D ), 1) G G F F 15 C 3 (u, v) =   0  if u ∈ (0, F (DF ))    − if (u, v) ∈ [F (DF ), 1) × (0, G(DG ))  P (X = DF and G(Y ) ≤ v)     P (X = D and G(Y ) ∈ (0, v) ∩ S c ) if (u, v) ∈ [F (D ), 1) × [G(D− ), 1) F F G G and C 4 (u, v) only put mass P (X = DF and Y = DG ) on a single point {F (DF )} × {G(DG )}. Note that C i (u, v), i = 1, ..., 4, are not probability measures and 4 Cθ (u, v) = C i (u, v). i=1 To obtain the almost sure convergence, it takes four steps. Step 1. Let cc Rn = cc J cc (u, v)dCn (u, v). 2 (0,1) cc → µcc ≡ E[N cc (X, Y )] almost surely. We show that J cc is uniformly inNow we prove that Rn cc cc tegrable with respect to the measures dCn (u, v) by showing |J cc (u, v)|1+ dCn (u, v) 2 (0,1) is bounded for some > 0. Using the H¨lder’s inequality and the assumption (C1), one can deo 1 cc rive the following chain of inequalities, noting that dCn (u, v) putting mass on each continuous n point: (0,1)2 ≤ M1 ≤ ≤ = ≤ cc |J cc (u, v)|1+ dCn (u, v) (0,1)2 cc r(u)a(1+ ) r(v)b(1+ ) dCn (u, v) 1/p 1/q a(1+ )p dC cc (u, v) b(1+ )q dC cc (u, v) M1 r (u) r (u) n n (0,1)2 (0,1)2  1/p  1/q 1 1 a(1+ )p  b(1+ )q  j j M1 r r n  n  n+1 n+1 j∈Acc j∈Acc (−1+δ)(1+ ) M1 j r n n+1 j∈Acc 1 1 M1 du < ∞, 0 {u(1 − u)(1−δ)(1+ ) } 16 for cc < δ. Combining the result above with the fact that Cn (u, v) uniformly converges to C 1 (u, v), we have cc Rn → (0,1)2 J cc (u, v)dC 1 (u, v) = c c ((0,1)∩SF )×((0,1)∩SG ) J cc (u, v)cθ (u, v) dudv almost surely, which completes step 1. Step 2. Let cd Rn = (0,1)2 cd J cd (u, v1 , v2 )dCn (u, v). cd Now we prove that Rn → µcd ≡ E[N cd (X, Y )] almost surely. We show that cd J cd (u, v1 , v2 ) is uniformly integrable with respect to the measures dCn (u, v) by showing cd |J cd (u, v1 , v2 )|1+ dCn (u, v) is bounded for some > 0. Using the H¨lder’s ino 2 (0,1) equality and the assumption (C2), one can derive the following chain of inequalities, noting that 1 cd c dCn (u, v) putting mass on each point on ((0, 1) ∩ SF ) × Gn (DG ): n (0,1)2 ≤ M2 ≤ ≤ = ≤ cd |J cd (u, v1 , v2 )|1+ dCn (u, v) (0,1)2 cd r(u)a(1+ ) dCn (u, v) 1/p 1/q a(1+ )p dC cd (u, v) (1+ )q dC cd (u, v) M2 r (u) 1 n n (0,1)2 (0,1)2 1/p    1 a(1+ )p  j r ·1 M2  n n+1   j∈A cd  1/p  1  (−1+δ)(1+ )  j M2 r n  n+1  j∈A  cd 1/p 1 1 M2 du 0 {u(1 − u)(1−δ)(1+ ) } < ∞, cd if < δ. Combining the result above with the fact that Cn (u, v) uniformly converges to C 2 (u, v), 17 we have cd Rn → (0,1)2 − J cd (u, G(DG ), G(DG ))dC 2 (u, v) − J cd (u, G(DG ), G(DG )) cθ (u, v) dv du c) ((0,1)∩SF SG almost surely, which completes step 2. = dc Step 3. Under the assumption (C3), the proof for Rn → µdc ≡ E[N dc (X, Y )] is similar to that in step 2, so it is omitted. dd Step 4. The convergence of Rn can be established using the SLLN. Let dd Rn = (0,1)2 dd J dd (u1 , u2 , v1 , v2 )dCn (u, v). dd → µdd ≡ E[N dd (X, Y )] almost surely. Now we prove that Rn We show that dd J dd (u1 , u2 , v1 , v2 ) is uniformly integrable with respect to the measures dCn (u, v) by showdd ing |J dd (u1 , u2 , v1 , v2 )|1+ dCn (u, v) is bounded for some > 0. Noting 2 (0,1) n∗ dd (u, v) putting mass dd on the single point (F (D ), G (D )), where n∗ is that dCn n F n G dd n the number of observations of (Xj , Yj ) such that Xj = DF and Yj = DG : (0,1)2 = n∗ dd n dd |J dd (u1 , u2 , v1 , v2 )|1+ dCn (u, v) |J dd (u1 , u2 , v1 , v2 )|1+ < ∞, as long as J dd (u1 , u2 , v1 , v2 ) is finite, which is the assumption (C4). Combining the result above dd with the fact that Cn (u, v) uniformly converges to C 4 (u, v), we have dd Rn → = J dd (u1 , u2 , v1 , v2 )dC 4 (u, v) (0,1)2 − − J dd (F (DF ), F (DF ), G(DG ), G(DG )) almost surely, which completes step 4. 18 SF ×SG cθ (u, v) du dv In summary, note that cc cd dc dd Rn = Rn + Rn + Rn + Rn , Theorem 3.2.1 is true. 3.3 Asymptotic Behavior of Bn In the case when Hθ and J are both continuous, in [11], by letting J = lθ,θ , one can see of interest is the asymptotic behavior of the statistic Rn defined by n 1 Rn = J(Fn (Xj ), Gn (Yj )) n j=1 (3.8) where J(·, ·) is a function on (0, 1)2 . Genest et al. [11] have shown that Bn → −E(lθ,θ {θ, F (X), G(Y )}) as n → ∞ almost surely. In the present case, the appropriate statistic is similar in form to Rn but with a significant difference, Hθ involves one or more jump points. Therefore, Rn can be written of the form as in (3.5), with J = lθ,θ . Almost sure convergence of Rn is established in the following theorem: Theorem 3.3.1. Let J = lθ,θ , r(u) = u(1 − u), δ > 0, p and q are positive numbers satisfying 1/p + 1/q = 1. Let a and b be numbers given by a = (−1 + δ)/p and b = (−1 + δ)/q. Consider the conditions (C1) J cc (u2 , v2 ) ≤ M1 r(u2 )a r(v2 )b , (C2) J cd (u2 , v1 , v2 ) ≤ M2 r(u2 )a (independent of v1 and v2 ), and (C3) J dc (u1 , u2 , v2 ) ≤ M3 r(v2 )b (independent of u1 and u2 ). (C4) In a small neighborhood of each (F (D− ), F (DF h ), G(D− ), G(DGl )) ∈ [0, 1]4 , for h = Fh Gl 1, 2, ..., dF and l = 1, 2, ..., dG , J dd (u1 , u2 , v1 , v2 ) is finite. Under conditions (C1-C4), Rn → β almost surely where β = E[N cc (X, Y ) + N cd (X, Y ) + N dc (X, Y ) + N dd (X, Y )], 19 (3.9) with E[N S (X, Y )] defined as in (3.7), and S = {cc}, {cd}, {dc}. Since the proof is similar to that in Theorem 3.2.1, it is omitted here. 3.4 Asymptotic Behavior of An By taking J = lθ in (3.5), we have the following Theorem: S S Theorem 3.4.1. Let r(u) = u(1 − u), δ > 0, p and q be as in Theorem 3.3.1. Let Ju and Jv be the partial derivatives of J S with respect to u and v, respectively, for S = {cc}, {cd}, {dc}, and {dd}. Also, let a and b be numbers given by a = (−0.5 + δ)/p and b = (−0.5 + δ)/q. Consider the conditions (D1) J cc (u2 , v2 ) ≤ M1 r(u2 )a r(v2 )b , with partial derivatives satisfying (D2) cc cc Ju (u2 , v2 ) ≤ M2 r(u2 )a−1 r(v2 )b and Jv (u2 , v2 ) ≤ M3 r(u2 )a r(v2 )b−1 , 2 2 cd J cd (u2 , v1 , v2 ) ≤ M4 r(u2 )a with Ju (u2 , v1 , v2 ) ≤ M5 r(u2 )a−1 , 2 v2 cθ (u2 , v)dv ≤ M6 r(u2 )a , v1 (0,1)∩{∪h SF h }c cd |Jη (u2 , G(D− ), G(DGl ))| cθ (u2 , v)dv Gl SGl du2 < ∞ l cd with η = v1 or v2 , Ju (u2 , v1 , v2 ) is continuous w.r.t. u2 on C(F ) almost surely, 2 cd cd Jv (u2 , v1 , v2 ) and Jv (u2 , v1 , v2 ) are continuous w.r.t. v1 and v2 respectively in 1 2 the small neighborhoods of rectangles C(F ) × (G(D− ), G(DGl )] almost surely Gl for l = 1, ..., DG , 20 (D3) dc J dc (u1 , u2 , v2 ) ≤ M7 r(v2 )b with Jv (u1 , u2 , v2 ) ≤ M8 r(v2 )b−1 , 2 u2 cθ (u, v2 )du ≤ M9 r(v2 )b , u1 (0,1)∩{∪l SGl }c h dc cθ (u, v2 )du |Jη (F (D− ), F (DF h ), v2 )| Fh SF h dv2 < ∞ dc with η = u1 or u2 , Jv (u1 , u2 , v2 ) is continuous w.r.t. v2 on C(G) almost surely, 2 dc dc Ju (u1 , u2 , v2 ) and Ju (u1 , u2 , v2 ) are continuous w.r.t. u1 and u2 respectively in 1 2 the small neighborhoods of rectangles (F (D− ), F (DF h )] × C(G) almost surely Fh for h = 1, ..., DF , (D4) dd dd dd dd Ju (u1 , ·, ·, ·), Ju (·, u2 , ·, ·), Jv (·, ·, v1 , ·), and Jv (·, ·, ·, v2 ) are continuous w.r.t. 1 2 1 2 u1 , u2 , v1 and v2 , respectively, in a small neighborhood of (F (D− ), F (DF h ), Fh G(D− ), G(DGl )) for any h = 1, ..., DF and l = 1, ..., DG . Gl Under conditions (D1-D4), n1/2 (Rn − κ) → N (0, σ 2 ) in distribution as n → ∞, where κ is defined in (3.7), and σ 2 = var M cc (X, Y ) + M cd (X, Y ) + M dc (X, Y ) + M dd (X, Y ) , (3.10) with M cc (x, y), M cd (x, y), M dc (x, y) and M dd (x, y) are given as below, respectively, J cc (F (x), G(y))I{x∈C(F ),y∈C(G)} + + cc I{x≤x } Ju (F (x ), G(y ))dHθ (x , y ) 2 {R\{∪h DF h }}×{R\{∪l DGl }} J cc (F (x ), G(y ))dHθ (x , y ), {y≤y } v2 {R\{∪h DF h }}×{R\{∪l DGl }} I 21 (3.11) dG J cd (F (x), G(D− ), G(DGl ))I{x∈C(F ), y=D } Gl Gl l=1 dG + R\{∪h DF h } l=1 dG + R\{∪h DF h } l=1 dG + I J cd (F (x ), G(D− ), G(DGl ))dHθ (x , DGl ) {x≤x } u2 Gl cd I{y 0, there exists a constant β = β in (0, 1) such that P (Ωn ) = P ({ω : βF ≤ Fn ≤ 1 − β(1 − F ) any x(ω) on ∆n1 }) > 1 − , 26 (3.14) for all n and F . Let A = {ω : βF ≤ Fn ≤ 1 − β(1 − F ) any x(ω) ∈ ∆n1 ∩ C(F )}, and B = {ω : βF ≤ Fn ≤ 1 − β(1 − F ) any x(ω) ∈ ∆n1 ∩ J (F )}. Note that in our case, P ({ω : βF ≤ Fn ≤ 1 − β(1 − F ) any x(ω) on ∆n1 }) = P (A ∪ B) = P (A) + P (B) = P ({ω : x(ω) ∈ ∆n1 ∩ C(F )}) · P ({ω : βF ≤ Fn ≤ 1 − β(1 − F ) | x(ω) ∈ ∆n1 ∩ C(F )}) +P ({ω : x(ω) ∈ ∆n1 ∩ J (F )})· P ({ω : βF ≤ Fn ≤ 1 − β(1 − F ) | x(ω) ∈ ∆n1 ∩ J (F )}). In Lemma 6.1 in [41] (3.14) has been verified for all continuous F for a constant β = β , which gives us P ({ω : βF ≤ Fn ≤ 1 − β(1 − F ) | x(ω) ∈ C(F ) ∩ ∆n1 }) > 1 − . Noting that P ({ω : x(ω) ∈ ∆n1 ∩ C(F )}) + P ({ω : x(ω) ∈ ∆n1 ∩ J (F )}) = 1, to prove (3.14) we only need to show when x(ω) ∈ ∆n1 ∩ J (F ), P ({ω : βF ≤ Fn ≤ 1 − β(1 − F ) | any x(ω) ∈ ∆n1 ∩ J (F )}) > 1 − is true for a constant β = β ” . This is a direct result of the Glivenko-Cantelli Theorem, which completes the proof. Lemma 3.3.3 Uniformly in all F , we have ∗ sup |Un (F ) − Un (F )|r(F )ρ−1/2 →p 0, as n → ∞, for each ρ ≥ 0, ∆n1 (ii) sup |Un (F )|r(F )ρ−1/2 = Op (1), as n → ∞, for each ρ ≥ 0, (−∞,∞)\{∪h DF h } (i) ∗ ˆ ˆ where Un (F (x)) = n1/2 (Fn (x) − F (x)), and Fn is the empirical distribution of F . Note that Fn n ˆ was defined to be Fn . n+1 Proof: (i) Note that ˆ Fn ∗ |Un (F ) − Un (F )|r(F )ρ−1/2 = n1/2 r(F )ρ−1/2 , n+1 27 and that for any fixed β ∈ (0, 1), we have β ρ−1/2 β ρ−1/2 =r 1− = O(n−ρ+1/2 ). n n r Since F (Xi ) are i.i.d. uniform random variables, given any arbitrary > 0, we can choose a β = β in (0, 1) such that β β ≤ F (X1n ) ≤ F (Xnn ) ≤ 1 − n n P >1− for all n and all F with F (DF ) = 1, which is obvious since P β β ≤ F (X1n ) ≤ F (Xnn ) ≤ 1 − n n = 1− β β n − n n = 1− 2β n → e−2β > 1 − , n for sufficiently small β. Combining with all above results, we proved (i). (ii) follows from (i) and (iii) of Lemma 4.2 in [41]. Proof completed. cc Proof of Asymptotic normality of Rn . (1) Note that A1n can be written as A1n = n−1/2 n A1jn , j=1 where A1jn = J cc (F (Xj ), G(Yj )) · I{j∈A } − µcc . The A1jn are i.i.d. with mean zero. By cc 28 the assumption (D1) and the H¨lder’s inequity, we have o {R\DF }×{R\DG } ≤ {R\DF }×{R\DG } 2+δ ≤ M1 0 2+δ = M1 0 (0,1) (0,1) |J cc (F (x), G(y))|2+δ0 dHθ (x, y) 2+δ M1 0 |r(F (x))a(2+δ0 ) r(G(y))b(2+δ0 ) |dHθ (x, y) r(u)a(2+δ0 )p0 du 1/p0 (0,1) (− 1 +δ)(2+δ0 ) du r(u) 2 r(v)b(2+δ0 )q0 dv 1/p0 (0,1) 1/q0 (− 1 +δ)(2+δ0 ) r(v) 2 dv 1/q0 < ∞, for some selected p0 , q0 and δ satisfying (1 − δ)(2 + δ0 ) < 1, which means that A1jn has a finite absolute moment of order 2+δ0 for some δ0 > 0. Moreover, the term on the left hand is uniformly bounded above of order 2 + δ0 . Using the CLT, we get the asymptotic normality of A1n . (2) Note that A2n can be written as cc ˆ (Fn (x) − F (x))Ju (F (x), G(y))dHθ (x, y) 2 {R\DF }×{R\DG } cc ˆ +n1/2 (Fn (x) − Fn (x))Ju (F (x), G(y))dHθ (x, y) 2 {R\DF }×{R\DG } = A∗ + A∗ , 2n1 2n2 A2n = n1/2 ˆ where Fn (x) is the empirical distribution function of X. Let   0 if x < X   j φX (x) =  j   1 if x ≥ X j Then A∗ 2n1 can be written n−1/2 as n A2jn , where A2jn = j=1 cc (φX (x) − F (x))Ju (F (x), G(y))dHθ (x, y) are i.i.d. 2 j {R\DF }×{R\DG } zero. Note that |φX (x) − F (x)| ≤ 1. Under the assumption (D1), we have j |A2jn | ≤ M2 {R\DF }×{R\DG } 29 ra−1 (F (x))rb (G(y))dHθ (x, y). with mean ra−1 (F (x))rb (G(y))dHθ (x, y) < ∞ uniformly as long as we Note that {R\DF }×{R\DG } can find some p1 and q1 satisfying 1/p1 + 1/q1 = 1 and (a − 1)p1 > −1 and bq1 > −1. Thus, we have shown that A∗ has an absolute moment of order 2 + δ1 , which leads to the asymptotic 2n1 normality of A∗ . Note that with the same p1 and q1 2n1 n1/2 A∗ = 2n2 n+1 ≤ {R\DF }×{R\DG } n1/2 ˆ sup |Fn (x)| n+1 x cc ˆ Fn (x)Ju (F (x), G(y))dHθ (x, y) 2 cc Ju (F (x), G(y))dHθ (x, y) {R\DF }×{R\DG } 2 = op (1). Therefore, we have the asymptotic normality of A2n . (3) Similarly, note that A3n can be written as cc ˆ (Gn (y) − G(y))Jv (F (x), G(y))dHθ (x, y) 2 {R\DF }×{R\DG } cc ˆ +n1/2 (Gn (y) − Gn (y))Jv (F (x), G(y))dHθ (x, y) 2 {R\DF }×{R\DG } = A∗ + A∗ , 3n1 3n2 A3n = n1/2 ˆ where Gn (y) is the empirical distribution function of Y . Similarly, the A∗ 3n1 can n be written as n−1/2 A3jn , where A3jn = (φY (y) − j {R\DF }×{R\DG } j=1 cc G(y))Jv (F (x), G(y))dHθ (x, y) are i.i.d. with mean zero. The same asymptotic conclusion can 2 be drawn for A3n . n (4) Note that the A1n + A2n + A3n can be written as n−1/2 (A1jn + A2jn + A3jn ) + j=1 n A∗ + A∗ ≡ n−1/2 A∗∗ + A∗ + A∗ , where the A∗∗ = A1jn + A2jn + A3jn 2n2 3n2 2n2 3n2 1jn 1jn j=1 p depends on (Xj , Yj ) only, hence are i.i.d. with mean zero, and the terms A∗ 2n2 → 0 and p A∗ 3n2 → 0 as n → ∞. Using the CLT, we have that the A1n + A2n + A3n is asymptotically normally distributed with mean 0. The asymptotic negligibility of the B− and C− terms will be stated as corollaries. 30 Corollary 1. For fixed γ, Bγ1n →p 0 and Cγ1n →p 0 as n → ∞. Proof: Note that P (Ωc n ) → 0 in (3.13) for any Hθ by the Glivenko-Cantelli Theorem, and γ because the distribution of sup |Fn (x) − F (x)| does not depend on Hθ . This completes the proof. x Corollary 2. For fixed γ, Bγ2n →p 0 and Cγ2n →p 0 as n → ∞. Proof: According to Lemma 3.3.3 (ii) with ρ = 1 , for given 2 > 0, there exists a constant M such that P (Ω1n ) = P (ω : {sup |Un (F )| ≤ M and x ∈ C(F )}) > 1 − x (3.15) for all n and Hθ , and M such that P (Ω2n ) = P (ω : { sup r−ξ (Φn )rξ (F ) ≤ M }) > 1 − , ∆n1 which gives P (Ω1n ∩ Ω2n ) > 1 − 2 . Also, χ(Ω1n ∩ Ω2n )|Bγ2n | cc cc ≤ M sup∆ ×∆ Ju (Φn (x), G(y)) − Ju (F (x), G(y)) . 2 2 F G +χ(Ω1n ∩ Ω2n ) {(−∞,X1n )∪(Xnn ,∞)}×{(−∞,Y1n )∪(Ynn ,∞)} Un (F (x))· cc cc [Ju (Φn (x), G(y)) − Ju (F (x), G(y))]dHn (x, y) 2 2 By the Glivenko-Cantelli Theorem, P ({(−∞, X1n ) ∪ (Xnn , ∞)} × {(−∞, Y1n ) ∪ (Ynn , ∞)}) → 0 for any Hθ , so the second term on the right side of the inequality above converges to cc 0, as n → ∞. The function Ju (u2 , v2 ) is uniformly continuous on (0, 1)2 . Since 2 |Φn − F | ≤ |Fn − F | where Φn is defined, the Glivenko-Cantelli Theorem yields cc cc sup∆ ×∆ Ju (Φn (x), G(y)) − Ju (F (x), G(y)) →p 0 uniformly for Hθ . Therefore, we 2 2 F G have shown that Bγ2n →p 0 as n → ∞. A similar argument may be used for |Cγ2n |. Proof completed. 31 Corollary 3. For fixed γ, Bγ3n →p 0 as n → ∞. Proof: Noting that {∆F × ∆G } ∩ {{R\DF } × {R\DG }} = ∆F × ∆G , Bγ3n = χ(Ωγn ) ∆F ×∆G = χ(Ωγn ∩ Ω1n ) cc Un (F (x))Ju (F (x), G(y))d[Hn (x, y) − Hθ (x, y)] 2 ∆F ×∆G +χ(Ωγn ∩ Ωc ) 1n ∆F ×∆G cc Un (F (x))Ju (F (x), G(y))d[Hn (x, y) − Hθ (x, y)] 2 cc Un (F (x))Ju (F (x), G(y))d[Hn (x, y) − Hθ (x, y)] 2 (3.16) where Ω1n is as defined in (3.15). Note that the second term in (3.16) converges to 0 in probability, and χ(Ωγn ∩ Ω1n ) ∆F ×∆G ≤ χ(Ωγn ∩ Ω1n )M ≤ χ(Ωγn ∩ Ω1n )M M cc Un (F (x))Ju (F (x), G(y))d[Hn (x, y) − Hθ (x, y)] 2 ∆F ×∆G cc Ju (F (x), G(y))d[Hn (x, y) − Hθ (x, y)]. 2 ∆F ×∆G d[Hn (x, y) − Hθ (x, y)] cc = sup∆ ×∆ Ju (F (x), G(y)) . From the Theorem 1.9, (i), from [43], the above 2 F G integral converges to 0 in probability as n → ∞. Proof completed. where M Corollary 4. For fixed γ, Bγ4n →p 0 as n → ∞. Proof: Note that Bγ4n = χ(Ωγn ∩ Ω1n ) +χ(Ωγn ∩ Ωc ) 1n ∆F ×∆G cc Un (F (x))Ju (F (x), G(y))dHθ (x, y) − A2n 2 ∆F ×∆G cc Un (F (x))Ju (F (x), G(y))dHθ (x, y) − A2n 2 32 = −χ(Ωγn ∩ Ω1n ) {(−∞,X1n )∪(Xnn ,∞)}×{(−∞,Y1n )∪(Ynn ,∞)} Un (F (x))· cc Ju (F (x), G(y)) dHθ (x, y) 2 +χ(Ωγn ∩ Ωc ) 1n where Ω1n is as ∆F ×∆G defined in cc Un (F (x))Ju (F (x), G(y))dHθ (x, y) − A2n 2 (3.15). By the Glivenko-Cantelli Theorem, P ({(−∞, X1n ) ∪ (Xnn , ∞)} × {(−∞, Y1n ) ∪ (Ynn , ∞)}) → 0 for any Hθ . Combining cc with the fact that Ju is uniformly continuous on (0, 1)2 , the righthand side converges to 0, as 2 n → ∞. To see B5n converging to 0 in probability as n → ∞, let us notice that 4 Bγin = n1/2 i=1 {R\DF }×{R\DG } [J cc (F (x), Gn (y)) − J cc (F (x), G(y))]dHn (x, y) −A2n . In summary, we have shown that all these B-terms and C-term are negligible. Combining with cc the results of these A-terms, we have established the asymptotic normality of Rn . cd dc 3.4.2 Asymptotic normality of Rn and Rn cd It suffices to show the asymptotic normality of Rn . Note that 4 cd n1/2 (Rn − µcd ) = 6 Ain + i=1 where 33 Bin , i=1 µcd = E[N cd (X, Y )], A1n = n1/2 A2n = A3n = A4n = B1n = B2n = B3n = B4n = {R\DF } − J cd (F (x), G(DG ), G(DG ))d[Hn (x, DG ) − Hθ (x, DG )]; − cd Un (F (x))Ju (F (x), G(DG ), G(DG ))dHθ (x, DG ); 2 {R\DF } {R\DF } {R\DF } − cd − Vn (G(DG ))Jv (F (x), G(DG ), G(DG ))dHθ (x, DG ); 1 − cd Vn (G(DG ))Jv (F (x), G(DG ), G(DG ))dHθ (x, DG ); 2 − cd Un (F (x))Ju (Φn (x), Gn (DG ), Gn (DG ))d[Hn (x, DG ) − Hθ (x, DG )] 2 {R\DF } {R\DF } {R\DF } − − cd Vn (G(DG ))Jv (F (x), Θn (DG ), Gn (DG ))d[Hn (x, DG ) − Hθ (x, DG )] 1 − cd Vn (G(DG ))Jv (F (x), G(DG ), Ψn (DG ))d[Hn (x, DG ) − Hθ (x, DG )] 2 − cd Un (F (x))[Ju (Φn (x), Gn (DG ), Gn (DG )) 2 {R\DF } − cd −Ju (F (x), G(DG ), G(DG )) dHθ (x, DG ) 2 B5n = {R\DF } − − cd Vn (G(DG ))[Jv (F (x), Θn (DG ), Gn (DG )) 1 − cd −Jv (F (x), G(DG ), G(DG )) dHθ (x, DG ) 1 B6n = {R\DF } − cd Vn (G(DG ))[Jv (F (x), G(DG ), Ψn (DG )) 2 − cd −Jv (F (x), G(DG ), G(DG )) dHθ (x, DG ) 2 where Φn (·), Θn (·), and Ψn (·) are defined by the Mean Value Theorem. Next, we will show that the A− terms are asymptotic normal and the B− terms converge to 0 in probability. 34 (1) Note that A1n can be written as A1n = n−1/2 n A1jn , j=1 where A1jn = J cd (F (Xj ), G(DG − ), G(DG )) · I{j∈A } − µcd . Note that A1jn are i.i.d. with cd mean zero. Since J cd (u, v1 , v2 ) = v2 ∂ log cθ (u, v)dv, ∂θ v1 using assumption (D2), we have {R\DF } ≤ {R\DF } · ≤ |J cd (F (x), G(DG − ), G(DG ))|2+δ0 dHθ (x, DG ) |J cd (F (x), G(DG − ), G(DG ))|(2+δ0 )p0 dHθ (x, DG ) {R\DF } 1(2+δ0 )q0 dHθ (x, DG ) 1/p0 1/q0 |J cd (u, G(DG − ), G(DG ))|(2+δ0 )p0 cθ (u, v)dv du c SG (0,1)∩SF Under the assumption (D2), c (0,1)∩SF ≤ M4 |J cd (u, v1 , v2 )|(2+δ0 )p0 c (0,1)∩SF ≤ M4 M6 r(u)a(2+δ0 )p0 c (0,1)∩SF SG SG cθ (u, v)dv du cθ (u, v)dv du r(u)a(2+δ0 )p0 r(u)a du < ∞, if a(2 + δ0 )p0 + a > −1. By the CLT, we have the asymptotic normality of A1n . 35 1/p0 . (2) Note that A2n can be written as cd ˆ (Fn (x) − F (x))Ju (F (x), G(DG − ), G(DG ))dHθ (x, DG ) 2 {R\DF } cd ˆ +n1/2 (Fn (x) − Fn (x))Ju (F (x), G(DG − ), G(DG ))dHθ (x, DG ) 2 {R\DF } = A∗ + A∗ . 2n1 2n2 A2n = n1/2 n A j=1 2jn , where A2jn = {R\D } (φXj (x) − F cd F (x))Ju (F (x), G(DG − ), G(DG ))dHθ (x, DG ) are i.i.d. with mean 0. Using the assumption 2 (D2), we have Furthermore, A∗ can be written as n−1/2 2n1 |A2jn | ≤ M5 c (0,1)∩SF ≤ M5 M6 ra−1 (u) c (0,1)∩SF SG cθ (u, v)dv du r2a−1 (u)du < ∞ as long as (2a − 1) > −1. Thus we have the asymptotic normality of A∗ . Using the assumption 2n1 (D2), we have √ n cd ∗ ˆ Fn (x)Ju (F (x), G(DG − ), G(DG ))dHθ (x, DG ) A2n2 = 2 n + 1 {R\D } F √ n cd ˆ ≤ sup |Fn (x)| Ju (F (x), G(DG − ), G(DG ))dHθ (x, DG ) n+1 x {R\DF } 2 = op (1). Therefore, the asymptotic normality of A2n has been established. (3) Note that A3n can be written as cd ˆ [Gn (DG − ) − G(DG − )]Jv (F (x), G(DG − ), G(DG ))dHθ (x, DG ) 1 {R\DF } cd ˆ +n1/2 [Gn (DG − ) − Gn (DG − )]Jv (F (x), G(DG − ), G(DG ))dHθ (x, DG ) 1 {R\DF } = A∗ + A∗ . 3n1 3n2 A3n = n1/2 Similarly, A∗ can be written as n−1/2 3n1 n A j=1 3jn , where A3jn = {R\D } (ψYj (DG ) − F 36 cd Gn (DG − ))Jv (F (x), G(DG − ), G(DG ))dHθ (x, DG ) are i.i.d. with mean zero. and 1    1 if Yj < y, ψY (y) = j   0 if Yj ≥ y. Noting that the absolute value of the random part |ψY (DG ) − Gn (DG − )| in A3jn is bounded j above by 1, we have |A3jn | ≤ c (0,1)∩SF cd |Jv (u, G(DG − ), G(DG ))| cθ (u, v)dv du. 1 SG By the assumption (D2), the integral above is finite. Therefore, A3jn has an absolute moment of order 2 + δ1 for some δ1 > 0, which leads to the asymptotic normality of A∗ . 3n1 Using argument similar to that used for A∗ as in (2), one can show that A∗ = op (1). In 2n2 3n2 summary, we have the asymptotic normality of A3n . (4) Result of A4n can be obtained in a similar way by noting that A4n can be written as cd ˆ [Gn (DG ) − G(DG )]Jv (F (x), G(DG − ), G(DG ))dHθ (x, DG ) 2 {R\DF } cd ˆ +n1/2 [Gn (DG ) − Gn (DG )]Jv (F (x), G(DG − ), G(DG ))dHθ (x, DG ) 2 {R\DF } = A∗ + A∗ , 4n1 4n2 A4n = n1/2 −1/2 and A∗ 4n1 can be written as n n A4jn , where A4jn = {R\DF } (φY (DG ) − j j=1 cd (F (x), G(D − ), G(D ))dH (x, D ) are i.i.d. with mean zero. Gn (DG ))Jv G G G θ 2 Using arguments similar to those in (3), we can get that A4n is asymptotically normally distributed with mean 0. (5) Finally, we show that the sum of Ain , i = 1, ..., 4, converges to a normal random variable n with mean 0. Note that the sum can be written as n−1/2 (A1jn + A2jn + A3jn + A4jn ) + j=1 n A∗ + A∗ + A∗ ≡ n−1/2 A∗∗ + A∗ + A∗ + A∗ , where A∗∗ = A1jn + 2n2 3n2 4n2 2n2 3n2 4n2 2jn 2jn j=1 A2jn + A3jn + A4jn depends on (Xj , Yj ) only, hence are i.i.d. with mean 0 as shown before 37 dc (Similarly, we can define A∗∗ for Rn ), and A∗ , A∗ and A∗ are negligible. Using the 2n2 3n2 4n2 3jn CLT, we have that the sum of Ain , i = 1, ..., 4, is asymptotically normally distributed with mean 0. cd We will finish the proof of Rn by a few corollaries. Corollary 5. B1n →p 0 as n → ∞. Proof: Note that B1n = χ(Ω1n ) {R\DF } − cd Un (F (x))Ju (Φn (x), Gn (DG ), Gn (DG ))d[Hn (x, DG ) 2 −Hθ (x, DG )] − cd +χ(Ωc ) 1n {R\D } Un (F (x))Ju2 (Φn (x), Gn (DG ), Gn (DG ))d[Hn (x, DG ) F −Hθ (x, DG )] where Ω1n is defined in (3.15), which tells us the second term on the righthand side of the equality above goes to 0 in probability as n → ∞. Also, χ(Ω1n ) ≤ χ(Ω1n ) = χ(Ω1n ) χ(Ω1n ) − cd Un (F (x))Ju (Φn (x), Gn (DG ), Gn (DG ))d[Hn (x, DG ) − Hθ (x, DG )] 2 {R\DF } − cd M Ju (Φn (x), Gn (DG ), Gn (DG ))d[Hn (x, DG ) − Hθ (x, DG )] 2 {R\DF } ∆F − cd M Ju (Φn (x), Gn (DG ), Gn (DG ))d[Hn (x, DG ) − Hθ (x, DG )] + 2 {(−∞,X1n )∪(Xnn − cd M Ju (Φn (x), Gn (DG ), Gn (DG ))d[Hn (x, DG ) 2 ,∞)} −Hθ (x, DG )], the second term on the righthand side of the inequality goes to 0 in probability as n → ∞ by the − cd Glivenko-Cantelli theorem and the assumption (D2). Since Ju (Φn (x), Gn (DG ), Gn (DG )) is 2 bounded above almost surely when x ∈ ∆F and Hn (x, y) converges to Hθ (x, y) in distribution as n → ∞, using Theorem 1.9 (i), [43], once again, we have the first term on the righthand side of the inequality above converges to 0. In summary, we have shown that B1n →p 0 as n → ∞. 38 Corollary 6. B2n →p 0 and B3n →p 0 as n → ∞. Proof: It suffices to show the result for B2n . By the CLT, Vn (G(DG − )) converges to a normal random variable as n → ∞, so we can define a set Ω2n such that P (Ω2n ) = P ({ω : sup |Vn (G(DG − ))| ≤ M }) > 1 − . ω Therefore, B2n = χ(Ω2n ) {R\DF } − cd Vn (G(DG − ))Jv (F (x), Θn (DG ), Gn (DG ))d[Hn (x, DG ) 1 −Hθ (x, DG )] − − cd +χ(Ωc ) 2n {R\D } Vn (G(DG ))Jv1 (F (x), Θn (DG ), Gn (DG ))d[Hn (x, DG ) F −Hθ (x, DG )], and the second term on the righthand side of the equality above goes to 0 in probability as n → ∞. Also, − cd Vn (G(DG − ))Jv (F (x), Θn (DG ), Gn (DG ))d[Hn (x, DG ) 1 {R\DF } −Hθ (x, DG )] χ(Ω2n ) ≤ χ(Ω2n ) = χ(Ω2n ) χ(Ω2n ) − cd M Jv (F (x), Θn (DG ), Gn (DG ))d[Hn (x, DG ) − Hθ (x, DG )] 1 {R\DF } ∆F − cd M Jv (F (x), Θn (DG ), Gn (DG ))d[Hn (x, DG ) − Hθ (x, DG )] + 1 {(−∞,X1n )∪(Xnn −Hθ (x, DG )], − cd M Jv (F (x), Θn (DG ), Gn (DG ))d[Hn (x, DG ) 1 ,∞)} the second term on the righthand side of the inequality goes to 0 in probability as n → ∞ by the Glivenko-Cantelli theorem and the assumption (D2). From the assumption (D2), combining with the definition of Θn (·) and the fact that cd Ju (u2 , v1 , v2 ) is continuous with respect to u2 , v1 , and v2 on (0, 1)3 , we know that 2 39 − cd Ju (F (x), Θn (DG ), Gn (DG )) is bounded above almost surely for x ∈ ∆F . Noting that 2 Hn (x, y) converges to Hθ (x, y) in distribution as n → ∞, using Theorem 1.9 (i), [43], once again, we have the first term on the righthand side of the inequality above converges to 0. In summary, we have shown that B2n →p 0 as n → ∞. Corollary 7. B4n →p 0 as n → ∞. Proof: Note that B4n = χ(Ω1n ) − cd Un (F (x))[Ju (Φn (x), Gn (DG ), Gn (DG )) 2 {R\DF } − cd −Ju (F (x), Gn (DG ), Gn (DG )) dHθ (x, DG ) 2 +χ(Ωc ) 1n − cd Un (F (x))[Ju (Φn (x), Gn (DG ), Gn (DG )) 2 {R\DF } − cd −Ju (F (x), Gn (DG ), Gn (DG )) dHθ (x, DG ) 2 where Ω1n is defined in (3.15). Therefore, the second term on the righthand side of the equality above goes to 0 in probability as n → ∞. Also, χ(Ω1n ) {R\DF } − cd Un (F (x))[Ju (Φn (x), Gn (DG ), Gn (DG )) 2 − cd −Ju (F (x), Gn (DG ), Gn (DG )) dHθ (x, DG ) 2 ≤ χ(Ω1n ) − cd M |Ju (Φn (x), Gn (DG ), Gn (DG )) 2 {R\DF } − cd −Ju (F (x), Gn (DG ), Gn (DG )) dHθ (x, DG ) 2 = χ(Ω1n ) ∆F − cd M |Ju (Φn (x), Gn (DG ), Gn (DG )) 2 − cd −Ju (F (x), Gn (DG ), Gn (DG )) dHθ (x, DG ) + 2 χ(Ω1n ) {(−∞,X1n )∪(Xnn ,∞)} − cd M |Ju (Φn (x), Gn (DG ), Gn (DG )) 2 − cd −Ju (F (x), Gn (DG ), Gn (DG )) dHθ (x, DG ), 2 40 the second term on the righthand side of the inequality goes to 0 in probability as n → ∞ by the − cd Glivenko-Cantelli theorem and the assumption (D2). Since |Ju (Φn (x), Gn (DG ), Gn (DG )) − 2 − cd Ju (F (x), Gn (DG ), Gn (DG ))| is bounded above almost surely when x ∈ ∆F and Hn (x, y) 2 converges to Hθ (x, y) in distribution as n → ∞, using Theorem 1.9 (i), [43], once again, we have the first term on the righthand side of the inequality above converges to 0. In summary, we have shown that B4n →p 0 as n → ∞. Corollary 8. B5n →p 0 and B6n →p 0 as n → ∞. Proof: Using similar arguments to those used in the proof of Corollary 7 and Corollary 8, one can see this is true, so proof omitted. cd In summary, we have the asymptotic normality of Rn . dd 3.4.3 Asymptotic normality of Rn Note that dd n1/2 (Rn − µdd ) = 5 4 Akn + k=1 Bkn , k=1 where µdd = E[N dd (X, Y )], − − dd A1n = n1/2 J dd (F (DF ), F (DF ), G(DG ), G(DG ))[dHn − dH dd ], − − dd A2n = Vn (G(DG ))Jv (F (DF ), F (DF ), G(DG ), G(DG ))dH dd , 2 − − dd A3n = Un (F (DF ))Ju (F (DF ), F (DF ), G(DG ), G(DG ))dH dd , 2 − dd − − A4n = Vn (G(DG ))Jv (F (DF ), F (DF ), G(DG ), G(DG ))dH dd , 1 − dd − − A5n = Un (F (DF ))Ju (F (DF ), F (DF ), G(DG ), G(DG ))dH dd , 1 41 − − dd dd B1 = Vn (G(DG ))[Jv (F (DF ), F (DF ), G(DG ), Ψn )dHn 2 − − dd −Jv (F (DF ), F (DF ), G(DG ), G(DG ))dH dd ], 2 − − dd dd B2 = Un (F (DF ))[Ju (F (DF ), Φn , G(DG ), G(DG ))dHn 2 − − dd −Ju (F (DF ), F (DF ), G(DG ), G(DG ))dH dd ], 2 − − dd dd B3 = Vn (G(DG ))[Jv (F (DF ), F (DF ), Θn , G(DG ))dHn 1 − − dd −Ju (F (DF ), F (DF ), G(DG ), G(DG ))dH dd ], 1 − − dd dd B4 = Un (F (DF ))[Ju (Γn , F (DF ), G(DG ), G(DG ))dHn 1 − − dd −Ju (F (DF ), F (DF ), G(DG ), G(DG ))dH dd ], 1 with n∗ = {the number of observations such that Xj = DF and Yj = DG } for k = 1, 2, ..., n, dd and n∗ dd dHn = dd n dd = P (X = D and Y = D ). dH F G We need to show that the A-terms are asymptotically normal, and the B-terms converge to 0 in dd probability as n → 0. By the SLLN, it is obvious that dHn → dH dd almost surely as n → ∞, and by the CLT, in distribution − − − Un (F (DF )) → N (0, [F (DF )(1 − F (DF ))]) Un (F (DF )) → N (0, [F (DF )(1 − F (DF ))]) − − − Vn (G(DG )) → N (0, [G(DG )(1 − G(DG ))]) Vn (G(DG )) → N (0, [G(DG )(1 − G(DG ))]) dd n1/2 [dHn − dH dd ] → N (0, [dH dd (1 − dH dd )]) 42 as n → ∞. Note that − − J dd (F (DF ), F (DF ), G(DG ), G(DG )), − − dd Jv (F (DF ), F (DF ), G(DG ), G(DG )), 2 − − dd Ju (F (DF ), F (DF ), G(DG ), G(DG )), 2 − − dd Jv (F (DF ), F (DF ), G(DG ), G(DG )), and 1 − − dd Ju (F (DF ), F (DF ), G(DG ), G(DG )), 1 in A1n , A2n , A3n , A4n and A5n , respectively, are fixed numbers. Therefore, the A-terms are asymptotically normally distributed with mean 0. To show the sum of the A-terms is still normal, one only need to notice that − − − − − ˆ ˆ Un (F (DF )) = n1/2 (Fn (DF ) − F (DF )) + n1/2 (Fn (DF ) − Fn (DF )) n − − −1/2 = n (ψX (DF ) − F (DF )) + op (1) j j=1 n ∗ = n−1/2 H1kn + op (1) j=1 ˆ ˆ Un (F (DF )) = n1/2 (Fn (DF ) − F (DF )) + n1/2 (Fn (DF ) − Fn (DF )) n −1/2 = n (φX (DF ) − F (DF )) + op (1) j j=1 n ∗ −1/2 = n H2kn + op (1) j=1 − − − − − ˆ ˆ Vn (G(DG )) = n1/2 (Gn (DG ) − G(DG )) + n1/2 (Gn (DG ) − Gn (DG )) n − − −1/2 = n (ψY (DG ) − G(DG )) + op (1) j j=1 n ∗ −1/2 = n H3kn + op (1) j=1 ˆ ˆ Vn (G(DG )) = n1/2 (Gn (DG ) − G(DG )) + n1/2 (Gn (DG ) − Gn (DG )) n −1/2 = n (φY (DG ) − G(DG )) + op (1) j j=1 n ∗ −1/2 = n H4kn + op (1) j=1 43 and n dd n1/2 [dHn − dH dd ] = n−1/2 j=1 I{j∈A } dd − dH dd = n−1/2 n j=1 ∗ H5kn , ∗ where Hikn are i.i.d. and depend on (Xj , Yj ) only, for k = 1, ..., 5. The A1n + A2n + A3n + n ∗ A4n + A5n can be written as n−1/2 A∗∗ , where A∗∗ are i.i.d. sum of products of Hikn 4jn 4jn j=1 and some certain fixed number, and four negligible terms. Using the CLT one more time, we have that the sum of A-terms is asymptotically normally distributed with mean 0. Under the assumption (D4), Bi converges to 0 as n → ∞, for i = 1, ..., 4, in probability. We dd have shown that the Rn is asymptotically normally distributed with mean 0. 3.4.4 Asymptotic normality of Rn cc cd dc dd To show that n1/2 Rn = n1/2 (Rn + Rn + Rn + Rn ) is asymptotic normal, we need the following notations: cc n1/2 (Rn − µcc ) ⇒ Rcc ∼ N (0, var(M cc (X, Y ))), cd n1/2 (Rn − µcd ) ⇒ Rcd ∼ N (0, var(M cd (X, Y ))), dc n1/2 (Rn − µdc ) ⇒ Rdc ∼ N (0, var(M dc (X, Y ))), dd n1/2 (Rn − µdd ) ⇒ Rdd ∼ N (0, var(M dd (X, Y ))). Noting that n cc n1/2 (Rn − µcc ) = cd n1/2 (Rn − µcd ) = j=1 n dc n1/2 (Rn − µdc ) = dd n1/2 (Rn − µdd ) = n−1/2 A∗∗ + a few negligible terms , 1jn j=1 n j=1 n n−1/2 A∗∗ + a few negligible terms , 2jn n−1/2 A∗∗ + a few negligible terms , 3jn j=1 n−1/2 A∗∗ + a few negligible terms , 4jn 44 by the CLT, we have n1/2 (Rn − µ) = n j=1 n−1/2 (A∗∗ + A∗∗ + A∗∗ + A∗∗ ) + a few negligible terms , 1jn 2jn 3jn 4jn ⇒ N (0, σ 2 ) where µ = (µcc + µcd + µdc + µdd ), and σ 2 = var[M cc (X, Y ) + M cd (X, Y ) + M dc (X, Y ) + M dd (X, Y )]. 3.5 Proof of the Main Result We finish this Chapter by providing the proof of Theorem 3.1.1. Proof of Theorem 3.1.1 As shown in Section 3.3, Bn → β almost surely, and in Section 3.4 An is asymptotically normal with mean 0 and variance σ 2 , as n → ∞. Using the Slutsky’s ˆ Theorem, n1/2 (θn − θ) = n1/2 An /Bn is asymptotically normal with mean 0 and variance = σ 2 /β 2 . In summary, we have shown the consistency and asymptotic normality of the semi-parametric ˆ estimator θn . 45 CHAPTER 4 A VARIANCE ESTIMATOR OF BIVARIATE DISTRIBUTIONS σ2 ˆ . Suppose estimators σ 2 and β 2 could ˆ 2 β be found for σ 2 and β 2 respectively. A rough-and-ready estimator of the asymptotic variance of σ2 ˆ would then be given by ˆ = . If the variables given in (3.9) and (3.10) could be observed, one ˆ β2 could simply estimate σ 2 and β by the respective sample variance and sample mean. As this is not In Chapter 3, we provided an explicit formula for = possible, the corresponding pseudo-observations can be used instead, which are defined in term of Cn , the re-scaled empirical copula function of the bivariate sample, namely 1 Cn (u, v) = n n I j=1 Fn (Xj )≤u,Gn (Yj )≤v . cc cd dc dd Let lθ,θ , lθ,θ , lθ,θ , and lθ,θ be the decomposed components of lθ,θ with respect to c∗ in (2.7) θ cc cd dc dd under different situations. Similarly, one can define lθ , lθ , lθ , and lθ for lθ . From (3.9), we have the following: ˆ 1 β= n n ˆ cc ˆ cd ˆ dc ˆ dd [Nj + Nj + Nj + Nj ], j=1 46 (4.1) j (Xj , Yj ) RX(j) 1 (2.1,1.0) 1 2 (2.5, 0.5) 2 3 (3.2,1.0) 5 4 (3.2,1.5) 5 5 (3.2,2.0) 5 6 (3.5,2.5) 6 7 (3.7,2.5) 7 RY (j) 3 1 3 4 5 7 7 Table 4.1. Relationship among (Xj , Yj ), RX(j) , and RY (j) . where cc ˆ ˆ cc Nj = I{j∈A } lθ,θ (θn , Fn (Xj ), Gn (Yj )) cc − cd ˆ ˆ cd = I Nj {j∈Acd } lθ,θ (θn , Fn (Xj ), Gn (Yj ), Gn (Yj )) − dc ˆ ˆ dc = I Nj {j∈Adc } lθ,θ (θn , Fn (Xj ), Fn (Xj ), Gn (Yj )) − − dd ˆ ˆ dd = I Nj {j∈Add } lθ,θ (θn , Fn (Xj ), Fn (Xj ), Gn (Yj ), Gn (Yj )). To get σ 2 for σ 2 in (3.11), we need more notation. Note that rearranging {Xj , Yj }, j = ˆ 1, ..., n, shall not change the value of σ 2 . Therefore, we assume the sample is in an order such that ˆ Xj ’s are in non-decreasing order. Let RX(j) = i I{X ≤X } i j be the rank of Xj in the sequence of X’s (Similarly, we can define RY (j) for Y ’s), and let R− = X(j) i I{X RY (j)   − dG RX(k) RY (k) RY (k) cd ˆ I lθ,v θn , + , , {k∈Acd , Yk =DGl } n+1 n+1 n+1 2 l=1 RY (k) ≥RY (j) ˆ cd = Mj ˆ dc = Mj + dF h=1  1  n − dc ˆ lθ (θn , Fn (Xj ), Fn (Xj ), Gn (Yj ))I{j∈A , X =D } dc j Fh dF h=1 RY (k) ≥RY (j)   R− RX(k) RY (k) X(k) dc ˆ · lθ,v θn , , , n+1 n+1 n+1 2 I{k∈A , X =D } dc k Fh 48  dF + h=1 RX(k) >RX(j)  R− RX(k) RY (k) X(k) dc ˆ · lθ,u θn , , , n+1 n+1 n+1 1 I{k∈A , X =D } dc k Fh    − dF n R X(k) RX(k) RY (k)   dc ˆ + lθ,u θn , , , I{k∈A , X =D }  n+1 n+1 n+1 2 dc k Fh h=1 k=j dF dG − dd ˆ ˆ dd = Mj lθ (θn , Fn (Xj ), Fn (Xj ), Gn (Yj− ), Gn (Yj ))I{X =D , Y =D } j Fh j Gl h=1 l=1    − d d n R− RX(k) RY (k) RY (k)  F G X(k) dd ˆ · + lθ,v θn , , , ,  n+1 n+1 n+1 n+1 1 h=1 l=1 RY (k) >RY (j) I{X =D , Y =D } k Fh k Gl dF dG n + h=1 l=1 RY (k) ≥RY (j)   − R− RX(k) RY (k) RY (k) X(k) dd ˆ · lθ,v θn , , , , n+1 n+1 n+1 n+1 2 I{X =D , Y =D } k Fh k Gl dF dG n + h=1 l=1 RX(k) >RX(j)   − R− RX(k) RY (k) RY (k) X(k) dd ˆ · lθ,u θn , , , , n+1 n+1 n+1 n+1 1 I{X =D , Y =D } k Fh k Gl   − dF dG n R− RX(k) RY (k) RY (k) X(k) ˆ · + lθ,u θn , , , , n+1 n+1 n+1 n+1 2 h=1 l=1 k=j n∗ hl I{X =D , Y =D } · n , k Fh k Gl where n∗ = {the number of observations (Xj , Yj ) : Xj = DF h , Yj = DGl }. hl ˆ2 Next, to establish the consistency of ˆn , it suffices to show that σn and βn are themselves ˆ2 ˆ consistent. To prove that βn − β converges almost surely, for example, express the difference as n 1 − J ˆ (Fn (Xj ), Fn (Xj ), Gn (Yj− ), Gn (Yj )) − E[Jθ (F (X − ), F (X), G(Y − ), G(Y ))] θn n j=1 49 in terms of Jθ = lθ,θ . By the triangle inequality, this quantity is no greater than 1 n n j=1 − J ˆ (Fn (Xj ), Fn (Xj ), Gn (Yj− ), Gn (Yj )) θn − − Jθ (Fn (Xj ), Fn (Xj ), Gn (Yj− ), Gn (Yj )) 1 + n n j=1 − Jθ (Fn (Xj ), Fn (Xj ), Gn (Yj− ), Gn (Yj )) − E[Jθ (F (X − ), F (X), G(Y − ), G(Y ))] Assuming that Jθ is bounded by an integrable function in a neighborhood of the true value of θ, by the convergence of maximum likelihood estimators, the first summand then converges to zero by the Dominated Convergence Theorem. Since the second term vanishes asymptotically by ˆ a similar argument as used in Chapter 3, it follows that βn → β almost surely. The argument for σn is similar, though somewhat more involved. It will not be presented here. ˆ In summary, we provided a variance estimator of , ˆ, and illustrated its consistency. 50 CHAPTER 5 EXTENSIONS TO HIGHER DIMENSIONS The previous developments extend more or less automatically to situations where it is desired to estimate a multidimensional dependence semi-parametrically. Let a boldfaced letter such as x denotes a D-tuple (D ≥ 2) vector of real numbers, that is, x ≡ (x1 , . . . , xD )T where xk ∈ R for k = 1, . . . , D, and F1 , ..., FD denote D univariate mixed marginals on the real line given by dk dk Fk (xk ) = h=1 pkh I{D ≤x } + (1 − kh k pkh ) h=1 w≤xk fk (w)dw, where Dkh is the h-th jump point of Fk with P (Xk = Dkh ) = pkh , for h = 1, ..., dk , and fk (·) is a continuous density function with support on the real line. When restricted to a particular copula family, say {Cθ : θ = (θ1 , ..., θq )T ∈ A ⊆ Rq }, a multivariate joint distribution function for (x1 , . . . , xD )T can be defined as follows: F D (x1 , . . . , xD ) = Cθ (F1 (x1 ), . . . , FD (xD )). (5.1) Result 1: F D (x1 , . . . , xD ) defined in (5.1) is a valid joint distribution function on RD . Proof: For F D (x1 , . . . , xD ) to be a valid distribution function on RD , the following four conditions should be satisfied: 51 (1) 0 ≤ F D (x1 , . . . , xD ) ≤ 1 for all (x1 , . . . , xD ), (2) F D (x1 , . . . , xD ) → 0 as max(x1 , . . . , xD ) → −∞, (3) F D (x1 , . . . , xD ) → 1 as min(x1 , . . . , xD ) → +∞, and (4) for every pair of (x1 , . . . , xD ) and (y1 , . . . , yD ) ∈ RD with xk ≤ yk for k = 1, ..., D, which defines a D-box V = [x1 , y1 ] × [x2 , y2 ] × ... × [xd , yD ]. Then sgn(c)F (c) ≥ 0 where the sum is taken over all vertices c of V and sgn(·) is defined as in (1.3). Note that since F D (x1 , . . . , xD ) = Cθ (F1 (x1 ), . . . , FD (xD )) = P (U1 ≤ F1 (x1 ), ..., UD ≤ FD (xD )), (1) follows. To prove (2), note that Fk (xk ) → 0 for all k = 1, ..., D when max(x1 , . . . , xD ) → −∞. Hence, F D (x1 , . . . , xD ) → 0 from the property of a cumulative distribution function. (3) follows similarly. (4) is the direct result of (1.3) and (5.1). Proof completed. Define C(Fk ) to be the collection of all points of continuity of Fk and J (Fk ) = {Dk1 , ..., Dk d } to be the collection of all jump points of Fk . k For a fixed x ∈ RD , every component of x corresponds to either a point of continuity or discontinuity of the corresponding marginal distribution. Suppose the indexes c1 , c2 , ..., c ∈ H {1, 2, ..., D} corresponds to xc ∈ C(Fc ) (i.e., xc belongs to the set of continuity points of Fc ) h h h h and the remaining indexes, say d1 , d2 , ..., dL , with H + L = D, are the indexes of marginals such that xd = Dd ,i , l = 1, 2, ..., L . l l dl When {c1 , c2 , ..., cH } = {1, 2, ..., D}, the density of F D is given by dF D = cθ (F1 (x1 ), . . . , FD (xD )) D fk (xk ), k=1 52 and if {c1 , c2 , ..., c dF D = ... H } ⊂ {1, 2, ..., D} the density of F D is given by H cθ (u1 , . . . , uD )dud ...dud · fch (xch ), L [F (D− 1 L ),Fd (Dd ,i )] l=1 dl dl ,id h=1 l l dl l where uch = Fch (xch ). Note that the above can be written as c∗ (F1 , ...FD )(x) · θ H h=1 fch (xch ). Letting Xj = (X1j , X2j , ..., XDj )T where j = 1, ..., n represent a random sample from F D , ˆ the semi-parametric estimator θn of θ would then be obtained as a solution of the system 1 n n ∂ log[dF D (X1j , . . . , XDj )] = 0, (1 ≤ i ≤ q) ∂θi j=1 (5.2) Since θ is the only vector of parameters of interest, the components of likelihood that matters is 1 n n j=1 ∂ log[c∗ (F1 , ...FD )(X1j , . . . , XDj )] = 0, (1 ≤ i ≤ q) θ ∂θi (5.3) Using the same techniques as in Chapter 3, Theorem 3.1.1 can be extended to the Dˆ dimensional case. Furthermore, the limiting variance-covariance matrix of n1/2 (θn − θ) is then B −1 ΣB −1 , where B is the information matrix associated with Cθ . To give the explicit form of Σ, we need some notations: Let    F (x ) ≡ u , if xk ∈ C(Fk ), p k k k2 F (xk ) = k   (F (x− ), F (x )) ≡ (u , u ), if x ∈ J (F ). k k k k k1 k2 k k Note that in the present case, the relevant statistic in (5.3) can be written as, for 1 ≤ i ≤ q, 1 Ri,n = n 1 = n n j=1 − − Ji (F1 (X1j ), F1 (X1j ), ..., FD (XDj ), FD (XDj )) n Q j=1 Q p p Ji (F1 (X1j ), ..., FD (XDj )), 53 where Q = {Q1 , ..., QD } is a D-sequence consisting of letters of {c} or {d}, with the following properties: (i) when Q is combined with a point x and the marginals F1 , ..., FD , say x ∈ Q(F1 , ..., FD ), a ‘c’ at the k-th component of Q refers to xk belonging to C(Fk ), and a ‘d’ at the k-th component refers to xk belonging to J (Fk ), for k = 1, ..., D; (ii) when Q is combined with Ji or Ji ’s derivative with respect to uk1 or uk2 , a ‘c’ at the k-th compop p nent refers to F (Xkj ) = Fk (Xkj ), and a ‘d’ at the k-th component refers to F (Xkj ) = k k (Fk (X − ), Fk (Xkj )), for k = 1, ..., D. kj Using the same techniques as those used in Chapter 3, one can see that Σ is the variancecovariance matrix of the q-dimensional random vector whose i-th component is given by var( Q Q Mi (X1 , ..., XD )); Q the following is a general form of Mi (X1 , ..., XD )). Assuming Q = {c1 , c2 , ..., cH } ∪ {d1 , d2 , ..., dL }, and L and H can be any value among {0, ..., D}, such that H + L = D. Then Q Mi (x1 , ..., xD ) = ... id 1 id L Q p p Ji (F1 (x1 ), , ..., FD (xD )) H h=1 I{x ch ∈C(Fch )} · D L I Lik . l=1 {xd =Dd i } + l l dl k=1 The set of jump points for Fd is {Dd 1 , ..., Dd d }, and Lik is given as below: l l l dl (i) if the k-th component in Q = c; Lik = ... ... I(xk ≤ xk )· dc1 dc H Dc n }} {R\{∪n=1 Dc1 n }} {R\{∪n=1 id id H 1 L Q p p Jiu (F1 (x1 ), ..., FD (xD ))dF (x1 , ..., xD ) k2 with xd = Dd i ; l dl l 54 (ii) if the k-th component in Q = d; Lik = ... dc1 {R\{∪n=1 Dc1 n }} ... I(xk < xk )· dc H Dc n }} {R\{∪n=1 H id id 1 L Q p p Jiu (F1 (x1 ), ..., FD (xD ))dF (x1 , ..., xD ) k1 ... I(xk ≤ xk )· + ... dc dc1 H Dc n }} {R\{∪n=1 Dc1 n }} {R\{∪n=1 id id H 1 L Q p p Jiu (F1 (x1 ), ..., FD (xD ))dF (x1 , ..., xD ), k2 with xd = Dd i . l dl l As the parallel with the case q = 1 and D = 2 described earlier in Chapter 2 and Chapter 3, ˆ an estimator of the variance-covariance matrix of θn could be found by repeating the procedure described in Chapter 4. 55 CHAPTER 6 JOINT DISTRIBUTIONS USING t-COPULA 6.1 Joint Distributions Using t-copula In this Chapter, we develop the joint distributions with the family of t-copulas, see, [34], and [8]. The reason we choose t-copulas is not only it is a generalization of the Gaussian copulas, but also it has the great tail dependence property. This property allows us to study the limiting association between random variables X and Y as both x and y go to their boundaries. Our finding is that the limiting association is governed by the correlation coefficient ρ together with the degrees of freedom ν, which is listed in Lemma 6.1.3. In reality, there are situations where random variables are still associated in a certain level even in the tails. For instance, in biometric recognition, the genuine and imposter distributions generated from the same biometric trait or different biometric traits in a multimodal biometric system, tend to have not exact the same but similar tail behavior, as shown in Chapter 7, Section 7.2.2. The t-copula models can fit in this case more appropriately than a copula family which does not have the tail dependency. Before defining the t-copula, we develop some notations for the presentation. Let Σ denote a positive definite matrix of dimension D × D. For such a Σ, the D-dimensional t density with ν 56 degrees of freedom will be denoted by Γ( ν+D ) D 2 fν,Σ (x) ≡ D 1 (πν) 2 Γ( ν )|Σ| 2 2 − ν+D 2 xT Σ−1 x 1+ , ν with corresponding cumulative distribution function D tD (x) ≡ fν,Σ (w) dw. ν,Σ w≤x The matrix Σ with unit diagonal entries corresponds to a correlation matrix and will be denoted by R. The D-dimensional t-copula function is given by Cν,R (u) ≡ D −1 (u) fν,R (w) dw w≤tν (6.1) where t−1 (u) ≡ (t−1 (u1 ), . . . , t−1 (uD ))T , t−1 is the inverse of the cumulative distribution ν ν ν ν function of univariate t with ν degrees of freedom, and R is a D × D correlation matrix. Note that Cν,R (u) = P{X ≤ t−1 (u)}, for X = (X1 , . . . , XD )T distributed as tD , demonstrating that ν ν,R D with uniform marginals. The density corresponding Cν,R (u) is a distribution function on (0, 1) to Cν,R (u) is given by ∂ D Cν,R (u) cν,R (u) = = ∂u1 ∂u2 . . . ∂uD D fν,R {t−1 (u)} ν D f {t−1 (u )} k k=1 ν ν , (6.2) where fν in (6.2) is the density of the univariate t distribution with ν degrees of freedom. We consider the joint distributions of the form D Fν,R (x) = Cν,R {F1 (x1 ), . . . , FD (xD )} (6.3) D with Cν,R defined as in (6.1). It follows from the properties of a copula function that Fν,R is a valid multivariate distribution function on RD . The identifiability of the marginal distributions Fk , k = 1, . . . , D, the correlation matrix R, and the degrees of freedom parameter, ν, are established in the following theorem: 57 D Theorem 6.1.1. Let Fν ,R and GD ,R denote two distribution functions on RD obtained from ν2 2 1 1 equation (6.3) with marginal distributions Fk , k = 1, . . . , D and Gk , k = 1, . . . , D, respectively. D Suppose we have Fν ,R (x) = GD ,R (x) for all x. Then, Fk (x) = Gk (x) for all k, ν1 = ν2 2 1 1 ν2 and R1 = R2 . In order to prove identifiability of (ν, R) in Theorem 6.1.1, we first must state and prove several lemmas. D Lemma 6.1.1 Fix ν. Let Fν,R and GD ν,R2 denote two cumulative distribution functions on 1 RD obtained from equation (6.3) with marginal distributions Fk , k = 1, . . . , D and Gk , k = 1, . . . , D, respectively. Suppose we have D Fν,R (x) = GD (x) ν,R2 1 (6.4) for all x. Then, Fk (x) = Gk (x) for all k and R1 = R2 . D Proof: By taking xi , i = k tending to ∞, we get that Fν,R (x) → Fk (xk ) and GD (x) → ν,R2 1 Gk (xk ). It follows that Fk (x) = Gk (x) for all k. Next we show that the correlation matrices R1 and R2 are equal. We first prove this result for D = 2. Note that when D = 2, R1 and R2 can be determined by one correlation parameter, namely, ρ1 and ρ2 , respectively, so, we have 2 Fν,ρ (x) = Cν,ρ1 {F1 (x1 ), F2 (x2 )} ≡ 1 2 −1 (v) fν,ρ1 (w) dw w≤tν (6.5) and G2 (x) = Cν,ρ2 {F1 (x1 ), F2 (x2 )} ≡ ν,ρ2 2 −1 (v) fν,ρ2 (w) dw w≤tν where v = (F1 (x1 ), F2 (x2 ))T , so, one only needs to prove ρ1 = ρ2 . (6.6) Now, for any real numbers a and b, the bivariate t-cumulative distribution function, t2 (a, b), ν,ρ with ν degrees of freedom and correlation parameter ρ is a strictly increasing function of ρ. Note that t2 (a, b) = ν,ρ = a b −∞ −∞ a b 2 fν,ρ (m, n)dm dn ∞ −∞ −∞ 0 58 φ(m, n, ρ, w) gν (w) dw dm dn, where m2 − 2ρmn + n2 φ(m, n, ρ, w) = exp − 2w(1 − ρ2 ) 2πw(1 − ρ2 )1/2 and gν (w) is the probability density function of the inverse Gamma distribution defined as 1 ν 2 gν (w2 ) ∼ IG( , ). 2 ν (6.7) Differentiating t2 (a, b) with respect to ρ, we get ν,ρ ∂t2 (a, b) ν,ρ ∞ ∂φ(m, n, ρ, w) gν (w) dw dm dn ∂ρ ∂ρ −∞ −∞ 0 a b ∞ d2 φ(m, n, ρ, w) = w gν (w) dw dm dn dm dn −∞ −∞ 0 ∞ = w φ(m, n, ρ, w) gν (w) dw > 0, 0 using the fact that ∂φ(m, n, ρ, w)/dρ = w ∂ 2 φ(m, n, ρ, w)/∂m∂n. = a b 2 Note that (6.5) and (6.6) can be re-written as Fν,ρ (x) = t2 (a, b) for j = 1, 2 with a = ν,ρj j 2 2 t−1 (F1 (x1 )) and b = t−1 (F2 (x2 )). So, Fν,ρ (x) = Fν,ρ (x), which gives ν ν 1 2 t2 (a, b) = t2 (a, b). ν,ρ2 ν,ρ1 Since when ν is fixed, for any a and b, t2 (a, b) is an strictly increasing function of ρ, the equality ν,ρj above implies that ρ1 = ρ2 must hold. The proof is completed. Now, we prove Lemma 6.1.1 for general D. Let R1 = ((ρkk ,1 )) and R2 = ((ρkk ,2 )) be the representation of the correlation matrices R1 and R2 in terms of their entries. Note that the condition (6.4) can be reduced to the case D = 2 by taking xi , i = k, k tending to infinity. Thus, 2 we have Fν,ρ kk ,1 (xk , xk ) = G2 ν,ρ kk ,2 (xk , xk ), but this implies ρkk ,1 = ρkk ,2 from the case D = 2. Next, we state and prove two more relevant lemmas: Lemma 6.1.2 For a > 0, tν {−(aν)1/2 } is a decreasing function of ν. Proof: Let Y ∼ tν (·). Then Z = Y /ν 1/2 has pdf fν (z) = [Γ{(ν + 1)/2}/{Γ(ν/2)}] −(ν+1)/2 −1/2 1 + z2 π . One can see that if ν1 ≤ ν2 , {fν2 (|z|)/fν1 (|z|)} is a strictly decreasing function of |z|. Since I is a nondecreasing function of a, mimicking the {|Z|≥a1/2 } 59 proof of Lemma 2 in [28], one can prove that Eν (I ) is a decreasing function of ν. It {|Z|≥a1/2 } follows that tν {−(aν)1/2 } = P{Y ≤ −(aν)1/2 } = P{Z ≤ −a1/2 } = Eν (I ) {|Z|≥a1/2 } 2 is decreasing in ν. Lemma 6.1.3 For the bivariate t-copula Cν,ρ (u, v), let λ∗ = lim v→0+ ∂Cν,ρ (u, v) ∂v u=v and µ∗ = lim v→0+ ∂Cν,ρ (u, v) . ∂v u=1−v Then, it follows that 1 − ρ 1/2 λ∗ = tν+1 − (ν + 1) 1+ρ 1 + ρ 1/2 and µ∗ = tν+1 − (ν + 1) . 1−ρ Proof: It is not hard to see that if (X, Y )T ∼ t2 , then given Y = y, ν,ρ ν + 1 1/2 X − ρy ∼ tν+1 . ν + y2 (1 − ρ2 )1/2 (6.8) Therefore Cν,ρ (u, v) = P(X ≤ t−1 (u), Y ≤ t−1 (v)) ν ν −1 (v) tν ν + 1 1/2 = tν+1 ν + y2 −∞ Taking derivative with respect to v, we get  1/2 ∂Cν,ρ (u, v) ν+1 = tν+1  ∂v ν + t−1 (v)2 ν  1/2 ν+1 = tν+1  ν + t−1 (v)2 ν  1/2 ν+1 = tν+1  ν + t−1 (v)2 ν 1 − ρ 1/2 → tν+1 − (ν + 1) 1+ρ 60 t−1 (u) − ρy ν (1 − ρ2 )1/2 fν (y) dy.  t−1 (u) − ρt−1 (v)  ν ν 1/2 (1 − ρ2 )  −1 (v) − ρt−1 (v) tν ν  , putting u = v 1/2 (1 − ρ2 )  −1 (v)(1 − ρ) tν  (1 − ρ2 )1/2 as v → 0+. The other expression can be derived similarly. Remark: Lemma 6.1.3 tells the property of tail dependence in t-copulas. Proof of Theorem 6.1.1 Using similar arguments as before, we only need to prove Theorem 6.1.1 for D = 2. It easily follows by taking xi → ∞ for i = k, that Fk (xk ) = Gk (xk ) for k = 1, 2. It follows that Cν1 ,ρ1 (u, v) = Cν2 ,ρ2 (u, v) (6.9) for all pairs of (u, v) of the form (F1 (x1 ), F2 (x2 )). Thus, (6.9) holds for all values of (u, v) in (0, 1)2 as in the case when both marginals are continuous. Nevertheless, since both marginals have only a finite number of discontinuities, there are infinite values of (u, v) for which the limits and derivatives in Lemma 6.1.3 can be applied to obtain λ∗ and µ∗ . Since (6.9) holds, we must have λ∗ = λ∗ and µ∗ = µ∗ . Now without loss of generality, assume ρ1 = ρ2 and ρ1 ≥ ρ2 , from 1 2 1 2 the equality λ∗ = λ∗ and Lemma 6.1.2, we must have ν1 ≥ ν2 . On the other hand, the equality 1 2 µ∗ = µ∗ gives ν1 ≤ ν2 . Hence, ρ1 ≥ ρ2 implies that ν1 = ν2 = ν, say. Now, using Lemma 1 2 6.1.1, we get ρ1 = ρ2 . The proof of theorem is completed. Remark: Note that as the degrees of freedom ν → ∞, the t-distribution converges asymptotically to a Gaussian distribution, so does the t-copula. For the Gaussian copulas, the counterpart of Theorem 6.1.1 also holds. Therefore, we have the following lemma: D Lemma 6.1.4 Let FR and GD denote two distribution functions on RD in the Gaussian R2 1 copula family with marginal distributions Fk , k = 1, . . . , D and Gk , k = 1, . . . , D, respectively. D Suppose we have FR (x) = GD (x) for all x. Then, Fk (x) = Gk (x) for all k, and R1 = R2 . R2 1 Proof: Without loss of generality, we prove Lemma for the case that D = 2, i.e., ρ1 = ρ2 . Notice that it suffices to prove that for any fixed (a, b) ∈ R2 , Pρ (Z1 ≤ a, Z2 ≤ b) = Φ2,ρ (a, b) is increasing in ρ, where Zi ∼ N (0, 1), i = 1, 2, with correlation coefficient ρ. 61 Note that Φ2,ρ (a, b) = a b −∞ −∞ φρ (z1 , z2 )dz1 dz2 , which gives us ∂Φ2,ρ ∂ρ (a, b) = = = a b −∞ −∞ a b φρ (z1 , z2 )dz1 dz2 (z ρ − z2 )(z2 ρ − z1 ) ρ + 1 · (1 − ρ2 )2 −∞ −∞ 2π (1 − ρ2 ) 1 − ρ2 2 2 z1 −2ρz1 z2 +z2 exp − dz1 dz2 2(1−ρ2 ) 1 ∂ 2 φρ (z1 , z2 ) dz1 dz2 −∞ −∞ ∂z1 ∂z2 a b = φρ (a, b) which is always positive. Therefore, we have proved the uniqueness of ρ. The proof of Lemma is completed. Remark: It is not hard to see that for the Gaussian copulas, the tail dependence between random variables no longer exists. D D Using notations defined in Chapter 5, the density of Fν,R at a fixed point x ∈ RD , dFν,R (x), is a function on RD that satisfies D D Fν,R (x) = dFν,R (w), w≤x (6.10) D where dFν,R (x) is given by H ... where uc L [F (D− l=1 dl dl ,id ),Fdl (Ddl ,id l l h cν,R (u1 , . . . , uD )dud , ..., dud · 1 L )] h=1 fc (xc ), h h = Fc (xc ), and H and L could be any value among {0, 1, ..., D} such that H + h h L = D. Note that the above density can be generally written as c∗ (ν, R, F1 , ..., FD )(x) · H h=1 62 fc (xc ). h h (6.11) 6.2 Estimation of R and ν 6.2.1 Estimation of R for fixed ν Let X1 , . . . , Xn be n independent and identically distributed D-dimensional random vectors arisD ˆ ing from the joint distribution FR,ν in (6.3), Fkn denote the empirical distribution function of Fk ˆ and Fkn = nFkn /(n + 1). From (6.11), the log-likelihood function corresponding to the n i.i.d. observations Xj , j = 1, . . . , n, is given by n τ (ν, R) = j=1 n = j=1 D log dFν,R (Xj ) (6.12) log c∗ (ν, R, F1 , . . . , FD )(Xj ) For fixed ν, our estimator of R is taken to be the maximizer of τ (ν, R), that is, ˆ ˆ R(ν) = arg maxR τ (ν, R). ˆ (6.13) Note that there are two main challenges in maximizing the likelihood τ (ν, R): Firstly, τ (ν, R) ˆ ˆ involves several integrals corresponding to discrete components in Xj , j = 1, . . . , n; c∗ is not available in a closed form. Secondly, τ (ν, R) needs to be maximized over all D × D correlation ˆ 2 matrices. The space of all correlation matrices is a compact space of [−1, 1]D since all entries of a correlation matrix lie between −1 and 1. However, this space is not easy to maximize over due to the constraint of positive definiteness placed on correlation matrices. The EM Algorithm The difficulties mentioned can be overcome with the use of the EM algorithm; see, for example, [31]. The EM algorithm is a well-known algorithm used to find the maximum likelihood estimator (MLE) of a parameter θ based on data y distributed according to the likelihood obs ( y; θ). In 63 many situations, obtaining the MLE, ˆ θM LE = arg maxθ obs ( y; θ), via maximization of obs ( y; θ) over θ turns out to be difficult. In such cases, the observed likelihood can usually be expressed in terms of an integral over missing components of a complete likelihood; in other words, if z and com ( y, z; θ), respectively, denote the missing observations and the complete likelihood corresponding to (y, z), it follows that obs ( y; θ) = Z(y) com ( y, z; θ) dz where Z(y) is the range of integration of z subject to the observed data being y. The EM algorithm is an iterative procedure that can be formulated in two steps: First, (i) the E-step, where the quantity Q(θ(N ) , θ) = E {log com ( y, z; θ)} π( z | θ(N ) ,y) (6.14) is formed; the expectation in (6.14) is taken with respect to the conditional distribution of Z given y when evaluated at a particular value, θ(N ) , and the conditional distribution of Z given y is given by π( z | θ, y) = com ( y, z; θ) . obs (y; θ) (6.15) Second, (ii) the M-step, where Q(θ(N ) , θ) is maximized with respect to θ to obtain θ(N +1) , that is, θ(N +1) = arg maxθ Q(θ(N ) , θ). ˆ Starting from an initial value θ(0) , the sequence θ(N ) , N ≥ 0 converges to θM LE under suitable regularity conditions. The integrals in τ (ν, R) corresponding to the discrete components can be formulated as missing ˆ components of a complete likelihood. Recall that the notations ch and dl were used to denote the discrete and continuous components in the vector x. We now extend this notation to represent discrete and continuous components in the j-th observation vector Xj , j = 1, . . . , n. For j = 64 1, . . . , n, let dlj , l = 1, . . . , Lj and chj , h = 1, . . . , Hj , respectively, denote the discrete and continuous components of Xj with respect to the corresponding marginal distributions. Next, define the vector uj in the following way: The dlj -th component of uj , ud , is a number that lj − ), F (x )], that is, u ∈ [F (x− ), F (x )] ≡ is allowed to vary between [Fd (x dlj dlj lj dlj d dlj dlj lj dlj lj Slj , say, for l = 1, . . . , Lj . The chj -th component of uj , uc , is taken to be uc = Fc (xc ) hj hj hj hj for h = 1, . . . , Hj . In that case, we have c∗ (ν, R, F1 , . . . , FD )(Xj ) = S1j ... S Lj cD (uj ) dud . . . dud ν,R 1j L j Making the transformation zj = t−1 (uj ), the likelihood corresponding to the j-th observation in ν (6.12) can be written as c∗ (ν, R, F1 , . . . , FD )(Xj ) = N U M j /DEN OM j , where NUMj = D −1 (S ) . . . t−1 (S ) fν,R (zj ) dzd1j . . . dzdL tν ν 1j j Lj and Hj DEN OM j = h=1 fν (zchj ), where zc = t−1 {Fc (xc )} and t−1 (Slj ) = [t−1 {Fd (x− )}, t−1 {Fd (xd )}]. We ν ν ν ν hj hj hj lj dlj lj lj make several important observations. First, where the maximization of R is concerned for fixed ν, it is enough to consider the N U M j terms. Thus, in the EM framework, we define N U M j to be the “observed likelihood” corresponding to the xj : obs (xj ; ν, R) = D −1 (S ) . . . t−1 (S ) fν,R (zj ) dzd1j . . . dzdL . tν ν 1j j Lj (6.16) If the variables zd j = 1, . . . , Lj in (6.16) are treated as missing, the “complete likelihood” lj corresponding to the j-th observation becomes   Lj   D , I com (xj , Zj ; ν, R) = fν,R (zj )   {zd ∈t−1 (Slj )}  l=1 lj ν 65 D where Zj = { zd , j = 1, . . . , Lj }. Next, we note that the t density, fν,R , is an infinite scale lj mixture of Gaussian densities, namely, D fν,R (zj ) = ∞ 2 φD (z ) g (σ 2 ) dσj 2 =0 R,σj j ν j σj  where φD (z) = R,σj 1 zT R−1 z   exp − 2 D 1 2σj D (2π) 2 σj |R| 2 and the mixing distribution on σj is the inverse Gamma distribution as defined in (6.7). In other words, an extra missing component can be added into the E-step, namely, the mixing parameter, σj . The complete likelihood specification for the j-th observation now becomes   Lj  D 2  . I com (xj , Zj , σj ; ν, R) = φR,σ (zj ) · gν (σj )   {zd ∈ t−1 (Slj )}  j l=1 lj ν (6.17) The conditional distribution π required to obtain Q in (6.14) is defined as n π(Zj , σj , j = 1, . . . , n; ν, R ) = πj (Zj , σj ; ν, R ) j=1 where πj (Zj , σj ; ν, R ) = com (xj , Zj , σj ; ν, R ) ; obs (xj ; ν, R ) (6.18) see (6.15). Two distributions derived from (6.18) will be used subsequently: (i) the conditional distribution of σj given Zj , given by      T R−1 z −1  ν + D  ν + zj j 2  πj (σj | Zj , ν, R) = IG , ,  2  2   and (ii) the marginal of Zj s (after integrating out σj ), given by   Lj D  I fν,R (zj )  l=1 {z ∈t−1 (t−1 (S )} ν lj dlj ν πj (Zj ; ν, R) = . D (z ) dz f · · · −1 j d1j · · · dzd Lj tν (S ) ν,R t−1 (S1j ) ν Lj 66 The E-step entails that the expected value of the logarithm of the complete likelihood (6.17) is taken with respect to the missing components, Zj , and the N -th iterate of R, R(N ) : E {log com (xj , Zj , σj ; ν, R ) } ≡ Ej {log com (xj , Zj , σj ; ν, R )}, π(Zj ,σj ; ν,R(N ) ) where πj (Zj , σj ; ν, R ) is as defined in (6.18). The expected value can be simplified to  Ej − zT R−1 zj j 2 2σj   − 1 log|R| 2 (6.19) plus other terms that do not involve the correlation matrix R, and hence are irrelevant for the subsequent M-step. The expectation in (6.19) is taken in two steps, namely, Ej = Ej1 Ej2 where Ej2 is the conditional expectation of σj given Zj , and Ej1 is the expectation with respect to the marginal of Zj . On taking Ej2 , the expression in (6.19) simplifies to   −   T zj R−1 zj ν+D ν+D Ej1 tr R−1 W (N ) , =− (N ) )−1 z   ν + z T (R 2 2 j j (N ) )) is a D × D matrix with entries kk     zkj zk j (N ) w = πj (Zj ; ν, R(N ) ) dZj  ν + z T (R(N ) )−1 z  kk j j where W (N ) = ((w for zj = (z1j , . . . , zDj )T . The last integral is approximated by numerical integration on a grid. We partition each interval t−1 (Slj ), l = 1, . . . , Lj into a large number of subintervals and evaluate the integrand on the ν partition points. Finally, a Reimann sum is obtained as an approximation to the integral. The M -step consists of maximizing the complete likelihood function (6.19),   zT R−1 zj j  − 1 log|R| = − ν + D tr(R−1 W (N ) ) − 1 log|R|, Ej − 2 2 2 2 2σj (6.20) with respect to the correlation matrix R. Since we have to maximize the objective function in the space of all correlation matrices, this is somewhat a difficult task. We adopt the methodology 67 presented by Barnard et al. [1] where an iterative procedure to maximize (6.20) is developed by considering the maximization of one element of R, say ρ, each time. In order to preserve the positive definiteness of R, one can show (see Barnard et al. [1] for details) that ρ should lie in an interval [ρl , ρu ]. The lower and upper limits of this interval are derived from the fact that in order for R to be positive definite, it is both necessary and sufficient that all the principal submatrices of R are positive definite. This is equivalent to a non-negativity condition on the determinant of each principal submatrix, which translates to an interval for the range of values of ρ. This procedure is repeated for the other elements of R and cycled until convergence before going to the (N + 1)-st iteration of the EM algorithm. 6.2.2 Selection of the Degrees of Freedom, ν The above procedure is carried out for a collection of degrees of freedom ν ∈ A, where A is finite. ˆ For each fixed ν, we obtain the estimator of the correlation matrix R(ν) based on the EM algorithm above. We select the degrees of freedom in the following way: Select ν such that ˆ ν = arg maxν∈A ˆ 1 ˆ τ {ν, R(ν)}. ˆ n (6.21) 6.3 Summary In this Chapter, we introduced the t-copula family in RD , and showed that for any joint distribution D F D in RD , there exists a unique pair (ν, R) such that Fν,R (x) = Cν,R (F1 (x1 ), ..., FD (xD )) within this family. Furthermore, we developed a semi-parametric technique to estimate the unknown pair (ν, R) using EM algorithm. The application of this approach will be presented in the next chapter. 68 CHAPTER 7 SIMULATION AND REAL DATA RESULTS 7.1 Simulation Results The results in this section are based on simulated data of four experiments. The first three are for the bivariate case, whereas the fourth is for a trivariate distribution. The true distributions for the observations Xj , j = 1, . . . , n, in all experiments are of the type as defined in (6.3). In experiment 1, ν0 = ∞ in (6.3) corresponding to the Gaussian copula function. We choose ρ = 0.75. The two mixed marginals, F1 and F2 , have the following cumulative distribution Sample size, n 500 1000 1500 2000 2500 Average Absolute Bias 0.0168 0.0135 0.0116 0.0113 0.0087 Relative Average Absolute Bias 2.2 % 1.8 % 1.5 % 1.5 % 1.2 % Coverage 92.4 % 91.6 % 92.3 % 93.2 % 94.7 % Table 7.1. Simulation results for Experiment 1 with ν0 = ∞ and ρ = 0.75. The absolute bias and relative absolute bias of the estimator ρn are provided, together with the empirical coverage of the ˆ approximate 95% confidence interval for ρ based on the asymptotic normality of ρn . ˆ 69 functions: F1 (x1 ) = 0.3I{x ≥0.2} + 0.7φ(x1 ), 1 and F2 (x2 ) = 0.2I{x ≥0.1} + 0.8φ(x2 ), 2 where φ(·) is the standard normal density function. Thus, we have d1 = d2 = 1 with D11 = 0.2, and D21 = 0.1 with probabilities p11 = 0.3, and p21 = 0.2, respectively. The set of jump points are J (F1 ) = {0.2} and J (F2 ) = {0.1} with continuous components of F1 and F2 corresponding to f1 (x) = f2 (x) = φ(x). The sample size n is taken from n = 500 to n = 2500 in increments of 500. In each trial, 2 we simulate n observations from F∞,0.75 (x1 , x2 ) and estimate ρ within the Gaussian copulas using the methodology presented in Section 6.2. Table 7.1 contains the simulation result from Experiment 1. For each sample size, the experiment was repeated 1,000 times. The average absolute bias and relative average absolute bias of the estimator ρn are reported, together with ˆ the empirical coverage of the approximate 95% confidence interval for ρ based on the asymptotic normality of ρn . ˆ Experiment 2 consists of the following choices: ν0 = ∞ in (6.3) corresponding to the Gaussian copula function. We choose ρ0 = 0.75. The two mixed marginals, F1 and F2 , have the following cumulative distribution functions: F1 (x1 ) = 0.25I{x ≥0.2} + 0.6t10 (x1 ) + 0.15I{x ≥0.7} , 1 1 and F2 (x2 ) = 0.2I{x ≥0.3} + 0.5t10 (x2 ) + 0.3I{x ≥0.6} . 2 2 Thus, we have d1 = d2 = 2 with D11 = 0.2, D12 = 0.7, D21 = 0.3 and D22 = 0.6 with probabilities p11 = 0.25, p12 = 0.15, p21 = 0.2 and p22 = 0.3, respectively. The set of jump points are J (F1 ) = {0.2, 0.7} and J (F2 ) = {0.3, 0.6} with continuous components of F1 and F2 corresponding to f1 (x) = f2 (x) = t10 (x). 70 0.45 df =3 df=5 df=10 df=15 df=20 df=25 normal 0.4 0.35 0.3 0.25 0.2 0.15 0.1 0.05 0 −3 −2 −1 0 1 2 3 Figure 7.1. Density curves for t distribution with degrees of freedom ν = 3, 5, 10, 15, 20, 25 and normal distribution. For interpretation of the references to color in this and all other figures, the reader is referred to the electronic version of this dissertation. For now, we select the set for the degrees of freedom of the t-copula to be A = {3, 5, 10, ∞} for illustrative purposes. Note that we choose ν0 values in A so that the corresponding t-distributions are significantly different from one another. Figure 7.1 gives the t-densities for several ν0 values, including values in A. There exist significant gaps between the t-density curves corresponding to ν0 = {3, 5, 10}, but relatively smaller gaps for ν0 = {15, 20, 25, ∞}. Table 7.2 provides the L1 distances between some t- distributions in Figure 7.1. The sample size n is taken from n = 500 to n = 2500 in increments of 500. In each trial, we 71 ρ0 ρ0 = 0.2 ρ0 = 0.75 ν0 3 5 10 ∞ 3 5 10 ∞ 3 5 0 0.0874 0.0874 0 0.1705 0.0837 0.2690 0.1835 0 0.0896 0.0896 0 0.1735 0.0845 0.2721 0.1844 10 0.1705 0.0837 0 0.1005 0.1735 0.0845 0 0.1006 ∞ 0.2690 0.1835 0.1005 0 0.2721 0.1844 0.1006 0 Table 7.2. L1 distances for paired values of ν0 corresponding to two values of ρ0 , 0.20 and 0.75. Sample size, n 500 1000 1500 2000 2500 Percentage of times ν = ν0 ˆ 86% 94% 100% 100% 100% Mean ρ(ˆ) ˆν M SE(ˆ(ˆ)) ρν 0.7361 0.7392 0.7413 0.7429 0.7450 0.6312 ×10−3 0.3869 ×10−3 0.3808 ×10−3 0.3035 ×10−3 0.2861 ×10−3 Table 7.3. Simulation results for Experiment 2 with ν0 = ∞ and ρ0 = 0.75. 2 simulate n observations from F∞,0.75 (x1 , x2 ) and estimate ρ0 and ν0 based on the methodology presented in Section 6.2. The trial is repeated 50 times. The simulation results, including percentage of times (out of 50) that the true value of ν0 , mean of ρ(ˆ), and the M SE of ρ(ˆ) is chosen , ˆν ˆν are presented in Table 7.3. In Experiment 3, we took ν0 = 10. The two generalized marginal distributions are the same as in Experiment 2. The correlation parameter ρ0 were selected to be 0.20 and 0.75, respectively. The results are presented in Table 7.4. From the entries of Table 7.3 and Table 7.4, we see that the estimation procedure is more effective in selecting the true degrees of freedom when ν0 = ∞ compared to ν0 = 10. The reason of being that is the distribution corresponding to ν0 = ∞ is further away from all the other candidate distributions in A. Also, the estimation procedure is less effected by the value of ρ0 as illustrated by the percentage of times ν = ν0 column in Table 7.4. ˆ 72 ρ0 Sample size, n Percentage of times ν = ν0 ˆ 500 84% 1000 88% ρ0 = 0.20 1500 92% 2000 94% 2500 100% 500 82% 1000 88% ρ0 = 0.75 1500 96% 2000 98% 2500 100% Mean ρ(ˆ) ˆν M SE(ˆ(ˆ)) ρν 0.1861 0.1877 0.1889 0.1923 0.1944 0.7398 0.7401 0.7429 0.7523 0.7481 0.7532×10−3 0.6871 ×10−3 0.6095 ×10−3 0.4173 ×10−3 0.3664 ×10−3 0.7235 ×10−3 0.6789 ×10−3 0.6565 ×10−3 0.4546 ×10−3 0.3648 ×10−3 Table 7.4. Simulation results for Experiment 3 with ν0 = 10. The two correlation values considered are ρ0 = 0.2 and ρ0 = 0.75. sample size 500 1000 1500 2000 2500 percentage of getting true ν 85 % 90 % 95 % 100 % 100 % mean ρ1 (ˆ) ˆ ν mean ρ2 (ˆ) ˆ ν mean ρ3 (ˆ) ˆ ν total M SE 0.1990 0.1830 0.1795 0.1820 0.1955 0.2760 0.2805 0.2845 0.2880 0.2920 0.1885 0.2005 0.1970 0.1880 0.1930 0.7143 ×10−3 0.6732 ×10−3 0.6714 ×10−3 0.6126 ×10−3 0.1335 ×10−3 Table 7.5. Simulation results for Experiment 4. 73 In Experiment 4, we took D = 3 and ν0 = 10. The first two marginal distributions are the same as before. The third marginal distribution is taken to be the t-distribution with 10 degrees of freedom (thus, having no points of discontinuity). We took the correlation matrix R0 as    1 0.2 0.3    R0 =  0.2 1 0.2      0.3 0.2 1    1 ρ1 ρ2    , say. =  ρ1 1 ρ3      ρ2 ρ3 1 3×3 3×3 (7.1) For different sample sizes, the experiment were repeated 20 times (instead of 50) to reduce computational time. The estimators of ρi , ρi , i = 1, 2, 3, were obtained based on the iterative procedure ˆ outlined in Section 6.2. Since the maximization step involves another loop within the M-step, the objective function was maximized over ρ-intervals in steps of 0.01 to reduce computational time. The iterative procedure within the M-step was not required when D = 2 which enabled us to maximize the objective function over a finer grid (steps of 0.0001). The results are given in Table 7.5; note that (i) the estimators converge and (ii) the MSE reduces as n tends to infinity. 7.2 Application to Multimodal Fusion in Biometric Recognition 7.2.1 Introduction Biometric recognition refers to the automatic identification of individuals based on their biological or behavioral characteristics [20]. In recent years, recognition of an individual based on his/her biometric trait has become an increasingly important method for testing “you are who you claim you are”, see, for example, [18] and [29]. Biometric recognition is more reliable compared to traditional approaches, such as password-based or token-based approaches, as biometric traits cannot be easily stolen or forgotten. Some examples of biometric traits include fingerprint, face, signature, voice and hand geometry (See Figure 7.2). A number of commercial recognition systems based on 74 (a) (b) (c) (d) (e) (f) (g) (h) Figure 7.2. Some examples of biometric traits: (a) fingerprint, (b) iris scan, (c) face scan, (d) signature, (e) voice, (f) hand geometry, (g) retina, and (h) ear. these traits have been deployed and are currently in use. Biometric technology has now become a viable alternative to traditional government applications (e.g., US-VISIT program [48] and the proposed biometric passport which is capable of storing biometric information of the owner in a chip inside the passport). With increasing applications involving human-computer interactions, there is a growing need for recognition techniques that are reliable and secure. Recognition of an individual can be viewed as a test of statistical hypothesis. Based on the biometric input Q and a claimed identity Ic , we would like to test H0 : It = Ic vs. H1 : It = Ic , (7.2) where It is the true identity of the user. The testing in (7.2) is performed by a matcher which computes a similarity measure, S(Q, T ), based on Q and T ; large (respectively, small) values of S indicate that T and Q are close to (far 75 from) each other (A matcher can also compute a distance measure between Q and T in which case similar Q and T will produce distance values that are close to zero and vice versa). The distribution of S(Q, T ) is called genuine (respectively, impostor) when It = Ic (It = Ic ) under H0 (H1 ). We denote the genuine (imposter) matching score distribution function by Fgen (Fimp ). Assuming that Fgen (x) and Fimp (x) have densities fgen (x) and fimp (x), respectively. The Neyman-Pearson theorem states that the optimal ROC curve is the one corresponding to the likelihood ratio statistic N P (x) = fgen (x) fimp (x) [14]. The ROC curve corresponding to N P (x) has the highest genuine accept rate (GAR) for every given value of the false acceptance rate (FAR) compared to any other statistic U (x) = N P (x). However, both fgen (x) and fimp (x) are unknown, and are estimated from the observed matching scores. The ROC corresponding to N P (x) may turn out to be suboptimal, which is mainly due to the large errors in the estimation of fgen (x) and fimp (x). Thus, for a set of genuine and imposter matching scores, it is important to be able to estimate fgen (x) and fimp (x) reliably and accurately. The articles, [14] and [38], assume that the distribution function F has a continuous density with no discrete components. In reality most matching algorithms apply thresholds at various stages in the matching process. When the required threshold conditions are not met, specific matching scores are output by the matcher. For example, some fingerprint matchers produce a score of zero if the number of extracted minutiae is less than a threshold. This leads to discrete components in the matching scores distribution that can not be modelled accurately using a continuous density function (see Figure 7.3, Figure 7.4, Figure 7.5 and Figure 7.6) . Thus, discrete components need to be detected and modelled seperately to avoid large errors in estimating fgen (x) and fimp (x). Another issue is that biometric systems based on a single source of information suffer from limitations like the lack of uniqueness, non-universality and noisy data [21] and hence, may not be able to achieve the desired performance requirements of real-world applications. In contrast, 76 4 3.5 3 2.5 2 1.5 1 0.5 0 −0.2 0 0.2 0.4 0.6 0.8 1 1.2 Figure 7.3. Histograms of matching scores, corresponding to genuine scores for Matcher 1. Continuous (respectively, generalized) density estimators is given by the dashed lines (solid lines). 77 0.05 0.045 0.04 0.035 0.03 0.025 0.02 0.015 0.01 0.005 0 −20 0 20 40 60 80 100 120 Figure 7.4. Histograms of matching scores, corresponding to genuine scores for Matcher 2. Continuous (respectively, generalized) density estimators is given by the dashed lines (solid lines). 78 12 10 8 6 4 2 0 −0.1 0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 Figure 7.5. Histograms of matching scores, corresponding to impostor scores for Matcher 1. Continuous (respectively, generalized) density estimators is given by the dashed lines (solid lines). The spike corresponds to discrete components. Note how the generalized density estimator performs better compared to the continuous estimator (assuming no discrete components). 79 0.07 0.06 0.05 0.04 0.03 0.02 0.01 0 −10 0 10 20 30 40 50 60 Figure 7.6. Histograms of matching scores, corresponding to impostor scores for Matcher 2. Continuous (respectively, generalized) density estimators is given by the dashed lines (solid lines). 80 some of the limitations imposed by unimodal biometric systems (that is, biometric systems that rely on the evidence of a single biometric trait) can be overcomed by using multiple biometric modalities[2], [24], [3], [25], [49] and [47]. Such systems, known as multibiometric systems, are expected to be more reliable due to the presence of multiple pieces of evidence. In a Multimodal biometric system, fusion can be done at (i) feature level, (ii) matching scores level, or (iii) decision level. Matching score level fusion is commonly preferred because matching scores are easily available and contain sufficient information to distinguish between a genuine and an imposter case. Dass et al. [7] proposed a biometric fusion using generalized densities. In [7], a Gaussian copula model is chosen to estimate the correlation structure. In reality, sometimes the joint distribution can be fitted better by using a t-copula model instead of a Gaussian copula model, which is due to the nature of the data set, so we consider the Gaussian copula and t-copula models together, and choose the more appropriate model by model selection method based on BIC criteria (Publication for this research is [16]). 7.2.2 Application in Biometric Fusion When people deal with biometric fusion, a natural question people need to answer first is how to get the joint distribution of multiple modalities. Previously, people simply assume independence between the individual modalities, but this assumption is not always true, especially, when the fusion is done on the same biometric trait with different matchers. Here we deal with the correlation structure via semi-parametric copula models. Based on fingerprint images in the MSU-Multimodal database, see [19], corresponding to 100 users, genuine and impostor similarity scores were obtained for two matchers: a correlation matcher, S1 (see [32]), and a minutiae-based matcher, S2 (see [39]). A total of 2,800 and 4,950 vectors of similarity scores, Xj ≡ (S1 (Q, T ), S2 (Q, T ))T , were obtained for the genuine and impostor cases, respectively. The histogram plot of each marginal (both genuine and impostor) gives strong indication of non-Gaussianity, thus, justifying the need for the methodology developed in 81 Matching score type Genuine Impostor ν ˆ 14 25 ( ρ1 , ρ2 , ρ3 ) ˆ ˆ ˆ ( 0.76, -0.11, -0.14 ) ( 0.3, 0.04, 0.02 ) Table 7.6. Results of the estimation procedure for R and ν based on the NIST database. this thesis. The match scores are highly correlated since S1 and S2 are applied to the same fingerprint images. Further, both S1 and S2 output the discrete score ‘0’ if certain “initial conditions” are not met, resulting in a spike at 0 in the corresponding marginal distributions. For both the genuine and impostor distributions, the set of degrees of freedom, ν, considered is A = {1, 2, 3, . . . 25, ∞}. We obtained ν = 3 and ρ(ˆ) = 0.4178 for the genuine scores, ν = 3 and ρ(ˆ) = 0.1563 for the ˆ ˆν ˆ ˆν impostor scores. For the reasons mentioned above, joint distribution functions of the form (6.3) with D = 2 are appropriate for strongly correlated biometric data as we have here. The second experiment was carried out on the first partition of the Biometric Scores Set Release I (BSSR1) released by NIST, [33], consisting of face and fingerprint images from 517 users. Like the previous case, the marginal distributions have discontinuities: The first matcher, S1 , for the face biometric outputs the value -1 if certain alignment conditions do not hold for the query and template face pair. The second face matcher, S2 , outputs continuous values and therefore, does not have any discrete components. The second fingerprint matcher of the MSU-Multimodal database, renamed S3 here, is used for the fingerprint images resulting in a discrete score of ‘0’. We obtained 517 and 266,772 vectors of similarity scores, Xj ≡ (S1 (Q, T ), S2 (Q, T ), S3 (Q, T ))T , corresponding to the genuine and impostor scores, respectively. In this case, D = 3 and the correlation matrix R0 can be written as in (7.1). A is taken as before. Table 7.6 gives the results of the estimation procedure. We investigate the performance of fusion obtained by combining the D similarity scores obtained from the D different modalities. Since we assume that the genuine and impostor distributions are of the form (6.3), the test of hypotheses (7.2) can be re-stated in terms of the score 82 100 Genuine Accept Rate(%) 95 90 85 80 copula fusion with generalize pdf copula fusion with continuous pdf finger matcher 1 finger matcher 2 75 70 −2 10 −1 10 0 10 False Accept Rate(%) 1 10 Figure 7.7. Performance of copula fusion on the MSU-Multimodal database distributions under H0 and H1 , namely, D H0 : FQ (x) = Fν ,R (x) 0 0 vs. D H1 : FQ (x) = Fν ,R (x) 1 1 for some (ν0 , R0 ) and (ν1 , R1 ). The optimal decision rule (fusion rule) then turns out to be the D D likelihood ratio LR(x) = dFν ,R (x)/dFν ,R (x)from the Neyman-Pearson Lemma, following 0 0 1 1 in a similar fashion for the case in the single modality explained earlier. However, the LR rule cannot be used in the current form since the parameters ν0 , R0 , ν1 and R1 are unknown. The methodology developed here can be used to obtain estimators of all of these parameters, thus ˆ obtaining the estimated likelihood ratio statistic LR(x) = dF D ˆ (x)/dF D ˆ (x). ν0 ,R0 ˆ ν1 ,R1 ˆ The effectiveness of the (estimated) LR fusion rule can be evaluated based on a K-fold cross 83 100 Genuine Accept Rate(%) 95 90 85 80 75 −2 10 copula fusion with generalize pdf copula fusion with continuous pdf face matcher 1 face matcher 2 −1 10 0 10 False Accept Rate(%) Figure 7.8. Performance of copula fusion on the NIST database. 84 1 10 validation procedure. In the k-th iteration, k = 1, . . . , K, a random subset, say S0 , of n0 genuine scores from the total of ngen scores is selected for estimating the parameters ν0 and R0 . Similarly, for estimating ν1 and R1 , a (random) subset n1 impostor scores, say S1 , is selected from a total of nimp scores. The remaining genuine and impostor scores are used to obtain an estimator of the false accept and genuine accept rates (FAR and GAR, respectively) for each threshold λ. The relevant formulas are ˆ F AR(λ) = c ˆ j∈S1 I{LR(xj )>λ } ˆ and GAR(λ) = nimp − n1 c ˆ j∈S0 I{LR(xj )>λ } , ngen − n0 c c where S0 and S1 are the complements of S0 and S1 , respectively. The ROC (Receiver Operating ˆ ˆ Characteristics) curve is the plot of F AR(λ) versus GAR(λ) with higher ROC values indicating better recognition performance. Our experiments on the MSU-Multimodal and NIST databases were based on the following choices: K = 10 and n0 /ngen = n1 /nimp ≈ 0.8. The fusion results are presented in Figure 7.7 and Figure 7.8. Note that there is an dramatic overall improvement of the performance indicating that the elicited joint distributions are good models for the distribution of similarity scores. 85 CHAPTER 8 SUMMARY AND CONCLUSION Investigating dependence structures of multivariate distributions has always been an interesting area for researchers. Copulas have proved to be a useful tool for analyzing multivariate dependence structures by providing more flexibility than the classic parametric approach as they can easily separate the effect of dependence structure from that of the marginals, especially in situations that the marginals contain discrete points. This thesis developed a semi-parametric approach to estimate the dependence structure for the bivariate distributions with mixed marginals. The semi-parametric estimator established in this thesis has been shown to be asymptotically consistent. A variance estimator has been provided as well. The estimation methodology involves integrals corresponding to the discrete componets and is therefore, non-standard. Furthermore, our approach was generalized to the higher dimensional case under similar assumptions and using the same arguments. The higher dimensional case requires more computing time. The semi-parametric approach developed has been utilized in the t-copula family and the Gaussian family, which is the limiting distribution of the t-copula family. Estimation of the correlation matrix as well as the degrees of freedom corresponding to the t-copula were carried out based on the EM algorithm. This estimation was also done for the estimation of the correlation matrix corresponding to the Gaussian copula. 86 We have shown large sample consistency of our estimates and demonstrated this based on several simulation examples. Finally, the methodology was applied to real data consisting of matching scores from various biometric sources. Fusion based on the generalized distributions gave improved performance compared to the individual systems. As future work, we will consider extensions to copula functions derived from general elliptical distributions. 87 BIBLIOGRAPHY 88 BIBLIOGRAPHY [1] Barnard, J. and McCulloch, R. and Meng, X. L., Matas, J.G., Modeling Covariance Matrices in Terms of Standard Deviations and Correlations, with Application to Shrinkage, Statistica Sinica, 10 (2000), 1281–1311. [2] Bigun, E. S., Bigun, J., Duc, B., Fischer, S., Expert Conciliation for Multimodal Person Authentication Systems using Bayesian Statistics, In: Proceedings of First International Conference on AVBPA, Crans–Montana, Switzerland (1997), 291–300. [3] Brunelli, R., and Falavigna, D., Person identification using multiple cues, IEEE Transactions on Pattern Analysis and Machine Intelligence, 20 (1998), 226–239. [4] Chen, X. and Fan, Y., Estimation of Copula-Based Semiparametric Time Series Models, Journal of Econometrics, 130 (2006), 307–335. [5] Cherubini, U., Luciano, E., and Vecchiato, W., Copula Methods in Finance, Wiley, 2004. [6] Clayton, D. G., A Model for association in Bivariate Life Tables and Its Application in Epidemiological Studies of Familial Tendency in Chronic Disease Incidence, Biometrika, 65 (1978) 141–151. [7] Dass, S., Nandakumar, K., Jain. A. K., A principled approach to score level fusion in multimodal biometric systems, To appear in Proceedings of AVBPA, (2005). [8] Demarta, S., McNeil, A. J., The t–copula and Related Copulas, International Statistical Review, 73 (2005), no. 1, 111–129. [9] Embrechts P., Lindskog F., and McNeil A., Modelling Dependence with Copula and Applications to Risk Management, Handbook of Heavy Tailed Distributions in Finance, Elsevier, (2003), 329–384. [10] Frey, R. and McNeil, A. J., Copulas and Credit Models, RISK, (2001), 111–114. [11] Genest, C., Ghoudi, K., and Rivest, L. –P., A semiparametric estimation procedure of dependence parameters in multivariate families of distributions, Biometrika, 82 (1995), no. 3, 543–552. 89 [12] Genest, C. and Rivest, L. –P., Statistical Inference Procedures for Bivariate Archimedean Copulas, Journal of the American Statistical Association, 88 (1993), 1034–1043. [13] Genest, C., and Quessy, J. F., and Remillard, B., Goodness–of–fit procedures for copula models based on the probability integral transform, Scandinavian Journal of Statistics, (2006), 337–366. [14] Griffin, P., Optimal Biometric Fusion for Identity Verification, Identix Corporate Research Center Preprint RDNJ–03–0064, (2004). [15] Hougaard, P., Harvald, B., and Holm, N. V., Measuring the Similarities Between the Lifetimes of Adult Danish Twins Born Between 1881 and 1930, J. Am. Statist. Assoc., 87 (1992) 17–24. [16] Huang, W. and Dass, C. S., Generalized t-Copula and Its Application on Biometric, JSM Proceedings, (2007). [17] Hutchinson, T. P. and Lai, C. D., Continuou Bivariate Distributions, Emphasising Applications. Adelaide: Rumsby Scientific. [18] Jain, A. K. and Bolle, R. and Pankanti, S., BIOMETRICS: Personal Identification in Networked Society. Kluwer Academic Publishers, Boston, 1999. [19] Jain, A. K. and Prabhakar, S. and Ross, A., Fingerprint Mathcing: Data Acquisition and Performance Evaluation. MSU Technical Report TR99-14, (1999). [20] Jain, A. K., Ross, A., Prabhakar, S., An Introduction to Biometric Recognition, IEEE Transactions on Circuits and Systems for Video Technology, Special Issue on Image– and Video– Based Biometrics, 14 (2004), 4–20. [21] Jain, A. K., Ross, A., Multibiometric Systems, Communications of the ACM, Special Issue on Multimodal Interfaces, 47 (2004), 34–40. [22] Joe, H., Parametric families of multivariate distributions with given marginals, J. Mult. Anal., 46, 262–282. [23] Kimeldorf, G., Sampson, A. R., One parameter families of bivariate distributions with fixed marginals, Commun. Statist., 4, 293–301. [24] Kittler, J., Hatef, M., Duin, R. P., Matas, J.G., On Combining Classifiers, IEEE Transactions on Pattern Analysis and Machine Intelligence, 12(10) (1998), 955–966. 90 [25] Lam, L., Suen, C. Y., Optimal Combination of Pattern Classifers, Pattern Recognition Letters, 16 (1995), 945–954. [26] Lancaster, H., The Structure of Bivariate Distributions, Annals of Mathematical Statistics, 29 (1958), 719–736. [27] Lehmann, E., Some Concepts of Dependence, Annals of Mathematical Statistics, 37 (1966), 1137–1153. [28] Lehmann, E. L., Testing Statistical Hypotheses, Chapman & Hall, NY, 1994. [29] Maltoni, D., Maio, D., Jain, A. K. and Prabhakar, S., Handbook of Fingerprint Recognition. Springer-Verlag, 2003. [30] Marshall, A. W. and Olkin, I., Families of Mulitvariate Distributions, J. Am. Statist. Assoc., 83 (1988) 834–841. [31] McLachlan, G., J. and Krishnan, T., The EM Algorithm and Extensions, John Wiley & Sons, 1997. [32] Nandakumar, K. and Jain, A. K., Local Correlation-based Fingerprint Matching, Proc. of Indian Conference on Computer Vision, Graphics and Image Processing, (Kolkata), (2004), 503–508. [33] National Institute of Standards and Technology: NIST Biometric Scores Set. Available at http://www.itl.nist.gov/iad/894.03/biometricscores (2004). [34] Nelsen. R., An introduction to Copulas, Springer, 1998. [35] Oakes, D., Semiparametric inference in a model for association in bivariate survival data, Biometrika, 73 (1986), 353–361. [36] Oakes, D., Multivariate Survival Distributions, J. Nonparam. Statist., 3 (1994), 343–354. [37] Pitt, M., and Chan, D., and Kohn, R., Efficient Bayesian inference for Gaussian copula regression models, Biometrika, 93 (2006), 537–554. [38] Prabhakar, S., Jain, A. K., Decision–level Fusion in Fingerprint Verification, Pattern Recognition, 35 (2002), 861–874. 91 [39] Ratha, N. K., Chen, S., Karu, K., and Jain, A. K., A Real-time Matching System for Large Fingerprint Databases, IEEE Transactions on PAMI, 18 (1996), 799–813. [40] R¨ schendorf, L., Asymptotic Distribution of Multivarite Rank Order Statistics, The Annals u of Statistics, 4 (1976), 912–923. [41] Ruymgaart, F. H., Shorack G. R., and Zwet, W. R.,Asymptotic Normality of Nonparametric Tests for Independence, Annals of Mathematical Statistics, 43 (1972), 1122–1135. [42] Ruymgaart, F. H., Asymptotic Normality of Nonparametric Tests for Independence, Annals of Statistics, 2 (1974), 892–910. [43] Shao, J., Mathematical Statistics, Springer, 2003. [44] Shih, J. and Louis, T., Inferences on the Association Parameter in Copula Models for Bivariate Survival Data, Biometrics, 51 (1995), 1384–1399. [45] Shorack, G. R., Functions of Order Statistics, Annals of Mathematical Statistics, 43 (1974), 412–427. ` [46] Sklar, A., Fonctions de r´ partition a n dimensions et leurs marges, Publ. Inst. Statist. Univ. e Paris, 8, 229–231. [47] Toh, K. A., Jiang, X., Yau, W. Y., Exploiting Global and Local Decisions for Multi– modal Biometrics Verification, IEEE Transactions on Signal Processing, 52 (2004), 3059–3072. [48] USVISIT, U. S. Department of Homeland Security. U. S. Department of Homeland Security. Online: http://www.dhs.gov/dhspublic/display?theme=91. [49] Wang, Y., Tan, T., Jain, A. K., Combining Face and Iris Biometrics for Identity Verification, In: Proceedings of Fourth International Conference on AVBPA, Guildford, U.K., (2003) 805– 813. 92