A DEPENDENCE MODEL FOR UNBALANCED CROP INSURANCE INDEMNITY AMOUNTS By Yi-Hsuen Lin A THESIS Submitted to Michigan State University in partial fulfillment of the requirements for the degree of Statistics—Master of Science 2023 ABSTRACT In this thesis, a copula model is constructed to estimate dependency and calculate the Value at Risk for insurance coverage under the dependence of the covered losses. The dependence model is illustrated in a three-dimensional setting to simplify the complex theoretical functions and provide an accessible introduction to the copula model. The modeling uses the U.S. crop insurance dataset aggregated by each county level and commodity type. The composite likelihood approach helps to simplify the computation in high-dimensional problems by approximating the negative log-likelihood using bivariate components. In this study, the majorization-minimization principle is employed to estimate the parameters of the normal copula by minimizing the composite likelihood iteratively. To avoid overfitting and result in a valid correlation matrix, the L1 penalty is applied to induce sparsity and shrink irrelevant parameters toward zero. The optimal tuning parameter is selected based on the BIC score to generate a positive semi-definite correlation matrix for the result. In the Appendix of the thesis, the dependence model is extended to a high dimension. The Value at Risk computed for the fictional insurance contract in the data analysis results in a higher value when considering dependence between variables. ACKNOWLEDGEMENTS I would like to express my sincere gratitude to my advisor, Dr. Gee Lee, for his guidance and offering me the opportunity to participate in this research. This thesis would not have reached its completion without his constant support throughout the research process. His belief in my capabilities has always motivated me, driving me to achieve more than I thought possible. Words cannot adequately express how lucky I am to have learned and grown under his mentorship. I extend my deep appreciation to my thesis committee members, Dr. Haiyan Liu and Dr. Haolei Weng, who provide their invaluable time and constructive suggestions that have enriched the content and quality of my work. Most importantly, none of this would have been possible without my family and friends. I am profoundly grateful to my parents, San-Ching Lin and Mei-Ling Chen, for their unconditional love and support. Being your daughter is an immeasurable blessing I deeply treasure. Your constant encouragement and understanding have given me the freedom to explore my passions, pursue my dreams without hesitation, and evolve into the person I am today. I am also thankful to my lovely furry friend, Bella, who accompanied me during late-night paper writing and through frustrated moments, bringing joy even on the most challenging days. This thesis is a culmination of the support and encouragement I’ve received from each one of you. Thank you all for being an integral part of this remarkable journey! iii TABLE OF CONTENTS LIST OF FIGURES . LIST OF TABLES . INTRODUCTION . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . v vi 1 CHAPTER 1: COPULA MODEL . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 4 4 1.1 Assumptions . 1.2 Estimation of Copula Parameters . . . . . . . . . . . . . . . . . . . . . . . . . . 5 1.3 The Distribution of Correlation Matrices . . . . . . . . . . . . . . . . . . . . . . . 13 . . . . . . . . . . . . . . . . . . . . CHAPTER 2: SIMULATION . 2.1 . 2.2 C and IC Analysis . 2.3 Model Performance . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 15 . 15 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 17 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 18 Influence of Spurious Correlations . . . . . . . . . . . . . . . . . . . . . . . . . CHAPTER 3: RESULTS . . . 3.1 Data Description . . 3.2 Marginal Models . 3.3 Dependence Model . 3.4 Value at Risk (VaR) . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 21 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 21 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 21 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 23 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 24 CHAPTER 4: CONCLUSION . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 27 BIBLIOGRAPHY . APPENDIX . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 28 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 29 iv LIST OF FIGURES Figure 1.1: Graph of ρjk as a function of ψjk . . . . . . . . . . . . . . . . . . . . . . . . . 6 Figure 2.1: Influence of spurious correlations . . . . . . . . . . . . . . . . . . . . . . . . . 16 Figure 3.1: Scatter plot of liabilities and indemnities at the county level . . . . . . . . . . . 23 Figure 3.2: Solution path and optimal tuning parameter for selected commodities . . . . . . 24 Figure 3.3: Simulated density of indemnities for a fictional insurance coverage . . . . . . . 26 v LIST OF TABLES Table 2.1: C and IC analysis result . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 18 Table 3.1: Dataset of liabilities and indemnities . . . . . . . . . . . . . . . . . . . . . . . . 22 Table 3.2: Coefficient estimates for the marginal gamma model . . . . . . . . . . . . . . . 23 Table 3.3: Correlation of selected commodities . . . . . . . . . . . . . . . . . . . . . . . . 23 vi INTRODUCTION Dependency refers to the relationship between variables, and the analysis of dependency is crucial since it helps building accurate models and managing risk. Suppose we are given n sample points for a p-dimensional multivariate response, where each of the p marginal response has a different distribution. In this case, the copula approach can be employed to model the dependence between random variables. The classic textbook Nelsen (1999) provides an introduction to the copula model. The literature on copula modeling has developed significantly in recent years and provides numerous extended introduction to the copula models, such as Hofert et al. (2018) and Czado (2019) including examples in the R programming language, or Frees and Valdez (1998) exploring the modeling in the actuarial science context. For the use of the copula approach for count data, the reader may see Nikoloulopoulos and Karlis (2010), or Shi and Emiliano (2014). Sklar’s Theorem allows us to derive the multivariate distribution function of a random vector through the use of copula functions, with its corresponding marginal distributions. If we attempt to estimate the normal copula parameters with a p-dimensional correlation matrix, where the (cid:1) > n), it will be considered a number of dependence parameters exceeds the samples size n, ((cid:0)p 2 high-dimensional problem. In this situation, using maximum likelihood to estimate the parameters may arise several problems. One problem is the potential overfitting with spurious correlations, it happens when all off-diagonal entries of the correlation matrix are non-zero, and the model may be fitting the noise that is not truly meaningful. Another problem is the difficulty in ensuring the estimated correlation matrix satisfies the positive semi-definite property. Due to these problems, we hope to automatically set some correlation entries to zero during the estimate process, so that we could introduce sparsity that simplifies the model, avoid overfitting, and further result in a valid correlation matrix. There are various ways to achieve this, such as the lasso and ridge regression approach. In this paper, the lasso penalty (L1 penalty) is used to select and shrink some irrelevant correlation parameters toward zero. Nevertheless, while reducing the number of non-zero entries, the process cannot ensure that the resulting matrix is positive semi-definite; it can only increase the likelihood of obtaining a valid correlation matrix. More 1 relevant details will be discussed in the following chapter. The lasso is a variable selection method that solves the optimization problem with a constraint, and it was initially introduced in the statistical lecture by Tibshirani (1996). In the thesis, the lasso approach is used to add a penalty term to the negative likelihood function and adjust the tuning parameter to control the severity of the penalty. Due to the accessibility of efficient algorithms for solving the estimation problem, the lasso method has become increasingly popular and developed several extensions, such as shrinking group variables in Yuan and Lin (2006), applying to non-normal responses in Yang and Zou (2015), and the Tweedie compound distribution in Qian et al. (2016). The Tweedie compound distribution is commonly used in actuarial science, insurance, and financial fields to model data with both continuous and discrete characteristics. The distribution can be applied in frequency-severity regression as a distribution for modeling insurance losses, which is the severity component of the regression. Frequency-severity regression is commonly used in actuarial science to analyze the insurance claim occurrence (frequency) and claim amounts (severity). One can refer to Frees et al. (2011), Frees (2014) to gain the introduction and applications of frequency-severity regression. In the thesis, the approach used for indemnity modeling has a close connection with the frequency-severity regression approach, both modeling responses consisting of continuous and discrete components. Moreover, the gamma distribution is mainly used to model the crop insurance indemnities in the research. This research demonstrates the use of the lasso technique on the multivariate normal copula, which helps to simplify the copula parameters by inducing sparsity. The approach takes advantage of the composite likelihood approach for estimation. The composite likelihood approach, explained in Zhao and Joe (2005), Song et al. (2005), and Joe (2014), constructs an approximation to the log-likelihood function using bivariate components. It is used for estimating the dependence parameters in the copula model since the log-likelihood may be hard to compute. It is a straightforward method that may be adaptable to other copula families. The main intent of this research is to explore the dependence model and underlying theoretical 2 functions that enable the selection and estimation of the normal copula parameters in a three-dimensional setting. By concentrating on three dimensions, we can simplify the explanation of the core concepts and provide a more approachable introduction to the copula model. Although we restrict our focus to three dimensions, the approach explained in this thesis can be extended to higher dimensions. And extensions to other copula families might be possible using the methodologies and insights gained from the research. The rest of the thesis is organized into four parts. Chapter 1 illustrates the assumptions and approach utilized in the dependence model in three-dimensional setting. The section also introduces the distribution of correlation matrices that is useful while generating valid correlation matrices and may be used in future work. Chapter 2 conducts some simulation studies using the copula estimation routine to evaluate the performance and properties of the model. Chapter 3 shows the results of the estimation using the copula model and the aggregated U.S. crop insurance dataset and calculates Value at Risk for the insurance portfolio losses. Moreover, the Appendix shows the extensions of the normal copula model in Chapter 1 to high dimensions. 3 CHAPTER 1: COPULA MODEL 1.1 Assumptions Suppose we are given 3-dimensional multivariate responses of n samples, denoted by (yi1, yi2, yi3) for i = 1, ..., n, and the marginal distributions Fi1(yi1), Fi2(yi2), Fi3(yi3) are known. Then, the task is to use these to derive the joint density among the variables, f (yi1, yi2, yi3). Various copula models could fulfill the task, and an introduction to the copulas is explained in Nelsen (1999). Here, we assume using a multivariate normal copula C that links the univariate marginals to their joint distribution and can be written as C(u1, u2, u3; Σ) = P r(U1 < u1, U2 < u2, U3 < u3) = Φ3(Φ−1(u1), Φ−1(u2), Φ−1(u3); Σ), (1.1) where Ui is uniformly distributed on the interval [0, 1], Φ3 is the cumulative distribution function of a 3-dimensional multivariate normal distribution with zero means and correlation matrix Σ, and Φ−1 is the inverse cumulative distribution function of standard normal distribution. In this case, the density of the copula C is c(u1, u2, u3; Σ) = ∂3 ∂u1∂u2∂u3 C(u1, u2, u3; Σ). (1.2) Using the probability integral transform (u1, u2, u3) = (F1(y1), F2(y2), F3(y3)), the normal copula C can also be written as C(Fi1(yi1), Fi2(yi2), Fi3(yi3); Σ) = F (yi1, yi2, yi3). (1.3) By the chain rule, the joint density of the multivariate response is f (yi1, yi2, yi3) = ∂3 ∂yi1∂yi2∂yi3 C(Fi1(yi1), Fi2(yi2), Fi3(yi3)) (1.4) = c(Fi1(yi1), Fi2(yi2), Fi3(yi3); Σ) · fi1(yi1) · fi2(yi2) · fi3(yi3). 4 1.2 Estimation of Copula Parameters The correlation matrix Σ is symmetric and displays the dependence parameter between the i-th and j-th variables in (i, j) entry. In the three-dimensional setting,       1 ρ12 ρ13 Σ = ρ21 1 ρ23 ρ31 ρ32 1       . (1.5) In the research, the pairwise composite likelihood method is used to estimate the dependence parameters ρij. The method is well explained in a multivariate setting by Zhao and Joe (2005), and it is helpful to lower the computational complexity by reducing the dimensionality of the data, especially in high-dimensional problems with complex dependencies among the variables. Assuming all of the three-dimensional multivariate responses for each sample are fully defined, the composite negative log-likelihood function will be ℓ(Σ) = − n (cid:88) i=1 (cid:20) 1 3 − 1 log c (Fi1(yi1), Fi2(yi2); ρ12) + log f1(yi1) + log f2(yi2) + log c (Fi1(yi1), Fi3(yi3); ρ13) + log f1(yi1) + log f3(yi3) (cid:21) + log c (Fi2(yi2), Fi2(yi3); ρ23) + log f2(yi2) + log f3(yi3) , (1.6) where 1/(3 − 1) is the weight to match the log-likelihood when independent case and c is a density of bivariate normal copula. However, in general, one or more responses for some of the samples may be undefined, then the weight will become 1/(mi − 1) where mi is the number of defined responses in observation i. Note that we know the marginal distributions Fi1(yi1), Fi2(yi2), Fi3(yi3), as well as the densities fi1(yi1), fi2(yi2), fi3(yi3) for i = 1, 2, ..., n, the negative log-likelihood ℓ(Σ) is a function in the dependence parameters only. The optimal values of the parameters ρjk that minimize the negative log-likelihood function ℓ(Σ) would fall between −∞ 5 and ∞. Nonetheless, in order to ensure the dependence parameters satisfy −1 < ρjk < 1, we set ρjk = tan−1(ψjk) π/2 , −∞ < ψjk < ∞, (1.7) Figure 1.1 shows that the function is a one-to-one mapping, and for any ψjk, the corresponding dependence parameters ρjk will fall within the range of -1 and 1. Figure 1.1: Graph of ρjk as a function of ψjk Then, we use the L1 penalty method to reduce the number of non-zero parameters and avoid overfitting, which is helpful in high-dimensional problems which have a large number of parameters to estimate. Now, the penalized negative log-likelihood is ℓP (Ψ) = ℓ(Σ) + λ(|ψ12| + |ψ13| + |ψ23|) (1.8) where λ is tuning parameter that control the severity of the penalty and       1 ψ12 ψ13 Ψ = ψ21 1 ψ23 ψ31 ψ32 1       . (1.9) Then, the problem now is to estimate the alternative parameters ψij by minimizing the penalized 6 −10−50510−1.0−0.50.00.51.0Graph of r as a function of yyr negative log-likelihood function ℓP (Ψ). Some of the estimates may end up shrunk to zero due to the lasso penalty. The tuning parameter λ controls the number of zero entries in the estimated matrix Ψ. When λ increases, more and more coefficients are set to zero and eliminated. There are different ways to determine λ, such as cross-validation and minimizing AIC (or BIC). In the thesis, the tuning parameter is optimized by minimizing the BIC score. Since we assume that c is the density of a bivariate normal copula, the negative log-likelihood function can be written as ℓ(Σ) = − n (cid:88) i=1 (cid:20) 1 3 − 1 (cid:32) log 1 (cid:112)1 − ρ2 12 (cid:33) (cid:32) + log 1 (cid:112)1 − ρ2 13 (cid:33) (cid:32) + log (cid:33) 1 (cid:112)1 − ρ2 23 − (a2 i + b2 12 − 2aibiρ12 i )ρ2 2(1 − ρ12)2 − (a2 i + c2 13 − 2aiciρ13 i )ρ2 2(1 − ρ13)2 − (b2 i + c2 23 − 2biciρ23 i )ρ2 2(1 − ρ23)2 + log f1(yi1) + log f2(yi2) + log f1(yi1) + log f3(yi3) + log f2(yi2) + log f3(yi3) (cid:21) , where with √ 2 erf−1(2ui1 − 1), ui1 = Fi1(yi1) √ 2 erf−1(2ui2 − 1), ui2 = Fi2(yi2) √ 2 erf−1(2ui3 − 1), ui3 = Fi3(yi3), ai = bi = ci = erf(z) = 2 √ π (cid:90) z 0 exp(−t2)dt. (1.10) (1.11) (1.12) 7 In order to estimate the dependent parameters, let Φi12 = − (cid:34) (cid:32) log 1 3 − 1 1 (cid:112)1 − ρ2 12 (cid:33) − Φi13 = − (cid:34) (cid:32) log 1 3 − 1 1 (cid:112)1 − ρ2 13 (cid:33) − Φi23 = − (cid:34) (cid:32) log 1 3 − 1 1 (cid:112)1 − ρ2 23 (cid:33) − (a2 i + b2 12 − 2aibiρ12 i )ρ2 2(1 − ρ12)2 (a2 i + c2 13 − 2aiciρ13 i )ρ2 2(1 − ρ13)2 (b2 i + c2 23 − 2biciρ23 i )ρ2 2(1 − ρ23)2 so that ℓ(Σ) = n (cid:88) i=1 Φi12 + Φi13 + Φi23. + log fi1(yi1) + log fi2(yi2) (cid:35) (cid:35) + log fi1(yi1) + log fi3(yi3) + log fi2(yi2) + log fi3(yi3) (cid:35) (1.13) (1.14) Remember that we map ρjk to ψjk by ρjk = 2 tan−1(ψjk)/π for ensuring the dependence parameters satisfy the condition of between -1 and 1. To obtain the Hessian and gradient of log-likelihood ℓ(Σ), the log-likelihood ℓ(Σ) should be differentiated in terms of the parameters ψjk using the chain rule. Then consider ∂Φi12 ∂ψ12 = ∂Φi12 ∂ρ12 ∂ρ12 ∂ψ12 , ∂Φi13 ∂ψ13 = ∂Φi13 ∂ρ13 ∂ρ13 ∂ψ13 , ∂Φi23 ∂ψ23 = ∂Φi23 ∂ρ23 ∂ρ23 ∂ψ23 , (1.15) where ∂Φi12 ∂ρ12 ∂Φi13 ∂ρ13 ∂Φi23 ∂ρ23 = − = − = − 1 3 − 1 1 3 − 1 1 3 − 1 (cid:20) ρ12 1 − ρ2 12 (cid:20) ρ13 1 − ρ2 13 (cid:20) ρ23 1 − ρ2 23 + + + aibiρ2 12 − (a2 aiciρ2 13 − (a2 biciρ2 23 − (b2 (cid:21) (cid:21) i + b2 i )ρ12 + aibi (1 − ρ2 12)2 i + c2 i )ρ13 + aici (1 − ρ2 13)2 i + c2 i )ρ23 + bici (1 − ρ2 23)2 (cid:21) , (1.16) 8 and ∂ρ12 ∂ψ12 = 2 π(1 + ψ2 12) , ∂ρ13 ∂ψ13 = 2 π(1 + ψ2 13) , ∂ρ23 ∂ψ23 = 2 π(1 + ψ2 23) . (1.17) For the Hessian, we have ∂2Φi12 ∂ψ2 12 ∂2Φi13 ∂ψ2 13 ∂2Φi23 ∂ψ2 23 = = = ∂2Φi12 ∂ρ2 12 ∂2Φi13 ∂ρ2 13 ∂2Φi23 ∂ρ2 23 (cid:18) ∂ρ12 ∂ψ12 (cid:18) ∂ρ13 ∂ψ13 (cid:18) ∂ρ23 ∂ψ23 (cid:19)2 (cid:19)2 (cid:19)2 + + + ∂Φi12 ∂ρ12 ∂Φi13 ∂ρ13 ∂Φi23 ∂ρ23 (cid:18) ∂2ρ12 ∂ψ2 12 (cid:18) ∂2ρ13 ∂ψ2 13 (cid:18) ∂2ρ23 ∂ψ2 23 (cid:19) (cid:19) (cid:19) , (1.18) where ∂2Φi12 ∂ρ2 12 ∂2Φi13 ∂ρ2 13 ∂2Φi23 ∂ρ2 23 = − = − = − (cid:34) (cid:34) (cid:34) 1 2 1 2 1 2 1 + ρ2 12 (1 − ρ2 12)2 + 1 + ρ2 13 (1 − ρ2 13)2 + 1 + ρ2 23 (1 − ρ2 23)2 + and 2aibiρ12 − (a2 12)2 (1 − ρ2 i + b2 i ) 2aiciρ13 − (a2 13)2 (1 − ρ2 i + c2 i ) 2biciρ23 − (b2 23)2 (1 − ρ2 i + c2 i ) + + + 4 {aibiρ2 12 − (a2 i + b2 i )ρ12 + aibi} (1 − ρ2 12)ρ12 (1 − ρ2 12)4 4 {aiciρ2 13 − (a2 i + c2 i )ρ13 + aici} (1 − ρ2 13)ρ13 (cid:35) (cid:35) (1 − ρ2 13)4 4 {biciρ2 23 − (b2 i + c2 i )ρ23 + bici} (1 − ρ2 23)ρ23 (1 − ρ2 23)4 (1.19) (cid:35) , ∂2ρ12 ∂ψ2 12 = − 4ψ12 π(1 + ψ2 12)2 , ∂2ρ13 ∂ψ2 13 = − 4ψ13 π(1 + ψ2 13)2 , ∂2ρ23 ∂ψ2 23 = − 4ψ23 π(1 + ψ2 23)2 . (1.20) The (cid:0)3 2 (cid:1) dimensional vector of parameters is ψ = (ψ12, ψ13, ψ23)T . (1.21) 9 The (cid:0)3 2 (cid:1) × (cid:0)3 2 (cid:1) Hessian matrix Hψ is diagonal and the off-diagonal elements are zero Hψ = Hψ(1, 2)       0 0 0 Hψ(1, 3) 0 0 0 Hψ(2, 3)       . (1.22) where and similarly Hψ(1, 2) = ∂2 ∂ψ2 12 ℓ(Σ) = n (cid:88) i=1 ∂2 ∂ψ2 12 [Φi12 + Φi13 + Φi23] = n (cid:88) i=1 ∂2Φi12 ∂ψ2 12 (1.23) Hψ(1, 3) = n (cid:88) i=1 ∂2Φi13 ∂ψ2 13 , Hψ(2, 3) = n (cid:88) i=1 ∂2Φi23 ∂ψ2 23 . And the (cid:0)3 (cid:1) gradient vector Uψ is 2 Uψ = (Uψ(1, 2), Uψ(1, 3), Uψ(2, 3))T n (cid:88) = ( i=1 ∂Φi12 ∂ψ12 , n (cid:88) i=1 ∂Φi13 ∂ψ13 , n (cid:88) i=1 ∂Φi23 ∂ψ23 )T (1.24) (1.25) Note ˜ψ denotes the parameters updated in the outer loop of Algorithm 1, and ˘ψ for the inner loop. Then, the second-order (quadratic) Taylor series approximation of the unpenalized negative log-likelihood function ℓ(Σ) around a given point ˜ψ is ℓQ(ψ) = ℓ( ˜ψ) + ˜U T ψ (ψ − ˜ψ) + 1 2 (ψ − ˜ψ)T ˜Hψ(ψ − ˜ψ), (1.26) where ˜Uψ is the gradient vector and ˜Hψ is the Hessian matrix obtained at the given point ˜ψ = ( ˜ψ12, ˜ψ13, ˜ψ23)T . Now, the problem is to minimize the following penalized objective function: PQ(ψ) = ℓQ(ψ) + λ (|ψ12| + |ψ13| + |ψ23|) . (1.27) 10 However, the log-likelihood ℓQ(ψ) is complicated and the penalized objective function PQ(Ψ) may be non-smooth, so the problem cannot be directly handled by first and second-order conditions for optimization. Thus, the majorization-minimization algorithm is used to accomplish the task, and it is well explained in Yang and Zou (2015) and Qian et al. (2016). The key step of applying the majorization-minimization is constructing a surrogate function that majorizes the objective function. The algorithm estimates the parameters by iteratively minimizing a sequence of surrogate functions. In this paper, we construct the surrogate function by second-order Taylor expansion. First, define UQ = ∇ℓQ(ψ) HQ = ∇2ℓQ(ψ), (1.28) (1.29) where the entries of UQ are given by UQ(1, 2) = ˜Uψ(1, 2) + ˜Hψ(1, 2)(ψ12 − ˜ψ12) (cid:33) ∂ ˜Φi12 ∂ψ12 ∂2 ˜Φi12 ∂ψ2 12 (cid:32) n (cid:32) n (cid:88) (cid:88) (cid:33) = + i=1 i=1 (ψ12 − ˜ψ12), UQ(1, 3) = ˜Uψ(1, 3) + ˜Hψ(1, 3)(ψ13 − ˜ψ13) (cid:33) ∂ ˜Φi13 ∂ψ13 ∂2 ˜Φi13 ∂ψ2 13 (cid:32) n (cid:32) n (cid:88) (cid:88) (cid:33) = + i=1 i=1 (ψ13 − ˜ψ13), UQ(2, 3) = (cid:32) n (cid:88) i=1 ∂ ˜Φi23 ∂ψ23 (cid:33) + (cid:32) n (cid:88) i=1 (cid:33) ∂2 ˜Φi23 ∂ψ2 23 (ψ23 − ˜ψ23), (1.30) 11 and the diagonal entries of HQ are given by HQ(1, 2) = ˜Hψ(1, 2) = HQ(1, 3) = ˜Hψ(1, 3) = HQ(2, 3) = ˜Hψ(2, 3) = n (cid:88) i=1 n (cid:88) i=1 n (cid:88) i=1 ∂2 ˜Φi12 ∂ψ2 12 ∂2 ˜Φi13 ∂ψ2 13 ∂2 ˜Φi23 ∂ψ2 23 , , , (1.31) where HQ is also diagonal. Then, to ensure the surrogate function majorizes the objective function PQ(ψ), we substitute the Hessian matrix HQ with its largest eigenvalue in the second-order Taylor expansion of the unpenalized negative log-likelihood function ℓQ(ψ). Given ℓQ(ψ), consider updating current estimates ˘ψ. The new parameters ˘ψ(new) are given by solving ˘ψ(new) = arg min ψ ℓQ( ˘ψ)+ ˜U T Q (ψ − ˘ψ)+(ψ − ˘ψ)T ˜γψ(ψ − ˘ψ)+λ(|ψ12|+|ψ13|+|ψ23|). (1.32) The solution for this problem is given by ˘ψjk(new) = (cid:16) ˜γψ ˘ψjk − ˜UQ(j, k) (cid:17) (cid:16) 1 − ˜γψ λ ˘ψjk− ˜UQ(j,k)| |˜γψ (cid:17) + , (1.33) where ˜γψ is the largest eigenvalue of ˜HQ. The entire procedure is summarized in Algorithm 1. 12 Algorithm 1: Algorithm for penalized estimation of the copula parameters. 1. Initialize ˜ψ. 2. (Outer Loop) Update the penalized objective function in Equation (1.27). (a) Compute ˜UQ, and ˜HQ. (b) Compute ˜γψ, the largest eigenvalue of ˜HQ. (c) (Inner Loop) Obtain the minimizer of the objective function in (1.27). • Initialize ˘ψ = ˜ψ. • Repeat the following until ˘ψ converges. – Compute ˘ψ(new) by Equation (4.38). – Set ˘ψ = ˘ψ(new). • Set ˜ψ = ˘ψ. 3. Repeat until ˜ψ converges. 1.3 The Distribution of Correlation Matrices The authors Pourahmadi and Wang (2015) have shown any p dimensional correlation matrix Σ can be Cholesky decomposed into Σ = BBT , where B is a lower triangular matrix with entries bjk, and bjk =    1 for j = k = 1, cos θj1 k−1 (cid:89) sin θjh h=1 cos θjk · k−1 (cid:89) h=1 for j = 2, · · · , p, for j = k > 1, (1.34) sin θjh for 2 ≤ k ≤ j − 1, 13 where θjk, j > k are some angles. For example, in a three dimensional setting, 1 0 B = cos θ21 sin θ21       0 0       . (1.35) cos θ31 cos θ32 sin θ31 sin θ32 sin θ31 The matrix B is unique if its diagonal entries are positive, or equivalently if the angles are restricted (cid:1) matrix Θ = (θjk) is to the range (0, π). Furthermore, the transformation from Σ to a (cid:0)p (cid:1) × (cid:0)p 2 2 one-to-one, where θjk = 0 for j ≤ k and θjk are angles for j > k. Thus, this helps us to generate valid high-dimensional correlation matrices using (cid:0)p (cid:1) parameters in the matrix Θ. 2 Now, we may wonder if the resulting matrix from the estimating procedure shown in Section 1.2 would be a valid correlation matrix that is positive semi-definite. However, it would not be an issue for practical application since the estimated correlations will be close to the true underlying correlations of the data when the sample size is large enough. Practically, the resulting correlation matrix in Section 3 is positive semi-definite since the smallest eigenvalue is positive. Figure 3.2 shows that the smallest eigenvalue decreases when the tuning parameter gets smaller, which means too many non-zero parameters may result in an invalid correlation matrix. Future work may use the (cid:0)p (cid:1) parameters in matrix Θ as alternatives to correlations ρjk in the copula model, to avoid any 2 possible theoretical problems in the algorithm. For this, the following compact expression from Pourahmadi and Wang (2015) will be useful to transform correlations to thetas: ρjk = cj1ck1 + j−1 (cid:88) h=2 cjhckh h−1 (cid:89) l=1 sjlskl + ckj j−1 (cid:89) l=1 sjlskl, 1 ≤ j < k ≤ p, (1.36) where cjk = cos θjk and sjk = sin θjk. However, applying the expression to the copula model may increase the computational complexity, especially in the differentiation parts. In the thesis, we rely on the ρij = tan−1(ψij)/(π/2) mapping instead of the alternative parameterization using Θ for its simplicity. 14 CHAPTER 2: SIMULATION 2.1 Influence of Spurious Correlations The purpose of this simulation is to analyze the influence of spurious correlations and further demonstrate the advantages of setting certain correlations to zero. Assume we have the following correlation matrices with their corresponding situation: 1 0.9 0.1 0.9 0 1 0.9 −0.1       A = 0.9 1 0.1 0.1 0.1 1       , B =       1 0.9 0       , C =  0      1 1 0 −0.1 −0.1 1       0.9 1 −0.1 , (2.1) • A: Positive spurious correlations in the other off-diagonal entries • B: True underlying correlation matrix without spurious correlations • C: Negative spurious correlations in the other off-diagonal entries The task is to analyze the variance and the 95th percentile of insurance portfolio losses with each dependence matrix from the simulation. To do so, we simulate B samples from copulas with each correlation matrix and denote the randomly generated uniform variables by uA,ij, uB,ij, uC,ij, for i = 1, · · · , B and j = 1, 2, 3. Assume the distribution of each response is known, which is a gamma distribution with shape 2 and scale 100. The realized losses can be simulated using inverse transform sampling: yA,ij = F −1(uA,ij), yB,ij = F −1(uB,ij), yC,ij = F −1(uC,ij). Since we are interested in the simulated insurance portfolio losses with the dependence structure A, B, and C, the losses should be aggregated for analysis TA,i = (yA,i,1 + yA,i,2 + yA,i,3), i = 1, · · · , B TB,i = (yB,i,1 + yB,i,2 + yB,i,3), i = 1, · · · , B (2.2) TC,i = (yC,i,1 + yC,i,2 + yC,i,3), i = 1, · · · , B, 15 The variance and the 95th percentiles are plotted with increasing sample sizes B and shown in Figure 2.1. Figure 2.1: Influence of spurious correlations In Figure 2.1, it is obvious that spurious correlations may affect the estimations of portfolio risk. Positive spurious correlations may overestimate the portfolio variance and 95th percentile, while negative spurious correlations resulted in underestimation. This indicates the usefulness of inducing sparsity into the resulting correlation matrix that may reduce the spurious correlations. 16 0e+002e+054e+056e+058e+051e+068500095000105000VarianceBVariancePositive spurious correlationsWithout spurious correlationsNegative spurious correlations0e+002e+054e+056e+058e+051e+06115011701190121095th PercentileB95th PercentilePositive spurious correlationsWithout spurious correlationsNegative spurious correlations 2.2 C and IC Analysis The goal of this simulation is to know how the sample size n and the dimension p impact the algorithm on identifying zero and non-zero correlation coefficients. In order to reach the goal, we plan to randomly generate a p-dimensional correlation matrix with some zero entries and simulate n observations from a copula using this matrix. Note that the distribution of correlation matrices in Section 1.3 is useful when generating a correlation matrix with a particular dependence structure, (cid:1) parameters in the lower triangular matrix and the inputs of the generation approach are the (cid:0)p (cid:1) parameters in Θ to π/2 and the rest 10% to π/4. It results in a Θ. Here, we set 90% of the (cid:0)p 2 2 correlation matrix with 90% of the off-diagonal entries being zero, since bjk = 0 for k < j when cos(θjk) = cos(π/2) = 0. Comparing the generated matrix, which is the true correlation matrix that produces the data, and the estimated correlation matrix from the algorithm, we classify the entries of the correlation matrix into the following cases: • CZ (Correctly identified zero): This occurs when the true coefficient is zero, and the algorithm correctly identifies it as zero. • ICZ (Incorrectly identified zero): This occurs when the true underlying coefficient is zero, but the algorithm thinks it is non-zero. • CNZ (Correctly identified non-zero): This occurs when the true coefficient is non-zero, and the algorithm thinks it is non-zero. • ICNZ (Incorrectly identified non-zero): This occurs when the true coefficient is non-zero, but the algorithm identifies it as zero. We duplicate the simulation 100 times, and the average number of CZ, ICZ, CNZ, and ICNZ cases with different sample size (n) and dimension (p) are listed in Table 2.1. Note that the Oracle is the idealized situation, the HDcop is the high-dimensional algorithm of our research, and the MLE is the maximum likelihood estimation approach. In the first panel of Table 2.1, the number of responses is fixed to p = 10, the algorithm has a high CZ value for all sample sizes, indicating that it was successful at identifying the zero 17 coefficients. The ICZ value is higher when the sample size is small, but it decreases quickly as the sample size increases. On the contrary, the CNZ value increases as the sample size increases. On the other hand, MLE seems to be unsuccessful at identifying the true zero coefficients, but it identifies the true non-zero well. In the second panel of Table 2.1, we fix the sample size to n = 400, and the number of correctly identified coefficient (CZ and CNZ) increases when dimension p increases. However, the MLE seems poor at identifying any true zero and non-zero coefficients. Oracle Sample size (n) 100 CZ 40 ICZ 0 CNZ 5 ICNZ 0 CZ 39.90 ICZ 3.49 CNZ 1.51 ICNZ 0.10 CZ 0.00 ICZ 0.00 5.00 CNZ ICNZ 40.00 HDcop MLE Oracle Dimension (p) CZ ICZ CNZ ICNZ CZ ICZ CNZ ICNZ CZ ICZ CNZ ICNZ MLE HDcop 3 2.68 0.00 0.32 0.00 2.61 0.00 0.32 0.07 0.00 0.00 0.00 2.68 Table 2.1: C and IC analysis result 200 40 0 5 0 39.54 0.00 5.00 0.46 0.00 0.00 5.00 40.00 4 5.26 0.00 0.74 0.00 5.25 0.00 0.59 0.16 0.00 0.00 0.00 5.41 300 40 0 5 0 39.48 0.00 5.00 0.52 0.00 0.00 5.00 40.00 5 8.95 0.00 1.05 0.00 8.82 0.00 1.05 0.13 0.00 0.00 0.00 8.95 400 40 0 5 0 39.49 0.00 5.00 0.51 0.00 0.00 5.00 40.00 6 13.53 0.00 1.47 0.00 13.05 0.00 1.86 0.09 0.00 0.00 0.00 13.14 500 40 0 5 0 39.44 0.00 5.00 0.56 0.00 0.00 5.00 40.00 7 18.38 0.00 2.62 0.00 18.41 0.00 2.32 0.27 0.00 0.00 0.00 16.68 600 40 0 5 0 39.49 0.00 5.00 0.51 0.00 0.00 5.00 40.00 8 24.51 0.00 3.49 0.00 24.29 0.00 3.42 0.29 0.00 0.00 0.00 24.58 700 40 0 5 0 39.37 0.00 5.00 0.63 0.00 0.00 5.00 40.00 9 31.41 0.00 4.59 0.00 31.27 0.00 4.43 0.30 0.00 0.00 0.00 31.57 800 40 0 5 0 39.46 0.00 5.00 0.54 0.00 0.00 5.00 40.00 10 39.61 0.00 5.39 0.00 38.70 0.00 5.75 0.56 0.00 0.00 0.00 39.26 900 40 0 5 0 39.38 0.00 5.00 0.62 0.00 0.00 5.00 40.00 11 47.85 0.00 7.15 0.00 47.61 0.00 6.87 0.52 0.00 0.00 0.00 48.13 1,000 40 0 5 0 39.40 0.00 5.00 0.60 0.00 0.00 5.00 40.00 12 57.73 0.00 8.27 0.00 56.68 0.02 8.55 0.75 0.00 0.00 0.00 57.43 2.3 Model Performance The goal of this simulation is to test the performance and precision of the dependence model described in Section 1.2. In other words, we want to compare the underlying correlation matrix, which produced the data, and the resulting matrix from the estimating procedure. First, according to the generation approach in Section 1.3, we randomly generate a lower triangular matrix Θ with 18 (cid:0)5 2 (cid:1) angles in range (0,π) as input. For example, (θ21, θ31, θ41, θ51, θ32, θ42, θ52, θ43, θ53, θ54) = (1.570, 1.568, 0.964, 1.213, 1.561, 2.108, 1.743, 2.415, 1.690, 1.201) Then, by Equation 1.34, the lower triangular matrix B is as follow 0 1 0 0 0.0098 0.9999 0 0 0  1 0.0008            B = 0.0028 0.5702 −0.4204 −0.5275 0.4689 0.3502 −0.1605 −0.1097 0.3312 0.8543 0 0 0 0             So that, our underlying correlation matrix Σ will be  1 0.001 0.003 0.57 0.35             0.001 1 0.01 −0.42 −0.16 Σ = BBT = 0.003 0.01 1 −0.53 −0.11 0.57 −0.42 −0.53 1 0.48 0.35 −0.16 −0.11 0.48 1 After simulating 2000 observations from copula with dependent matrix Σ and applying the copula estimation routine shown in Section 1.2, we obtain the estimated dependence parameters and show them as the following correlation matrix A.             A = 1 0 0 0 1 0 0 0 1 0.5475 0.2977  −0.3848 −0.1583 −0.4822 −0.0705 0.5475 −0.3848 −0.4822 1 0.4540 0.2977 −0.1583 −0.0705 0.4540 1 19 (2.3) (2.4) (2.5) (2.6)                       Comparing the resulting matrix A with the underlying correlation matrix Σ, all of the estimates seem close to the true value, and the underlying coefficients that are close to zero, such as 0.001, 0.003, and 0.01 in Σ, end up being shrunk to exactly zero in the resulting matrix A. It indicates that the dependent model estimates the parameters well and successfully eliminates the irrelevant ones. 20 CHAPTER 3: RESULTS In this section, the method in Chapter 1 is applied to estimate the dependence of indemnity amounts among the selected commodities in the U.S. federal crop insurance data. Here, the selected commodities are the top 3 prevalent crops: corn, soybean and wheat. 3.1 Data Description The raw dataset is the U.S. crop insurance dataset on the Risk Management Agency website (RMA) 1which is publicly available and includes the liability amounts and indemnity amounts for each farm. The samples are composed of 368,720 farms, located in 2356 different counties in the United States. Each farm grows only one of the crop commodities and the commodity types are 188 in total. Crops grown in similar regions shared environmental conditions and may be exposed to common hazards, leading to the correlation in the indemnity amounts. The dependency among commodity types in the same region may be important for the insurance/reinsurance company interested in the Value at Risk (VaR) of specific insurance contracts covering any subset of the commodities. For the analysis in this thesis, the crop indemnity amounts and liability amounts for each farm are further aggregated by county level, resulting in 2356 observations. Each observation (county) has the total indemnity and liability for each commodity type. In the aggregated dataset, we are given the observed indemnity Zij and liability Lij amount. Table 3.1 shows the partial dataset of the aggregated liabilities amount and indemnities amount by the top 3 prevalent crops (corn, soybean, and wheat). According to the dataset, we can see that observed indemnity Zij must be zero when liability Lij is zero, and Zij may be zero or positive number when Lij is positive. 3.2 Marginal Models Suppose the random variable Yij represents the realization of the indemnity amount for the j-th commodity type in county i. As the assumption of Yij is gamma distributed, we select the positive observed indemnity amounts Zij as Yij (response) with their corresponding liability amounts Lij 1 https://www.rma.usda.gov 21 Table 3.1: Dataset of liabilities and indemnities County (i) 1 2 3 4 5 6 7 8 9 10 Liab. Liab. Liab. Indem. Indem. Indem. Corn(Li1) Corn(Zi1) Soy Bean(Li2) Soy Bean(Zi2) Wheat(Li3) Wheat(Zi3) 15482 9605 33660 11957 0 0 96086 0 451107 0 119156 1564468 118968 552515 97843 43851 556543 0 1563561 72413 109918 1432266 329826 188870 351199 320717 297396 59827 647271 85054 10957 297804 0 9069 51486 0 66829 0 153198 31204 60485 126290 37598 25364 0 0 239153 0 682013 5738 0 79946 3909 0 0 4261 1084 0 17755 7163 (exposure) from the aggregate dataset. Then, Yij > 0 for each i and j allow us to fully define the density and distribution of Yij for the copula model. The gamma distribution is commonly used in actuarial science since it is only defined for positive values, and this is appropriate for modeling the variables that represent quantities in the insurance field that cannot be negative, such as claim sizes and loss amounts. Although heavy-tail distributions are recommended for individual loss severity modeling in the recent literature, we still use the gamma distribution instead of heavy-tailed loss distributions in this research for simplicity and practical purpose. Moreover, according to the central limit theorem, the skewness in the distribution of aggregated indemnity amount will reduce while the loss amounts for more and more farms are aggregated, thus gamma distribution may be a more suitable choice for the model. Since the indemnity amount Yij follows gamma distribution and the expected indemnity amount can be exposed using the corresponding liability amount Lij, the marginal model is defined by GLM with gamma distributed dependent variable. The right panel in Figure 3.1 shows that the liability amounts and the indemnity amounts have a positive relationship at the county level after log transformation. Therefore, we assume a log link, and the marginal model is defined as log (E(Yij)) = β0 + β1 log(Lij). (3.1) 22 Figure 3.1: Scatter plot of liabilities and indemnities at the county level And the estimated coefficients of the model for each commodity type are shown in Table 3.2. Table 3.2: Coefficient estimates for the marginal gamma model (Intercept) LogExposurePos Corn Soybean Estimate Std. Error Estimate Std. Error Estimate Std. Error 0.8038 0.7738 0.2531 0.0161 0.2860 0.0213 0.1038 0.8445 2.1346 0.6753 0.3750 0.0235 Wheat 3.3 Dependence Model Using the marginals and the dependence model described in Chapter 1, we obtain estimates for the dependence parameters of the selected commodities shown in Table 3.3. These estimates are obtained using the tuning parameter that minimizes the BIC score, as shown in Figure 3.2. Table 3.3: Correlation of selected commodities Corn Soybean Wheat Corn 1 0.4660 0.1821 Soybean Wheat 1 0.1923 1 In Figure 3.2, the correlation estimates are shown for different log lambda in the left penal. Although there are only 3 correlations, we can still observe that more coefficients are shrunk to zero when lambda increases. The BIC scores corresponding to different log lambda values are 23 0.0e+005.0e+081.0e+091.5e+090.0e+005.0e+081.0e+091.5e+09LiabilitiesIndemnities0510152005101520Log LiabilitiesLog Indemnities displayed in the upper right panel, and the vertical line indicates the value of log lambda that minimized the BIC score. Here, log(λ) = −9.21034 leads to the minimum BIC and will be set as the optimal tuning parameter to estimate coefficients. The smallest eigenvalue of the correlation matrix Σ with the corresponding log lambda is shown in the bottom right panel, where the panel indicates that the correlation matrix for the optimal tuning parameter is positive semi-definite since all of its eigenvalues are non-negative. Figure 3.2: Solution path and optimal tuning parameter for selected commodities In the 3-dimensional copula case, with only three dependent parameters, there is sufficient data to estimate the parameters accurately. Thus, the lasso penalty might not be as crucial, and the optimal tuning parameter may not shrink any coefficient to exactly zero, as depicted in Figure 3.2. However, in higher-dimensional copula cases, the model may heavily rely on regularization (L1 penalty) to deal with sparse data and prevent overfitting, leading to more coefficients being shrunk toward zero, ultimately resulting in a positive semi-definite correlation matrix for the optimal tuning parameter. 3.4 Value at Risk (VaR) In order to demonstrate the practical application of the estimated model, let’s consider a fictional insurance contract that covers a selection of commodities, including corn, soybean, and wheat. 24 Suppose the correlations among the commodity type are shown in Table 3.3, and the marginal models are defined as equation (3.1) with coefficients in Table 3.2. Furthermore, the log liability amounts for each commodity are 10. Using the marginal models and the given log liability amounts, the distribution of the indemnity amount for each commodity may be known. And further, the indemnity amount can be randomly sampled using the simulation from copula with correlations shown in Table 3.3 and inverse cdf transformation. In other words, random indemnity amounts can be simulated using the copula and marginal models based on the dependence structure for the selected commodities. The simulation is repeated B = 100,000 times, and the indemnity amounts are aggregated each run to calculate the Value at Risk for the insurance portfolio losses. To compare the Value at Risk for the insurance coverage under the dependence and independence model, this simulation process may be applied to an independence model using an identity matrix as the correlation matrix. Once the simulations are complete, the resulting aggregate indemnity amounts from the two models can be compared using histograms. Figure 3.3 displays the histograms for the dependence model and the independence model, and the vertical line indicates the 95th percentile for each case. It can be observed that the dependence model results in a more skewed distribution of aggregate indemnity amounts and a higher 95th percentile, indicating that the dependence among commodity types can impact the Value at Risk for insurance coverage. The 95th percentile is 42,255 for the dependent case, which corresponds to the Value at Risk at a 95% confidence level, indicating there is a 5% chance that the losses will exceed 42,255. In contrast, the 95th percentile is 37,683 under independent circumstances. To gain a more comprehensive understanding of the insurance portfolio’s risk, we calculate the Conditional Value at Risk (CVaR), also known as Tail Value at Risk. CVaR is the expected loss given that the loss exceeds the VaR level, it offers more information about the potential severity of extreme events. The CVaR at a 95% confidence level is 53,220 for the dependent case, and 46,086 for the independent case. For this fictional insurance contract, both the Conditional Value at Risk (CVaR) and Value at Risk (VaR) turn out to be larger under dependence. This observation 25 reminds us that when variables are dependent, we might be especially cautious about the potential for substantial losses. Figure 3.3: Simulated density of indemnities for a fictional insurance coverage 26 CHAPTER 4: CONCLUSION In this thesis, the algorithm estimates the normal copula parameters of the composite pairwise likelihood with the L1 penalty, which is used to induce sparsity into the resulting correlation matrix and avoid overfitting. Inducing sparsity allows the dependence parameters to be estimated even in high-dimensional problems, thus it is useful when the number of dependence parameters among the p commodity types is larger than the sample size n. In the estimation approach, the majorization-minimization is used to minimize the composite likelihood function iteratively since the function with penalty has complex constraints and is difficult to optimize directly. In practical application, the resulting dependence model can be used to calculate Value at Risk for an insurance contract covering any subset of commodities. The Value at Risk is a crucial measure for insurance companies to facilitate risk assessment and ensure sufficient capital reserves for potential losses. With the estimated dependence model, the distribution of aggregated indemnity amount for the subset of commodities can be simulated using marginal and copula models, and Value at Risk of indemnity amount covered by contract can be further computed. According to result in Section 3.4, insurance contract with positive dependencies among commodities may have higher Value at Risk than one with independence model. The thesis focuses on using 3 dimensions as an example to illustrate the dependence model and the extension to the high-dimensional model is shown in the Appendix. With the three-dimensional setting, the complex theoretical functions and core concepts may be simplified and easily understood. There are some limitations that may be addressed in future work. One is the approach does not use the true distribution of correlation matrices in Section 1.3, thus it is possible to result in an invalid correlation matrix with non-positive semi-definiteness. Another is the approach is restricted in normal copula cases. For future research, apply the alternative parameterization using Θ to the algorithm. Moreover, the approach explained in this thesis may be extended to other copula families. Last, we only set liability amounts as exposure for modeling the expected indemnity amount in the thesis, however, if there are more explanatory variables available, including them as exposures may help improve the marginal model. 27 BIBLIOGRAPHY Nelsen, R.B. An Introduction to Copulas; Vol. 139, Springer Science & Business Media, 1999. Hofert, M.; Kojadinovic, I.; M¨achler, M.; Yan, J. Elements of Copula Modeling with R; Springer, 2018. Czado, C. Analyzing Dependent Data with Vine Copulas: A Practical Guide With R; Springer, 2019. Frees, E.W.; Valdez, E.A. Understanding relationships using copulas. North American Actuarial Journal 1998, 2, 1–25. Nikoloulopoulos, A.K.; Karlis, D. Regression in a copula model for bivariate count data. Journal of Applied Statistics 2010, 37, 1555–1568. Shi, P.; Emiliano, V. Longitudinal modeling of insurance claim counts using jitters. Scandinavian Actuarial Journal 2014, 2014, 159–179. Tibshirani, R. Regression shrinkage and selection via the lasso. Statistics and Computing 1996, 58, 267–288. Yuan, M.; Lin, Y. Model selection and estimation in regression with grouped variables. Journal of the Royal Statistical Society. Series B (Methodological) 2006, 68, 49–67. Yang, Y.; Zou, H. A fast unified algorithm for solving group-lasso penalize learning problems. Statistics and Computing 2015, 25, 1129–1141. Qian, W.; Yang, Y.; Zou, H. Tweedie’s compound poisson model with grouped elastic net. Journal of Computational and Graphical Statistics 2016, 25, 606–625. Frees, E.W.; Gao, J.; Rosenberg, M.A. Predicting the frequency and amount of health care expenditures. North American Actuarial Journal 2011, 15, 377–392. Frees, E.W. Frequency and severity models. In Predictive Modeling Applications in Actuarial Science; Frees, E.W.; Meyers, G.; Derrig, R.A., Eds.; Cambridge University Press: Cambridge, 2014. Zhao, Y.; Joe, H. Composite likelihood estimation in multivariate data analysis. The Canadian Journal of Statistics 2005, 33, 335–356. Song, P.X.K.; Fan, Y.; Kalbfleisch, J.D. Maximization by parts in likelihood inference. Journal of the American Statistical Association 2005, 100, 1145–1158. Joe, H. Dependence Modeling with Copulas; CRC Press, 2014. Pourahmadi, M.; Wang, X. Distribution of random correlation matrices: Hyperspherical parameterization of the Cholesky factor. Statistics and Probability Letters 2015, 106, 5–12. Frees, E.W.; Lee, G.Y.; Yang, L. Multivariate frequency-severity regression models in insurance. Risks 2016, 2016. 28 Assumptions of General Case (p dimensions) APPENDIX Consider the general problem, where we are given observations of a continuous multivariate response Zij, and liability (exposure) amounts Lij. The observations are given in two matrices: z11 z12 · · · z21 ... z22 ... · · · . . . z1p z2p ...                   and          L11 L12 · · · L1p L21 L22 ... ... · · · L2p ... . . . zn1 zn2 · · · znp Ln1 Ln2 · · · Lnp          . (4.1) In practice, we may observe a matrix of explanatory variables in addition to the liabilities. Yet, for simplicity of the illustration, we assume in this paper that there are no additional explanatory variables observed. The principle may be easily extended to the case with more covariates (explanatory variables). We assume that Zij holds the value of zero when Lij is zero. Yet, this is not the only case when Zij is zero. Typically, the value Zij may be either zero or a positive number with some probability, say P r(Rij = 1|Lij > 0) = E[Rij|Lij > 0], (4.2) where Rij may be thought of as an indicator of the response being positive, given a postive liability. This is a typical situation arising in insurance claims modeling, where the data are unbalanced due to the response being undefined when the liability amount Lij is zero, and given a positive Lij the response is observed. In practice, the frequency of insurance claims may also be observed, resulting in the modeling of average severities, as opposed to the total severity. An exampe of where the frequencies are observed is the article by Frees et al. (2016). The method proposed in this paper can be extended to this situation as well. Yet, for now let’s assume we are interested in modeling the total severity Zij directly. We hope to understand the dependence structure of the observations of the Zij response. For this, we decompose the response, so that Zij = I(Lij > 0) · Rij · Yij. (4.3) 29 Here, we assume that the indemnity amount Yij satisfies Yij > 0 for each i and j. This allows the density of Yij to be well defined. We imagine that there is a binary random variable Rij for each indemnity amount Yij, that holds the value 1 when the observed indemnity amount is positive. We also assume that Rij and Yij are independent. We may define the observed indemnity amount as the product of Yij and Rij, so that Rij · Yij are the observed indemnity amounts. Notice that the following relationship holds: Zij =   Yij if Rij = 1 and Lij > 0  0 otherwise. (4.4) The liability amount Lij may also be zero or a positive number. For this reason, the observed response Zij is positive only if the liability is positive, and zero (in fact undefined) otherwise. We consider the general situation, where the marginal models are either given, or may be estimated from data separately using the Inference for Margins (IFM) approach, described in books and articles on copula modeling, such as Joe (2014). Thus, we assume that we know the parametric form and parameters of the distributions Fi1(yi1), Fi2(yi2), · · · , Fip(yip), as well as the densities fi1(yi1), fi2(yi2), · · · , fip(yip). In other words, we are given marginal models dependent on the liability amounts Lij so that Fij(y) = F (y; Lij), i = 1, · · · , n, j = 1, · · · , p. (4.5) Using these, we hope to find a joint distribution among p variables, f (yi1, yi2, · · · , yip). This task may be accomplished using a copula. Here, we make an assumption that there exists a multivariate normal copula C of p-dimension, linking the marginal distributions, so that C(u1, · · · , up) = Φp(Φ−1(u1), · · · , Φ−1(up); Σ), (4.6) where Φp is the cumulative distribution function for a p-dimensional multivariate normal distribution 30 with zero means and correlation matrix Σ, and Φ−1 are the inverse cumulative distribution functions for standard normal distributions. The density for the copula is c(u1, · · · , up; Σ) = ∂p ∂u1 · · · ∂up C(u1, · · · , up; Σ). (4.7) The joint density is then given by f (yi1, yi2, · · · , yip) = c (Fi1(yi1), Fi2(yi2), · · · , Fip(yip); Σ) · fi1(yi1) · fi2(yi2) · · · fip(yip). (4.8) Note that by separating Yij from Zij by factoring out Rij and Lij, we may assume for every observation i, there exists such a density. Because we are assuming the marginal models are given, we may assume the input for the copula estimation routine is a matrix of uniformly distributed values uij, and indicators r∗ ij = rij · I(Lij > 0) where and uij =   Fi(zij) Rij = 1, and Lij > 0  0 Rij = 0, or Lij = 0, r∗ ij =    1 Rij = 1, and Lij > 0 0 Rij = 0, or Lij = 0. (4.9) (4.10) Shrinkage Estimation of Copula Parameters In order to estimate the dependence parameters in the matrix Σ, we use the composite likelihood approach. Specifically, we minimize the composite negative log-likelihood function ℓ(Σ) = − 1 mi − 1 n (cid:88) (cid:88) i=1 j>k r∗ ij =1 r∗ ik=1 (cid:20) (cid:21) log c (Fij(yij), Fik(yik); ρjk) + log fj(yij) + log fk(yik) , (4.11) 31 where c is a bivariate normal copula, and the weight for observation i, is mi and is obtained by mi = p (cid:88) j=1 r∗ ij, (4.12) or in other words, the number of non-zero responses for observation i. Note that given the marginal distributions Fij for each i = 1, · · · , n, j = 1, · · · , p, the likelihood in Equation (4.11) is a function in the dependence parameters only. Note that Σ is symmetric. Σ =          1 ρ12 · · · ρ1p ρ21 ... 1 ... ρp1 ρp2 · · · ρ2p ... . . . · · · 1          . (4.13) If p is large, then the problem is high-dimensional, since the number of dependence parameters in the matrix Σ is exactly (cid:0)p (cid:1). In order to automatically reduce the number of non-zero dependence 2 parameters, we use a shrinkage approach. For this, let ρjk = tan−1(ψjk) π/2 , −∞ < ψjk < ∞, (4.14) which ensures that for any ψjk, the resulting dependence parameter satisfies −1 < ρjk < 1. We then attempt to minimize the penalized negative log likelihood ℓP (Ψ) = ℓ(Σ) + λ |ψjk|, (cid:88) j>k r∗ ij =1 r∗ ik=1 (4.15) 32 where  Ψ = 1 ψ12 · · · ψ1p         ψ21 ... 1 ... ψp1 ψp2 · · · ψ2p ... . . . · · · 1          . (4.16) The problem boils down to estimating the parameters in the Ψ matrix, some of which may end up being zero due to the L1 penalty on the likelihood. This may be done using coordinate descent. In this thesis, the tuning parameter λ is to be determined by minimizing the BIC score after solving the problem for a set of λ values. In order to estimate the dependence parameters, let Φijk = − 1 mi − 1 (cid:20) (cid:21) log c (Fij(yij), Fik(yik); ρjk) + log fij(yij) + log fik(yik) , (4.17) so that ℓ(Σ) = Φijk. n (cid:88) (cid:88) i=1 j>k r∗ ij =1 r∗ ik=1 (4.18) Here, we assume that c is the density of a bivariate normal copula. This assumption allows us to write the following: Φijk = −   log  (cid:113) 1 mi − 1   − 1 1 − ρ2 jk where (a2 i + b2 jk − 2aibiρ i )ρ2 2(1 − ρjk)2 + log fij(yij) + log fik(yik)  ,  (4.19) (4.20) √ ai = 2 erf−1(2uij − 1), uij = Fij(yij) √ 2 erf−1(2uik − 1), bi = uik = Fik(yik), 33 with erf(z) = 2 √ π (cid:90) z 0 exp(−t2)dt. (4.21) Furthermore, remember that we have parametrized the dependence parameters so that we have ρjk = 2 tan−1(ψjk)/π. In order to obtain the gradient and hessian of the likelihood in terms of the parameters ψjk, consider ∂Φijk ∂ψjk = ∂Φijk ∂ρjk ∂ρjk ∂ψjk . The first and second terms in the product may be obtained by differentiation. ∂Φijk ∂ρjk = − 1 mi − 1 (cid:34) ρjk 1 − ρ2 jk + aibiρ2 jk − (a2 i + b2 i )ρjk + aibi (1 − ρ2 jk)2 ∂ρjk ∂ψjk = 2 π(1 + ψ2 jk) . Furthermore, we have ∂2Φijk ∂ψ2 jk = ∂2Φijk ∂ρ2 jk (cid:18) ∂ρjk ∂ψjk (cid:19)2 + ∂Φijk ∂ρjk (cid:32) (cid:33) , ∂2ρjk ∂ψ2 jk (cid:35) (4.22) (4.23) (4.24) (4.25) where and (cid:34) ∂2Φijk ∂ρ2 jk = − 1 mi − 1 1 + ρ2 jk (1 − ρ2 4 (cid:8)aibiρ2 jk)2 + + {2aibiρjk − (a2 jk)2 i )} (1 − ρ2 i + b2 (1 − ρ2 jk)4 (cid:9) (1 − ρ2 jk − (a2 i + b2 i )ρjk + aibi jk)4 (1 − ρ2 jk)ρjk (cid:35) , (4.26) ∂2ρjk ∂ψ2 jk = − 4ψjk π(1 + ψ2 jk)2 . (4.27) 34 Denote the flattened vector of parameters by a (cid:0)p 2 (cid:1) dimensional vector ψ = (ψ21, ψ31, · · · , ψp1, ψ32, ψ42, · · · ψp2, · · · , ψp,p−1)T . (4.28) The Hessian matrix has dimension (cid:0)p (cid:1) × (cid:0)p (cid:1), and is diagonal. For simplicity, we denote the entries 2 2 of the Hessian matrix using j and k, the Hessian matrix Hψ has entries Hψ(j, k) = n (cid:88) i=1 ∂2Φijk ∂ψ2 jk , j > k, (4.29) where in the three dimensional example, there are only three possible values for the j and k indices: (2, 1), (3, 1), and (3, 2). Denote the gradient vector by Uψ(j, k) = n (cid:88) i=1 ∂Φijk ∂ψjk , j > k. (4.30) Remember that j and k are indices for the dependence parameter matrix, not the Hessian matrix itself. Also remember that the off diagonal elements of Hψ are zero. We take a second order Taylor series approximation to the unpenalized negative log-likelihood function in Equation (4.18) around a given point ˜ψ. ℓQ(ψ) = ℓ( ˜ψ) + ˜U T ψ (ψ − ˜ψ) + 1 2 (ψ − ˜ψ)T ˜Hψ(ψ − ˜ψ), (4.31) Then, we would like to minimize the following penalized objective function: PQ(ψ) = ℓQ(ψ) + λ |ψjk|. (cid:88) j>k Rij =1 Rik=1 (4.32) 35 For this, we use the majorization minimization principle. First, define UQ = ∇ℓQ(ψ) HQ = ∇2ℓQ(ψ), where the entries of UQ are given by UQ(j, k) = ˜Uψ(j, k) + ˜Hψ(j, k)(ψjk − ˜ψjk) (cid:33) ∂ ˜Φijk ∂ψjk ∂2 ˜Φijk ∂ψ2 jk (cid:32) m (cid:88) (cid:32) m (cid:88) (cid:33) = + i=1 i=1 (ψjk − ˜ψjk), and HQ(j, k) = ˜Hψ(j, k) = n (cid:88) i=1 ∂2 ˜Φijk ∂ψ2 jk , (4.33) (4.34) (4.35) (4.36) where the simplicity is due to the fact that Hψ is diagonal. Notice that because of the simplicity, HQ is also diagonal. Given ℓQ(ψ), consider updating current estimates ˘ψ. The new parameters ˘ψ(new) is given by solving ˘ψ(new) = arg min ψ ℓQ( ˘ψ) + ˜U T Q (ψ − ˘ψ) + (ψ − ˘ψ)T ˜HQ(ψ − ˘ψ) + λ |ψjk|. (4.37) (cid:88) j>k r∗ ij =1 r∗ ik=1 The solution for this problem is given by ˘ψjk(new) = (cid:16) ˜γψ ˘ψjk − ˜UQ(j, k) (cid:17) (cid:16) 1 − ˜γψ λ |˜γψ ˘ψjk− ˜UQ(j,k)| (cid:17) + , (4.38) where ˜γψ is the largest eigenvalue of ˜HQ. 36