PREDICTIVE MODEL BUILDING FOR UTILIZING WORD EMBEDDING MODELS: APPLICATIONS IN INSURANCE DATA By Scott Manski A DISSERTATION Michigan State University in partial fulfillment of the requirements Submitted to for the degree of Statistics – Doctor of Philosophy 2020 ABSTRACT PREDICTIVE MODEL BUILDING FOR UTILIZING WORD EMBEDDING MODELS: APPLICATIONS IN INSURANCE DATA By Scott Manski Textual data contains a vast amount of information, yet for many researchers it has not been clear how the information could be used for an empirical analysis. Often times textual data are ignored or discarded in statistical analyses because regression and other statistical methods require numeric covariates. This dissertation will demonstrate how cutting-edge text mining technologies can improve empirical analyses by transforming textual data into numeric explanatory variables, thus allowing textual data to be incorporated into a statistical analysis. By transforming the textual data, the number of explanatory variables often becomes larger than the number of observations. For this reason, we explore the application of generalized additive models in tandem with adaptive lasso. In addition, we construct an algorithm for fitting a Gamma double generalized linear model with a group lasso penalty. Through this, we show how useful information can be extracted from textual data. We show how our methods can be applied through several insurance claims examples. We believe that our work can be widely used for other observational researchers in economics, business, statistical science, and social science. Copyright by SCOTT MANSKI 2020 ACKNOWLEDGEMENTS I would like to express the utmost gratitude to Dr. Taps Maiti and Dr. Gee Lee, for their assistance, support, guidance, and encouragement throughout my time at MSU. I would also like to thank my entire graduate committee, including Dr. Hyokyoung Hong and Dr. Chenhui Guo. Their service is greatly appreciated. Furthermore, I would like to thank the Department of Statistics and Probability for the resources and opportunities provided to me during my Ph.D. career. iv TABLE OF CONTENTS . . . . . . 1.1 Using Text as Data LIST OF TABLES . . . . . LIST OF FIGURES . . . . . LIST OF ALGORITHMS . . . KEY TO ABBREVIATIONS . CHAPTER 1 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . vii . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . viii . . x . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . INTRODUCTION . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1.1.1 Word Occurrence Methods . . . . . . . . . . . . . . . . . . . . . . . . . . 1.1.2 Word Embeddings . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1.1.2.1 Word2vec . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1.1.2.2 Global Vectors for Word Representation (GloVe) . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1 1 2 4 4 6 7 1.1.3 Word similarities . 1.1.4 Data Size Considerations . . . . . . . . . . . . . . . . . . . . . . . . . . . 11 1.2 Lasso and Generalizations . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 12 1.2.1 Adaptive Lasso . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 13 1.2.2 Group Lasso . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 13 1.2.3 Adaptive Group Lasso . . . . . . . . . . . . . . . . . . . . . . . . . . . . 14 xi . . . CHAPTER 2 CLAIMS CLASSIFICATION AND RISK MITIGATION USING SHORT 2.1 2.2 Data . 2.3 Applications . . . . . . . . . . . . . . . . . . . . . Introduction . TEXTUAL DESCRIPTIONS . . . . . . . . . . . . . . . . . . . . . . . . . 15 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 15 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 17 2.2.1 Data Generation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 17 2.2.2 Textual Data Preprocessing . . . . . . . . . . . . . . . . . . . . . . . . . . 19 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 22 2.3.1 Claims Classification . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 22 2.3.1.1 Model . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 22 2.3.1.2 Result . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 25 2.3.2 Risk Mitigation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 27 2.3.2.1 Model . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 27 2.3.2.2 Result . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 31 2.3.2.3 Model diagnostic . . . . . . . . . . . . . . . . . . . . . . . . . . 32 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 33 . . 2.4 Concluding Remarks . Introduction . CHAPTER 3 A THREE STAGE APPROACH TO MODEL BUILDING . . . . . . . . . . 35 3.1 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 35 3.2 Data and Preprocessing . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 37 3.3 The Model and Basis Expansion . . . . . . . . . . . . . . . . . . . . . . . . . . . 39 3.4 The 3-Stage Approach . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 44 . . . . . . . . . v 3.5 Algorithm . 3.4.1 3.4.2 3.4.3 3.4.4 Tuning Parameters . Stage 1 – Group lasso . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 44 . 44 Stage 2 – Adaptive group lasso . . . . . . . . . . . . . . . . . . . . . . . . 45 Stage 3 – The smoothness penalty . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 45 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 45 Stage 1 – Group lasso . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 46 . 46 Stage 2 – Adaptive group lasso . . . . . . . . . . . . . . . . . . . . . . . Stage 3 – The smoothness penalty . . . . . . . . . . . . . . . . . . . . . . 46 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 46 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 49 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 50 . 3.6 Results . . 3.7 . . 3.8 Concluding Remarks Implications . . 3.5.1 3.5.2 3.5.3 . . . . . . . . . . . . . . . . . . CHAPTER 4 GAMMA DOUBLE GENERALIZED LINEAR MODEL WITH GROUP . . . . . . . . . 4.1 . . . . . . . . 4.2 Model 4.3 Convexity . LASSO . . . . . . Introduction . 4.1.1 Overview . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 51 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 51 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 51 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 52 4.2.1 One Dimensional Unpenalized Problem . . . . . . . . . . . . . . . . . . . 52 4.2.2 One Dimensional Penalized Problem . . . . . . . . . . . . . . . . . . . . . 54 4.2.3 Gamma Generalized Linear Model . . . . . . . . . . . . . . . . . . . . . . 55 4.2.4 Gamma Double Generalized Linear Model . . . . . . . . . . . . . . . . . 57 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 62 4.3.1 The Asymptotic Convexity . . . . . . . . . . . . . . . . . . . . . . . . . . 65 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 65 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 66 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 68 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 69 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 69 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 70 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 72 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 75 . 4.4.1 Example 1 . 4.4.2 Example 2 . 4.5 Empirical Analysis 4.5.1 Data . . . 4.5.2 Methods . 4.5.3 Results . . . . . . . . . . . . . . . . . . . . . . . . . 4.6 Concluding Remarks . . 4.4 Simulation . . . . . . . . . . . CHAPTER 5 CONCLUSION . APPENDICES . . . APPENDIX A APPENDIX B APPENDIX C . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 76 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 78 . THREE STAGE APPROACH THEORETICAL RESULTS . . . . . 79 THREE STAGE APPROACH SIMULATION RESULTS . . . . . . 85 GAMMA DOUBLE GLM WITH LASSO ALGORITHM RESULTS 93 . . . . . . . . . BIBLIOGRAPHY . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 97 vi LIST OF TABLES Table 2.1: Summary statistics of losses by claim category . . . . . . . . . . . . . . . . . . 21 Table 2.2: Claim Categories for Training and Validation Datasets . . . . . . . . . . . . . . 23 Table 2.3: Words used for Classification . . . . . . . . . . . . . . . . . . . . . . . . . . . . 24 Table 2.4: Confusion matrix for multinomial model . . . . . . . . . . . . . . . . . . . . . 25 Table 2.5: Summary statistics of explanatory variables (vandalism model) . . . . . . . . . . 29 Table 2.6: GAM model summary (vandalism model) . . . . . . . . . . . . . . . . . . . . . 33 Table 3.1: Summary statistics for the log(loss) for the training and validation datasets. . . . 38 Table 3.2: Summary statistics for the final model. The Residual DF comes from the estimated degrees of freedom from the GAM, and the MSPE is the out of sample mean squared prediction error. . . . . . . . . . . . . . . . . . . . . . . 47 Table 4.1: Proportion of randomly sampled pairs of points that violate the convexity definition for different sample sizes, each over 5 repetitions. The standard error of the 5 repetitions are given in parenthesis. . . . . . . . . . . . . . . . . . 65 Table 4.2: Average simulation results for 100 runs for the Lasso and Double Lasso models. The standard error is provided in the parentheses. . . . . . . . . . . . . . . . . . 67 Table 4.3: Average simulation results for 100 runs for the Grouped Lasso and Double Grouped Lasso models. The standard error is provided in the parentheses. . . . . 67 Table 4.4: Proportion of replications that the coefficient is nonzero in the Double Lasso . . . . . . . model. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 68 Table 4.5: Summary of loss amounts by event type . . . . . . . . . . . . . . . . . . . . . . 70 Table 4.6: Spearman correlation of cosine similarities and loss amounts . . . . . . . . . . . 72 vii LIST OF FIGURES Figure 1.1: Plot of () for different values of , with  = 100 . . . . . . . . . . . . . . Figure 1.2: Example of a two dimensional projection of the word embeddings . . . . . . . . 7 9 Figure 1.3: The number of unique words as a function of the number of documents in the NOAA dataset. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 11 Figure 2.1: Two dimensional projection of the word embeddings for common words . . . . 18 Figure 2.2: Distribution of common words . . . . . . . . . . . . . . . . . . . . . . . . . . 20 Figure 2.3: Average accuracy, error rate, and precision in log scale . . . . . . . . . . . . . . 26 Figure 2.4: GAM model plots for vandalism . . . . . . . . . . . . . . . . . . . . . . . . . 31 Figure 2.5: Words that relate with high vandalism losses . . . . . . . . . . . . . . . . . . . 32 Figure 3.1: Frequency for the most common words. . . . . . . . . . . . . . . . . . . . . . . 38 Figure 3.2: Cosine similarity against property loss for house and thunderstorm. . . . . . . 39 Figure 3.3: Function estimates for several covariates. . . . . . . . . . . . . . . . . . . . . 47 Figure 3.4: Words with the highest cosine similarity with house. . . . . . . . . . . . . . . 48 Figure 3.5: Predicted property loss amounts against the true property loss amounts for the validation sample. The Spearman correlation is 76.06%. . . . . . . . . . . . 49 Figure 4.1: 95% VaR for Double Lasso model . . . . . . . . . . . . . . . . . . . . . . . . 69 Figure 4.2: 95% VaR for Lasso model . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 69 Figure 4.3: Frequency of words in descriptions of Vandalism events . . . . . . . . . . . . . 70 Figure 4.4: Cosine similarities of selected words with log loss amounts for Vandalism . . . 71 Figure 4.5: Effect of word indicator on loss amount . . . . . . . . . . . . . . . . . . . . . . 72 Figure 4.6: Solution path of the algorithm . . . . . . . . . . . . . . . . . . . . . . . . . . . 73 Figure 4.7:  curve estimates . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 74 viii Figure 4.8:  curve estimates . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 74 Figure 4.9: Prediction versus out-of-sample claims (log scale) . . . . . . . . . . . . . . . . 75 Figure B.1: Mean squared prediction error for each method when  has 800 observations and 200 covariates. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 86 Figure B.2: Mean squared prediction error from the 3 step approach against other methods, at various dimensions of . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 87 Figure B.3: Misclassification error rate for step 2, step 3, and gamsel model, for each misclassification type. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 88 Figure B.4: 3 step approach function estimates for representative covariates from the model corresponding to figure B.1. . . . . . . . . . . . . . . . . . . . . . . . . 89 Figure B.5: mgcv function estimates for representative covariates from the model corre- sponding to figure B.1. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 90 Figure B.6: Runtime, in seconds, for each method when  has 800 observations and 100 covariates. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 91 Figure B.7: Runtime, in seconds, from the 3 step approach the mgcv model, at various dimensions of . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 92 . . . . ix LIST OF ALGORITHMS Algorithm 4.1: Algorithm for solving the Gamma Double GLM . . . . . . . . . . . . . . . 61 x KEY TO ABBREVIATIONS CBOW Continuous bag-of-words CNN Convolutional Neural Networks GAM Generalized Additive Model GLM Generalized Linear Model GloVe Global Vectors for Word Representation IBNR Incurred but not Reported LGPIF Local Government Property Insurance Fund NOAA National Oceanic and Atmospheric Administration RNN Recurrent Neural Networks xi CHAPTER 1 INTRODUCTION 1.1 Using Text as Data With the advance in technology, data are being collected on a much larger scale. While traditional data types such as numerical and categorical data are still being collected, there has been an increase in the collection of digital text data. In many cases, text data can provide more information than a series of traditional variables, but extracting this vital information is not trivial. Text mining has been utilized in a variety of contexts, including but not limited to spam filtering, sentiment analysis, customer churn, stock returns, and politics. Sentiment analysis is the process of categorizing texts based on the message of the text, usually positive or negative. Pang et al. (2002) explored the use of Naive Bayes, maximum entropy classification, and support vector machines to predict the overall opinion, positive or negative, of a movie review. Similar work includes using discriminant analysis to categorize texts into pre-determined genres by Karlgren & Cutting (1994), and finding features indicated by the use of subjective language by Hatzivassiloglou & Wiebe (2000). Text mining has also been used substantially in the prediction of stock prices. Antweiler & Frank (2004) analyzed internet stock message boards and found that the messages help to predict market volatility. Moreover, Tetlock (2007) found that high media pessimism predicts downward pressure on the market, and unusually high or low pessimism predicts high trading volume. Each of the aforementioned examples are primarily concerned with prediction as opposed to inference of relationships between the response and the explanatory variables. Some authors have found text analysis to be helpful in telling a story using the relationship between the response and the explanatory variables. Stephens-Davidowitz (2014) explored racially charged language in Google search queries to explain how much racial animus costs a black presidential candidate. Gentzkow et al. (2016) studied congressional speeches and found that partisanship is far greater in recent years, with a sharp increase in the early 1990s. 1 In almost all situations where text data are utilized, the analysis can be summarized in the following way: 1. Represent the raw text D as some numeric vector or array  2. Map  to predicted values ˆ of unknown response  3. Use ˆ in subsequent analysis In the following sections, we will discuss a variety of methods used to construct . We start by exploring the use of indicator variables and count data to represent text in Section 1.1.1. In Section 1.1.2, we introduce word embeddings along with several methods for obtaining word embedding matrices. Finally, in Section 1.1.3 we discuss how we will be using cosine similarities to measure the similarity between words. 1.1.1 Word Occurrence Methods The most common and simplest way of constructing  is by using indicator variables for the occurrence of words. We will define D as the collection of documents typically in the form of a column vector. Furthermore, we will denote the set of words in the th document as , for  = 1, ..., |D|.. Finally, let  be the set of all unique words found in D. Then we can define the entries of  as   = for  = 1, ..., |D|,  = 1, .., ||. Consider the following example where you have textual descriptions for 3 observations, 0 otherwise 1 if   ∈   2 Lightning struck a building D = A lightning strike hit a building Lightning struck a tree  The response in our example could be the amount of damage, in dollars, caused by the event. By our definition, 1 = {lighning, struck, a, building}, for example. Furthermore, Then  = {lighning, struck, a, building, strike, hit, tree}.   = Ò  1 1 1 1 0 1  1 1 1   1 1 0 0 1 0 Ò 0 1 0  0 0 1  An extension of the indicator method is to count the number of occurrences of word   in each textual description. Using the count method, since the word a is the only word that occurs more than once in a description, we have  =  Ò  1 1 1 1 0 1  1 2 1   1 1 0 0 1 0 Ò 0 1 0  0 0 1  If we conduct a lexical analysis of the first and second textual descriptions, we see Lightning noun struck verb a building noun A lightning adj. strike noun hit verb a building. noun While the sentence structure of the first and second textual descriptions are different, it is clear that their meanings are essentially equivalent. Since the sentence structure is different,  and  are different and they do not appropriately capture this similarity in meaning. Additionally, the only difference between  and  in the example is the third column, the column associated with the word a. While the use of the word a is necessary in creating grammatically sound sentences in our example, it does not add any additional information in regards to our response. Common words 3 that typically add little information to the meaning of a sentence are called stop words. Examples of stop words include a, the, and, at, and for. In practice, stop words are typically removed from  and  ( = 1, ..., ) to reduce the number of columns in  and to reduce the number of computations when constructing . 1.1.2 Word Embeddings So far, we have only represented text through indicator variables or counts. While this is easy to understand and easy to work with, it is a very basic way of summarizing text. Instead, word embedding representations map words into some vector space. The vector space consists of a word embedding for each word in the vocabulary. The word embeddings are determined by optimizing some objective function, such as a likelihood for word occurrences, on a library of texts. There are several algorithms for determining the word embeddings. We will be introducing two of these algorithms, word2vec and Global Vectors for Word Representation (GloVe). 1.1.2.1 Word2vec One approach to obtaining the word embeddings is to use the word2vec algorithm. The word2vec algorithm can be illustrated using a simplified example. Suppose we use the following seven words as our simplified vocabulary list:  = {lighning, struck, a, building, strike, hit, tree} = {1, 2, . . . , 7} Consider the sentence: lightning struck a building with center word 2, and context words  = { ˜1, ˜3}. What we want is some vector representation  of center words and ˜ of context words. These word embedding matrices are obtained by letting an algorithm read through billions of sentences, maximizing a log likelihood, treating  and ˜ as parameters to be estimated. For this, we can specify the probability of observing a context word 4 ˜  given a center word 2 by (cid:0) ˜ |2(cid:1) = exp(cid:0) ˜  · 2(cid:1) || =1 exp ( ˜ · 2) Going back to the sentence, lightning struck a building, using a naive Bayes assumption (where we assume the conditional independence of the events of observing context words given a center word), the negative log likelihood becomes:  = − log  ( ˜1|2) − log  ( ˜3|2) = −  ˜  · 2 + || log  ˜ ∈ exp ( ˜ · 2) ˜∈ The negative log likelihood is minimized by gradient descent. Note that analytical formulas for the gradient can be obtained based on the likelihood. In practice using the analytical forms of the gradient speeds up the convergence, and makes the algorithm more stable. Extensions of the gradient descent algorithm such as Adam optimizers may be used as well. The  and ˜ matrices start from a random initial matrix, and is updated each step of the gradient descent iteration. Repeating this process for millions and billions of center words and context words, an input word matrix  and a context word matrix ˜ is learned. Word2vec may use either the continuous bag-of-words (CBOW) model or the continuous skip- gram model, depending on how the log likelihood is defined. In the CBOW model, the log likelihood represents the probability of observing the center word within a window of context words. In the continuous skip-gram model, the probability of observing the context word given a center word is used. Hence, the example shown above would correspond to a continuous skip-gram model. Nowadays, word embedding matrices trained by word2vec can be downloaded from Google’s website (https://code.google.com/archive/p/word2vec/). An extension of word2vec, which uses character n-grams, is fastText, which can be downloaded from (https://fasttext. cc). FastText is a library for learning word embeddings, created by Facebook’s AI Research (FAIR) lab. 5 1.1.2.2 Global Vectors for Word Representation (GloVe) In this paper, we use the pre-trained word embeddings obtained via an algorithm called GloVe, developed by Pennington et al. (2014). We note that other methods to create word embedding matrices exist. From our perspective, the end result of word2vec and GloVe is similar. We chose GloVe because it is a straightforward algorithm based on word counts, and the approach is well documented with an emphasis on reproducibility. In practice, the GloVe algorithm has additional benefits over word2vec in that the algorithm is more easily parallelized. GloVe word embedding matrices can be downloaded from https://nlp.stanford.edu/projects/glove/. From this website, the 300 dimension word vectors containing 400 thousand vocabularies, trained over 6 billion tokens appearing in the Wikipedia corpus, has been downloaded. The algorithm used for this particular word embedding is to minimize a cost function, which has the form || || (cid:0),(cid:1)(cid:0) · ˜ +  + ˜ − log(, + 1)(cid:1)2  = =1 =1 where || is the size of the vocabulary, , ˜ are bias terms, , are entries of the co-occurrence matrix for all the words found in the corpus over which the algorithm is being applied,  and ˜ are the word embeddings corresponding to the position in the co-occurrence matrix ,, and  is a weighting function:  () = (/) if  <  1 otherwise with  = 100 and  = 3/4. The motivation for  = 3/4 is empirical, and with this choice of the parameter, the performance of the model happens to improve when compared to  = 1. In practice, we may have any 0 <  < 1. The reader may understand this is a way to give more weight to rare word combinations found in the corpus. Intuitively, the co-occurrence matrix , records how often two words occur together in the training corpus. The algorithm attempts to find a word embedding for each word that gives a dot product that is close to the transformed co-occurrence matrix entry of the word combination. The resulting word embedding gives a large dot product for those combinations of words that have a high co-occurrence 6 Figure 1.1: Plot of () for different values of , with  = 100 matrix entry, and a low dot product for those combinations of words that correspond to zero matrix entries. Note that most of the entries of the matrix would be zeros. For additional details on the GloVe algorithm, we refer to the original paper by Pennington et al. (2014). Word embedding matrices obtained this way have numerous applications. In the natural language processing literature, word embeddings are used to construct neural networks that can predict missing words in sentences, or translate sentences in one language to another. Neural network based text models are powerful in terms of their prediction. Meanwhile, we are interested in models that allow us to interpret the relationship between explanatory variables and response variables. One way this can be achieved is to consider projecting the vector representation of words onto a set of axes, which we are able to understand. Given a vector representation of phrases, instead of using the vectors directly inside a prediction algorithm, one may consider projecting the vectors onto axes defined by a set of keywords. 1.1.3 Word similarities Once the word vectors are obtained using approaches outlined in Section 1.1.2, explanatory variables can be formed using cosine similarity of words. The cosine similarity between two words with 7 0501001502000.00.20.40.60.81.0x= 1/4xY(x)0501001502000.00.20.40.60.81.0x= 3/4xY(x)0501001502000.00.20.40.60.81.0x= 1xY(x) nonzero vector representations  and  is given by simcos(, ) =  ·  ||||2 · ||||2 (1.1) One may think geometrically, and interpret the cosine between two unit vectors as the dot product of the two vectors. The dot product ranges between -1 and 1, with 1 indicating identical vectors, and -1 indicating two vectors that point in the opposite direction. For example, the cosine similarity between the words building and buildings is 0.7946, while the cosine similarity between building and hello is −0.02810. Figure 1.2 shows a projection of the word vectors in a two-dimensional space for an example set of words. In the figure, notice that library and museum appear close to each other, since they have similar functions, and hence may appear in similar contexts. Also, graffiti, vandalism, theft, stolen all appear at a similar location on the plot. Imagine drawing an arrow from the point (0, 0) to the word, and the reader may see that the vector corresponding to each of these words are very similar to one another. The angle between the words is small, and hence the cosine of the angle between the words would be large (close to 1). Another way to say this is that the dot product between the words is large. Now consider the word hail and its corresponding vector, and compare it with the vector corresponding to graffiti. The two words are somewhat unrelated, and hence the angle between these two words is large. Another way to say this is that the cosine between the unit vectors corresponding to the two vectors is negative, or in other words the dot product is negative. Now consider two documents 1 = {1, 2, ..., } and 2 = {1, 2, ..., }. Let  ( = 1, ..., ) be the vector representation for word , and let   (  = 1, ..., ) be the vector representation for word  . Goldberg (2017) suggests using the similarity metric sim∗cos(1, 2) = simcos (, ) .   =1 =1 Essentially this metric assumes the vectors within a document can be added to form a vector representing the entire document. We tried using this metric, and discovered that the results could 8 Figure 1.2: Example of a two dimensional projection of the word embeddings be improved using a different metric. Hence, in this paper, we use the similarity metric (cid:18) (cid:19) . simcos(1, 2) = max =1,..., max =1,..., (simcos(, )) Thus the cosine similarity between two documents corresponds to the maximum cosine similarity between any two words found within the documents. In particular, the similarity between a single word  and  = {1, 2, ..., } is given by simcos(, ) = max =1,..., (simcos(, )) . (1.2) Defining the features this way is equivalent to max-pooling the features of a one-dimensional convolutional neural network. Max-pooling tends to work better than alternatives such as average pooling, as explained in page 120 of Chollet & Allaire (2018). When used with a single word , the similarity will be 1 if the particular word appears in the claim description. Hence, in some sense, the similarity can be thought of as a detector of whether a particular word or concept appears in the claim description. This provides more interpretability, as it is a generalization of indicators for word appearance. Continuing the example described in Section 1.1.1, we can calculate  using equation 1.2, the formula for cosine similarities, 9 −100−50050100−150−100−50050100150V1V2vandalismgraffititheftstolenairporttrafficmuseumzoobuildingsbldgfenceshingleshailstormlightningwaterfloodlibraryglasswindowtruck ˜ =  Ò 1.000 1.000 1.000  1.000 0.684 1.000  1.000 1.000 1.000  1.000 1.000 0.356  0.507 1.000 0.507 Ò 0.684 1.000 0.684  0.230 0.230 1.000  Notice that the cosine similarities for the words lightning and a are 1 for all three cases because the words appear in each of the three documents. In addition, recall that 1 = {lightning, struck, a, building}. The cosine similarity between the word tree and 1 is 0.230. The word within 1 that achieves this maximum cosine similarity is the word a. This further emphasizes the importance of removing stop words from the textual descriptions. By removing stop words from each description and from , we construct  as   = Ò 1.000 1.000 1.000  1.000 0.684 1.000  1.000 1.000 0.238  0.507 1.000 0.507 Ò 0.684 1.000 0.684  0.228 0.228 1.000  When using cosine similarities, we are often more interested in the similarities closer to 1, while smaller similarities provide less information about the word of interest. For this reason, we can calculate the components of  as  () = simcos(,  ) (cid:0)simcos(,  ) ≥ (cid:1) for  = 1, ..., ,  = 1, ..., || (1.3) where 0 <  ≤ 1. Continuing our example, (0.3) is (0.3) = Ò 1.000 1.000 1.000  1.000 0.684 1.000  1.000 1.000 0  0.507 1.000 0.507 Ò 0.684 1.000 0.684   0 0 1.000  Moreover, note that as  increases to 1, () becomes closer to . Specifically, (1) = . 10 1.1.4 Data Size Considerations The accuracy and efficiency of any statistical method applied to  is heavily influenced by the choice of . Recall that the number of words in  is the number of columns that will be in . As we did with the example in Section 1.1.1, we could let  be the set of all words that appear in D. Using the National Oceanic and Atmospheric Administration (NOAA) dataset, Figure 1.3 illustrates how the number of unique words is a function of the number of documents, or observations. It is clear that the number of unique words will plateau at some number of documents, which follows our intuition. Eventually the addition of another document in the dataset will not contain any words that have not appeared in the previous documents. Figure 1.3: The number of unique words as a function of the number of documents in the NOAA dataset. It is clear that choosing  to be the set of all words that appear in D will be difficult in any statistical analysis due to the large number of covariates. For this reason, we explore a variety of other options for choosing . The first method for determining  is having a subject matter expert manually choose words they believe to be the most related to the variable of interest. This method has the advantage of allowing the expert to choose as many or as few important words as they see fit. However, several possible issues may arise. To start, the subject matter expert will need to be familiar enough with cosine similarities to make appropriate choices. In addition, the expert will also have their own 11 biases that may influence the analysis. In Chapter 2 we will show how this method can be effective in claims classification and risk mitigation. We could also consider using a data driven method for choosing . Since the majority of words will most likely not be strong predictors of the response variable, we can assume that the true model will be sparse. Therefore, we choose to use Lasso to select our set of words . In Section 1.2 we discuss the Lasso estimator along with several generalizations. In Chapters 3 and 4 we show how Lasso type penalties can be used to select . 1.2 Lasso and Generalizations In this section we describe the Lasso estimator along with generalizations of the Lasso method. Tibshirani (1996) originally proposed the lasso, a method combining the least squares loss with an ࡁ1-constraint. Suppose we have a response vector  of length , an  ×  covariate matrix , unknown intercept 0, and unknown parameter vector  of length . Then the lasso estimator is expressed || − 01 −  ||2 2 subject to || ||1 ≤ , (1.4) (cid:26) 1 2 arg min 0, (cid:27) where 1 is a vector of  ones. The tuning parameter  determines the total size of the coefficients, and therefore shrinks the coefficients to zero as  increases. Often, after standardizing the covariates, the lasso problem is rewritten in the Lagrangian form (cid:26) 1 2 arg min 0, (cid:27) || −  ||2 2 + || ||1 , (1.5) for some tuning parameter  ≥ 0. The tuning parameter  is directly related to the tuning parameter  from Equation 1.4. Now that we have introduced the lasso estimator, we will discuss several generalizations. While variable selection by lasso is consistent under some conditions, the lasso method was extended to improve the consistency of variable selection. 12 1.2.1 Adaptive Lasso Zou (2006) established a necessary condition for variable selection to be consistent. By doing so, they showed the existence of certain scenarios where the lasso estimator is inconsistent. For this reason, the adaptive lasso was proposed. Choosing  > 0, the adaptive lasso solves  1 arg min   =1  , 2 (cid:107)  −  (cid:107)2 2 +   | | (1.6) where   = 1/| ˜ |, and ˜ is some initial estimate. Given the initial estimates, it can be shown that equation 1.6 is convex in . When  < , the least squares estimates can be used as the initial estimates. When  ≥ , the least squares estimates are not defined, however Huang, Ma, and Zhang (2008) showed that, under some general conditions, the marginal least squares estimates can be used as initial estimates. Zhang, Jeng, and Liu (2008) showed that using the lasso estimates as the initial estimates in adaptive lasso improves the sparsity recovery rate. In addition, Furthermore, Zou (2006) showed that the adaptive lasso recovers the true model under more general conditions than the lasso when the initial estimates are √  consistent. 1.2.2 Group Lasso In many applications of regression, the covariates may have some natural group or block structure, and it may be sensible to have all the coefficients within the group be estimated as zero or nonzero together. The most common example of this is the use of categorical variables. In practice, a categorical variable is typically coded as a series of dummy variables, and we want to either include or exclude the entire set of dummy variables from the model. This scenario provides motivation for the group lasso. Consider a linear regression model with  groups of covariates, and each group contains   covariates for  = 1, ..., . That is, we have design matrix  = ( 1 2 · · · ) where  is an =1   matrix, and  = ( 1 2 · · · ) where   is a vector of length   for each  = 1, ..., .  × 13 (cid:112)  (cid:13)(cid:13)(cid:13)  (cid:13)(cid:13)(cid:13)2  =1  , Yuan & Lin (2006) proposed the group lasso which solves arg min  2 (cid:107)  − 01 −  (cid:107)2 2 +  (1.7) for some tuning parameter  ≥ 0. In addition to categorical data, group lasso can be used in the additive model framework. If we apply some basis expansion to a numeric explanatory variable, then the expansion could now be thought of having a group-like structure. This idea will be explored in Chapters 3 and 4. 1.2.3 Adaptive Group Lasso Wang & Leng (2008) combined these extensions to formulate the adaptive group lasso, and showed the ability of the method to identify the true model consistently. For some tuning parameter  ≥ 0, the adaptive group lasso solves where   = 1/(cid:13)(cid:13) ˜  (cid:13)(cid:13) arg min  2 (cid:107)  − 01 −  (cid:107)2 2 +    (1.8) 2, and ˜ is some initial estimate. In addition, Wang & Leng (2008) proved estimation consistency and selection consistency for the adaptive group lasso estimator. We will illustrate how adaptive group lasso can be utilized in Chapter 3.  =1 (cid:13)(cid:13)(cid:13)  (cid:13)(cid:13)(cid:13)2  ,  1  1 14 CHAPTER 2 CLAIMS CLASSIFICATION AND RISK MITIGATION USING SHORT TEXTUAL DESCRIPTIONS 2.1 Introduction When an insurance claim arises, one of the first things that is reported to a claims manager of an insurance company is a short textual description of the insurance claim, along with demographic information regarding the policyholder. Losses may be reported via a notice form, where one of the common forms used is the Association for Cooperative Operations Research and Development (ACORD) form. See page 7.16 of Kearney (2010). Upon this initial report, the claims department is responsible for a number of important tasks. The claims department would identify the policy and set adequate reserves, contact the insured, investigate the claim, document the claim, determine the precise cause of loss, liability, and the loss amount. In this process, the claims manager would coordinate with the actuarial department in order to predict the amount of insurance claim and set the adequate case reserve for each claim. In practice, parametric loss models are fit to data, and the resulting distribution is used for the prediction of the ultimate claim amount. In this process, regression analysis helps us understand the relationship among variables. The goal of a standard regression model is to understand the relationship between the response variable , and explanatory variables , and predict future responses under a set of assumptions. The relationship  {E[]} =  , where typically a distributional assumption is imposed on the error, allows us to interpret the relationship between  and  in a systematic way. In case  is a binary response, logistic regression can be used. For a general overview of regression, see Frees (2009). When forming the  matrix for empirical research using traditional approaches, useful information is discarded from the analysis, because traditional regression analysis requires numeric descriptor variables. Textual information has been one example. In this chapter, we demonstrate how we can implement the tools discussed in Section 1.1 so 15 that traditional analysis can be performed. The extracted information is used to build a regression model for the response variable of interest. Through this demonstration, we show how information can be extracted from textual data. We demonstrate two illustrative case studies in insurance claims classification, and insurance risk mitigation. We believe these approaches demonstrate how text processing methods can improve insurance analytics in the actuarial practice. In addition, understanding the factors that relate with large insurance losses may help us mitigate future insurance losses. Generalized additive models (GAM) extend linear models to contain smooth functions for each of the covariates, while retaining inference about the functions. Applications of GAMs have been discussed in Hastie & Tibshirani (1990). These applications include studying kyphosis in laminectomy patients, atmospheric ozone concentration, and the intensity of ischaemic heart disease risk factors, among others. Moreover, Hastie et al. (2009) describes an example of utilizing GAMs to classify emails as spam. While this example does analyze text, the method has several significant differences from that of our analysis. The spam example observes the number of occurrences of certain words, and fits a GAM with each word as a covariate. While the spam example in Hastie et al. (2009) also discusses the interpretability of the model, the primary goal is to predict the probability of an email being spam. In our analysis, we start by quantifying the similarity between a description and a series of selected words. The main interest in this analysis is the interpretation of the smoothing functions. This provides a much more complete explanation of the various factors that lead to high losses, and therefore, may provide for a more improved strategy of risk mitigation. A recent work, Wood (2017), provides a comprehensive overview of GAMs. There is a vast literature in insurance claims modeling, where parametric models are employed to understand insurance claims distributions. To the best of our knowledge, combining text mining approaches with loss modeling is a new approach, which hasn’t been attempted in the past. In particular, there seems to be no prior work utilizing text mining approaches to empirically under- stand and model insurance claims data. The rest of the Chapter proceeds in the following order: 16 In Section 2.2, the Wisconsin Local Government Property Insurance Fund (LGPIF) data are intro- duced. In Section 2.3, two applications of word similarity models are presented: insurance claims categorization, and risk mitigation. In Section 2.4, concluding remarks are provided. 2.2 Data In this section we provide some summary statistics for the dataset. For our case study, we utilize a unique dataset of claim descriptions and loss amounts from the Wisconsin Local Government Property Insurance Fund (LGPIF). The dataset is obtained from the Office of the Commissioner of Insurance (OCI) of Wisconsin. The Wisconsin LGPIF has been established to make property insurance available for local government units. The property fund essentially acts as an insurance company in the area, providing property coverage for thousands of government entities. In Table 2.1, it is interesting to observe that claim categories with high frequency tend to have low severity, while claim categories with low frequency tend to have high severity. In order to illustrate this effect, the table has been sorted in decreasing order of the number of observations . Modelers have studied insurance claim frequency and severity models, and empirically it has been discovered that claim frequencies and severities are often correlated; see Frees et al. (2016). 2.2.1 Data Generation The number of claims observations is 4991 in the training sample, and 1039 in the validation sample, which totals to 6030 observations. The data used for this chapter is already in tabular form corresponding to the data generating processes in Section 2.3, and we consider the data cleaning process, including the metadata analysis, as a black-box process that has already been performed by the provider of the data. We assume all claims are closed, and the claim amounts are fixed. Descriptions for the observed insurance claims are recorded in the dataset. These claim descriptions are human generated, and there are 2797 unique words found in the training sample and validation sample all together. Figure 2.1 shows a projection of the word vectors in a two- dimensional space, for common words found in the dataset. Word vectors are explained in Section 17 1.1.2. For now, imagine that there exists a framework, where every word corresponds to some two- dimensional vector, with related words having similar vector representations. A plot of common words found in the claim descriptions file may look like Figure 2.1. Figure 2.1: Two dimensional projection of the word embeddings for common words In Figure 2.1, notice that library and museum appear close to each other, since they have similar functions, and hence may appear in similar contexts. Also, graffiti, vandalism, theft, stolen all appear at a similar location on the plot. Imagine drawing an arrow from the point (0, 0) to the word, and the reader may see that the vector corresponding to each of these words are very similar to one another. The angle between the words is small, and hence the cosine of the angle between the words would be large (close to 1). Another way to say this is that the dot product between the words is large. Now consider the word hail and its corresponding vector, and compare it with the vector corresponding to graffiti. The two words are somewhat unrelated, and hence the angle between these two words is large. Another way to say this is that the cosine between the unit vectors corresponding to the two vectors is negative, or in other words the dot product is negative. 18 −100−50050100−150−100−50050100150V1V2vandalismgraffititheftstolenairporttrafficmuseumzoobuildingsbldgfenceshingleshailstormlightningwaterfloodlibraryglasswindowtruck 2.2.2 Textual Data Preprocessing Figure 2.2 shows the most common words in the dataset. Stop-words such as a, the, and, etc. have been removed. No other pre-processing of the words have been performed. Notice that the word damage is most frequent in the descriptions. The word vandalism is also frequent, as vandalism turns out to be one of the most frequent claim causes in the dataset. Abbreviations such as hs (high school), ms (middle school), es (elementary school), dmg (damage), bldg (building) also appear in the dataset. Note that both bldg and building are valid words. Imagine we search for the word building in a search engine. In this case, phrases with the word bldg and building should both appear in the search. We also note that building and buildings both appear in the vocabulary. The reason why both bldg and buildings appear when building is already in the vocabulary list, is because they are distinct words. How much these words relate to claim occurrence is determined by the cosine similarities of the words with selected key words. The advantage of our approach is that minimal data preprocessing is needed, so that such similar words can all be kept in the data. Note, the claim descriptions are short phrases, such as lightning damage to building, or vandal- ism damage at recycle center. A total of 4991 such claim descriptions are in the training dataset. Each claim description is associated with the loss amount corresponding to the description. Each claim is categorized into one of the following claim categories: vandalism, fire, lightning, wind, hail, vehicle, water (weather related), water (non-weather related), and miscellaneous losses. Table 2.1 shows a summary of the loss amounts for each of the nine claim categories. According to the table, vandalism has the lowest average loss amount, while the frequency of vandalism is the highest. Hail damages are the largest in terms of median and mean value, while the frequency is the lowest. The largest hail damage with a loss of 6.6 million is a hail damage to multiple buildings insured by the property fund. It is interesting to observe that the maximum loss amount happened within the weather related water damages category, which is a loss of 12.9 million. This largest loss corresponds to a water damage to a school. The second most frequent claim category is the vehicles category. This category of claim happens when a vehicle (car, plow, truck, etc.) runs into 19 Figure 2.2: Distribution of common words a government property building or structure, such as a light pole. 20 0500100015002000damagevandalismdamagedlightningwaterglassparkfirehswinddoorlightestheftpolevehiclepowermsgraffitisignalsurgehitbuildingtrafficschoolequipmentlaptophydrantbrokendmggarageroofairportbldgcentershelterstreetfencehailstationwindowwestradiosystemtowerstolenstormstruckwashingtonhalldeptn Table 2.1: Summary statistics of losses by claim category Validation sample Training sample Peril Vandalism Vehicle Lightning Water (weather) Miscellaneous Wind Water (non-weather) Fire Hail min median mean 6,190 1 5,662 1 500 11,623 51,608 55 9,723 70 325 46,304 60,538 544 83,767 125 7,886 103,674 500 3,000 5,000 19,337 3,025 9,010 6,739 11,355 49,184 max 981,599 135,268 88,603 411,641 242,918 1,048,683 2,672,184 964,150 332,412 N min median mean 2,084 310 3,905 227 123 11,087 80,432 38 29,150 103 107 18,125 23,974 67 81,762 46 18 145,488 587 2,500 4,431 8,898 3,929 4,960 6,306 8,964 17,819 6 37 1 1 1 1 1 100 124 max 207,565 111,740 655,092 12,922,218 2,633,822 492,478 1,114,595 1,570,619 6,615,117 N 1774 852 832 426 362 296 202 171 76 21 To conclude the data description, we provide some ground-truth information regarding the quality of the data. The data has no missing values, and the categories for each claim are all observed. We inspected the claim categorizations and believe that the data quality is mostly dependable, although some errors may exist because the categorization has been performed by human experts. 2.3 Applications In this section, we demonstrate how the explanatory variables extracted from textual data can be used in specific models. We provide two examples: an example in claims classification, and an example in risk mitigation. In both applications, the Generalized Additive Models (GAM) framework is used as an underlying theme. We selected the GAM framework for its flexibility in capturing potential nonlinear effects of the cosine similarities. Word vectors are used in order to improve the classification results in all examples. These applications specifically use short textual descriptions of insurance claims, which may be found in initial reports of the claims to an insurance claims department. For long textual descriptions or claim adjuster notes, other methods such as recurrent neural networks (RNNs) using LSTM cells, or convolutional neural networks (CNNs) with multiple layers, may be preferred. For more details regarding neural network approaches to textual data analyses, see Goodfellow et al. (2016) and Goldberg (2017). Yet, the simplicity of our approach may make it attractive for actuaries working with simple textual descriptions of claims. 2.3.1 Claims Classification 2.3.1.1 Model In practice, an insurance claims manager would have to classify given claims based on their properties. This task may be supported with a claims classification engine, which would take the claim description as its input, and output the correct claim category. This motivates our first application of word embedding models, which is the classification of insurance claims into discrete categories. 22 The engine would be trained over a training dataset, and validated over a test dataset. Within the training dataset, we assume category  and description  are observed for each claim . Thus, the sample consists of observations {(1, 1), . . . (, )}. Note that  = (1, 2, . . . , ()), a description consisting of () words, where () is the number of words consisting the -th description. We assume that  = 0 + 1 + . . . +  , thus we imagine that nine samples each of size 0, 1, . . . ,   are stacked to form the given sample of size . In this paper,  = 8. The claim categories for the training sample and test sample are shown in Table 2.2. Thus, the response variable  takes on the values 0, . . . , . Table 2.2: Claim Categories for Training and Validation Datasets Misc. ( = 0) 362 103 Vandalism ( = 1) 1774 310 Training Test Fire ( = 2) 171 46 Lightning ( = 3) 832 123 Wind ( = 4) 296 107 Hail ( = 5) 76 18 Vehicle Water(NW) Water(W) ( = 6) ( = 8) ( = 7) 852 227 202 67 426 38 Total 4991 1039 A generalized additive model (GAM) framework can be thought of as a generalized linear model with a linear predictor involving smooth functions of covariates. See Hastie & Tibshirani (1990) and Wood (2017). For the analysis, we construct design matrix using Equation 1.3 as described in Section 1.1.3. That is, the explanatory variables,  are defined by , = simcos( , ) · (simcos( , ) ≥ ),  = 1, . . . ,  where  = 0.2 is used, and  is the -th word used in the model, and  is the description of the -th claim. For the model, we use a multinomial specification. We made this choice (as opposed to other classification methods), based on the fact that the multinomial model provides a framework that is easily generalizable to a GAM model. In this case, the probability of observing a specific peril type  is given by  exp(cid:16)  1(cid:14)(cid:169)(cid:173)(cid:171)1 +  exp(cid:0) ,(cid:1)(cid:14)(cid:169)(cid:173)(cid:171)1 + (cid:48)=0 (cid:48)=0  (cid:48), (cid:17)(cid:170)(cid:174)(cid:172) exp(cid:16) f( ) = for base peril type  = 0, for peril type 1 ≤  ≤  (cid:17)(cid:170)(cid:174)(cid:172)  (cid:48), 23 with  , =   +  ,(,)  =1 where   is an intercept, and  ,( = 1, . . . ) may be smooth functions of the covariate, and  is the number of words used in the model. We denote the base peril type as miscellaneous claims, and call it  = 0. If we assume the functions are linear so that estimation time could be saved, then we have  ,(,) = ,  , ,  = 1, . . . ,  and hence f( ) =   1(cid:14)(cid:169)(cid:173)(cid:171)1 + exp(cid:16) (cid:48)=0   + (cid:48)     (cid:48) + (cid:48) exp(cid:16) (cid:17)(cid:14)(cid:169)(cid:173)(cid:171)1 +   (cid:48)(cid:17)(cid:170)(cid:174)(cid:172) exp(cid:16)  (cid:48)=0   (cid:48)(cid:17)(cid:170)(cid:174)(cid:172)  (cid:48) + (cid:48) for base peril type  = 0, for peril type 1 ≤  ≤  Here,   are  dimensional coefficients. Note that with this choice of  ,, the GAM model becomes the GLM model. This simplification is useful for reducing the computation time, when a large number of explanatory variables are included in the model. Table 2.3 shows  = 7 words defining the set  used in the model. The reader may imagine projecting each claim description  onto a space represented by  = 7 axes. In this paper, the feature words have been selected by a human expert with a good understanding of the dataset. In practice, the most frequent words found in the claim descriptions may be used as the key words. Abbreviated words are valid choices. In the claims classification problem, our goal is to construct an engine that gives the best possible classification result, and the focus is less on the interpretability of the coefficients resulting from the estimation. For applications such as that found in Section 2.3.2, the interpretability is more important, and hence the feature words should be selected more carefully by a human expert using the engine. Table 2.3: Words used for Classification vandalism fire lightning wind hail vehicle water 24 The reader may be curious why a censoring of the cosine similarities is needed. The reason is that cosine similarities smaller than the threshold is basically noise. One way to understand this phenomenon is to imagine a search engine returning results on the internet. The first few results are highly related to the search string that has been entered, but as one goes down the list, more and more irrelevant results may be observed. These junk results tend to add noise to the regression result, and hence we censor the cosine similarities with a threshold  = 0.2, where the choice of  is an empirical question. Figure 2.3 in Section 2.3.1.2 shows the classification accuracy as a function of the threshold . 2.3.1.2 Result Table 2.4: Confusion matrix for multinomial model Predicted Category Lightning Wind Hail Vehicle Water(NW) Water(W) Actual Category Misc. Vandalism Fire Lightning Wind Hail Vehicle Water(NW) Water(W) Total Misc. Vandalism Fire 1 24 17 0 18 0 0 3 2 4 0 0 31 4 0 2 0 5 86 25 33 267 2 1 4 0 5 4 1 317 1 0 3 114 3 0 0 0 0 121 0 2 0 0 88 0 0 0 4 94 0 0 0 1 2 17 0 0 0 20 39 23 20 3 1 1 182 5 0 274 0 0 2 0 0 0 2 4 1 9 5 1 1 1 3 0 3 52 27 93 Total 103 310 46 123 107 18 227 67 38 1039 The analysis of classification accuracy is often performed by the Receiver Operating Charac- teristic (ROC) curve in the binary classification problem. The ROC curve is created by plotting the true positive rate (TPR) against the false positive rate (FPR) at various thresholds. The Area Under the Curve (AUC) is often used as a measure of how well the classification engine is performing. In the multiple class problem, numeric measures such as the average accuracy, error rate, and precision are better suited for analyzing the classification accuracy. These quantities are defined by Average Accuracy =    +      +    +    +   , Error Rate = 1  + 1    +       +    +    +   , 1  + 1   =0 =0 25 Precision = 1  + 1  =0       +    , where    is the number of true positives,   is the number of true negatives,    is the number of false positives,    is the number of false negatives; see Sokolova & Lapalme (2009). Figure 2.3 shows the average accuracy, error rate, and precision. With  = 0.2, the average accuracy is 93.62%, the error rate is 6.37%, and the precision is 63.9%. We tested if these numbers changed as the threshold  changed. Figure 2.3: Average accuracy, error rate, and precision in log scale Figure 2.3 shows that the average accuracy, error rate, and precision changes as  is altered by 0.005. The solid lines show degree 10 splines fit to the experiment data. Notice that when  = 1 (or in other words when log  = 0) the accuracy and precision reduces, and the error rate increases significantly. This is precisely the case when the cosine similarities boil down to indicator variables of whether the particular words are found in the claim descriptions (in other words, when there exists an embedding for at least one word in the description that is a constant multiple of the keyword). Figure 2.3 is evidence that the model matrix based on cosine similarities is outperforming that based on indicators. Our choice  = 0.2 is shown as a vertical dotted line at log(0.2) = −1.609. We emphasize once more that the choice of  is an empirical question, as the classification accuracy is not influenced much by the threshold. Moreover, since the graph is stable on the left-hand side, 26 −5−4−3−2−10−0.10−0.09−0.08−0.07−0.06Average AccuracyLog EpsilonLog Average Accuracy−5−4−3−2−10−2.9−2.8−2.7−2.6−2.5−2.4Error RateLog EpsilonLog Error Rate−5−4−3−2−10−0.7−0.6−0.5−0.4PrecisionLog EpsilonLog Precision  = 0 may also be a reasonable choice, if prediction alone is the concern. Yet, for interpretation of the coefficients, we have chosen a non-zero threshold. 2.3.2 Risk Mitigation 2.3.2.1 Model In order to understand the factors that relate with high losses, we assume a sampling frame in the following form: We assume that the loss amount, the category of the loss, and the description of the loss are observed. While  was a random variable in section 2.3.1, here we assume  =  is fixed. Also, we assume for each category,   losses are observed. Let  be the category of the loss, and let ∗ , be the underlying loss amount, and  , the description of the -th loss in the -th category. Thus, the dataset consists of distinct samples with observations of the form {(∗ 0,1, 0,1), . . . , (∗ 0,0 , 0,0)} ... ,1,  ,1), . . . , (∗ {(∗ ,  ,  , )} For a given 0 <  < 1, responses  , are formed by  , = (∗ , >  ())  () = inf{ : (∗  ≤ ) ≥ } where  = 0.5 is used to obtain the median for each category. For our analysis, the empirical quantile has been used for  (). In other words, for any given loss,  , is an indicator of whether the observed loss ∗ , is above the 50th percentile of losses in that category of loss. Any other quantile could have been used. For instance, the 95th percentile could have been used. Different quantiles would give different stories, because the definition of a large claim would be different for each case. The 50th percentile has been arbitrarily selected for demonstration. 27 We analyzed the losses for vandalism, fire, wind, vehicle, and the two water damage categories, omitting lightning, hail, and miscellaneous losses from the analysis because for the latter three it was difficult to identify keywords that correspond to large claims. Detailed results are shown for the vandalism peril type only, in order to limit the number of pages of the paper. The question is, whether we can understand the factors that are related to response values  , = 1, corresponding to high losses, through text analysis. Using the word similarity metric described in Section 1.1.3, we can create variables for risk measures of interest. Consider a specific description of a claim in a given category. Suppose a modeler is interested in creating a risk metric corresponding to a word  ,,  = 1, . . . ,   ( : number of risk metrics in category ). Suppose an insurance claim is described by the phrase , for the -th observation,  = 1, . . . , , where  is the number of claims found in the dataset. We create a variable by  ,, = simcos( , , ) · (simcos( , , ) ≥  ) for a threshold  . In this paper,   = 0.2 is chosen for each . Thus,  ,, is the cosine similarity between a search word  , and the description of the claim , with a censoring below a certain similarity level. For example, the modeler may be interested in the risk metric graffiti. The modeler may be interested in the relationship between this risk metric, and a response variable of interest, say the magnitude of loss. If the metric has a high correlation with large losses, then attention should be given to the particular risk metric in order to mitigate the risk inherent in this metric. In this case, the modeler may create a column vector for graffiti. For a particular response, say the likelihood of a high vandalism claim, a modeler may have a set of risk metrics of interest, say: graffiti, laptop, window, shelter, pool (In this case,   = 5). This gives a matrix of   explanatory variables, which can be used in standard regression models. Table 2.5 summarizes the explanatory variables used for the vandalism model. For category  (this section will focus on the vandalism category in particular), given a claim 28 Table 2.5: Summary statistics of explanatory variables (vandalism model) Min. Median Mean Max. 1.000 0.000 laptop 0.000 1.000 graffiti 1.000 window 0.000 1.000 0.000 shelter pool 0.000 1.000 0.000 0.520 0.246 0.221 0.260 0.130 0.424 0.348 0.187 0.209 , the GAM model for each peril type can be specified as (cid:8)E( ,)(cid:9) =   +  ,( ,,)   (cid:18)  , =1 (cid:19) ( ,) = log 1 −  , , where subject to the constraint such that and   is the coefficient for the intercept, and  , are smooth functions of the covariates  ,,, =1  ,( ,,) = 0. In other words, a logit link is used since  , is a binary variable taking on the value of 1 or 0. We have ,  ) ,  ) , ( ,) =  ,(cid:0)1 −  ,(cid:1) exp( (cid:48) 1 + exp( (cid:48)  , = Let the matrix   include transformed columns representing the spline bases for the  ,. For this study, a second order B-spline basis with three terms is used. In this case, the design matrix   can be constructed by first forming the matrix  1( ,1,) 1 1 1( ,2,) 1 1 2( ,1,) 1 2( ,2,) 1 3( ,1,) 3( ,2,)  , = ... ... 1( ,,) 1 1 2( ,,) 1 3( ,,) ,  = 1, . . .    () are the B-spline basis functions, presented in Wood (2017). Then the columns are where  transformed using QR factorization, in order to impose the identifiability constraint. This involves decomposing each vector  ,1 into the form ...  (cid:105) , 0  (cid:104)  ,1 =  ,,1  ,,2 29 and then taking  ,,2 for  = 1, . . . ,   to form the design matrix (cid:20)   = 1;  ,1 ,1,2;  ,2 ,2,2; . . .  ,   ,  ,2 satisfy This ensures that an intercept is included in the design, and also allows the basis functions  , to =1  ,( ,,) = 0. The idea of P-IRLS is that a weight matrix is adjusted each time the algorithm iterates until convergence. The algorithm follows the following steps. 1. Given the current    [Ò]  , calculate the diagonal matrix  (cid:21) . and where  [Ò]  [Ò]  =    −  is a diagonal matrix satisfying [Ò] [Ò]   [Ò]  (cid:17), and [Ò] , = −1(cid:16) (cid:17). [Ò]     [Ò] , = 2 [Ò] , (cid:20) (cid:16) (cid:17)(cid:13)(cid:13)(cid:13)(cid:13)(cid:13)2 [Ò]   , (cid:16) [Ò] (cid:17)(cid:21)−1 (cid:17) +    , = (cid:48)(cid:16)   [Ò] [Ò]    ,    , + =1  ,,2 ,  ,,2   2. Then minimize(cid:13)(cid:13)(cid:13)(cid:13)(cid:13)(cid:114) (cid:16) [Ò]  [Ò]  −       with respect to  . 3. Repeat steps 1 and 2 until convergence. Here, Ò is the iteration index, and  , are matrices designed to penalize the roughness of the smooth functions. In this paper, we used the difference penalty  , =   , where  = −1 1 0 0 −1 1  The  , are selected by generalized cross validation. Cross validation is applied to a subset of the training sample corresponding to each peril type. This involves minimizing the generalized cross validation score V = =1( , − ˆ ,)2 [  − (  )]2    30 where   is the influence matrix for the -th category. For details of the theory behind cross validation, we reference Wood (2017). The above procedure is performed for each , where in our work each of the following peril types are considered: vandalism, fire, wind, vehicle, water (non- weather), water (weather). Generalized additive models are implemented in the R programming language via packages such as gam and mgcv. In this paper, parameters are estimated using the mgcv R package, implemented by the author of Wood (2017). We chose this package because it provides a convenient interface regarding the choice of basis functions and graphical outputs. 2.3.2.2 Result Figure 2.4: GAM model plots for vandalism In this section, we present the analysis results for the vandalism category. For the vandalism category, explanatory variables laptop, graffiti, window, shelter, pool were included in the GAM model. Among these, laptop turns out to have a positive relationship with high losses. Figure 2.4 shows the shape of the smooth functions  , ,  = 1,  = 1, . . . , 5, as the explanatory variable 31 0.00.20.40.60.81.0−3−2−10123laptopCosine SimilaritySmooth Function0.00.20.40.60.81.0−3−2−10123graffitiCosine SimilaritySmooth Function0.00.20.40.60.81.0−3−2−10123windowCosine SimilaritySmooth Function0.00.20.40.60.81.0−3−2−10123shelterCosine SimilaritySmooth Function0.00.20.40.60.81.0−3−2−10123poolCosine SimilaritySmooth Function varies from 0 to 1. The shaded regions in Figure 2.4 illustrate the 95% credible intervals of the smooth functions at each point on the curve. Notice that data for large values of cosines are scarce, hence the credible interval widens for large values of the explanatory variables. Figure 2.5 shows the words with highest cosine similarity with laptop, which turns out to be positively related with high losses. Presumably, vandalisms and thefts to laptops, computers, portables turn out to result in relatively high losses within the vandalism category. Note that vandalisms are small frequent losses. Although the loss amounts in this category are small, the frequent nature of the losses may make it worthwhile to mitigate thefts to laptops. The words in Figure 2.5 are not necessarily found in the training data. They are words found in the word embedding matrices. What we are saying here is that since the word laptop has a high correlation with large losses, related words such as laptops, computers, and phones are also suspects for potential high losses, due to their relationship to the word laptop. This type of chart helps an insurance claims department to learn which type of property to focus on, when mitigating risk. Figure 2.5: Words that relate with high vandalism losses 2.3.2.3 Model diagnostic Table 2.6 provides a summary of the GAM models. The 2 statistic reported for each smooth function is based on Wood (2013), or page 305–308 of Wood (2017). Essentially, this tests the hypothesis 0 :  ,() = 0 32 0.000.250.500.751.00laptoplaptopscomputerscomputerportablephonecamcorderprintercamerascameraradiosscreenssmartmobilescreenhardwaregpswiredelectronicstolencosinelaptop for all  in the range of the cosine similarities for category . A high p-value would indicate the function  , is not needed in the model. The smooth functions for vandalism all turn out to be significant, according to Table 2.6. Table 2.6: GAM model summary (vandalism model) s(laptop) s(graffiti) s(window) s(shelter) s(pool) R-sq.(adj) Chi.sq 32.560 39.250 47.430 132.930 9.020 0.206 p.value 0.000 0.000 0.000 0.000 0.001 2.4 Concluding Remarks In this chapter, we introduced a framework for incorporating textual data into insurance claims modeling, and considered its applications in claims management processes. An insurance claim representative is responsible for investigating the claim, in order to determine the handling process. In this chapter, we explored the use of word similarities as a tool for modeling insurance claims and mitigating insurance risks. Our results demonstrate how text mining technology can be incorporated into a traditional regression analysis. The methodology is applicable in many different areas of applications, where textual data arises. Possible applications of our approach for an insurance risk manager may include: • Classification of claims based on textual descriptions of the claims • Classification of policyholders based on textual descriptions of the policyholders • Prediction of insurance claims at the claim level • Prediction of insurance claims at the policyholder level • Analysis of insurance claims and risk mitigation We explored the LGPIF data in the form of case studies to understand the factors that relate with high insurance losses, classify insurance claims, and model the loss amounts using parametric 33 distributions involving covariates derived from textual information. We make some remarks on the current limitations of our framework, where potential improvements can be made. • Under the current framework, words not found in the word embedding matrix cannot be used in the modeling. • The threshold  is selected using heuristics by a human expert, under the current framework. • Because pre-determined word embedding matrices are limited to one-grams (single words) at the time the paper is being written, the incorporation of -grams (use of phrases longer than one word as a search key) remains an open question. • Further linguistic barriers may exist, if the textual descriptions are longer than those appearing in the dataset used for this paper. Examples may be polysemy, false friends, compound words. • In order to use the proposed method, insurers that focus on specific insurance segments may be constrained to build its own word embedding matrices, as the terms appearing in the claim descriptions may be specific to the field. For example, a medical insurer may find GloVe insufficient, and may need a word embedding matrix trained on medical terms in order to use our proposed approach. Economic losses due to property damage caused by perils including fire, lightning, wind, hail, or vandalism have vast implications to our society. Understanding the nature of property damage can improve our readiness and contribute to minimizing the losses. We have illustrated a way to help realize this goal using a new analysis method, which, to the best of our knowledge, has not been attempted before in the actuarial literature. We believe our methodology may help broaden the horizon of empirical research, and contribute to the advancement of the understanding of our world and the risks residing within it. In addition, we believe that our approach will improve the claim handling procedures of insurance claims departments. 34 CHAPTER 3 A THREE STAGE APPROACH TO MODEL BUILDING 3.1 Introduction In practice, an important task of an insurance analyst is to assist the claims manager in setting the case reserves for reported claims. The case reserve for a given claim can be understood as the difference between the reported claim amount and the paid amount for an individual claim. Note that the case reserve excludes incurred but unreported claims, for which a separate incurred but not reported (IBNR) reserve should be prepared. For the purpose of this chapter, the reader should understand that the case reserve is an approximation to the ultimate amount of the claim, given information available at the time of the report of the claim. Sometimes the case reserve is set by the claims department of an insurance company, while in other cases the task is outsourced to an outside adjustor. Part of the information available to the claims department at the report time of the claim is a textual description of the claim. In this chapter, we are interested in approaches that use the textual information regarding an insurance claim to predict the ultimate loss amount, by regressing the loss amount on a set of covariates derived from the textual description. Given a dataset of historic loss descriptions and ultimate loss amounts, an actuary may use the approach to improve the case reserving procedure. Part of the problem in this prediction task is that if we use a large number of keywords in forming the design matrix extracted from the textual descriptions of claims, the resulting problem is high-dimensional in nature. In this case study, we use the framework of Lee et al. (2019) and analyze a dataset of loss descriptions and amounts, downloaded from the National Oceanic and Atmospheric Administration (NOAA). For this analysis, a generalized linear model (GLM) may not be appropriate because the linearity assumption may not appropriately fit the data. To solve such a problem, we may consider using a nonparametric regression technique. A variety of nonparametric regression techniques have been 35 developed, including but not limited to regression splines, kernel smoothing, neural networks, and generalized additive models (GAM). Nonparametric regression has been applied in many areas, from modeling daily pollution in the U.K. (Wood et al., 2017), to estimating relative risk for disease mapping of lung cancer (Dreassi et al., 2014). See Simonoff (1996) for more details and examples of nonparametric regression. In this chapter, we consider the GAM. Hastie & Tibshirani (1986) proposed the generalized additive model that consists of the sum- mation of smooth functions, allowing for the ability to capture the true, not necessarily, nonlinear relationship. In the generalized additive model setup, more information is needed to estimate each function as compared to the generalized linear model setup. Therefore, the data must have many more observations than the number of covariates. In addition, when working with high dimensional data, the scalability of the algorithm is also extremely important when considering a method. Our approach is motivated by these characteristics. Considerable work has been done in efficiently estimating larger datasets using generalized additive models. Most recently, Wood et al. (2017) developed a method for estimating GAM with the number of coefficients of order 104, and observations up to 108. This method reduces the number of matrix operations, utilizes parallelization, and reduces the memory necessary by marginal discretization of the model covariates. Li & Wood (2019) extended this work by proposing an alternative method of calculating X(cid:48)WX where X is a model matrix and W a diagonal or tri- diagonal matrix, which results in a 30 fold reduction in computational time. Previous works include Marra & Wood (2011) and Wood et al. (2015). Code for these methods are found in the R package mgcv, Wood (2019). While the aforementioned GAM results provide for a scalable algorithm, a hindrance of GAM is the restriction on the number of covariates. Considering the GLM, there are several methods for combating the high dimensionality issue, with most notably being lasso by Tibshirani (1996). Similar to the GLM, the lasso maximizes the likelihood, but instead has an additional 1 penalty term. This term is typically referred to as the shrinkage term. Extensions of the lasso have also been developed, including but certainly not limited to, group lasso from Yuan & Lin (2006) and 36 adaptive lasso from Zou (2006). The group lasso is applied to variables with group-like structure, and it uses a slightly altered penalty term where each variable in a group is penalized equally. This is particularly important due to the group-like structure induced by the basis expansion used in the estimation of the generalized additive model. The adaptive lasso simply applies a weight to each coefficient in the penalty term, with these weights typically estimated through ordinary least squares or lasso. Wang & Leng (2008) combined these extensions to formulate the adaptive group lasso, and showed the ability of the method to identify the true model consistently. Our proposed method is a three step approach consisting of (1) weight calculation by group lasso, (2) the shrinkage step by adaptive group lasso, and (3) the smoothing step. This approach combines the adaptive group lasso dimension reduction technique with the scalable GAM algorithm. 3.2 Data and Preprocessing For our analysis, we utilize the publicly available NOAA Storm Events Database. The analysis is performed on property loss amounts at the event level, using storm event observations involving textual descriptions of the events. The data are collected over time, however we use a cross-sectional model in this paper, in order to focus on the relationship between the textual information and the response. Only Thunderstorm Wind events taking place in Michigan and from 2000 to 2018 are considered for the analysis. For validations, the dataset is divided into training and validation datasets. The reason we use this dataset is because it contains relatively clean, lengthy descriptions of losses from storm events in the United States each year, along with the property and crop damage amount estimates. These damage amounts are initial estimates of the losses, and hence are different from the ultimate loss amounts. Yet, the structure of the data is identical to that available to a claims adjuster, and hence is a good test dataset for the analytical framework explained in this chapter. Another advantage of this dataset is that it is publicly available, allowing dissemination and reproducibility to be easy. Each event is recorded with an event narrative. An example of an observation with an estimated property damage of $10,000 has an event narrative that reads: Roof damage was incurred to a barn 37 Table 3.1: Summary statistics for the log(loss) for the training and validation datasets. Training Validation N Min Mean 8.97 2353 126 8.78 2.30 6.21 SD Max 17.03 1.44 1.56 14.00 six miles northwest of Mason due to a severe thunderstorm wind gust and a large tree limb was blown down in South Lansing. Figure 3.1: Frequency for the most common words. Figure 3.1 shows the most common words in the descriptions of the losses. Stop-words such as a, the, and, etc. have been removed. Notice that the word trees is most frequent in the descriptions. A few of the most common words are typically used to describe what is happening to trees, such as blown and wind. In addition, several of the other most common words like power, lines, damage, and outages are used to describe the results of downed trees. There are a total of 2, 353 observations in the training set, with 126 observations in the validation set. As previously mentioned, the claim descriptions are quite lengthy, with an average of 16.8 words per description. There are a total of 2, 642 unique words used in the dataset. To capture only relevant words, stop-words, numbers, and words that only occurred once were removed, resulting in 1, 998 words. Table 3.1 provides summary statistics for the log(loss) for the training and validation datasets. In order to better understand the relationship between the words in the claim description and the property loss amount, each word is represented by a vector. Recent advancements in word 38 050010001500treesblownreportedpowertreelinesnumerouscountydamageroadwindestimatedlimbsmphdiameterthunderstormlakewindsmilesgustsfrequency embedding models have made it possible to obtain these representations easily. We utilize the 300 dimensional word embeddings developed by the authors of Pennington et al. (2014). To form the design matrix, we follow the framework described in Section 1.1.3. That is, we will choose our design matrix, ×, to be  where  is the set of all unique words found in D,  is the number of descriptions in D, and  is the number of words in . Each value in the matrix is now continuous and restricted to [−1, 1]. Figure 3.2 shows the relationship between cosine similarity and property loss for house and thunderstorm. From the figure, we see that the relationship between cosine similarity and property loss is nonlinear in nature and therefore a generalized additive model is appropriate. Figure 3.2: Cosine similarity against property loss for house and thunderstorm. 3.3 The Model and Basis Expansion In this section, we describe our methodology in specific terms. We consider the generalized additive model (3.1)  ( )(cid:170)(cid:174)(cid:172) ,  = [| ] = −1(cid:169)(cid:173)(cid:171)  (cid:21) (cid:20)  − () =1 , where the link function corresponds to that of the corresponding exponential family distribution. For each of the  independent observations, the density function is given by  = () exp 1 ≤  ≤ ,  ∈ R (3.2)  39 510150.250.500.751.00Cosine Similaritylosshouse510150.000.250.500.751.00Cosine Similaritylossthunderstorm We assume that a matrix of explanatory variables is given. Let’s call it ×, and use the notation  = (  ). We have 1 ,  2 , . . . ,  (3.3) × = 11 12 . . . 1 21 22 ... ... . . . 2 . . . ... 1 2 . . .     =1 Thus,  is the number of observations, and  is the number of explanatory variables available. We assume that the parameter 0 <  < ∞ is known. Without loss of generality, let  = 1. Also, we assume that the density of  depends on  via the structure  =  ( ), (3.4) where  are defined in equation 3.2. Let the values   be defined over intervals [, ], and let  = 0 < 1 < . . . <  < +1 =  be a partition on [, ], where  =  = , with 0 <  < 0.5, such that | − −1| = (−). max 1≤≤+1 (3.5) For smooth functions 1, . . . , , there exist functions 1, . . . ,  ∈ S, such that 1, . . . ,  well approximate 1, . . . , . Here, S is the space of polynomial splines of degree  ≥ 1. In other words, S is the space of piecewise polynomials on each interval of the partition, which are (cid:48) ≤  − 2 times continuously differentiable for  ≥ 2. Smoothness is defined in the same way as Huang et al. (2010). According to Schumaker (1981), there exists a normalized B-spline basis { , 1 ≤  ≤ } for S, where  =  +  = 1. For a practical overview of the B-spline basis function, the reader may refer to Wood (2017), section 5.3.3, starting from page 204. We want to write  ( ) = [ ]      , (3.6)  =1 40 [ ] for some value   . For this, our next goal is to construct a design matrix (cid:16)1 : [1] : [2] : . . . : [ ](cid:17) (cid:16)1 : [1] : [2] [1] 2 [2] 2 × = = (cid:17) [ ] 2 (3.7) : . . . : [ ] from ×, where [ ] are matrices of basis functions. Denote  =  × . Here, [ ] are each  ×  matrices of transformed basis functions, where  is the number of parameters within each basis function consisting the design matrix. We consider the matrix of basis functions  [ ] = 1(1 ) 2(1 ) 1(2 ) 2(2 ) ... ... 1( ) 2( ) . . . (1 ) . . . (2 ) . . . . . . ( ) ...  . (3.8) From here on, we assume () are -parameter B-spline basis functions of order 3. In order to obtain the design matrix, we form the QR decomposition (cid:16) [ ](cid:17) 1 =  [ ] 1 [ ] +  [ ] 2 0 (3.9) (3.10) and take the  [ ] 2 matrix corresponding to the zero part of the decomposition, and define [ ] = [ ] [ ] 2 Finally, let the × matrix be given by expression (3.7). We call  our design matrix, and denote the elements of the design matrix , for  = 1, . . . ,  and  = 1, . . . . We also denote   = [ ] 1 ,  [ ] 2 , . . . ,  [ ]   , for  = 1, . . . , , and  = 1, . . . . (3.11) (cid:16) (cid:17) Under this framework, the response variable is related to the covariate   via  ( ) =      ,  = 1, . . . , ,  = 1, . . . , , (3.12) where   is the coefficient corresponding to the -th explanatory variable. We may see that   must be a length  vector, since the -th spline contains  parameters. The illustrated approach 41 to constructing the design matrix ensures that the basis functions satisfy  ( ) = 0 for each  = 1, . . . , . (3.13) In other words, selecting the model matrix this way imposes an identifiability constraint, which states that the smooth function defined by the basis functions must satisfy  ( ) = 1 [ ]   = 0 (3.14) Notice that equation (3.14) holds because  =1  =1 (cid:104) (cid:105)  [ ] 1 [ ] 2   1 [ ]   = 1 [ ] [ ] 2   = [ ] 0 [ ] 2   = [ ] [ ] 1  [ ] 2   = 0 (3.15) [ ] 1 [ ] 2 and  are orthogonal. Simply stated, if  () is a spline, the coefficients for the since  [ ] 2   instead of   after the transformation. Our methodology is related spline are now given by  to Chouldechova & Hastie (2015), yet we use a three step procedure. We are looking for the parameters for  {[| ]} = 0 +  ( ) = 0 +     = 0 +    (3.16)  =1  =1 where we have used the notation  to denote the -th row of , and  = (  ), where some of the  ’s are zero, while others are non-zero. The approach in Chouldechova & Hastie (2015) is to minimize the penalized negative log-likelihood 2 , . . . ,  1 ,  where ࡁ( ) is the log-likelihood for an exponential family distribution: − 1  ࡁ( ) + 2 (cid:113) =1  (cid:169)(cid:173)(cid:171)  (cid:16) (cid:104) =1          + 1 (cid:48) 2 =1    (cid:170)(cid:174)(cid:172) − (cid:169)(cid:173)(cid:171)  (cid:17)(cid:105) =1     (cid:17) −  =1  [ ]  (cid:16)   =1 =1 ࡁ( ) = = 42 3  (cid:48)   (3.17)  =1 [ ]      (cid:170)(cid:174)(cid:172) (3.18) The hope is that the second term in equation (3.17) induces zeros into groups of coefficients, while the last term imposes smoothness into the “surviving” coefficients. Here,   is an identity matrix of dimension , and   is a constraint matrix to impose smoothness into the estimated functions  . There are several practical difficulties with this approach: • When  is large, or in other words when the problem dimension is large, there are too many 3  tuning parameters to estimate. Wood (2017) discusses algorithms for large  cases, but does not talk about cases where  is large. • Theory behind selecting the tuning parameters 3 , discussed in Wood (2017) is no longer directly applicable, because of the extra group lasso type penalty term. • Implementing the coordinate descent algorithm, which brings in sparsity into , becomes tricky with the smoothing penalty. Usually fast algorithms for lasso type estimators with GLMs are implemented by locally approximating the likelihood with a Taylor’s approximation at each iterative step, yet the extra penalty term makes this tricky. • Estimating the coefficients may take a very long time, especially when the number of ex- planatory variables  is large, as in the application we consider in this paper. Hence, in order to keep the estimation procedure scalable for large  (and hence large ), we propose a three step approach to the estimation problem for the model (3.16). The first step of the approach is to perform a group lasso estimation with the first and second terms of equation (3.17). The second step uses the resulting coefficient estimates to perform an adaptive group lasso estimation of the parameters. The third and final step uses the nonzero coefficients obtained from the second step to induce smoothness into the implied spline function  (·), for each nonzero function  . These steps are formalized in the following section. Moreover, to provide a statistical validation, we present both the numerical results in section 3.6 and the theory for the estimated functions in Appendix A, which works as another support of our proposed 3-stage approach. We aim at validating two things: the variables selected are consistent; 43 the estimators are consistent with respect to the unknown true functions. Appendix B contains a short simulation study illustrating the effectiveness of the method on a simulated dataset. 3.4 The 3-Stage Approach 3.4.1 Stage 1 – Group lasso Define the objective function to be ( ; 1) = −1  (cid:16) (cid:104)  =1     (cid:17) −  (cid:16)    (cid:17)(cid:105) + 1  =1 Let ˆ be the optimizer for (3.19), or in other words (cid:107)  (cid:107)2 (3.19) ˆ = arg min ∈R· ( ; 1) (3.20) 3.4.2 Stage 2 – Adaptive group lasso Define the objective function to be ( ; 2) = −1   (cid:104) (cid:16) (cid:17) −  (cid:16) (cid:17)(cid:105) + 2         =1 where the weights depend on the screening stage group lasso estimator =1  (cid:107)  (cid:107)2 (3.21)    = (cid:107) ˆ (cid:107)−1 2 ∞ if (cid:107) ˆ (cid:107)2 > 0 if (cid:107) ˆ (cid:107)2 = 0 (3.22) (3.23) Numerically, the weights are set to a large number, for the case when (cid:107) ˆ (cid:107)2= 0. Let ˆ  be the optimizer for (3.21). In other words, ˆ  = arg min ∈R· ( ; 2) Let ˆ be the subset of {1, . . . , }, such that the th coefficient of   with  ∈ ˆ are nonzero. Thus, the second stage estimates are sparse, meaning that the coefficients are zero for some . This reduces the coefficient size in the third stage. 44 3.4.3 Stage 3 – The smoothness penalty ˆ be the matrix consisting of columns from  corresponding to the set ˆ. Let  ˆ Let  Rˆ·, where ˆ = | ˆ|. Define the objective function to be be in ( ; 3) = −1  ˆ      ˆ      3        (3.24) (cid:16) (cid:104)  =1 (cid:17) −  (cid:16) (cid:17)(cid:105) + 1 2  ∈ ˆ where 3 = (31, 32, . . . , 3). Let ˆ be the optimizer for (3.24). In other words ˆ = arg min ∈R ˆ ( ; 3) (3.25) Since the problem of dimension has been reduced, the third step estimation may be performed using existing generalized additive models routines, using ˆ  as the initial guess for the P-IRLS procedure. The tuning parameters 3 may be obtained by generalized cross validation or REML as described in Wood (2017). 3.4.4 Tuning Parameters Each stage has a tuning parameter, 1, 2, and 3, respectively. The selection of 1 and 2 can greatly influence the performance of the model and the efficiency of the algorithm. Larger values of 1 and 2 will lead to an over-simplified model with faster computation time, while smaller values will lead to an over-fitted model with slower computation time. To find the “sweet spot”, cross validation is used to determine 1 and 2. The tuning parameters 3 is obtained by generalized cross validation or REML as described in Wood (2017). 3.5 Algorithm We now discuss the implementation of the method using R. For stage 1 and stage 2, we utilize functions from the gglasso package (Yang & Zou, 2017) and for stage 3 we utilize functions from the mgcv package (Wood, 2019). The code is provided in an R package at github.com/ scottmanski/TAGAM. 45 3.5.1 Stage 1 – Group lasso The gglasso function is modified such that we loop through the grid of 1 values, but once the number of nonzero coefficients is greater than , the algorithm is stopped. By doing so, we ensure that we will be able to execute stage 3. 3.5.2 Stage 2 – Adaptive group lasso The implementation of stage 2 is very similar to that of stage 1, except for the addition of the (cid:48)  =     for each  ∈ {1, ..., }. Then weights. equation 3.21 can be written as In order to incorporate the weights, let   + 2 (cid:48) (cid:170)(cid:174)(cid:172)  =1 (cid:107)  (cid:48) (cid:107)2 (3.26) (  (cid:48); 2) = −1  1   [ ]    1   [ ]     =1 (cid:169)(cid:173)(cid:171)  =1 (cid:170)(cid:174)(cid:172) − (cid:169)(cid:173)(cid:171)  =1 (cid:48) 3.5.3 Stage 3 – The smoothness penalty The mgcv package is used to implement stage 3. In the mgcv package, there is a gam function and a bam function, with the former designed for smaller datasets and the latter designed for much larger datasets. In this analysis, we utilize bam. To increase the computational efficiency we also choose to have the function discretize the data following the method described in Wood et al. (2017). 3.6 Results In this section, we discuss the results of our model. Table 3.2 provides information for the final model. As previously mentioned, 1,998 words appeared in the dataset, and were considered as possible covariates. For the model, we chose to use the penalized regression spline. Stage 1 effectively reduced the number of covariates to 261, and stage 2 further reduced the number of words to 149. While the number of functions to interpret may seem cumbersome, the final model is relatively simple compared to the number of possible covariates that could have been in the model. Figure 3.3 shows the estimated functions for several covariates. All of the function estimates have a few characteristics in common. For smaller cosine similarity values, the estimated functions 46 Table 3.2: Summary statistics for the final model. The Residual DF comes from the estimated degrees of freedom from the GAM, and the MSPE is the out of sample mean squared prediction error.  4  2 1 0.0005255074 2 0.0001063902 ˆ 149 Residual DF Deviance MSPE 1.016 2167.387 70.7% Figure 3.3: Function estimates for several covariates. are approximately zero. We expect this because smaller cosine similarities between a word and a phrase indicates that the word has very little meaning in common with the phrase. For large cosine similarity values, the 95% credible interval for the functions becomes wider as compared to cosine similarity values around 0.2. This is also expected simply due to the lack of observations for higher cosine similarities. The function estimates help us understand the relationship between a word, its related words, and the property loss amount. Many of these estimated functions seem to follow our intuition. For example, house, losing, widespread, gusts, and tree are all words that would typically be associated with property loss. Words with the highest cosine similarity to house are shown in figure 3.4. Most of these related words are types of homes. From the function estimate, we see that an incident involving a house results in higher property loss than that of an incident involving offices or apartments. 47 −0.40.00.40.80.250.500.751.00houses−0.50.00.51.01.50.000.250.500.751.00losings−1.0−0.50.00.51.00.000.250.500.751.00minors0.00.40.80.250.500.751.00dozenss−0.50.00.51.00.250.500.751.00trees0.00.40.80.250.500.751.00multiples0.00.40.80.000.250.500.751.00widespreads0.00.40.81.20.000.250.500.751.00quarterss0.00.51.01.52.00.000.250.500.751.00wireds01230.000.250.500.751.00shuttings0.00.51.00.000.250.500.751.00gustss01230.000.250.500.751.00orchardss Figure 3.4: Words with the highest cosine similarity with house. While many function estimates obviously follow our intuition, there are some that seem harder to interpret. Words like quarters, shutting, and orchards all seem unrelated to property loss. To shed some light on this issue, we look at a sentence from a description that includes quarters; Two eyewitnesses in Covington reported hail greater than the size of quarters during the peak of the storm. The use of quarters here is related to the size of hail. It is expected that larger hail will lead to larger property loss. Words related to quarters include nickel and dime, which are also used to describe hail size. In a similar way, we find out that shutting is referring to the closure of major roadways. In the case of orchards, several observations involved damage to apple orchards. With Michigan producing the third most apples of any state, it is clear why damage to apple orchards results in large property loss. The model also performed well with out of sample prediction. Figure 3.5 shows the predicted property loss amounts against the true loss amounts for the validation sample. The Spearman correlation for the validation set is 76.06%, while the Spearman correlation for the training dataset is 80.30%. To measure the stability of the method, for a selected year, the model was trained using the previous years, and tested on data from the selected year. This was completed for each year from 2001 to 2018. This resulted in an average mean squared prediction error of 1.34 with a standard error of 0.123. Using a lasso model increases each of these values by about 8% respectively. The 48 0.000.250.500.751.00househousessenatecongressionalcongressrepublicansbuildingwhitemansioncapitolofficerepublicanclintonroomhomedemocratsbillparliamentfloorlawmakersresidencerepresentativesgopbushspeakerhillnextgingrichthecommitteeCosine Similarity Figure 3.5: Predicted property loss amounts against the true property loss amounts for the validation sample. The Spearman correlation is 76.06%. 3 stage method selected a more parsimonious model as compared to the single step lasso model, resulting in greater model stability. 3.7 Implications We have presented an analytical method for analyzing losses due to storm events in relation to their textual descriptions. The fact that losses may be predicted more accurately with textual information implies that the case reserving procedure may be improved significantly. The traditional approach to case reserving is to take the average amount of the reported losses, yet this does not take advantage of the heterogeneity of information contained within the initial report of a loss to an insurance company. The new method allows for a more accurate prediction of the ultimate loss to be indemnified for a specific reported loss. Being able to explain the factors that contribute to higher or lower severity of losses by selecting the relevant keywords from a set of words allows the actuarial analyst to avoid manually selecting the keywords needed for the textual risk analysis. This technique may be useful especially when the number of words describing the loss is large, or statistically the problem is high-dimensional. The analyst may also be able to understand the factors that relate to high losses using the selected covariates, and this may help mitigate future losses. In addition, these factors that contribute to higher severity of property loss can indicate areas 49 5.07.510.012.55.07.510.012.5PredictionTrue Property Loss needing improvement in the way they protect against various weather events. For example, events involving orchards resulted in high property loss, illustrating the need for additional preventative measures to protect the apple trees during a thunderstorm. The fact that a simple three-step approach allows for the regression selection problem to be solved easily using existing routines in the R programming language. The approach may be applied in general to problems where nonlinear effects of a large number of continuous explanatory variables must be understood in relation to the response. We have focused on the log-normal case of the response, yet the method is general enough to be applied to non-normal responses, including responses following a gamma distribution, or Poisson distribution. Future work may focus on these specific cases. 3.8 Concluding Remarks In this chapter, we consider a general high dimensional text analysis problem and propose a 3 stage approach by adopting modern statistical methods. Stage 1 and 2 effectively reduced the high dimensional problem to one that mgcv can handle. The use of stage 1 and 2 to reduce the problem instead of utilizing a subject matter expert allows for simple replicability of the process. We showed how the use of cosine similarities from textual descriptions can provide interpretable results when predicting property loss. While there are many other possible applications in risk analysis, our framework could also be applied in: • classification of users on a social networking site based on their posts, • prediction of a company’s change in stock price from related articles, • caller scam classification based on call transcripts. 50 GAMMA DOUBLE GENERALIZED LINEAR MODEL WITH GROUP LASSO CHAPTER 4 4.1 Introduction 4.1.1 Overview The recent advancement of coordinate descent approaches to the Lasso estimation problem has allowed the variable selection problem to be solved in a systematic way. Lasso is an approach to induce sparsity into the coefficients estimated from a regression analysis, which has been introduced by Tibshirani (1996). Since its introduction, the method has gained popularity in the statistics literature due to its elegancy and ability to solve high dimensional regression problems. In a highly cited paper by Yuan & Lin (2006), the authors have introduced a method to group the coefficients in a regression for group-wise variable selection. In spite of the popularity and the extensive research performed on the Lasso method, applications in the actuarial science literature has been limited until now. The reason is in part because datasets found in the actuarial literature are typically not high dimensional in nature, and in part also because generalizing the fast coordinate descent methods to non-normal responses remained a challenging task. Although works by authors such as Friedman et al. (2010) have extended the Lasso method to logistic regression and multinomial regression, problems involving other non-normal responses such as the Tweedie distributed case were left as open problems. Yet, in a recent strand of works by Yang & Zou (2015) and Qian et al. (2016), an approach to solve the Tweedie distributed case with high speed has been introduced. The approach approximates the likelihood function with a second order Taylor series expansion around the current estimate of the coefficients, and uses the approximation to construct a closed form update to the coefficient estimates. This paper extends the works by Yang & Zou (2015) and Qian et al. (2016) and applies the technique to textual data analysis in the actuarial science literature using the gamma distribution, 51 which is a special case of the Tweedie distribution. Insurance claims prediction using textual data analysis methods is a natural application of the group Lasso technique for non-normal responses. The use of the technique in conjunction with dispersion modeling is new to the literature to the best of our knowledge. In the actuarial science literature, double GLM approaches have been used by authors such as Smyth (1989) and Smyth & Jørgensen (2002). In a dispersion modeling framework, one uses an additional link function for the dispersion. The gamma distribution double GLM model has been explored in Smyth (1989), and the Tweedie model has been explored in Smyth & Jørgensen (2002). Our idea in this paper is to incorporate a group Lasso type penalty in both the dispersion parameterization and the mean parameterization of a gamma model. The rest of the chapter proceeds in the following order: In Section 4.2 the model is explained along with a fast algorithm for obtaining the parameters. In Section 4.3 we discuss the asymptotic convexity of the negative log-likelihood. In Section 4.4 two simulations are conducted to investigate the performance of the models explained in Section 4.2. Section 4.5 describes the application of the method to insurance claim predictions using textual data analysis. Finally, Section 4.6 concludes the chapter with closing remarks. 4.2 Model 4.2.1 One Dimensional Unpenalized Problem In order to explain our approach, we demonstrate the algorithm in Yang & Zou (2015) and Qian et al. (2016) for a one-dimensional problem. For simplicity, suppose we have the one dimensional negative log-likelihood ࡁ() = − + , (4.1)  =1 52 which arises from the gamma regression model with only an intercept term . The Taylor series approximation around a point ˜ is given by ࡁ() = ࡁ( ˜) + − ˜( − ˜)2 (4.2) (cid:16)−− ˜ + 1(cid:17) ( − ˜) + 1  =1  =1 2 This can be simplified to where  =1 1 2 ࡁ() = ˜( ˜ − )2 + ( ˜) (4.3) ˜ = − ˜ ˜ = ˜ + (1 − −1   ˜) and (4.4) and ( ˜) is constant given ˜. Fixing the ࡁ() function, which is an approximation to the ࡁ() function, we try to minimize ࡁ() by solving the problem ࡁ( ˘) + ˘( − ˘) + 1 2 ˜( − ˘)2 arg min (4.5)  assuming a current estimate for ˘ is given. The idea is, we update ˘ until it converges to the minimum of a given ࡁ() fixed for a given ˜. Once this minimization is achieved, we update ˜ and continue. Here,  =1 ˘ = ࡁ   = − ( ˜ − ˘) ˜ and ˜ = 2ࡁ  2 =  =1 ˜ (4.6) The first order condition for equation (4.5) is given by ˘() = ˘ − ˜−1 ˘ The resulting ˘() satisfies ࡁ( ˘()) ≤ ࡁ( ˘), because ࡁ( ˘()) = = = 1 2 1 2 1 2    =1 =1 =1 ˜( ˜ − ˘())2 + ( ˜) ˜( ˜ − ˘ − ( ˘() − ˘))2 + ( ˜)  =1  =1 ˜( ˘() − ˘)2 + ( ˜) ˜( ˜ − ˘)2 −  ˜( ˜ − ˘)( ˘() − ˘) + 1 2  ˜( ˜ − ˘)( ˘() − ˘) + 1 2 =1 = ࡁ( ˘) − = ࡁ( ˘) + ˘( ˘() − ˘) + 1 2 ≤ ࡁ( ˘) + ˘( ˘ − ˘) + 1 2 =1 ˜( ˘() − ˘)2 = ࡁ( ˘) ˜( ˘ − ˘)2 ˜( ˘() − ˘)2 (4.7) 53 where, the inequality in equation (4.7) is due to the fact that ˘() is the solution for the mini- mization problem (4.5). Hence, the iteration scheme eventually converges to the solution ˆ = arg min  ࡁ() 4.2.2 One Dimensional Penalized Problem Now, consider the penalized problem ˆ∗ = arg min  ࡁ() + || (4.8) (4.9) where  is a tuning parameter. In this subsection, we denote the solution to the penalized problem with an upper asterisk (*). We try to minimize () = ࡁ() + || by solving the minimization problem arg min  ࡁ( ˘∗) + ˘( − ˘∗) + 1 2 ˜( − ˘∗)2 + || (4.10) at each iterative step, for a given current estimate ˘∗. Notice, we have ( ˘∗ ()) = ˜( ˜ − ˘∗ ())2 + ( ˜) + | ˘∗ ()|   =1 1 2 1 2 = ( ˘∗) − =1 = ˜( ˜ − ˘∗ − ( ˘∗ ()| () − ˘∗))2 + ( ˜) + | ˘∗ ˜( ˜ − ˘∗)( ˘() − ˘∗)   ˜( ˘() − ˘∗)2 + | ˘∗ () − ˘∗) + 1 ()| − | ˘∗| 2 ˜( ˘∗ − ˘∗)2 + | ˘∗| − | ˘∗| = ( ˘∗) ()| − | ˘∗| () − ˘∗)2 + | ˘∗ ˜( ˘∗ =1 + 1 2 =1 = ( ˘∗) + ˘∗( ˘∗ ≤ ( ˘∗) + ˘∗( ˘∗ − ˘∗) + 1 2 (4.11) Hence, the iteration scheme converges. Let the unpenalized solution be ˘(), in contrast to the penalized solution ˘∗ (). According to the first order condition without the penalty, as in section 4.2.1, we would have ˘() = ˘ − ˜−1 ˘∗. The solution for equation (4.10) can 54 The second case is when ˘() = ˘ − ˜−1 ˘∗ < − ˜−1 < 0. argument, the solution becomes ˘∗ − ˘∗ ˘ −  ˘ ˘∗ − ˘∗ ˘ +  ˘ (4.13) In this case, using a similar (4.14) be obtained using the first order conditions for three different cases. The first case is when ˘() = ˘ − ˜−1 ˘∗ > ˜−1 > 0. In this case, locally, the problem would be arg min  ࡁ( ˘∗) + ˘∗( − ˘∗) + 1 2 ˜( − ˘∗)2 +  (4.12) Differentiating this objective function with respect to  and setting the expression to zero results in the solution: ()| = | ˘∗ − ˜−1 ˘∗| ≤ ˜−1. In this case, we have ˘∗ The third case is when | ˘∗ () = 0. To see why, without loss of generality, assume 0 ≤ ˘∗ () = ˘∗ − ˜−1 ˘∗ ≤ ˜−1. The objective function in equation (4.12) relevant to the solution (excluding constant terms) can be simplified to obtain the following expression: ( ˘∗ − ˘∗ ˜ + )  + 1 2 ˜ 2 (4.15) which is minimized at zero, since the coefficient for  in the first term is positive, and we have assumed that ˘∗ () is positive. The three cases can be expressed compactly in one expression: (cid:18) (cid:19) () = ˜−1( ˜ ˘∗ − ˘∗) ˘∗ 1 −  | ˜ ˘∗ − ˘∗| + (4.16) Hence, we repeatedly update the estimate using expression (4.16), updating ˘∗ and ˜ along the way. This gives an iteration scheme, which ultimately converges to ˆ∗. 4.2.3 Gamma Generalized Linear Model The unpenalized negative log-likelihood for the gamma generalized linear model is  =1 ࡁ(0, ) =  (cid:16)  exp(cid:16)−0 −    (cid:17) + 0 +  (cid:17)   , (4.17) 55 (cid:16) where  are weights,  are responses, 0 is an intercept,  are coefficients excluding the intercept, and  are the explanatory variables for observation . We are interested in solutions of the form  (cid:16) ˆ0, ˆ (cid:17) (cid:17) treating the coefficients for each word  = 1, . . . ,  as groups. We = arg min (0,) ࡁ (0, ) +   (cid:107)  (cid:107)2 (4.18) =1  1 ,  2 , . . . ,   where  = use the algorithm presented in Yang & Zou (2015) and Qian et al. (2016) to estimate the gamma regression coefficients with a group lasso penalty. The solution for the Tweedie model with group lasso penalty is discussed in Qian et al. (2016) for the case when the power parameter is less than two. When the power parameter equals two, the model results in the gamma model, yet the derivation for this special case is not explicitly discussed in Qian et al. (2016), so we present it below. The algorithm iteratively updates the estimate of the coefficients given a current estimate. Given current estimates ˜0, ˜, the second order Taylor series approximation of the negative log-likelihood around the current estimate is    exp(cid:16)− ˜0 − ˜ (cid:16)− exp(cid:16)− ˜0 − ˜ (cid:17)(cid:16) 0 +  =1    (cid:17) + 1(cid:17)(cid:16)     − ˜0 − ˜   0 +   − ˜0 − ˜    (cid:17)2 (cid:17) ࡁ(0, ) = ࡁ( ˜0, ˜) +   =1 =1 + 1 2 1 2 = where (cid:17) (4.19) (4.20) (4.21)   () ˜( ˜ − 0 −   )2 + ( ˜0, ˜), ˜ =   exp(cid:16)− ˜0 − ˜ ˜ = ˜0 + ˜   +  ˜ 56 and ( ˜0, ˜) is a constant given the current estimate. Given ࡁ(0, ), consider updating the current estimates ˘0, ˘. Then (cid:16) ˜ − 0 −  (cid:16) ˜ − 0 −      (cid:17) (cid:17) ˜ ˜ =1     =1 =1 =1 ˜ ࡁ (0, )    ࡁ (0, )  0 = − = − 2ࡁ (0, )      2ࡁ(0, )  0  0 = =   (0, ) = 0 (0, ) =   (0, ) = 0 (0, ) = ˜        (4.22) (4.23) (4.24) (4.25) With these values, the expression for the new coefficients for the penalized problem in equation (4.18), according to the algorithm in Qian et al. (2016) is (cid:16) ˜  ˘  − ˘  (cid:17)(cid:18) (cid:19)   ˘ − ˘  (cid:107)2 (cid:107) ˜  1 − ˜  ˘ () = ˘0() = ˘0 − ˜−1 0 where ˜  is the largest eigenvalue of ˘ , and ˜0 = ˘0 =1 ˜. 4.2.4 Gamma Double Generalized Linear Model  ∼ (, ),  = 1, ...,  where  =   =   and  =  . + (4.26) (4.27) (4.28) (4.29) For simplicity of the derivation, in this section we assume the design matrix includes a column of ones, corresponding to the intercept. Then, the negative log-likelihood is  =1 (cid:26) (cid:16)  =1 ࡁ() = ࡁ( , ) = −  +  (cid:17) + log Ɖ(cid:0)(cid:1) −  +(cid:0)1 − (cid:1) log  (cid:27) (4.30) 57 where  =  , and  =  . Let  = (  , ). The goal is to solve the following minimization problem (cid:13)(cid:13)  (cid:13)(cid:13)2 +     =1 (cid:13)(cid:13)(cid:13)  (cid:13)(cid:13)(cid:13)2    =1 ˆ = arg min ࡁ() +   where  > 0 and  > 0 are tuning parameters, and   are the positive group lasso weights for each group  = 1, 2, ..., . We have  /  = / = . Define  =  =  = 2ࡁ   2  ࡁ   2ࡁ 2   = ࡁ  =  −  = (cid:16)−−  + 1(cid:17) = (cid:16) = (cid:16) −  +  −  +  Then, by the chain rule, we have ∇ࡁ = (cid:17) + (1)(cid:0)(cid:1) 2 + (0)(cid:0)(cid:1)  − 2 −  −  log  (cid:17) + (0)(cid:0)(cid:1)  −  −  −  log             and ∇2ࡁ =   We define ࡁ() to be the second order Taylor series expansion of ࡁ() around the current estimate of the parameters ˜: ࡁ() = ࡁ( ˜) + ∇ࡁ( ˜)( − ˜) + 1 2( − ˜)∇2ࡁ( ˜)( − ˜) (4.37) Then plug in the expressions for ∇ࡁ( ˜) and ∇2ࡁ( ˜) in terms of , , , and  into the expression for ࡁ() and simplify. The result is (4.31) (4.32) (4.33) (4.34) (4.35) (4.36) ࡁ() = ( ˜) + 1 2 ˜(   )2 + 1 2 ˜(   )2  =1  =1   − ˜(  (cid:110) ˜ (cid:110) ˜   =1 =1 + + ˜)(   ) − ˜(   ˜)(   ˜)(    − ˜(   ˜)(   ) − ˜(   58   )(cid:111)  )(cid:111) +   =1 ˜(   )(   ) (4.38) where ( ˜) is a constant given ˜. Then we would like to minimize the following penalized objective function () := ࡁ() +    . (4.39) To minimize 4.39, we utilize a blockwise majorization descent method. Fixing the ࡁ() function at the point ˜, we have  =1 (cid:13)(cid:13)(cid:13)  (cid:13)(cid:13)(cid:13)2  =1   ࡁ/ (cid:13)(cid:13)2 +  (cid:13)(cid:13)   ࡁ/      ˜( ˜ ˜ =1 =1 =1  ˜(   =1  ˜) +  =1  =1 ∇ࡁ() = =1     =1 =1 − ˜(   ) + ˜(   ˜) − ˜(   ) + ˜(   ˜) −   =1 =1 ˜ ˜     =1 =1 ˜ ˜   ∇2ࡁ() =  (cid:20)   (cid:20)   =1 − =1 − =1 ˜     =1 ˜ =1 ˜(   ) + =1 ˜) − ˜(   =1 ˜(   ) + =1  ˜) −  ˜( 59 (4.40) ˜(   ) (4.41) ˜) + ˜(   ) (4.42) (4.43) (cid:21) (cid:21) ˜(   ) ˜(   )   (4.44)   (4.45) ˜(   ˜) + ˜(   ˜) +   =1  =1 where ࡁ/  = − ࡁ/ = =1 Also, we have the hessian matrix   () = ࡁ/   =  () = ࡁ/  = For a specific group of the parameter, call it  , or  , we have and we have the following hessian matrices: ˜  = ∇2   ࡁ = ˜  = ∇2   ࡁ =   =1 =1 ˜      ˜      (4.46) (4.47) Given ࡁ(), consider updating the current estimates ˘  for a given group  (1 ≤  ≤ ). Suppose that ˜ and ˘ are the current estimates, and define ˘  = ˜ ( ˘). Then ˘ () is updated by solving (cid:13)(cid:13)(cid:13)  (cid:13)(cid:13)  . (cid:13)(cid:13)(cid:13)2 (cid:13)(cid:13)2 . (4.48) (4.49) (4.50) (4.51) ࡁ( ˘, ˜) + ˘  (   − ˘ ) + arg min   Similarly, ˘ () is updated by solving arg min   ࡁ( ˜, ˘) + ˘  (  − ˘ ) + The expression for the new coefficients for the penalized problem is ˜  2 (   − ˘ )(   − ˘ ) +   ˜  2 (  − ˘ )(  − ˘ ) +   (cid:33) (cid:19)   ˘ − ˘  (cid:107)2 (cid:17)(cid:32) (cid:17)(cid:18) ˘  − ˘  (cid:16) ˜  (cid:16) ˜  ˘  − ˘ 1 − (cid:107) ˜   +   (cid:107) ˜  ˘ − ˘  (cid:107)2 + ˘ () = ˘ () = ˜  1 − ˜  is the largest eigenvalue of ˘  where ˜  4.1 summarizes the process of estimating  and  by minimizing (). , and ˜  is the largest eigenvalue of ˘  . Algorithm 60 Algorithm 4.1: Algorithm for solving the Gamma Double GLM 1. Initialize ˜ and ˜. 2. (Outer Loop) Update the penalized objective function 4.39. a) (Inner Loop: ). Obtain the minimizer of the objective function 4.39. • Compute ˜, ˜, ˜, ˜ for each  = 1, 2, ..., . • Compute ˜  • Initialize ˘ = ˜. • Repeat the following until ˘ converges. and the maximum eigenvalue ˜  for  = 1, 2, ..., . – Update ˘. For  = 1, 2, ..., , ∗ Compute ˘  = ˜ ( ˘) by 4.44. ∗ Compute ˘ () by 4.50. ∗ Set ˘  = ˘ (). • Set ˜ = ˘. b) (Inner Loop: ). Obtain the minimizer of the objective function 4.39. • Compute ˜, ˜, ˜, ˜ for each  = 1, 2, ..., . • Compute ˜  and the maximum eigenvalue ˜  for  = 1, 2, ..., . • Initialize ˘ = ˜. • Repeat the following until ˘ converges. – Update ˘. For  = 1, 2, ..., , ∗ Compute ˘ = ˜( ˘) by 4.45. ∗ Compute ˘ () by 4.51. ∗ Set ˘  = ˘ (). • Set ˜ = ˘. 3. Repeat until ˜ and ˜ converges. 61 4.3 Convexity In this section, we show the convexity of the problem described in the previous section. We can use an alternative formulation to express the model in equation 4.29. Model the response as where and We have and thus i.e.,  ∼ (, ),  = 1, ...,  ˜  =    =  ˜ ˜ =     ˜ =    ˜+ ˜  =  ˜   =   =    = 1 0 (cid:110)log Ɖ( ˜) + ˜ ˜ − ( ˜ − 1) log  + − ˜(cid:111)   =  +    1 1   The negative log-likelihood function of the equivalent formulation is  =1 ࡁ() = 1  ( ˜, ˜) = 1   =1 (4.52) where  = ( , ). For the gradient descent algorithm to reach to the global optimum, the negative log-likelihood function needs to have a global minimum. Therefore, ࡁ() needs to be a convex function. This function may not be convex in general, because an extreme case of the observations may violate the definition of convexity. However, it is convex with high probability as we increase the sample size. 62 Definition 4.1 (Asymptotically convex). We say that a sequence of random functions () is asymptotically convex if →∞ P ( () is convex) = 1 lim (4.53) We have, for the loss function above, the following theorem. The theorem guarantees that when we have a large sample size, it’s very likely that we observe a convex loss function. Theorem 4.1. The loss function 4.52 is asymptotically convex. Proof. First, we quantify the variation of  and log() using law of large numbers. Observe that the gamma distribution has finite expectation and variance and that E() =  ,  = 1, ...,  Therefore, the variables  −   have mean zero. By the strong law of large numbers, we have ( −  ) ..−−−→ 0   → ∞ (4.54)  =1 1  Also observe that E(log ) = log  + (0)() where (0)() is the digamma function. Similarly, we have (log  − log  − (0)()) ..−−−→ 0   → ∞ (4.55)  =1 1   =1 1   ∼ 1  ࡁ() ≈ 1    =1 =1 Equations 4.54 and 4.55 guarantee that the following quantities can be arbitrarily close when  is large  =1 1   =1    log  ∼ 1  log  + () (cid:110)log Ɖ( ˜) − (0)( ˜) ˜ + log  + − ˜(cid:111) (4.56) Therefore, when  is large, we have By definition, it’s easy to show that • the sum of two convex functions 63 • the composition of two convex functions are both convex. By the first rule, it suffices to show that () = log Ɖ( ˜) − (0)( ˜) ˜ + log  + − ˜ (4.57) is convex. Again observe that log  is a constant which is automatically convex. Then observe that  > 0 since it follows a gamma distribution, thus we have − ˜ is also convex in . Therefore, it suffices to show that () = log Ɖ( ˜) − (0)( ˜) ˜ (4.58) is convex. By the second rule, the composition of convex functions is convex. Since  ˜ is convex in , we only need to show that the function Ò() = log Ɖ() − (0)() (4.59) is convex on  ∈ (0,∞), since  =  ˜ > 0. Denote with () the ( + 1)Ò derivative of log Ɖ(), i.e. the polygamma function or order , we have and Ò(cid:48)() = −(0)() Ò(cid:48)(cid:48)() = −(0)() − (1)() Observe that Ò(cid:48)(cid:48)() is continuous and decreasing, we have Ò(cid:48)(cid:48)() ≥ lim →∞ Ò(cid:48)(cid:48)() = 0, ∀ ∈ (0,∞) Therefore, the function Ò() in equation 4.59 is convex. We have with probability converging to one that the loss function is convex, as  → ∞. (cid:3) Remark 4.1. Since the parameters in the classic formulation is a linear transformation of the parameters in the second formulation, the negative log-likelihood function of the classic formulation is also asymptotically convex. It doesn’t matter which formulation we use, we may easily obtain the parameters of another formulation by performing the linear transformation. 64 4.3.1 The Asymptotic Convexity We draw a random design matrix  of size  ×  from a standard multivariate normal distribution with independent entries.  and  are also drawn from standard multivariate normal distributions. The  is generated according to the gamma distribution. Then we randomly draw  = 10000 pairs of 1 = (1, 1) and 2 = (2, 2) and check the convexity by definition ࡁ(1) + (1 − )ࡁ(2) ≥ ࡁ(1 + (1 − )2) (4.60) where  is randomly drawn from uniform distribution between 0 and 1. We fix  = 10 and repeat the simulation for  = 1, 5, 10, 20, 50, 100. The proportion of pairs of points that violates the convexity definition is given in Table 4.1. The simulation shows that as the sample size increases, there’s less chance to see a non-convex region. Moreover, the sample size needs not be too big to reach this good property. Table 4.1: Proportion of randomly sampled pairs of points that violate the convexity definition for different sample sizes, each over 5 repetitions. The standard error of the 5 repetitions are given in parenthesis. n Proportion Standard Error 1 0.1210 (0.0012) 5 0.0231 (0.0009) 10 0.0048 (0.0003) 20 0.0004 (0.0001) 100 0.0000 (0.0000) 4.4 Simulation In the simulation study, we investigate the performance of the method using two examples. The following models are investigated in the simulation study, • Double Group Lasso - The mean and dispersion are modeled by group lasso • Group Lasso - The mean is modeled by group lasso with only an intercept term for modeling the dispersion. This is achieved by choosing  to be large. • Double Lasso - The mean and dispersion are modeled by lasso, no group structure is assumed in the model. 65 • Lasso - The mean is modeled by lasso with only an intercept term for modeling the dispersion, no group structure is assumed in the model. All models assume the response follows a Gamma distribution. In both examples,  and  are selected using the Bayesian information criterion (BIC). 4.4.1 Example 1 In each run of the simulation, 500 observations and 200 observations are used for the training and testing datasets, respectively. To create the design matrix, we will construct four blocks, each with three covariates. The th block,  ( = 1, ..., 4), is sampled from a multivariate normal distribution with mean 0 and variance . For ,  = 1, 2, 3, we set   = 1 when  =  and   =  when  ࣔ , where  = 0 or 0.5. The design matrix is constructed as  = (1, 2, 3, 4). For  = 1, ..., ,  is sampled from a Gamma distribution with scale parameter  and shape parameter , where log() = log( ) = 1 + (−1) +1(1  + 2 ) log() = −1 + (−1) (0.51  + 0.53 ) 3 =1 3 =1 (4.61) (4.62) where   is the th observation for the th covariate from the th block. In the construction of the response, 1 and 2 are the only active blocks in calculating the mean, and 1 and 3 are the only active blocks in calculating the dispersion. In the simulation study, we are interested in how well the method selects the true nonzero blocks. We will consider a block active in the model if at least one of the coefficients in the block are nonzero. We will quantify the performance by counting the number of correctly active blocks (C) along with the number of incorrectly active blocks (IC). Table 4.2 summarizes the average simulation results for the Lasso model and the Double Lasso model. The Double Lasso model performs better than the Lasso model with respect to the log- likelihood and the Gini index calculated on the testing dataset. While the Double Lasso model and 66 Table 4.2: Average simulation results for 100 runs for the Lasso and Double Lasso models. The standard error is provided in the parentheses. Dispersion Mean C IC IC C BIC log-likelihood Gini Index 1.17 2 2 — — 2 1.52 2 2 — — 2 1.25 1.6 1.44 1.76 -119.86 (28.92) 260.76 (26.96) 12.61 (6.12) -28.34 (6.04) 0.863 (0.005) 0.852 (0.005) 398.54 (35.63) 659.07 (32.29) -27.08 (8.46) -57.77 (7.55) 0.791 (0.009) 0.787 (0.008)  = 0 Double Lasso Lasso  = 0.5 Double Lasso Lasso Table 4.3: Average simulation results for 100 runs for the Grouped Lasso and Double Grouped Lasso models. The standard error is provided in the parentheses.  = 0 Double Grouped Lasso Grouped Lasso  = 0.5 Double Grouped Lasso Grouped Lasso Dispersion Mean C IC IC C BIC log-likelihood Gini Index 0.58 2 2 — — 2 0.8 2 2 — — 2 0.18 1.07 1.04 1.18 -115.23 (28.84) 263.39 (27.06) 11.62 (6.15) -28.1 (6.06) 0.853 (0.005) 0.854 (0.006) 400.07 (37.65) 665.39 (32.67) -28.09 (8.5) -58.4 (7.61) 0.792 (0.009) 0.781 (0.011) the Lasso model both appropriately estimate 1 and 2 as nonzero when modeling the mean, the Double Lasso model incorrectly estimates nonzero blocks less often than the Lasso model. Table 4.3 summarizes the average simulation results for the Grouped Lasso model and the Double Grouped Lasso model. The Double Grouped Lasso model performs better than the Lasso group model with respect to the log-likelihood and the Gini index calculated on the testing dataset. While the Double Grouped Lasso model and the Grouped Lasso model both appropriately estimate 1 and 2 as nonzero when modeling the mean, the Double Grouped Lasso model incorrectly estimates nonzero blocks less often than the Lasso model. In general, the Double Lasso and Double Grouped Lasso methods outperform the Lasso and Grouped Lasso methods, respectively. In addition, by modeling for the dispersion, the model for the mean more accurately reflects the true relationship. That is, it is more likely that 3 and 4 will be incorrectly included in the model for the mean by not modeling for dispersion. 67 4.4.2 Example 2 Consider a scenario where we have some variable 1 that is binary, with 0 indicating a good driver and 1 indicating a bad driver. We want to estimate the premium  for each driver  = 1, ..., 1000. Assume further that we have 9 additional highly correlated explanatory variables that do not affect our response. Formally,  ∼ (, ),  = 1, ...,  where  =   = exp {log(100) + log(100) 1} and  = exp {log(1) + log(10) 1} . We compare the results of two different models; Double Lasso and Lasso. The data are simulated 1000 times. Table 4.4 shows the proportion of replications where the  and  coefficients related to 1 and 2 (noise) are nonzero. In general, the Double Lasso model appropriately shrinks the coefficients related to 2 while accurately estimating the coefficients related to 1. Table 4.4: Proportion of replications that the coefficient is nonzero in the Double Lasso model. 1  0.9941 (0.0042) 0.9941 (0.0042)  2 0.0445 (0.0113) 0.0653 (0.0135) The value at risk (VaR) is also estimated for each simulation and the resulting distributions are shown in figure 4.1 and figure 4.2. The plots on the left represent the 95% VaR for good drivers (1 = 0) while the 95% VaR for bad drivers (1 = 1) are shown in the plots on the right. The vertical lines represent the true 95% VaR in each case. Since the Double Lasso method models the dispersion as a function of 1, the estimated 95% VaR is much closer to the true value as compared to the Lasso model. The Lasso model underestimates the 95% VaR for good drivers while overestimating the 95% VaR for bad drivers. Since the risk capital is underestimated for good drivers, there is a risk of insolvency. Since the risk captial is overestimated for bad drivers, it is likely that the company will lose business from bad drivers. 68 Figure 4.1: 95% VaR for Double Lasso model Figure 4.2: 95% VaR for Lasso model 4.5 Empirical Analysis 4.5.1 Data For our case study, we obtained data from the Office of the Commissioner of Insurance (OCI) of Wisconsin. The dataset contains claim descriptions with loss amounts from the Wisconsin Local Government Property Insurance Fund (LGPIF). Table 4.5 is a summary of the frequency and severity of loss amounts for 6030 events recorded in the LGPIF dataset. For the analysis, we will focus only on claims in the Vandalism event type. The resulting 2084 events have an average property damage amount estimate of $2,695. The maximum loss amount in the sample was caused by vandalism to a pool that resulted in a loss of $981,598. Figure 4.3 gives the reader a better idea of the distribution of words found in the event narratives of the events. The counts of each word is shown for the four most common event types. The reader may observe that the word vandalism and the word glass are common in descriptions for Vandalism events, whereas in the descriptions for Lightning, light, pole, and hit are common words. Given 69 Table 4.5: Summary of loss amounts by event type Event type Frequency Severity 2,695 Vandalism Vehicle 4,274 11,156 Lightning 24,846 Misc WaterW 78,071 25,606 Wind 33,081 WaterNW Fire 82,187 137,480 Hail 19,674 Total 2084 1079 955 465 464 403 269 217 94 6030 Figure 4.3: Frequency of words in descriptions of Vandalism events that different event types have different average severities, as implied by Table 4.5, we may infer that certain keywords may be indicative of the magnitude of property losses. 4.5.2 Methods By using cosine similarities, we can extract features from textual descriptions of losses. For example, if we consider the word vandalism, the cosine similarities will give a value between -1 and 1, increasing in the word’s relationship with the observed sentence. If we consider multiple words, then we may obtain multiple explanatory variables, which can be used for a multivariate regression model. The question is, which words should be used for extracting the explanatory 70 variables? Previous work has focused on using words carefully selected by a human expert. This approach worked for short textual descriptions of insurance losses. However, as the descriptions become longer, we may require a method that is more clever. One possible approach would be to try every possible word in the English language, and then use variable selection methods to reduce the model to one that includes only significant variables. Another possible approach would be to try only the words observed in the descriptions within the dataset. We use the latter approach in this case study. Figure 4.4: Cosine similarities of selected words with log loss amounts for Vandalism Figure 4.4 shows a plot of the cosine similarities for selected words with the corresponding loss amounts in log scale. The spearman correlations with the loss amounts for the nine words are shown in Table 4.6. Keywords shown in Table 4.6 may be the words that are indicative of large losses. Similarly, those words with small correlation may be words indicative of small losses. Figure 4.5 illustrates the effect of the existence of selected words in the narrative. The plot illustrates that, for instance, the occurrence of the word caused increases the average loss amount of the event by a significant amount. We construct the design matrix  = (1, 2, . . . , ) using the cosine similarities. In order (cid:17), where ,  = (cid:0)1(, ), 2(, ), 3(, )(cid:1) with ,  = simcos(, ) for description  and unique words  . to do this, consider the matrix of basis functions   = 1,  , 2,  , . . . , ,  (cid:16) 71 Table 4.6: Spearman correlation of cosine similarities and loss amounts vehicle 0.334 computers 0.321 equipment 0.29 stolen 0.285 computer missing 0.214 0.282 fire 0.148 lot 0.142 damaged 0.137 Figure 4.5: Effect of word indicator on loss amount Here, () are cubic penalized regression spline (P-spline) basis functions with penalty order 1. For an overview of the P-spline basis function, the reader may refer to Wood (2017). Then,  1 =  ,1   +  ,20 take  ,2 and form the design matrix  = from the QR decomposition  (cid:0)11,2, 22,2, . . . ,  ,2(cid:1). Under this framework, the response variable is related to the covariate ,  via a smooth function  (, ) =      , (4.63) where   is the coefficient corresponding to the -th word describing observation . This ensures that the basis functions satisfy =1  (, ) = 0 for each  = 1, . . . , . 4.5.3 Results Figure 4.6 shows the cross-sectional solution path of the algorithm for three selected words. The reader may verify that group Lasso is applied to both the mean and the dispersion of the gamma 72 distribution. Notice that the three lines corresponding to a single word converge to zero at the same tuning parameter value. Figure 4.6: Solution path of the algorithm Ten-fold cross validation using the Spearman correlation of the predicted loss amount is used in order to determine the tuning parameters for the model. The tuning parameters for the final model are  = 0.7040 and  = 1.0822. The final model contained 172 words for modeling the mean and 130 words for modeling the dispersion. Figures 4.7 and 4.8 show the curve estimates for 9 selected words for modeling the mean and the dispersion, respectively. Note that the curve estimates for the dispersion associated with computer and damaged are zero. Looking back at figure 4.4, we can see how the model has captured the change in the mean and variance of the loss amount as the cosine similarity changes. The Spearman correlation with the validation sample loss amounts and the predicted loss 73 Figure 4.7:  curve estimates Figure 4.8:  curve estimates 74 amounts using the model turns out to be 48.68%, while the Spearman correlation for the training sample was 66.38%. A plot of the predicted losses versus the actual losses in the training and validation samples are shown in log scale in the figure 4.9. Figure 4.9: Prediction versus out-of-sample claims (log scale) 4.6 Concluding Remarks In this paper, we explored the use of the group Lasso technique to predict non-normal loss amounts based on covariates derived from textual descriptions of the losses. Our contribution to the literature is the extension of the group Lasso technique to the case where the mean and the dispersion are both modeled with a penalty term, as well as the application of the methodology in the textual data analysis context. The approach has applications in the insurance case reserving problem, and has promising results according to the real data analysis performed using a training sample and a validation sample. Limitations to our approach may be the fact that a vector representation of words should be available in order for the described approach to work. The methodology may be vulnerable to spelling errors in the descriptions of the losses. However, for the presented dataset, spelling errors were minimal and did not turn out to be a problem in terms of the prediction results. The methodology may also suffer from descriptions with a large number of abbreviations, as is sometimes the case with insurance claim adjuster notes. Dealing with such messy textual descriptions may be potential future work. 75 CHAPTER 5 CONCLUSION In this paper, we introduced a framework for incorporating textual data into insurance claims modeling, and considered its applications in claims management processes. An insurance claim representative is responsible for investigating the claim, in order to determine the handling process. In this paper, we explored the use of word similarities as a tool for modeling insurance claims and mitigating insurance risks. In Chapter 2, we illustrated how short textual descriptions can be leveraged in the GAM framework for insurance claim classification and risk mitigation. The method relies on the subject matter expert to choose the words most impactful to the response. In Chapter 3, we generalize the previous work to longer textual descriptions without the need for an expert. When we consider all unique words found in the dataset as explanatory variables and impose a GAM, the resulting design matrix is high-dimensional. For this reason, we used a group Lasso penalty to reduce the number of coefficients in the model. The scalable, analytical framework proposed provides for a parsimonious and interpretable model. We discussed the implications of the analysis, including how the framework may be used by an insurance company. In Chapter 4, we showed how we can incorporate a group Lasso type penalty in both the dispersion and the mean parameterization for a Gamma model, and illustrate its use in a predictive analytics application in actuarial science. In particular, we applied the method to an insurance claim prediction problem involving textual data analysis methods. Simulations illustrated the variable selection and model fitting performance of our method. Our results demonstrate how text mining technology can be incorporated into a traditional regression analysis. The methodology is applicable in many different areas of applications, where textual data arises. Possible applications of our approach for an insurance risk manager may include: 76 • Classification of claims based on textual descriptions of the claims • Classification of policyholders based on textual descriptions of the policyholders • Prediction of insurance claims at the claim level • Prediction of insurance claims at the policyholder level • Analysis of insurance claims and risk mitigation We make some remarks on the current limitations of our framework, where potential improve- ments can be made. • Under the current framework, words not found in the word embedding matrix cannot be used in the modeling. • The threshold  is selected using heuristics by a human expert, under the current framework. • Because pre-determined word embedding matrices are limited to one-grams (single words) at the time the paper is being written, the incorporation of -grams (use of phrases longer than one word as a search key) remains an open question. • Further linguistic barriers may exist, if the textual descriptions are longer than those appearing in the dataset used for this paper. Examples may be polysemy, false friends, compound words. • In order to use the proposed method, insurers that focus on specific insurance segments may be constrained to build its own word embedding matrices, as the terms appearing in the claim descriptions may be specific to the field. For example, a medical insurer may find GloVe insufficient, and may need a word embedding matrix trained on medical terms in order to use our proposed approach. 77 APPENDICES 78 APPENDIX A THREE STAGE APPROACH THEORETICAL RESULTS In this section, we will provide statistical foundation for the proposed three stage approach. For this reason, we derive the convergence rate for our 3rd-stage estimator. This will establish statistical consistency of our procedure. In Yang & Maiti (2018), the following result for the second stage estimator has been established. Lemma A.1 (K. Yang and T. Maiti, 2018). The adaptive group lasso consistently selects the true active predictors in probability, i.e., the estimator ˆ  satisfies: (cid:16)(cid:107) ˆ  ()(cid:107)2 > 0, P  ∈   (cid:107) ˆ  ()(cid:107)2 = 0, (A.1)  ∈  (cid:17) → 1. The results states that with proper choices of 1 and 2, the adaptive group lasso consistently selects the true nonzero predictors. This theorem guarantees the selection consistency of the 3-stage algorithm, since the variable selection is done in the second stage and the third stage does not do variable selection. It’s important for an algorithm to select the correct subset of variables for the model built on them to work. With similar assumptions, assume we have Assumption 1. The true functions 1, ...,  has smoothness order , i.e. Þ    (cid:48)(cid:48)  ()2 (cid:16)  where  (cid:16)  means there exist constants  and  such that Then, we have  ≤   ≤  79 Theorem A.1. Under assumptions 1 and assumptions in Yang & Maiti (2018), for tuning parameters 31, ...3, we have (cid:107) ˆ −  0 (cid:107)2 2 =   −2 2  log()  + (2 −2 2 −2  ) +  3   (A.2) (cid:18) (cid:19)  ∈ ˆ (cid:169)(cid:173)(cid:173)(cid:171) (cid:170)(cid:174)(cid:174)(cid:172) where 0 and 2 are assumed bounds parameters in eigenvalues of , see Yang & Maiti (2018). Theorem A.1 shows the rate of convergence of the third stage estimator. There are three terms in the convergence rate: the estimation error, the spline approximation error and the regularization error. The greater , the less 3, thus the product won’t change. This theorem guarantees that with proper choice of parameters, the estimated functions are consistent estimators of the true functions that describe the relationship between the variables and the response. Proof. Consider the third step, where we have the smoothness penalty. Define the event The previous lemma showed that S = { ˆ = } P(S) → 1   → ∞ From now on, let’s condition on the event S. For convenience, we suppress the notations ˆ, 0  and   To study the characteristics of the smoothness term  , where ˆ and denote them with ˆ,  0 and . Þ   Þ    (cid:48)(cid:48)  ()2 = (cid:48)(cid:48)()(cid:48)(cid:48)()  =       without loss of generality, consider the case that the knots are evenly distributed on the interval [, ], since changing the length of the intervals does not change the shape of the B-splines but the span and height (Schumaker, 1981). In the following calculations, we normalize the interval [, ] to [0, ], where each interval has length . According to Huang et al. (2010), assume 80 the constant length of the interval satisfies  = (−) with 0 <  < 0.5. The Ò cubic B-spline basis can be derived from definition + 2 2  ≤  ≤ ( + 1), 3 63  −33 63  − 3 − 2 6 , 22  + (9 + 10)2 − 72 + 16 + 6 6 62  + 3 + 22 − 2 − 2 , 6 ( + 1) ≤  ≤ ( + 2),  ,4() = 33 63  − (9 + 20)2 62  + 92 + 42 + 34) 6 − 3 + 82 + 14 + 10 6 , ( + 2) ≤  ≤ ( + 3), −3 63  + ( + 2)2 62  − 32 + 20 + 32 6 0, .. + 3 + 102 + 32 + 32 , 6 ( + 3) ≤  ≤ ( + 4), and we have () = {,4(),  = 1, ..., }. Taking derivative, we have the second derivative of the basis function satisfies (cid:48)(cid:48) ,4() = (−2  ) = (2) Therefore, the elements  , = (3)    = 1, ...,   ,  = 1, ...,  Ò  , ∈   and equals exactly zero if | − | > 3. As a direct result, the eigenvalue of the matrix   is bounded from above by (3) and from below by some constant. Similarly, if we use a quadratic B-spline, the elements  , are bounded from above by (4) and from below by some constant. Then we begin the convergence rate part. For a converging sequence  such that (cid:107) ˆ −  0(cid:107)2 ≤ 0(cid:107)2), then consider the convex combination ∗ =  ˆ + (1 − )  0. , define  = /( + (cid:107) ˆ −  We have ∗ −  0 = ( ˆ −  0), which implies (cid:107) ∗ −  0(cid:107)2 = (cid:107) ˆ −  0(cid:107)2 = 0(cid:107)2 (cid:107) ˆ −   + (cid:107) ˆ −  0(cid:107)2 ≤  (A.3) 81 This means ∗ is within a small distance from  if we have 0 and we are safe to use Taylor expansion. Moreover, (cid:107) ∗ −  0(cid:107)2 ≤ , then  + (cid:107) ˆ −  Choosing  to be greater than , we have  (cid:107) ˆ −  0(cid:107)2 ≤  0(cid:107)2 (cid:107) ˆ −  0(cid:107)2 ≤ 2 Therefore, it’s sufficient to derive the convergence rate for ∗. Consider the Taylor expansion ∗  (cid:17) −  (cid:16) (cid:17)(cid:105) (cid:32)1 (cid:17) −  (cid:16) (cid:17)(cid:105) − (cid:104)   (cid:48)(cid:48)(cid:0)∗∗(cid:1) ( ∗ −  (cid:17)(cid:105) − 1 (cid:16) (cid:17) −  =1  0 0      0) ∗  0   ( ∗ −  0)  0    =1 (cid:104)  (cid:104)    =1 =1   (cid:104) (cid:16) (cid:16) (cid:16) =1 ( ∗ −  − 1  = − 1  + 1 2 = : −1 + 1 2  ( −  0) ( ∗ −  0)  − (cid:48)(cid:16) 0   (cid:105)(cid:33) (cid:17)  ( ∗ −  0) 0 and ( ∗∗) is the covariance matrix of  evaluated ast ∗∗ 0)  ( ∗∗)( ∗ −  0) 0 is the expectation of  at  where  which is located on the line segment joining  0 and ∗. By the definition of ∗ and convexity, we have (cid:17) −  (cid:17) −  (cid:16) (cid:16) ∗  0   (cid:16) (cid:16) (cid:104) (cid:104)   =1 =1   − 1  ≤ − 1  (cid:17)(cid:105) +  (cid:17)(cid:105) +  ∈ ˆ ∈ ˆ 3  ∗     ∗  3    0     0  ∗  0   82 Combine this with the Taylor expansion result, we have 1 ( ∗ −  2 ≤ 1 ( −   0)  ( ∗∗)( ∗ −  0) (cid:104) 0) +  ∈ ˆ 0)| + 1 |(   ≤ 1  |( − ) ( ∗ −  0 − ) ( ∗ −  0) ( ∗ −  3   0       − ∗ 0  (cid:104) |( − ) ( ∗ −  3   0      ≤ 1 +   ∈ ˆ 0)| + 1 4  − ∗ 0   ( ∗ −    ∗  (cid:105) (cid:105)   ∗   0)| +  3  (cid:104)  0       − ∗ 0     ∗  (cid:105) ∈ ˆ 0)  ( ∗∗)( ∗ −  0) + (2 −2  ) where the second inequality comes from norm inequality, the third inequality comes from Cauchy- Swarchz inequality, and  is the expectation of  given  0. Rearrangling the inequality, we have 1 ( ∗ −  4 ≤ 1 |( − ) ( ∗ −  0)  ( ∗∗)( ∗ −  0) −2 0)| + (2   ( ∗ −  ≤ 1 8 +  3   0)  ( ∗∗)( ∗ −  −2  + (2 0    0    0) + 2 )  ∈ ˆ ) +  ∈ ˆ (cid:107) 3    0     0  1/2( ∗∗)( − )(cid:107)2 2 where the inequality is by Cauchy-Swarchz inequality. Consider the penalty matrix   who has entries  , = 1 if  =  = 1,  , = 2 if  =  ࣔ 1 and  , = −1 if | − | = 1. The matrix  is of the order (). Rearranging the terms and by 0 is a constant matrix, thus each  Concentration inequality, see for example Yang & Maiti (2018), we have    0   1 8 ( ∗ −  0)  ( ∗∗)( ∗ −  0) =  log(   3  ) + (2 −2  ) (cid:18) (cid:19) + ( ∈ ˆ By remark 2.1 in Yang & Maiti (2018), we have 012 2 8 (cid:107) ∗ −  2 ≤ 1 0(cid:107)2 8 ( ∗ −  0)  ( ∗∗)( ∗ −  0) 83 Combine the above result with the condition event S, we have conditioning on S (cid:107) ∗ −  0(cid:107)2 2 =  By the inequality that −2 2  log()  + (2 −2 2 −2   ∈ ˆ ) + (cid:169)(cid:173)(cid:173)(cid:171) 3   ( ) = ( |)() + ( |)() ≤ ( |) + () consider S = , we have (cid:107) ∗ −  0(cid:107)2 2 =   −2 2  log()  (cid:19) + (2 −2 2 −2  ) +  3   Combine this with the argument at the beginning of the proof, we have (cid:107) ˆ −  0(cid:107)2 2 =   −2 2  log()  + (2 −2 2 −2  ) +  3    ∈ ˆ  ∈ ˆ (cid:169)(cid:173)(cid:173)(cid:171) (cid:169)(cid:173)(cid:173)(cid:171) (cid:170)(cid:174)(cid:174)(cid:172) (cid:170)(cid:174)(cid:174)(cid:172) (cid:170)(cid:174)(cid:174)(cid:172) (cid:3) (cid:18) (cid:18) (cid:18) (cid:19) (cid:19) 84 APPENDIX B THREE STAGE APPROACH SIMULATION RESULTS In this section, we outline the process of constructing the simulated datasets along with the results of our method. B.1 Data Several simulated datasets were constructed using various numbers of observations and covari- ates. For a prespecified number of observations,   is randomly generated from (0, 1) for each  = 1, ..., . In addition, for each  = 1, ..., ,   is selected from a series of functions; zero, linear, polynomial, exponential, logarithmic, and sinusoidal. The functions are selected randomly based on chosen probabilities; 0.6, 0.05, 0.2, 0.05, 0.05, and 0.05. The coefficients used in each function are also randomly selected. The exact functions used in the simulation study can be found in the appendix. B.2 Simulation Results The performance of our method is summarized through three characteristics; out-of-sample prediction, function misclassification, and computational performance. In addition, performance results for other models are included for comparison. These models include results from interme- diate steps such as Initial Group Lasso and Adaptive Group Lasso, as well as other models such as True Selected, the GAM model fit to only nonzero functions, and mgcv, the GAM model fit to all functions using the mgcv package. Out-of-sample prediction is measured by mean squared prediction error (MSPE), with a training to testing set ratio of 80:20. Figure B.1 shows the comparison of the out-of-sample prediction performance for each method when  = 800 and  = 200. In addition to the aforementioned comparison models, Null Model and the true 2 were added. This plot is fairly representative of all plots constructed for each simulated sample. The MSPE from Step 3 decreases as  decreases, but 85 Figure B.1: Mean squared prediction error for each method when  has 800 observations and 200 covariates. may slightly increases as  becomes smallest. For the best , the MSPE of step 3 is typically very similar to that of True Selected and True Sigma, with the MSPE of mgcv being relatively close all well. Note that the MSPE of step 2 is higher than that of the null model, which is expected due to the overfitting nature of the model. This further emphasizes the importance of step 3. Figure B.2 shows the performance of the 3 step approach against the other methods, for a varying number of observations and covariates. These plots simply omit step 2 for increased readability. 86 Figure B.2: Mean squared prediction error from the 3 step approach against other methods, at various dimensions of . We now consider the four misclassification types; zero, the percentage of truly zero functions estimated as nonzero functions, nonzero, the percentage of truly nonzero functions estimated as zero functions, linear, the percentage of truly linear functions estimated as nonlinear or zero functions, and nonlinear, the percentage of truly nonlinear functions estimated as linear or zero functions. Note nonlinear functions do not include zero functions. Figure B.3 shows the misclassification rates the for step 2, step 3, and the gamsel model. Moreover, figures B.4 and B.5 show the estimated functions for the simulated dataset used in figure B.1 from the 3 step approach and the mgcv model, respectively. The 3 step approach provides 87 Figure B.3: Misclassification error rate for step 2, step 3, and gamsel model, for each misclassifi- cation type. function estimates very similar to the true functions, with confidence bands that accurately capture the observations. The mgcv model does not perform as well, and has much wider confidence bands. 88 Figure B.4: 3 step approach function estimates for representative covariates from the model corresponding to figure B.1. 89 Figure B.5: mgcv function estimates for representative covariates from the model corresponding to figure B.1. 90 Finally, we consider method runtime as a measure of computational performance. Figure B.7 shows the runtime, in seconds, of each model for the various simulated datasets. The mgcv model has a shorter runtime than the 3 step method at the best lambda in subplots A, B, C, and D. However, as the dimension of the dataset becomes larger, the 3 step method at the best lambda runs faster. In figure B.7, the mgcv model had a runtime of about 12.55 hours compared to about 1.12 hours for the 3 step method at the best lambda. The performance of the final model with respect to the three characteristics discussed is directly affected by the choice of  in the adaptive group lasso step. In figure B.1, smaller values of lambda provide for reduced mean squared prediction error. However, in figure B.7, smaller values of lambda generally lead to longer runtimes. These observations emphasize the importance of selecting and appropriate lambda in the adaptive lasso step. Figure B.6: Runtime, in seconds, for each method when  has 800 observations and 100 covariates. 91 Figure B.7: Runtime, in seconds, from the 3 step approach the mgcv model, at various dimensions of . 92 APPENDIX C GAMMA DOUBLE GLM WITH LASSO ALGORITHM RESULTS We want to justify that the update schemes in Equation 4.48 and Equation 4.49 by showing that the penalized objective function in Equation 4.39 decreases with each update. Lemma C.1. Let  be a column vector of length , let  a positive definite  ×  matrix, and let  be the largest eigenvalue of . Then Proof. Letting  = /(cid:107)(cid:107)2, it is equivalent to show   ≤  (cid:107)(cid:107)2 2 . Let eigendecomposition   ≤ .  =  , where  = ( ) is orthonormal and  is a diagonal matrix with 1, ...,  on the diagonals. The columns of  form an orthonormal basis and are the eigenvectors corresponding to 1, ..., . 1 · · ·  =1 2  = 1. Moreover,   = (cid:104), (cid:105) Therefore, we have 1, ...,  such that  = =1 . Since (cid:107)(cid:107)2 = 1, we have (cid:42)  (cid:42)   (cid:18) By the definition of eigenvalue,  =  for  = 1, ..., . Then   (cid:19)  =1 2   (cid:43) (cid:43)   = ,   . ,   =1 = =1 =1 =1 = ≤ max =1,...,  =1 2  = . The inequality holds since  is positive definite resulting in all ,  = 1, ...,  being positive. (cid:3) 93 Result C.1. ࡁ( ˘, ˘()) ≤ ࡁ( ˘, ˘) + ˘   =1  (cid:110) ˜ (cid:110) ˜  ˜(    =1 ˜(   + + + =1 =1 Proof. ࡁ( ˘, ˘()) = ( ˜) + 1  2   = ( ˜) + 1  2   =1 − ˜( + =1 =1 +   (cid:110) ˜ 2 ( ˘ () − ˘ ))( ˘ () − ˘ )) ( ˘ () − ˘ ) + ˜   ˘())2 + 1 2   ˜( ˘() − ˜( ˜)(     ˜(   ˘)2 =1 ˘()) − ˜( ˘())(cid:111)     ˜)(  ˘ − ˜(   ˜)(   ˘) − ˜(   ˜)( ˘())( ˜(   ˘ +    ˘)  ( ˘ () − ˘ ))2 + 1 2 ˜(   ˘)2   ˘)(cid:111)  =1 ˘ +   ( ˘ () − ˘ ))   ˘ +  ˜)(  ˘ − ˜(  ( ˘ () − ˘ )) − ˜( ˜)(  ˜)(  ˘) − ˜(       ( ˘ () − ˘ ))     ˘ +   ˜)(  ˘)(cid:111)  ( ˘ () − ˘ ))2(cid:21) 2(   ˘)  + 1   (cid:20) + ˜(   ˘ +   ( ˘ () − ˘ ))( = ࡁ( ˘, ˘) + ˜  ( ˘ () − ˘ ) ˘  =1  ( ˘ () − ˘ ) ˜ =1 +   =1 − ˜( + =1    ( ˘ () − ˘ )) − ˜( ˜)(  ( ˘ () − ˘ ))(   ˘) ˜(   ˜)(  ( ˘ () − ˘ )) 94 ࡁ( ˘, ˘()) = ࡁ( ˘, ˘) +  =1 ˜(   ˘) ( ˘ () − ˘ ) + 1 2 ˜(  ( ˘ () − ˘ ))2   =1 − ˜( + =1   ˜( + ˜ ( ˘ () − ˘ )  ˜) ( ˘ () − ˘ ) − ˜(  ˘) ( ˘ () − ˘ ) ( ˘ () − ˘ ) + 1 ( ˘ () − ˘ ) + ˜  = ࡁ( ˘, ˘) + ˘ ≤ ࡁ( ˘, ˘) + ˘   ˜) ( ˘ () − ˘ ) 2( ˘ () − ˘ )) ˜( ˘ () − ˘ )) 2 ( ˘ () − ˘ ))( ˘ () − ˘ )) The last inequality holds by Lemma C.1. (cid:3) Lemma C.2. Each iteration of the inner loops in Algorithm 4.1 always decreases penalized objective function in Equation 4.39. Proof. It is sufficient to show that the update schemes in Equation 4.48 and Equation 4.49 always decreases the penalized objective function in Equation 4.39. Without loss of generality, fix  ∈ {1, ..., } and let us consider the update of  . Then we simply need to show that ( ˘, ˘()) − ( ˘, ˘) ≤ 0. 95 Then, ( ˘, ˘()) − ( ˘, ˘)  =1   (cid:13)(cid:13) ˘()(cid:13)(cid:13)2 (cid:13)(cid:13)2 (cid:13)(cid:13) ˘  =1 =1 =1  (cid:107) ˘(cid:107)2 −   (cid:13)(cid:13)(cid:13) ˘ ()(cid:13)(cid:13)(cid:13)2 (cid:13)(cid:13)(cid:13)2 (cid:13)(cid:13)(cid:13)2  (cid:13)(cid:13)(cid:13) ˘  (cid:13)(cid:13)(cid:13) ˘  = ࡁ( ˘, ˘()) +   (cid:107) ˘(cid:107)2 +  − ࡁ( ˘, ˘) −  = ࡁ( ˘, ˘()) +   ( ˘ () − ˘ ) − ࡁ( ˘, ˘) −   ≤ ࡁ( ˘, ˘) + ˘ + ˜  2 ( ˘ () − ˘ ))( ˘ () − ˘ )) +   − ࡁ( ˘, ˘) −   by Result C.1 ( ˘ () − ˘ ) = ࡁ( ˘, ˘) + ˘ + ˜  2 ( ˘ () − ˘ ))( ˘ () − ˘ )) +   − ࡁ( ˘, ˘) − ˘ − ˜  ≤ 0. 2 ( ˘  − ˘ ))( ˘  − ˘ )) −   ( ˘  − ˘ ) (cid:13)(cid:13)(cid:13) ˘  (cid:13)(cid:13)(cid:13)2 (cid:13)(cid:13)(cid:13) ˘ ()(cid:13)(cid:13)(cid:13)2 (cid:13)(cid:13)(cid:13) ˘ ()(cid:13)(cid:13)(cid:13)2 The final inequality holds by Equation 4.48. (cid:3) 96 BIBLIOGRAPHY 97 BIBLIOGRAPHY Antweiler, Werner & Murray Frank. 2004. Is all that talk just noise? the information content of internet stock message boards. The Journal of Finance 59. Chollet, Francois & J. J. Allaire. 2018. Deep learning with r. Manning Publications Co. Chouldechova, Alexandra & Trevor Hastie. 2015. Generalized additive model selection. arXiv preprint, arXiv:1506.03850v2 [stat.ML] . Dreassi, Emanuela, M Giovanna Ranalli & Nicola Salvati. 2014. Semiparametric m-quantile regression for count data. Statistical Methods in Medical Research 23(6). 591–610. doi:10.1177/ 0962280214536636. https://doi.org/10.1177/0962280214536636. PMID: 24847899. Frees, Edward W. 2009. Regression modeling with actuarial and financial applications. Cambridge University Press. Frees, Edward W., Gee Y. Lee & Lu Yang. 2016. Multivariate frequency-severity regression models in insurance. Risks 2016(4). Friedman, Jerome, Trevor Hastie & Robert Tibshirani. 2010. Regularization paths for generalized linear models via coordinate descent. Journal of Statistical Software 33(1). 1–22. Gentzkow, Matthew, Jesse Shapiro & Matt Taddy. 2016. Measuring polarization in high- dimensional data: method and application to congressional speech. NBER Working Paper No. 22423 . Goldberg, Yoav. 2017. Neural network methods for natural language processing. Morgan & Claypool Publishers. Goodfellow, Ian, Yoshua Bengio & Aaron Courville. 2016. Deep learning. MIT Press. Hastie, T. & R. Tibshirani. 1986. Generalized additive models. Statistical science 297–310. Hastie, T. J. & R. J. Tibshirani. 1990. Generalized additive models. Chapman and Hall. Hastie, Trevor, Robert Tibshirani & Jerome Friedman. 2009. The elements of statistical learning: Data mining, inference, and prediction, second edition. Springer Science & Business Media. Hatzivassiloglou, Vasileios & Janyce Wiebe. 2000. Effects of adjective orientation and gradability on sentence subjectivity. Proceedings of COLING . Huang, J., L. Horowitz & F. Wei. 2010. Variable selection in nonparametric additive models. Annals of Statistics 38(4). 2282 – 2313. Karlgren, Jussi & Douglass Cutting. 1994. Recognizing text genres with simple metrics using discriminant analysis. Proceedings of COLING . 98 Kearney, Susan. 2010. Insurance operations. The Institutes. Lee, Gee Y., Scott Manski & Tapabrata Maiti. 2019. Actuarial applications of word embedding models. ASTIN Bulletin DOI: https://doi.org/10.1017/asb.2019.28. Li, Zheyuan & Simon N. Wood. 2019. Faster model matrix crossproducts for large generalized linear models with discretized covariates. Statistics and Computing doi:10.1007/s11222-019-09864-2. https://doi.org/10.1007/s11222-019-09864-2. Marra, Giampiero & Simon Wood. 2011. Practical variable selection for generalized additive models. Computational Statistics and Data Analysis 55. 2372–2387. Pang, Bo, Lillian Lee & Shivakumar Vaithyanathan. 2002. Thumbs up? sentiment classification using machine learning techniques. Proceedings of the Empirical Methods in Natural Language Processing (EMNLP) . Pennington, Jeffrey, Richard Socher & Christopher D. Manning. 2014. Glove: Global vectors for word representation. In Empirical methods in natural language processing (emnlp), 1532–1543. http://www.aclweb.org/anthology/D14-1162. Qian, Wei, Yi Yang & Hui Zou. 2016. Tweedie’s compound poisson model with grouped elastic net. Journal of Computational and Graphical Statistics 25(2). 606–625. Schumaker, L. 1981. Spline functions: Basic theory. John Wiley & Sonds, New York. Simonoff, Jeffrey S. 1996. Smoothing methods in statistics. Springer, New York, NY. Smyth, Gordon K. 1989. Generalized linear models with varying dispersion. Journal of the Royal Statistical Society, Series B (Methodological) 51(1). 47–60. Smyth, Gordon K. & Bent Jørgensen. 2002. Fitting tweedie’s compound poisson model to insurance claims data: dispersion modelling. ASTIN Bulletin 32(1). 143–157. Sokolova, Marina & Guy Lapalme. 2009. A systematic analysis of performance measures for classification tasks. Information Processing and Management 45. 427–437. Stephens-Davidowitz, Seth. 2014. The cost of racial animus on a black candidate: Evidence using google search data. Journal of Public Economics 118. 26–40. Tetlock, Paul. 2007. Giving content to investor sentiment: the role of media in the stock market. The Journal of Finance 62. Tibshirani, R. 1996. Regression shrinkage and selection via the lasso. Journal of the Royal Statistical Society. Series B (Methodological) 267–288. Wang, Hansheng & Chenlei Leng. 2008. A note on adaptive group lasso. Computational Statistics & Data Analysis 52(12). 5277 – 5286. Wood, Simon. 2013. On p-values for smooth components of an extended generalized additive model. Biometrika 100. 221–228. 99 Wood, Simon. 2019. mgcv: Mixed gam computation vehicle with automatic smoothness estimation. https://CRAN.R-project.org/package=mgcv. R package version 1.8-31. Wood, Simon N. 2017. Generalized additive models: An introduction with r, second edition. CRC Press. Wood, Simon N., Yannig Goude & Simon Shaw. 2015. Generalized additive models for large data sets. Journal of the Royal Statistical Society, Series C 64(1). 139–155. Wood, Simon N., Zheyuan Li, Gavin Shaddick & Nicole H. Augustin. 2017. Generalized additive models for gigadata: Modeling the u.k. black smoke network daily data. Journal of the American Statistical Association 112(519). 1199–1210. Yang, Kaixu & Taps Maiti. 2018. Ultra high dimensional classification consisitency with general- ized additive model. Technical Reports, Michigan State University . Yang, Yi & Hui Zou. 2015. A fast unified algorithm for solving group-lasso penalize learning problems. Statistics and Computing 25(6). 1129–1141. Yang, Yi & Hui Zou. 2017. gglasso: Group lasso penalized learning using a unified bmd algorithm. https://CRAN.R-project.org/package=gglasso. R package version 1.4. Yuan, Ming & Yi Lin. 2006. Model selection and estimation in regression with grouped variables. Journal of the Royal Statistical Society Series B 68. 49–67. Zou, Hui. 2006. The adaptive lasso and its oracle properties. Journal of the American Statistical Association 101(476). 1418–1429. 100