PREDICTIVE MODEL BUILDING FOR UTILIZING WORD EMBEDDING MODELS:

APPLICATIONS IN INSURANCE DATA

By

Scott Manski

A DISSERTATION

Michigan State University

in partial fulﬁllment of the requirements

Submitted to

for the degree of

Statistics – Doctor of Philosophy

2020

ABSTRACT

PREDICTIVE MODEL BUILDING FOR UTILIZING WORD EMBEDDING MODELS:

APPLICATIONS IN INSURANCE DATA

By

Scott Manski

Textual data contains a vast amount of information, yet for many researchers it has not been
clear how the information could be used for an empirical analysis. Often times textual data are
ignored or discarded in statistical analyses because regression and other statistical methods require
numeric covariates. This dissertation will demonstrate how cutting-edge text mining technologies
can improve empirical analyses by transforming textual data into numeric explanatory variables,
thus allowing textual data to be incorporated into a statistical analysis. By transforming the textual
data, the number of explanatory variables often becomes larger than the number of observations.
For this reason, we explore the application of generalized additive models in tandem with adaptive
lasso. In addition, we construct an algorithm for ﬁtting a Gamma double generalized linear model
with a group lasso penalty. Through this, we show how useful information can be extracted from
textual data. We show how our methods can be applied through several insurance claims examples.
We believe that our work can be widely used for other observational researchers in economics,
business, statistical science, and social science.

Copyright by
SCOTT MANSKI
2020

ACKNOWLEDGEMENTS

I would like to express the utmost gratitude to Dr. Taps Maiti and Dr. Gee Lee, for their assistance,
support, guidance, and encouragement throughout my time at MSU. I would also like to thank my
entire graduate committee, including Dr. Hyokyoung Hong and Dr. Chenhui Guo. Their service is
greatly appreciated. Furthermore, I would like to thank the Department of Statistics and Probability
for the resources and opportunities provided to me during my Ph.D. career.

iv

TABLE OF CONTENTS

.
.

.
.
.
.

1.1 Using Text as Data

LIST OF TABLES .
.
.
.
.
LIST OF FIGURES .
.
.
.
.
LIST OF ALGORITHMS .
.
.
KEY TO ABBREVIATIONS .
CHAPTER 1

. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . vii
.
.
. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . viii
.
.
x
. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
.
.
. . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
.
INTRODUCTION .
.
. . . . . . . . . . . . . . . . . . . . . . . . . . . . .
. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
.
1.1.1 Word Occurrence Methods . . . . . . . . . . . . . . . . . . . . . . . . . .
1.1.2 Word Embeddings
. . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
1.1.2.1 Word2vec . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
1.1.2.2 Global Vectors for Word Representation (GloVe) . . . . . . . . .
. . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

1
1
2
4
4
6
7
1.1.3 Word similarities .
1.1.4 Data Size Considerations . . . . . . . . . . . . . . . . . . . . . . . . . . . 11
1.2 Lasso and Generalizations
. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 12
1.2.1 Adaptive Lasso .
. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 13
1.2.2 Group Lasso .
.
. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 13
1.2.3 Adaptive Group Lasso . . . . . . . . . . . . . . . . . . . . . . . . . . . . 14

xi

.

.

.

CHAPTER 2 CLAIMS CLASSIFICATION AND RISK MITIGATION USING SHORT

2.1
2.2 Data .

2.3 Applications .

.

.

.

.

.

.
.

.
.

. .

. .

. .

. .

.
. .

Introduction .

TEXTUAL DESCRIPTIONS . . . . . . . . . . . . . . . . . . . . . . . . . 15
.
.
. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 15
.
. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 17
2.2.1 Data Generation .
. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 17
2.2.2 Textual Data Preprocessing . . . . . . . . . . . . . . . . . . . . . . . . . . 19
. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 22
2.3.1 Claims Classiﬁcation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 22
2.3.1.1 Model .
. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 22
2.3.1.2 Result . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 25
2.3.2 Risk Mitigation .
. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 27
2.3.2.1 Model .
.
. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 27
2.3.2.2 Result . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 31
2.3.2.3 Model diagnostic . . . . . . . . . . . . . . . . . . . . . . . . . . 32
. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 33

.

.

2.4 Concluding Remarks

.

Introduction .

CHAPTER 3 A THREE STAGE APPROACH TO MODEL BUILDING . . . . . . . . . . 35
3.1
.
. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 35
3.2 Data and Preprocessing .
. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 37
3.3 The Model and Basis Expansion . . . . . . . . . . . . . . . . . . . . . . . . . . . 39
3.4 The 3-Stage Approach .
. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 44

. .

.
.

.

.

.

.

.

v

3.5 Algorithm .

3.4.1
3.4.2
3.4.3
3.4.4 Tuning Parameters
.

Stage 1 – Group lasso . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 44
. 44
Stage 2 – Adaptive group lasso . . . . . . . . . . . . . . . . . . . . . . .
. 45
Stage 3 – The smoothness penalty . . . . . . . . . . . . . . . . . . . . .
. . . . . . . . . . . . . . . . . . . . . . . . . . . . .
. 45
. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 45
Stage 1 – Group lasso . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 46
. 46
Stage 2 – Adaptive group lasso . . . . . . . . . . . . . . . . . . . . . . .
Stage 3 – The smoothness penalty . . . . . . . . . . . . . . . . . . . . .
. 46
.
. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 46
. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 49
. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 50

.
3.6 Results . .
3.7
. .
3.8 Concluding Remarks

Implications . .

3.5.1
3.5.2
3.5.3

. .

. .

. .

. .

.
.
.

.
.
.

.
.

.

.

CHAPTER 4 GAMMA DOUBLE GENERALIZED LINEAR MODEL WITH GROUP

.

.
.
.
.

. .

. .

4.1

.
.
.
.
. .
.
.

4.2 Model

4.3 Convexity .

LASSO .
.
.
.
. .
Introduction .
4.1.1 Overview .
. .

. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 51
.
. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 51
.
. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 51
. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 52
4.2.1 One Dimensional Unpenalized Problem . . . . . . . . . . . . . . . . . . . 52
4.2.2 One Dimensional Penalized Problem . . . . . . . . . . . . . . . . . . . . . 54
4.2.3 Gamma Generalized Linear Model . . . . . . . . . . . . . . . . . . . . . . 55
4.2.4 Gamma Double Generalized Linear Model
. . . . . . . . . . . . . . . . . 57
. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 62
4.3.1 The Asymptotic Convexity . . . . . . . . . . . . . . . . . . . . . . . . . . 65
. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 65
.
. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 66
.
. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 68
. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 69
.
. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 69
.
. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 70
. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 72
.
. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 75

.
4.4.1 Example 1 .
4.4.2 Example 2 .
4.5 Empirical Analysis
4.5.1 Data .
. .
4.5.2 Methods .
4.5.3 Results

.
.
.
.
.
.
.
.
.
.
.
.
. .
. .
.
.
.
.
.
. .
. .
4.6 Concluding Remarks
.
.

4.4 Simulation .

.

.

.

.

.

.

.

.

.

.

CHAPTER 5 CONCLUSION .
APPENDICES .
.
.
APPENDIX A
APPENDIX B
APPENDIX C

. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 76
. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 78
.
THREE STAGE APPROACH THEORETICAL RESULTS . . . . . 79
THREE STAGE APPROACH SIMULATION RESULTS . . . . . . 85
GAMMA DOUBLE GLM WITH LASSO ALGORITHM RESULTS 93

.
.

.
.

.

.

.

.

.

BIBLIOGRAPHY .

.

.

.

.

.

.

.

.

. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 97

vi

LIST OF TABLES

Table 2.1: Summary statistics of losses by claim category . . . . . . . . . . . . . . . . . . 21

Table 2.2: Claim Categories for Training and Validation Datasets

. . . . . . . . . . . . . . 23

Table 2.3: Words used for Classiﬁcation . . . . . . . . . . . . . . . . . . . . . . . . . . . . 24

Table 2.4: Confusion matrix for multinomial model

. . . . . . . . . . . . . . . . . . . . . 25

Table 2.5: Summary statistics of explanatory variables (vandalism model) . . . . . . . . . . 29

Table 2.6: GAM model summary (vandalism model) . . . . . . . . . . . . . . . . . . . . . 33

Table 3.1: Summary statistics for the log(loss) for the training and validation datasets.

. . . 38

Table 3.2: Summary statistics for the ﬁnal model. The Residual DF comes from the
estimated degrees of freedom from the GAM, and the MSPE is the out of
sample mean squared prediction error.

. . . . . . . . . . . . . . . . . . . . . . 47

Table 4.1: Proportion of randomly sampled pairs of points that violate the convexity
deﬁnition for diﬀerent sample sizes, each over 5 repetitions. The standard
error of the 5 repetitions are given in parenthesis.

. . . . . . . . . . . . . . . . . 65

Table 4.2: Average simulation results for 100 runs for the Lasso and Double Lasso models.

The standard error is provided in the parentheses.

. . . . . . . . . . . . . . . . . 67

Table 4.3: Average simulation results for 100 runs for the Grouped Lasso and Double

Grouped Lasso models. The standard error is provided in the parentheses. . . . . 67

Table 4.4: Proportion of replications that the coeﬃcient is nonzero in the Double Lasso

.

.

.

.

.

.

.

model.

.

.

. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 68

Table 4.5: Summary of loss amounts by event type . . . . . . . . . . . . . . . . . . . . . . 70

Table 4.6: Spearman correlation of cosine similarities and loss amounts . . . . . . . . . . . 72

vii

LIST OF FIGURES

Figure 1.1: Plot of () for diﬀerent values of , with  = 100 . . . . . . . . . . . . . .

Figure 1.2: Example of a two dimensional projection of the word embeddings . . . . . . . .

7

9

Figure 1.3: The number of unique words as a function of the number of documents in the

NOAA dataset.

.

.

.

.

. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 11

Figure 2.1: Two dimensional projection of the word embeddings for common words

. . . . 18

Figure 2.2: Distribution of common words

. . . . . . . . . . . . . . . . . . . . . . . . . . 20

Figure 2.3: Average accuracy, error rate, and precision in log scale . . . . . . . . . . . . . . 26

Figure 2.4: GAM model plots for vandalism . . . . . . . . . . . . . . . . . . . . . . . . . 31

Figure 2.5: Words that relate with high vandalism losses . . . . . . . . . . . . . . . . . .

. 32

Figure 3.1: Frequency for the most common words. . . . . . . . . . . . . . . . . . . . . . . 38

Figure 3.2: Cosine similarity against property loss for house and thunderstorm.

. . . . . . 39

Figure 3.3: Function estimates for several covariates.

. . . . . . . . . . . . . . . . . . . . 47

Figure 3.4: Words with the highest cosine similarity with house.

. . . . . . . . . . . . . . 48

Figure 3.5: Predicted property loss amounts against the true property loss amounts for

the validation sample. The Spearman correlation is 76.06%. . . . . . . . . . . . 49

Figure 4.1: 95% VaR for Double Lasso model

. . . . . . . . . . . . . . . . . . . . . . .

. 69

Figure 4.2: 95% VaR for Lasso model . . . . . . . . . . . . . . . . . . . . . . . . . . . .

. 69

Figure 4.3: Frequency of words in descriptions of Vandalism events . . . . . . . . . . . . . 70

Figure 4.4: Cosine similarities of selected words with log loss amounts for Vandalism . .

. 71

Figure 4.5: Eﬀect of word indicator on loss amount . . . . . . . . . . . . . . . . . . . . . . 72

Figure 4.6: Solution path of the algorithm . . . . . . . . . . . . . . . . . . . . . . . . . . . 73

Figure 4.7:  curve estimates

.

. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

. 74

viii

Figure 4.8:  curve estimates

.

.

.

. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 74

Figure 4.9: Prediction versus out-of-sample claims (log scale) . . . . . . . . . . . . . . . . 75

Figure B.1: Mean squared prediction error for each method when  has 800 observations

and 200 covariates.

.

. . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

. 86

Figure B.2: Mean squared prediction error from the 3 step approach against other methods,

at various dimensions of .

. . . . . . . . . . . . . . . . . . . . . . . . . . . . 87

Figure B.3: Misclassiﬁcation error rate for step 2, step 3, and gamsel model, for each

misclassiﬁcation type.

. . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

. 88

Figure B.4: 3 step approach function estimates for representative covariates from the

model corresponding to ﬁgure B.1.

. . . . . . . . . . . . . . . . . . . . . . . . 89

Figure B.5: mgcv function estimates for representative covariates from the model corre-

sponding to ﬁgure B.1.

. . . . . . . . . . . . . . . . . . . . . . . . . . . . .

. 90

Figure B.6: Runtime, in seconds, for each method when  has 800 observations and 100

covariates.

.

.

.

.

.

. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 91

Figure B.7: Runtime, in seconds, from the 3 step approach the mgcv model, at various

dimensions of . .

. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 92

.

.

.

.

ix

LIST OF ALGORITHMS

Algorithm 4.1: Algorithm for solving the Gamma Double GLM . . . . . . . . . . . . . .

. 61

x

KEY TO ABBREVIATIONS

CBOW Continuous bag-of-words
CNN Convolutional Neural Networks
GAM Generalized Additive Model
GLM Generalized Linear Model
GloVe Global Vectors for Word Representation
IBNR Incurred but not Reported
LGPIF Local Government Property Insurance Fund
NOAA National Oceanic and Atmospheric Administration
RNN Recurrent Neural Networks

xi

CHAPTER 1

INTRODUCTION

1.1 Using Text as Data

With the advance in technology, data are being collected on a much larger scale. While
traditional data types such as numerical and categorical data are still being collected, there has
been an increase in the collection of digital text data. In many cases, text data can provide more
information than a series of traditional variables, but extracting this vital information is not trivial.
Text mining has been utilized in a variety of contexts, including but not limited to spam ﬁltering,
sentiment analysis, customer churn, stock returns, and politics. Sentiment analysis is the process of
categorizing texts based on the message of the text, usually positive or negative. Pang et al. (2002)
explored the use of Naive Bayes, maximum entropy classiﬁcation, and support vector machines to
predict the overall opinion, positive or negative, of a movie review. Similar work includes using
discriminant analysis to categorize texts into pre-determined genres by Karlgren & Cutting (1994),
and ﬁnding features indicated by the use of subjective language by Hatzivassiloglou & Wiebe
(2000). Text mining has also been used substantially in the prediction of stock prices. Antweiler &
Frank (2004) analyzed internet stock message boards and found that the messages help to predict
market volatility. Moreover, Tetlock (2007) found that high media pessimism predicts downward
pressure on the market, and unusually high or low pessimism predicts high trading volume.

Each of the aforementioned examples are primarily concerned with prediction as opposed to
inference of relationships between the response and the explanatory variables. Some authors have
found text analysis to be helpful in telling a story using the relationship between the response
and the explanatory variables. Stephens-Davidowitz (2014) explored racially charged language
in Google search queries to explain how much racial animus costs a black presidential candidate.
Gentzkow et al. (2016) studied congressional speeches and found that partisanship is far greater in
recent years, with a sharp increase in the early 1990s.

1

In almost all situations where text data are utilized, the analysis can be summarized in the

following way:

1. Represent the raw text D as some numeric vector or array 

2. Map  to predicted values ˆ of unknown response 

3. Use ˆ in subsequent analysis

In the following sections, we will discuss a variety of methods used to construct . We start by
exploring the use of indicator variables and count data to represent text in Section 1.1.1. In Section
1.1.2, we introduce word embeddings along with several methods for obtaining word embedding
matrices. Finally, in Section 1.1.3 we discuss how we will be using cosine similarities to measure
the similarity between words.

1.1.1 Word Occurrence Methods

The most common and simplest way of constructing  is by using indicator variables for the
occurrence of words. We will deﬁne D as the collection of documents typically in the form of
a column vector. Furthermore, we will denote the set of words in the th document as , for
 = 1, ..., |D|.. Finally, let  be the set of all unique words found in D. Then we can deﬁne the
entries of  as

  =

for  = 1, ..., |D|,

 = 1, .., ||.

Consider the following example where you have textual descriptions for 3 observations,

0 otherwise

1 if   ∈ 


2

Lightning struck a building

D =

A lightning strike hit a building

Lightning struck a tree



The response in our example could be the amount of damage, in dollars, caused by the event. By
our deﬁnition, 1 = {lighning, struck, a, building}, for example. Furthermore,

Then

 = {lighning, struck, a, building, strike, hit, tree}.



 =

Ò


1
1
1

1
0
1


1
1
1


1
1
0

0
1
0

Ò
0
1
0


0
0
1



An extension of the indicator method is to count the number of occurrences of word   in each
textual description. Using the count method, since the word a is the only word that occurs more
than once in a description, we have

 =



Ò


1
1
1

1
0
1


1
2
1


1
1
0

0
1
0

Ò
0
1
0


0
0
1



If we conduct a lexical analysis of the ﬁrst and second textual descriptions, we see

Lightning

noun

struck
verb

a

building

noun

A lightning

adj.

strike
noun

hit
verb

a

building.

noun

While the sentence structure of the ﬁrst and second textual descriptions are diﬀerent, it is clear that
their meanings are essentially equivalent. Since the sentence structure is diﬀerent,  and 
are diﬀerent and they do not appropriately capture this similarity in meaning. Additionally, the only
diﬀerence between  and  in the example is the third column, the column associated with
the word a. While the use of the word a is necessary in creating grammatically sound sentences in
our example, it does not add any additional information in regards to our response. Common words

3

that typically add little information to the meaning of a sentence are called stop words. Examples of
stop words include a, the, and, at, and for. In practice, stop words are typically removed from  and
 ( = 1, ..., ) to reduce the number of columns in  and to reduce the number of computations
when constructing .

1.1.2 Word Embeddings

So far, we have only represented text through indicator variables or counts. While this is easy
to understand and easy to work with, it is a very basic way of summarizing text. Instead, word
embedding representations map words into some vector space. The vector space consists of a word
embedding for each word in the vocabulary. The word embeddings are determined by optimizing
some objective function, such as a likelihood for word occurrences, on a library of texts. There
are several algorithms for determining the word embeddings. We will be introducing two of these
algorithms, word2vec and Global Vectors for Word Representation (GloVe).

1.1.2.1 Word2vec

One approach to obtaining the word embeddings is to use the word2vec algorithm. The word2vec
algorithm can be illustrated using a simpliﬁed example. Suppose we use the following seven words
as our simpliﬁed vocabulary list:

 = {lighning, struck, a, building, strike, hit, tree} = {1, 2, . . . , 7}

Consider the sentence:

lightning struck a building

with center word 2, and context words  = { ˜1, ˜3}. What we want is some vector representation
 of center words and ˜ of context words. These word embedding matrices are obtained by letting
an algorithm read through billions of sentences, maximizing a log likelihood, treating  and ˜ as
parameters to be estimated. For this, we can specify the probability of observing a context word

4

˜  given a center word 2 by

(cid:0) ˜ |2(cid:1) =

exp(cid:0) ˜  · 2(cid:1)
||
=1 exp ( ˜ · 2)

Going back to the sentence, lightning struck a building, using a naive Bayes assumption (where
we assume the conditional independence of the events of observing context words given a center
word), the negative log likelihood becomes:

 = − log  ( ˜1|2) − log  ( ˜3|2) = − 

˜  · 2 + || log 

˜ ∈

exp ( ˜ · 2)

˜∈

The negative log likelihood is minimized by gradient descent. Note that analytical formulas for
the gradient can be obtained based on the likelihood.
In practice using the analytical forms of
the gradient speeds up the convergence, and makes the algorithm more stable. Extensions of the
gradient descent algorithm such as Adam optimizers may be used as well. The  and ˜ matrices
start from a random initial matrix, and is updated each step of the gradient descent iteration.
Repeating this process for millions and billions of center words and context words, an input word
matrix  and a context word matrix ˜ is learned.

Word2vec may use either the continuous bag-of-words (CBOW) model or the continuous skip-
gram model, depending on how the log likelihood is deﬁned. In the CBOW model, the log likelihood
represents the probability of observing the center word within a window of context words. In the
continuous skip-gram model, the probability of observing the context word given a center word is
used. Hence, the example shown above would correspond to a continuous skip-gram model.

Nowadays, word embedding matrices trained by word2vec can be downloaded from Google’s
website (https://code.google.com/archive/p/word2vec/). An extension of word2vec,
which uses character n-grams, is fastText, which can be downloaded from (https://fasttext.
cc). FastText is a library for learning word embeddings, created by Facebook’s AI Research (FAIR)
lab.

5

1.1.2.2 Global Vectors for Word Representation (GloVe)

In this paper, we use the pre-trained word embeddings obtained via an algorithm called GloVe,
developed by Pennington et al. (2014). We note that other methods to create word embedding
matrices exist. From our perspective, the end result of word2vec and GloVe is similar. We chose
GloVe because it is a straightforward algorithm based on word counts, and the approach is well
documented with an emphasis on reproducibility. In practice, the GloVe algorithm has additional
beneﬁts over word2vec in that the algorithm is more easily parallelized. GloVe word embedding
matrices can be downloaded from https://nlp.stanford.edu/projects/glove/. From this
website, the 300 dimension word vectors containing 400 thousand vocabularies, trained over 6
billion tokens appearing in the Wikipedia corpus, has been downloaded. The algorithm used for
this particular word embedding is to minimize a cost function, which has the form

||

||

(cid:0),(cid:1)(cid:0) · ˜ +  + ˜ − log(, + 1)(cid:1)2

 =

=1

=1

where || is the size of the vocabulary, , ˜ are bias terms, , are entries of the co-occurrence
matrix for all the words found in the corpus over which the algorithm is being applied,  and ˜
are the word embeddings corresponding to the position in the co-occurrence matrix ,, and  is
a weighting function:



() =

(/)

if  < 

1

otherwise

with  = 100 and  = 3/4. The motivation for  = 3/4 is empirical, and with this choice of the
parameter, the performance of the model happens to improve when compared to  = 1. In practice,
we may have any 0 <  < 1. The reader may understand this is a way to give more weight to rare
word combinations found in the corpus. Intuitively, the co-occurrence matrix , records how
often two words occur together in the training corpus.

The algorithm attempts to ﬁnd a word embedding for each word that gives a dot product that is
close to the transformed co-occurrence matrix entry of the word combination. The resulting word
embedding gives a large dot product for those combinations of words that have a high co-occurrence

6

Figure 1.1: Plot of () for diﬀerent values of , with  = 100

matrix entry, and a low dot product for those combinations of words that correspond to zero matrix
entries. Note that most of the entries of the matrix would be zeros. For additional details on the
GloVe algorithm, we refer to the original paper by Pennington et al. (2014).

Word embedding matrices obtained this way have numerous applications.

In the natural
language processing literature, word embeddings are used to construct neural networks that can
predict missing words in sentences, or translate sentences in one language to another. Neural
network based text models are powerful in terms of their prediction. Meanwhile, we are interested
in models that allow us to interpret the relationship between explanatory variables and response
variables. One way this can be achieved is to consider projecting the vector representation of words
onto a set of axes, which we are able to understand. Given a vector representation of phrases,
instead of using the vectors directly inside a prediction algorithm, one may consider projecting the
vectors onto axes deﬁned by a set of keywords.

1.1.3 Word similarities

Once the word vectors are obtained using approaches outlined in Section 1.1.2, explanatory variables
can be formed using cosine similarity of words. The cosine similarity between two words with

7

0501001502000.00.20.40.60.81.0x= 1/4xY(x)0501001502000.00.20.40.60.81.0x= 3/4xY(x)0501001502000.00.20.40.60.81.0x= 1xY(x)nonzero vector representations  and  is given by

simcos(, ) =

 · 

||||2 · ||||2

(1.1)

One may think geometrically, and interpret the cosine between two unit vectors as the dot product
of the two vectors. The dot product ranges between -1 and 1, with 1 indicating identical vectors,
and -1 indicating two vectors that point in the opposite direction. For example, the cosine similarity
between the words building and buildings is 0.7946, while the cosine similarity between building
and hello is −0.02810.

Figure 1.2 shows a projection of the word vectors in a two-dimensional space for an example
set of words. In the ﬁgure, notice that library and museum appear close to each other, since they
have similar functions, and hence may appear in similar contexts. Also, graﬃti, vandalism, theft,
stolen all appear at a similar location on the plot. Imagine drawing an arrow from the point (0, 0)
to the word, and the reader may see that the vector corresponding to each of these words are very
similar to one another. The angle between the words is small, and hence the cosine of the angle
between the words would be large (close to 1). Another way to say this is that the dot product
between the words is large. Now consider the word hail and its corresponding vector, and compare
it with the vector corresponding to graﬃti. The two words are somewhat unrelated, and hence the
angle between these two words is large. Another way to say this is that the cosine between the unit
vectors corresponding to the two vectors is negative, or in other words the dot product is negative.
Now consider two documents 1 = {1, 2, ..., } and 2 = {1, 2, ..., }. Let  ( =
1, ..., ) be the vector representation for word , and let   (  = 1, ..., ) be the vector representation
for word  . Goldberg (2017) suggests using the similarity metric

sim∗cos(1, 2) =

simcos (, ) .


=1

=1

Essentially this metric assumes the vectors within a document can be added to form a vector
representing the entire document. We tried using this metric, and discovered that the results could

8

Figure 1.2: Example of a two dimensional projection of the word embeddings

be improved using a diﬀerent metric. Hence, in this paper, we use the similarity metric

(cid:18)

(cid:19)

.

simcos(1, 2) = max
=1,...,

max
=1,...,

(simcos(, ))

Thus the cosine similarity between two documents corresponds to the maximum cosine similarity
between any two words found within the documents. In particular, the similarity between a single
word  and  = {1, 2, ..., } is given by

simcos(, ) = max
=1,...,

(simcos(, )) .

(1.2)

Deﬁning the features this way is equivalent to max-pooling the features of a one-dimensional
convolutional neural network. Max-pooling tends to work better than alternatives such as average
pooling, as explained in page 120 of Chollet & Allaire (2018). When used with a single word , the
similarity will be 1 if the particular word appears in the claim description. Hence, in some sense,
the similarity can be thought of as a detector of whether a particular word or concept appears in
the claim description. This provides more interpretability, as it is a generalization of indicators for
word appearance.

Continuing the example described in Section 1.1.1, we can calculate  using equation 1.2, the

formula for cosine similarities,

9

−100−50050100−150−100−50050100150V1V2vandalismgraffititheftstolenairporttrafficmuseumzoobuildingsbldgfenceshingleshailstormlightningwaterfloodlibraryglasswindowtruck˜ =



Ò

1.000
1.000
1.000


1.000
0.684
1.000


1.000
1.000
1.000


1.000
1.000
0.356


0.507
1.000
0.507

Ò
0.684
1.000
0.684


0.230
0.230
1.000



Notice that the cosine similarities for the words lightning and a are 1 for all three cases because

the words appear in each of the three documents. In addition, recall that 1 = {lightning, struck, a, building}.
The cosine similarity between the word tree and 1 is 0.230. The word within 1 that achieves
this maximum cosine similarity is the word a. This further emphasizes the importance of removing
stop words from the textual descriptions. By removing stop words from each description and from
, we construct  as



 =

Ò

1.000
1.000
1.000


1.000
0.684
1.000


1.000
1.000
0.238


0.507
1.000
0.507

Ò
0.684
1.000
0.684


0.228
0.228
1.000



When using cosine similarities, we are often more interested in the similarities closer to 1,
while smaller similarities provide less information about the word of interest. For this reason, we
can calculate the components of  as

 () = simcos(,  ) (cid:0)simcos(,  ) ≥ (cid:1)

for  = 1, ..., ,

 = 1, ..., ||

(1.3)

where 0 <  ≤ 1. Continuing our example, (0.3) is

(0.3) =

Ò

1.000
1.000
1.000


1.000
0.684
1.000


1.000
1.000

0


0.507
1.000
0.507

Ò
0.684
1.000
0.684




0
0

1.000



Moreover, note that as  increases to 1, () becomes closer to . Speciﬁcally, (1) = .

10

1.1.4 Data Size Considerations

The accuracy and eﬃciency of any statistical method applied to  is heavily inﬂuenced by the
choice of . Recall that the number of words in  is the number of columns that will be in . As
we did with the example in Section 1.1.1, we could let  be the set of all words that appear in D.
Using the National Oceanic and Atmospheric Administration (NOAA) dataset, Figure 1.3 illustrates
how the number of unique words is a function of the number of documents, or observations. It is
clear that the number of unique words will plateau at some number of documents, which follows
our intuition. Eventually the addition of another document in the dataset will not contain any words
that have not appeared in the previous documents.

Figure 1.3: The number of unique words as a function of the number of documents in the NOAA
dataset.

It is clear that choosing  to be the set of all words that appear in D will be diﬃcult in any
statistical analysis due to the large number of covariates. For this reason, we explore a variety of
other options for choosing .

The ﬁrst method for determining  is having a subject matter expert manually choose words
they believe to be the most related to the variable of interest. This method has the advantage of
allowing the expert to choose as many or as few important words as they see ﬁt. However, several
possible issues may arise. To start, the subject matter expert will need to be familiar enough with
cosine similarities to make appropriate choices. In addition, the expert will also have their own

11

biases that may inﬂuence the analysis. In Chapter 2 we will show how this method can be eﬀective
in claims classiﬁcation and risk mitigation.

We could also consider using a data driven method for choosing . Since the majority of words
will most likely not be strong predictors of the response variable, we can assume that the true model
will be sparse. Therefore, we choose to use Lasso to select our set of words . In Section 1.2 we
discuss the Lasso estimator along with several generalizations. In Chapters 3 and 4 we show how
Lasso type penalties can be used to select .

1.2 Lasso and Generalizations

In this section we describe the Lasso estimator along with generalizations of the Lasso method.
Tibshirani (1996) originally proposed the lasso, a method combining the least squares loss with
an ࡁ1-constraint. Suppose we have a response vector  of length , an  ×  covariate matrix ,
unknown intercept 0, and unknown parameter vector  of length . Then the lasso estimator is
expressed

|| − 01 −  ||2
2

subject to || ||1 ≤ ,

(1.4)

(cid:26) 1

2

arg min
0,

(cid:27)

where 1 is a vector of  ones. The tuning parameter  determines the total size of the coeﬃcients,
and therefore shrinks the coeﬃcients to zero as  increases. Often, after standardizing the covariates,
the lasso problem is rewritten in the Lagrangian form

(cid:26) 1

2

arg min
0,

(cid:27)

|| −  ||2

2 + || ||1

,

(1.5)

for some tuning parameter  ≥ 0. The tuning parameter  is directly related to the tuning parameter
 from Equation 1.4. Now that we have introduced the lasso estimator, we will discuss several
generalizations. While variable selection by lasso is consistent under some conditions, the lasso
method was extended to improve the consistency of variable selection.

12

1.2.1 Adaptive Lasso

Zou (2006) established a necessary condition for variable selection to be consistent. By doing so,
they showed the existence of certain scenarios where the lasso estimator is inconsistent. For this
reason, the adaptive lasso was proposed. Choosing  > 0, the adaptive lasso solves

 1

arg min


=1

 ,

2 (cid:107)  −  (cid:107)2

2 + 

 | |

(1.6)

where   = 1/| ˜ |, and ˜ is some initial estimate. Given the initial estimates, it can be shown
that equation 1.6 is convex in . When  < , the least squares estimates can be used as the initial
estimates. When  ≥ , the least squares estimates are not deﬁned, however Huang, Ma, and Zhang
(2008) showed that, under some general conditions, the marginal least squares estimates can be
used as initial estimates. Zhang, Jeng, and Liu (2008) showed that using the lasso estimates as the
initial estimates in adaptive lasso improves the sparsity recovery rate. In addition, Furthermore,
Zou (2006) showed that the adaptive lasso recovers the true model under more general conditions
than the lasso when the initial estimates are √

 consistent.

1.2.2 Group Lasso

In many applications of regression, the covariates may have some natural group or block structure,
and it may be sensible to have all the coeﬃcients within the group be estimated as zero or nonzero
together. The most common example of this is the use of categorical variables.
In practice,
a categorical variable is typically coded as a series of dummy variables, and we want to either
include or exclude the entire set of dummy variables from the model. This scenario provides
motivation for the group lasso.

Consider a linear regression model with  groups of covariates, and each group contains  
covariates for  = 1, ..., . That is, we have design matrix  = ( 1 2 · · · ) where  is an
=1   matrix, and  = ( 1 2 · · · ) where   is a vector of length   for each  = 1, ..., .

 ×

13

(cid:112) 

(cid:13)(cid:13)(cid:13) 

(cid:13)(cid:13)(cid:13)2


=1

 ,

Yuan & Lin (2006) proposed the group lasso which solves

arg min


2 (cid:107)  − 01 −  (cid:107)2

2 + 

(1.7)

for some tuning parameter  ≥ 0. In addition to categorical data, group lasso can be used in the
additive model framework. If we apply some basis expansion to a numeric explanatory variable,
then the expansion could now be thought of having a group-like structure. This idea will be explored
in Chapters 3 and 4.

1.2.3 Adaptive Group Lasso

Wang & Leng (2008) combined these extensions to formulate the adaptive group lasso, and showed
the ability of the method to identify the true model consistently. For some tuning parameter  ≥ 0,
the adaptive group lasso solves

where   = 1/(cid:13)(cid:13) ˜ 

(cid:13)(cid:13)

arg min


2 (cid:107)  − 01 −  (cid:107)2

2 + 

 
(1.8)

2, and ˜ is some initial estimate. In addition, Wang & Leng (2008) proved
estimation consistency and selection consistency for the adaptive group lasso estimator. We will
illustrate how adaptive group lasso can be utilized in Chapter 3.


=1

(cid:13)(cid:13)(cid:13) 

(cid:13)(cid:13)(cid:13)2

 ,

 1

 1

14

CHAPTER 2

CLAIMS CLASSIFICATION AND RISK MITIGATION USING SHORT TEXTUAL

DESCRIPTIONS

2.1 Introduction

When an insurance claim arises, one of the ﬁrst things that is reported to a claims manager of an
insurance company is a short textual description of the insurance claim, along with demographic
information regarding the policyholder. Losses may be reported via a notice form, where one of
the common forms used is the Association for Cooperative Operations Research and Development
(ACORD) form. See page 7.16 of Kearney (2010). Upon this initial report, the claims department
is responsible for a number of important tasks. The claims department would identify the policy
and set adequate reserves, contact the insured, investigate the claim, document the claim, determine
the precise cause of loss, liability, and the loss amount. In this process, the claims manager would
coordinate with the actuarial department in order to predict the amount of insurance claim and set
the adequate case reserve for each claim. In practice, parametric loss models are ﬁt to data, and the
resulting distribution is used for the prediction of the ultimate claim amount.

In this process, regression analysis helps us understand the relationship among variables. The
goal of a standard regression model is to understand the relationship between the response variable
, and explanatory variables , and predict future responses under a set of assumptions. The
relationship  {E[]} =  , where typically a distributional assumption is imposed on the error,
allows us to interpret the relationship between  and  in a systematic way. In case  is a binary
response, logistic regression can be used. For a general overview of regression, see Frees (2009).
When forming the  matrix for empirical research using traditional approaches, useful information
is discarded from the analysis, because traditional regression analysis requires numeric descriptor
variables. Textual information has been one example.

In this chapter, we demonstrate how we can implement the tools discussed in Section 1.1 so

15

that traditional analysis can be performed. The extracted information is used to build a regression
model for the response variable of interest. Through this demonstration, we show how information
can be extracted from textual data. We demonstrate two illustrative case studies in insurance
claims classiﬁcation, and insurance risk mitigation. We believe these approaches demonstrate
how text processing methods can improve insurance analytics in the actuarial practice. In addition,
understanding the factors that relate with large insurance losses may help us mitigate future insurance
losses.

Generalized additive models (GAM) extend linear models to contain smooth functions for
each of the covariates, while retaining inference about the functions. Applications of GAMs
have been discussed in Hastie & Tibshirani (1990). These applications include studying kyphosis
in laminectomy patients, atmospheric ozone concentration, and the intensity of ischaemic heart
disease risk factors, among others.

Moreover, Hastie et al. (2009) describes an example of utilizing GAMs to classify emails as
spam. While this example does analyze text, the method has several signiﬁcant diﬀerences from
that of our analysis. The spam example observes the number of occurrences of certain words, and
ﬁts a GAM with each word as a covariate. While the spam example in Hastie et al. (2009) also
discusses the interpretability of the model, the primary goal is to predict the probability of an email
being spam. In our analysis, we start by quantifying the similarity between a description and a
series of selected words. The main interest in this analysis is the interpretation of the smoothing
functions. This provides a much more complete explanation of the various factors that lead to high
losses, and therefore, may provide for a more improved strategy of risk mitigation. A recent work,
Wood (2017), provides a comprehensive overview of GAMs.

There is a vast literature in insurance claims modeling, where parametric models are employed
to understand insurance claims distributions. To the best of our knowledge, combining text mining
approaches with loss modeling is a new approach, which hasn’t been attempted in the past. In
particular, there seems to be no prior work utilizing text mining approaches to empirically under-
stand and model insurance claims data. The rest of the Chapter proceeds in the following order:

16

In Section 2.2, the Wisconsin Local Government Property Insurance Fund (LGPIF) data are intro-
duced. In Section 2.3, two applications of word similarity models are presented: insurance claims
categorization, and risk mitigation. In Section 2.4, concluding remarks are provided.

2.2 Data

In this section we provide some summary statistics for the dataset. For our case study, we utilize
a unique dataset of claim descriptions and loss amounts from the Wisconsin Local Government
Property Insurance Fund (LGPIF). The dataset is obtained from the Oﬃce of the Commissioner
of Insurance (OCI) of Wisconsin. The Wisconsin LGPIF has been established to make property
insurance available for local government units. The property fund essentially acts as an insurance
company in the area, providing property coverage for thousands of government entities.

In Table 2.1, it is interesting to observe that claim categories with high frequency tend to have
low severity, while claim categories with low frequency tend to have high severity. In order to
illustrate this eﬀect, the table has been sorted in decreasing order of the number of observations .
Modelers have studied insurance claim frequency and severity models, and empirically it has been
discovered that claim frequencies and severities are often correlated; see Frees et al. (2016).

2.2.1 Data Generation

The number of claims observations is 4991 in the training sample, and 1039 in the validation
sample, which totals to 6030 observations. The data used for this chapter is already in tabular form
corresponding to the data generating processes in Section 2.3, and we consider the data cleaning
process, including the metadata analysis, as a black-box process that has already been performed
by the provider of the data. We assume all claims are closed, and the claim amounts are ﬁxed.

Descriptions for the observed insurance claims are recorded in the dataset. These claim
descriptions are human generated, and there are 2797 unique words found in the training sample
and validation sample all together. Figure 2.1 shows a projection of the word vectors in a two-
dimensional space, for common words found in the dataset. Word vectors are explained in Section

17

1.1.2. For now, imagine that there exists a framework, where every word corresponds to some two-
dimensional vector, with related words having similar vector representations. A plot of common
words found in the claim descriptions ﬁle may look like Figure 2.1.

Figure 2.1: Two dimensional projection of the word embeddings for common words

In Figure 2.1, notice that library and museum appear close to each other, since they have similar
functions, and hence may appear in similar contexts. Also, graﬃti, vandalism, theft, stolen all
appear at a similar location on the plot. Imagine drawing an arrow from the point (0, 0) to the
word, and the reader may see that the vector corresponding to each of these words are very similar
to one another. The angle between the words is small, and hence the cosine of the angle between
the words would be large (close to 1). Another way to say this is that the dot product between
the words is large. Now consider the word hail and its corresponding vector, and compare it with
the vector corresponding to graﬃti. The two words are somewhat unrelated, and hence the angle
between these two words is large. Another way to say this is that the cosine between the unit vectors
corresponding to the two vectors is negative, or in other words the dot product is negative.

18

−100−50050100−150−100−50050100150V1V2vandalismgraffititheftstolenairporttrafficmuseumzoobuildingsbldgfenceshingleshailstormlightningwaterfloodlibraryglasswindowtruck2.2.2 Textual Data Preprocessing

Figure 2.2 shows the most common words in the dataset. Stop-words such as a, the, and, etc. have
been removed. No other pre-processing of the words have been performed. Notice that the word
damage is most frequent in the descriptions. The word vandalism is also frequent, as vandalism
turns out to be one of the most frequent claim causes in the dataset. Abbreviations such as hs (high
school), ms (middle school), es (elementary school), dmg (damage), bldg (building) also appear in
the dataset.

Note that both bldg and building are valid words. Imagine we search for the word building in
a search engine. In this case, phrases with the word bldg and building should both appear in the
search. We also note that building and buildings both appear in the vocabulary. The reason why
both bldg and buildings appear when building is already in the vocabulary list, is because they
are distinct words. How much these words relate to claim occurrence is determined by the cosine
similarities of the words with selected key words. The advantage of our approach is that minimal
data preprocessing is needed, so that such similar words can all be kept in the data.

Note, the claim descriptions are short phrases, such as lightning damage to building, or vandal-
ism damage at recycle center. A total of 4991 such claim descriptions are in the training dataset.
Each claim description is associated with the loss amount corresponding to the description. Each
claim is categorized into one of the following claim categories: vandalism, ﬁre, lightning, wind,
hail, vehicle, water (weather related), water (non-weather related), and miscellaneous losses.

Table 2.1 shows a summary of the loss amounts for each of the nine claim categories. According
to the table, vandalism has the lowest average loss amount, while the frequency of vandalism is the
highest. Hail damages are the largest in terms of median and mean value, while the frequency is
the lowest. The largest hail damage with a loss of 6.6 million is a hail damage to multiple buildings
insured by the property fund. It is interesting to observe that the maximum loss amount happened
within the weather related water damages category, which is a loss of 12.9 million. This largest
loss corresponds to a water damage to a school. The second most frequent claim category is the
vehicles category. This category of claim happens when a vehicle (car, plow, truck, etc.) runs into

19

Figure 2.2: Distribution of common words

a government property building or structure, such as a light pole.

20

0500100015002000damagevandalismdamagedlightningwaterglassparkfirehswinddoorlightestheftpolevehiclepowermsgraffitisignalsurgehitbuildingtrafficschoolequipmentlaptophydrantbrokendmggarageroofairportbldgcentershelterstreetfencehailstationwindowwestradiosystemtowerstolenstormstruckwashingtonhalldeptnTable 2.1: Summary statistics of losses by claim category

Validation sample

Training sample

Peril
Vandalism
Vehicle
Lightning
Water (weather)
Miscellaneous
Wind
Water (non-weather)
Fire
Hail

min median mean
6,190
1
5,662
1
500
11,623
51,608
55
9,723
70
325
46,304
60,538
544
83,767
125
7,886
103,674

500
3,000
5,000
19,337
3,025
9,010
6,739
11,355
49,184

max

981,599
135,268
88,603
411,641
242,918
1,048,683
2,672,184
964,150
332,412

N min median mean
2,084
310
3,905
227
123
11,087
80,432
38
29,150
103
107
18,125
23,974
67
81,762
46
18
145,488

587
2,500
4,431
8,898
3,929
4,960
6,306
8,964
17,819

6
37
1
1
1
1
1
100
124

max

207,565
111,740
655,092
12,922,218
2,633,822
492,478
1,114,595
1,570,619
6,615,117

N
1774
852
832
426
362
296
202
171
76

21

To conclude the data description, we provide some ground-truth information regarding the
quality of the data. The data has no missing values, and the categories for each claim are all
observed. We inspected the claim categorizations and believe that the data quality is mostly
dependable, although some errors may exist because the categorization has been performed by
human experts.

2.3 Applications

In this section, we demonstrate how the explanatory variables extracted from textual data can
be used in speciﬁc models. We provide two examples: an example in claims classiﬁcation, and
an example in risk mitigation.
In both applications, the Generalized Additive Models (GAM)
framework is used as an underlying theme. We selected the GAM framework for its ﬂexibility in
capturing potential nonlinear eﬀects of the cosine similarities. Word vectors are used in order to
improve the classiﬁcation results in all examples. These applications speciﬁcally use short textual
descriptions of insurance claims, which may be found in initial reports of the claims to an insurance
claims department. For long textual descriptions or claim adjuster notes, other methods such as
recurrent neural networks (RNNs) using LSTM cells, or convolutional neural networks (CNNs)
with multiple layers, may be preferred. For more details regarding neural network approaches to
textual data analyses, see Goodfellow et al. (2016) and Goldberg (2017). Yet, the simplicity of our
approach may make it attractive for actuaries working with simple textual descriptions of claims.

2.3.1 Claims Classiﬁcation

2.3.1.1 Model

In practice, an insurance claims manager would have to classify given claims based on their
properties. This task may be supported with a claims classiﬁcation engine, which would take
the claim description as its input, and output the correct claim category. This motivates our ﬁrst
application of word embedding models, which is the classiﬁcation of insurance claims into discrete
categories.

22

The engine would be trained over a training dataset, and validated over a test dataset. Within
the training dataset, we assume category  and description  are observed for each claim . Thus,
the sample consists of observations {(1, 1), . . . (, )}. Note that  = (1, 2, . . . , ()),
a description consisting of () words, where () is the number of words consisting the -th
description. We assume that  = 0 + 1 + . . . +  , thus we imagine that nine samples each of
size 0, 1, . . . ,   are stacked to form the given sample of size . In this paper,  = 8. The
claim categories for the training sample and test sample are shown in Table 2.2. Thus, the response
variable  takes on the values 0, . . . , .

Table 2.2: Claim Categories for Training and Validation Datasets

Misc.
( = 0)

362
103

Vandalism
( = 1)
1774
310

Training
Test

Fire

( = 2)

171
46

Lightning
( = 3)

832
123

Wind
( = 4)

296
107

Hail
( = 5)

76
18

Vehicle Water(NW) Water(W)
( = 6)
( = 8)

( = 7)

852
227

202
67

426
38

Total

4991
1039

A generalized additive model (GAM) framework can be thought of as a generalized linear model
with a linear predictor involving smooth functions of covariates. See Hastie & Tibshirani (1990)
and Wood (2017). For the analysis, we construct design matrix using Equation 1.3 as described in
Section 1.1.3. That is, the explanatory variables,  are deﬁned by

, = simcos( , ) · (simcos( , ) ≥ ),

 = 1, . . . , 

where  = 0.2 is used, and  is the -th word used in the model, and  is the description of the
-th claim. For the model, we use a multinomial speciﬁcation. We made this choice (as opposed to
other classiﬁcation methods), based on the fact that the multinomial model provides a framework
that is easily generalizable to a GAM model. In this case, the probability of observing a speciﬁc
peril type  is given by



exp(cid:16)

1(cid:14)(cid:169)(cid:173)(cid:171)1 +

exp(cid:0) ,(cid:1)(cid:14)(cid:169)(cid:173)(cid:171)1 +

(cid:48)=0

(cid:48)=0

 (cid:48),

(cid:17)(cid:170)(cid:174)(cid:172)
exp(cid:16)

f( ) =

for base peril type  = 0,

for peril type 1 ≤  ≤ 

(cid:17)(cid:170)(cid:174)(cid:172)

 (cid:48),

23

with

 , =   +

 ,(,)


=1

where   is an intercept, and  ,( = 1, . . . ) may be smooth functions of the covariate, and 
is the number of words used in the model. We denote the base peril type as miscellaneous claims,
and call it  = 0. If we assume the functions are linear so that estimation time could be saved, then
we have

 ,(,) = ,  , ,

 = 1, . . . , 

and hence

f( ) =




1(cid:14)(cid:169)(cid:173)(cid:171)1 +
exp(cid:16)

(cid:48)=0
  + (cid:48)

  
 (cid:48) + (cid:48)

exp(cid:16)
(cid:17)(cid:14)(cid:169)(cid:173)(cid:171)1 +

  (cid:48)(cid:17)(cid:170)(cid:174)(cid:172)
exp(cid:16)


(cid:48)=0

  (cid:48)(cid:17)(cid:170)(cid:174)(cid:172)

 (cid:48) + (cid:48)

for base peril type  = 0,

for peril type 1 ≤  ≤ 

Here,   are  dimensional coeﬃcients. Note that with this choice of  ,, the GAM model
becomes the GLM model. This simpliﬁcation is useful for reducing the computation time, when a
large number of explanatory variables are included in the model.

Table 2.3 shows  = 7 words deﬁning the set  used in the model. The reader may imagine
projecting each claim description  onto a space represented by  = 7 axes. In this paper, the
feature words have been selected by a human expert with a good understanding of the dataset. In
practice, the most frequent words found in the claim descriptions may be used as the key words.
Abbreviated words are valid choices. In the claims classiﬁcation problem, our goal is to construct
an engine that gives the best possible classiﬁcation result, and the focus is less on the interpretability
of the coeﬃcients resulting from the estimation. For applications such as that found in Section
2.3.2, the interpretability is more important, and hence the feature words should be selected more
carefully by a human expert using the engine.

Table 2.3: Words used for Classiﬁcation

vandalism ﬁre

lightning wind

hail

vehicle water

24

The reader may be curious why a censoring of the cosine similarities is needed. The reason is
that cosine similarities smaller than the threshold is basically noise. One way to understand this
phenomenon is to imagine a search engine returning results on the internet. The ﬁrst few results
are highly related to the search string that has been entered, but as one goes down the list, more
and more irrelevant results may be observed. These junk results tend to add noise to the regression
result, and hence we censor the cosine similarities with a threshold  = 0.2, where the choice of  is
an empirical question. Figure 2.3 in Section 2.3.1.2 shows the classiﬁcation accuracy as a function
of the threshold .

2.3.1.2 Result

Table 2.4: Confusion matrix for multinomial model

Predicted Category

Lightning Wind Hail Vehicle Water(NW) Water(W)

Actual
Category
Misc.
Vandalism
Fire
Lightning
Wind
Hail
Vehicle
Water(NW)
Water(W)
Total

Misc. Vandalism Fire
1
24
17
0
18
0
0
3
2
4
0
0
31
4
0
2
0
5
86
25

33
267
2
1
4
0
5
4
1
317

1
0
3
114
3
0
0
0
0
121

0
2
0
0
88
0
0
0
4
94

0
0
0
1
2
17
0
0
0
20

39
23
20
3
1
1
182
5
0
274

0
0
2
0
0
0
2
4
1
9

5
1
1
1
3
0
3
52
27
93

Total
103
310
46
123
107
18
227
67
38
1039

The analysis of classiﬁcation accuracy is often performed by the Receiver Operating Charac-
teristic (ROC) curve in the binary classiﬁcation problem. The ROC curve is created by plotting the
true positive rate (TPR) against the false positive rate (FPR) at various thresholds. The Area Under
the Curve (AUC) is often used as a measure of how well the classiﬁcation engine is performing.
In the multiple class problem, numeric measures such as the average accuracy, error rate, and
precision are better suited for analyzing the classiﬁcation accuracy. These quantities are deﬁned by

Average Accuracy =

   +  

   +    +    +  

,

Error Rate =

1

 + 1

   +   

   +    +    +  

,

1

 + 1


=0

=0

25

Precision =

1

 + 1


=0

  
   +   

,

where    is the number of true positives,   is the number of true negatives,    is the number
of false positives,    is the number of false negatives; see Sokolova & Lapalme (2009). Figure
2.3 shows the average accuracy, error rate, and precision. With  = 0.2, the average accuracy is
93.62%, the error rate is 6.37%, and the precision is 63.9%. We tested if these numbers changed
as the threshold  changed.

Figure 2.3: Average accuracy, error rate, and precision in log scale

Figure 2.3 shows that the average accuracy, error rate, and precision changes as  is altered by
0.005. The solid lines show degree 10 splines ﬁt to the experiment data. Notice that when  = 1
(or in other words when log  = 0) the accuracy and precision reduces, and the error rate increases
signiﬁcantly. This is precisely the case when the cosine similarities boil down to indicator variables
of whether the particular words are found in the claim descriptions (in other words, when there exists
an embedding for at least one word in the description that is a constant multiple of the keyword).
Figure 2.3 is evidence that the model matrix based on cosine similarities is outperforming that
based on indicators. Our choice  = 0.2 is shown as a vertical dotted line at log(0.2) = −1.609. We
emphasize once more that the choice of  is an empirical question, as the classiﬁcation accuracy
is not inﬂuenced much by the threshold. Moreover, since the graph is stable on the left-hand side,

26

−5−4−3−2−10−0.10−0.09−0.08−0.07−0.06Average AccuracyLog EpsilonLog Average Accuracy−5−4−3−2−10−2.9−2.8−2.7−2.6−2.5−2.4Error RateLog EpsilonLog Error Rate−5−4−3−2−10−0.7−0.6−0.5−0.4PrecisionLog EpsilonLog Precision = 0 may also be a reasonable choice, if prediction alone is the concern. Yet, for interpretation of
the coeﬃcients, we have chosen a non-zero threshold.

2.3.2 Risk Mitigation

2.3.2.1 Model

In order to understand the factors that relate with high losses, we assume a sampling frame in the
following form: We assume that the loss amount, the category of the loss, and the description of
the loss are observed. While  was a random variable in section 2.3.1, here we assume  =  is
ﬁxed. Also, we assume for each category,   losses are observed.

Let  be the category of the loss, and let ∗

, be the underlying loss amount, and  , the
description of the -th loss in the -th category. Thus, the dataset consists of distinct samples with
observations of the form

{(∗

0,1, 0,1), . . . , (∗
0,0

, 0,0)}

...
,1,  ,1), . . . , (∗

{(∗

, 

,  , )}

For a given 0 <  < 1, responses  , are formed by

 , = (∗

, >  ())

 () = inf{ : (∗

 ≤ ) ≥ }

where  = 0.5 is used to obtain the median for each category. For our analysis, the empirical
quantile has been used for  (). In other words, for any given loss,  , is an indicator of whether
the observed loss ∗
, is above the 50th percentile of losses in that category of loss. Any other
quantile could have been used. For instance, the 95th percentile could have been used. Diﬀerent
quantiles would give diﬀerent stories, because the deﬁnition of a large claim would be diﬀerent for
each case. The 50th percentile has been arbitrarily selected for demonstration.

27

We analyzed the losses for vandalism, ﬁre, wind, vehicle, and the two water damage categories,
omitting lightning, hail, and miscellaneous losses from the analysis because for the latter three it
was diﬃcult to identify keywords that correspond to large claims. Detailed results are shown for
the vandalism peril type only, in order to limit the number of pages of the paper. The question is,
whether we can understand the factors that are related to response values  , = 1, corresponding
to high losses, through text analysis.

Using the word similarity metric described in Section 1.1.3, we can create variables for risk
measures of interest. Consider a speciﬁc description of a claim in a given category. Suppose a
modeler is interested in creating a risk metric corresponding to a word  ,,  = 1, . . . ,   ( :
number of risk metrics in category ). Suppose an insurance claim is described by the phrase ,
for the -th observation,  = 1, . . . , , where  is the number of claims found in the dataset. We
create a variable by

 ,, = simcos( , , ) · (simcos( , , ) ≥  )

for a threshold  . In this paper,   = 0.2 is chosen for each . Thus,  ,, is the cosine similarity
between a search word  , and the description of the claim , with a censoring below a certain
similarity level. For example, the modeler may be interested in the risk metric graﬃti. The modeler
may be interested in the relationship between this risk metric, and a response variable of interest,
say the magnitude of loss. If the metric has a high correlation with large losses, then attention
should be given to the particular risk metric in order to mitigate the risk inherent in this metric.
In this case, the modeler may create a column vector for graﬃti. For a particular response, say
the likelihood of a high vandalism claim, a modeler may have a set of risk metrics of interest, say:
graﬃti, laptop, window, shelter, pool (In this case,   = 5). This gives a matrix of   explanatory
variables, which can be used in standard regression models. Table 2.5 summarizes the explanatory
variables used for the vandalism model.

For category  (this section will focus on the vandalism category in particular), given a claim

28

Table 2.5: Summary statistics of explanatory variables (vandalism model)

Min. Median Mean Max.
1.000
0.000
laptop
0.000
1.000
graﬃti
1.000
window 0.000
1.000
0.000
shelter
pool
0.000
1.000

0.000
0.520
0.246
0.221
0.260

0.130
0.424
0.348
0.187
0.209

, the GAM model for each peril type can be speciﬁed as

(cid:8)E( ,)(cid:9) =   +

 ,( ,,)

 
(cid:18)  ,

=1

(cid:19)

( ,) = log

1 −  ,

,

where

subject to the constraint such that

and   is the coeﬃcient for the intercept, and  , are smooth functions of the covariates  ,,,
=1  ,( ,,) = 0. In other words, a logit link is used since

 , is a binary variable taking on the value of 1 or 0. We have

,  )

,  ) , ( ,) =  ,(cid:0)1 −  ,(cid:1)

exp( (cid:48)
1 + exp( (cid:48)

 , =

Let the matrix   include transformed columns representing the spline bases for the  ,. For this
study, a second order B-spline basis with three terms is used. In this case, the design matrix  
can be constructed by ﬁrst forming the matrix



1( ,1,) 1
1
1( ,2,) 1
1

2( ,1,) 1
2( ,2,) 1

3( ,1,)
3( ,2,)

 , =

...

...

1( ,,) 1
1

2( ,,) 1

3( ,,)

,

 = 1, . . .  

 () are the B-spline basis functions, presented in Wood (2017). Then the columns are
where 
transformed using QR factorization, in order to impose the identiﬁability constraint. This involves
decomposing each vector 

,1 into the form

...


(cid:105) ,

0



(cid:104)


,1 =

 ,,1  ,,2

29

and then taking  ,,2 for  = 1, . . . ,   to form the design matrix

(cid:20)

  =

1;  ,1 ,1,2;  ,2 ,2,2;

. . .  ,   ,  ,2

satisfy

This ensures that an intercept is included in the design, and also allows the basis functions  , to
=1  ,( ,,) = 0. The idea of P-IRLS is that a weight matrix is adjusted each time the

algorithm iterates until convergence. The algorithm follows the following steps.

1. Given the current   

[Ò]


, calculate the diagonal matrix 

(cid:21)

.

and

where 

[Ò]


[Ò]
 = 

  − 
is a diagonal matrix satisfying [Ò]

[Ò]


[Ò]


(cid:17), and [Ò]

, = −1(cid:16)

(cid:17).

[Ò]


[Ò]

, =

2

[Ò]

,

(cid:20)
(cid:16)
(cid:17)(cid:13)(cid:13)(cid:13)(cid:13)(cid:13)2

[Ò]


,

(cid:16)

[Ò]

(cid:17)(cid:21)−1
(cid:17) +   
, = (cid:48)(cid:16)
 

[Ò]

[Ò]


 , 
 

,

+

=1


,,2 ,  ,,2  

2. Then minimize(cid:13)(cid:13)(cid:13)(cid:13)(cid:13)(cid:114)

(cid:16)

[Ò]


[Ò]
 −    


with respect to  .

3. Repeat steps 1 and 2 until convergence.

Here, Ò is the iteration index, and  , are matrices designed to penalize the roughness of the
smooth functions. In this paper, we used the diﬀerence penalty

 , = 


, where  =

−1

1

0
0 −1 1



The  , are selected by generalized cross validation. Cross validation is applied to a subset of the
training sample corresponding to each peril type. This involves minimizing the generalized cross
validation score

V =

=1( , − ˆ ,)2
[  − (  )]2

  
30

where   is the inﬂuence matrix for the -th category. For details of the theory behind cross
validation, we reference Wood (2017). The above procedure is performed for each , where in our
work each of the following peril types are considered: vandalism, ﬁre, wind, vehicle, water (non-
weather), water (weather). Generalized additive models are implemented in the R programming
language via packages such as gam and mgcv. In this paper, parameters are estimated using the
mgcv R package, implemented by the author of Wood (2017). We chose this package because it
provides a convenient interface regarding the choice of basis functions and graphical outputs.

2.3.2.2 Result

Figure 2.4: GAM model plots for vandalism

In this section, we present the analysis results for the vandalism category. For the vandalism
category, explanatory variables laptop, graﬃti, window, shelter, pool were included in the GAM
model. Among these, laptop turns out to have a positive relationship with high losses. Figure
2.4 shows the shape of the smooth functions  , ,  = 1,  = 1, . . . , 5, as the explanatory variable

31

0.00.20.40.60.81.0−3−2−10123laptopCosine SimilaritySmooth Function0.00.20.40.60.81.0−3−2−10123graffitiCosine SimilaritySmooth Function0.00.20.40.60.81.0−3−2−10123windowCosine SimilaritySmooth Function0.00.20.40.60.81.0−3−2−10123shelterCosine SimilaritySmooth Function0.00.20.40.60.81.0−3−2−10123poolCosine SimilaritySmooth Functionvaries from 0 to 1. The shaded regions in Figure 2.4 illustrate the 95% credible intervals of the
smooth functions at each point on the curve.

Notice that data for large values of cosines are scarce, hence the credible interval widens for
large values of the explanatory variables. Figure 2.5 shows the words with highest cosine similarity
with laptop, which turns out to be positively related with high losses. Presumably, vandalisms
and thefts to laptops, computers, portables turn out to result in relatively high losses within the
vandalism category. Note that vandalisms are small frequent losses. Although the loss amounts in
this category are small, the frequent nature of the losses may make it worthwhile to mitigate thefts
to laptops. The words in Figure 2.5 are not necessarily found in the training data. They are words
found in the word embedding matrices. What we are saying here is that since the word laptop has
a high correlation with large losses, related words such as laptops, computers, and phones are also
suspects for potential high losses, due to their relationship to the word laptop. This type of chart
helps an insurance claims department to learn which type of property to focus on, when mitigating
risk.

Figure 2.5: Words that relate with high vandalism losses

2.3.2.3 Model diagnostic

Table 2.6 provides a summary of the GAM models. The 2 statistic reported for each smooth
function is based on Wood (2013), or page 305–308 of Wood (2017). Essentially, this tests the
hypothesis

0 :  ,() = 0

32

0.000.250.500.751.00laptoplaptopscomputerscomputerportablephonecamcorderprintercamerascameraradiosscreenssmartmobilescreenhardwaregpswiredelectronicstolencosinelaptopfor all  in the range of the cosine similarities for category . A high p-value would indicate the
function  , is not needed in the model. The smooth functions for vandalism all turn out to be
signiﬁcant, according to Table 2.6.

Table 2.6: GAM model summary (vandalism model)

s(laptop)
s(graﬃti)
s(window)
s(shelter)
s(pool)
R-sq.(adj)

Chi.sq
32.560
39.250
47.430
132.930
9.020
0.206

p.value
0.000
0.000
0.000
0.000
0.001

2.4 Concluding Remarks

In this chapter, we introduced a framework for incorporating textual data into insurance claims
modeling, and considered its applications in claims management processes. An insurance claim
representative is responsible for investigating the claim, in order to determine the handling process.
In this chapter, we explored the use of word similarities as a tool for modeling insurance claims and
mitigating insurance risks. Our results demonstrate how text mining technology can be incorporated
into a traditional regression analysis. The methodology is applicable in many diﬀerent areas of
applications, where textual data arises. Possible applications of our approach for an insurance risk
manager may include:

• Classiﬁcation of claims based on textual descriptions of the claims
• Classiﬁcation of policyholders based on textual descriptions of the policyholders
• Prediction of insurance claims at the claim level
• Prediction of insurance claims at the policyholder level
• Analysis of insurance claims and risk mitigation

We explored the LGPIF data in the form of case studies to understand the factors that relate
with high insurance losses, classify insurance claims, and model the loss amounts using parametric

33

distributions involving covariates derived from textual information. We make some remarks on the
current limitations of our framework, where potential improvements can be made.

• Under the current framework, words not found in the word embedding matrix cannot be used

in the modeling.

• The threshold  is selected using heuristics by a human expert, under the current framework.
• Because pre-determined word embedding matrices are limited to one-grams (single words)
at the time the paper is being written, the incorporation of -grams (use of phrases longer
than one word as a search key) remains an open question.

• Further linguistic barriers may exist, if the textual descriptions are longer than those appearing
in the dataset used for this paper. Examples may be polysemy, false friends, compound words.
• In order to use the proposed method, insurers that focus on speciﬁc insurance segments may
be constrained to build its own word embedding matrices, as the terms appearing in the claim
descriptions may be speciﬁc to the ﬁeld. For example, a medical insurer may ﬁnd GloVe
insuﬃcient, and may need a word embedding matrix trained on medical terms in order to use
our proposed approach.

Economic losses due to property damage caused by perils including ﬁre, lightning, wind, hail,
or vandalism have vast implications to our society. Understanding the nature of property damage
can improve our readiness and contribute to minimizing the losses. We have illustrated a way to
help realize this goal using a new analysis method, which, to the best of our knowledge, has not
been attempted before in the actuarial literature. We believe our methodology may help broaden
the horizon of empirical research, and contribute to the advancement of the understanding of our
world and the risks residing within it. In addition, we believe that our approach will improve the
claim handling procedures of insurance claims departments.

34

CHAPTER 3

A THREE STAGE APPROACH TO MODEL BUILDING

3.1 Introduction

In practice, an important task of an insurance analyst is to assist the claims manager in setting
the case reserves for reported claims. The case reserve for a given claim can be understood as the
diﬀerence between the reported claim amount and the paid amount for an individual claim. Note
that the case reserve excludes incurred but unreported claims, for which a separate incurred but
not reported (IBNR) reserve should be prepared. For the purpose of this chapter, the reader should
understand that the case reserve is an approximation to the ultimate amount of the claim, given
information available at the time of the report of the claim. Sometimes the case reserve is set by
the claims department of an insurance company, while in other cases the task is outsourced to an
outside adjustor. Part of the information available to the claims department at the report time of
the claim is a textual description of the claim. In this chapter, we are interested in approaches that
use the textual information regarding an insurance claim to predict the ultimate loss amount, by
regressing the loss amount on a set of covariates derived from the textual description. Given a
dataset of historic loss descriptions and ultimate loss amounts, an actuary may use the approach to
improve the case reserving procedure.

Part of the problem in this prediction task is that if we use a large number of keywords in
forming the design matrix extracted from the textual descriptions of claims, the resulting problem
is high-dimensional in nature. In this case study, we use the framework of Lee et al. (2019) and
analyze a dataset of loss descriptions and amounts, downloaded from the National Oceanic and
Atmospheric Administration (NOAA).

For this analysis, a generalized linear model (GLM) may not be appropriate because the linearity
assumption may not appropriately ﬁt the data. To solve such a problem, we may consider using a
nonparametric regression technique. A variety of nonparametric regression techniques have been

35

developed, including but not limited to regression splines, kernel smoothing, neural networks, and
generalized additive models (GAM). Nonparametric regression has been applied in many areas,
from modeling daily pollution in the U.K. (Wood et al., 2017), to estimating relative risk for disease
mapping of lung cancer (Dreassi et al., 2014). See Simonoﬀ (1996) for more details and examples
of nonparametric regression. In this chapter, we consider the GAM.

Hastie & Tibshirani (1986) proposed the generalized additive model that consists of the sum-
mation of smooth functions, allowing for the ability to capture the true, not necessarily, nonlinear
relationship. In the generalized additive model setup, more information is needed to estimate each
function as compared to the generalized linear model setup. Therefore, the data must have many
more observations than the number of covariates. In addition, when working with high dimensional
data, the scalability of the algorithm is also extremely important when considering a method. Our
approach is motivated by these characteristics.

Considerable work has been done in eﬃciently estimating larger datasets using generalized
additive models. Most recently, Wood et al. (2017) developed a method for estimating GAM
with the number of coeﬃcients of order 104, and observations up to 108. This method reduces
the number of matrix operations, utilizes parallelization, and reduces the memory necessary by
marginal discretization of the model covariates. Li & Wood (2019) extended this work by proposing
an alternative method of calculating X(cid:48)WX where X is a model matrix and W a diagonal or tri-
diagonal matrix, which results in a 30 fold reduction in computational time. Previous works include
Marra & Wood (2011) and Wood et al. (2015). Code for these methods are found in the R package
mgcv, Wood (2019).

While the aforementioned GAM results provide for a scalable algorithm, a hindrance of GAM
is the restriction on the number of covariates. Considering the GLM, there are several methods
for combating the high dimensionality issue, with most notably being lasso by Tibshirani (1996).
Similar to the GLM, the lasso maximizes the likelihood, but instead has an additional 1 penalty
term. This term is typically referred to as the shrinkage term. Extensions of the lasso have also
been developed, including but certainly not limited to, group lasso from Yuan & Lin (2006) and

36

adaptive lasso from Zou (2006). The group lasso is applied to variables with group-like structure,
and it uses a slightly altered penalty term where each variable in a group is penalized equally. This
is particularly important due to the group-like structure induced by the basis expansion used in
the estimation of the generalized additive model. The adaptive lasso simply applies a weight to
each coeﬃcient in the penalty term, with these weights typically estimated through ordinary least
squares or lasso. Wang & Leng (2008) combined these extensions to formulate the adaptive group
lasso, and showed the ability of the method to identify the true model consistently.

Our proposed method is a three step approach consisting of (1) weight calculation by group lasso,
(2) the shrinkage step by adaptive group lasso, and (3) the smoothing step. This approach combines
the adaptive group lasso dimension reduction technique with the scalable GAM algorithm.

3.2 Data and Preprocessing

For our analysis, we utilize the publicly available NOAA Storm Events Database. The analysis
is performed on property loss amounts at the event level, using storm event observations involving
textual descriptions of the events. The data are collected over time, however we use a cross-sectional
model in this paper, in order to focus on the relationship between the textual information and the
response. Only Thunderstorm Wind events taking place in Michigan and from 2000 to 2018 are
considered for the analysis. For validations, the dataset is divided into training and validation
datasets. The reason we use this dataset is because it contains relatively clean, lengthy descriptions
of losses from storm events in the United States each year, along with the property and crop damage
amount estimates. These damage amounts are initial estimates of the losses, and hence are diﬀerent
from the ultimate loss amounts. Yet, the structure of the data is identical to that available to a
claims adjuster, and hence is a good test dataset for the analytical framework explained in this
chapter. Another advantage of this dataset is that it is publicly available, allowing dissemination
and reproducibility to be easy.

Each event is recorded with an event narrative. An example of an observation with an estimated
property damage of $10,000 has an event narrative that reads: Roof damage was incurred to a barn

37

Table 3.1: Summary statistics for the log(loss) for the training and validation datasets.

Training
Validation

N Min Mean
8.97
2353
126
8.78

2.30
6.21

SD Max
17.03
1.44
1.56
14.00

six miles northwest of Mason due to a severe thunderstorm wind gust and a large tree limb was
blown down in South Lansing.

Figure 3.1: Frequency for the most common words.

Figure 3.1 shows the most common words in the descriptions of the losses. Stop-words such as
a, the, and, etc. have been removed. Notice that the word trees is most frequent in the descriptions.
A few of the most common words are typically used to describe what is happening to trees, such as
blown and wind. In addition, several of the other most common words like power, lines, damage,
and outages are used to describe the results of downed trees.

There are a total of 2, 353 observations in the training set, with 126 observations in the validation
set. As previously mentioned, the claim descriptions are quite lengthy, with an average of 16.8
words per description. There are a total of 2, 642 unique words used in the dataset. To capture only
relevant words, stop-words, numbers, and words that only occurred once were removed, resulting in
1, 998 words. Table 3.1 provides summary statistics for the log(loss) for the training and validation
datasets.

In order to better understand the relationship between the words in the claim description and
the property loss amount, each word is represented by a vector. Recent advancements in word

38

050010001500treesblownreportedpowertreelinesnumerouscountydamageroadwindestimatedlimbsmphdiameterthunderstormlakewindsmilesgustsfrequencyembedding models have made it possible to obtain these representations easily. We utilize the 300
dimensional word embeddings developed by the authors of Pennington et al. (2014). To form the
design matrix, we follow the framework described in Section 1.1.3. That is, we will choose our
design matrix, ×, to be  where  is the set of all unique words found in D,  is the number
of descriptions in D, and  is the number of words in .

Each value in the matrix is now continuous and restricted to [−1, 1]. Figure 3.2 shows the
relationship between cosine similarity and property loss for house and thunderstorm. From the
ﬁgure, we see that the relationship between cosine similarity and property loss is nonlinear in nature
and therefore a generalized additive model is appropriate.

Figure 3.2: Cosine similarity against property loss for house and thunderstorm.

3.3 The Model and Basis Expansion

In this section, we describe our methodology in speciﬁc terms. We consider the generalized

additive model

(3.1)

 ( )(cid:170)(cid:174)(cid:172) ,

 = [| ] = −1(cid:169)(cid:173)(cid:171) 
(cid:21)
(cid:20)  − ()

=1

,

where the link function corresponds to that of the corresponding exponential family distribution.
For each of the  independent observations, the density function is given by

 = () exp

1 ≤  ≤ ,

 ∈ R

(3.2)


39

510150.250.500.751.00Cosine Similaritylosshouse510150.000.250.500.751.00Cosine SimilaritylossthunderstormWe assume that a matrix of explanatory variables is given. Let’s call it ×, and use the notation
 = ( 

). We have

1 , 

2 , . . . , 

(3.3)

× =

11 12

. . . 1

21 22
...
...

. . . 2
. . .

...

1 2

. . . 






=1

Thus,  is the number of observations, and  is the number of explanatory variables available. We
assume that the parameter 0 <  < ∞ is known. Without loss of generality, let  = 1. Also, we
assume that the density of  depends on  via the structure

 =

 ( ),

(3.4)

where  are deﬁned in equation 3.2. Let the values   be deﬁned over intervals [, ], and
let  = 0 < 1 < . . . <  < +1 =  be a partition on [, ], where  =  = , with
0 <  < 0.5, such that

| − −1| = (−).

max

1≤≤+1

(3.5)
For smooth functions 1, . . . , , there exist functions 1, . . . ,  ∈ S, such that 1, . . . , 
well approximate 1, . . . , . Here, S is the space of polynomial splines of degree  ≥ 1. In
other words, S is the space of piecewise polynomials on each interval of the partition, which are
(cid:48) ≤  − 2 times continuously diﬀerentiable for  ≥ 2. Smoothness is deﬁned in the same way
as Huang et al. (2010). According to Schumaker (1981), there exists a normalized B-spline basis
{ , 1 ≤  ≤ } for S, where  =  +  = 1. For a practical overview of the B-spline basis
function, the reader may refer to Wood (2017), section 5.3.3, starting from page 204. We want to
write

 ( ) =

[ ]


   ,

(3.6)


=1

40

[ ]
for some value 


. For this, our next goal is to construct a design matrix

(cid:16)1 : [1] : [2] : . . . : [ ](cid:17)
(cid:16)1 : [1]

: [2]

[1]
2

[2]
2

× =

=

(cid:17)

[ ]
2

(3.7)

: . . . : [ ]

from ×, where [ ] are matrices of basis functions. Denote  =  × . Here, [ ] are
each  ×  matrices of transformed basis functions, where  is the number of parameters within
each basis function consisting the design matrix. We consider the matrix of basis functions



[ ] =

1(1 ) 2(1 )
1(2 ) 2(2 )

...

...

1( ) 2( )

. . . (1 )
. . . (2 )
. . .
. . . ( )

...



.

(3.8)

From here on, we assume () are -parameter B-spline basis functions of order 3. In order to
obtain the design matrix, we form the QR decomposition

(cid:16)

[ ](cid:17) 1 = 

[ ]
1 [ ] + 

[ ]
2 0

(3.9)

(3.10)

and take the 

[ ]
2 matrix corresponding to the zero part of the decomposition, and deﬁne

[ ] = [ ]

[ ]
2

Finally, let the × matrix be given by expression (3.7). We call  our design matrix, and denote
the elements of the design matrix , for  = 1, . . . ,  and  = 1, . . . . We also denote

  =

[ ]
1 , 

[ ]
2 , . . . , 

[ ]


,

for  = 1, . . . , , and  = 1, . . . .

(3.11)

(cid:16)

(cid:17)

Under this framework, the response variable is related to the covariate   via

 ( ) = 

    ,

 = 1, . . . , ,

 = 1, . . . , ,

(3.12)

where   is the coeﬃcient corresponding to the -th explanatory variable. We may see that  
must be a length  vector, since the -th spline contains  parameters. The illustrated approach

41

to constructing the design matrix ensures that the basis functions satisfy

 ( ) = 0

for each  = 1, . . . , .

(3.13)

In other words, selecting the model matrix this way imposes an identiﬁability constraint, which
states that the smooth function deﬁned by the basis functions must satisfy

 ( ) = 1 [ ]   = 0

(3.14)

Notice that equation (3.14) holds because


=1


=1

(cid:104)

(cid:105)


[ ]
1
[ ]
2

 

1 [ ]   = 1 [ ]

[ ]
2   =

[ ] 0

[ ]
2   = [ ]

[ ]
1 

[ ]
2   = 0

(3.15)

[ ]
1

[ ]
2

and 

are orthogonal. Simply stated, if  () is a spline, the coeﬃcients for the
since 
[ ]
2   instead of   after the transformation. Our methodology is related
spline are now given by 
to Chouldechova & Hastie (2015), yet we use a three step procedure. We are looking for the
parameters for

 {[| ]} = 0 +

 ( ) = 0 +

    = 0 +  


(3.16)


=1


=1

where we have used the notation  to denote the -th row of , and  = ( 
),
where some of the  ’s are zero, while others are non-zero. The approach in Chouldechova &
Hastie (2015) is to minimize the penalized negative log-likelihood

2 , . . . , 

1 , 

where ࡁ( ) is the log-likelihood for an exponential family distribution:

− 1


ࡁ( ) + 2

(cid:113)

=1


(cid:169)(cid:173)(cid:171) 
(cid:16)
(cid:104)

=1


     + 1
(cid:48)
2

=1


  (cid:170)(cid:174)(cid:172) − (cid:169)(cid:173)(cid:171) 
(cid:17)(cid:105)

=1


(cid:17) − 

=1


[ ]


(cid:16)


=1

=1

ࡁ( ) =

=

42

3  (cid:48)  

(3.17)


=1

[ ]




  (cid:170)(cid:174)(cid:172)

(3.18)

The hope is that the second term in equation (3.17) induces zeros into groups of coeﬃcients, while
the last term imposes smoothness into the “surviving” coeﬃcients. Here,   is an identity matrix
of dimension , and   is a constraint matrix to impose smoothness into the estimated functions
 . There are several practical diﬃculties with this approach:

• When  is large, or in other words when the problem dimension is large, there are too many
3  tuning parameters to estimate. Wood (2017) discusses algorithms for large  cases, but
does not talk about cases where  is large.

• Theory behind selecting the tuning parameters 3 , discussed in Wood (2017) is no longer

directly applicable, because of the extra group lasso type penalty term.

• Implementing the coordinate descent algorithm, which brings in sparsity into , becomes
tricky with the smoothing penalty. Usually fast algorithms for lasso type estimators with
GLMs are implemented by locally approximating the likelihood with a Taylor’s approximation
at each iterative step, yet the extra penalty term makes this tricky.

• Estimating the coeﬃcients may take a very long time, especially when the number of ex-

planatory variables  is large, as in the application we consider in this paper.

Hence, in order to keep the estimation procedure scalable for large  (and hence large ), we
propose a three step approach to the estimation problem for the model (3.16). The ﬁrst step of
the approach is to perform a group lasso estimation with the ﬁrst and second terms of equation
(3.17). The second step uses the resulting coeﬃcient estimates to perform an adaptive group lasso
estimation of the parameters. The third and ﬁnal step uses the nonzero coeﬃcients obtained from
the second step to induce smoothness into the implied spline function  (·), for each nonzero
function  . These steps are formalized in the following section.

Moreover, to provide a statistical validation, we present both the numerical results in section 3.6
and the theory for the estimated functions in Appendix A, which works as another support of our
proposed 3-stage approach. We aim at validating two things: the variables selected are consistent;

43

the estimators are consistent with respect to the unknown true functions. Appendix B contains a
short simulation study illustrating the eﬀectiveness of the method on a simulated dataset.

3.4 The 3-Stage Approach

3.4.1 Stage 1 – Group lasso

Deﬁne the objective function to be
( ; 1) = −1


(cid:16)

(cid:104)


=1


(cid:17) − 

(cid:16)


(cid:17)(cid:105) + 1


=1

Let ˆ be the optimizer for (3.19), or in other words

(cid:107)  (cid:107)2

(3.19)

ˆ = arg min
∈R·

( ; 1)

(3.20)

3.4.2 Stage 2 – Adaptive group lasso

Deﬁne the objective function to be
( ; 2) = −1


(cid:104)

(cid:16)

(cid:17) − 

(cid:16)

(cid:17)(cid:105) + 2


=1
where the weights depend on the screening stage group lasso estimator

=1

 (cid:107)  (cid:107)2

(3.21)



  =

(cid:107) ˆ (cid:107)−1
2
∞

if (cid:107) ˆ (cid:107)2 > 0
if (cid:107) ˆ (cid:107)2 = 0

(3.22)

(3.23)

Numerically, the weights are set to a large number, for the case when (cid:107) ˆ (cid:107)2= 0.

Let ˆ  be the optimizer for (3.21). In other words,

ˆ  = arg min
∈R·

( ; 2)

Let ˆ be the subset of {1, . . . , }, such that the th coeﬃcient of   with  ∈ ˆ are nonzero.
Thus, the second stage estimates are sparse, meaning that the coeﬃcients are zero for some . This
reduces the coeﬃcient size in the third stage.

44

3.4.3 Stage 3 – The smoothness penalty

ˆ be the matrix consisting of columns from  corresponding to the set ˆ. Let  ˆ

Let 
Rˆ·, where ˆ = | ˆ|. Deﬁne the objective function to be

be in

( ; 3) = −1


ˆ
 

ˆ
 

3  
    

(3.24)

(cid:16)

(cid:104)


=1

(cid:17) − 

(cid:16)

(cid:17)(cid:105) + 1

2


∈ ˆ

where 3 = (31, 32, . . . , 3). Let ˆ be the optimizer for (3.24). In other words

ˆ = arg min
∈R ˆ

( ; 3)

(3.25)

Since the problem of dimension has been reduced, the third step estimation may be performed
using existing generalized additive models routines, using ˆ  as the initial guess for the P-IRLS
procedure. The tuning parameters 3 may be obtained by generalized cross validation or REML
as described in Wood (2017).

3.4.4 Tuning Parameters

Each stage has a tuning parameter, 1, 2, and 3, respectively. The selection of 1 and 2
can greatly inﬂuence the performance of the model and the eﬃciency of the algorithm. Larger
values of 1 and 2 will lead to an over-simpliﬁed model with faster computation time, while
smaller values will lead to an over-ﬁtted model with slower computation time. To ﬁnd the “sweet
spot”, cross validation is used to determine 1 and 2. The tuning parameters 3 is obtained by
generalized cross validation or REML as described in Wood (2017).

3.5 Algorithm

We now discuss the implementation of the method using R. For stage 1 and stage 2, we utilize
functions from the gglasso package (Yang & Zou, 2017) and for stage 3 we utilize functions
from the mgcv package (Wood, 2019). The code is provided in an R package at github.com/
scottmanski/TAGAM.

45

3.5.1 Stage 1 – Group lasso

The gglasso function is modiﬁed such that we loop through the grid of 1 values, but once the
number of nonzero coeﬃcients is greater than , the algorithm is stopped. By doing so, we ensure
that we will be able to execute stage 3.

3.5.2 Stage 2 – Adaptive group lasso

The implementation of stage 2 is very similar to that of stage 1, except for the addition of the
(cid:48)
 =     for each  ∈ {1, ..., }. Then
weights.
equation 3.21 can be written as

In order to incorporate the weights, let 

 + 2

(cid:48)

(cid:170)(cid:174)(cid:172)


=1

(cid:107) 

(cid:48)
(cid:107)2

(3.26)

( 

(cid:48); 2) = −1


1
 

[ ]


1
 

[ ]


=1

(cid:169)(cid:173)(cid:171) 

=1

(cid:170)(cid:174)(cid:172) − (cid:169)(cid:173)(cid:171) 

=1

(cid:48)

3.5.3 Stage 3 – The smoothness penalty

The mgcv package is used to implement stage 3. In the mgcv package, there is a gam function and a
bam function, with the former designed for smaller datasets and the latter designed for much larger
datasets. In this analysis, we utilize bam. To increase the computational eﬃciency we also choose
to have the function discretize the data following the method described in Wood et al. (2017).

3.6 Results

In this section, we discuss the results of our model. Table 3.2 provides information for the
ﬁnal model. As previously mentioned, 1,998 words appeared in the dataset, and were considered
as possible covariates. For the model, we chose to use the penalized regression spline. Stage 1
eﬀectively reduced the number of covariates to 261, and stage 2 further reduced the number of
words to 149. While the number of functions to interpret may seem cumbersome, the ﬁnal model is
relatively simple compared to the number of possible covariates that could have been in the model.
Figure 3.3 shows the estimated functions for several covariates. All of the function estimates
have a few characteristics in common. For smaller cosine similarity values, the estimated functions

46

Table 3.2: Summary statistics for the ﬁnal model. The Residual DF comes from the estimated
degrees of freedom from the GAM, and the MSPE is the out of sample mean squared prediction
error.


4


2

1

0.0005255074

2

0.0001063902

ˆ
149

Residual DF Deviance MSPE
1.016
2167.387

70.7%

Figure 3.3: Function estimates for several covariates.

are approximately zero. We expect this because smaller cosine similarities between a word and a
phrase indicates that the word has very little meaning in common with the phrase. For large cosine
similarity values, the 95% credible interval for the functions becomes wider as compared to cosine
similarity values around 0.2. This is also expected simply due to the lack of observations for higher
cosine similarities.

The function estimates help us understand the relationship between a word, its related words,
and the property loss amount. Many of these estimated functions seem to follow our intuition.
For example, house, losing, widespread, gusts, and tree are all words that would typically be
associated with property loss. Words with the highest cosine similarity to house are shown in ﬁgure
3.4. Most of these related words are types of homes. From the function estimate, we see that an
incident involving a house results in higher property loss than that of an incident involving oﬃces
or apartments.

47

−0.40.00.40.80.250.500.751.00houses−0.50.00.51.01.50.000.250.500.751.00losings−1.0−0.50.00.51.00.000.250.500.751.00minors0.00.40.80.250.500.751.00dozenss−0.50.00.51.00.250.500.751.00trees0.00.40.80.250.500.751.00multiples0.00.40.80.000.250.500.751.00widespreads0.00.40.81.20.000.250.500.751.00quarterss0.00.51.01.52.00.000.250.500.751.00wireds01230.000.250.500.751.00shuttings0.00.51.00.000.250.500.751.00gustss01230.000.250.500.751.00orchardssFigure 3.4: Words with the highest cosine similarity with house.

While many function estimates obviously follow our intuition, there are some that seem harder
to interpret. Words like quarters, shutting, and orchards all seem unrelated to property loss. To
shed some light on this issue, we look at a sentence from a description that includes quarters; Two
eyewitnesses in Covington reported hail greater than the size of quarters during the peak of the
storm. The use of quarters here is related to the size of hail. It is expected that larger hail will lead
to larger property loss. Words related to quarters include nickel and dime, which are also used to
describe hail size. In a similar way, we ﬁnd out that shutting is referring to the closure of major
roadways. In the case of orchards, several observations involved damage to apple orchards. With
Michigan producing the third most apples of any state, it is clear why damage to apple orchards
results in large property loss.

The model also performed well with out of sample prediction. Figure 3.5 shows the predicted
property loss amounts against the true loss amounts for the validation sample. The Spearman
correlation for the validation set is 76.06%, while the Spearman correlation for the training dataset
is 80.30%.

To measure the stability of the method, for a selected year, the model was trained using the
previous years, and tested on data from the selected year. This was completed for each year from
2001 to 2018. This resulted in an average mean squared prediction error of 1.34 with a standard
error of 0.123. Using a lasso model increases each of these values by about 8% respectively. The

48

0.000.250.500.751.00househousessenatecongressionalcongressrepublicansbuildingwhitemansioncapitolofficerepublicanclintonroomhomedemocratsbillparliamentfloorlawmakersresidencerepresentativesgopbushspeakerhillnextgingrichthecommitteeCosine SimilarityFigure 3.5: Predicted property loss amounts against the true property loss amounts for the validation
sample. The Spearman correlation is 76.06%.

3 stage method selected a more parsimonious model as compared to the single step lasso model,
resulting in greater model stability.

3.7 Implications

We have presented an analytical method for analyzing losses due to storm events in relation
to their textual descriptions. The fact that losses may be predicted more accurately with textual
information implies that the case reserving procedure may be improved signiﬁcantly. The traditional
approach to case reserving is to take the average amount of the reported losses, yet this does not
take advantage of the heterogeneity of information contained within the initial report of a loss to
an insurance company. The new method allows for a more accurate prediction of the ultimate loss
to be indemniﬁed for a speciﬁc reported loss.

Being able to explain the factors that contribute to higher or lower severity of losses by selecting
the relevant keywords from a set of words allows the actuarial analyst to avoid manually selecting
the keywords needed for the textual risk analysis. This technique may be useful especially when
the number of words describing the loss is large, or statistically the problem is high-dimensional.
The analyst may also be able to understand the factors that relate to high losses using the selected
covariates, and this may help mitigate future losses.

In addition, these factors that contribute to higher severity of property loss can indicate areas

49

5.07.510.012.55.07.510.012.5PredictionTrue Property Lossneeding improvement in the way they protect against various weather events. For example, events
involving orchards resulted in high property loss, illustrating the need for additional preventative
measures to protect the apple trees during a thunderstorm.

The fact that a simple three-step approach allows for the regression selection problem to be
solved easily using existing routines in the R programming language. The approach may be
applied in general to problems where nonlinear eﬀects of a large number of continuous explanatory
variables must be understood in relation to the response. We have focused on the log-normal case
of the response, yet the method is general enough to be applied to non-normal responses, including
responses following a gamma distribution, or Poisson distribution. Future work may focus on these
speciﬁc cases.

3.8 Concluding Remarks

In this chapter, we consider a general high dimensional text analysis problem and propose a
3 stage approach by adopting modern statistical methods. Stage 1 and 2 eﬀectively reduced the
high dimensional problem to one that mgcv can handle. The use of stage 1 and 2 to reduce the
problem instead of utilizing a subject matter expert allows for simple replicability of the process.
We showed how the use of cosine similarities from textual descriptions can provide interpretable
results when predicting property loss. While there are many other possible applications in risk
analysis, our framework could also be applied in:

• classiﬁcation of users on a social networking site based on their posts,

• prediction of a company’s change in stock price from related articles,

• caller scam classiﬁcation based on call transcripts.

50

GAMMA DOUBLE GENERALIZED LINEAR MODEL WITH GROUP LASSO

CHAPTER 4

4.1 Introduction

4.1.1 Overview

The recent advancement of coordinate descent approaches to the Lasso estimation problem has
allowed the variable selection problem to be solved in a systematic way. Lasso is an approach
to induce sparsity into the coeﬃcients estimated from a regression analysis, which has been
introduced by Tibshirani (1996). Since its introduction, the method has gained popularity in the
statistics literature due to its elegancy and ability to solve high dimensional regression problems.
In a highly cited paper by Yuan & Lin (2006), the authors have introduced a method to group the
coeﬃcients in a regression for group-wise variable selection.

In spite of the popularity and the extensive research performed on the Lasso method, applications
in the actuarial science literature has been limited until now. The reason is in part because datasets
found in the actuarial literature are typically not high dimensional in nature, and in part also because
generalizing the fast coordinate descent methods to non-normal responses remained a challenging
task. Although works by authors such as Friedman et al. (2010) have extended the Lasso method
to logistic regression and multinomial regression, problems involving other non-normal responses
such as the Tweedie distributed case were left as open problems.

Yet, in a recent strand of works by Yang & Zou (2015) and Qian et al. (2016), an approach to
solve the Tweedie distributed case with high speed has been introduced. The approach approximates
the likelihood function with a second order Taylor series expansion around the current estimate of
the coeﬃcients, and uses the approximation to construct a closed form update to the coeﬃcient
estimates. This paper extends the works by Yang & Zou (2015) and Qian et al. (2016) and applies
the technique to textual data analysis in the actuarial science literature using the gamma distribution,

51

which is a special case of the Tweedie distribution.

Insurance claims prediction using textual data analysis methods is a natural application of the
group Lasso technique for non-normal responses. The use of the technique in conjunction with
dispersion modeling is new to the literature to the best of our knowledge. In the actuarial science
literature, double GLM approaches have been used by authors such as Smyth (1989) and Smyth &
Jørgensen (2002). In a dispersion modeling framework, one uses an additional link function for
the dispersion. The gamma distribution double GLM model has been explored in Smyth (1989),
and the Tweedie model has been explored in Smyth & Jørgensen (2002). Our idea in this paper
is to incorporate a group Lasso type penalty in both the dispersion parameterization and the mean
parameterization of a gamma model.

The rest of the chapter proceeds in the following order: In Section 4.2 the model is explained
along with a fast algorithm for obtaining the parameters. In Section 4.3 we discuss the asymptotic
convexity of the negative log-likelihood. In Section 4.4 two simulations are conducted to investigate
the performance of the models explained in Section 4.2. Section 4.5 describes the application of the
method to insurance claim predictions using textual data analysis. Finally, Section 4.6 concludes
the chapter with closing remarks.

4.2 Model

4.2.1 One Dimensional Unpenalized Problem

In order to explain our approach, we demonstrate the algorithm in Yang & Zou (2015) and Qian
et al. (2016) for a one-dimensional problem. For simplicity, suppose we have the one dimensional
negative log-likelihood

ࡁ() =

− + ,

(4.1)


=1

52

which arises from the gamma regression model with only an intercept term . The Taylor series
approximation around a point ˜ is given by

ࡁ() = ࡁ( ˜) +

− ˜( − ˜)2

(4.2)

(cid:16)−− ˜ + 1(cid:17) ( − ˜) + 1


=1


=1

2

This can be simpliﬁed to

where


=1

1
2

ࡁ() =

˜( ˜ − )2 + ( ˜)

(4.3)

˜ = − ˜

˜ = ˜ + (1 − −1
 

˜)

and

(4.4)
and ( ˜) is constant given ˜. Fixing the ࡁ() function, which is an approximation to the ࡁ()
function, we try to minimize ࡁ() by solving the problem
ࡁ( ˘) + ˘( − ˘) + 1
2

˜( − ˘)2

arg min

(4.5)


assuming a current estimate for ˘ is given. The idea is, we update ˘ until it converges to the
minimum of a given ࡁ() ﬁxed for a given ˜. Once this minimization is achieved, we update ˜
and continue. Here,


=1

˘ =

ࡁ
 

= −

( ˜ − ˘) ˜

and

˜ =

2ࡁ
 2 =


=1

˜

(4.6)

The ﬁrst order condition for equation (4.5) is given by

˘() = ˘ − ˜−1 ˘
The resulting ˘() satisﬁes ࡁ( ˘()) ≤ ࡁ( ˘), because

ࡁ( ˘()) =

=

=

1
2
1
2
1
2


=1

=1

=1

˜( ˜ − ˘())2 + ( ˜)

˜( ˜ − ˘ − ( ˘() − ˘))2 + ( ˜)


=1


=1

˜( ˘() − ˘)2 + ( ˜)

˜( ˜ − ˘)2 −


˜( ˜ − ˘)( ˘() − ˘) + 1
2


˜( ˜ − ˘)( ˘() − ˘) + 1
2

=1

= ࡁ( ˘) −
= ࡁ( ˘) + ˘( ˘() − ˘) + 1
2
≤ ࡁ( ˘) + ˘( ˘ − ˘) + 1
2

=1
˜( ˘() − ˘)2
= ࡁ( ˘)

˜( ˘ − ˘)2

˜( ˘() − ˘)2

(4.7)

53

where, the inequality in equation (4.7) is due to the fact that ˘() is the solution for the mini-
mization problem (4.5). Hence, the iteration scheme eventually converges to the solution

ˆ = arg min


ࡁ()

4.2.2 One Dimensional Penalized Problem

Now, consider the penalized problem

ˆ∗ = arg min


ࡁ() + ||

(4.8)

(4.9)

where  is a tuning parameter. In this subsection, we denote the solution to the penalized problem
with an upper asterisk (*). We try to minimize () = ࡁ() + || by solving the minimization
problem

arg min


ࡁ( ˘∗) + ˘( − ˘∗) + 1
2

˜( − ˘∗)2 + ||

(4.10)

at each iterative step, for a given current estimate ˘∗. Notice, we have

( ˘∗

()) =

˜( ˜ − ˘∗

())2 + ( ˜) + | ˘∗

()|


=1

1
2
1
2
= ( ˘∗) −

=1

=

˜( ˜ − ˘∗ − ( ˘∗

()|

() − ˘∗))2 + ( ˜) + | ˘∗

˜( ˜ − ˘∗)( ˘() − ˘∗)


˜( ˘() − ˘∗)2 + | ˘∗
() − ˘∗) + 1
()| − | ˘∗|
2
˜( ˘∗ − ˘∗)2 + | ˘∗| − | ˘∗| = ( ˘∗)

()| − | ˘∗|
() − ˘∗)2 + | ˘∗

˜( ˘∗

=1
+ 1
2

=1
= ( ˘∗) + ˘∗( ˘∗
≤ ( ˘∗) + ˘∗( ˘∗ − ˘∗) + 1
2

(4.11)

Hence, the iteration scheme converges. Let the unpenalized solution be ˘(), in contrast to
the penalized solution ˘∗
(). According to the ﬁrst order condition without the penalty, as
in section 4.2.1, we would have ˘() = ˘ − ˜−1 ˘∗. The solution for equation (4.10) can

54

The second case is when ˘() = ˘ − ˜−1 ˘∗ < − ˜−1 < 0.
argument, the solution becomes

˘∗ − ˘∗
˘

− 
˘

˘∗ − ˘∗
˘

+ 
˘

(4.13)

In this case, using a similar

(4.14)

be obtained using the ﬁrst order conditions for three diﬀerent cases. The ﬁrst case is when
˘() = ˘ − ˜−1 ˘∗ > ˜−1 > 0. In this case, locally, the problem would be

arg min


ࡁ( ˘∗) + ˘∗( − ˘∗) + 1
2

˜( − ˘∗)2 + 

(4.12)

Diﬀerentiating this objective function with respect to  and setting the expression to zero results in
the solution:

()| = | ˘∗ − ˜−1 ˘∗| ≤ ˜−1. In this case, we have ˘∗

The third case is when | ˘∗
() = 0. To
see why, without loss of generality, assume 0 ≤ ˘∗
() = ˘∗ − ˜−1 ˘∗ ≤ ˜−1. The objective
function in equation (4.12) relevant to the solution (excluding constant terms) can be simpliﬁed to
obtain the following expression:

( ˘∗ − ˘∗ ˜ + )  + 1
2

˜ 2

(4.15)

which is minimized at zero, since the coeﬃcient for  in the ﬁrst term is positive, and we have
assumed that ˘∗

() is positive. The three cases can be expressed compactly in one expression:

(cid:18)

(cid:19)

() = ˜−1( ˜ ˘∗ − ˘∗)
˘∗

1 −


| ˜ ˘∗ − ˘∗|

+

(4.16)

Hence, we repeatedly update the estimate using expression (4.16), updating ˘∗ and ˜ along the
way. This gives an iteration scheme, which ultimately converges to ˆ∗.

4.2.3 Gamma Generalized Linear Model

The unpenalized negative log-likelihood for the gamma generalized linear model is


=1

ࡁ(0, ) =


(cid:16)

 exp(cid:16)−0 − 


(cid:17) + 0 + 

(cid:17)


,

(4.17)

55

(cid:16)

where  are weights,  are responses, 0 is an intercept,  are coeﬃcients excluding the intercept,
and  are the explanatory variables for observation . We are interested in solutions of the form


(cid:16) ˆ0, ˆ
(cid:17)
(cid:17) treating the coeﬃcients for each word  = 1, . . . ,  as groups. We

= arg min
(0,)

ࡁ (0, ) + 

 (cid:107)  (cid:107)2

(4.18)

=1


1 , 

2 , . . . , 


where  =
use the algorithm presented in Yang & Zou (2015) and Qian et al. (2016) to estimate the gamma
regression coeﬃcients with a group lasso penalty. The solution for the Tweedie model with group
lasso penalty is discussed in Qian et al. (2016) for the case when the power parameter is less
than two. When the power parameter equals two, the model results in the gamma model, yet the
derivation for this special case is not explicitly discussed in Qian et al. (2016), so we present it below.
The algorithm iteratively updates the estimate of the coeﬃcients given a current estimate. Given
current estimates ˜0, ˜, the second order Taylor series approximation of the negative log-likelihood
around the current estimate is


  exp(cid:16)− ˜0 − ˜

(cid:16)− exp(cid:16)− ˜0 − ˜
(cid:17)(cid:16)

0 + 

=1


(cid:17) + 1(cid:17)(cid:16)


 − ˜0 − ˜


0 + 

 − ˜0 − ˜


(cid:17)2

(cid:17)

ࡁ(0, ) = ࡁ( ˜0, ˜) +


=1

=1

+ 1
2
1
2

=

where

(cid:17)

(4.19)

(4.20)

(4.21)


()

˜( ˜ − 0 − 


)2 + ( ˜0, ˜),

˜ =   exp(cid:16)− ˜0 − ˜

˜ = ˜0 + ˜


 + 
˜

56

and ( ˜0, ˜) is a constant given the current estimate. Given ࡁ(0, ), consider updating the
current estimates ˘0, ˘. Then

(cid:16) ˜ − 0 − 
(cid:16) ˜ − 0 − 


(cid:17)
(cid:17)

˜

˜

=1


=1

=1

=1

˜

ࡁ (0, )

  
ࡁ (0, )

 0

= −

= −

2ࡁ (0, )

   
2ࡁ(0, )
 0 
0

=

=

  (0, ) =

0 (0, ) =

  (0, ) =

0 (0, ) =

˜  


(4.22)

(4.23)

(4.24)

(4.25)

With these values, the expression for the new coeﬃcients for the penalized problem in equation
(4.18), according to the algorithm in Qian et al. (2016) is

(cid:16) ˜ 

˘  − ˘ 

(cid:17)(cid:18)

(cid:19)

 
˘ − ˘  (cid:107)2

(cid:107) ˜ 

1 −
˜ 

˘ () =
˘0() = ˘0 − ˜−1
0

where ˜  is the largest eigenvalue of ˘ , and ˜0 =

˘0

=1 ˜.

4.2.4 Gamma Double Generalized Linear Model
 ∼ (, ),

 = 1, ..., 

where

 =   =  

and

 =  .

+

(4.26)

(4.27)

(4.28)

(4.29)

For simplicity of the derivation, in this section we assume the design matrix includes a column

of ones, corresponding to the intercept. Then, the negative log-likelihood is


=1

(cid:26)

(cid:16)


=1

ࡁ() =

ࡁ( , ) =

−  + 

(cid:17) + log Ɖ(cid:0)(cid:1) −  +(cid:0)1 − (cid:1) log 

(cid:27)

(4.30)

57

where  =  , and  =  . Let  = (  , ). The goal is to solve the following
minimization problem

(cid:13)(cid:13) 

(cid:13)(cid:13)2 + 

 
=1

(cid:13)(cid:13)(cid:13) 

(cid:13)(cid:13)(cid:13)2

 
=1

ˆ = arg min

ࡁ() + 


where  > 0 and  > 0 are tuning parameters, and   are the positive group lasso weights for
each group  = 1, 2, ..., . We have  /  = / = . Deﬁne

 =

 =

 =

2ࡁ
  2

ࡁ
 
2ࡁ
2


 =

ࡁ


=  − 

= (cid:16)−−  + 1(cid:17)
= (cid:16)
= (cid:16)

−  + 

−  + 

Then, by the chain rule, we have

∇ࡁ =

(cid:17) + (1)(cid:0)(cid:1) 2 + (0)(cid:0)(cid:1)  − 2 −  −  log 
(cid:17) + (0)(cid:0)(cid:1)  −  −  −  log 

 




 
and ∇2ࡁ =


We deﬁne ࡁ() to be the second order Taylor series expansion of ࡁ() around the current estimate
of the parameters ˜:

ࡁ() = ࡁ( ˜) + ∇ࡁ( ˜)( − ˜) + 1

2( − ˜)∇2ࡁ( ˜)( − ˜)

(4.37)

Then plug in the expressions for ∇ࡁ( ˜) and ∇2ࡁ( ˜) in terms of , , , and  into the expression
for ࡁ() and simplify. The result is

(4.31)

(4.32)

(4.33)

(4.34)

(4.35)

(4.36)

ࡁ() = ( ˜) + 1
2

˜(


 )2 + 1
2

˜(


 )2


=1


=1
  − ˜(


(cid:110) ˜
(cid:110) ˜


=1

=1

+

+

˜)(


 ) − ˜(


 ˜)(


˜)(


  − ˜(


 ˜)(


 ) − ˜(


58


 )(cid:111)
 )(cid:111) +


=1

˜(


 )(


 )

(4.38)

where ( ˜) is a constant given ˜. Then we would like to minimize the following penalized objective
function

() := ࡁ() + 

 
.

(4.39)

To minimize 4.39, we utilize a blockwise majorization descent method. Fixing the ࡁ() function
at the point ˜, we have


=1

(cid:13)(cid:13)(cid:13) 

(cid:13)(cid:13)(cid:13)2


=1

 
ࡁ/

(cid:13)(cid:13)2 + 
(cid:13)(cid:13) 

ࡁ/ 


˜(

˜

˜

=1

=1

=1


˜(


=1

 ˜) +


=1


=1

∇ࡁ() =

=1


=1

=1

−

˜(


 ) +

˜(


˜) −

˜(


 ) +

˜(


 ˜) −


=1

=1

˜
˜


=1

=1

˜
˜


∇2ࡁ() =


(cid:20) 

(cid:20) 


=1
−

=1
−

=1

˜


=1
˜

=1

˜(


 ) +

=1
˜) −

˜(


=1
˜(


 ) +

=1
 ˜) −


˜(

59

(4.40)

˜(


 )

(4.41)

˜) +

˜(


 )

(4.42)

(4.43)

(cid:21)

(cid:21)

˜(


 )

˜(


 )

 
(4.44)

 
(4.45)

˜(


 ˜) +

˜(


˜) +




=1


=1

where

ࡁ/  =

−

ࡁ/ =

=1
Also, we have the hessian matrix

  () = ࡁ/   =

 () = ࡁ/  =

For a speciﬁc group of the parameter, call it  , or  , we have

and we have the following hessian matrices:

˜  = ∇2

 
ࡁ =

˜  = ∇2

 
ࡁ =


=1

=1

˜  


˜  


(4.46)

(4.47)

Given ࡁ(), consider updating the current estimates ˘  for a given group  (1 ≤  ≤ ). Suppose
that ˜ and ˘ are the current estimates, and deﬁne ˘  = ˜ ( ˘). Then ˘ () is updated by
solving

(cid:13)(cid:13)(cid:13) 
(cid:13)(cid:13) 

.

(cid:13)(cid:13)(cid:13)2
(cid:13)(cid:13)2 .

(4.48)

(4.49)

(4.50)

(4.51)

ࡁ( ˘, ˜) + ˘

 (   − ˘ ) +

arg min

 
Similarly, ˘ () is updated by solving

arg min

 
ࡁ( ˜, ˘) + ˘

 (  − ˘ ) +

The expression for the new coeﬃcients for the penalized problem is

˜ 
2 (   − ˘ )(   − ˘ ) +  

˜ 
2 (  − ˘ )(  − ˘ ) +  
(cid:33)
(cid:19)

 
˘ − ˘  (cid:107)2

(cid:17)(cid:32)
(cid:17)(cid:18)

˘  − ˘ 

(cid:16) ˜ 
(cid:16) ˜  ˘  − ˘

1 −

(cid:107) ˜

 
+

 
(cid:107) ˜  ˘ − ˘  (cid:107)2

+

˘ () =

˘ () =

˜ 
1 −
˜ 

is the largest eigenvalue of ˘ 

where ˜ 
4.1 summarizes the process of estimating  and  by minimizing ().

, and ˜  is the largest eigenvalue of ˘  . Algorithm

60

Algorithm 4.1: Algorithm for solving the Gamma Double GLM

1. Initialize ˜ and ˜.
2. (Outer Loop) Update the penalized objective function 4.39.

a) (Inner Loop: ). Obtain the minimizer of the objective function 4.39.

• Compute ˜, ˜, ˜, ˜ for each  = 1, 2, ..., .
• Compute ˜ 
• Initialize ˘ = ˜.
• Repeat the following until ˘ converges.

and the maximum eigenvalue ˜ 

for  = 1, 2, ..., .

– Update ˘. For  = 1, 2, ..., ,

∗ Compute ˘  = ˜ ( ˘) by 4.44.
∗ Compute ˘ () by 4.50.
∗ Set ˘  = ˘ ().

• Set ˜ = ˘.

b) (Inner Loop: ). Obtain the minimizer of the objective function 4.39.

• Compute ˜, ˜, ˜, ˜ for each  = 1, 2, ..., .
• Compute ˜  and the maximum eigenvalue ˜  for  = 1, 2, ..., .
• Initialize ˘ = ˜.
• Repeat the following until ˘ converges.

– Update ˘. For  = 1, 2, ..., ,

∗ Compute ˘ = ˜( ˘) by 4.45.
∗ Compute ˘ () by 4.51.
∗ Set ˘  = ˘ ().

• Set ˜ = ˘.

3. Repeat until ˜ and ˜ converges.

61

4.3 Convexity

In this section, we show the convexity of the problem described in the previous section. We

can use an alternative formulation to express the model in equation 4.29. Model the response as

where

and

We have

and thus

i.e.,

 ∼ (, ),

 = 1, ..., 

˜

 = 


 =  ˜

˜ = 


˜ = 


˜+ ˜

 =  ˜

  = 
 = 

 =
1 0
(cid:110)log Ɖ( ˜) + ˜ ˜ − ( ˜ − 1) log  + − ˜(cid:111)



 =  + 





1 1


The negative log-likelihood function of the equivalent formulation is


=1

ࡁ() =

1


( ˜, ˜) =

1


=1

(4.52)

where  = ( , ). For the gradient descent algorithm to reach to the global optimum, the
negative log-likelihood function needs to have a global minimum. Therefore, ࡁ() needs to be
a convex function. This function may not be convex in general, because an extreme case of the
observations may violate the deﬁnition of convexity. However, it is convex with high probability
as we increase the sample size.

62

Deﬁnition 4.1 (Asymptotically convex). We say that a sequence of random functions () is
asymptotically convex if

→∞ P ( () is convex) = 1
lim

(4.53)

We have, for the loss function above, the following theorem. The theorem guarantees that when

we have a large sample size, it’s very likely that we observe a convex loss function.

Theorem 4.1. The loss function 4.52 is asymptotically convex.
Proof. First, we quantify the variation of  and log() using law of large numbers. Observe that
the gamma distribution has ﬁnite expectation and variance and that

E() =  ,

 = 1, ..., 

Therefore, the variables  −   have mean zero. By the strong law of large numbers, we have

( −  ) ..−−−→ 0

  → ∞

(4.54)


=1

1


Also observe that

E(log ) = log  + (0)()

where (0)() is the digamma function. Similarly, we have

(log  − log  − (0)()) ..−−−→ 0

  → ∞

(4.55)


=1

1


=1

1


 ∼ 1


ࡁ() ≈ 1


=1

=1

Equations 4.54 and 4.55 guarantee that the following quantities can be arbitrarily close when  is
large


=1

1


=1

 
log  ∼ 1


log  + ()

(cid:110)log Ɖ( ˜) − (0)( ˜) ˜ + log  + − ˜(cid:111)

(4.56)

Therefore, when  is large, we have

By deﬁnition, it’s easy to show that

• the sum of two convex functions

63

• the composition of two convex functions

are both convex. By the ﬁrst rule, it suﬃces to show that

() = log Ɖ( ˜) − (0)( ˜) ˜ + log  + − ˜

(4.57)

is convex. Again observe that log  is a constant which is automatically convex. Then observe that
 > 0 since it follows a gamma distribution, thus we have − ˜ is also convex in . Therefore, it
suﬃces to show that

() = log Ɖ( ˜) − (0)( ˜) ˜

(4.58)

is convex. By the second rule, the composition of convex functions is convex. Since  ˜ is convex
in , we only need to show that the function

Ò() = log Ɖ() − (0)()

(4.59)

is convex on  ∈ (0,∞), since  =  ˜ > 0. Denote with () the ( + 1)Ò derivative of log Ɖ(),
i.e. the polygamma function or order , we have

and

Ò(cid:48)() = −(0)()

Ò(cid:48)(cid:48)() = −(0)() − (1)()

Observe that Ò(cid:48)(cid:48)() is continuous and decreasing, we have

Ò(cid:48)(cid:48)() ≥ lim

→∞ Ò(cid:48)(cid:48)() = 0, ∀ ∈ (0,∞)

Therefore, the function Ò() in equation 4.59 is convex. We have with probability converging to
one that the loss function is convex, as  → ∞.
(cid:3)

Remark 4.1. Since the parameters in the classic formulation is a linear transformation of the
parameters in the second formulation, the negative log-likelihood function of the classic formulation
is also asymptotically convex. It doesn’t matter which formulation we use, we may easily obtain
the parameters of another formulation by performing the linear transformation.

64

4.3.1 The Asymptotic Convexity
We draw a random design matrix  of size  ×  from a standard multivariate normal distribution
with independent entries.  and  are also drawn from standard multivariate normal distributions.
The  is generated according to the gamma distribution. Then we randomly draw  = 10000 pairs
of 1 = (1, 1) and 2 = (2, 2) and check the convexity by deﬁnition

ࡁ(1) + (1 − )ࡁ(2) ≥ ࡁ(1 + (1 − )2)

(4.60)

where  is randomly drawn from uniform distribution between 0 and 1. We ﬁx  = 10 and repeat
the simulation for  = 1, 5, 10, 20, 50, 100. The proportion of pairs of points that violates the
convexity deﬁnition is given in Table 4.1. The simulation shows that as the sample size increases,
there’s less chance to see a non-convex region. Moreover, the sample size needs not be too big to
reach this good property.

Table 4.1: Proportion of randomly sampled pairs of points that violate the convexity deﬁnition for
diﬀerent sample sizes, each over 5 repetitions. The standard error of the 5 repetitions are given in
parenthesis.

n

Proportion

Standard Error

1

0.1210
(0.0012)

5

0.0231
(0.0009)

10

0.0048
(0.0003)

20

0.0004
(0.0001)

100
0.0000
(0.0000)

4.4 Simulation

In the simulation study, we investigate the performance of the method using two examples. The

following models are investigated in the simulation study,

• Double Group Lasso - The mean and dispersion are modeled by group lasso

• Group Lasso - The mean is modeled by group lasso with only an intercept term for modeling

the dispersion. This is achieved by choosing  to be large.

• Double Lasso - The mean and dispersion are modeled by lasso, no group structure is assumed

in the model.

65

• Lasso - The mean is modeled by lasso with only an intercept term for modeling the dispersion,

no group structure is assumed in the model.

All models assume the response follows a Gamma distribution. In both examples,  and  are
selected using the Bayesian information criterion (BIC).

4.4.1 Example 1

In each run of the simulation, 500 observations and 200 observations are used for the training and
testing datasets, respectively. To create the design matrix, we will construct four blocks, each with
three covariates. The th block,  ( = 1, ..., 4), is sampled from a multivariate normal distribution
with mean 0 and variance . For ,  = 1, 2, 3, we set   = 1 when  =  and   =  when
 ࣔ , where  = 0 or 0.5. The design matrix is constructed as  = (1, 2, 3, 4). For
 = 1, ..., ,  is sampled from a Gamma distribution with scale parameter  and shape parameter
, where

log() = log( ) = 1 +

(−1) +1(1  + 2 )

log() = −1 +

(−1) (0.51  + 0.53 )

3

=1

3

=1

(4.61)

(4.62)

where   is the th observation for the th covariate from the th block. In the construction of the
response, 1 and 2 are the only active blocks in calculating the mean, and 1 and 3 are the only
active blocks in calculating the dispersion. In the simulation study, we are interested in how well
the method selects the true nonzero blocks. We will consider a block active in the model if at least
one of the coeﬃcients in the block are nonzero. We will quantify the performance by counting the
number of correctly active blocks (C) along with the number of incorrectly active blocks (IC).

Table 4.2 summarizes the average simulation results for the Lasso model and the Double Lasso
model. The Double Lasso model performs better than the Lasso model with respect to the log-
likelihood and the Gini index calculated on the testing dataset. While the Double Lasso model and

66

Table 4.2: Average simulation results for 100 runs for the Lasso and Double Lasso models. The
standard error is provided in the parentheses.

Dispersion Mean
C
IC

IC

C

BIC

log-likelihood

Gini Index

1.17

2
2
— — 2

1.52

2
2
— — 2

1.25
1.6

1.44
1.76

-119.86 (28.92)
260.76 (26.96)

12.61 (6.12)
-28.34 (6.04)

0.863 (0.005)
0.852 (0.005)

398.54 (35.63)
659.07 (32.29)

-27.08 (8.46)
-57.77 (7.55)

0.791 (0.009)
0.787 (0.008)

 = 0
Double Lasso
Lasso
 = 0.5
Double Lasso
Lasso

Table 4.3: Average simulation results for 100 runs for the Grouped Lasso and Double Grouped
Lasso models. The standard error is provided in the parentheses.

 = 0
Double Grouped Lasso
Grouped Lasso
 = 0.5
Double Grouped Lasso
Grouped Lasso

Dispersion Mean
C
IC

IC

C

BIC

log-likelihood

Gini Index

0.58

2
2
— — 2

0.8

2
2
— — 2

0.18
1.07

1.04
1.18

-115.23 (28.84)
263.39 (27.06)

11.62 (6.15)
-28.1 (6.06)

0.853 (0.005)
0.854 (0.006)

400.07 (37.65)
665.39 (32.67)

-28.09 (8.5)
-58.4 (7.61)

0.792 (0.009)
0.781 (0.011)

the Lasso model both appropriately estimate 1 and 2 as nonzero when modeling the mean, the
Double Lasso model incorrectly estimates nonzero blocks less often than the Lasso model.

Table 4.3 summarizes the average simulation results for the Grouped Lasso model and the
Double Grouped Lasso model. The Double Grouped Lasso model performs better than the Lasso
group model with respect to the log-likelihood and the Gini index calculated on the testing dataset.
While the Double Grouped Lasso model and the Grouped Lasso model both appropriately estimate
1 and 2 as nonzero when modeling the mean, the Double Grouped Lasso model incorrectly
estimates nonzero blocks less often than the Lasso model.

In general, the Double Lasso and Double Grouped Lasso methods outperform the Lasso and
Grouped Lasso methods, respectively. In addition, by modeling for the dispersion, the model for
the mean more accurately reﬂects the true relationship. That is, it is more likely that 3 and 4
will be incorrectly included in the model for the mean by not modeling for dispersion.

67

4.4.2 Example 2

Consider a scenario where we have some variable 1 that is binary, with 0 indicating a good driver
and 1 indicating a bad driver. We want to estimate the premium  for each driver  = 1, ..., 1000.
Assume further that we have 9 additional highly correlated explanatory variables that do not aﬀect
our response. Formally,

 ∼ (, ),

 = 1, ..., 

where

 =   = exp {log(100) + log(100) 1}

and

 = exp {log(1) + log(10) 1} .

We compare the results of two diﬀerent models; Double Lasso and Lasso. The data are simulated
1000 times. Table 4.4 shows the proportion of replications where the  and  coeﬃcients related
to 1 and 2 (noise) are nonzero. In general, the Double Lasso model appropriately shrinks the
coeﬃcients related to 2 while accurately estimating the coeﬃcients related to 1.
Table 4.4: Proportion of replications that the coeﬃcient is nonzero in the Double Lasso model.

1

 0.9941 (0.0042)
0.9941 (0.0042)


2

0.0445 (0.0113)
0.0653 (0.0135)

The value at risk (VaR) is also estimated for each simulation and the resulting distributions are
shown in ﬁgure 4.1 and ﬁgure 4.2. The plots on the left represent the 95% VaR for good drivers
(1 = 0) while the 95% VaR for bad drivers (1 = 1) are shown in the plots on the right. The
vertical lines represent the true 95% VaR in each case. Since the Double Lasso method models
the dispersion as a function of 1, the estimated 95% VaR is much closer to the true value as
compared to the Lasso model. The Lasso model underestimates the 95% VaR for good drivers
while overestimating the 95% VaR for bad drivers. Since the risk capital is underestimated for
good drivers, there is a risk of insolvency. Since the risk captial is overestimated for bad drivers, it
is likely that the company will lose business from bad drivers.

68

Figure 4.1: 95% VaR for Double Lasso model

Figure 4.2: 95% VaR for Lasso model

4.5 Empirical Analysis

4.5.1 Data

For our case study, we obtained data from the Oﬃce of the Commissioner of Insurance (OCI)
of Wisconsin. The dataset contains claim descriptions with loss amounts from the Wisconsin
Local Government Property Insurance Fund (LGPIF). Table 4.5 is a summary of the frequency
and severity of loss amounts for 6030 events recorded in the LGPIF dataset. For the analysis, we
will focus only on claims in the Vandalism event type. The resulting 2084 events have an average
property damage amount estimate of $2,695. The maximum loss amount in the sample was caused
by vandalism to a pool that resulted in a loss of $981,598.

Figure 4.3 gives the reader a better idea of the distribution of words found in the event narratives
of the events. The counts of each word is shown for the four most common event types. The reader
may observe that the word vandalism and the word glass are common in descriptions for Vandalism
events, whereas in the descriptions for Lightning, light, pole, and hit are common words. Given

69

Table 4.5: Summary of loss amounts by event type

Event type Frequency Severity
2,695
Vandalism
Vehicle
4,274
11,156
Lightning
24,846
Misc
WaterW
78,071
25,606
Wind
33,081
WaterNW
Fire
82,187
137,480
Hail
19,674
Total

2084
1079
955
465
464
403
269
217
94
6030

Figure 4.3: Frequency of words in descriptions of Vandalism events

that diﬀerent event types have diﬀerent average severities, as implied by Table 4.5, we may infer
that certain keywords may be indicative of the magnitude of property losses.

4.5.2 Methods

By using cosine similarities, we can extract features from textual descriptions of losses. For
example, if we consider the word vandalism, the cosine similarities will give a value between -1
and 1, increasing in the word’s relationship with the observed sentence. If we consider multiple
words, then we may obtain multiple explanatory variables, which can be used for a multivariate
regression model. The question is, which words should be used for extracting the explanatory

70

variables? Previous work has focused on using words carefully selected by a human expert. This
approach worked for short textual descriptions of insurance losses. However, as the descriptions
become longer, we may require a method that is more clever. One possible approach would be to
try every possible word in the English language, and then use variable selection methods to reduce
the model to one that includes only signiﬁcant variables. Another possible approach would be to
try only the words observed in the descriptions within the dataset. We use the latter approach in
this case study.

Figure 4.4: Cosine similarities of selected words with log loss amounts for Vandalism

Figure 4.4 shows a plot of the cosine similarities for selected words with the corresponding loss
amounts in log scale. The spearman correlations with the loss amounts for the nine words are shown
in Table 4.6. Keywords shown in Table 4.6 may be the words that are indicative of large losses.
Similarly, those words with small correlation may be words indicative of small losses. Figure 4.5
illustrates the eﬀect of the existence of selected words in the narrative. The plot illustrates that, for
instance, the occurrence of the word caused increases the average loss amount of the event by a
signiﬁcant amount.

We construct the design matrix  = (1, 2, . . . , ) using the cosine similarities. In order

(cid:17), where ,  =
(cid:0)1(, ), 2(, ), 3(, )(cid:1) with ,  = simcos(, ) for description  and unique words  .

to do this, consider the matrix of basis functions   =

1,  , 2,  , . . . , , 

(cid:16)

71

Table 4.6: Spearman correlation of cosine similarities and loss amounts

vehicle
0.334

computers

0.321

equipment

0.29

stolen
0.285

computer missing
0.214

0.282

ﬁre
0.148

lot
0.142

damaged
0.137

Figure 4.5: Eﬀect of word indicator on loss amount

Here, () are cubic penalized regression spline (P-spline) basis functions with penalty order
1. For an overview of the P-spline basis function, the reader may refer to Wood (2017). Then,
 1 =  ,1   +  ,20 take  ,2 and form the design matrix  =
from the QR decomposition 

(cid:0)11,2, 22,2, . . . ,  ,2(cid:1). Under this framework, the response variable is related to the

covariate ,  via a smooth function

 (, ) = 


   ,

(4.63)

where   is the coeﬃcient corresponding to the -th word describing observation . This ensures

that the basis functions satisfy

=1  (, ) = 0 for each  = 1, . . . , .

4.5.3 Results

Figure 4.6 shows the cross-sectional solution path of the algorithm for three selected words. The
reader may verify that group Lasso is applied to both the mean and the dispersion of the gamma

72

distribution. Notice that the three lines corresponding to a single word converge to zero at the same
tuning parameter value.

Figure 4.6: Solution path of the algorithm

Ten-fold cross validation using the Spearman correlation of the predicted loss amount is used in
order to determine the tuning parameters for the model. The tuning parameters for the ﬁnal model
are  = 0.7040 and  = 1.0822. The ﬁnal model contained 172 words for modeling the mean and
130 words for modeling the dispersion. Figures 4.7 and 4.8 show the curve estimates for 9 selected
words for modeling the mean and the dispersion, respectively. Note that the curve estimates for the
dispersion associated with computer and damaged are zero. Looking back at ﬁgure 4.4, we can see
how the model has captured the change in the mean and variance of the loss amount as the cosine
similarity changes.

The Spearman correlation with the validation sample loss amounts and the predicted loss

73

Figure 4.7:  curve estimates

Figure 4.8:  curve estimates

74

amounts using the model turns out to be 48.68%, while the Spearman correlation for the training
sample was 66.38%. A plot of the predicted losses versus the actual losses in the training and
validation samples are shown in log scale in the ﬁgure 4.9.

Figure 4.9: Prediction versus out-of-sample claims (log scale)

4.6 Concluding Remarks

In this paper, we explored the use of the group Lasso technique to predict non-normal loss
amounts based on covariates derived from textual descriptions of the losses. Our contribution to
the literature is the extension of the group Lasso technique to the case where the mean and the
dispersion are both modeled with a penalty term, as well as the application of the methodology
in the textual data analysis context. The approach has applications in the insurance case reserving
problem, and has promising results according to the real data analysis performed using a training
sample and a validation sample.

Limitations to our approach may be the fact that a vector representation of words should be
available in order for the described approach to work. The methodology may be vulnerable to
spelling errors in the descriptions of the losses. However, for the presented dataset, spelling
errors were minimal and did not turn out to be a problem in terms of the prediction results.
The methodology may also suﬀer from descriptions with a large number of abbreviations, as
is sometimes the case with insurance claim adjuster notes. Dealing with such messy textual
descriptions may be potential future work.

75

CHAPTER 5

CONCLUSION

In this paper, we introduced a framework for incorporating textual data into insurance claims
modeling, and considered its applications in claims management processes. An insurance claim
representative is responsible for investigating the claim, in order to determine the handling process.
In this paper, we explored the use of word similarities as a tool for modeling insurance claims and
mitigating insurance risks.

In Chapter 2, we illustrated how short textual descriptions can be leveraged in the GAM
framework for insurance claim classiﬁcation and risk mitigation. The method relies on the subject
matter expert to choose the words most impactful to the response.

In Chapter 3, we generalize the previous work to longer textual descriptions without the need
for an expert. When we consider all unique words found in the dataset as explanatory variables and
impose a GAM, the resulting design matrix is high-dimensional. For this reason, we used a group
Lasso penalty to reduce the number of coeﬃcients in the model. The scalable, analytical framework
proposed provides for a parsimonious and interpretable model. We discussed the implications of
the analysis, including how the framework may be used by an insurance company.

In Chapter 4, we showed how we can incorporate a group Lasso type penalty in both the
dispersion and the mean parameterization for a Gamma model, and illustrate its use in a predictive
analytics application in actuarial science. In particular, we applied the method to an insurance claim
prediction problem involving textual data analysis methods. Simulations illustrated the variable
selection and model ﬁtting performance of our method.

Our results demonstrate how text mining technology can be incorporated into a traditional
regression analysis. The methodology is applicable in many diﬀerent areas of applications, where
textual data arises. Possible applications of our approach for an insurance risk manager may
include:

76

• Classiﬁcation of claims based on textual descriptions of the claims
• Classiﬁcation of policyholders based on textual descriptions of the policyholders
• Prediction of insurance claims at the claim level
• Prediction of insurance claims at the policyholder level
• Analysis of insurance claims and risk mitigation

We make some remarks on the current limitations of our framework, where potential improve-

ments can be made.

• Under the current framework, words not found in the word embedding matrix cannot be used

in the modeling.

• The threshold  is selected using heuristics by a human expert, under the current framework.
• Because pre-determined word embedding matrices are limited to one-grams (single words)
at the time the paper is being written, the incorporation of -grams (use of phrases longer
than one word as a search key) remains an open question.

• Further linguistic barriers may exist, if the textual descriptions are longer than those appearing
in the dataset used for this paper. Examples may be polysemy, false friends, compound words.
• In order to use the proposed method, insurers that focus on speciﬁc insurance segments may
be constrained to build its own word embedding matrices, as the terms appearing in the claim
descriptions may be speciﬁc to the ﬁeld. For example, a medical insurer may ﬁnd GloVe
insuﬃcient, and may need a word embedding matrix trained on medical terms in order to use
our proposed approach.

77

APPENDICES

78

APPENDIX A

THREE STAGE APPROACH THEORETICAL RESULTS

In this section, we will provide statistical foundation for the proposed three stage approach. For this
reason, we derive the convergence rate for our 3rd-stage estimator. This will establish statistical
consistency of our procedure. In Yang & Maiti (2018), the following result for the second stage
estimator has been established.

Lemma A.1 (K. Yang and T. Maiti, 2018). The adaptive group lasso consistently selects the true
active predictors in probability, i.e., the estimator ˆ  satisﬁes:

(cid:16)(cid:107) ˆ  ()(cid:107)2 > 0,

P

 ∈   (cid:107) ˆ  ()(cid:107)2 = 0,

(A.1)

 ∈  (cid:17) → 1.

The results states that with proper choices of 1 and 2, the adaptive group lasso consistently
selects the true nonzero predictors. This theorem guarantees the selection consistency of the 3-stage
algorithm, since the variable selection is done in the second stage and the third stage does not do
variable selection. It’s important for an algorithm to select the correct subset of variables for the
model built on them to work.

With similar assumptions, assume we have

Assumption 1. The true functions 1, ...,  has smoothness order , i.e.

Þ 


 (cid:48)(cid:48)
 ()2 (cid:16) 

where  (cid:16)  means there exist constants  and  such that

Then, we have

 ≤ 


≤ 

79

Theorem A.1. Under assumptions 1 and assumptions in Yang & Maiti (2018), for tuning parameters
31, ...3, we have

(cid:107) ˆ −  0

(cid:107)2

2 =  

−2

2


log()


+ (2

−2
2

−2


) + 

3  

(A.2)

(cid:18)

(cid:19)


∈ ˆ

(cid:169)(cid:173)(cid:173)(cid:171)

(cid:170)(cid:174)(cid:174)(cid:172)

where 0 and 2 are assumed bounds parameters in eigenvalues of , see Yang & Maiti (2018).

Theorem A.1 shows the rate of convergence of the third stage estimator. There are three terms
in the convergence rate: the estimation error, the spline approximation error and the regularization
error. The greater , the less 3, thus the product won’t change. This theorem guarantees that
with proper choice of parameters, the estimated functions are consistent estimators of the true
functions that describe the relationship between the variables and the response.

Proof. Consider the third step, where we have the smoothness penalty. Deﬁne the event

The previous lemma showed that

S = { ˆ = }

P(S) → 1

  → ∞

From now on, let’s condition on the event S. For convenience, we suppress the notations ˆ,
0
 and 

To study the characteristics of the smoothness term  , where

ˆ and denote them with ˆ, 

0 and .

Þ 


Þ 


 (cid:48)(cid:48)
 ()2 =

(cid:48)(cid:48)()(cid:48)(cid:48)()  = 


without loss of generality, consider the case that the knots are evenly distributed on the interval
[, ], since changing the length of the intervals does not change the shape of the B-splines but
the span and height (Schumaker, 1981). In the following calculations, we normalize the interval
[, ] to [0, ], where each interval has length . According to Huang et al. (2010), assume

80

the constant length of the interval satisﬁes  = (−) with 0 <  < 0.5. The Ò cubic B-spline
basis can be derived from deﬁnition
+ 2
2

 ≤  ≤ ( + 1),

3
63

−33
63


− 3
− 2
6 ,
22

+ (9 + 10)2
− 72 + 16 + 6

6

62


+ 3 + 22 − 2 − 2

,

6
( + 1) ≤  ≤ ( + 2),



,4() =

33
63


− (9 + 20)2

62


+ 92 + 42 + 34)

6

− 3 + 82 + 14 + 10

6

,

( + 2) ≤  ≤ ( + 3),

−3
63


+ ( + 2)2

62


− 32 + 20 + 32

6

0,

..

+ 3 + 102 + 32 + 32

,

6
( + 3) ≤  ≤ ( + 4),

and we have () = {,4(),  = 1, ..., }. Taking derivative, we have the second derivative of
the basis function satisﬁes

(cid:48)(cid:48)
,4() = (−2

 ) = (2)

Therefore, the elements

 , = (3)

 
 = 1, ..., 


,  = 1, ...,  Ò

 , ∈  

and equals exactly zero if | − | > 3. As a direct result, the eigenvalue of the matrix   is bounded
from above by (3) and from below by some constant. Similarly, if we use a quadratic B-spline,
the elements  , are bounded from above by (4) and from below by some constant.

Then we begin the convergence rate part. For a converging sequence  such that (cid:107) ˆ − 

0(cid:107)2 ≤
0(cid:107)2), then consider the convex combination ∗ =  ˆ + (1 − ) 
0.

, deﬁne  = /( + (cid:107) ˆ − 
We have ∗ − 

0 = ( ˆ − 

0), which implies

(cid:107) ∗ − 

0(cid:107)2 = (cid:107) ˆ − 

0(cid:107)2 =

0(cid:107)2
(cid:107) ˆ − 
 + (cid:107) ˆ − 
0(cid:107)2

≤ 

(A.3)

81

This means ∗ is within a small distance from 
if we have

0 and we are safe to use Taylor expansion. Moreover,

(cid:107) ∗ − 

0(cid:107)2 ≤ ,

then

 + (cid:107) ˆ − 
Choosing  to be greater than , we have


(cid:107) ˆ − 

0(cid:107)2 ≤ 

0(cid:107)2

(cid:107) ˆ − 

0(cid:107)2 ≤ 2

Therefore, it’s suﬃcient to derive the convergence rate for ∗.

Consider the Taylor expansion

∗ 

(cid:17) − 
(cid:16)
(cid:17)(cid:105)
(cid:32)1
(cid:17) − 
(cid:16)
(cid:17)(cid:105) −
(cid:104)

 (cid:48)(cid:48)(cid:0)∗∗(cid:1) ( ∗ − 
(cid:17)(cid:105) − 1
(cid:16)
(cid:17) − 

=1


0

0


0)

∗ 

0


( ∗ − 

0) 

0


=1

(cid:104)

(cid:104)


=1

=1


(cid:104)

(cid:16)
(cid:16)

(cid:16)

=1
( ∗ − 

− 1


= − 1


+ 1
2
= : −1
+ 1
2


( − 

0) ( ∗ − 

0)

 − (cid:48)(cid:16)

0


(cid:105)(cid:33)

(cid:17)


( ∗ − 

0)

0 and ( ∗∗) is the covariance matrix of  evaluated ast ∗∗

0)  ( ∗∗)( ∗ − 

0)

0 is the expectation of  at 

where 
which is located on the line segment joining 

0 and ∗.

By the deﬁnition of ∗ and convexity, we have

(cid:17) − 
(cid:17) − 

(cid:16)
(cid:16)

∗ 

0


(cid:16)
(cid:16)

(cid:104)
(cid:104)


=1

=1


− 1


≤ − 1


(cid:17)(cid:105) + 
(cid:17)(cid:105) + 

∈ ˆ

∈ ˆ

3  ∗


  ∗


3  


0


0


∗ 

0


82

Combine this with the Taylor expansion result, we have

1
( ∗ − 
2
≤ 1
( − 


0)  ( ∗∗)( ∗ − 

0)

(cid:104)

0) + 

∈ ˆ
0)| + 1
|( 


≤ 1


|( − ) ( ∗ − 

0 − ) ( ∗ − 

0) ( ∗ − 

3 


0


 − ∗
0


(cid:104)

|( − ) ( ∗ − 

3 


0


≤ 1

+ 


∈ ˆ

0)| + 1
4
 − ∗
0


( ∗ − 
  ∗


(cid:105)

(cid:105)

  ∗


0)| + 

3 

(cid:104)


0


 − ∗
0


  ∗


(cid:105)

∈ ˆ
0)  ( ∗∗)( ∗ − 

0) + (2

−2


)

where the second inequality comes from norm inequality, the third inequality comes from Cauchy-
Swarchz inequality, and  is the expectation of  given  0. Rearrangling the inequality, we have

1
( ∗ − 
4
≤ 1
|( − ) ( ∗ − 

0)  ( ∗∗)( ∗ − 
0)
−2

0)| + (2


( ∗ − 

≤ 1
8

+ 

3  

0)  ( ∗∗)( ∗ − 
−2

 + (2
0

  
0


0) + 2
)


∈ ˆ

) + 

∈ ˆ
(cid:107)

3  


0


0


1/2( ∗∗)( − )(cid:107)2
2

where the inequality is by Cauchy-Swarchz inequality. Consider the penalty matrix   who has
entries  , = 1 if  =  = 1,  , = 2 if  =  ࣔ 1 and  , = −1 if | − | = 1. The matrix
 is of the order (). Rearranging the terms and by
0
is a constant matrix, thus each 
Concentration inequality, see for example Yang & Maiti (2018), we have

  
0


1
8

( ∗ − 

0)  ( ∗∗)( ∗ − 

0) = 

log(


3  ) + (2

−2


)

(cid:18)

(cid:19)

+ (

∈ ˆ

By remark 2.1 in Yang & Maiti (2018), we have

012
2
8

(cid:107) ∗ − 

2 ≤ 1
0(cid:107)2
8

( ∗ − 

0)  ( ∗∗)( ∗ − 

0)

83

Combine the above result with the condition event S, we have conditioning on S

(cid:107) ∗ − 

0(cid:107)2

2 = 

By the inequality that

−2

2


log()


+ (2

−2
2

−2


∈ ˆ

) + (cid:169)(cid:173)(cid:173)(cid:171)

3  

( ) = ( |)() + ( |)() ≤ ( |) + ()

consider S = , we have

(cid:107) ∗ − 

0(cid:107)2

2 =  

−2

2


log()


(cid:19)

+ (2

−2
2

−2


) + 

3  

Combine this with the argument at the beginning of the proof, we have

(cid:107) ˆ − 

0(cid:107)2

2 =  

−2

2


log()


+ (2

−2
2

−2


) + 

3  


∈ ˆ


∈ ˆ

(cid:169)(cid:173)(cid:173)(cid:171)
(cid:169)(cid:173)(cid:173)(cid:171)

(cid:170)(cid:174)(cid:174)(cid:172)

(cid:170)(cid:174)(cid:174)(cid:172)
(cid:170)(cid:174)(cid:174)(cid:172)

(cid:3)

(cid:18)

(cid:18)

(cid:18)

(cid:19)

(cid:19)

84

APPENDIX B

THREE STAGE APPROACH SIMULATION RESULTS

In this section, we outline the process of constructing the simulated datasets along with the results
of our method.

B.1 Data

Several simulated datasets were constructed using various numbers of observations and covari-
ates. For a prespeciﬁed number of observations,   is randomly generated from (0, 1) for each
 = 1, ..., . In addition, for each  = 1, ..., ,   is selected from a series of functions; zero, linear,
polynomial, exponential, logarithmic, and sinusoidal. The functions are selected randomly based
on chosen probabilities; 0.6, 0.05, 0.2, 0.05, 0.05, and 0.05. The coeﬃcients used in each function
are also randomly selected. The exact functions used in the simulation study can be found in the
appendix.

B.2 Simulation Results

The performance of our method is summarized through three characteristics; out-of-sample
prediction, function misclassiﬁcation, and computational performance. In addition, performance
results for other models are included for comparison. These models include results from interme-
diate steps such as Initial Group Lasso and Adaptive Group Lasso, as well as other models such as
True Selected, the GAM model ﬁt to only nonzero functions, and mgcv, the GAM model ﬁt to all
functions using the mgcv package.

Out-of-sample prediction is measured by mean squared prediction error (MSPE), with a training
to testing set ratio of 80:20. Figure B.1 shows the comparison of the out-of-sample prediction
performance for each method when  = 800 and  = 200.
In addition to the aforementioned
comparison models, Null Model and the true 2 were added. This plot is fairly representative of all
plots constructed for each simulated sample. The MSPE from Step 3 decreases as  decreases, but

85

Figure B.1: Mean squared prediction error for each method when  has 800 observations and 200
covariates.

may slightly increases as  becomes smallest. For the best , the MSPE of step 3 is typically very
similar to that of True Selected and True Sigma, with the MSPE of mgcv being relatively close all
well. Note that the MSPE of step 2 is higher than that of the null model, which is expected due to
the overﬁtting nature of the model. This further emphasizes the importance of step 3. Figure B.2
shows the performance of the 3 step approach against the other methods, for a varying number of
observations and covariates. These plots simply omit step 2 for increased readability.

86

Figure B.2: Mean squared prediction error from the 3 step approach against other methods, at
various dimensions of .

We now consider the four misclassiﬁcation types; zero, the percentage of truly zero functions
estimated as nonzero functions, nonzero, the percentage of truly nonzero functions estimated as zero
functions, linear, the percentage of truly linear functions estimated as nonlinear or zero functions,
and nonlinear, the percentage of truly nonlinear functions estimated as linear or zero functions.
Note nonlinear functions do not include zero functions. Figure B.3 shows the misclassiﬁcation
rates the for step 2, step 3, and the gamsel model.

Moreover, ﬁgures B.4 and B.5 show the estimated functions for the simulated dataset used in
ﬁgure B.1 from the 3 step approach and the mgcv model, respectively. The 3 step approach provides

87

Figure B.3: Misclassiﬁcation error rate for step 2, step 3, and gamsel model, for each misclassiﬁ-
cation type.

function estimates very similar to the true functions, with conﬁdence bands that accurately capture
the observations. The mgcv model does not perform as well, and has much wider conﬁdence bands.

88

Figure B.4: 3 step approach function estimates for representative covariates from the model
corresponding to ﬁgure B.1.

89

Figure B.5: mgcv function estimates for representative covariates from the model corresponding to
ﬁgure B.1.

90

Finally, we consider method runtime as a measure of computational performance. Figure B.7
shows the runtime, in seconds, of each model for the various simulated datasets. The mgcv model
has a shorter runtime than the 3 step method at the best lambda in subplots A, B, C, and D. However,
as the dimension of the dataset becomes larger, the 3 step method at the best lambda runs faster. In
ﬁgure B.7, the mgcv model had a runtime of about 12.55 hours compared to about 1.12 hours for
the 3 step method at the best lambda.

The performance of the ﬁnal model with respect to the three characteristics discussed is directly
aﬀected by the choice of  in the adaptive group lasso step.
In ﬁgure B.1, smaller values of
lambda provide for reduced mean squared prediction error. However, in ﬁgure B.7, smaller values
of lambda generally lead to longer runtimes. These observations emphasize the importance of
selecting and appropriate lambda in the adaptive lasso step.

Figure B.6: Runtime, in seconds, for each method when  has 800 observations and 100 covariates.

91

Figure B.7: Runtime, in seconds, from the 3 step approach the mgcv model, at various dimensions
of .

92

APPENDIX C

GAMMA DOUBLE GLM WITH LASSO ALGORITHM RESULTS

We want to justify that the update schemes in Equation 4.48 and Equation 4.49 by showing that the
penalized objective function in Equation 4.39 decreases with each update.

Lemma C.1. Let  be a column vector of length , let  a positive deﬁnite  ×  matrix, and let 
be the largest eigenvalue of . Then

Proof. Letting  = /(cid:107)(cid:107)2, it is equivalent to show

  ≤  (cid:107)(cid:107)2
2 .

Let eigendecomposition

  ≤ .

 =  ,

where  = (
) is orthonormal and  is a diagonal matrix with 1, ...,  on the diagonals.
The columns of  form an orthonormal basis and are the eigenvectors corresponding to 1, ..., .

1 · · · 

=1 2

 = 1.

Moreover,

  = (cid:104), (cid:105)

Therefore, we have 1, ...,  such that  =
=1 . Since (cid:107)(cid:107)2 = 1, we have
(cid:42) 
(cid:42) 

(cid:18)

By the deﬁnition of eigenvalue,  =  for  = 1, ..., . Then


(cid:19) 

=1
2
 

(cid:43)
(cid:43)

  =

, 


.

, 


=1

=

=1

=1

=1

=

≤

max
=1,...,


=1

2
 = .

The inequality holds since  is positive deﬁnite resulting in all ,  = 1, ...,  being positive.

(cid:3)

93

Result C.1.

ࡁ( ˘, ˘()) ≤ ࡁ( ˘, ˘) + ˘


=1


(cid:110) ˜
(cid:110) ˜


˜(


=1
˜(


+

+

+

=1

=1

Proof.
ࡁ( ˘, ˘()) = ( ˜) + 1

2


= ( ˜) + 1

2


=1
− ˜(
+

=1

=1

+


(cid:110) ˜

2 ( ˘ () − ˘ ))( ˘ () − ˘ ))

( ˘ () − ˘ ) + ˜ 


˘())2 + 1
2


˜(

˘() − ˜(

˜)(


˜(


 ˘)2

=1
˘()) − ˜(

˘())(cid:111)


 ˜)(

 ˘ − ˜(


 ˜)(


 ˘) − ˜(


˜)(

˘())(

˜(


˘ + 


 ˘)
 ( ˘ () − ˘ ))2 + 1
2

˜(


 ˘)2


 ˘)(cid:111)


=1

˘ + 

 ( ˘ () − ˘ ))


˘ + 
˜)(
 ˘ − ˜(

 ( ˘ () − ˘ )) − ˜(
˜)(
 ˜)(

 ˘) − ˜(


 ( ˘ () − ˘ ))


˘ + 

 ˜)(

 ˘)(cid:111)
 ( ˘ () − ˘ ))2(cid:21)

2(


 ˘)
 + 1


(cid:20)

+

˜(


˘ + 

 ( ˘ () − ˘ ))(

= ࡁ( ˘, ˘) +

˜

 ( ˘ () − ˘ ) ˘


=1
 ( ˘ () − ˘ )

˜

=1

+


=1
− ˜(
+

=1


 ( ˘ () − ˘ )) − ˜(

˜)(
 ( ˘ () − ˘ ))(


 ˘)

˜(


 ˜)(

 ( ˘ () − ˘ ))

94

ࡁ( ˘, ˘()) = ࡁ( ˘, ˘) +


=1

˜(


˘) ( ˘ () − ˘ ) + 1
2

˜(

 ( ˘ () − ˘ ))2


=1
− ˜(
+

=1


˜(

+

˜ ( ˘ () − ˘ )


˜) ( ˘ () − ˘ ) − ˜(
 ˘) ( ˘ () − ˘ )
( ˘ () − ˘ ) + 1
( ˘ () − ˘ ) + ˜ 

= ࡁ( ˘, ˘) + ˘
≤ ࡁ( ˘, ˘) + ˘


 ˜) ( ˘ () − ˘ )

2( ˘ () − ˘ )) ˜( ˘ () − ˘ ))
2 ( ˘ () − ˘ ))( ˘ () − ˘ ))

The last inequality holds by Lemma C.1.

(cid:3)

Lemma C.2. Each iteration of the inner loops in Algorithm 4.1 always decreases penalized objective
function in Equation 4.39.

Proof. It is suﬃcient to show that the update schemes in Equation 4.48 and Equation 4.49 always
decreases the penalized objective function in Equation 4.39. Without loss of generality, ﬁx  ∈
{1, ..., } and let us consider the update of  . Then we simply need to show that

( ˘, ˘()) − ( ˘, ˘) ≤ 0.

95

Then,

( ˘, ˘()) − ( ˘, ˘)


=1


(cid:13)(cid:13) ˘()(cid:13)(cid:13)2
(cid:13)(cid:13)2
(cid:13)(cid:13) ˘


=1

=1

=1
 (cid:107) ˘(cid:107)2 − 


(cid:13)(cid:13)(cid:13) ˘ ()(cid:13)(cid:13)(cid:13)2
(cid:13)(cid:13)(cid:13)2
(cid:13)(cid:13)(cid:13)2


(cid:13)(cid:13)(cid:13) ˘ 
(cid:13)(cid:13)(cid:13) ˘ 

= ࡁ( ˘, ˘()) + 

 (cid:107) ˘(cid:107)2 + 

− ࡁ( ˘, ˘) − 

= ࡁ( ˘, ˘()) +  

( ˘ () − ˘ )

− ࡁ( ˘, ˘) −  
≤ ࡁ( ˘, ˘) + ˘
+ ˜ 
2 ( ˘ () − ˘ ))( ˘ () − ˘ )) +  
− ࡁ( ˘, ˘) −  

by Result C.1

( ˘ () − ˘ )

= ࡁ( ˘, ˘) + ˘
+ ˜ 
2 ( ˘ () − ˘ ))( ˘ () − ˘ )) +  
− ࡁ( ˘, ˘) − ˘
− ˜ 
≤ 0.

2 ( ˘  − ˘ ))( ˘  − ˘ )) −  

( ˘  − ˘ )

(cid:13)(cid:13)(cid:13) ˘ 

(cid:13)(cid:13)(cid:13)2

(cid:13)(cid:13)(cid:13) ˘ ()(cid:13)(cid:13)(cid:13)2
(cid:13)(cid:13)(cid:13) ˘ ()(cid:13)(cid:13)(cid:13)2

The ﬁnal inequality holds by Equation 4.48.

(cid:3)

96

BIBLIOGRAPHY

97

BIBLIOGRAPHY

Antweiler, Werner & Murray Frank. 2004. Is all that talk just noise? the information content of

internet stock message boards. The Journal of Finance 59.

Chollet, Francois & J. J. Allaire. 2018. Deep learning with r. Manning Publications Co.
Chouldechova, Alexandra & Trevor Hastie. 2015. Generalized additive model selection. arXiv

preprint, arXiv:1506.03850v2 [stat.ML] .

Dreassi, Emanuela, M Giovanna Ranalli & Nicola Salvati. 2014. Semiparametric m-quantile
regression for count data. Statistical Methods in Medical Research 23(6). 591–610. doi:10.1177/
0962280214536636. https://doi.org/10.1177/0962280214536636. PMID: 24847899.
Frees, Edward W. 2009. Regression modeling with actuarial and ﬁnancial applications. Cambridge

University Press.

Frees, Edward W., Gee Y. Lee & Lu Yang. 2016. Multivariate frequency-severity regression models

in insurance. Risks 2016(4).

Friedman, Jerome, Trevor Hastie & Robert Tibshirani. 2010. Regularization paths for generalized

linear models via coordinate descent. Journal of Statistical Software 33(1). 1–22.

Gentzkow, Matthew, Jesse Shapiro & Matt Taddy. 2016. Measuring polarization in high-
dimensional data: method and application to congressional speech. NBER Working Paper
No. 22423 .

Goldberg, Yoav. 2017. Neural network methods for natural language processing. Morgan &

Claypool Publishers.

Goodfellow, Ian, Yoshua Bengio & Aaron Courville. 2016. Deep learning. MIT Press.
Hastie, T. & R. Tibshirani. 1986. Generalized additive models. Statistical science 297–310.
Hastie, T. J. & R. J. Tibshirani. 1990. Generalized additive models. Chapman and Hall.
Hastie, Trevor, Robert Tibshirani & Jerome Friedman. 2009. The elements of statistical learning:

Data mining, inference, and prediction, second edition. Springer Science & Business Media.

Hatzivassiloglou, Vasileios & Janyce Wiebe. 2000. Eﬀects of adjective orientation and gradability

on sentence subjectivity. Proceedings of COLING .

Huang, J., L. Horowitz & F. Wei. 2010. Variable selection in nonparametric additive models.

Annals of Statistics 38(4). 2282 – 2313.

Karlgren, Jussi & Douglass Cutting. 1994. Recognizing text genres with simple metrics using

discriminant analysis. Proceedings of COLING .

98

Kearney, Susan. 2010. Insurance operations. The Institutes.
Lee, Gee Y., Scott Manski & Tapabrata Maiti. 2019. Actuarial applications of word embedding

models. ASTIN Bulletin DOI: https://doi.org/10.1017/asb.2019.28.

Li, Zheyuan & Simon N. Wood. 2019. Faster model matrix crossproducts for large generalized linear
models with discretized covariates. Statistics and Computing doi:10.1007/s11222-019-09864-2.
https://doi.org/10.1007/s11222-019-09864-2.

Marra, Giampiero & Simon Wood. 2011. Practical variable selection for generalized additive

models. Computational Statistics and Data Analysis 55. 2372–2387.

Pang, Bo, Lillian Lee & Shivakumar Vaithyanathan. 2002. Thumbs up? sentiment classiﬁcation
using machine learning techniques. Proceedings of the Empirical Methods in Natural Language
Processing (EMNLP) .

Pennington, Jeﬀrey, Richard Socher & Christopher D. Manning. 2014. Glove: Global vectors for
word representation. In Empirical methods in natural language processing (emnlp), 1532–1543.
http://www.aclweb.org/anthology/D14-1162.

Qian, Wei, Yi Yang & Hui Zou. 2016. Tweedie’s compound poisson model with grouped elastic

net. Journal of Computational and Graphical Statistics 25(2). 606–625.

Schumaker, L. 1981. Spline functions: Basic theory. John Wiley & Sonds, New York.
Simonoﬀ, Jeﬀrey S. 1996. Smoothing methods in statistics. Springer, New York, NY.
Smyth, Gordon K. 1989. Generalized linear models with varying dispersion. Journal of the Royal

Statistical Society, Series B (Methodological) 51(1). 47–60.

Smyth, Gordon K. & Bent Jørgensen. 2002. Fitting tweedie’s compound poisson model to insurance

claims data: dispersion modelling. ASTIN Bulletin 32(1). 143–157.

Sokolova, Marina & Guy Lapalme. 2009. A systematic analysis of performance measures for

classiﬁcation tasks. Information Processing and Management 45. 427–437.

Stephens-Davidowitz, Seth. 2014. The cost of racial animus on a black candidate: Evidence using

google search data. Journal of Public Economics 118. 26–40.

Tetlock, Paul. 2007. Giving content to investor sentiment: the role of media in the stock market.

The Journal of Finance 62.

Tibshirani, R. 1996. Regression shrinkage and selection via the lasso. Journal of the Royal

Statistical Society. Series B (Methodological) 267–288.

Wang, Hansheng & Chenlei Leng. 2008. A note on adaptive group lasso. Computational Statistics

& Data Analysis 52(12). 5277 – 5286.

Wood, Simon. 2013. On p-values for smooth components of an extended generalized additive

model. Biometrika 100. 221–228.

99

Wood, Simon. 2019. mgcv: Mixed gam computation vehicle with automatic smoothness estimation.

https://CRAN.R-project.org/package=mgcv. R package version 1.8-31.

Wood, Simon N. 2017. Generalized additive models: An introduction with r, second edition. CRC

Press.

Wood, Simon N., Yannig Goude & Simon Shaw. 2015. Generalized additive models for large data

sets. Journal of the Royal Statistical Society, Series C 64(1). 139–155.

Wood, Simon N., Zheyuan Li, Gavin Shaddick & Nicole H. Augustin. 2017. Generalized additive
models for gigadata: Modeling the u.k. black smoke network daily data. Journal of the American
Statistical Association 112(519). 1199–1210.

Yang, Kaixu & Taps Maiti. 2018. Ultra high dimensional classiﬁcation consisitency with general-

ized additive model. Technical Reports, Michigan State University .

Yang, Yi & Hui Zou. 2015. A fast uniﬁed algorithm for solving group-lasso penalize learning

problems. Statistics and Computing 25(6). 1129–1141.

Yang, Yi & Hui Zou. 2017. gglasso: Group lasso penalized learning using a uniﬁed bmd algorithm.

https://CRAN.R-project.org/package=gglasso. R package version 1.4.

Yuan, Ming & Yi Lin. 2006. Model selection and estimation in regression with grouped variables.

Journal of the Royal Statistical Society Series B 68. 49–67.

Zou, Hui. 2006. The adaptive lasso and its oracle properties. Journal of the American Statistical

Association 101(476). 1418–1429.

100