. 3......”
GP»

. ~
3.5:)»
1.1:...-
4v:

a

; huzu t; 25.. ..

4.ny {.2 in...

. .. n 1.
I'

r:

5 will u

; ..;.... 33......

7.71.3... 52...»... ..
3......

. V

3
v:

.e a) 4.125:
E ‘2, .«tillti... a aux!
\ .‘.v\|..vﬁp
L'- z’: 3.0”
1.:- SSI It}.
(hat; L .a
b! \3 ..\». v
v3.3.ﬂ! aztk
I «f.

I: I... u. 11.
2431;...) liviltﬁ}
.3i&111t=.£|!1:
tun; .i a {Big}
W. i tin-:1! f.
5': Iii! y
11..

.rviiti $ .
J ..}..nv..v.

.
7

 

 

 

{WFHF

LIBRARY
Michigan State
University

This is to certify that the
dissertation entitled

Trimmed and Winsorized Estimators

presented by

Mingxin Wu

has been accepted towards fulﬁllment
of the requirements for the

PhD degree in Statistics

 

 

Yijun Zuo W

Major Professor’s Signaturev

 

Date 05/04/aé

MSU is an Afﬁrmative Action/Equal Opportunity Institution

- ---‘---n--.—.—‘— —.-.-.—.-.-.-.- — -

lC-t-l-I-0-o-l-6-U-o-O-a-O-O-u-t-|-o-o-b-n-aa. - 4

PLACE IN RETURN BOX to remove this checkout from your record.
To AVOID FINES return on or before date due.
MAY BE RECALLED with earlier due date if requested.

 

DATE DUE DATE DUE DATE DUE

 

JUN 2 0 2007

 

,‘ -‘_‘~ f‘ -.-
_u n U , l!)

"-.

 

 

 

 

 

 

 

 

 

 

 

 

 

2/05 p:/CIRC/DateDue.indd-p.1

 

 

TRIMMED AND WINSORIZED ESTIMATORS

By

Mingxin Wu

A DISSERTATION

Submitted to
Michigan State University
in partial fulﬁllment of the requirements
for the degree of

DOCTOR OF PHILOSOPHY

Probability and Statistics Department,

2006

ABSTRACT

TRIMMED AND WIN SORIZED ESTIMATORS

By

Mingxin Wu

The dissertation consists of three parts. The ﬁrst part studies trimmed and Winsorized
means based on a scaled deviation. The inﬂuence functions of the trimmed (and Winsorized)
means are derived and their limiting distributions are established via asymptotic representa-
tions. The performance of these estimators with respect to various robustness and efﬁciency
criteria is evaluated and compared with leading competitors including the ordinary Thkey
trimmed (and Winsorized) means. The resulting trimmed (and Winsorized) means are much
more robust than their predecessors. Indeed they can share the best breakdown point ro-
bustness of the sample median for any common trimming thresholds. Furthermore, for
appropriate trimming thresholds they are highly efficient for light-tailed symmetric mod-
els and more efﬁcient than their predecessors for heavy-tailed or contaminated symmetric
models.

The second part of the dissertation pertains to applying the same trimming scheme to the
scale setting. In this part, trimmed (and Winsorized) standard deviations based on a scaled
deviation are introduced and studied. The inﬂuence functions and the limiting distributions
are obtained. The performance of the estimators is evaluated and compared with respect to
high breakdown scale estimators. Unlike other high breakdown competitors which perform
poorly for light-tailed distributions and for contaminated symmetric distributions with con-
tamination near the center, the resulting trimmed (and Winsorized) standard deviations are
much more efficient than their predecessors for light-tailed distributions for suitably chosen

trimming parameters and highly efﬁcient for heavy-tailed and skewed distributions. At the

same time, they are sharing the best breakdown point robustness of the sample median
absolute deviation for any common trimming thresholds.

The third part is about the multiple least square estimator. In this part we introduce the
least trimmed squares estimator for multiple regression. A fast algorithm for its computation
is proposed. We prove Fisher consistency for the multiple regression model with symmetric
error distributions and derive the influence function. Simulation studies investigate the

ﬁnite—sample efﬁciency of the estimator.

Copyright by
Mingxin Wu
2006

ACKNOWLEDGMENTS

Foremost, I am deeply grateful to my advisor, Professor Yijun Zuo for introducing me to
this wonderful place—MSU. He is gratefully acknowledged for his years of encouragement,
his scientiﬁc inﬂuence on me, his inﬁnite patience, his insights in our numerous discussions,
his ﬁnancial support and his careful review of paper manuscripts. Without these generous
assistance, this dissertation could not have come into light. To be one of his students is my
great honor.

I also wish to express my gratitude to my dissertation committee, Professor Connie Page,
Professor Habib Salehi, Professor Lijian Yang, for sparing their precious time to serve on
my committee and giving valuable comments and suggestions.

I am grateful to Professor Connie Page and Professor Dennis Gilliland for accepting me
as one of the consultants at Cstat. It’s such a good experience to work with Cstat that I-
have had chance to encounter some very interesting topics such as survival analysis, machine
learning and application aspects of statistics.

My thanks also go to Professor James Stapleton for his numerous help and constant
support.

The support and encouragement of members of statistics department are greatly appre-

ciated and acknowledged.

TABLE OF CONTENTS

LIST OF TABLES
LIST OF FIGURES

1 Introduction and Motivation

1.1 Location ......................................
1.2 Scale ........................................

2 Trimmed and Winsorized means based on a scaled deviation

2.1 Introduction ....................................
2.2 Scaled deviation trimmed and Winsorized means ................
2.3 Inﬂuence function .................................
2.4 Asymptotic representation and limiting distribution ..............
2.5 Performance comparison .............................
2.5.1 Breakdown point .............................
2.5.2 Inﬂuence function and gross error sensitivity ..............
2.5.3 Large sample relative efﬁciency .....................
2.5.4 Finite sample relative efﬁciency .....................
2.6 Remarks ......................................

viii

6

8
13
15
15
16
25
26
28

Trimmed and Winsorized standard deviations based on a scaled deviation 30

3.1 Introduction ....................................
3.2 Scaled-deviation trimmed and Winsorized standard deviation .........
3.3 Inﬂuence Function ................................
3.4 Asymptotic representation and limiting Distribution ..............

3.5 Comparison ....................................
3.5.1 Breakdown Point .............................
3.5.2 Inﬂuence Function and Gross Error Sensitivity .............
3.5.3 Large sample relative eﬂiciency .....................
3.5.4 Finite sample relative efﬁciency .....................

3.6 Concluding remarks ................................

The Multiple Least 'D'immed Squares Estimator

4.1 Introduction ....................................

4.2 Deﬁnition and properties .............................

4.3 The inﬂuence function and asymptotic variances ................

4.4 Finite-sample simulations ............................
4.4.1 Algorithm .................................

vi

30
31
34
39
42
42
43
45
48
51

64
64
67
68
75
75

4.4.2 F inite—sample performance ........................ 76

5 Selected proofs of main results and lemmas 79
5.1 Selected proofs for results of chapter 2 ..................... 79
5.2 Selected proofs for results of chapter 3 ..................... 87

BIBLIOGRAPHY 96

vii

2.1
2.2
2.3
2.4
2.5

2.6
2.7

3.1
3.2
3.3
3.4
3.5
3.6
3.7

3.8

4.1

4.2

LIST OF TABLES

Breakdown points of mean, trimmed (Winsorized) means, and median . . . . 16
CESs of mean, trimmed (Winsorized) means, and median at symmetric F . . 24
GESS of mean, trimmed (Winsorized) means, and median at asymmetric F . 24
ARES of T‘3 and T5 relative to the mean .................... 26
ARES of trimmed and Winsorized means and median with

a = 0.01 3 = ﬂ(F, a) ............................. 26
RES of trimmed and Winsorized means with B = 7 .............. 27
RES of trimmed and Winsorized means with a = 0.01 B = BOP, a) . . . . 28
Cross Error Sensitivity .............................. 44
ARES of S and Sw relative to the standard deviation ............. 46
ARE’S with respect to SD ............................ 46
6 Values for S having better ARE than Other Scales ............. 47
[3 Values for Sw having better ARE than Other Scales ............ 47
Stande variance of M AD", 3,130, Qn, Sn, Sum and SD" at normal model 49
RES of various robust scales (6 = 7 for scaled-deviation trimmed / Winsorized

scale) at (1 — €)N(0, 1) + €N(I,0.I) ....................... 50
RES of various robust scales (5 = 7 for scaled-deviation trimmed/Winsorized

scale) at (1 — €)N(0, 1) + 66“,} ......................... 50
Asymptotic relative efﬁciency of the LTS estimator w.r.t. the Least Squares

estimator at the normal distribution for several values of 2 ........... 75
ARES of LTS relative to the LS for p = 3 .................... 77

viii

LIST OF FIGURES

2.1 Inﬂuence function of T2 for N (0, 1) with [3 = 2 and a constant weight. ...... 11
2.2 Inﬂuence functions of T3, for N (0, 1) with B = 2 and a constant weight ....... 12
2.3 Asymptotic breakdown points of trimmed means. ................. 17
2.4 Gross error sensitivity of trimmed means. ..................... 18
2.5 Inﬂuence functions of the trimmed and Winsorized means at normal ...... 20

2.6 Inﬂuence functions of the trimmed and Winsorized means for t(3) with a = 0.1. 21
2.7 Inﬂuence functions of the trimmed and Winsorized means for 0.9N(0, 1)+0.1N (4, 9)

with a = 0.1 ..................................... 22
2.8 Inﬂuence functions of the trimmed and Winsorized means for 0.9N(0, 1) +

0.1N(4, 0.5) with a = 0.1. ............................. 23
3.1 Inﬂuence functions of S for N (0, 1) with a constant weight and ﬂ = 3 ..... 37
3.2 Inﬂuence functions of Sw for N(0, 1) with a constant weight and [3 = 3. . . . 53

3.3 Inﬂuence functions of various scales for normal distribution. (ﬂ = 4.5 for S and SW) 54
3.4 Inﬂuence functions of various scales for Cauchy distribution. ([3 = 4.5 for S and Sw) 55
3.5 Inﬂuence functions of various scales for exponential distribution. ([3 = 4.5 for S

and 5,”) ...................................... 56
3.6 Inﬂuence functions of various scales for 0.9N(0, 1) + 0.1N(1,0.1). (ﬂ = 4.5 for S

and S...) ...................................... 57
3.7 ARE of trimmed and Winsorized standard deviations for normal distribution . . . 58
3.8 ARE of trimmed and Winsorized standard deviations for Cauchy distribution . . . 59

3.9 ARE of trimmed and Winsorized standard deviations for exponential distribution . 60
3.10 GES of trimmed and Winsorized standard deviations for normal distribution . . . 61
3.11 GES of trimmed and Winsorized standard deviations for Cauchy distribution . . . 62
3.12 GES of trimmed and Winsorized standard deviations for exponential distribution . 63

4.1 MSR vs number of iterations with 100 arbitrary initial subsets .......... 66
4.2 LTS estimator v.s OLTS estimator ......................... 78

CHAPTER 1

Introduction and Motivation

1.1 Location

The sample mean is the most efﬁcient location estimator for normal models. It is, how-
ever, not robust. The sample median is the most robust location estimator with the best
breakdown point. It is, however, not eﬁcient for normal models. The trimmed mean is a
compromise between the two extremes. It is more robust than the mean and more efﬁcient
than the median for normal models. It also performs quite well for heavy-tailed non-normal
symmetric distributions. That’s one reason why we use it in Olympic rating system. We
know that for some sports in the Olympics, such as diving, gymnastics (Summer Olympics)
and several years ago, ﬁgure skating (Winter Olympics), the rating system is based on the
trimmed means. In Olympic rating, nine different judges from nine different countries give
nine scores for each athlete. They drop the highest and lowest scores, taking the average
of the rest seven scores as the ﬁnal score of each athlete. The Gold Medal in these games
is awarded to contestants with the highest trimmed scores. And many rating systems for
competitions in our life are also based on trimmed means. The high and low scores are
dropped and the rest are averaged. In every Olympics year, there are a lot of controversies
over the ordinary trimmed mean based scoring system used in competitions. In fact there
are a lot of problems with it. The argument I make in this dissertation is that the problems
can be avoided by trimmed mean /winsorized mean based on a scaled-deviation—indeed,

from a statistical point of view it is statistically superior to the alternatives that have been

proposed and used.
The ordinary trimmed mean associated with the Olympic rating system has the following

shortcomings which can be avoided by the trimmed mean based on a scaled-deviation

1 Ordinary trimmed mean (OTM) cannot to exclude all the outliers when outliers come
from one side instead of both sides. This time outliers are from lower side, it will
underestimate the athlete (center); next time when the outliers are from upper side,

it will overestimate the athlete (center).

2 Another problem involved in the Olympic rating system is: Arbitrarily throwing out
the high and low marks in Olympic rating is also a remarkably poor solution to the
problems of national bias or human error that arise from time to time on judging
panels, because if you throw away a ﬁxed fraction of data you automatically make the
assumption that two out of nine judges are necessarily mistaken and the other seven
are “correct”. This is an unreasonable assumption. so OTM is mechanical because
it always trims a ﬁxed fraction of data points at both ends of a data set no matter

whether these data points are “good” or “bad”.

These disadvantages of the Olympic rating system motivate us to consider trimmed means
based on a scaled deviation. It turns out that trimmed means based on a scaled deviation
is ﬂexible and random. They may trim some or no sample points and have the power
to distinguish outliers no matter which side they are coming from. Detailed comparisons
with leading competitors on various robustness and efﬁciency aspects reveal that the scaled
deviation trimmed means behave very well overall and consequently represent very favorable
alternatives to the ordinary trimmed means.

Besides trimming, winsorizing is another robust method to mitigate inordinate inﬂuence
of extreme values. Unlike the trimmed mean, the Winsorized mean replaces the outliers with
cutting-point values, rather than discarding them. The Winsorized mean based on scaled-
deviation has the highest breakdown point and is more efﬁcient than the corresponding
trimmed mean when the cutting parameter {3 (see section 2.2) is small. But when H is large,
scaled-deviation trimmed mean almost has the same efﬁciency as Winsorized mean. That is

because when [3 is large, there is not too much information contained in both tails. When

there are “bad” points presented from either end, the Winsorized mean is less efﬁcient than

trimmed mean.

1.2 Scale

A fundamental task in many statistical analyses is to characterize the spread, or variability,
of a data set. Measures of scale are simply attempting to estimate this variability.

When assessing the variability of a data set, there are two key components:
1 . How spread out are the data values near center?
2 . How spread out are the tails?

Different numerical summaries will give different weight to these two elements. The choice
of scale estimator is often driven by which of these components you want to emphasize.

The histogram is an effective graphical technique for showing both of these components of
the spread, however it is just descriptive. There are several common numerical measures of
the spread: the variance, standard deviation, average absolute deviation, median absolute
deviation, interquartile range and range. The variance, standard deviation, average absolute
deviation, and median absolute deviation measure both aspects of the variability, that is,
the variability near the center and the variability in the tails. They differ in that the
average absolute deviation and median absolute deviation do not give undue weight to the
tail behavior. On the other hand, the range only uses the two most extreme points and the
interquartile range only uses the middle portion of the data.

The standard deviation is an example of an estimator that is the best in terms of efﬁciency
if the underlying distribution is normal. However, it lacks robustness validity. That is,
conﬁdence intervals based on the standard deviation tend to lack precision if the underlying
distribution is in fact not normal. It has the lowest possible explosion breakdown point.

The median absolute deviation and the interquartile range are estimates of scale that have
robustness of validity. However, the median absolute deviation is not particularly strong for
efﬁciency. The median absolute deviation estimator (MAD) has a low efﬁciency for normal

distributions (36.75%), thereby leading to rather unsatisfactory results for normal models.

The interquartile range is not particularly strong for robustness, it can not reach the highest

possible breakdown point.
Rousseeuw and Croux (1993) introduced two alternative statistics more efﬁcient than the

MAD, which are deﬁned as

$7520 = Cs med,{medj|a:,- '— IUjI}, Qn = d‘Uxi — le; i< j}(k)

where c8, of are consistent coefﬁcients, [(1 = (9‘) R5 (3) / 4 with h = [n/2] + 1 and (k)

is the k-th ordered statistics.

The 5,130 has efﬁciency for normal distributions 58.23%, while Q. has 82.27%. They are
still not good. Especially at the situation that there are contaminating points presented
close to the center, MAD, 8'50 and Qn are quite inefﬁcient. Motivated by these facts, we
introduced scaled deviation trimmed and Winsorized standard deviations.

The resulting trimmed (and Winsorized) standard deviatons are much more efﬁcient than
their predecessors for light-tailed distributions by suitably choosing the cutting parameter
and highly efﬁcient for heavy-tailed and skewed distributions. At the same time, they are
I. sharing the best breakdown point robustness of the sample median absolute deviation for
any common trimming thresholds. Compared with their predecessors, they can achieve the
best eﬁciency when points around the center are contaminated. Indeed, the scaled devi-
ation trimmed (Winsorized) standard deviatons behave very well overall and consequently

represent very favorable alternatives to other types of scales.

CHAPTER 2

Trimmed and Winsorized means based

on a scaled deviation

2.1 Introduction

Thkey trimmed (and Winsorized) means are among the most popular estimators of location
parameter; see, e.g., Stigler (1977). They overcome the extreme sensitivity of the mean
while improving the efﬁciency of the median for light tailed distributions. The robustness
and efficiency are two fundamentally desirable properties of any statistical procedure. They,
however, do not work in tandem in general. The trimmed (and Winsorized) means somehow
can keep a quite good balance between the two. The 'Ihkey trimming scheme is a symmetric
one in the sense that it trims the same number of sample points at both ends of data and
hence is quite eﬁcient for symmetric distributions. It, however, becomes less efﬁcient when
there is even just a slight departure from symmetry, e.g., with one end containing outlying
points. Metrical trimming, introduced in Bickel (1965), trims points based on their distance
to the center - median and hence is more efﬁcient for contaminated symmetric models. Like
the ordinary trimming, it always trims a ﬁxed fraction of sample points, no matter those
points are “good” or “bad”. This raises a concern as to whether there is a trimming scheme
that only trims points that are “bad” , which motivates us to consider in this chapter the

so-called scaled deviation trimmed and Winsorized means.

The main idea behind the new trimming scheme is that sample points are trimmed based

on the magnitude of their scaled (standardized) deviations to a center (say median). Only
points with the scaled deviation beyond some ﬁxed threshold are trimmed. This new trim-
ming scheme can lead to the best possible breakdown point (see Section 5.1 for deﬁnition)
robustness. The resulting estimators are also highly efﬁcient at light-tailed symmetric mod-
els and much more eﬂicient than the Tukey trimmed and Winsorized means at models with
a slight departure from symmetry or with heavy tails. Hence they represent favorable al-

ternatives to their predecessors.

The rest of the chapter is organized as follows. Section 2 deﬁnes the scaled deviation
trimmed and Winsorized means and discusses some primary properties. Section 3 inves-
tigates the local robustness, the inﬂuence functions, of the estimators. The asymptotic
normality of the estimators is established via their asymptotic representations in Section
4. The performance comparison of the estimators with other leading trimmed means with
respect to various robustness and efﬁciency criteria is carried out in Section 5. Concluding
remarks in Section 6 end the main body of the chapter. Proofs of main results and auxiliary

lemmas are reserved for the Appendix.

2.2 Scaled deviation trimmed and Winsorized means

Let u(F) and 0(F) be some robust location and scale measures of a distribution F. For
simplicity, we consider ,u and 0' being the median (Med) and the median absolute deviations
(MAD) throughout the chapter. Assume 0(F) > 0, namely, F is not degenerate. For a
given point 2:, we deﬁne the scaled deviation (generalized standardized deviation) of a: to
the center F by

D(2:, F) = (a: — p(F))/0(F). (2.2.1)

Now we trim points based on the absolute value of this scaled deviation and deﬁne the ﬁ

scaled deviations trimmed mean at F as (c.f. Zuo (2003) for a multi-dimensional version)

I 1(|D($i Fll S H) W(D(~’ci F ))$dF (1')

5 __
T (F) ‘ fI(ID(:r,F)I s mw(D(x.F»dF(x) ’

 

(2.2.2)

where 0 < H S 00 and w is an even bounded weight function on [—oo,oo] so that the

denominator is positive. The heuristic idea behind this deﬁnition is that one trims points

that are far (60) away from the center and then one weights (not just simply average)
remaining points based on the robust scaled deviation with larger weights for points closer
to the center. When w is a non-zero constant, T3 becomes the plain average of points after
the trimming. To cover a broader class of the trimmed means, we consider general w in
our treatment. Note that in the extreme case 6 = 00 (w = c 75 0) Tﬂ becomes the usual
mean. A concern might be that T5 throws away useful information in the tails. A remedial
measure is the Winsorization. For the completeness of our discussion, we consider here the

£3 scaled deviations winsorized mean at F, deﬁned as

[($I(ID($rF)I S [3) + L(F)I($ < L(F)) + U (F )I($ > U (F )l) W(D(=€i F NF (2?)
f W(D(IiF )ldF (1')

 

Tim =

(2.2.3)
where L(F) = ”(F) — ﬂo(F) and U(F) = u(F) + 60(F). In the extreme case 6 = 0, T5
degenerates into the median. For a ﬁxed 6, we sometimes suppress 5 in Ta and T5 for

convenience.

Since both a and o are afﬁne equivariant, i.e., #(FaX-l-b) = ap(FX) + b, ”(Fax-H») =
|a|o(FX) for any scalars a and b, where FX is the distribution of X, it is readily seen that
|D(:r, F)| is afﬁne invaraint and T thus is afﬁne equivariant as well. For X ~ F symmetric
about 0 (i.e. d:(X — 0) have the same distribution), it is seen that T(F) = 9, i.e., T is
Fisher consistent. Without loss of generality, we can assume 6 = 0. Let Fn be the usual
empirical version of F based on a random sample. It is readily seen that T(Fn) is also afﬁne
equivariant. It is unbiased for 0 if F is symmetric about 0 and has an expectation. For
Tw(F) and Tw(Fn), all these properties hold.

Two popular trimmed means in the literature are: the ordinary trimmed mean ('Illkey
(1948)) and the metrically trimmed mean (Bickel (1965), Kim (1992)), deﬁned respectively

as

1 F‘1(1-a/2) 1 u(F)+V(F)
73(F)=— mm), T3<F>= [m m
p —u

1 _ a F—1(0/2) '1—:—a $dF(.’D), (2.24) l
where F-1(r) is the rth quantile of F and Fara?) + V(F)) — F(u(F) - V(F)) = 1— a. It is

readily seen that these trimmed means are also afﬁne equivariant and consequently Fisher

consistent for symmetric F. The two trimming schemes are probability content based. The

A former, however, trims equally (500%) of points at each tail. This is not always the case for
the latter (though total points trimmed are also 1000%). At the sample level, Tg'(Fn) trims
a ﬁxed (equal) number of sample points at each tail while T,‘,?,(Fn) trims sample points at
both tails or just one tail with the same total number of points trimmed as in the former

case. For performance evaluation and comparison of T5 and T3 in later sections, Tg‘ and

’13 will be used as benchmarks.

Note that the proportion of the trimmed points for a ﬁxed 3, P(|D(X,F)| > ,6), in
T3 (F) is not ﬁxed but F-dependent. In the sample case, the proportion of sample points
trimmed is random. T 8 (Fa) may trim some or no sample points. So Tﬂ(Fn) is also called a
randomly trimmed mean. The random trimming scheme here based on the scaled deviation is
interconnected to the usual trimming scheme based on the probability content, nevertheless.
Indeed, in the population case set 6 to be the (1-0)th quantile of the scaled centered variable
|X — p(F)|/o(F), then Tf3 is just a regular trimmed mean that trims 1000% of points at
tails for symmetric F. For example, if one wants to trim 0 = 10% points at tails, then
simply set 6 = (P'l(0.95)/<I>"1(0.75) = 2.4387 for normal F and 6 = 6.3138 for Cauchy F.
A large 6 corresponds to a small 0 and consequently is in favor of the efﬁciency of T‘6 (and

T3) at light-tailed F (see Sections 5.3 and 5.4).

2.3 Inﬂuence function

We ﬁrst investigate the local robustness of the functional TB(F) and T5 (F). Here F is the
assumed distribution. The actual distribution, however, may be (slightly) different from
F. A simple departure from F may be due to the point mass contamination of F that
results in the distribution F(5, 61) = (1 — e)F +653, where 63; is the point mass probability
distribution at a ﬁxed point a: E R. It is hoped that the effect of the slight deviation from F
on the underlying functional is small relative to e. The inﬂuence function (IF) of a statistical
functional M at a given point :r E R for a given F, defined as (see Hampel et al. (1996))

1F($; M(F)) = EEI51+(M(F(&62)) - M(F))/€i (2-3-1)

exactly measures the relative effect (inﬂuence) of an inﬁnitesimal point mass contamination
on M. It is desirable that this relative inﬂuence IF(:r; M (F)) be bounded. This indeed is
the case for 731(F) (see, e,g, Serﬂing (1980)), Tﬁ,(F) (Kim (1992)), T3 (Theorem 2.3.1) and
T5 (Theorem 2.3.3) but not for the mean functional with :c — E (X ) as its inﬂuence function
for r.v. X ~ F.

The integrands in Tﬂ (F) (Tg(F)) are complicated functions of F and the derivation of
the inﬂuence functions thus is a bit involved. We ﬁrst work out the inﬂuence functions of L
and U. Assume F’ = f exists at u and ad: a with f(p) and f(u +0) + f(p - a) positive,

where u and a stand for ”(F) and U(F). Invoicing the chain rule we have the following

preliminary results:
1F(:v; L(F)) = 1F($; u(F)) - [3 1F(x; U(F)), (2-3-2)
IF(:1:; U(F)) = IF(1:; nu?) + {3 IF(.'c; our», (2.3.3)
1F (9:; D(y. F l) = -(D(y.F)1F‘(I;0(F)) + IF(I; u(F)))/0 5 MI, 31), (23-4)
1F($;M(F)) = sign(x—u)/(2f(u)). (2.3.5)

 

, _ sienna: — ul - a) — 21F(x;lu(F))(f(u + a) - f(u - 0))
”(M(F)) ‘ 2W + a) + f(u — a» ’

Now assume that w is diﬂ'erentiable and f exists at L(F) and U (F). Write L and U for
L(F) and U(F) respectively and 6 for fI(|D(:r,F)| S 6) w(D(:c,F))dF(x) and deﬁne

(2.3.6)

61(2) -- %[(U — T) w(ﬁ)f(U) Hie; U(F)) - (L - T) wwmm 1m; L(F))] (2.3.7)
ens) = ﬂ [Luv — T) w(1)(D(y.F))IF(z;D(y.F))dF(y)] (2.3.8)
€3(z) = %[I(x e [L, U])(x - T) w(D(x, m] (2.3.9)
We then have the inﬂuence function of the scaled deviation trimmed mean T3(F) as follows.

Theorem 2.3.1. Assume that F’ = f exists at u, u :l: o, L(F), and U(F) with f(u) and
f(p + a) + f (p -— a) positive and is continuous in small neighborhoods of L(F) and U (F),

and that w(-) is continuously diﬂerentiable. Then for a given 0 < 6 < oo,

1F(:1:;T“(F)) = €1($) + ms) + 23(r). (2.3.10)

The proof is given in section 5.1, chapter 5.
Under the conditions of Theorem 2.3.1, IF($; T6 (F)) clearly is bounded and consequently
T’6 is locally robust. For symmetric F and w = c # 0, the inﬂuence function simpliﬁes

substantially.
Corollary 2.3.2. Let X ~ F be symmetric about the origin and w a non-zero constant.
Under the conditions of Theorem 2. 3.1, we have

. a _ dove [-ﬁa.ﬁo}) ﬂaf(ﬂa))sign(x)
mm! (F))" 2F(ﬁo)—1 +f(0)(2F()60)-1)'

 

 

(2.3.11)

A graph of this inﬂuence function is given in Figure 2.1. The boundedness is clearly
revealed.

To work out the inﬂuence function for T5 (F), we write 61 for f w(D(:l:,F))dF (1c) and
deﬁne

emu) ﬁll-ﬁgure s y s U) + no < L) + me > U) - To) w<1><D(y.F)>h(x,y>dF(y)

(2.3.12)
€w2(a:) =~61—1/[IF(3:; L)I(y < L) + IF(:I:; U)I(y > U)] w(D(y,F))dF(y) (2.3.13)
€w3(:l:) =(Sll-[xI(L S :1: S U) + LI(:c < L) + UI(3: > U) — Tw] w(D(:c,F)). (2.3.14)

We then have the inﬂuence function of the scaled deviation Winsorized mean T5 as follows.

Theorem 2.3.3. Assume that F’ = f exists at u, u i a, L(F), and U (F) with f(u) and
f (u + a) + f ([1 — a) positive and is continuous in small neighborhoods of L(F) and U (F),
and that w is continuously differentiable with rw(1)(r) being bounded for r e R. Then for a
given 0 < ,6 < oo,

1 F (:12; Tw(F )) = ew1(-T) + 131112073) + emu). (2-3-15)

Under the conditions of Theorem 2.3.3, IF(:c; Tw(F)) is readily seen to be bounded and

T3 thus is locally robust. For symmetric F and constant w, the inﬂuence function simpliﬁes

greatly.

10

 

 

IF
0
l

 

l

 

 

 

Figure 2.1. Inﬂuence function of T2 for N (0, 1) with [l = 2 and a constant weight.

11

 

 

0.5 1 .0 1 .5

IF
0.0

—1 .0

-—1.5

 

 

 

 

Figure 2.2. Inﬂuence functions of T3, for N (0, 1) with 6 = 2 and a constant weight.

12

Corollary 2.3.4. Let X ~ F be symmetric about the origin and w a non-zero constant.

Under the conditions of Theorem 2.3.3, we have

sign(x)

f (0)

 

IF(2:;Tw(F)) = F(—,60) + zI(-—ﬂo S :1; _<_ 60) — 601(2: < —ﬂa) + [30I(:c > 50)

(2.3.16)

The boundedness of this inﬂuence function is very clear and also shown in Figure 2.2.

In addition to being local robustness measures, the inﬂuence functions in this section are

useful for establishing the limiting distribution of T5(Fn) and Tg(Fn) in the next section.

2.4 Asymptotic representation and limiting distribu-
tion

Establishing the limiting distribution of the scaled deviation trimmed and Winsorized means
turns out to be a quite challenging task. One possible approach is to establish ﬁrst the
Hadamard differentiability of the functional involved under the supremum norm and then
to employ the inﬂuence function results. This is exactly what is done for the metrically
trimmed mean in Kim (1992). The treatment (proof) there, however, is not quite rigorous.
Here we combine an empirical process theory argument with the inﬂuence function results
obtained in the last section to fulﬁl the task. Asymptotic representations of the estimators
are established ﬁrst.

Theorem 2.4.1. Let F ' = f exist at u and be continuous in small neighborhoods of u i a,
L and U with ﬂy) and ﬂy - o) + f(u + a) positive. Let wm be continuous on R. Then

for0<ﬂ<oo

We.) — Tim = ﬁZIaXsTWF» + unﬁt (2.4.1)

i=1

where IF(:c; T3 (F )) is given in Theorem 2.3.1. Consequently
regime) - The) —» No, 62), (2.4.2)
with 62 = E(IF(X;TB(F)))2.

13

The distribution F is usually assumed to be symmetric about a point 0 in the location
setting. By afﬁne equivariance, we can let 0 = 0. In this case with a (non-zero) constant
weight, (32 takes a much simpler form and will be evaluated at a number of distributions in

the next section.

Corollary 2.4.2. Let X ~ F be symmetric about the origin and w a non-zero constant.

Under the conditions of Theorem 2.4.1, we have

52: ff; $2dF(:r)+szr-Hﬁaf 3" ff}; |$|dF(1:)+(—ﬂ%7-Hf 5" )2 (2.4.3)

(2F( ,30) - 1)2

For T 5 , we can establish results similar to Theorem 2.4.1 and Corollary 2.4.2.

Theorem 2. 4. 3. Assume that F’= f exists at u, u :1: o, L(F), and U(F) with f(u) and
f (p + a) + f (p — a) positive and is continuous in small neighborhoods of L(F) and U (F),
and that w is continuously diﬂerentiable with rw(1)(r) being bounded for r E R. Then for a
given 0 < 6 < 00,
n
5 1 1
Tom) — T3 (F) = ;,- 2:1 mm; T50?» + airﬁ). (244)

where IF(:I:; Tg (F )) is given in Theorem 2.3.3. Consequently
M7180") — Tim» —+ No. 63.). (2.4.5)
with 3,2,, = E(IF(X; T3(F)))2.
The proof is given in section 5.1, chapter 5.

For. F symmetric (about 0) and w non-zero constant, we have a simple speciﬁc form for

6,2”, which will also be evaluated at a variety of distributions in the next section.

Corollary 2.4.4. Let X ~ F be symmetric about the origin and w a non-zero constant.

Under the conditions of Theorem 2.4.3, we have

6?..=(-,s—1+4—‘3")F'~’(—rio)+2<ﬂa>2F(-ra)+/ﬂ

Fm ) m» _; (MI$I+$2)dF(2). (2.4.6)

f (0)

With the results obtained in this and last sections, we are in the position to evaluate the

performance of the scaled deviation trimmed and Winsorized means T5 and T3.

14

2.5 Performance comparison

We now compare the performance of the scaled deviation trimmed and Winsorized means
with the trimmed means in (2.2.4), the mean, and the median with respect to robustness
(breakdown point and inﬂuence function) as well as efﬁciency (asymptotic and ﬁnite sample

one) criteria.

2.5.1 Breakdown point

The ﬁnite sample breakdown point, a notion introduced by Donoho and Huber (1983), is
the most popular measure of the global robustness of an estimator. Roughly speaking, the
breakdown point of a location estimator is the minimum fraction of ‘bad’ (or contaminated)
data points in a data set that can render the estimator beyond any bound. More precisely,
the ﬁnite sample breakdown point of a location estimator T at a random sample X" =

{X1, . . . ,Xn} is deﬁned as
spar, X") = min{-:::- : sup |T(X,',’,) — T(X")|}, (2.5.1)
X3.

where X,'}, are contaminated data resulting from replacing m points of X n with arbitrary

m points. The asymptotic breakdown point (ABP) of T is deﬁned as limn_.oo BP(T , X n).

Since one bad point can ruin the sample mean (Xn), the breakdown point of Xn thus
is 1/n, the lowest possible value. On the other hand, to break down the sample median,
50% of original points must be contaminated (moved to 00). Thus the sample median has a
breakdown point [(n + 1) / 2] / n, the best among all afﬁne equivariant location estimators. It
is readily seen that the regular 0 trimmed mean in (2.2.4) has a breakdown point (L0n/2J +
1) /n whereas the 0 metrically trimmed mean in (2.2.4) has a breakdown point ([0nj + 1) / n
if 0 S 0 S 1/2 — 3/ (2n) or |_(n + 1)/2j /n otherwise. The breakdown point of the scaled
deviation trimmed mean is shown (see Zuo (2003)) to be the same as the median, as long
as w(r) is deﬁned on [0,6] and 1 5 6 < 00. Likewise, one can show the scaled deviation

Winsorized mean has the same breakdown point.

15

1____ ﬂ

Table 2.1. Breakdown points of mean, trimmed (Winsorized) means, and median

 

 

 

 

 

x, T: T5,: T5 T5 Med
1313 l 1552:3213 [an + 2J A [(n + 1)/2J [(n + 1)/21 [(n + 1)/21 [(n + 1)/21

71 Tl Tl Tl n n n
ABP 0 0/2 0A1/2 1/2 1/2 1/2

 

The regular trimmed mean Tg‘ thus has the lowest breakdown point among the three
trimmed means. The metrically trimmed mean T,‘,’, has a higher breakdown point (twice as
high as that of the regular one) when 0 S 1/2 — 3/ (2n) and can attain the best breakdown
value if 0 is higher. The scaled deviation trimmed mean Tﬂ always has the best breakdown
value as long as 1 S 6 < 00. The difference in the breakdown points of the trimmed
means is due to the difference in trimming schemes. The regular and metrically trimmed
means trim always a ﬁxed 1000% points with the former trimming based on the rank of
X,- and the latter based on the rank of IX,- — u(X")|. The scaled deviation trimming is
based on the value of IX,- — u(X")| and it trims only points with “large” deviations. The
breakdown points of the trimmed means are listed in Table 2.1 (0 S 0 < 1, 1 S )6 < 00).
The asymptotic ones are shown in Figure 2.3.

It is noteworthy that the scaled deviation trimming (or Winsorizing) can lead to the
best breakdown robustness, while metrically trimming gains breakdown robustness over the
ordinary trimming. All trimming schemes improve the breakdown robustness of the sample

mean.

‘ 2.5.2 Influence function and gross error sensitivity

The breakdown point measures only the global robustness while the inﬂuence function
can capture the local robustness of an estimator. The two together can provide a more
complete picture of robustness. We now look at the inﬂuence functions of the trimmed (and

Winsorized) means.

The boundedness of its inﬂuence function is the fundamental concern for a functional being
locally robust. The mean functional has an unbounded inﬂuence function. The ordinarily

and the metrically trimmed means are known to have bounded inﬂuence functions; see, e.g.,

16

0.4

0.3 -

0.2 ~

0.1 -

Figure 2.3. Asymptotic breakdown points of trimmed means.

 

 

. Door-H-l-O---tel—-- I-O-Ivv-I-l-O

 

 

 

 

./
/.
—ABP(a, To) .
._._ ABP“), Tm)
....... ABPWJ’)

 

 

0.6

17

0.8

 

I I I I

 

 

 

 

3.5? _GES(a, 10(0)
..... 3580. Tm(¢))(GES(a. T0»)
....... GES(a, Tw(<l>))
3 s

 

 

 

 

Figure 2.4. Gross error sensitivity of trimmed means.

18

Serﬂing (1980) and Kim (1992). In the light of Theorems 2.3.1 and 2.3.3, T5 and T5 have
bounded inﬂuence functions for suitable w and 6. Figure 2.8, which plots their inﬂuence
functions at normal and t (with 3 degrees of freedom) models with 0 = 0.1, conﬁrms this.
Here we set 6 = 6 (F, 0) so that 1000% of points are trimmed in each of the trimming cases
(see Section 2). For convenience, we also set w = c 76 0 in T3 and T5. Note that Tg’ and T5
and their inﬂuence functions are the same under this setting. Indeed, the inﬂuence function

in Theorem 2.3.1 becomes

Uf(U) lax; U*(F)) - Lf(L) 1am; U(F)) + 2:1(2 6 IL. U1) - T3
1 - 0

 

1F(2:;T"(F)) = .

(2.5.2)
where we have for A = ﬂ(F)o(F)

1F(2:; U*(F)) = IF($;#(F)) + IF($; /\(F)); IF(2:; U(F)) = IF(=I=;)U(F)) - 1F(:r; A(F))

(1 -a)-I(u-z\ S x S u+/\) - 1F($;u(F))(f(n+/\) - f(#- M)
f()u+/\)+f(u—>\) '

Thus IF(:c; T5 (F )) is the same as that of T3, in Kim (1992). Since a pure normal model

 

IF(:1:; A(F)) =

is rare in practice, we thus consider contaminated normal models. With the same 0 and
ﬂ = [3(F , 0) as above, the inﬂuence functions of T3, and T 3 in the contaminated normal
models, plotted also in Figure 2.8, become different (but all are still bounded). In terms of
the bounded inﬂuence function criterion, we conclude that all the trimmed and Winsorized

means are equally robust (locally).

Besides boundedness, one can also look at the magnitude of the supremum of
|IF(1:;T(F))|, the so—called the gross error sensitivity (GES) of T at F (Hampel et al.
(1986))

GES(T(F)) = S:%|IF($;T(F))I, (2.5.3)
I

which measures the worst case effect on T of an infinitesimal point mass contamination.
Generally speaking, a smaller CBS is more desirable. For Tﬂ and T5, it is readily seen that
their GES depends on the values of 6 (or 0 if B = 6 (F, 0)) and the weight function w. As a

19

 

IF

 

 

 

 

 

 

 

Figure 2.5. Inﬂuence functions of the trimmed and Winsorized means at normal.

20

 

IF

 

 

 

 

 

 

 

 

Figure 2.6. Inﬂuence functions of the trimmed and Winsorized means for t(3) with 0 = 0.1.

21

 

 

 

 

 

 

IF

 

 

 

 

Figure 2.7. Inﬂuence functions of the trimmed and Winsorized means for 0.9N(0, 1) + 0.1N (4, 9)
with 0 = 0.1.

22

 

 

 

 

    

 

-. OTM I I!
.... Med ’{,".
“' MET "
-- T
N -‘ - TW [’1
It .
-- canoes (I .IaoaoooocoQHoQofﬂoT

I

IF

V _
I

 

 

 

Figure 2.8. Inﬂuence functions of the trimmed and Winsorized means for 0.9N (O, 1) +0.1N (4, 0.5)
with 0 = 0.1.

23

Table 2.2. GESs of mean, trimmed (Winsorized) means, and median at symmetric F

 

Mean T3; T19 T3 T5 Med

F-1 1— 2CF,0 F‘l 1-02CF, F-1 1- 2 _
(810)” )) ( (110)“ all (105/) F 1(1‘0/2)+2—fa(0) ﬁll)

 

+oo

 

Table 2.3. GESs of mean, trimmed (Winsorized) means, and median at asymmetric F

 

Mean T3 T3, T5 T5 Med

F = .9N(0, 1) + .1N(4, 9)
GES +oo 4.2419 2.8563 2.7781 2.4422 1.3787

 

F = .9N(0, 1) + .1N(4, .5)
CBS +oo 3.9989 4.6025 3.5290 3.0995 1.4062

 

possible way to make a comparison, we again set 6 = 6 (F, 0) (w = c 51$ 0) in the following
discussion.

Now ﬁrst consider the case that F is symmetric about the origin and meets the conditions
in Corollary 2.3.2. Deﬁne C(F, 0) = 1+ f(F‘1(1 - 0/2))/f(0). The GES’s of the trimmed
(and Winsorized) means are listed in Table 2.2 for general F and illustrated in Figure 2.4
as functions of 0 at F = <I>, the most interesting and common normal distribution used in
practice.

It can be shown that for any 0 < 0 <1, the GES’s in Table 2.2 are increasingly small
with the median having the smallest one if F’ = f exists and is unimodal. This is also
conﬁrmed in Figure 2.3 for F = <1). On the other hand, it is noted from Figure 2.8 that the
inﬂuence function of T0“ at symmetric F has larger absolute values than those of T,‘,',, T5
and T5 at most values of a: E R, a result much favorable to the efﬁciency of the latter three
(see Section 5.3). Again in practice data follow more often than not an asymmetric model.
It is therefore sensible to consider the GESs of T5 and T5 at F that slightly deviates from
a symmetric model. Table 2.3 lists the CBS results of trimmed and Winsorized means at
such models with 0 = 0.1 and ﬂ = ﬂ(<I>, 0.1).

The relationship between the GES’s of the the trimmed (and Winsorized) means for sym-

24

metric F is altered under just slight contamination. Table 2.3 indicates that T3 can have
the smallest GES among three trimmed means at both contaminated models, whereas both
T3 and 73, can have the largest GES. On the other hand, T5 has smaller GES than three
trimmed means while the median enjoys the smallest GES under slight deviation from sym-
metry. The CBS advantage of T3 and T5 over the two competitors is due to the unique
trimming mechanism. With 5 = ﬂ(<I>, 0.1), T5 (T5) trims only the “bad” points that have .
large scaled deviations while T3 and T3, always trim a ﬁxed 10% percent of points even if

they are “good” points.

2.5.3 Large sample relative efﬁciency

Now we evaluate the performance of the trimmed (and Winsorized) means in terms of their
efﬁciency behavior (relative to the sample mean). First we examine the asymptotic relative
efﬁciency (ARE). Table 2.4 lists the ARE results of T3 and T5 at a number of light- and
heavy-tailed symmetric distributions with different 5 values. Here we again set w = c > 0.
The table reveals that (i) at normal model with larye ,6 both T5 and T5 can be highly
efﬁcient relative to the mean; (ii) their efficiency increases as the tail of the distribution
becomes heavier and exceeds 100% at heavy tailed distributions; and (iii) T5 is more efﬁcient
than T3 for small 5 or normal distribution but when )6 get larger and the tail gets heavier
T5 becomes more efﬁcient.

To compare the efﬁciency behavior of T16 (T5 ) with that of T3 and T3,, we again face
the issue of the choices of the values of 0 and 5. One possible choice again is 5 = ,6(F, 0)
as above. Such a choice is somewhat in favor of T3 and T3, since for very small 0 values
they become quite efﬁcient at normal and other light tailed distributions whereas their
breakdown points become very low at those values but those of T5 and T5 are always the
best. With the choice 5 = 5(F, 0) and 0 = 0.01 (and w = c > 0), the ARES of the trimmed
and Winsorized means and the median are listed Table 2.5. Note again that T3, = T3 under
this setting for symmetric F.

Examining Table 2.5 reveals that in terms of efﬁciency: (i) T3 performs better than T3
and T5 atDE, t3 and t4 (with T3 worst among the ﬁve) and best at t with degrees of
freedom (df) 4 (to 7) and (ii) T3 performs best at normal or very light-tailed F’s such as

25

 

Table 2.4. ARES of Tﬁ and T5 relative to the mean

 

 

 

[3 1 2 3 4 5 6 7
N(0,1) T5 0.7589 0.9262 0.9882 0.9987 0.9999 1.0000 1.0000
T3 0.4678 0.5630 0.7762 0.9377 0.9901 0.9991 0.9999
LG(0,1) T3 0.9644 1.0852 1.0754 1.0410 1.0190 1.0081 1.0033
T3 0.6346 0.7640 0.9231 1.0004 1.0167 1.0133 1.0077
DE(0, 1) T3 1.8924 1.6073 1.3656 1.2125 1.1221 1.0696 1.0394
T 1.7060 1.5493 1.4218 1.3090 1.2159 1.1452 1.0948
t3 T3 1.8649 1.9334 1.7709 1.6059 1.4844 1.3982 1.3358
T5 1.3202 1.5952 1.7613 1.7500 1.6700 1.5826 1.5061
Table 2.5. ARES of trimmed and Winsorized means and median with
a=0.013=ﬁ(F,a)
T3 T3, Tﬁ T5 Med
N(0,1) 0.9982 0.9179 0.9179 0.9981 0.6366
LG(0, 1) 1.0192 1.0161 1.0161 1.0220 0.8225
DE(0,1) 1.0383 1.1107 1.1107 1.0483 2.0000
t3 1.2953 1.4645 1.4645 1.3047 1.6211
t4 1.1168 1.2011 1.2011 1.1226 1.1250

 

logistic (or t with df 2 10) while the median is the best at DE(0, 1) or very heavy-tailed

F’s such as t with df S 4.
The results in Tables 2.4 and 2.5 are in the asymptotic sense and the F’s are symmetric.

This raises the concern as to whether these results are valid at ﬁnite sample practice and
for F’s with slight departure from symmetry. We answer this question via ﬁnite sample

simulation studies.

2.5.4 Finite sample relative efﬁciency

We now conduct Monte Carlo studies to investigate the efﬁciency behavior of Tﬁ and T5

at ﬁnite samples for normal and contaminated normal models. Here 0 = 0 is regarded as

26

Table 2.6. REs of trimmed and Winsorized means with H = 7

 

T3 T5 Mean T13 T5 Mean T5 T5 Mean
71 s=0% s=10% s=20%

20 EMSE 0.05 0.05 0.05 0.11 0.18 0.25 0.34 0.60 0.77
RE 0.99 1.00 1.00 2.31 1.40 1.00 2.30 1.29 1.00

40 EMSE 0.03 0.03 0.03 0.07 0.15 0.20 0.28 0.56 0.70
RE 1.00 1.00 1.00 2.99 1.40 1.00 2.51 1.26 1.00

60 EMSE 0.02 0.02 0.02 0.06 0.14 0.19 0.26 0.55 0.68
RE 1.00 1.00 1.00 3.41 1.40 1.00 2.61 1.24 1.00

80 EMSE 0.01 0.01 0.01 0.05 0.13 0.18 0.25 0.54 0.67
RE 1.00 1.00 1.00 3.70 1.40 1.00 2.67 1.24 1.00

100 EMSE 0.01 0.01 0.01 0.05 0.13 0.18 0.24 0.54 0.66
RE 1.00 1.00 1.00 3.95 1.40 1.00 2.73 1.24 1.00

 

the target parameter to be estimated. For an estimator T the empirical mean squared error
(EMSE) is: EMSE = 97-, 23:1 IT]- — 0|2, where m is the number of samples generated and T]-
is the estimate based on the jth sample. The relative efﬁciency (RE) of T is then obtained
by dividing the EMSE of the sample mean by that of T. We generated m = 50,000 samples
from (1 — s)N (0, 1) +€N (4, 9) with s = 0, .1 and .2 for different sample sizes 71. Some results
are listed in Table 6 with 5 = 7.

The RE results at s = 0 conﬁrm at the validity at ﬁnite samples of the asymptotic result
in Table 2.4 with B = 7. When there is just a 10% or 20% contamination, both T‘3 and T5
become overwhelmingly more efﬁcient than the mean with TB substantially more efﬁcient
than T5.

To compare the efficiency of the trimmed and Winsorized means, we now set 5 = B (F, 0)
again. Note that at this setting T3 and T3, have very low breakdown point for small 0
whereas T5 and T5 always enjoy the best breakdown point. We consider F = <1) and set
a = 0.01. Table 2.7 lists the relative efﬁciency results at (1 —s)N(0, 1)+sN (4, 9) for s = 0.1
and 0.2. Note that we set 71 = 200 or larger so that T3 trims at least one sample point at

each end of data.

27

Table 2.7. RES of trimmed and Winsorized means with a = 0.01 5 = [3(<I>, a)

 

mmTﬁTgMedTngTﬁTEMed
n =10% €=20%
200 1.158 1.579 19.70 2.996 7.988 1.075 1.269 21.79 2.384 9.015
400 1.167 1.615 30.35 3.059 9.595 1.076 1.278 25.71 2.394 9.566
600 1.171 1.630 38.34 3.102 10.48 1.079 1.281 27.33 2.397 9.780
800 1.172 1.636 43.39 3.114 10.90 1.079 1.282 28.39 2.403 9.914
1000 1.173 1.640 47.42 3.125 11.24 1.079 1.283 28.96 2.403 9.969

 

Our simulation results for s = 0.0 (not listed in the table) conﬁrms the validity of the
asymptotic ones in Table 2.5 for N (0, 1). On the other hand, when there is just a 10% or
20% contamination in the distribution, all the estimators become more eﬁcient than the
sample mean with T3 , T3,, T5, median, and T3 being increasingly more efﬁcient, reﬂecting
the robustness of these estimators. It is remarkable that T3 is overwhelmingly more efﬁcient

than all other estimators.

2.6 Remarks

Unlike the mean, all the trimmed and Winsorized means discussed in the chapter have
bounded inﬂuence functions for suitable distributions and weight functions and hence are
locally robust. In terms of the global robustness, the scaled deviation trimmed and win-
sorized means T5 and T5 are exceptional in the sense that they can enjoy the best possible
breakdown point robustness for any 5 2 1 whereas the ordinary and metrically trimmed
means Ta and T3, have much lower breakdown point for typical choices of a and the mean
has the worst.

Relative to the mean, T5 and T5 are highly efﬁcient for large [3’s at light-tailed symmetric
distributions and much more efﬁcient at heavy-tailed ones. When 5 is set to be 3(F, a)
so that 100a% points are trimmed, T5 and T5 are less efﬁcient than T“ at light-tailed

symmetric distributions but become much more efficient at heavy-tailed or contaminated

28

symmetric ones. The latter models seem more popular than the light-tailed symmetric ones
in practice.

The advantages of Tﬂ and T5 over Ta and T3, on robustness and efﬁciency are due to
the difference in the trimming schemes. The latter trim always a ﬁxed fraction of sample
points no matter they are “good” or “bad” whereas the former trim only when there are

“bad” points.

A very legitimate concern for T5 and T5 in practice is about the choice of [3 value. In
light of our simulation studies, we recommend a 5 value between 4 to 7 so that Ty and
T5 can be very efﬁcient at both light- and heavy-tailed distributions. Instead of a ﬁxed
value one might also adopt an adaptive data-driven approach to determine an appropriate
5 value. For a given data set, one determines a value for 5 based on the heaviness of the
tail. Generally speaking, a large value of ,6 is selected for a light-tailed data set while a

smaller value for a heavy-tailed one.

The basic idea of adaptive trimming above exists in the literature, see, e.g., Jaeckel
(1971), Hogg (1974), and Jureckova et al. (1994). Furthermore, a random trimming idea
also appeared in Shorack (1974), though the trimming proportion there is (asymptotically)
ﬁxed. Consequently the trimmed and Winsorized means in Shorack (1974) are different in
essence from the ones in this chapter. We note that Tﬁ(Fn) in (2.2.2) has some connection
with (however is very different from) the (scaled version of) Huber-type skipped mean (see,
e.g., Hampel et al. (1986)), the solution Tn of Z,- XiI(—;6 S .(Xi—Tn)/an S [3)/ Z,- I(—ﬂ 5
(Xi - Tnl/On S B) = Tn-

We remark that a general multi-dimensional version of (2.2.2) (but not (2.2.3)) has been
thoroughly studied in Zuo (2003). Here we focus on the performance evaluation of the
speciﬁc one-dimensional version TB and provide speciﬁc and concrete results for inﬂuence
function and limiting distribution as well. Multidimensional version of (2.2.3) is yet to be

studied.

29

 

CHAPTER 3

Trimmed and Winsorized standard

deviations based on a scaled deviation

3.1 Introduction

Scale is an important parameter of interest which can tell us how spread out the data is.
- Although there are many robust location estimators, robust scale estimators seems far much
' less. Despite this fact, ﬁnding robust scale statistics with a high level efﬁciency is always
—. (the goal of many statisticians. Two types of trimmed standard deviations are introduced
and discussed by Welsh and Morrison (1990). But unfortunately, these estimators can not
reach highest breakdown point while keeping satisfactory efﬁciency. So this chapter will not
discuss these versions of trimming. Another attempt was made by Rousseeuw and Croux
(1993), in which two types of high breakdown estimators are introduced. Those estimators
are built through recursive medians. Put aside computation time, the efﬁciency is not
very good at light-tailed distribution, though they are better than the widely used robust
estimator, the median absolute deviation (MAD). And these estimators perform poorly
when points in the neighborhood of the center are contaminated, because they only use
the middle half of the data points or “extended” data points. These situations motivate us
to consider in this chapter the so-called scaled-deviation trimmed and Winsorized standard

deviations.

The high breakdown scale estimators—scaled-deviation trimmed and Winsorized standard

30

deviations. 'ﬁimmed and Winsorized standard deviations enjoy the highest breakdown point
and bounded inﬂuence functions for a variety of distributions. They are also much more
efﬁcient at light-tailed symmetric models than their predecessors and highly efﬁcient for
heavy tailed or skewed distributions. They also have the best performance among high
breakdown estimators when the points somewhere around the center are contaminated.

Hence they represent favorable alternatives to their predecessors.

Section 3.2 introduces Scaled-Deviation trimmed / Winsorized standard deviations; Sec-
tion 3.3 is devoted to the study of the local robustness; Asymptotic representation and
asymptotic normality are treated in section 3.4. Comparisons of inﬂuence functions and
Gross Error Sensitivities of various high breakdown point scale estimators are undertaken

in section 3.5. Proofs of main results and auxiliary lemmas are reserved for Chapter 5.

3.2 Scaled-deviation trimmed and Winsorized stan-
dard deviation

Let M(F) and U(F) be some robust location and scale measures of a distribution F. For
simplicity, we consider p and a being the median (Med) and the median absolute deviation
(MAD) throughout the chapter. Assume U(F) > 0, namely, F is not degenerate. For a
given point 2:, we deﬁne the scaled deviation (generalized standardized deviation) of $ to
the center F by

D(:c, F) = (a: — M(F))/U(F). (3.2.1)
Now one trims points based on the absolute value of this scaled deviation and deﬁne the ﬁ

scaled deviations trimmed variance functional as

f1(|D($, F)l S ﬂ)w2(D(x, F ))(-'C - T1(F))2dF($)
f I(UN-1313 )l S ﬁ)w2(D($,F))dF($)

 

82(F) = c, (3.2.2)

where C; is the consistency coefﬁcient, T1 (F) the ﬁ scaled-deviations trimmed mean which
is deﬁned through

_ f1(|D($.F)l S ﬂ)wz'(D(z,F))xdF($)

Ti(F) " fI(|D($,F)| g ﬂ)w,-(D($,F))dF($)

 

= 1, 2. (3.2.3)

31

where 0 < H S 00 and mi (2' = 1, 2) is an even bounded weight function on [—00, 00] so that
the denominator is positive. The heuristic idea behind this location definition is that one
trims points that are a robust distance (50) away from the robust center )1 and then one
obtains a robust and eﬂicient location estimator by weighting (including simply averages)
left points, which integrates the robustness of a, p and efﬁciency of mean. When u),- (i=1,
2) is a non-zero constant, T,- (i=1, 2) and $2 is the plain average and plain variance of
points after trimming. We consider general wi (i=1, 2) in our treatment which will contain
a broader class of the trimmed means. Note that in the extreme case 5 = 00 (w = c 79 0)
T; and 32 become the usual mean and usual variance and ct becomes 1.

Another robust estimator of scale is the Winsorized standard deviation. Like the trimmed
standard deviation, the Winsorized standard deviation eliminates the outliers if they exist.
Unlike the trimmed scale, the Winsorized scale replaces the outliers with cutting-point values,
rather than discarding them. The deﬁnition of the 5 scaled deviation wisonsorized variance

functional is given by

$3.03) =0... [ [<02 - Tw1(F))2I(|D($,F)I .<_ m + (L(F) — 5.0219: < L(F))
+ (U(F) — Tw1)21(17 > U(F)» w2(D<x.F»dF(x>]/ / wane. F>)dF<x>
(3.2.4)

where cw is the consistency coefﬁcient, Tw1(F) the B scaled deviations Winsorized mean

which is deﬁned through

we =[ [(21005 F)! s m + 14me < L(F))
+ U(F)I(:r > U(F)))w,-(D(a:,F))dF(x)] / / w,(D(z,F))dF(a:) 7: 1,2,
(3.2.5)
where L(F) = u(F) — 50(F) and U(F) = p(F) + ﬂo(F). In the extreme case )6 = 0, Twi
degenerate into the median and Sm zero.
The ,6 scaled deviation trimmed /Wisonsorized standard deviation at F is the square root

of corresponding variance.

Two popular high breakdown scales are introduced by Rousseeuw and Croux (1993). we

32

denote them by S30 and Q, which are deﬁned as
3.50 = cemeddmedjlz. — avg-I}. 9.. = 40x.- — le; .- < an.) (32.6)

where c3, of are consistent coefficients and k = (’2’) z (’2’) /4 with h = [n/2] + 1. We
will compare S(F), Sw(F) with SR0, Q. It turned out that S(F) and Sw(F) are more
ﬂexible, and more eﬁcient than these two types of scales for light tailed distribution and
for the situation that “bad points” are coming from the neighborhood of the center. Those
estimators are built through recursive medians. For performance evaluation and comparison
of S and Sw in later sections, S30 and Q will be used as benchmarks.

The scaled-deviation trimmed / Winsorized means or variances are different from usual
ones. Note that the proportion of the trimmed points for a ﬁxed ,6, P(|D(X,F)| > 5),
in T(F) (or S(F)) is not ﬁxed but F-dependent. In the sample case, the proportion of
sample points trimmed is not ﬁxed but random. T(Fn) (or S (Fn)) may trim some or no
sample points. So T(Fn) (or S(Fn)) are ﬂexible rather than mechanical. On the other
hand, there is some connections between the estimators introduced above and the usual
trimming/winsorizing scheme based on the probability content. Indeed, set 5 to be the
(1 -—- a)th quantile of the scaled centered variable IX —— p(F)|/0(F), then T(F) (or S (F))
are just the regular trimmed mean and standard deviation after trimming 1000% of points
at tails for symmetric F. For example, if one wants to trim a = 10% points at tails, then
simply set 6 = <1>-1(0.95)/<I>-1(0.75) = 2.4387 for normal F and 5 = 6.3138 for Cauchy F.
A large 5 corresponds to a small trimmed proportion (a) and consequently is in favor of
the efﬁciency of the scaled deviation trimmed mean and standard deviation at light-tailed
F.

In order to keep clarity, we need to adopt some notations and write
8%) = ct(s(F) — MP». 5.3.03) = en(3w(F) - M(F)),

62': /I(ID($,F)I Sﬂ)wi(D($,F))dF($), 61ml: / wi(D(-751F ))dF (I) i=112-

33

 

with
s03) = [1005.101 s 5)$2W2(D(17,F))dF($)/52(F)

MF) =2T1(F)T2(F) — 71W

saw") = [ [521005. F)! s m + L(F)21(a= < L(F)) + U(F)2I<x > U(F)»

w2(D(z.F»dF(x)]/awz(m

AM) =2Tw1(F)Tw2(F) — Tam?-

T, Tw, S, and Sw are affine equivariant because both )4 and a are afﬁne equivariant, i.e.,
#(Fax+b) = ap(FX) + b, U(Fax+b) = |a|a(FX) for any scalars a and b, where FX is the
distribution of X. For X ~ F symmetric about 6(i.e. :t(X — 9)/n(n > 0) have the same
distribution F0), it is seen that T(F) = 6 and S(F) = 7) (ct = 1/S(F0)), i.e., T and 5'2 are
Fisher consistent. Without loss of generality, we can assume 6 = 0 and n = 1. Let Fn be
the usual empirical version of F based on a random sample. It is readily seen that T(Fn)
and 32(Fn) are also aﬂine equivariant. T(Fn) is unbiased for 0 if F is symmetric about 6
, and has an expectation and 52(Fn) (ct = 1/(s(F0) — /\(Fo))) is unbiased for n if F has a

, ,variance. All these properties also hold for Tu, and 5,2“.

3.3 Inﬂuence Function

We ﬁrst investigate the local robustness of the functional S (F) and Sw(F) through inﬂuence
functions. Here F is the assumed distribution. The actual distribution, however, may be
(slightly) different from F. A simple departure from F may be due to the point mass
contamination of F that results in the distribution F (22,635) = (1 — s)F + 569;, where 6,,- is
the point mass probability distribution at a ﬁxed point z E R. It is hoped that the effect of
the slight deviation from F on the underlying functional is small relative to s. The inﬂuence
function (IF) of a statistical functional M at a given point a: E R for a given F, deﬁned as
(see Hampel et al. (1996))

IF(2:; M(F)) = 1113+(M(F(e,5,)) - M(F))/e. (3.3.1)

34

 

exactly measures the relative effect (inﬂuence) of an inﬁnitesimal point mass contamination
on M. It is desirable that this relative inﬂuence I F(:r; M (F)) be bounded. This indeed
is the the case for S30 and Q (see, e.g. Rousseeuw and Croux (1993)), but not for the
standard deviation functional with (:1:2 — 1) / 2 as its inﬂuence function for r.v. X ~ N (0, 1).

Note that the integration interval in S2(F) (35(F)) are a functional of F. Hence an
inﬁnitesimal point-mass contamination affects this interval. Because of this, the derivation
of the inﬂuence function of the scaled-deviation trimmed/Winsorized scales becomes a little
bit challenging. The strategy to attack the problem is“divide then conquer”. One ﬁrst
works out the inﬂuence functions of L and U. Assume F’ = f exists at p and p :l: a with
f (a) and f (p+a)+ f (p-a) positive, where u and a stand for p(F) and U(F). As in chapter
2, we have preliminary results (2.3.2)-(2.3.6). Now assume that w,- (i=1, 2) is differentiable
and f exists at L(F) and U (F). Write L and U for L(F) and U (F) respectively and deﬁne

 

4.0:) 737.; [(U — T.)w.-<D(U. F))f(U)1F(r, U(F))

— (L-Tam-(Du.F))f<L)IF(x.L(F»] (3.3.2)
327(3) = ,1, (IF, [ [:0 — mwflwe. F))h<x.y)dF<y>] (3.3.3)
73,-(n) 741—13) [1(2: 6 [L, U])(:o — T,-)w,-(D(a:, F))] (3.3.4)

One can derive the inﬂuence function of the scaled deviation trimmed mean T(F) as follows.

The following result is in our location chapter.

Corollary 3.3.1. Assume that F’ = f exists at p, u :l: a, L(F) and U(F) with f(p) and
f (p + a) + f (p. — a) positive and is continuous in a small neighborhood of L(F) and U (F),

and that w,(-), (i = 1, 2.) are continuously diﬂerentiable. Then for a given 0 < )6 < oo
1F($; Ti(F)) = 317(13) + 527(17) + 534($)- (33-5)

Furthermore, if F is symmetric about the origin and w is a non-zero constant, one has

_ _ $107 ‘5 [-ﬁaaﬁal) ﬂ0f(ﬁ0)si8‘n($)
IF(:c,T(F))-— 2F(ﬁa)-1 +f(0)(2F(ﬁo)—1)° (3.3.6)

 

 

In order to express the inﬂuence function of S (F), we need to borrow the following

35

notations and write

.r)=(6i[v2 —s)w2( D,f(UF)) (U)IF(2:,U(F))
2
”(L2 sw) 20( (L F>>f(L>IF(z.L(F)>] (3.3.7)
2(a:)=61—2[/U (y -s>w§1(D(y. F))h(x.y)dF(y>] (3.3.8)
73(x)=-1—[I($ E (L, U]) (x2 - s)w2(D(:r, F))]. (3.3.9)

It is ready seen that the inﬂuence function of A(F) and Aw(F) are given by

1F($. A(F)) = 2(T1(F) - T2(F))1F($,T1(F)) - 2T1(Flu’(ﬁ=i’-’2(F));

117(3) Aw(F)) = 2(Tw1(F) _Tw2(F))1F($tTw1(F)) "2Tw1(F)IF($tTw2(F))-

The inﬂuence function of the scaled deviation trimmed variance S2(F) is given by the

following theorem.

Theorem 3.3.2. Assume that F' = f exists at u, pi a, L(F) and U(F) with ﬂy) and
f (u + o) + f (,u — a) positive and is continuous in a small neighborhood of L(F) and U (F),

and that m,(.), (t = 1, 2.) are continuously diﬁ’erentiable. Then for a given 0 < 5 < oo
IF(a:; 52(5)) = Ct [1175; S(F)) + IF(:r, A(F))] (3.3.10)
with
IF(:c; s(F)) = 71(21: )+ 72(x )+ 73(5) (3.3.11)

The proof of theorem 3.3.2 is given in section 5.2, chapter 5.

Under the condition of Theorem 3.3.2, IF (2:; S2(F )) clearly is bounded and consequently
52(F) is locally robust. Use the chain rule, one easily knows the inﬂuence function of the
scaled deviation trimmed standard deviation, i.e. IF(:c; S(F)) = IF(:c; 52(F))/(2,/§?TF)).

For symmetric F and w = c 79 0, the inﬂuence function simpliﬁes substantially.

Corollary 3.3.3. Let f be symmetric about the origin and w,- (i=1, 2) are nonzero con-

stants. Under the conditions of Theorem 3.3.2, we have

(x -s)1(=v€[—ﬂaﬁo]) ((ﬁ0)2-8)f(ﬂo)ﬂsign(l$|-0)
C‘l 2F(Bo)- 1 + 2f(e)(2F(5e)—1) l (3‘3'12)

 

IF(a= 82w F))=

36

 

IF

 

 

 

Figure 3.1. Inﬂuence functions of S for N(0, 1) with a constant weight and ﬂ = 3.

37

The proof of this corollary is omitted. A graph of this inﬂuence function I F(:r; S (F)) is

given in ﬁgure 3.1. Obviously, it is bounded.

To work out the inﬂuence function for 5,2,,(F) (or Sw(F)), we deﬁne

€w1z‘($)- 1 [Ii/KL s y s U) + LI(y < L) + UI(y > U)

5m

 

- a.)w§1’(D(y.F))h(x.y)dF(y) (3.3.13)

€w2i(1)=-1- l1F($.L)1(y < L) + 1F($rU)I(y > U )] Wt(D(y.F WP (9) (3.3-14)

6wi

tw3,(r)=6—L [rm 3 z s U) + LI(:r < L) + UI(:r > U) — Tm] w,(D(:c,F)). (3.3.15)
and

rare) =53; [021a s y s U) + L210 < L)
+ U21(y > U) — sw] wg1)(D(y, F))h(:r:,y)dF(y) (3.3.16)

rw2(:r) =33; /[IF(2:, L)LI(y < L) + IF(:r, U)UI(y > U)] w2(D(y,F))dF(y) (3.3.17)

rw3(:r) =32— [1521(L S :1: S U) + L21(:c < L) + U21(:r > U) — sw] W2(D(I,F)). (3.3.18)

w2
One then has the inﬂuence function of the scaled deviation Winsorized mean Tw(F) as

follows. The following result is in our location chapter.

Corollary 3.3.4. Assume that F' = f exists at p, uzl: a, L(F) and U (F) with f (u) and
f (u + a) + f (p - a) positive and is continuous in a small neighborhood of L(F) and U (F),
and that w,(-), (i = 1, 2.) are continuously differentiable with rw(1)(r) being bounded for

r 6 R. Then for a given 0 < 5 < 00,
1F (1‘; th(F )) = 13.51313) + 13.5241?) + €w3t($)- (3319)

Furthermore, if F is symmetric about the origin and w is a non-zero constant, one has

 

1F(r;T,,,(F)) = Sizgln—se) + TI(—[ia g r g [30) — some < —5a) + some > so).

(3.3.20)

One can have the inﬂuence function of the scaled deviation Winsorized variance SE,(F)

as follows.

38

Theorem 3.3.5. Assume that F’ = f exists at u, pd: a, L(F) and U(F) with f(u) and
f (p + a) + f (u — a) positive and is continuous in a small neighborhood of L(F) and U (F),
and that w,(-), (i = 1, 2.) are continuously differentiable with r2w(1)(r) being bounded for

rER. Thenforagiven0<ﬂ<oo
IF(:r; 33413)) = cw [IF(:r; sw(F)) — IF(2:, M(F))] (3.3.21)
with
IF(a:; sw(F)) = Tw1($) + 7102(1)) + rw3(:r) (3.3.22)

The proof is given in section 5.2, chapter 5.
Under the conditions of Theorem 3.3.5, I F(:r; 5,2,,(F)) is readily seen to be bounded and

Sw thus is locally robust. For symmetric F and constant w, the inﬂuence function simpliﬁes

greatly.

Corollary 3.3.6. Let f be symmetric about the origin and w a constant. Under the condi-

tions of Theorem 3. 3. 5, we have

52Usi9n(lxl - 0)
f(0)
+ ((35200: 3 —Ua) + I(x 2 W» — at] (3.323)

F(—ﬁa) + $21(-ﬁa S a: S 50)

 

mm; 3313)) =e...[

The boundedness of the inﬂuence function I F (:c; Sw(F )) is very clear and also shown in
ﬁgure 3.2.

Besides measuring local robustness measures, the inﬂuence functions also provides the
form of asymptotic variance of corresponding estimator. It is useful for establishing the

asymptotic theory for S (F) (Sw(F)) in the next section.

3.4 Asymptotic representation and limiting Distribu-
tion

It is challenging to establish the limiting distribution of the scaled deviation trimmed stan-

dard deviation and Winsorized standard deviation. Here we fulﬁll the task by combining

39

an empirical process theory argument with the inﬂuence function results obtained in the
last section. Once asymptotic representations of the estimators are established, the limiting

distributions easily follow.

Theorem 3.4.1. Let F’ = f exist at u and be continuous in small neighborhoods of u :l: a, p
L and U with f (u) and f (u — U) + f (14+ 0) positive. Let w“) be continuous on R. Then

for 1 S 5 < oo
1 " 1
2 _ 2 = _ ,, 2 _
3 (Fe) 8 (F) n gram (F)) + at «5)
where IF(:1:; 32(F)) is given by Theorem 3.3.2. Consequently
755203..) — 5203)) —» N(0, 62) (341)
provided 52 = E(IF(X1, sitar)»2 < +00.

The proof is given in section 5.2, chapter 5.

Suppose one want to estimate 7) in distribution F0(:r/r)) by using the scaled-deviation
trimmed standard deviation to estimate it. Since 82(F) = 172, invoking delta method, one
has

mar.) — n) —» N(0, 11762) (342)
So, if we set 1) = 1, then the asymptotic variance for S(Fn) is one quarter of 62.

The distribution F is usually assumed to be symmetric about a point 6. By afﬁne equiv-

ariance, we can let 0 = 0. In this case with a (non-zero) Constant weight 62 takes a much

simpler form and will be evaluated at a number of distributions in the next section.

Corollary 3.4.2. Let f be symmetric about the origin and w a constant. Under the condi-

tions of Theorem 3.4.1, we have

52 = Cal-:33 _ s)2dF(a:) + 2((30)2 - s)f(ﬂa))ﬁ(/'°($2 _ s)dF(a:)

 

 

f(0) -[30
0 ((Ua)2-s)f(La))U 2
- _,(xZ-s)dF(x))+( W, )]/(2F(ﬁU)-1)2

where C; = 1/s(F0).

The proof of this corollary is omitted.

For 3,3,, we can establish results similar to Theorem 3.4.1 and Corollary 3.4.2.

40

Theorem 3.4.3. Let F = f’ exist at u and be continuous in small neighborhoods of u i a,
L and U with f (u) and f (u — 0) + f (u + a) positive. Let w be continuously diﬂ’erentiable
with r2w(1)(r) being bounded for r E R. Then for 0 < 5 < oo

1
SEAFn) - “$2311“ (Xt;5i2v(F Fl) +OP(7—T;)

where I F(x; 5,2,,(F )) is given by Theorem 3.3. 5. Consequently

7555(5) — 33,05) —) N(0, 53,) (343)
provided 5?, = E(IF(X1,S§,(F)))2 < +oo.

The proof is given in section 5.2, chapter 5.
Similar to trimmed case, suppose one want to estimate 77 in distribution F0(x/n) by using

the scaled-deviation Winsorized standard deviation to estimate it. One has
1 -
ﬁ(5w(Fn) "' Tl) —’ N(0, @0232) (3-4-4)

So, if we set 77 = 1, then the asymptotic variance for Sw(Fn) is one quarter of 6,2”.
Let F be symmetric about a point 0. By afﬁne equivariance, one can let 0 = 0. In this
case with a (non-zero) constant weight 62 takes a relatively simple and explicit form and

will be evaluated at a number of distributions in the next section.

Corollary 3.4.4. Let f be symmetric about the origin and w a constant. Under the condi-

tions of Theorem 3.4.3, we have

20 0'
Um 2=cﬁi{(§_—-f(a, F( )5 ))2 +/$4I(-ﬁ0.<_x_<_ﬂ0)
+(La)4(1(:_ < -ﬁ0) + 1(e 2 sundae) + s3.
03 a 20 —- a '0 0
+454 (figﬂ ”2 +4ﬁ figﬁ )[Lﬂaxzdﬂaﬁ - [a xzdF(x)]

_ 2swﬁ2aF (-,60) (4F (—
f (0)

 

 

 

[kl-1) _ Ba 2 2 _ a
4sw[/O x dF(x)+(ﬂa) F( [3 )]}
where em = 1/Sw(F0)-

The proof of this corollary is omitted.

41

3.5 Comparison

In this section, we compare scaled-deviation trimmed and Winsorized standard deviation
with median absolute deviation (MAD), ordinary standard deviation (SD), and the scale
estimators S30 and Qn proposed by Rousseeuw and Croux. We will discuss robustness
through breakdown point, inﬂuence function and gross error sensitivity. We also will
consider eﬂiciency by asymptotic relative efﬁciency in section 3.5.3 and ﬁnite sample ef-
ﬁciency in section 3.5.4. For simplicity, let w, = 1, i = 1, 2. Suppose X ~ F(x) =
F0((xi— 0) /17) and F0 is the underlying model distribution, let ct = 1/(s(Fo) — A(F0)) and
0w =1/(5w(F0)— )‘w(F0))-

3.5.1 Breakdown Point

The ﬁnite sample breakdown point measures the global robustness of an estimator. Roughly
speaking, the breakdown point of a scale estimator is the minimum fraction of “bad” data
points that can render the estimator useless (0 or 00). Precisely, the ﬁnite sample breakdown

point of a scale estimator Sn at sample X n = {X1, . . . ,Xn} in R is deﬁned as
5:;(Sn, X”) = min{5;t (Sn, X71), 5,:(811, Xn)}

where

63(Sn,X") = min {-72 : sup Sn(X3,) = 00}
Tl X35;

and

€;(Sn,Xn) = min {El- 13:13.9"(X30 = 0}
m

where X3, denotes a contaminated data set resulting from replacing m original points of
X n with arbitrary m points. The quantities 6,"; and 5:, are called the explosion and the
implosion breakdown points. It is well known that the standard deviation has the lowest
breakdown point 1 / n.

Since u(Fn) and U(Fn) have the highest possible breakdown point, it is easy to see that
S (Fn)(S (Fm)) has the same breakdown point as U(Fn). So S (Fn)(S (F570), together with
S30, (2,, and M AD”, achieves the best possible breakdown point of any translation equiv-

alent scale estimators (see, e.g. Lopuhaii and Rousseuw (1991)).

42

Theorem 3.5.1. At any sample X n in which no two points coincide and for any 1 S 5 < 00,

we have
65(S(Fe).X")=[(n+1)/2]/n. and e;(S(Fa).X")=(n/2J/n

53(Sw(Fn)tan = [(n +1)/2]/n, and 55(Sw(Fn)tXn) = [n/2]/n

The breakdown point of the scale estimators S(Fn) and Sw(Fn) thus is given by
53(S(F).X") = [n/Ql/n. 63(Sw(F).X”) = [n/Ql/n,

which is the highest possible value for any afﬁne equivalent scale estimator. ( The proof of

this theorem is omitted.)

3.5.2 Inﬂuence Eduction and Gross Error Sensitivity

The breakdown point measures only the global robustness while the inﬂuence function can
capture the local robustness of an estimator. The two together can provide a more complete
picture of robustness. We now look at the inﬂuence functions of the trimmed and Winsorized
variances.

The boundedness of its inﬂuence function is the fundamental concern for a functional being
locally robust. The mean functional and standard deviation functional have an unbounded
inﬂuence function. The MAD and SRC(F), Q(F) are known to have bounded inﬂuence
functions; see, e.g., Serﬂing (1980) for MAD, Rousseeuw and Croux (1993) for SRC(F) and
Q(F)-

In the light of Theorems 3.3.2 and 3.3.5, S and Sw have bounded inﬂuence functions for
suitable w and ,6. Figure 3.3-3.6, which plots their inﬂuence functions at normal, Cauchy,
exponential and contaminated normal models with )6 = 4.5, conﬁrms this. For convenience,
we also set w,- = c 75 0 in S and Sw. Since a pure normal model is rare in practice, we thus
consider light-tail, heavy-tail, and skewed models. They become different (but all are still
bounded). In terms of the bounded inﬂuence function criterion, we conclude that all the
trimmed and Winsorized standard deviation are equally robust (locally).

Besides boundedness, one can also look at the magnitude of the supremum of

IIF(x;S(F))|, the so—called the gross error sensitivity (GES) of S at F (Hampel et al.

43

“mi

(1986))

GES(S(F)) = supzllﬂx; S(F))|, (3.5.1)
x6

which measures the worst case eﬂ'ect on S of an inﬁnitesimal point mass contamination.
Generally speaking, a smaller CBS is more desirable. For scaled-deviation trimmed standard
deviation S and Sw, it is readily seen that their GES depends on the values of ,6 and the
weight function w. As a possible way to make a comparison, we again set (w,- = c aé 0) in
the following discussion.

Table 3.1. Gross Error Sensitivity

 

 

 

 

 

 

 

 

 

 

 

 

SD Q SR0 5 (s = 4.5) s... (5 = 4.5) MAD

N(0, 1) +00 2.069 1.625 4.3503 4.1528 1.167
Cauchy 2.2214 1.8961 5.4185 2.5174 1.5708
Exponential +oo 2.3173 1.8447 5.2883 3.6954 1.8587
0.9N(0, 1) + 0.1N (4, 9) +oo 2.1007 1.6274 4.7827 4.6295 1.2457

 

 

'The inﬂuence functions of MAD, Q, SRO, S (5 < 00), are all bounded and has ﬁnite
Gross Error sensitivity. '

From those ﬁgure 3.3-3.6, and table 3.1 presented, it is easily seen that CBS of scaled-
"de’viation trimmed and Winsorized scale which usually attained at the cutting end points L
and U for most of ,6 values is slightly larger than the other four estimators.

At normal model, Sw has the smaller GES than Q for 0 < 3 S 3.13 and has the smaller
GES than 530 for 0 < s g 2.62. But 3 alway has larger GES than Q, SRO and MAD.

At Cauchy model, Sw has the smaller GES than Q for 0 < 5 S 3.66 and has the smaller
GES than 530 for 0 < )6 g 2.67. But 3 alway has larger GES than Q, 530 and MAD.

At exponential model, Sw has the smaller GES than Q for 0 < 5 S 3.12 and has the
smaller GES than SRO for 1.30 S 5 S 1.57. But S alway has larger GES than Q, SRC and
MAD.

Scaled-deviaton Winsorized standard deviation is overwhelmingly more robust than
trimmed standard deviation in terms of CBS as shown in ﬁgure 3.10—4.2.

Since different 5 can give us difference choices, so somehow, trimmed and Winsorized

standard deviation are more ﬂexible than the other three.

44

The picture of CBS versus the steps 5 of different distributions are presented by ﬁgure ?? .
For normal distribution, GES increases dramatically when the 5 increases. But for Cauchy
distribution, GES increases slowly; At the same time, since the inﬂuence function of the
classical standard deviation for normal distribution is given by IF (x, S D(<I>)) = (.7:2 - 1) / 2.
When ,6 takes large values, IF(x, S(<I>)), IF(x, Sw(<I>)) and IF(x, SD(<I>)) will be very close.
It implies the high efficiency of S (SW) at normal model when 5 is large.

3.5.3 Large sample relative efﬁciency

Now we evaluate the performance of the trimmed (and Winsorized) standard deviations
in terms of their efﬁciency behavior (relative to the sample SD). First we examine the
asymptotic relative eﬂiciency (ARE). Table 3.2 lists the ARE results of S and Sw at a
number of light- and heavy-tailed symmetric distributions with different 5 values. Here we
again set w = c > 0. The table reveals that (i) at normal model with large 5 both S and
Sw can be highly efﬁcient relative to the SD; (ii) their efﬁciency increases as the tail of the
distribution becomes heavier and exceeds 100% at heavy tailed distributions for large 5;
and (iii) Sw is more efﬁcient than S for small 5 or normal distribution.

For Cauchy distribution, when 0.954 S [3 S 1.015, S is more emcient than S").

The picture of ARE versus the steps 5 of different distributions are presented by ﬁgure 3.7—
3.9. For normal distribution, ARE increases gradually when the 3 increases. But for Cauchy
distribution, ARE decreases with the increasing of 5. When 5 goes to inﬁnity, ARE tends to
zero. For both CBS and ARE, S is unstable at the neighborhood of 5 = 1 mainly because the
denominator of S (Fn) might be 0. However, Sw is pretty stable when 5 changes. Table 3.3
gives the asymptotic relative eﬂiciencies (ARE) of various high breakdown scale estimators
with respect to the sample standard deviation for various distributions. Examining Table
3.3 reveals that in terms of efﬁciency: (i) Overall Sw performs better than S. (ii) S and Sw
perform better than other robust estimators S RC, Q and MAD at light tailed distribution
while have satisfactory efﬁciency for heavy tailed distributions.

The behaviors of ARE of scaled-deviation trimmed standard deviation for different values

of ,6 to standard deviation at normal, to theoretical lower bound (inverse of Fisher informa-

45

Table 3.2. ARES of S and Sw relative to the standard deviation

 

3
N (0,1) S
Sw
LG(0, 1) S
Sw
DE(0, 1) S
5w
E(0, 1) S
3w
t5 S
Sw
Cauchy*(O, 1) S
5w

1

0.3068
0.3717

0.4708
0.5535

0.5647
0.6191

0.7315
0.9163

0.7315
0.9163
0.8768

0.8579

2

0.2736
0.5383

0.4132
0.7368

0.4482
0.7413

0.7610
1.1138

0.7610
1.1138
0.7053

0.8900

3

0.4509
0.8261

0.5818
0.9892

0.5412
0.9318

0.8303
1 .2903

0.8303
1 .2903
0.7422

0.9080

4

0.7438
0.9713

0.7822
1.1069

0.6605
1.0915

0.9156
1.4456

0.9156
1.4456
0.7542

0.8969

5

0.9403
0.9973

0.9363
1.1045

0.7879
1.1758

1.0067
1.5420

1.0067
1.5420
0.7485

0.8716

6

0.9921
0.9998

1.0090
1.0691

0.9032
1 . 1894

1.0941
1.5653

1.0941
1.5653
0.7334

0.8403

7

0.9994
1.0000

1.0263
1.0390

0.9883
1.1638

1.1679
1.5291

1.1679
1.5291
0.7134

0.8071

 

* compared with the inverse of ﬁsher information 2.

Table 3.3. ARE’s with respect to SD

 

Q
N(0, 1) 0.8227
LG(O, 1) 1.0210
DE(0, 1) 1.5952
E(0, 1) 1.4897
t5 1.4066
Cauchy* 0.9784

SR0 5 (s = 4.5) s (s = 7) 3,, (p = 4.5) 5,, (o = 7) MAD

0.5823
0.8918
0.9206
1.0591
2.0026
0.9497

0.8658
0.8690
0.7244
0.9609
1.9388
0.7530

0.9994
1.0263
0.9883
1.1679
2.0732
0.7134

0.9906
1.1144
1.1439
1.5028
2.4113
0.8854

1.0000
1.0390
1.1638
1.5291
2.0145
0.8071

0.3675
0.5431
0.6006
0.9367
1.3332
0.8106

 

* compared with the inverse of ﬁsher information 2.

46

Table 3.4. 5 Values for S having better ARE than Other Scales

 

s a Q s : 330 s : MAD
N(0, 1) [4.31, oo) (3.47, oo) (0, oo)
LG(0, 1) [6.38, 8.19) [4.66, 00) (0, oo)

DE(0, 1) NA [6.18, oo) [0,oo)

E(0, 1) (5.59, 18.17) NA [4.24, oo)
t5 [2.99, 18.62) [4.81,7.89) [2.82,22.11)

Cauchy* NA NA [0.82, 1.06]

 

* compared with the inverse of ﬁsher information 2.

Table 3.5. )6 Values for Sw having better ARE than Other Scales

 

swto 's,,,>_-sRC SthAD swts
N (0, 1) [2.99, 00) [2.16, 00) (0, oo) (0, oo)
LG(0, 1) (3.17, 7.93) [258,00) (0, oo) (0, oo)
DE(0, 1) NA [2.94,oo) (0, oo) (0, oo) \ [11,18]
E(O, 1) (1.68,15.84) (4.38, 7.64) (0, oo) (0,1097)
t5 (1.26, 15.01) (241,700) (0.17.78) (0, 6.46)
Cauchy* NA NA (0, 6.89) (0, oo) \ [0.96, 1.01]

 

* compared with the inverse of ﬁsher information 2.

47

tion which is equal to 2 in this case) at Cauchy, to standard deviation at exponential are
shown in ﬁgure 3.7—3.9. For normal model, it is natural that ARES"), SD and ARES, SD
(5 > 1) increase with 5 and less than 1. For Cauchy model, note that ARES and ARES,”
roughly decrease to 0 with 5, while its standard deviation does not even exist. For exponen-
tial distribution, ARE 5,30 is increasing while GES (S) decreases when 5 increases while
ARESw,SD is J-shaped.

In practice, the sample size is often small and the distributions are not pure models.
Scaled-deviation trimmed and Winsorized scales perform quite well in the asymptotic sense
when F’s are from a pure model. This raises the concern as to whether these results are
valid at ﬁnite sample practice and for F’s with slight departure from a perfect model. We

answer this question in the next section via ﬁnite sample simulation studies.

3.5.4 Finite sample relative efﬁciency

To check whether the estimator S and Sw are approximate unbiased for ﬁnite samples,
we performed a modest simulation study. In Table 3.6, we calculated the average scale
estimate on 10,000 batches of normal, Cauchy, and exponential observations. We see that
Sn (Sum) behaves better than other scales at normal model and we carried out a simulation
to verify the eﬂiciency gain at ﬁnite samples. For each n in the table 3.6 , we computed the
variance varm(Sn) of the scale estimator Sn over m = 10, 000 samples. Table 3.6 lists the

standardized variances

nVarm(Sn)/(avem(Sn))2 (3.5.2)

where avem(Sn) is the average estimated value which is listed in the left half of the tables.
The results show that the asymptotic variance provides a good approximation for (not too
small) ﬁnite saniples, and that 8,; and Sum are more eﬂicient than S30, Qn and M AD” at
normal model even for small 71.

It is clearly reviewed that Sn and Snw are considerably more eﬂicient than S30, Q" and

M ADn even for small n.

48

Table 3.6. Standard variance of MAD”, S30, Qn, Sn, Sum and SDn at normal model

 

10

20

40

60

80

100
200

Average value

 

Qn

3,150

Sn

Snw

M AD"

S D"

 

0.899
0.957
0.978
0.986
0.990
0.993
0.997

1.286
1.166
1.087
1.059
1.046
1.037
1.020

0.913
0.961
0.979
0.986
0.990
0.993
0.997

0.905
0.959
0.979
0.984
0.990
0.991
0.996

0.905
0.959
0.980
0.987
0.992
0.993
0.997

0.969
0.987
0.992
0.994
0.997
0.998
0.999

 

10

20

40

60

80
100
200

Standard variance

 

Qn

83V

Sn

Snw

M ADn

S Dn

 

0.849
0.749
0.700
0.658
0.649
0.647
0.627

0.920
0.877
0.850
0.849
0.845
0.855
0.859

0.617
0.535
0.521
0.497
0.509
0.501
0.494

0.533
0.507
0.514
0.494
0.506
0.500
0.493

1.241
1.293
1.320
1.337
1.344
1.352
1.380

0.518
0.504
0.514
0.494
0.506
0.500
0.493

 

49

- m» 'n— n...

 

We also conducted a similar study on Cauchy, exponential distribution. The results at
these distributions conﬁrm the asymptotic results presented in last section so they are not
listed here. It is shown that Sn and Snw can achieve satisfactory eﬂiciency compared with
other robust estimators.

We now conduct Monte Carlo studies to investigate the efﬁciency behavior of S and Sw
at ﬁnite samples for normal and contaminated normal models. Here 7) = 1 is regarded
as the target parameter to be estimated. For an estimator S the empirical mean squared
error (EMSE) is: EMSE = % 23:11 lSj - nlz, where m is the number of samples generated
and T]- is the estimate based on the jth sample. The relative efﬁciency (RE) of S is then
obtained by dividing the EMSE of the sample standard deviation by that of S. We generated
m = 50, 000 samples from (1 —s)N(0, 1)+5N(1,0.1) and (1-€)N(0, 1)+€5{0} with e = 0, .1
and .2 for different sample sizes 72. Some results are listed in Table 3.7 and Table 3.8 with
5 = 7.

Table 3.7. RES of various robust scales ()6 = 7 for scaled-deviation trimmed/Winsorized
scale) at (1 — e)N(0, l) + eN(1,0.1)

 

Q SRO s s... MAD Q .930 s S... MAD
n e=10% e=20%
20 0.305 0.634 0.972 0.993 0.414 0.322 0.615 0.958 0.985 0.417
40 0.444 0.606 0.986 0.992 0.367 0.527 0.586 0.973 0.980 0.369
60 0.552 0.570 0.985 0.988 0.323 0.650 0.564 0.977 0.982 0.318
80 0.617 0.559 0.992 0.993 0.290 0.732 0.533 0.981 0.982 0.281
100 0.666 0.524 0.991 0.991 0.256 0.791 0.527 0.978 0.978 0.261

 

Table 3.8. RES of various robust scales (5 = 7 for scaled-deviation trimmed/Winsorized

scale) at (1 - €)N(0, 1) + 56m}

 

Q SEC 5 S... MAD Q SRC s s... MAD
n e=10% e=20%
20 0.502 0.494 0.825 0.908 0.361 0.604 0.410 0.758 0.875 0.328
40 0.721 0.413 0.872 0.924 0.319 0.711 0.309 0.823 0.903 0.278
60 0.779 0.342 0.897 0.930 0.295 0.633 0.248 0.856 0.917 0.252
80 0.746 0.296 0.908 0.935 0.283 0.562 0.216 0.873 0.927 0.241
100 0.699 0.263 0.916 0.940 0.270 0.502 0.195 0.888 0.934 0.229

 

For small 5, simulation results Show that scaled-deviation Winsorized scale is more efﬁcient

50

than scaled-deviation trimmed scale at the situation “bad” points from the areas around
the center. When 5 gets large, the difference becomes small and all most achieve the same
efﬁciency when [i > 7.

Our simulation results for e = 0.0 (not listed in the table) conﬁrms the validity of the
asymptotic ones in Table 3.2 for N (0, 1). On the other hand, when there is just a 10% or 20%
contamination in the distribution, all other estimators Q, SR0, MAD become less efficient
than S and Sw which are the most efﬁcient, reﬂecting the robustness of these estimators.
It is remarkable that S5 is overwhelmingly more efﬁcient than all other estimators if we
suitably choose [3.

The other three types of robust estimators that are built on (recursive) medians only use
50% of information around the center, so it is very robust and efﬁcient when contaminating
points are from either end. But its strength is also its weakness, when contaminating
points are close to the center, it will use the “bad” points and lose its eﬁiciency. The two
types of estimators S, Sw introduced in this chapter achieve satisfactory efficiency when
contaminating points are from either end, although it’s a little bit less efﬁcient than other
three(Q, SR0, MAD). But when outliers/contaminating points are from the areas near

the center, S and Sw are far more efﬁcient than them.

3.6 Concluding remarks

Unlike the standard deviation, all the trimmed and Winsorized standard deviation and robust
scales discussed in the chapter have bounded inﬂuence functions for suitable distributions
and weight functions and hence are locally robust. In terms of the global robustness, the
scaled deviation trimmed and Winsorized standard deviations S and Sm are exceptional in
the sense that they can enjoy the best possible breakdown point robustness for any )6 2 1
with other robust scales.

Relative to the standard deviation, the scaled deviation trimmed S and Winsorized stan-
dard deviations Sw are highly efficient for large B’s at light-tailed symmetric distributions
and much more efﬁcient at heavy-tailed ones for small 3. Three popular scale estimators Q,

SRC, MAD which are built on (recursive) medians are highly efﬁcient when “bad” points

51

are from either end. However they lose the capability to tell the truth when contaminating
points are presented around the center. At this time, S and Sw can make a diﬂ'erence.

S and Sw are more ﬂexible than other robust scale estimators since 5 can take different
values, which also raises a question on the choice of 5 value. In light of our simulation
studies, a ,6 value between 4 to 7 would be recommended so that S and Sw can be very
efﬁcient at both light- and heavy-tailed distributions. Instead of a ﬁxed value one might
also adopt an adaptive data-driven approach to determine an appropriate 5 value. For a
given data set, one determines a value for 5 based on the heaviness of the tail. Generally
speaking, a large value of 5 is selected for a light-tailed data set while a smaller value for a

heavy-tailed one.

52

 

IF

 

 

 

Figure 3.2. Inﬂuence functions of Sw for N (0, 1) with a constant weight and ﬂ = 3.

53

 

 

 

 

 

 

 

 

 

 

-2
1

 

 

 

Figure 3.3. Inﬂuence functions of various scales for normal distribution. (5 = 4.5 for S and Sw)

54

 

 

 

IF

 

 

 

 

 

 

 

 

o q alto o
'5, ,5
MAD
M """ SR0
N _ ":1“ ....... 0
' S
- Sw
l j l l I

Figure 3.4. Inﬂuence functions of various scales for Cauchy distribution. (5 = 4.5 for S and Sw)

55

.441 ”‘5‘

 

 

 

 

 

IF

 

 

 

 

 

 

 

 

 

Figure 3.5. Inﬂuence functions of various scales for exponential distribution. (3 = 4.5 for S and
Sw)

56

 

 

 

 

 

IF

 

 

 

 

 

 

 

 

l 1 l l l
A -2 0 2 4
x

Figure 3.6. Inﬂuence functions of various scales for 0.9N(0, 1) + 0.1N(1,0.1). (ﬂ = 4.5 for S and
8.”)

57

 

 

 

 

 

 

 

 

 

 

0. _
V" , -------
we -
O
‘0. -
O
u, s
a _
4 SW
‘7. .1
O
I 4’
I ‘ ’r
I v’l 0’
I 'l ‘ If
I 'l ‘\ 'a"
N I I -
. _ l I
O I .‘
I I
II
II
I)
If
V
O I
O' —(
I T I I I I I I
0 1 2 3 4 5 6 7

beta

Figure 3.7. ARE of trimmed and Winsorized standard deviations for normal distribution

58

 

 

 

 

 

 

 

 

0.
‘_...
I
Q_ /R/—\
O I I
, 1
i """"""""""""""""
"t K‘ dddddddddddddddd I‘.‘
' \.b-'
l
I
(O._ '1‘
° .’
[U I
05 7'
<1: 7
I
V- I
o 1'
I
I
1
I
I
7
N_ I
o I
: s
I — SW
I
Q-
C
I | I | I I I I
0 I 2 3 4 5 6 7

beta

Figure 3.8. ARE of trimmed and Winsorized standard deviations for Cauchy distribution

59

 

1.5

1.0

ARE

 

 

 

 

 

 

 

0.0

 

beta

Figure 3.9. ARE of trirmned and Winsorized stande deviations for exponential distribution

60

 

 

 

 

 

 

 

 

 

beta

Figure 3.10. GES of trimmed and Winsorized standard deviations for normal distribution

61

 

GES
4
l

 

 

 

 

 

 

 

 

beta

Figure 3.11. GES of trimmed and Winsorized standard deviations for Cauchy distribution

62

 

 

 

 

 

 

 

 

 

 

beta

Figure 3.12. GES of trimmed and Winsorized standard deviations for exponential distribution

63

CHAPTER 4

The Multiple Least Trimmed Squares

Estimator

4.1 Introduction
Consider the multiple regression model

y,-=th,-+€,-, i=1,...,n

with :13,- = (2:31, . . . , $5P)t 6 RP and yi E R . The matrix B 6 RP contains the regression
coefﬁcients. The error terms 51, . . . ,5" are i.i.d. with zero center and a positive scale 0.
Furthermore, we assume that the errors are independent of the carriers. Note that this
model generalizes the location model (x,- = 1). Denote the entire sample {Zn = (xi; y,);i =
1,... ,n} and write X = (x1,...,xn)t for the design matrix and Y = (y1,.. . ,yn)‘ for the
vector of responses. The classical estimator for B is the least-squares (LS) estimator B L S
which is given by

8L5 = (X‘X)'1X‘Y (4.1.1)

while 02 is unbiasedly estimated by

1
"-P

(Y -— x515)? (4.1.2)

 

.2 __
aLS“

Since the least squares estimator is extremely sensitive to outliers, we aim to construct a

robust alternative. An overview of strategies to robustify the multiple regression method

64

is given by Maronna and Yohai (1997) in the context of simultaneous equations models.
Koenker and Portnoy (1990) apply a regression M-estimator to each coordinate of the re-
sponses and Bai et al. (1990) minimize the sum of the euclidean norm of the residuals.
However, these two methods are not afﬁne equivariant. Our approach will be different from
the latter, since it will be afﬁne equivariant. Agullo, J ., and Croux, C., al.(2002) discussed
some pr0perties of multivariate trimmed least squares estimator. From computational as-
pect, Peter J. Rousseeuw, P. J. and Driessen, K. V. introduced an fast algorithm which
is based on “C—step”. “C—step” will gurantee the reduction of objective function during
iterations.

However, as in location and scale settings, the general trimmed regression estimator suffers
from very low efﬁciency while keeping the highest breakdown point. As shown in this
chapter, the efﬁciency for ordinary least squares estimator is only 7.1% which is far less
than satisfactory. Meanwhile, small proportion of trimming will cause low breakdown point
issue and, hence, low robustness. In this Chapter, we are trying to work out this dilemma
and ﬁnd an estimator as in location and scale setting that has the highest breakdown point
and can have high level efficiency. It turned out that this kind of estimator is hard to define
and hard to discuss, don’t not even mention to come up with an algorithm. The estimator
deﬁned in this chapter is expected to have highest breakdown point (strict proof is not
available due to technical reasons) and truly efﬁcient.

However we found that starting from any initial subset, keeping those points with residuals
close to zero and after several steps of iterating, we found that the subsequent subset is
relatively “stable”. From ﬁgure 4.1 we can see that the mean squre of residuals is stabilized
after a few number of iterations. Then we deﬁne an estimator with the minimum mean
square error in the collection of ”stable” sets, we obtain the estimator deﬁned in this chapter.
This definition is different from the general least squares estimator deﬁned by Agullo, J .,
and Croux, C., al.(2002).

In Section 4.2 we give a formal deﬁnition of the multiple least trimmed squares (LTS)
estimator. In Section 4.3, we derive the inﬂuence function and study the ARE. A time

efficient algorithm to compute the LTS is presented in Section 4.4.

65

 

1.5

MSR
1 0
L

 

 

 

 

 

0.5

 

 

 

 

0.0

 

number of iteration

Figure 4.1. MSR vs number of iterations with 100 arbitrary initial subsets

66

4.2 Deﬁnition and properties

Our approach consists of ﬁnding the subset H of observations having the property that
the Mean Square of its residuals from a LS—ﬁt 8w), solely based on this subset is minimal,
the subset H of observations satisﬁes H: {i : (1,-(531,5) < €d(h)(33ub)}, let h=
[(n+p+ 1)/2] or [( n +p+ 2) ),/2]d (3 sub): (Y — X38532 . When H is not unique, we
take the one that has Minimum Mean Square of its residuals. Denote the collection of H
by H.

The resulting estimator will then be simply the LS-estimator computed from the optimal
subset which deﬁned by its least squares estimator. When X = (1, . . . ,1)‘ 6 RP, it reduces
to a multiple regression model with only an intercept that is a location model.

When I.’ = 1, our approach is equivalent to the general least trimmed squares estimator
(see, e.g. Agullo, etc) , which is a generalization of the LTS estimator (Rousseeuw 1984)
for robust regression. Consider a dataset Zn = {(x,; y,); i = 1, . . . , n} E RP'I'1 and
for any 3 6 RP denote 13(3) = y,- — 8‘23,- the corresponding residuals.

Deﬁnition 4.2.1. With the notation above the multiple least trimmed squares estimator

(LTS) is deﬁned as
8111’sz )= .8501) where H e argminnenc‘riew) (4.2.1)

with (TAZLS(H) = 2161;?) ( )/(#(H )—p) for any H E 'H. The variance of the errors
can then be estimated by
2 ..
ULTS(Zn )_ — czasub(H) (4'22)
where Cg is a consistency factor.
Note that if the minimization problem has more than one solution, in which case we
. . .. 2
look at argmln H Sigma L S(H ) as a set, we arbitrarily select one of these solutions to

determine the LTS estimator. In Section 5 a consistency factor Cg will be proposed to

attain Fisher consistency at the speciﬁed model. Note that for Z = +00, we ﬁnd back the

67

classical least squares estimator. Throughout the text we will suppose that the data set
Z, = {(1,}; yi);z' = 1, . . . ,n} E R1”1 is in general position in the sense that no h
points of Z; are lying on the same hyperplane of RP“. Formally, this means that for all

be RP, 7 e R, it holds that

#{($jiyj)l:8t$j + 7% = 0} < h (4.2.3)

unless if 3 and 7 are both zero vectors.

4.3 The inﬂuence function and asymptotic variances

The functional form of the LTS estimator can be deﬁned as follows. Let K be an arbitrary
(p + 1) dimensional continuously distribution which represents the joint distribution of the
carriers and response variables. 3

Let us denote dike, y) = (y — B(A)‘x)2, then it follows that A =
{(12.11) 6 RP“ Id?4(x, y) S €q(A)} where (M) = (Di)'1(0-5) with Di“) =
PK(d?4($a y) S t).

Deﬁne
DK(€) = {AIA = {($,y)ld31($. y) S 34W} (4.3-1)
To deﬁne the LTS estimator at the distribution K we require that

PKw‘x = 0) < 1/2 for all e 6 RP \{0} (4.3.2)

For each A E D K (t’), the least squares solution over the set A is then given by

BA(K) = (AxxtdK(x,y))—l[Axy‘dK(x,y) (4.3.3)

andA

y - BA(K)t$)2dK($; y)

_ f (
200- A PK(A)

0A (4.3.4)

68

 

Furthermore, a set fl 6 DK(€) is called an LTS solution if 0%(K) S 0300 for any other

A E D K(€). The LTS functionals at the distribution K are then deﬁned as
BLTS(K) = BA(K) and UETS(K) = ce 0%(K) (4.3.5)

The constant Ce can be chosen such that consistency will be obtained at the speciﬁed model.

If the distribution K is not continuous, then the deﬁnition of DK(€) can be modiﬁed as in
Croux and Haesbroeck (1999) to ensure that the set DK (3) is non-empty. Now consider
the regression model

y=l3tx+e

where :1: = (x1, . . . , mp)t is the p-dimensional vector of explanatory variables, and e
is the error term. We suppose that e is independent of a: and has a distribution Fa with

density

f0(u) = g(u2/U2)/U

where o > 0. The function g is assumed to have a strictly negative derivative 9' such that
F0 is a unimodal elliptically symmetric distribution around the origin. The distribution of
z = (r,y) is denoted by H. A regularity condition (to avoid degenerate situations) on the

model distribution H is that

Pgwtas + yty = O) < 1/2 (4.3.6)

for all 6 6 RP and 'y E R not both equal to zero at the same time. This general posi-
tion condition says that the maximal amount of probability mass of H lying on the same

hyperplane must be lower than 1 / 2.

69

Theorem 4.3.1. Denote
F(tq) - 1

Where F is the symmetric error distribution and q = K "1(O.5) with K (t) = Pp0(ete _<_ t).

 

Cg:

Then the functionals BLTS and 0%TS are Fisher-consistent estimators for the parameters

8 and 02 at the model distribution K:

BLTS(K) = B and 0,235 = 02

Proof. First of all, due to equivariance, we may assume that B = O and 02 = 1, so y =
6 ~ F. It now sufﬁces to show that BLT5(K) = 0. Then we will have that 02(K) is that
LTS functional at the distribution of y ~— BLTS(K)‘:r = y = e, the consistent coefﬁcient
c can be easily derived. Since BLTS is deﬁned solely based on the set C = {(x, y) E
RP+1|(y - BETSzV _<_ liq}. Therefore

/C :r(y — BETSx)dK(a:, y) = 0 (4.3.7)

Now suppose that BLTS 79 0. From (4.3.7) it follows that

[G 82mm — Birszwnwdaa) = 0

Which can be rewritten as
[R BtLTszl(:1:)dG(y) = 0 (4.3.8)
with

d+ﬂ§
Ia) = [WE (y —- Birs$)dF(y),

where C1; = {y E R|(:c,y) E C}, Fix :1: and set d = 327151;. Since y is symmetric, we have

that

d+ﬂ§ t
Ice) = [My] (y - ramming)

m 2 2 -
= [0 t(g(<d+t) >—g<(d-—t> >>dt

If d > O we have (d + t)2 > (d - t)2 (for t > 0) and since 9 is strictly decreasing this implies

1(2) < 0. Similarly, we can show that d < 0 implies I(:r) > 0. Also, 8317.52: = 0 implies

70

I(:r) = 0. However, due to condition (4.3.6), the latter event occurs with probability less
than 0.5. Therefore, we obtain [0 :1:(y — 8213sz (1:, y) < O which contradicts (4.3.8), so

we conclude that BLTS = 0. D

The inﬂuence function of a functional T at the distribution K measures the effect on T
of adding a small mass at z = (:r, y). If we denote the point mass at 2 by A; and consider
the contaminated distribution KW = (1 — e)K + eAz then the inﬂuence function is given
by
T(K5,z) - T(K) 3

E = a—E'T(K5rz)|€=0'

 

IF(2;T,K)=5113%
(See Hampel et al. 1986.) It can easily be seen that the LTS is equivariant for afﬁne
transformations of the regressors and responses and for regression transformations which
add a linear function of the explanatory variables to the responses. Therefore, it sufﬁces
to derive the influence function at a model distribution K0 for which 8 = 0 and the error
distribution F = F0 with density f0(y) = g(y2). The following theorem gives the inﬂuence

function of the LTS regression functional at K0.

Theorem 4.3.2. With the notations from above, we have that

 

y 1(3/2 S 6(1)

1F(z;BLTs, K0) = dextrl (2F(\/€—Q) - 1) - mam/a)

(4.3.9)
Proof. Consider the contaminated distribution K5 = (1 -— E)Ko + EAZO with 20 =

(230,310) and denote BE := BLTS(K€) and 0'? := UiTSUQ). Then (4.3.3) results in

A -1
Be = ( / xx‘dKe(x; 31)) / xy’dKer; y)
A, A.
where fie 6 DK. (0) is a LTS solution. Differentiating w.r.t. e and evaluating at 0 yields

_1 a
IF(zo; BLTS, K0) =( [A xz‘dKo(a:; y)) 55' (A. xy‘dKea; y)|€=0

+ g; [( [is xz‘dK5(x;y))_l] [6:0 [4 xy‘dKo(x,y)

F isher-consistency yields that 14 = {(3:,y) : Rp+q; y2 S tq} where q = (D%.)’1(1/2) with
D%(t) = Pp(y2 _<_ t). Hence fl = R” x {y E R; y2 S Zq} =: Rp xA. This implies

71

Al'ytdKM-Tw) = [RP :rdG(:r:) [A de(y) = 0

by symmetry of F and

/,xx‘dKo(z)= / 2:3:th(32) / dF(y)= (2F(\/€q)—1)Eg[zxt]
A RP A
Therefore, we obtain

EG[XXt] a

 

 

 

IF(zo;BLTs.Ko> = (mm) _ n5; 4. xy‘dKe(x;y)|£=0 (4.3.10)
_ EGlXth 3 . *
- (2F(\/Z§) _ 1)gﬂl - 6) [As zy‘dKo($,y) + error/01% 6 A5) 5:0
(4.3.11)
_ Eg[XXt] 0 .
— ——-——(2 M «an (5; [A xy‘de. y)+$oy01(y2 s eq)|€=0 (4.3.12)

Let us denote dg(:c;y) = (y — 821:)2, then it follows that 145 = {(x,y) E Rp+1;dg(a:;y) g
€q(e)} where q(e) = (D%€)‘l(0.5) with 0&5 (t) = PKg (dg(a:;y) S t). For :1: ﬁxed we deﬁne
the set 85,3 := {y e Rldgcc; y) S £q(e)}. Then it follows that

~//i5 my‘dKoht; y) = [RP/56,17 zde(y)dG(I)
= [m [ems/gabdyxdcv) (4.3.13)

Using the transformation v = y — 822:, we obtain that

1(6) == / y9(y2)dy

!

_. t t 2
— [v 2$£q(€) (v + Ber)g((v + 85x) )dv

«41(6) 2
= f (v + ng)g((v + 822:) )dv
-lQ(€)

72

Note that

1(5) — 1(0) _1[( 39(6)

ﬂ-
6 _E (v + 82x)g((v + 821:)2)dv — f :(v + 823:)g((v + 82.x)2)dv)

4% -ﬂ'
m t t 2 ‘ﬂa t t 2
.+ (Ls/23(1) + 853:)g((v + 851:) )dv — [475(1) + B :L')g((v + B x) )dv)]
1 41(6) 2 - 134(6) 2
=;( L/Ia (v + egz)g((v + 35.2) )dv — f—m (v + Bgz)g((v + 32x) )dv)

ﬂ? 1
+ (1% E ((1) + 82$)g((v + 32x)2)dv — (U + th)g((v + Bt$)2))dv
=(61 + Bée>g((e. + 3:4)2) WW:- ﬂ'q' _ (_02
+ 82mm + 13,523?) (-\/e(1(€)€- (as?)

J?—
+ (1;; ”£1 ((1) + 82$)g((v + 8:1)2)dv —- (U + th)g((v + th)2))dv

m t 2 2 I 2 t
=/ﬂ_ (IF(zO;BLTs,Ko) 129(1) )+21} 9 (v )IF(ZO§BLTSaKO) $)dv+020(1)
- «1

So we have that
6
sgl(e)le=o = ((2F(\/€q) — 1) + 262)1F(zosBLTS’ Koltl‘

where e = 155;, g'(v2)v'~’dv = In“26 vdf(v) = JIM/76) - tam/26) — 1)
Cl

Note that the inﬂuence function is bounded in y but unbounded in 1:. Closer inspection
of (6.1) shows, however, that only good leverage points, which have outlying x but satisfy
the regression model, can have a high effect on the LTS estimator. Bad leverage points will

give a zero inﬂuence.
Remark: The inﬂuence function of the LTS location estimator T at a symmetric distri-

bution F0 can be obtained easily, it is given by

 

1F (31; T; Fo) = 1(y2 _<_ 6(1)

y
(WM/5) - 1) - Zth/Z’E)

Therefore, it follows that the inﬂuence function of B LTS can be rewritten as

IF(z; BLTS, K0) = Ea[XXt]'1$IF(y; T, F0) : (4.3.14)

73

The asymptotic variance-coVariance matrix of BLTS can now be computed by means of
ASV(BLTS, K0) = EK[IF(z; BLTS: K0)®IF(z; BLTS: K0)t] (see e.g. Hampel et al. 1986).
Here A ® B denotes the Kronecker product of a (p x 1) matrix A with a (l x p) matrix B,
which results in a (p x p) matrix with (i, j )-th block equals aibj, where aj are the elements
of the matrix A and bj are the elements of the matrix B. Let us denote 23; := Eg[X X t],

then expression (4.3.14) implies that
ASV(BLTS,K0) = ASV(T,F0))2;1) (4.3.15)

From (4.3.15) it follows that for every 1 _<_ i _<_ p the asymptotic variance of (81,715),-
equals

ASV((BLTS)2': K0) = EKl1F2(Z; (Bush, Koll = (251)4iASV(T, Fol)

For i 79 i' we obtain the asymptotic covariances

ASV((BLTS)ia (BLTSh'h K0) = EK[1F(Z; (BLTsli, K0)1F(Z; (BLTS)¢": Koll

= (2,:1),,.,ASV(T, F0)

and all other asymptotic covariances (for j’ aé 3') equal 0. Due to affine equivariance, we
may consider w.l.o.g. the case where o = 1. Then all asymptotic covariances are zero,
while ASV((BLTS),-, K0) = ASV(T, F0) for all 1 g i S p. The limit case Z = 00 yields the
asymptotic variance of the least squares estimator ASV((B L 3),, K0) = ASV (M; F0) where
M is the functional form of the sample mean. Therefore, we can compute the asymptotic
relative efficiency of the LTS estimator at the model distribution K0 with respect to the

least squares estimator as

ASV((BLS):'; K0) _ ASWM; F0)

ASV((BLTS)2'1K0) = ASV((BLTs)i; K0) " ASV(T;F0)

 

= ARE(T, F0)

for all 1 S i g p. Hence the asymptotic relative efﬁciency of the LTS estimator in p + 1
dimensions does not depend on the distribution of the carriers, but only on the distribution
of the errors and equals the asymptotic relative'efficiency of the LTS location estimator at

the error distribution F0. For the normal distribution these relative efﬁciencies are given in

74

 

 

 

Table 1. Note that the efﬁciency of LTS does not depend on p, the number of explanatory
variables, but only on the number of dependent variables.

Table 4.1. Asymptotic relative efﬁciency of the LTS estimator w.r.t. the Least Squares
estimator at the normal distribution for several values of K.

K 1 3 5 7 10 20 30
ARE 0.071 0.286 0.483 0.636 0.792 0.973 0.997

4.4 Finite-sample simulations

4.4.1 Algorithm

,In algorithmic terms, the procedure can be described as follows:

Step I

1. Create an initial subsets H0. Draw a random p-subset J, and compute do := the
coefficients of the hyperplane through J. If J does not deﬁne a unique hyperplane (i.e., when
- _ the rank of X J is less than p), redraw J random observations until it does. Then compute the
residuals To (i) := y,- — 651‘: for i = 1, . . . ,n. Sort the absolute values of these residuals,
which yields a permutation 7r for which rg(7r(1)) S r3(7r(2)) S . . - S ‘rg(7r(n)),
Ho = {ilr3(i) S 5T3(7r(h))}, h = [Tb/2] + [(P + 1)/2]-

3. Compute ,60 := LS regression estimator based on H0.

4. Iterate K (say 20) steps or until Hk = Hk-1 or Hk = Hk_2, record 6, H in the last
step and k(the actual number of iteration).

Step II

Repeat M (say 300) times step I by choose different initial subsets K0. For simplicity,
write 62-, H,- and ki, i=1,...,M

Step 111

If min(k,-) < K, then Choose argminlsisM,k,-<K 21'th 7‘12()6,)/(#(Hi) - p).

Else If mtn(k,-) = K , then Choose argminlgsM 23.61,, r§(ﬁ,)/(#(H,-) — p).

75

The key to the algorithm is to ﬁnd H -a subset having the property that the Mean Square
of its residuals from a LS-ﬁt 3w), solely based on this subset is minimal, the subset H of
observations satisﬁes H = {i : di(8wb) S €d(h)(83ub)}, where h = [(n +1) + 1)/2] or
[(n + p+ 2) / 2], (1,-(831,0 = (Y - X Baub)2- We call H “stable set” which won’t be changed
if we iterate by use H as an initial set. Sometimes, H doesn’t exist, however, we can ﬁnd H1
and H2 and they appear alternatively—If we choose H1 as an initial set and iterate one step

we reach H2 and if we iterate once more by choosing H2 as an initial set, we come back to

H1. H1 and H2 form ’closed sets’. This idea is revealed in Step I-4.

4.4.2 Finite-sample performance

In this section we investigate the ﬁnite-sample performance of the LTS estimator. Therefore,
we will compare the asymptotic efﬁciency obtained in the previous section with ﬁnite sample
efficiencies obtained by simulation. To this end, we performed the following simulations. For
various sample sizes 11, and for p = 3, we generated m = 10000 regression datasets of size n.
The response variables were generated from the stande normal distribution N (0, 1), and
w.l.o.g. we took 8 = 0 in the multiple regression model. We set the p-th regressor equal
to one, so we consider a regression model with intercept. The remaining p - 1 explanatory
variables were generated from the following distributions:

The multiple standard normal distribution N (0; I _1).

In this simulation setup, the last element of B is the intercept vector and the matrix
formed by the p — 1 ﬁrst rows of B, which we will denote by 80, is the slope matrix.
h = [(n + p + 2) /2]. For the parameter 3, we let it vary.

For each simulated dataset Z (I), l = 1,...,m we computed the regression vector
139.5. The Monte Carlo variance of a regression coefﬁcient (31,119)]- is measured as
Var((BLT5)j) = nvar¢((l3LTS)§) for j = 1,...,p — 1. The variance of the estimated
slope matrix BETS is then summarized by avej(Var((f3LT5)J-)) for 1 S j S p — 1 while
its inverse measures the ﬁnite-sample efﬁciency of the slope. Similarly we computed the
ﬁnite-sample efficiency of the intercept vector.

Table 4.2 shows the ﬁnite-sample efﬁciencies of the LTS estimator obtained by simulation

for sample size n equal to 100, 300, and 500. We see that the ﬁnite-sample efﬁciencies of the

76

LTS converge to the corresponding asymptotic efﬁciencies which are listed under 11 = 00

Table 4.2. ARES of LTS relative to the LS for p = 3

 

10

20

30

Slope

Intercept

Slope

Intercept

Slope
Intercept
Slope
Intercept

Slope

Intercept
Slope
Intercept
Slope

Intercept

100

0.047
0.068

0.193
0.254

0.409
0.480

0.589
0.662
0.748

0.706
0.944
0.946
0.941
0.918

300

0.020
0.030

0.110
0.156

0.341
0.395

0.575
0.592
0.758

0.714
0.935
0.874
0.974
0.952

500

0.013
0.019

0.077
0.112

0.247
0.377

0.549
0.647
0.767

0.805
0.952
0.998
1.000
1.000

00

0.071
0.071

0.286
0.286

0.483
0.483

0.636
0.636
0.792

0.792
0.973
0.973
0.997
0.997

 

Note that efﬁciencies for large I are always higher than the corresponding efﬁciency for
small t. We also did a study by choosing different p value. The results of them conﬁrm the
theoretical ARE.

To illustrate the difference between ordinary least trimmed squares estimator (OLTS) and
the trimmed squares estimator introduced in this chapter. We generated a regression data
set with n = 1000 observations of which 30% bad leverage points. The ﬁrst 700 observations

were generated by the formula

y,=12:,-+1+e,~ i=1,2,...,700,

where x, ~ N (0, 100) and 5,- ~ N (0, 1). The other 300 observations (23,-, 92') were
drawn from the bivariate normal distribution with p = (50,0) and 2 = 2512. The entire

data sets were shown in ﬁgure. We compute 75% OLTS and t = 30 LTS.

77

20 30 40

10

-20 -10

-30

 

 

 

— LTS
' - ' OLTS

 

 

 

d.
-‘
0"

 

 

Figure 4.2. LTS estimator v.s OLTS estimator

78

 

CHAPTER 5

Selected proofs of main results and

lemmas

5.1 Selected proofs for results of chapter 2

PROOF OF THEOREM 2.3.1. The proof follows the lines of that given in Zuo (2003). For
simplicity we sometimes write F5 for F (e, 61.) for a given :1: E R. Denote by 03(1) a quantity
that may depend on :1: but approaches 0 as e -> 0 for the given 2:. We need the following

results whose more general versions are given and treated in Zuo (2003).

Lemma 5.1.1. For ﬁxed :1: 6 R and suﬁ‘iciently small 8, We have for ﬁxed 0 < 6 < 00
(a) D(y, F) and D(y, F (5,633)) are Lipschitz continuous in y 6 R;

(b) supyeg |D(y,F(e,53)) — D(y, F)| = 03(1) for any bounded set S C R;

(C) |L(F(€,5x) - L(Fll = 022(1): |U(F(€.6z)) .. U(F)! = 03(1)-

First we write

13%;? (y — T(F)>w(D(y. when)
T(FE) — T(F) = U(FE)
L(FE) w(D(y. FelldFe(3/)

 

(5.1.1)

We focus on the numerator. The denominator can be treated in the same (but less involved)

79

manner. The numerator can clearly be decomposed into three terms

”(’75) U(F))
.=(/ -/m )-(y T( ()F)w(D(y.F.»eF.(y).
L(Fe)

) .
12. = 11w] (y - T(F))<w(D<y. F.» - wary, F»)dF.<y),
L(F)

U(F)
1.. = f (y - T(F))w(D(y.F»dF.(y).
L(F)

It follows immediately that
$13,. = I{xe[L(F),U(F)]}($ — T(F))w(D($, F», uniformly in y for the given 11:. (5.1.2)
In light of the continuity of w(1)(-) and Lemma 5.1.1, we have that

we F.» — w(D(y. F» = <w<1_><D<y. F» + o.(1))(D<y. F.» - 0(4). F». (5.1.3)

uniformly in y for y in a bounded set S for the given 2:. This, combining with the bound-

edness of L(F) and U (F) and Lemma 5.1.1, immediately gives

1 W” (1)
2125 =/ (y — T(F))w (D(th))-’F($;D(31,F))dF(3/) + 02(1) (51-4)
L(F)

In virtue of equation (5.1.3), boundedness of L(F) and U(F), Lemma 5.1.1, and the

argument used above, we have for suﬁciently small 6 > 0
$11. = i / A(y.e)(y — T)w(D(y.F»dF<y) — / A(y.e)(y — T)w(D(y.F»dF<y)
+ f 4(1). e)<y - T(F))w<1>(D<y.F))IF(e; my. mm.» + 0.0)

where A(y,e) = I{y€[L(F5),U(F5)]} - I{y€[L(F),U(F)]}' Call the three terms With integration
1151, 1152, I153, respectively. It’s obvious that 1152 and 1253 are 03(1) because of the bound-

edness of L(F) and U (F), Lemma 5.1.1, and Lebesgue’s dominated convergence theorem.

80

Conditions on f and w, the mean value theorem, and Lemma 5.1.1 imply that

11.1=§(/U(F€)(y-T TwD)( (aF ))dF(y) [if (31- TM D(y.F))dF(y))

U(F)
=(02e - T)w(D(92e, F))f(025)(1Fl$; U(F)) + 0x(1))
- (91: - T)w(D(91.:. F ))f (915)(1F (I; L(F )) + 017(1))
=(U - T)w(D(U, F ))f(U )1F($; U (F )l

- (L - T)w(D(L, F))f(L)1F(=r; L(F)) + 02(1)

where 025 is a point between U (F) and U (F5), and 015 between L(F) and L(FE). The

desired result now follows. C]

PROOF OF THEOREM 2.3.3. The proof is similar to that of Theorem 2.3.1 and we adopt
the same notation. Let w(D(y, F€)) — w(D(y, F)) = w(1)(6(y, F€))(D(y, F5) — D(y, F)). We

need the following lemma whose proof is omitted here.

Lemma 5.1.2. Under the conditions of Theorem 2.3.3, we have
(a) my... I when. F.» — w<1>(D(y. F))I = 0.0).-

(b) supyea Iyw<1><e(y. F.)» < oo for sumeeenuy emau . > o,-
(e) (D(y, F.) — D(y, F))/e = IF(e; D(y, F)) + 210.0) + 0.0).

First we write
U (Fe)

Tw(FE)-Tw(F) = fw(D( (y,F.))dF.(y)ULUF.)W

 

D(y, 2))(31’ Tw)dFe(y )

L(Fe)
+ / w(D(y.F.))(L(F.) — Tw»dF.(e)
-—oo
00
+ / w(D(y.F.»(U(F.) — Tw)dF.(y)]
U(Fe)
Lebesgue’s dominated convergence‘theorem implies immediately that

[w(D(y,F.))dF.(y) = [w(D(y.F))dF(y)+o.(1) (5.1.5)

We now focus on the numerator. Call the three terms Ii(F5,y), i = 1, 2, 3, respectively. By

81

the proof of Theorem 2.3.1, we see immediately that
$11052. y) =(U — T...) W(ﬁ)f(U)IF(L; U) — (L — T...) w(F)f(L)IF(e; L)

U
+/L (y -— Tw) w(1)(D(y, F))h(:r:,y)dF(y) + I(L S :1: S U)(:v — Tw) w(D(r, F))

+

 

_ U
1 E E f (y - Tw) W(D(y.F))dF(y) + 03(1). (5.1.6)
L

Now it suﬂices to treat 12(F5,y). Following the proof of Theorem 2.3.1 and employing
Lemmas 5.1.1 and 5.1.2, we have

L
:I2(F€,y) = [(2 < L) w(D($,F))(L — Tw) + IF(x;L) j... w(D(y,F))dF(y)

L
+(L — T...) l... w(1>(U(y. F))h(e. y)dF(y) + (L - T...) w(e)f(L)IF(e; L)

l—e
e

 

L
+ f...“ — T...) w(D(y, F))th» + o.(1). (5.1.7)

Likewise we have

$130241) = 1(1- > U)w(D(:1:,F))(U -— Tm) + IF(x; U) [U00 w(D(y.F))dF(.U)

+(U — T...) l: w<1>(D(y. F))h(e. y)eF(y) — (U — T...) w(e)f(U)IF(.e; U)

1—5
8

 

+ /°°(U — T...) w(D(y. F))th» + o.(1). (518)
U

Combining the last four displays, we have the desired result. El

PROOF OF THEOREM 2.4.1. For the sake of convenience, we deﬁne
Vn = ﬁ(Fn -F), Hn(‘) = ﬁ(D(°,Fn) _ D(',F)) (5.1.9)

The following result, a special version of a general result in Zuo (2003), is needed in the

‘ proof.

Lemma 5.1.3. Assume that F’ = f exists at u and continuous in small neighborhoods of
u i o with f (a) and f (a + o) + f (a - a) positive. Then for 0 < 6 < co and any numbers
L1 < U1

(0) SUP (1+I$|)|Hn($)| =0p(1); and
z€[L1,U1]

(b) Hn(a:) = f]F(y,D(:r, F))Vn(dy) + 010(1), uniformly over a: 6 [L1, U1].

82

PROOF: For :1: 6 [L1, U1], it is readily seen that
D($,Fn) - D($,F) = -(D($,F)(0n - 0)+(11en - [ID/0n-

(a) follows immediately since the given conditions allow asymptotic representations for both

an and on (see, e.g., page 92 of Serﬂing (1980)), which lead to (b). C]

The proof of the theorem follows the lines given in Zuo (2003). First, we can write

Un Un
\/"_1 (Tn - T) = ﬁ/Ln (y _ le(D(y) Fnllan(y)//Ln w(D(y, Fnllan(y) (5.1-10)

and the numerator then can be decomposed into three terms

(In U
Iln = ﬁ/Ln (y - T)w(D(y.Fn))Fn(dy) - x/T—z/L (y - T)w(D(y,Fn))Fn(dy);
U U
12.. = x/E/L (y - T)W(D(U,Fn))Fn(dy) — ﬁ/L (y - T)w(D(y. F ))Fn(dy);

U
13,, = «Ft/L (y — T)w(D(y,F))Fn(dy).

It follows immediately that
£311

6 = % gnu.) (5.1.11)
For 12", we note that
U U
12.. =¢F / (y — T)w(U(y.F.»F.(dy) — U5 f (y — T)w(U(y.F»F.(dy)
L L

U

= [L (y — T)w’(6n(y))Hn(U)Fn(dy)
U

= [L (y — T)w’(D(y.F»H.(y)Fn(dy)

U
+ [L (y — T)(w’(e.(y» - w'(D(y, F)))Hn(3/)Fn(dy)

A
=J1n + Jzn,

where 9n(y) is a point between D(y, F”) and D(y, F). For Jzn, by Lemma 5.1.3, we have
U

J... = [L (y - T)(w’(9n(y)) — w’(U(y. F)))H..(y)F.(dy)

U
S/L (|y|+|T|)|Hn(y)|(w'(9n(y))-w'(D(y.F)))Fn(dy) = 012(1)

83

On the other hand, by Lemma 5.1.3, continuity of w’ and boundedness of L and U, Fubini’s

Theorem and the central limit theorem, we Obtain
fLU(y — T)w’(D(y. F)))H..(y)(F. — F)(dy)
= LU“ - T)w’(D(y, F)))( / my; D(y, F))y.(de))(F. — F)(dy) + .,(1)
= £7 / [51(3) - T)w’(D(y. F ))1F(x; 0(ny ))Vn(dy)vn(d:r) + 012(1)

U
= -0—%( — [L (y — T)w’(D(y.F»D(y,F)u.(dy) / IF(e;e(F»u.(de)+
U
- [L (y-T)w’(D(y.F»un(ey) [IF(e;y(F»u.—.(dx)) +0.0) = 0.0).

which, in conjunction with Lemma 5.1.3 and the Ribini’s Theorem, yields
U I
= l. (y - T)w (D(y. F)))H.(y)F(dy) + 0.0)

U
= f ([1. (y_T)y'(n(y,F»»F(x;D(y.F))F(dy))u.(de)+oy(1)

= 5% £3209) + 0:20)-
Hence
ifg = 2}? gym) + 0.0). (51-12)

For [1", we note that

=f/Z" (y— T)w(D(y F..»F..(dy)— f/U (y— (Dy F.»F.(dy)
=./5/L:(y— D(yFn»F.(dy) )+f/U" (y— )wD( (y.F.»F.(dy)
=Vln+V2n

Now we deal with V1,, only since V2,, can be’ treated similarly. By mean value theorem,
L L
v... = ﬂ [L (y — T)w(D(y. F..»F(dy) + [1...” — T)w(D(y. F..»u.(dy)
n

L
= -(y.. — T)w(D(y.. Fn))f(nn) «F (L. — L) + [1...” — F)w(D(y. F..»u.(dy).

84

 

 

where r)" is a point between Ln and L. Note that by the conditions given, we have

(Tin _ T)w(D(llny Fn))f(77n)\/E(Ln - L)
= (L — T)w(D(L, F))f(L)—% Z IF(x,; L(F)) + 0.0).
i=1

Since P(X = L) = 0, it is readily seen that for large n and L" = —1 — |L|,

L
f (y — T)w(D(y. F.»u.(ey)
Ln
= “/lI(L*,Ln)(y) — I(1.=",L)(y)l(y- T)w(D(y,Fn))Vn(d1/) +0p(1) = 011(1),

by an empirical process theory argument; see Pollard (1984) or van der Vaart and Wellner
(1996). Thus

v... = -(L — T)w(D(L. omnﬁ )3 IF(x,; L(F)) + 0,,(1),

i=1

which, combining with a similar result from V2”, gives

5152 = % geld.) + 0.0). (51.13)

In the same but much less involved manner we can show that

U

U"
/ w(D(rc,Fn))Fn(dx)= / w(D(z,F))F(dr)+0p(1/\/r—1). (5.1.14)
Ln L

Now (5.1.11), (5.1.12), (5.1.13) and (5.1.14) give the desired result. [3

PROOF OF THEOREM 2.4.3. The proof is very similar to that, of Theorem 2.4.1.
We adopt the notation in the proof of Theorem 2.4.1. Let w(D(y,Fn)) — w(D(y,F)) =
w(1)(0(y,Fn))(D(y, F") — D(y, F)). We need the following lemma whose proof is skipped

here.

Lemma 5.1.4. Under the conditions of Theorem 2.4.3, we have

(a) sup(1+ 1y)» w<1>(e(y,F.» — w<l>(D(y.F»( = 0.0). and
yER

(b) Hue!) = 9011(1) + 011(1)-

85

We ﬁrst write

 

1 U(Fn)
Tw(Fn) -Tw(F) = fw(D(y,Fn))an(3/) [fur ) w(D(nynDtU-Tw)al1”‘n(y)

L(Fn)
+ / w(D(y,Fn))(L(Fn) — Tw))an(y)

1 L(Fn)w(p(y,Fn))(U(Fn) - Tw)dFe(y)l

The given conditions guarantee that w(D(yn,Fn)) —> w(D(y,F)) as. for every y E R
and every sequence yn -> y. Skorokhod representation theorem and Lebesgue’s dominated

convergence theorem imply immediately that

[w(D(y,Fn))an(y) = /w(D(y,F))dF(y)+o(1), a.s. (5.1.15)

We now focus on the numerator. Call the three terms L(Fn, y), i = 1, 2, 3, respectively. By

the proof of Theorem 2.4.1, we see immediately that
n

11(Fn. y) = £2 ((U — Ty) w(U)f(U)IF(X.-; U) + 1(L s X. s U)(X.- - in.) w(D(X.» F))

i=5
+L (y _ Tw) w(1)(D(y, F))h(Xi,y)dF(y) - (L - Tw) w(ﬁlﬂLUFin; Ll)
+0p(n_1/2). (5.1.16)

Now it suffices to treat 12(Fn,y). Following the proof of Theorem 2.4.1 and employing
Lemmas5.1...3and514, wehave

12-(Fn,y) — -:ZI(I(X.' <L)W(D(Xt,F))(L- Tw)+1F(X1;L)/_:OW(D(U,F))dF(U)

+(L — in.) [:0 w(1>(D(y. F))h(X.-, y)dF(y) + (L — T...) w(F)f(L)IF(X.-; L))
+op(n‘1/2). (5.1.17)

Likewise we have

13(-Fn.y) — - ;Z(I( I(.-X > U) (D(X..F»(U - T...) + IF(X.;U) [Um w(D(y.F))dF(y)
+(U— T...) [<va1 F))h(X.-.y)dF(y)- (U - T...) w(U)f(U)IF(X.-; U))
+o,,(n‘1/2). (5.1.18)
Combining the last four displays, we have the desired result. Cl

86

 

5.2 Selected proofs for results of chapter 3

Proof of Theorem 3.3.2. For simplicity we sometimes write F5 for F (e, 63) for a given a: 6 R.
Denote by 03(1) a quantity that may depend on a: but approaches 0 as e —1 0 for the given

2:. We need the following results whose proof is omitted.

Lemma 5.2.1. For ﬁxed a: E R and sufﬁciently small 6, we have

(a) D(y, F) and D(y, F (5,655)) are Lipschitz continuous in y E R;

(b) supzes |D(y,F(e,6,;)) — D(y,F)| = 035(1) for any bounded set S E R;
(C) |L(F5) - L(F)| = 01(1), lU(Fe) - U(F)| = 0,,(1).

Proof. We only need to show that
”(33; S(F)) = 71(1) + 72(3) + 7"3(13)

Step 1: First we write

 

F _ 1,3353%? - s)w2(D(y.F.»dF.(y)

S(Fg) — s( ) (5.2.1)
(£1553) w2(D(y, F.»dF. (y)

The continuity of w(1)(e) and Lemma 5.2.1 yields
w2(D(y, F5) = w2(D(y, F) + 03(1), unifomly in y for the given 2:.
Therefore, we have

U(Fe)
/ w2(D(y.F.»dF.-(y)
L(Fe)

U(Fe)
= (1 - 5) [L(F) W2(D(ny))dF(3/) + EI{xe[L(F€),U(F5)]} w(D(y, 175)) + 02(1).
6

U(F)

= / W2(D(y.F))dF(y)+0e(1)
L(F)

Step 2: For the numerator of equation (5.2.1), it can clearly be discomposed in three terms

U(F)

U(Fe)
11. = ( f (y2 — s) wy(D(y,F.»dF.(y) — f (y2 - s)w2(D(y.F.»dF.(y))

L(Fe) L(F)

U(F) U(F)
12.=(/ (ye-.)w,(p(y,p,))yy,(y)_/ (ye-.)w,(p(y,p))yy,(y))

L(F) L(F)

U(F)
1.. = / (y2 - s) W2(D(y, F))eF.(y)
L(F)

87

It follows immediately that
$13,. = I{x€[L(F),U(F)]}($2 — s)w2(D(:1:,F)), uniformly in y for the given :15. (5.2.2)
In light of the continuity of wgl)(-) and Lemma 5.2.1, we have that

w(D(y, Fell - w(D(y, F)) = (W§I)(D(y. F)) + 02(1))(D(y, 17.)) - D(y, F)). (533)

uniformly in y for y in a bounded set S for the given 1:. This, combining with the bound-
edness of L(F) and U (F) and Lemma 5.2.1, immediately gives

1 _ ”(Fl 2 (1) .
E12. — [L(F) (y —e)w2 (p(.,,p))yy(.,D(y,p))yp(y) +o.(1) (5.2.4)

In virtue of equation (5.2.3), boundedness of L(F) and U (F), Lemma 5.2.1, and the argu-

ment used above, we have for sufficiently small 6 > 0

1

211. = 2. / A(y.e)(y2 - .) w2(D(y.F))dF(y) — / A(y.e)(y2 - e) W2(D(y. F))dF(y)

+ / A(y.e)(y2 — s) w§1)(D(y, F))h(e. y)dF(y) + o.(1)

where A(y,e) = I(:r: E [L(FE),U(F€)] — I(:t‘ E [L(F),U(F)]). Call the three terms with
integration 1151., i = 1, 2, 3 respectively. It’s obvious that I152 and [253 are 03(1) because of
the boundedness of L(F)) and U (F), Lemma 5.2.1, and Lebesgue’s dominated convergence

theorem. Conditions on f and wg, the mean value theorem and Lemma 5.2.1 imply that

1
11., =2 /(I{ze[L(FE),U(F5)]} - I{z€[L(F),U(F)]})(y2 -.- 8) W2(D(y. F ))dF (y)

1 U(Fg) L(Fe)
=E( L(F) (y2 —s)w2(D(y,F))dF(y) - [L(F) (y2 —s)W2(D(.U.F))dF(y))

=(03. — s) wz(D(62.. F))f(62.)(IF(e. U(F)) + 0.0))
— (6%. — s) w2(D((h.. F))f(915)(1F($a L(F)) + o.(1))
=(U(F)2 - .) w2(D(U. F))f(U)IF(e. U(F))

- (L(F)2 - s) W2(D(L.F))f(L)1F(x. L(F)) + o.(1)

where 025 is a point between U (F) and U (F5), and 015 between L(F) and ME.) Therefore

the desired result now follows. E]

Proof of Theorem 3.3.5

88

 

 

Proof. The proof is very similar to that of Theorem 3.3.2. We adopt the notation in the proof
of Theorem 3.3.2. Let w2(D(y, Fn)) - w(D(y,F)) = w§1)(e(y,rn))(o(y, Fa) — D(y,F)).
We need the following facts whose proofs are omitted here.

Lemma 5.2.2. Under the conditions of Theorem 3.3. 5, we have

(a) supy.e(1 + y2)(w£1’(e(y. F.» - wé”(D(y. F)» = o.(1); and

(b) supyeRU + y2)ly2wgl)(9(y, F5))| < 00 for suﬂiciently small 6 > 0;

(C) (D(y, Fe) - D(y, F))/6 = IF(x, 0(3), F)) + y0x(1) + 01(1)-

First we write

 

F F— 1 ”up” D( F))(2— )dF()
3w( el-3w( )_fW2(D(y’F5))dFE(y)[./L(FE) w2( y: e y 3w 63/
L(Fe)
+ l... w2(D(y.F.»(L2(F.) — ew)>dFe(U)
+ [I]: )w2(D(y. F.))(U2(F.) — s...)dF.(y)] (5.2.5)

Lebesgue’s dominated convergence theorem implies immediately that

/W2(D(yaFe))dFe(3/l =/W2(D(U.F))dF(y)+0x(1) (52-6)

Step 2: We now focus on the numerator of equation (5.2.5), Call three terms I,(F5,y),

i = 1, 2,3, respectively. By the proof of Theorem 3.4.1, we see immediately that

in (F... y) =(U2 — sw)W2(ﬂ)f(U)1F(x; U) — (L2 — so) we(e)f(L)IF(e; L)

U
+/L (y2 - 8w)Wi1)(D(U,F))h(x.y)dF(y) + 1(L S 3 S UNI? - 8w) W2(D($,F))

l—e
e

+

 

U
[L (y2 - s...) W2(D(y.1'*‘))c11”(1/) +0.0). (527)

Now it sufﬁces to treat 12(F5,y). Following the proof of Theorem 3.4.1 and employing
Lemma 5.2.1 and 5.2.3, we have

Ergo}, y) = (L2 — sw)I(:c < L) w(P(:r, F)) + (L2 - sw) w2(D(L,F))1F($; L)

L(F)
+ (L2 — s...) / w§1’(D(y, F))h(e. y)dF(y)

-CD

L(F)
+2L(F)1F(y,L(F))/_00 w2(D(y,F))dF(y) (5.2.8)

89

_ U
1 . E f (L2 - s)w(D(y.F))eF(y) +o.(1) , (5.2.9)
L

 

+

Likewise we have

i13(F..y) = (U2 — so)1(e > U) w(F(e.F)) — (U2 — sw) w2(D(U. F))IF(e; U)

+ (U2 — so) [I]; w§1’(D(y. F))h(e.y)dF(y)

 

oo
+2U(F)IF(2:,U(F)) [U(F)w2(D(y,F))dF(y) (5.2.10)
1 - E U 2
+ 5 ll. (U - s)w(D(y,F))dF(y) + 03(1) (5.2.11)
Combining the last four displays, we have the desired result. El

Proof of Theorem 3.4.1

Proof. For sake of convenience, we deﬁne

Vn = \/T—1(Fn - F). Hn(') = J5(Dn(-.Fn) - D(nFll (52-12)

The following result, is needed in the proof.

Lemma 5.2.3. Assume that F’ = f exists at p and continuous in small neighborhoods of
a :1; o with ﬂy) and f(u + a) + f(/1 - a) positive. Then for 0 < B < 00, we have

(a) supxeleﬂ + x2)|Hn(:r:)| = 0,,(1); and

(b) Hn($) = f h(y.z)vn(dy) + 012(1), uniformly on L 6 MW ). U (F )1-

Proof. For a: E [L, U], it is readily seen that
D(Ian) _ D($,F) = "(D($1F)(aﬂ _ 0') + (”41 "' “ll/071'

(a) follows immediately since the given conditions allow asymptotic representations for both

an and on (see, e.g., page 92 of Serﬂing (1980)), which lead to (b). C]

90

Since S2(F) can be breaken into two parts (s(F), A(F)) and the proof of the two parts

are similar, we only prove

S(Fn) - S(F =%ZIF( X,;s(F)) +o,,(;_)

ni=1
=-—2:n( X.)+-,1,Ze2( X.)+-,’,2:jm(X- +019 jﬁ)

First, observe that

Un Un
Mew.) — e(F» = «a (I... (y2 - e)w2(D(y.F.»F.(ey)/ ft. w2(D(y.F..»F.(dy)
(5.2.13)

and the numerator then can be decomposed into three terms
Un U
111. = We [L (y2 - S)w2(D(y, Fn))Fn(dy) - ﬁ/L (312 - S)w2(D(y,Fn))Fn(dU)
3 y
12. = «a [L (y2 — s)w2(D(y,Fn»Fn(dy) — t/ﬁ/L (y2 — s)w2(D(y.F»F.(ey)
U
1y. = «a f). (y2 — s)we(D(y.F))Fn(ey)

It follows immediately that
13,, = % 27305). (5.2.14)
For 12”, we note that
U U
I... =./5 [L (y2 — s) we(D(y,F.»F..(dy) — «a [L (y2 — s) w2(D(y,F))F..(ey)
U
= [L (y2 — .) w§1’(e.(y))H..(y)F..(dy)
U 2 (1)
= f (y — e)w2 (D(y,F»H..(y)F.(dy)

+/U (y— ’(..e (y» — wE.”(D(y.F»)H.(y)F.(ey)

=J1n + 121.

where 6n(y) is a point between D(y, Fn) and D(y,F). For J2", by using Lemma 5.2.3, we

91

 

 

have

Jen: [U (y— 2.1’()-e..(y) wt 1’(D(y.F)))H..(y)F..(ey)
s [LU (y +s »(1H..(y)I(w§’(e..(y»—w§”(D(y.F»)F.(dy)
=0p(1)

On the other hand, by Lemma 6.3, continuity of my) and boundedness of L and U, Fubini’s

Theorem and the central limit theorem, we obtain
U 2 (1)
[L (y -s)w2 (D(y.F)»H..(y)(F.-F)(dy)
U
= / (yZ-s)w“’(D(y F» (/h(e.y)y..(de)+o.(1))(F..-F)(dy)
=71. / [U (y _.)w<1>(o D(y (F)))IF(e-.D(y.F»e..(ey)u..(de) +oy(1)
= m (ll-A (y2_3)w21)1)( (y,F))D(y,F)l/n(dy)/IF($,U(F))Vn(Cl$)
U
— [L (y2 — .)yg1>(o(y,p))y,,(yy) / IF(e.y(F))u.(de)) + 0.0)

which, in conjunction with Lemma 5.2.3 and the thini’s Theorem, yields
U 2 (1)
= J.../ (y — s)w2 (D(y.F»)H..(y)F(dy) + opu)
=([ [U (y —s)w§”(D(y.F)))h(e.y)F(dy))u..(ee) + 0,.(1)
= 62— ‘/— 2720“) + 011(1)

Hence
I_2_n
t( )+0 1 5.2.15
—,—2 =72: 2(X p() , ( )

For 11m note that
Un U
I... = «a f (y2 — s) we(D(y. F..»F..(dy) — (47/ (y2 - s) w2(D(y.F..))F..(dy)
Ln L
L Un ,
= ya / (y2 — s)w2(D(y,Fn))Fn(dy) + t/ﬁ / (1)2 - s)W2(D(y.Fn))Fn(dy)
Ln U

A
= Vln + V2n

92

Next we only deal with Vln since V2,, can be treated similarly. By mean value theorem,
L
v1. = «a (#2 — s) w2(D(y,Fn>)Fn(dy)
L 2 L 2
= J17 [L,.” - s) w2(D(y.Fn)>F(dy) + (1...“ — s) w2(D(y, Fn>>un(dy)

L
= —(7772z — 5) W2(D(77ny Fn))f(77n)\/7—1(Ln " L) +/L (3’2 — 5) w2(D(y, F"))V"(dy)

where 17,, is a point between Ln and L. Note that by the conditions given, we have
1 11
(17121-3) W2(D(77n. Fn))f(nn)\/5(Ln-L) = (142-8) W2(D(L, ”WU-J: Z IF(Xz', L(F))+0p(1),
i=1
Since P(X = L) = 0, it is readily seen that for large n and L“ = —1 — |L|, ,
L
f (y2 - s) W2(D(y, Fn>)u..(dy)
Ln
= /11{(L*,Ln)}(y) - I{(L*,L,,)}(y)](y2 — Slw2(D(yy Fn))Vn(d3/) + 010(1)

= 013(1):

by an empirical process theory argument; See Pollard (1984) or van der Vaart and Wellner

(1986). Thus
1 n
v1. = —(L2 — s)w2(D(L.F))f(L)T 231m.» L(F)) + 012(1), (5.2.16)
n i=1
which combining with a similar results from V2", gives
11" 1 iZ(X)+o(1) (5217)
— = — 1 ,- p . .
6 l/ﬁ i=1
In the same but much less involved manner, we can show that
Un U
(I... W2(D(D. F))F..(Jm) = [L w2(D(y.F>>F(dy) + ops/«5) (5.2.18)
Now (5.2.14), (5.2.15), (5.2.17) and (5.2.18) give the desired result. El

Proof of Theorem 3.4.3. The proof is very similar to that of Theorem 3.4.1. We
adopt the notation in the proof of Theorem 3.4.1. Let w(D(y,Fn» — w(D(y,F)) =
w(1)(0(y,Fn))(D(y,Fn) — D(y,F)). We need the following lemma whose proof is skipped

here.

93

Lemma 5.2.4. Under the conditions of Theorem 3.4.3, we have
(a) supyeyu + ,2). w§1’(a(y. F.» — wé”(w2(D(y. F)))! = 0pm,- and
(b) Hn(y) = y0p(1) + 012(1)-

We ﬁrst write

1
3”" =I W2(D(y,Fn))an

Ln
811,100+ w2(D(y,Fn))(L2(Fn) - 3112))an(1')

 

Un
(y) [[Ln" w2(D(y.F..)>(z -s..)dF..(x )

+ [U00 W2(D(nyn))(R2n “ SW)dF"($)]

The given conditions guarantee that w(D(yn,Fn)) -+ w(D(y,F)) a.s. for every 31 6 R
and every sequence 3],, —-+ y. Skorohod representation theorem and Lebesgue’s dominated

convergence theorem imply immediately that

/ w(D(y.F.>>dF..(y) = / w2(D(y,F))dF(:c> (52.19)

We now focus on the numerator. Call the three terms I,-(Fn,y), i = 1, 2, 3, respectively.

By the proof of Theorem 3.4.1, we see immediately that

11(Fn.y)=-,1;Zn( IE)(yz—sw)w§”(D(y.F))h(X.-,y)dF(y))+op(n'1/2) (5.2.20)
nizl

Now it suffices to treat 12(Fn, y). Following the proof of Theorem 3.4.1, and employing
Lemmas 5.2.3, 5.2.4; we have

12(=F..,y) ﬁznaz —sw)I(X-<L)w(F(X.-,F))

L( (F)
+ (L2 -— s...) f... w§1’(D(y,F))h(X.-,y)dF(y)

L(F) _1/2
+2L(F)IF(X.-.L(F))[_oo w2(D(y,F))JF(y>+op(n > (5221)

Likewise we have

13(Fn.y> =(U2 — 3w)I(Xi > U)w2(D(X.-, F)) (52.22)

94

+(U2- s ) / ‘1’D(y.)>h(x..y)dF(y)
+ 2U(F)IF(X., U(F)) [w w(D(y, F))dF(y) + own-V2)
U(F)

Combining the last four displays, we have the desired result.

95

(5.2.23)

“2‘

BIBLIOGRAPHY

[1] Agullo, J., Croux, C., and Van Aelst, S. (2002). The multivariate least trimmed
squares estimator. Submitted.

[2] Bai, Z.D., Chen, N.R., Miao, B.Q., and Rao, C.R. (1990), Asymptotic Theory of
Least Distance Estimate in Multivariate Linear Models, Statistics, 21, 503-519.

[3] Bickel, P. J. (1965). On some robust estimates of location. Ann. Math. Statist. 36
847-858.

[4] Donoho, D. L., and Huber, P. J. (1983). The notion of breakdown point. In A
Festschn'ft for Erich L. Lehmann (P. J. Bickel, K. A. Doksum and J. L. Hodges,
Jr., eds) 157-184. Wadsworth, Belmont, CA.

[5] Jaeckel, L. A. (1971). Some ﬂexible estimates of location. Ann. Math. Statis. 42
1540-1552.

[6] Jureckova, J., Koenker, R. and Welsh, A. H. (1994). Adaptive choice of trimming
proportions. Ann. Inst. Statist. Math. 46 737-755.

[7] Hampel, F. R., Ronchetti, E. Z., Rousseeuw, P. J. and Stahel, W. A. (1986). Robust
Statistics: The approach based on inﬂuence function. Wiley, New York.

[8] Hogg, R. V. (1974). Adaptive robust Procedures: A Partial review and some sug-
gestions for future applications and theory. J. Amer. Statist. Assoc. 69 909-923.

[9] Kim, S. (1992). The metrically trimmed mean as a robust estimator of location. Ann.
Statist. 20 1534-1547.

[10] Koenker, R., Portnoy, S. (1990). M-estimation of multivariate regressions. Journal
of the American Statistical Association, 85, 1060-1068.

[11] Marrona, R.A., and Yohai, V.J. (1997), Robust Estimation in Simultaneous Equa-
tions Models, Journal of Statistical Planning and Inference, 57, 233—244.

[12] Pollard, D. (1984). Convergence of Stochastic Processes. Springer-Verlag, New York.

[13] Rousseeuw, P.J. (1984). Least median of squares regression. Journal of the American
Statistical Association, 79, 871—880.

[14] Rousseeuw, P.J., Van Aelst, S., Van Driessen, K., and Agullo, J. (2001). Robust
multivariate regression. submitted.

96

 

[15] Rousseeuw, P.J. and Van Driessen, K. (2002), Computing LTS Regression for Large
Data Sets, Estadistica, 54, 163-190.

[16] Serﬂing, R. (1980). Approximation Theorems of Mathematical Statistics, John Wiley
& Sons, New York.

[17] Shorack, G. R. (1974). Random Means. Ann. Statist. 2 661-675.

[18] Peter J. Rousseeuw and Christophe Croux. Alternative to the Median Absolute
Deviation. J. AM. Stat. Assoc. 88, 1993.

[19] Stigler, S. M. (1973). The asymptotic distribution of the trimmed mean. Ann. Statist.
1 472-477.

[20] Stigler, S. M. (1977). Do robust estimators work with real data? Ann. Statist. 5
1055-1077.

[21] Thkey, J. W. (1948). Some elementary problems of importance to small sample
practice. Human Biology 20 205-214.

[22] van der Vaart, A. W., and Wellner, J. A. (1996). Weak Convergence and Empirical
Processes With Applications to Statistics. Springer.

[23] Welsh, AH; Morrison, HL Robust L estimation of scale with an application in as-
tronomy. J. Amer. Statist. Assoc. 85 (1990), no. 411, 729—743. 62F35.

[24] Zuo, Y. (2003). Projection depth trimmed means for multivariate data: robustness
and efﬁciency (the latest version: Multi-dimensional trimming based on projection
depth was tentatively accepted by the Annals of Statistics in 2004).

97

 

   

I1111111111211I111]