. 3......” GP» . ~ 3.5:)» 1.1:...- 4v: a ; huzu t; 25.. .. 4.ny {.2 in... . .. n 1. I' r: 5 will u ; ..;.... 33...... 7.71.3... 52...»... .. 3...... . V 3 v: .e a) 4.125: E ‘2, .«tillti... a aux! \ .‘.v\|..vfip L'- z’: 3.0” 1.:- SSI It}. (hat; L .a b! \3 ..\». v v3.3.fl! aztk I «f. I: I... u. 11. 2431;...) liviltfi} .3i&111t=.£|!1: tun; .i a {Big} W. i tin-:1! f. 5': Iii! y 11.. .rviiti $ . J ..}..nv..v. . 7 {WFHF LIBRARY Michigan State University This is to certify that the dissertation entitled Trimmed and Winsorized Estimators presented by Mingxin Wu has been accepted towards fulfillment of the requirements for the PhD degree in Statistics Yijun Zuo W Major Professor’s Signaturev Date 05/04/aé MSU is an Affirmative Action/Equal Opportunity Institution - ---‘---n--.—.—‘— —.-.-.—.-.-.-.- — - lC-t-l-I-0-o-l-6-U-o-O-a-O-O-u-t-|-o-o-b-n-aa. - 4 PLACE IN RETURN BOX to remove this checkout from your record. To AVOID FINES return on or before date due. MAY BE RECALLED with earlier due date if requested. DATE DUE DATE DUE DATE DUE JUN 2 0 2007 ,‘ -‘_‘~ f‘ -.- _u n U , l!) "-. 2/05 p:/CIRC/DateDue.indd-p.1 TRIMMED AND WINSORIZED ESTIMATORS By Mingxin Wu A DISSERTATION Submitted to Michigan State University in partial fulfillment of the requirements for the degree of DOCTOR OF PHILOSOPHY Probability and Statistics Department, 2006 ABSTRACT TRIMMED AND WIN SORIZED ESTIMATORS By Mingxin Wu The dissertation consists of three parts. The first part studies trimmed and Winsorized means based on a scaled deviation. The influence functions of the trimmed (and Winsorized) means are derived and their limiting distributions are established via asymptotic representa- tions. The performance of these estimators with respect to various robustness and efficiency criteria is evaluated and compared with leading competitors including the ordinary Thkey trimmed (and Winsorized) means. The resulting trimmed (and Winsorized) means are much more robust than their predecessors. Indeed they can share the best breakdown point ro- bustness of the sample median for any common trimming thresholds. Furthermore, for appropriate trimming thresholds they are highly efficient for light-tailed symmetric mod- els and more efficient than their predecessors for heavy-tailed or contaminated symmetric models. The second part of the dissertation pertains to applying the same trimming scheme to the scale setting. In this part, trimmed (and Winsorized) standard deviations based on a scaled deviation are introduced and studied. The influence functions and the limiting distributions are obtained. The performance of the estimators is evaluated and compared with respect to high breakdown scale estimators. Unlike other high breakdown competitors which perform poorly for light-tailed distributions and for contaminated symmetric distributions with con- tamination near the center, the resulting trimmed (and Winsorized) standard deviations are much more efficient than their predecessors for light-tailed distributions for suitably chosen trimming parameters and highly efficient for heavy-tailed and skewed distributions. At the same time, they are sharing the best breakdown point robustness of the sample median absolute deviation for any common trimming thresholds. The third part is about the multiple least square estimator. In this part we introduce the least trimmed squares estimator for multiple regression. A fast algorithm for its computation is proposed. We prove Fisher consistency for the multiple regression model with symmetric error distributions and derive the influence function. Simulation studies investigate the finite—sample efficiency of the estimator. Copyright by Mingxin Wu 2006 ACKNOWLEDGMENTS Foremost, I am deeply grateful to my advisor, Professor Yijun Zuo for introducing me to this wonderful place—MSU. He is gratefully acknowledged for his years of encouragement, his scientific influence on me, his infinite patience, his insights in our numerous discussions, his financial support and his careful review of paper manuscripts. Without these generous assistance, this dissertation could not have come into light. To be one of his students is my great honor. I also wish to express my gratitude to my dissertation committee, Professor Connie Page, Professor Habib Salehi, Professor Lijian Yang, for sparing their precious time to serve on my committee and giving valuable comments and suggestions. I am grateful to Professor Connie Page and Professor Dennis Gilliland for accepting me as one of the consultants at Cstat. It’s such a good experience to work with Cstat that I- have had chance to encounter some very interesting topics such as survival analysis, machine learning and application aspects of statistics. My thanks also go to Professor James Stapleton for his numerous help and constant support. The support and encouragement of members of statistics department are greatly appre- ciated and acknowledged. TABLE OF CONTENTS LIST OF TABLES LIST OF FIGURES 1 Introduction and Motivation 1.1 Location ...................................... 1.2 Scale ........................................ 2 Trimmed and Winsorized means based on a scaled deviation 2.1 Introduction .................................... 2.2 Scaled deviation trimmed and Winsorized means ................ 2.3 Influence function ................................. 2.4 Asymptotic representation and limiting distribution .............. 2.5 Performance comparison ............................. 2.5.1 Breakdown point ............................. 2.5.2 Influence function and gross error sensitivity .............. 2.5.3 Large sample relative efficiency ..................... 2.5.4 Finite sample relative efficiency ..................... 2.6 Remarks ...................................... viii 6 8 13 15 15 16 25 26 28 Trimmed and Winsorized standard deviations based on a scaled deviation 30 3.1 Introduction .................................... 3.2 Scaled-deviation trimmed and Winsorized standard deviation ......... 3.3 Influence Function ................................ 3.4 Asymptotic representation and limiting Distribution .............. 3.5 Comparison .................................... 3.5.1 Breakdown Point ............................. 3.5.2 Influence Function and Gross Error Sensitivity ............. 3.5.3 Large sample relative efliciency ..................... 3.5.4 Finite sample relative efficiency ..................... 3.6 Concluding remarks ................................ The Multiple Least 'D'immed Squares Estimator 4.1 Introduction .................................... 4.2 Definition and properties ............................. 4.3 The influence function and asymptotic variances ................ 4.4 Finite-sample simulations ............................ 4.4.1 Algorithm ................................. vi 30 31 34 39 42 42 43 45 48 51 64 64 67 68 75 75 4.4.2 F inite—sample performance ........................ 76 5 Selected proofs of main results and lemmas 79 5.1 Selected proofs for results of chapter 2 ..................... 79 5.2 Selected proofs for results of chapter 3 ..................... 87 BIBLIOGRAPHY 96 vii 2.1 2.2 2.3 2.4 2.5 2.6 2.7 3.1 3.2 3.3 3.4 3.5 3.6 3.7 3.8 4.1 4.2 LIST OF TABLES Breakdown points of mean, trimmed (Winsorized) means, and median . . . . 16 CESs of mean, trimmed (Winsorized) means, and median at symmetric F . . 24 GESS of mean, trimmed (Winsorized) means, and median at asymmetric F . 24 ARES of T‘3 and T5 relative to the mean .................... 26 ARES of trimmed and Winsorized means and median with a = 0.01 3 = fl(F, a) ............................. 26 RES of trimmed and Winsorized means with B = 7 .............. 27 RES of trimmed and Winsorized means with a = 0.01 B = BOP, a) . . . . 28 Cross Error Sensitivity .............................. 44 ARES of S and Sw relative to the standard deviation ............. 46 ARE’S with respect to SD ............................ 46 6 Values for S having better ARE than Other Scales ............. 47 [3 Values for Sw having better ARE than Other Scales ............ 47 Stande variance of M AD", 3,130, Qn, Sn, Sum and SD" at normal model 49 RES of various robust scales (6 = 7 for scaled-deviation trimmed / Winsorized scale) at (1 — €)N(0, 1) + €N(I,0.I) ....................... 50 RES of various robust scales (5 = 7 for scaled-deviation trimmed/Winsorized scale) at (1 — €)N(0, 1) + 66“,} ......................... 50 Asymptotic relative efficiency of the LTS estimator w.r.t. the Least Squares estimator at the normal distribution for several values of 2 ........... 75 ARES of LTS relative to the LS for p = 3 .................... 77 viii LIST OF FIGURES 2.1 Influence function of T2 for N (0, 1) with [3 = 2 and a constant weight. ...... 11 2.2 Influence functions of T3, for N (0, 1) with B = 2 and a constant weight ....... 12 2.3 Asymptotic breakdown points of trimmed means. ................. 17 2.4 Gross error sensitivity of trimmed means. ..................... 18 2.5 Influence functions of the trimmed and Winsorized means at normal ...... 20 2.6 Influence functions of the trimmed and Winsorized means for t(3) with a = 0.1. 21 2.7 Influence functions of the trimmed and Winsorized means for 0.9N(0, 1)+0.1N (4, 9) with a = 0.1 ..................................... 22 2.8 Influence functions of the trimmed and Winsorized means for 0.9N(0, 1) + 0.1N(4, 0.5) with a = 0.1. ............................. 23 3.1 Influence functions of S for N (0, 1) with a constant weight and fl = 3 ..... 37 3.2 Influence functions of Sw for N(0, 1) with a constant weight and [3 = 3. . . . 53 3.3 Influence functions of various scales for normal distribution. (fl = 4.5 for S and SW) 54 3.4 Influence functions of various scales for Cauchy distribution. ([3 = 4.5 for S and Sw) 55 3.5 Influence functions of various scales for exponential distribution. ([3 = 4.5 for S and 5,”) ...................................... 56 3.6 Influence functions of various scales for 0.9N(0, 1) + 0.1N(1,0.1). (fl = 4.5 for S and S...) ...................................... 57 3.7 ARE of trimmed and Winsorized standard deviations for normal distribution . . . 58 3.8 ARE of trimmed and Winsorized standard deviations for Cauchy distribution . . . 59 3.9 ARE of trimmed and Winsorized standard deviations for exponential distribution . 60 3.10 GES of trimmed and Winsorized standard deviations for normal distribution . . . 61 3.11 GES of trimmed and Winsorized standard deviations for Cauchy distribution . . . 62 3.12 GES of trimmed and Winsorized standard deviations for exponential distribution . 63 4.1 MSR vs number of iterations with 100 arbitrary initial subsets .......... 66 4.2 LTS estimator v.s OLTS estimator ......................... 78 CHAPTER 1 Introduction and Motivation 1.1 Location The sample mean is the most efficient location estimator for normal models. It is, how- ever, not robust. The sample median is the most robust location estimator with the best breakdown point. It is, however, not eficient for normal models. The trimmed mean is a compromise between the two extremes. It is more robust than the mean and more efficient than the median for normal models. It also performs quite well for heavy-tailed non-normal symmetric distributions. That’s one reason why we use it in Olympic rating system. We know that for some sports in the Olympics, such as diving, gymnastics (Summer Olympics) and several years ago, figure skating (Winter Olympics), the rating system is based on the trimmed means. In Olympic rating, nine different judges from nine different countries give nine scores for each athlete. They drop the highest and lowest scores, taking the average of the rest seven scores as the final score of each athlete. The Gold Medal in these games is awarded to contestants with the highest trimmed scores. And many rating systems for competitions in our life are also based on trimmed means. The high and low scores are dropped and the rest are averaged. In every Olympics year, there are a lot of controversies over the ordinary trimmed mean based scoring system used in competitions. In fact there are a lot of problems with it. The argument I make in this dissertation is that the problems can be avoided by trimmed mean /winsorized mean based on a scaled-deviation—indeed, from a statistical point of view it is statistically superior to the alternatives that have been proposed and used. The ordinary trimmed mean associated with the Olympic rating system has the following shortcomings which can be avoided by the trimmed mean based on a scaled-deviation 1 Ordinary trimmed mean (OTM) cannot to exclude all the outliers when outliers come from one side instead of both sides. This time outliers are from lower side, it will underestimate the athlete (center); next time when the outliers are from upper side, it will overestimate the athlete (center). 2 Another problem involved in the Olympic rating system is: Arbitrarily throwing out the high and low marks in Olympic rating is also a remarkably poor solution to the problems of national bias or human error that arise from time to time on judging panels, because if you throw away a fixed fraction of data you automatically make the assumption that two out of nine judges are necessarily mistaken and the other seven are “correct”. This is an unreasonable assumption. so OTM is mechanical because it always trims a fixed fraction of data points at both ends of a data set no matter whether these data points are “good” or “bad”. These disadvantages of the Olympic rating system motivate us to consider trimmed means based on a scaled deviation. It turns out that trimmed means based on a scaled deviation is flexible and random. They may trim some or no sample points and have the power to distinguish outliers no matter which side they are coming from. Detailed comparisons with leading competitors on various robustness and efficiency aspects reveal that the scaled deviation trimmed means behave very well overall and consequently represent very favorable alternatives to the ordinary trimmed means. Besides trimming, winsorizing is another robust method to mitigate inordinate influence of extreme values. Unlike the trimmed mean, the Winsorized mean replaces the outliers with cutting-point values, rather than discarding them. The Winsorized mean based on scaled- deviation has the highest breakdown point and is more efficient than the corresponding trimmed mean when the cutting parameter {3 (see section 2.2) is small. But when H is large, scaled-deviation trimmed mean almost has the same efficiency as Winsorized mean. That is because when [3 is large, there is not too much information contained in both tails. When there are “bad” points presented from either end, the Winsorized mean is less efficient than trimmed mean. 1.2 Scale A fundamental task in many statistical analyses is to characterize the spread, or variability, of a data set. Measures of scale are simply attempting to estimate this variability. When assessing the variability of a data set, there are two key components: 1 . How spread out are the data values near center? 2 . How spread out are the tails? Different numerical summaries will give different weight to these two elements. The choice of scale estimator is often driven by which of these components you want to emphasize. The histogram is an effective graphical technique for showing both of these components of the spread, however it is just descriptive. There are several common numerical measures of the spread: the variance, standard deviation, average absolute deviation, median absolute deviation, interquartile range and range. The variance, standard deviation, average absolute deviation, and median absolute deviation measure both aspects of the variability, that is, the variability near the center and the variability in the tails. They differ in that the average absolute deviation and median absolute deviation do not give undue weight to the tail behavior. On the other hand, the range only uses the two most extreme points and the interquartile range only uses the middle portion of the data. The standard deviation is an example of an estimator that is the best in terms of efficiency if the underlying distribution is normal. However, it lacks robustness validity. That is, confidence intervals based on the standard deviation tend to lack precision if the underlying distribution is in fact not normal. It has the lowest possible explosion breakdown point. The median absolute deviation and the interquartile range are estimates of scale that have robustness of validity. However, the median absolute deviation is not particularly strong for efficiency. The median absolute deviation estimator (MAD) has a low efficiency for normal distributions (36.75%), thereby leading to rather unsatisfactory results for normal models. The interquartile range is not particularly strong for robustness, it can not reach the highest possible breakdown point. Rousseeuw and Croux (1993) introduced two alternative statistics more efficient than the MAD, which are defined as $7520 = Cs med,{medj|a:,- '— IUjI}, Qn = d‘Uxi — le; i< j}(k) where c8, of are consistent coefficients, [(1 = (9‘) R5 (3) / 4 with h = [n/2] + 1 and (k) is the k-th ordered statistics. The 5,130 has efficiency for normal distributions 58.23%, while Q. has 82.27%. They are still not good. Especially at the situation that there are contaminating points presented close to the center, MAD, 8'50 and Qn are quite inefficient. Motivated by these facts, we introduced scaled deviation trimmed and Winsorized standard deviations. The resulting trimmed (and Winsorized) standard deviatons are much more efficient than their predecessors for light-tailed distributions by suitably choosing the cutting parameter and highly efficient for heavy-tailed and skewed distributions. At the same time, they are I. sharing the best breakdown point robustness of the sample median absolute deviation for any common trimming thresholds. Compared with their predecessors, they can achieve the best eficiency when points around the center are contaminated. Indeed, the scaled devi- ation trimmed (Winsorized) standard deviatons behave very well overall and consequently represent very favorable alternatives to other types of scales. CHAPTER 2 Trimmed and Winsorized means based on a scaled deviation 2.1 Introduction Thkey trimmed (and Winsorized) means are among the most popular estimators of location parameter; see, e.g., Stigler (1977). They overcome the extreme sensitivity of the mean while improving the efficiency of the median for light tailed distributions. The robustness and efficiency are two fundamentally desirable properties of any statistical procedure. They, however, do not work in tandem in general. The trimmed (and Winsorized) means somehow can keep a quite good balance between the two. The 'Ihkey trimming scheme is a symmetric one in the sense that it trims the same number of sample points at both ends of data and hence is quite eficient for symmetric distributions. It, however, becomes less efficient when there is even just a slight departure from symmetry, e.g., with one end containing outlying points. Metrical trimming, introduced in Bickel (1965), trims points based on their distance to the center - median and hence is more efficient for contaminated symmetric models. Like the ordinary trimming, it always trims a fixed fraction of sample points, no matter those points are “good” or “bad”. This raises a concern as to whether there is a trimming scheme that only trims points that are “bad” , which motivates us to consider in this chapter the so-called scaled deviation trimmed and Winsorized means. The main idea behind the new trimming scheme is that sample points are trimmed based on the magnitude of their scaled (standardized) deviations to a center (say median). Only points with the scaled deviation beyond some fixed threshold are trimmed. This new trim- ming scheme can lead to the best possible breakdown point (see Section 5.1 for definition) robustness. The resulting estimators are also highly efficient at light-tailed symmetric mod- els and much more eflicient than the Tukey trimmed and Winsorized means at models with a slight departure from symmetry or with heavy tails. Hence they represent favorable al- ternatives to their predecessors. The rest of the chapter is organized as follows. Section 2 defines the scaled deviation trimmed and Winsorized means and discusses some primary properties. Section 3 inves- tigates the local robustness, the influence functions, of the estimators. The asymptotic normality of the estimators is established via their asymptotic representations in Section 4. The performance comparison of the estimators with other leading trimmed means with respect to various robustness and efficiency criteria is carried out in Section 5. Concluding remarks in Section 6 end the main body of the chapter. Proofs of main results and auxiliary lemmas are reserved for the Appendix. 2.2 Scaled deviation trimmed and Winsorized means Let u(F) and 0(F) be some robust location and scale measures of a distribution F. For simplicity, we consider ,u and 0' being the median (Med) and the median absolute deviations (MAD) throughout the chapter. Assume 0(F) > 0, namely, F is not degenerate. For a given point 2:, we define the scaled deviation (generalized standardized deviation) of a: to the center F by D(2:, F) = (a: — p(F))/0(F). (2.2.1) Now we trim points based on the absolute value of this scaled deviation and define the fi scaled deviations trimmed mean at F as (c.f. Zuo (2003) for a multi-dimensional version) I 1(|D($i Fll S H) W(D(~’ci F ))$dF (1') 5 __ T (F) ‘ fI(ID(:r,F)I s mw(D(x.F»dF(x) ’ (2.2.2) where 0 < H S 00 and w is an even bounded weight function on [—oo,oo] so that the denominator is positive. The heuristic idea behind this definition is that one trims points that are far (60) away from the center and then one weights (not just simply average) remaining points based on the robust scaled deviation with larger weights for points closer to the center. When w is a non-zero constant, T3 becomes the plain average of points after the trimming. To cover a broader class of the trimmed means, we consider general w in our treatment. Note that in the extreme case 6 = 00 (w = c 75 0) Tfl becomes the usual mean. A concern might be that T5 throws away useful information in the tails. A remedial measure is the Winsorization. For the completeness of our discussion, we consider here the £3 scaled deviations winsorized mean at F, defined as [($I(ID($rF)I S [3) + L(F)I($ < L(F)) + U (F )I($ > U (F )l) W(D(=€i F NF (2?) f W(D(IiF )ldF (1') Tim = (2.2.3) where L(F) = ”(F) — flo(F) and U(F) = u(F) + 60(F). In the extreme case 6 = 0, T5 degenerates into the median. For a fixed 6, we sometimes suppress 5 in Ta and T5 for convenience. Since both a and o are affine equivariant, i.e., #(FaX-l-b) = ap(FX) + b, ”(Fax-H») = |a|o(FX) for any scalars a and b, where FX is the distribution of X, it is readily seen that |D(:r, F)| is affine invaraint and T thus is affine equivariant as well. For X ~ F symmetric about 0 (i.e. d:(X — 0) have the same distribution), it is seen that T(F) = 9, i.e., T is Fisher consistent. Without loss of generality, we can assume 6 = 0. Let Fn be the usual empirical version of F based on a random sample. It is readily seen that T(Fn) is also affine equivariant. It is unbiased for 0 if F is symmetric about 0 and has an expectation. For Tw(F) and Tw(Fn), all these properties hold. Two popular trimmed means in the literature are: the ordinary trimmed mean ('Illkey (1948)) and the metrically trimmed mean (Bickel (1965), Kim (1992)), defined respectively as 1 F‘1(1-a/2) 1 u(F)+V(F) 73(F)=— mm), T3= [m m p —u 1 _ a F—1(0/2) '1—:—a $dF(.’D), (2.24) l where F-1(r) is the rth quantile of F and Fara?) + V(F)) — F(u(F) - V(F)) = 1— a. It is readily seen that these trimmed means are also affine equivariant and consequently Fisher consistent for symmetric F. The two trimming schemes are probability content based. The A former, however, trims equally (500%) of points at each tail. This is not always the case for the latter (though total points trimmed are also 1000%). At the sample level, Tg'(Fn) trims a fixed (equal) number of sample points at each tail while T,‘,?,(Fn) trims sample points at both tails or just one tail with the same total number of points trimmed as in the former case. For performance evaluation and comparison of T5 and T3 in later sections, Tg‘ and ’13 will be used as benchmarks. Note that the proportion of the trimmed points for a fixed 3, P(|D(X,F)| > ,6), in T3 (F) is not fixed but F-dependent. In the sample case, the proportion of sample points trimmed is random. T 8 (Fa) may trim some or no sample points. So Tfl(Fn) is also called a randomly trimmed mean. The random trimming scheme here based on the scaled deviation is interconnected to the usual trimming scheme based on the probability content, nevertheless. Indeed, in the population case set 6 to be the (1-0)th quantile of the scaled centered variable |X — p(F)|/o(F), then Tf3 is just a regular trimmed mean that trims 1000% of points at tails for symmetric F. For example, if one wants to trim 0 = 10% points at tails, then simply set 6 = (P'l(0.95)/"1(0.75) = 2.4387 for normal F and 6 = 6.3138 for Cauchy F. A large 6 corresponds to a small 0 and consequently is in favor of the efficiency of T‘6 (and T3) at light-tailed F (see Sections 5.3 and 5.4). 2.3 Influence function We first investigate the local robustness of the functional TB(F) and T5 (F). Here F is the assumed distribution. The actual distribution, however, may be (slightly) different from F. A simple departure from F may be due to the point mass contamination of F that results in the distribution F(5, 61) = (1 — e)F +653, where 63; is the point mass probability distribution at a fixed point a: E R. It is hoped that the effect of the slight deviation from F on the underlying functional is small relative to e. The influence function (IF) of a statistical functional M at a given point :r E R for a given F, defined as (see Hampel et al. (1996)) 1F($; M(F)) = EEI51+(M(F(&62)) - M(F))/€i (2-3-1) exactly measures the relative effect (influence) of an infinitesimal point mass contamination on M. It is desirable that this relative influence IF(:r; M (F)) be bounded. This indeed is the case for 731(F) (see, e,g, Serfling (1980)), Tfi,(F) (Kim (1992)), T3 (Theorem 2.3.1) and T5 (Theorem 2.3.3) but not for the mean functional with :c — E (X ) as its influence function for r.v. X ~ F. The integrands in Tfl (F) (Tg(F)) are complicated functions of F and the derivation of the influence functions thus is a bit involved. We first work out the influence functions of L and U. Assume F’ = f exists at u and ad: a with f(p) and f(u +0) + f(p - a) positive, where u and a stand for ”(F) and U(F). Invoicing the chain rule we have the following preliminary results: 1F(:v; L(F)) = 1F($; u(F)) - [3 1F(x; U(F)), (2-3-2) IF(:1:; U(F)) = IF(1:; nu?) + {3 IF(.'c; our», (2.3.3) 1F (9:; D(y. F l) = -(D(y.F)1F‘(I;0(F)) + IF(I; u(F)))/0 5 MI, 31), (23-4) 1F($;M(F)) = sign(x—u)/(2f(u)). (2.3.5) , _ sienna: — ul - a) — 21F(x;lu(F))(f(u + a) - f(u - 0)) ”(M(F)) ‘ 2W + a) + f(u — a» ’ Now assume that w is difl'erentiable and f exists at L(F) and U (F). Write L and U for L(F) and U(F) respectively and 6 for fI(|D(:r,F)| S 6) w(D(:c,F))dF(x) and define (2.3.6) 61(2) -- %[(U — T) w(fi)f(U) Hie; U(F)) - (L - T) wwmm 1m; L(F))] (2.3.7) ens) = fl [Luv — T) w(1)(D(y.F))IF(z;D(y.F))dF(y)] (2.3.8) €3(z) = %[I(x e [L, U])(x - T) w(D(x, m] (2.3.9) We then have the influence function of the scaled deviation trimmed mean T3(F) as follows. Theorem 2.3.1. Assume that F’ = f exists at u, u :l: o, L(F), and U(F) with f(u) and f(p + a) + f (p -— a) positive and is continuous in small neighborhoods of L(F) and U (F), and that w(-) is continuously diflerentiable. Then for a given 0 < 6 < oo, 1F(:1:;T“(F)) = €1($) + ms) + 23(r). (2.3.10) The proof is given in section 5.1, chapter 5. Under the conditions of Theorem 2.3.1, IF($; T6 (F)) clearly is bounded and consequently T’6 is locally robust. For symmetric F and w = c # 0, the influence function simplifies substantially. Corollary 2.3.2. Let X ~ F be symmetric about the origin and w a non-zero constant. Under the conditions of Theorem 2. 3.1, we have . a _ dove [-fia.fio}) flaf(fla))sign(x) mm! (F))" 2F(fio)—1 +f(0)(2F()60)-1)' (2.3.11) A graph of this influence function is given in Figure 2.1. The boundedness is clearly revealed. To work out the influence function for T5 (F), we write 61 for f w(D(:l:,F))dF (1c) and define emu) fill-figure s y s U) + no < L) + me > U) - To) w<1>h(x,y>dF(y) (2.3.12) €w2(a:) =~61—1/[IF(3:; L)I(y < L) + IF(:I:; U)I(y > U)] w(D(y,F))dF(y) (2.3.13) €w3(:l:) =(Sll-[xI(L S :1: S U) + LI(:c < L) + UI(3: > U) — Tw] w(D(:c,F)). (2.3.14) We then have the influence function of the scaled deviation Winsorized mean T5 as follows. Theorem 2.3.3. Assume that F’ = f exists at u, u i a, L(F), and U (F) with f(u) and f (u + a) + f ([1 — a) positive and is continuous in small neighborhoods of L(F) and U (F), and that w is continuously differentiable with rw(1)(r) being bounded for r e R. Then for a given 0 < ,6 < oo, 1 F (:12; Tw(F )) = ew1(-T) + 131112073) + emu). (2-3-15) Under the conditions of Theorem 2.3.3, IF(:c; Tw(F)) is readily seen to be bounded and T3 thus is locally robust. For symmetric F and constant w, the influence function simplifies greatly. 10 IF 0 l l Figure 2.1. Influence function of T2 for N (0, 1) with [l = 2 and a constant weight. 11 0.5 1 .0 1 .5 IF 0.0 —1 .0 -—1.5 Figure 2.2. Influence functions of T3, for N (0, 1) with 6 = 2 and a constant weight. 12 Corollary 2.3.4. Let X ~ F be symmetric about the origin and w a non-zero constant. Under the conditions of Theorem 2.3.3, we have sign(x) f (0) IF(2:;Tw(F)) = F(—,60) + zI(-—flo S :1; _<_ 60) — 601(2: < —fla) + [30I(:c > 50) (2.3.16) The boundedness of this influence function is very clear and also shown in Figure 2.2. In addition to being local robustness measures, the influence functions in this section are useful for establishing the limiting distribution of T5(Fn) and Tg(Fn) in the next section. 2.4 Asymptotic representation and limiting distribu- tion Establishing the limiting distribution of the scaled deviation trimmed and Winsorized means turns out to be a quite challenging task. One possible approach is to establish first the Hadamard differentiability of the functional involved under the supremum norm and then to employ the influence function results. This is exactly what is done for the metrically trimmed mean in Kim (1992). The treatment (proof) there, however, is not quite rigorous. Here we combine an empirical process theory argument with the influence function results obtained in the last section to fulfil the task. Asymptotic representations of the estimators are established first. Theorem 2.4.1. Let F ' = f exist at u and be continuous in small neighborhoods of u i a, L and U with fly) and fly - o) + f(u + a) positive. Let wm be continuous on R. Then for0<fl2F(-ra)+/fl Fm ) m» _; (MI$I+$2)dF(2). (2.4.6) f (0) With the results obtained in this and last sections, we are in the position to evaluate the performance of the scaled deviation trimmed and Winsorized means T5 and T3. 14 2.5 Performance comparison We now compare the performance of the scaled deviation trimmed and Winsorized means with the trimmed means in (2.2.4), the mean, and the median with respect to robustness (breakdown point and influence function) as well as efficiency (asymptotic and finite sample one) criteria. 2.5.1 Breakdown point The finite sample breakdown point, a notion introduced by Donoho and Huber (1983), is the most popular measure of the global robustness of an estimator. Roughly speaking, the breakdown point of a location estimator is the minimum fraction of ‘bad’ (or contaminated) data points in a data set that can render the estimator beyond any bound. More precisely, the finite sample breakdown point of a location estimator T at a random sample X" = {X1, . . . ,Xn} is defined as spar, X") = min{-:::- : sup |T(X,',’,) — T(X")|}, (2.5.1) X3. where X,'}, are contaminated data resulting from replacing m points of X n with arbitrary m points. The asymptotic breakdown point (ABP) of T is defined as limn_.oo BP(T , X n). Since one bad point can ruin the sample mean (Xn), the breakdown point of Xn thus is 1/n, the lowest possible value. On the other hand, to break down the sample median, 50% of original points must be contaminated (moved to 00). Thus the sample median has a breakdown point [(n + 1) / 2] / n, the best among all affine equivariant location estimators. It is readily seen that the regular 0 trimmed mean in (2.2.4) has a breakdown point (L0n/2J + 1) /n whereas the 0 metrically trimmed mean in (2.2.4) has a breakdown point ([0nj + 1) / n if 0 S 0 S 1/2 — 3/ (2n) or |_(n + 1)/2j /n otherwise. The breakdown point of the scaled deviation trimmed mean is shown (see Zuo (2003)) to be the same as the median, as long as w(r) is defined on [0,6] and 1 5 6 < 00. Likewise, one can show the scaled deviation Winsorized mean has the same breakdown point. 15 1____ fl Table 2.1. Breakdown points of mean, trimmed (Winsorized) means, and median x, T: T5,: T5 T5 Med 1313 l 1552:3213 [an + 2J A [(n + 1)/2J [(n + 1)/21 [(n + 1)/21 [(n + 1)/21 71 Tl Tl Tl n n n ABP 0 0/2 0A1/2 1/2 1/2 1/2 The regular trimmed mean Tg‘ thus has the lowest breakdown point among the three trimmed means. The metrically trimmed mean T,‘,’, has a higher breakdown point (twice as high as that of the regular one) when 0 S 1/2 — 3/ (2n) and can attain the best breakdown value if 0 is higher. The scaled deviation trimmed mean Tfl always has the best breakdown value as long as 1 S 6 < 00. The difference in the breakdown points of the trimmed means is due to the difference in trimming schemes. The regular and metrically trimmed means trim always a fixed 1000% points with the former trimming based on the rank of X,- and the latter based on the rank of IX,- — u(X")|. The scaled deviation trimming is based on the value of IX,- — u(X")| and it trims only points with “large” deviations. The breakdown points of the trimmed means are listed in Table 2.1 (0 S 0 < 1, 1 S )6 < 00). The asymptotic ones are shown in Figure 2.3. It is noteworthy that the scaled deviation trimming (or Winsorizing) can lead to the best breakdown robustness, while metrically trimming gains breakdown robustness over the ordinary trimming. All trimming schemes improve the breakdown robustness of the sample mean. ‘ 2.5.2 Influence function and gross error sensitivity The breakdown point measures only the global robustness while the influence function can capture the local robustness of an estimator. The two together can provide a more complete picture of robustness. We now look at the influence functions of the trimmed (and Winsorized) means. The boundedness of its influence function is the fundamental concern for a functional being locally robust. The mean functional has an unbounded influence function. The ordinarily and the metrically trimmed means are known to have bounded influence functions; see, e.g., 16 0.4 0.3 - 0.2 ~ 0.1 - Figure 2.3. Asymptotic breakdown points of trimmed means. . Door-H-l-O---tel—-- I-O-Ivv-I-l-O ./ /. —ABP(a, To) . ._._ ABP“), Tm) ....... ABPWJ’) 0.6 17 0.8 I I I I 3.5? _GES(a, 10(0) ..... 3580. Tm(¢))(GES(a. T0») ....... GES(a, Tw()) 3 s Figure 2.4. Gross error sensitivity of trimmed means. 18 Serfling (1980) and Kim (1992). In the light of Theorems 2.3.1 and 2.3.3, T5 and T5 have bounded influence functions for suitable w and 6. Figure 2.8, which plots their influence functions at normal and t (with 3 degrees of freedom) models with 0 = 0.1, confirms this. Here we set 6 = 6 (F, 0) so that 1000% of points are trimmed in each of the trimming cases (see Section 2). For convenience, we also set w = c 76 0 in T3 and T5. Note that Tg’ and T5 and their influence functions are the same under this setting. Indeed, the influence function in Theorem 2.3.1 becomes Uf(U) lax; U*(F)) - Lf(L) 1am; U(F)) + 2:1(2 6 IL. U1) - T3 1 - 0 1F(2:;T"(F)) = . (2.5.2) where we have for A = fl(F)o(F) 1F(2:; U*(F)) = IF($;#(F)) + IF($; /\(F)); IF(2:; U(F)) = IF(=I=;)U(F)) - 1F(:r; A(F)) (1 -a)-I(u-z\ S x S u+/\) - 1F($;u(F))(f(n+/\) - f(#- M) f()u+/\)+f(u—>\) ' Thus IF(:c; T5 (F )) is the same as that of T3, in Kim (1992). Since a pure normal model IF(:1:; A(F)) = is rare in practice, we thus consider contaminated normal models. With the same 0 and fl = [3(F , 0) as above, the influence functions of T3, and T 3 in the contaminated normal models, plotted also in Figure 2.8, become different (but all are still bounded). In terms of the bounded influence function criterion, we conclude that all the trimmed and Winsorized means are equally robust (locally). Besides boundedness, one can also look at the magnitude of the supremum of |IF(1:;T(F))|, the so—called the gross error sensitivity (GES) of T at F (Hampel et al. (1986)) GES(T(F)) = S:%|IF($;T(F))I, (2.5.3) I which measures the worst case effect on T of an infinitesimal point mass contamination. Generally speaking, a smaller CBS is more desirable. For Tfl and T5, it is readily seen that their GES depends on the values of 6 (or 0 if B = 6 (F, 0)) and the weight function w. As a 19 IF Figure 2.5. Influence functions of the trimmed and Winsorized means at normal. 20 IF Figure 2.6. Influence functions of the trimmed and Winsorized means for t(3) with 0 = 0.1. 21 IF Figure 2.7. Influence functions of the trimmed and Winsorized means for 0.9N(0, 1) + 0.1N (4, 9) with 0 = 0.1. 22 -. OTM I I! .... Med ’{,". “' MET " -- T N -‘ - TW [’1 It . -- canoes (I .IaoaoooocoQHoQoffloT I IF V _ I Figure 2.8. Influence functions of the trimmed and Winsorized means for 0.9N (O, 1) +0.1N (4, 0.5) with 0 = 0.1. 23 Table 2.2. GESs of mean, trimmed (Winsorized) means, and median at symmetric F Mean T3; T19 T3 T5 Med F-1 1— 2CF,0 F‘l 1-02CF, F-1 1- 2 _ (810)” )) ( (110)“ all (105/) F 1(1‘0/2)+2—fa(0) fill) +oo Table 2.3. GESs of mean, trimmed (Winsorized) means, and median at asymmetric F Mean T3 T3, T5 T5 Med F = .9N(0, 1) + .1N(4, 9) GES +oo 4.2419 2.8563 2.7781 2.4422 1.3787 F = .9N(0, 1) + .1N(4, .5) CBS +oo 3.9989 4.6025 3.5290 3.0995 1.4062 possible way to make a comparison, we again set 6 = 6 (F, 0) (w = c 51$ 0) in the following discussion. Now first consider the case that F is symmetric about the origin and meets the conditions in Corollary 2.3.2. Define C(F, 0) = 1+ f(F‘1(1 - 0/2))/f(0). The GES’s of the trimmed (and Winsorized) means are listed in Table 2.2 for general F and illustrated in Figure 2.4 as functions of 0 at F = , the most interesting and common normal distribution used in practice. It can be shown that for any 0 < 0 <1, the GES’s in Table 2.2 are increasingly small with the median having the smallest one if F’ = f exists and is unimodal. This is also confirmed in Figure 2.3 for F = <1). On the other hand, it is noted from Figure 2.8 that the influence function of T0“ at symmetric F has larger absolute values than those of T,‘,',, T5 and T5 at most values of a: E R, a result much favorable to the efficiency of the latter three (see Section 5.3). Again in practice data follow more often than not an asymmetric model. It is therefore sensible to consider the GESs of T5 and T5 at F that slightly deviates from a symmetric model. Table 2.3 lists the CBS results of trimmed and Winsorized means at such models with 0 = 0.1 and fl = fl(, 0.1). The relationship between the GES’s of the the trimmed (and Winsorized) means for sym- 24 metric F is altered under just slight contamination. Table 2.3 indicates that T3 can have the smallest GES among three trimmed means at both contaminated models, whereas both T3 and 73, can have the largest GES. On the other hand, T5 has smaller GES than three trimmed means while the median enjoys the smallest GES under slight deviation from sym- metry. The CBS advantage of T3 and T5 over the two competitors is due to the unique trimming mechanism. With 5 = fl(, 0.1), T5 (T5) trims only the “bad” points that have . large scaled deviations while T3 and T3, always trim a fixed 10% percent of points even if they are “good” points. 2.5.3 Large sample relative efficiency Now we evaluate the performance of the trimmed (and Winsorized) means in terms of their efficiency behavior (relative to the sample mean). First we examine the asymptotic relative efficiency (ARE). Table 2.4 lists the ARE results of T3 and T5 at a number of light- and heavy-tailed symmetric distributions with different 5 values. Here we again set w = c > 0. The table reveals that (i) at normal model with larye ,6 both T5 and T5 can be highly efficient relative to the mean; (ii) their efficiency increases as the tail of the distribution becomes heavier and exceeds 100% at heavy tailed distributions; and (iii) T5 is more efficient than T3 for small 5 or normal distribution but when )6 get larger and the tail gets heavier T5 becomes more efficient. To compare the efficiency behavior of T16 (T5 ) with that of T3 and T3,, we again face the issue of the choices of the values of 0 and 5. One possible choice again is 5 = ,6(F, 0) as above. Such a choice is somewhat in favor of T3 and T3, since for very small 0 values they become quite efficient at normal and other light tailed distributions whereas their breakdown points become very low at those values but those of T5 and T5 are always the best. With the choice 5 = 5(F, 0) and 0 = 0.01 (and w = c > 0), the ARES of the trimmed and Winsorized means and the median are listed Table 2.5. Note again that T3, = T3 under this setting for symmetric F. Examining Table 2.5 reveals that in terms of efficiency: (i) T3 performs better than T3 and T5 atDE, t3 and t4 (with T3 worst among the five) and best at t with degrees of freedom (df) 4 (to 7) and (ii) T3 performs best at normal or very light-tailed F’s such as 25 Table 2.4. ARES of Tfi and T5 relative to the mean [3 1 2 3 4 5 6 7 N(0,1) T5 0.7589 0.9262 0.9882 0.9987 0.9999 1.0000 1.0000 T3 0.4678 0.5630 0.7762 0.9377 0.9901 0.9991 0.9999 LG(0,1) T3 0.9644 1.0852 1.0754 1.0410 1.0190 1.0081 1.0033 T3 0.6346 0.7640 0.9231 1.0004 1.0167 1.0133 1.0077 DE(0, 1) T3 1.8924 1.6073 1.3656 1.2125 1.1221 1.0696 1.0394 T 1.7060 1.5493 1.4218 1.3090 1.2159 1.1452 1.0948 t3 T3 1.8649 1.9334 1.7709 1.6059 1.4844 1.3982 1.3358 T5 1.3202 1.5952 1.7613 1.7500 1.6700 1.5826 1.5061 Table 2.5. ARES of trimmed and Winsorized means and median with a=0.013=fi(F,a) T3 T3, Tfi T5 Med N(0,1) 0.9982 0.9179 0.9179 0.9981 0.6366 LG(0, 1) 1.0192 1.0161 1.0161 1.0220 0.8225 DE(0,1) 1.0383 1.1107 1.1107 1.0483 2.0000 t3 1.2953 1.4645 1.4645 1.3047 1.6211 t4 1.1168 1.2011 1.2011 1.1226 1.1250 logistic (or t with df 2 10) while the median is the best at DE(0, 1) or very heavy-tailed F’s such as t with df S 4. The results in Tables 2.4 and 2.5 are in the asymptotic sense and the F’s are symmetric. This raises the concern as to whether these results are valid at finite sample practice and for F’s with slight departure from symmetry. We answer this question via finite sample simulation studies. 2.5.4 Finite sample relative efficiency We now conduct Monte Carlo studies to investigate the efficiency behavior of Tfi and T5 at finite samples for normal and contaminated normal models. Here 0 = 0 is regarded as 26 Table 2.6. REs of trimmed and Winsorized means with H = 7 T3 T5 Mean T13 T5 Mean T5 T5 Mean 71 s=0% s=10% s=20% 20 EMSE 0.05 0.05 0.05 0.11 0.18 0.25 0.34 0.60 0.77 RE 0.99 1.00 1.00 2.31 1.40 1.00 2.30 1.29 1.00 40 EMSE 0.03 0.03 0.03 0.07 0.15 0.20 0.28 0.56 0.70 RE 1.00 1.00 1.00 2.99 1.40 1.00 2.51 1.26 1.00 60 EMSE 0.02 0.02 0.02 0.06 0.14 0.19 0.26 0.55 0.68 RE 1.00 1.00 1.00 3.41 1.40 1.00 2.61 1.24 1.00 80 EMSE 0.01 0.01 0.01 0.05 0.13 0.18 0.25 0.54 0.67 RE 1.00 1.00 1.00 3.70 1.40 1.00 2.67 1.24 1.00 100 EMSE 0.01 0.01 0.01 0.05 0.13 0.18 0.24 0.54 0.66 RE 1.00 1.00 1.00 3.95 1.40 1.00 2.73 1.24 1.00 the target parameter to be estimated. For an estimator T the empirical mean squared error (EMSE) is: EMSE = 97-, 23:1 IT]- — 0|2, where m is the number of samples generated and T]- is the estimate based on the jth sample. The relative efficiency (RE) of T is then obtained by dividing the EMSE of the sample mean by that of T. We generated m = 50,000 samples from (1 — s)N (0, 1) +€N (4, 9) with s = 0, .1 and .2 for different sample sizes 71. Some results are listed in Table 6 with 5 = 7. The RE results at s = 0 confirm at the validity at finite samples of the asymptotic result in Table 2.4 with B = 7. When there is just a 10% or 20% contamination, both T‘3 and T5 become overwhelmingly more efficient than the mean with TB substantially more efficient than T5. To compare the efficiency of the trimmed and Winsorized means, we now set 5 = B (F, 0) again. Note that at this setting T3 and T3, have very low breakdown point for small 0 whereas T5 and T5 always enjoy the best breakdown point. We consider F = <1) and set a = 0.01. Table 2.7 lists the relative efficiency results at (1 —s)N(0, 1)+sN (4, 9) for s = 0.1 and 0.2. Note that we set 71 = 200 or larger so that T3 trims at least one sample point at each end of data. 27 Table 2.7. RES of trimmed and Winsorized means with a = 0.01 5 = [3(, a) mmTfiTgMedTngTfiTEMed n =10% €=20% 200 1.158 1.579 19.70 2.996 7.988 1.075 1.269 21.79 2.384 9.015 400 1.167 1.615 30.35 3.059 9.595 1.076 1.278 25.71 2.394 9.566 600 1.171 1.630 38.34 3.102 10.48 1.079 1.281 27.33 2.397 9.780 800 1.172 1.636 43.39 3.114 10.90 1.079 1.282 28.39 2.403 9.914 1000 1.173 1.640 47.42 3.125 11.24 1.079 1.283 28.96 2.403 9.969 Our simulation results for s = 0.0 (not listed in the table) confirms the validity of the asymptotic ones in Table 2.5 for N (0, 1). On the other hand, when there is just a 10% or 20% contamination in the distribution, all the estimators become more eficient than the sample mean with T3 , T3,, T5, median, and T3 being increasingly more efficient, reflecting the robustness of these estimators. It is remarkable that T3 is overwhelmingly more efficient than all other estimators. 2.6 Remarks Unlike the mean, all the trimmed and Winsorized means discussed in the chapter have bounded influence functions for suitable distributions and weight functions and hence are locally robust. In terms of the global robustness, the scaled deviation trimmed and win- sorized means T5 and T5 are exceptional in the sense that they can enjoy the best possible breakdown point robustness for any 5 2 1 whereas the ordinary and metrically trimmed means Ta and T3, have much lower breakdown point for typical choices of a and the mean has the worst. Relative to the mean, T5 and T5 are highly efficient for large [3’s at light-tailed symmetric distributions and much more efficient at heavy-tailed ones. When 5 is set to be 3(F, a) so that 100a% points are trimmed, T5 and T5 are less efficient than T“ at light-tailed symmetric distributions but become much more efficient at heavy-tailed or contaminated 28 symmetric ones. The latter models seem more popular than the light-tailed symmetric ones in practice. The advantages of Tfl and T5 over Ta and T3, on robustness and efficiency are due to the difference in the trimming schemes. The latter trim always a fixed fraction of sample points no matter they are “good” or “bad” whereas the former trim only when there are “bad” points. A very legitimate concern for T5 and T5 in practice is about the choice of [3 value. In light of our simulation studies, we recommend a 5 value between 4 to 7 so that Ty and T5 can be very efficient at both light- and heavy-tailed distributions. Instead of a fixed value one might also adopt an adaptive data-driven approach to determine an appropriate 5 value. For a given data set, one determines a value for 5 based on the heaviness of the tail. Generally speaking, a large value of ,6 is selected for a light-tailed data set while a smaller value for a heavy-tailed one. The basic idea of adaptive trimming above exists in the literature, see, e.g., Jaeckel (1971), Hogg (1974), and Jureckova et al. (1994). Furthermore, a random trimming idea also appeared in Shorack (1974), though the trimming proportion there is (asymptotically) fixed. Consequently the trimmed and Winsorized means in Shorack (1974) are different in essence from the ones in this chapter. We note that Tfi(Fn) in (2.2.2) has some connection with (however is very different from) the (scaled version of) Huber-type skipped mean (see, e.g., Hampel et al. (1986)), the solution Tn of Z,- XiI(—;6 S .(Xi—Tn)/an S [3)/ Z,- I(—fl 5 (Xi - Tnl/On S B) = Tn- We remark that a general multi-dimensional version of (2.2.2) (but not (2.2.3)) has been thoroughly studied in Zuo (2003). Here we focus on the performance evaluation of the specific one-dimensional version TB and provide specific and concrete results for influence function and limiting distribution as well. Multidimensional version of (2.2.3) is yet to be studied. 29 CHAPTER 3 Trimmed and Winsorized standard deviations based on a scaled deviation 3.1 Introduction Scale is an important parameter of interest which can tell us how spread out the data is. - Although there are many robust location estimators, robust scale estimators seems far much ' less. Despite this fact, finding robust scale statistics with a high level efficiency is always —. (the goal of many statisticians. Two types of trimmed standard deviations are introduced and discussed by Welsh and Morrison (1990). But unfortunately, these estimators can not reach highest breakdown point while keeping satisfactory efficiency. So this chapter will not discuss these versions of trimming. Another attempt was made by Rousseeuw and Croux (1993), in which two types of high breakdown estimators are introduced. Those estimators are built through recursive medians. Put aside computation time, the efficiency is not very good at light-tailed distribution, though they are better than the widely used robust estimator, the median absolute deviation (MAD). And these estimators perform poorly when points in the neighborhood of the center are contaminated, because they only use the middle half of the data points or “extended” data points. These situations motivate us to consider in this chapter the so-called scaled-deviation trimmed and Winsorized standard deviations. The high breakdown scale estimators—scaled-deviation trimmed and Winsorized standard 30 deviations. 'fiimmed and Winsorized standard deviations enjoy the highest breakdown point and bounded influence functions for a variety of distributions. They are also much more efficient at light-tailed symmetric models than their predecessors and highly efficient for heavy tailed or skewed distributions. They also have the best performance among high breakdown estimators when the points somewhere around the center are contaminated. Hence they represent favorable alternatives to their predecessors. Section 3.2 introduces Scaled-Deviation trimmed / Winsorized standard deviations; Sec- tion 3.3 is devoted to the study of the local robustness; Asymptotic representation and asymptotic normality are treated in section 3.4. Comparisons of influence functions and Gross Error Sensitivities of various high breakdown point scale estimators are undertaken in section 3.5. Proofs of main results and auxiliary lemmas are reserved for Chapter 5. 3.2 Scaled-deviation trimmed and Winsorized stan- dard deviation Let M(F) and U(F) be some robust location and scale measures of a distribution F. For simplicity, we consider p and a being the median (Med) and the median absolute deviation (MAD) throughout the chapter. Assume U(F) > 0, namely, F is not degenerate. For a given point 2:, we define the scaled deviation (generalized standardized deviation) of $ to the center F by D(:c, F) = (a: — M(F))/U(F). (3.2.1) Now one trims points based on the absolute value of this scaled deviation and define the fi scaled deviations trimmed variance functional as f1(|D($, F)l S fl)w2(D(x, F ))(-'C - T1(F))2dF($) f I(UN-1313 )l S fi)w2(D($,F))dF($) 82(F) = c, (3.2.2) where C; is the consistency coefficient, T1 (F) the fi scaled-deviations trimmed mean which is defined through _ f1(|D($.F)l S fl)wz'(D(z,F))xdF($) Ti(F) " fI(|D($,F)| g fl)w,-(D($,F))dF($) = 1, 2. (3.2.3) 31 where 0 < H S 00 and mi (2' = 1, 2) is an even bounded weight function on [—00, 00] so that the denominator is positive. The heuristic idea behind this location definition is that one trims points that are a robust distance (50) away from the robust center )1 and then one obtains a robust and eflicient location estimator by weighting (including simply averages) left points, which integrates the robustness of a, p and efficiency of mean. When u),- (i=1, 2) is a non-zero constant, T,- (i=1, 2) and $2 is the plain average and plain variance of points after trimming. We consider general wi (i=1, 2) in our treatment which will contain a broader class of the trimmed means. Note that in the extreme case 5 = 00 (w = c 79 0) T; and 32 become the usual mean and usual variance and ct becomes 1. Another robust estimator of scale is the Winsorized standard deviation. Like the trimmed standard deviation, the Winsorized standard deviation eliminates the outliers if they exist. Unlike the trimmed scale, the Winsorized scale replaces the outliers with cutting-point values, rather than discarding them. The definition of the 5 scaled deviation wisonsorized variance functional is given by $3.03) =0... [ [<02 - Tw1(F))2I(|D($,F)I .<_ m + (L(F) — 5.0219: < L(F)) + (U(F) — Tw1)21(17 > U(F)» w2(D]/ / wane. F>)dF (3.2.4) where cw is the consistency coefficient, Tw1(F) the B scaled deviations Winsorized mean which is defined through we =[ [(21005 F)! s m + 14me < L(F)) + U(F)I(:r > U(F)))w,-(D(a:,F))dF(x)] / / w,(D(z,F))dF(a:) 7: 1,2, (3.2.5) where L(F) = u(F) — 50(F) and U(F) = p(F) + flo(F). In the extreme case )6 = 0, Twi degenerate into the median and Sm zero. The ,6 scaled deviation trimmed /Wisonsorized standard deviation at F is the square root of corresponding variance. Two popular high breakdown scales are introduced by Rousseeuw and Croux (1993). we 32 denote them by S30 and Q, which are defined as 3.50 = cemeddmedjlz. — avg-I}. 9.. = 40x.- — le; .- < an.) (32.6) where c3, of are consistent coefficients and k = (’2’) z (’2’) /4 with h = [n/2] + 1. We will compare S(F), Sw(F) with SR0, Q. It turned out that S(F) and Sw(F) are more flexible, and more eficient than these two types of scales for light tailed distribution and for the situation that “bad points” are coming from the neighborhood of the center. Those estimators are built through recursive medians. For performance evaluation and comparison of S and Sw in later sections, S30 and Q will be used as benchmarks. The scaled-deviation trimmed / Winsorized means or variances are different from usual ones. Note that the proportion of the trimmed points for a fixed ,6, P(|D(X,F)| > 5), in T(F) (or S(F)) is not fixed but F-dependent. In the sample case, the proportion of sample points trimmed is not fixed but random. T(Fn) (or S (Fn)) may trim some or no sample points. So T(Fn) (or S(Fn)) are flexible rather than mechanical. On the other hand, there is some connections between the estimators introduced above and the usual trimming/winsorizing scheme based on the probability content. Indeed, set 5 to be the (1 -—- a)th quantile of the scaled centered variable IX —— p(F)|/0(F), then T(F) (or S (F)) are just the regular trimmed mean and standard deviation after trimming 1000% of points at tails for symmetric F. For example, if one wants to trim a = 10% points at tails, then simply set 6 = <1>-1(0.95)/-1(0.75) = 2.4387 for normal F and 5 = 6.3138 for Cauchy F. A large 5 corresponds to a small trimmed proportion (a) and consequently is in favor of the efficiency of the scaled deviation trimmed mean and standard deviation at light-tailed F. In order to keep clarity, we need to adopt some notations and write 8%) = ct(s(F) — MP». 5.3.03) = en(3w(F) - M(F)), 62': /I(ID($,F)I Sfl)wi(D($,F))dF($), 61ml: / wi(D(-751F ))dF (I) i=112- 33 with s03) = [1005.101 s 5)$2W2(D(17,F))dF($)/52(F) MF) =2T1(F)T2(F) — 71W saw") = [ [521005. F)! s m + L(F)21(a= < L(F)) + U(F)2I U(F)» w2(D(z.F»dF(x)]/awz(m AM) =2Tw1(F)Tw2(F) — Tam?- T, Tw, S, and Sw are affine equivariant because both )4 and a are affine equivariant, i.e., #(Fax+b) = ap(FX) + b, U(Fax+b) = |a|a(FX) for any scalars a and b, where FX is the distribution of X. For X ~ F symmetric about 6(i.e. :t(X — 9)/n(n > 0) have the same distribution F0), it is seen that T(F) = 6 and S(F) = 7) (ct = 1/S(F0)), i.e., T and 5'2 are Fisher consistent. Without loss of generality, we can assume 6 = 0 and n = 1. Let Fn be the usual empirical version of F based on a random sample. It is readily seen that T(Fn) and 32(Fn) are also afline equivariant. T(Fn) is unbiased for 0 if F is symmetric about 6 , and has an expectation and 52(Fn) (ct = 1/(s(F0) — /\(Fo))) is unbiased for n if F has a , ,variance. All these properties also hold for Tu, and 5,2“. 3.3 Influence Function We first investigate the local robustness of the functional S (F) and Sw(F) through influence functions. Here F is the assumed distribution. The actual distribution, however, may be (slightly) different from F. A simple departure from F may be due to the point mass contamination of F that results in the distribution F (22,635) = (1 — s)F + 569;, where 6,,- is the point mass probability distribution at a fixed point z E R. It is hoped that the effect of the slight deviation from F on the underlying functional is small relative to s. The influence function (IF) of a statistical functional M at a given point a: E R for a given F, defined as (see Hampel et al. (1996)) IF(2:; M(F)) = 1113+(M(F(e,5,)) - M(F))/e. (3.3.1) 34 exactly measures the relative effect (influence) of an infinitesimal point mass contamination on M. It is desirable that this relative influence I F(:r; M (F)) be bounded. This indeed is the the case for S30 and Q (see, e.g. Rousseeuw and Croux (1993)), but not for the standard deviation functional with (:1:2 — 1) / 2 as its influence function for r.v. X ~ N (0, 1). Note that the integration interval in S2(F) (35(F)) are a functional of F. Hence an infinitesimal point-mass contamination affects this interval. Because of this, the derivation of the influence function of the scaled-deviation trimmed/Winsorized scales becomes a little bit challenging. The strategy to attack the problem is“divide then conquer”. One first works out the influence functions of L and U. Assume F’ = f exists at p and p :l: a with f (a) and f (p+a)+ f (p-a) positive, where u and a stand for p(F) and U(F). As in chapter 2, we have preliminary results (2.3.2)-(2.3.6). Now assume that w,- (i=1, 2) is differentiable and f exists at L(F) and U (F). Write L and U for L(F) and U (F) respectively and define 4.0:) 737.; [(U — T.)w.-] (3.3.3) 73,-(n) 741—13) [1(2: 6 [L, U])(:o — T,-)w,-(D(a:, F))] (3.3.4) One can derive the influence function of the scaled deviation trimmed mean T(F) as follows. The following result is in our location chapter. Corollary 3.3.1. Assume that F’ = f exists at p, u :l: a, L(F) and U(F) with f(p) and f (p + a) + f (p. — a) positive and is continuous in a small neighborhood of L(F) and U (F), and that w,(-), (i = 1, 2.) are continuously diflerentiable. Then for a given 0 < )6 < oo 1F($; Ti(F)) = 317(13) + 527(17) + 534($)- (33-5) Furthermore, if F is symmetric about the origin and w is a non-zero constant, one has _ _ $107 ‘5 [-fiaafial) fl0f(fi0)si8‘n($) IF(:c,T(F))-— 2F(fia)-1 +f(0)(2F(fio)—1)° (3.3.6) In order to express the influence function of S (F), we need to borrow the following 35 notations and write .r)=(6i[v2 —s)w2( D,f(UF)) (U)IF(2:,U(F)) 2 ”(L2 sw) 20( (L F>>f(L>IF(z.L(F)>] (3.3.7) 2(a:)=61—2[/U (y -s>w§1(D(y. F))h(x.y)dF(y>] (3.3.8) 73(x)=-1—[I($ E (L, U]) (x2 - s)w2(D(:r, F))]. (3.3.9) It is ready seen that the influence function of A(F) and Aw(F) are given by 1F($. A(F)) = 2(T1(F) - T2(F))1F($,T1(F)) - 2T1(Flu’(fi=i’-’2(F)); 117(3) Aw(F)) = 2(Tw1(F) _Tw2(F))1F($tTw1(F)) "2Tw1(F)IF($tTw2(F))- The influence function of the scaled deviation trimmed variance S2(F) is given by the following theorem. Theorem 3.3.2. Assume that F' = f exists at u, pi a, L(F) and U(F) with fly) and f (u + o) + f (,u — a) positive and is continuous in a small neighborhood of L(F) and U (F), and that m,(.), (t = 1, 2.) are continuously difi’erentiable. Then for a given 0 < 5 < oo IF(a:; 52(5)) = Ct [1175; S(F)) + IF(:r, A(F))] (3.3.10) with IF(:c; s(F)) = 71(21: )+ 72(x )+ 73(5) (3.3.11) The proof of theorem 3.3.2 is given in section 5.2, chapter 5. Under the condition of Theorem 3.3.2, IF (2:; S2(F )) clearly is bounded and consequently 52(F) is locally robust. Use the chain rule, one easily knows the influence function of the scaled deviation trimmed standard deviation, i.e. IF(:c; S(F)) = IF(:c; 52(F))/(2,/§?TF)). For symmetric F and w = c 79 0, the influence function simplifies substantially. Corollary 3.3.3. Let f be symmetric about the origin and w,- (i=1, 2) are nonzero con- stants. Under the conditions of Theorem 3.3.2, we have (x -s)1(=v€[—flafio]) ((fi0)2-8)f(flo)flsign(l$|-0) C‘l 2F(Bo)- 1 + 2f(e)(2F(5e)—1) l (3‘3'12) IF(a= 82w F))= 36 IF Figure 3.1. Influence functions of S for N(0, 1) with a constant weight and fl = 3. 37 The proof of this corollary is omitted. A graph of this influence function I F(:r; S (F)) is given in figure 3.1. Obviously, it is bounded. To work out the influence function for 5,2,,(F) (or Sw(F)), we define €w1z‘($)- 1 [Ii/KL s y s U) + LI(y < L) + UI(y > U) 5m - a.)w§1’(D(y.F))h(x.y)dF(y) (3.3.13) €w2i(1)=-1- l1F($.L)1(y < L) + 1F($rU)I(y > U )] Wt(D(y.F WP (9) (3.3-14) 6wi tw3,(r)=6—L [rm 3 z s U) + LI(:r < L) + UI(:r > U) — Tm] w,(D(:c,F)). (3.3.15) and rare) =53; [021a s y s U) + L210 < L) + U21(y > U) — sw] wg1)(D(y, F))h(:r:,y)dF(y) (3.3.16) rw2(:r) =33; /[IF(2:, L)LI(y < L) + IF(:r, U)UI(y > U)] w2(D(y,F))dF(y) (3.3.17) rw3(:r) =32— [1521(L S :1: S U) + L21(:c < L) + U21(:r > U) — sw] W2(D(I,F)). (3.3.18) w2 One then has the influence function of the scaled deviation Winsorized mean Tw(F) as follows. The following result is in our location chapter. Corollary 3.3.4. Assume that F' = f exists at p, uzl: a, L(F) and U (F) with f (u) and f (u + a) + f (p - a) positive and is continuous in a small neighborhood of L(F) and U (F), and that w,(-), (i = 1, 2.) are continuously differentiable with rw(1)(r) being bounded for r 6 R. Then for a given 0 < 5 < 00, 1F (1‘; th(F )) = 13.51313) + 13.5241?) + €w3t($)- (3319) Furthermore, if F is symmetric about the origin and w is a non-zero constant, one has 1F(r;T,,,(F)) = Sizgln—se) + TI(—[ia g r g [30) — some < —5a) + some > so). (3.3.20) One can have the influence function of the scaled deviation Winsorized variance SE,(F) as follows. 38 Theorem 3.3.5. Assume that F’ = f exists at u, pd: a, L(F) and U(F) with f(u) and f (p + a) + f (u — a) positive and is continuous in a small neighborhood of L(F) and U (F), and that w,(-), (i = 1, 2.) are continuously differentiable with r2w(1)(r) being bounded for rER. Thenforagiven0<fl)) = (.7:2 - 1) / 2. When ,6 takes large values, IF(x, S()), IF(x, Sw()) and IF(x, SD()) will be very close. It implies the high efficiency of S (SW) at normal model when 5 is large. 3.5.3 Large sample relative efficiency Now we evaluate the performance of the trimmed (and Winsorized) standard deviations in terms of their efficiency behavior (relative to the sample SD). First we examine the asymptotic relative efliciency (ARE). Table 3.2 lists the ARE results of S and Sw at a number of light- and heavy-tailed symmetric distributions with different 5 values. Here we again set w = c > 0. The table reveals that (i) at normal model with large 5 both S and Sw can be highly efficient relative to the SD; (ii) their efficiency increases as the tail of the distribution becomes heavier and exceeds 100% at heavy tailed distributions for large 5; and (iii) Sw is more efficient than S for small 5 or normal distribution. For Cauchy distribution, when 0.954 S [3 S 1.015, S is more emcient than S"). The picture of ARE versus the steps 5 of different distributions are presented by figure 3.7— 3.9. For normal distribution, ARE increases gradually when the 3 increases. But for Cauchy distribution, ARE decreases with the increasing of 5. When 5 goes to infinity, ARE tends to zero. For both CBS and ARE, S is unstable at the neighborhood of 5 = 1 mainly because the denominator of S (Fn) might be 0. However, Sw is pretty stable when 5 changes. Table 3.3 gives the asymptotic relative efliciencies (ARE) of various high breakdown scale estimators with respect to the sample standard deviation for various distributions. Examining Table 3.3 reveals that in terms of efficiency: (i) Overall Sw performs better than S. (ii) S and Sw perform better than other robust estimators S RC, Q and MAD at light tailed distribution while have satisfactory efficiency for heavy tailed distributions. The behaviors of ARE of scaled-deviation trimmed standard deviation for different values of ,6 to standard deviation at normal, to theoretical lower bound (inverse of Fisher informa- 45 Table 3.2. ARES of S and Sw relative to the standard deviation 3 N (0,1) S Sw LG(0, 1) S Sw DE(0, 1) S 5w E(0, 1) S 3w t5 S Sw Cauchy*(O, 1) S 5w 1 0.3068 0.3717 0.4708 0.5535 0.5647 0.6191 0.7315 0.9163 0.7315 0.9163 0.8768 0.8579 2 0.2736 0.5383 0.4132 0.7368 0.4482 0.7413 0.7610 1.1138 0.7610 1.1138 0.7053 0.8900 3 0.4509 0.8261 0.5818 0.9892 0.5412 0.9318 0.8303 1 .2903 0.8303 1 .2903 0.7422 0.9080 4 0.7438 0.9713 0.7822 1.1069 0.6605 1.0915 0.9156 1.4456 0.9156 1.4456 0.7542 0.8969 5 0.9403 0.9973 0.9363 1.1045 0.7879 1.1758 1.0067 1.5420 1.0067 1.5420 0.7485 0.8716 6 0.9921 0.9998 1.0090 1.0691 0.9032 1 . 1894 1.0941 1.5653 1.0941 1.5653 0.7334 0.8403 7 0.9994 1.0000 1.0263 1.0390 0.9883 1.1638 1.1679 1.5291 1.1679 1.5291 0.7134 0.8071 * compared with the inverse of fisher information 2. Table 3.3. ARE’s with respect to SD Q N(0, 1) 0.8227 LG(O, 1) 1.0210 DE(0, 1) 1.5952 E(0, 1) 1.4897 t5 1.4066 Cauchy* 0.9784 SR0 5 (s = 4.5) s (s = 7) 3,, (p = 4.5) 5,, (o = 7) MAD 0.5823 0.8918 0.9206 1.0591 2.0026 0.9497 0.8658 0.8690 0.7244 0.9609 1.9388 0.7530 0.9994 1.0263 0.9883 1.1679 2.0732 0.7134 0.9906 1.1144 1.1439 1.5028 2.4113 0.8854 1.0000 1.0390 1.1638 1.5291 2.0145 0.8071 0.3675 0.5431 0.6006 0.9367 1.3332 0.8106 * compared with the inverse of fisher information 2. 46 Table 3.4. 5 Values for S having better ARE than Other Scales s a Q s : 330 s : MAD N(0, 1) [4.31, oo) (3.47, oo) (0, oo) LG(0, 1) [6.38, 8.19) [4.66, 00) (0, oo) DE(0, 1) NA [6.18, oo) [0,oo) E(0, 1) (5.59, 18.17) NA [4.24, oo) t5 [2.99, 18.62) [4.81,7.89) [2.82,22.11) Cauchy* NA NA [0.82, 1.06] * compared with the inverse of fisher information 2. Table 3.5. )6 Values for Sw having better ARE than Other Scales swto 's,,,>_-sRC SthAD swts N (0, 1) [2.99, 00) [2.16, 00) (0, oo) (0, oo) LG(0, 1) (3.17, 7.93) [258,00) (0, oo) (0, oo) DE(0, 1) NA [2.94,oo) (0, oo) (0, oo) \ [11,18] E(O, 1) (1.68,15.84) (4.38, 7.64) (0, oo) (0,1097) t5 (1.26, 15.01) (241,700) (0.17.78) (0, 6.46) Cauchy* NA NA (0, 6.89) (0, oo) \ [0.96, 1.01] * compared with the inverse of fisher information 2. 47 tion which is equal to 2 in this case) at Cauchy, to standard deviation at exponential are shown in figure 3.7—3.9. For normal model, it is natural that ARES"), SD and ARES, SD (5 > 1) increase with 5 and less than 1. For Cauchy model, note that ARES and ARES,” roughly decrease to 0 with 5, while its standard deviation does not even exist. For exponen- tial distribution, ARE 5,30 is increasing while GES (S) decreases when 5 increases while ARESw,SD is J-shaped. In practice, the sample size is often small and the distributions are not pure models. Scaled-deviation trimmed and Winsorized scales perform quite well in the asymptotic sense when F’s are from a pure model. This raises the concern as to whether these results are valid at finite sample practice and for F’s with slight departure from a perfect model. We answer this question in the next section via finite sample simulation studies. 3.5.4 Finite sample relative efficiency To check whether the estimator S and Sw are approximate unbiased for finite samples, we performed a modest simulation study. In Table 3.6, we calculated the average scale estimate on 10,000 batches of normal, Cauchy, and exponential observations. We see that Sn (Sum) behaves better than other scales at normal model and we carried out a simulation to verify the efliciency gain at finite samples. For each n in the table 3.6 , we computed the variance varm(Sn) of the scale estimator Sn over m = 10, 000 samples. Table 3.6 lists the standardized variances nVarm(Sn)/(avem(Sn))2 (3.5.2) where avem(Sn) is the average estimated value which is listed in the left half of the tables. The results show that the asymptotic variance provides a good approximation for (not too small) finite saniples, and that 8,; and Sum are more eflicient than S30, Qn and M AD” at normal model even for small 71. It is clearly reviewed that Sn and Snw are considerably more eflicient than S30, Q" and M ADn even for small n. 48 Table 3.6. Standard variance of MAD”, S30, Qn, Sn, Sum and SDn at normal model 10 20 40 60 80 100 200 Average value Qn 3,150 Sn Snw M AD" S D" 0.899 0.957 0.978 0.986 0.990 0.993 0.997 1.286 1.166 1.087 1.059 1.046 1.037 1.020 0.913 0.961 0.979 0.986 0.990 0.993 0.997 0.905 0.959 0.979 0.984 0.990 0.991 0.996 0.905 0.959 0.980 0.987 0.992 0.993 0.997 0.969 0.987 0.992 0.994 0.997 0.998 0.999 10 20 40 60 80 100 200 Standard variance Qn 83V Sn Snw M ADn S Dn 0.849 0.749 0.700 0.658 0.649 0.647 0.627 0.920 0.877 0.850 0.849 0.845 0.855 0.859 0.617 0.535 0.521 0.497 0.509 0.501 0.494 0.533 0.507 0.514 0.494 0.506 0.500 0.493 1.241 1.293 1.320 1.337 1.344 1.352 1.380 0.518 0.504 0.514 0.494 0.506 0.500 0.493 49 - m» 'n— n... We also conducted a similar study on Cauchy, exponential distribution. The results at these distributions confirm the asymptotic results presented in last section so they are not listed here. It is shown that Sn and Snw can achieve satisfactory efliciency compared with other robust estimators. We now conduct Monte Carlo studies to investigate the efficiency behavior of S and Sw at finite samples for normal and contaminated normal models. Here 7) = 1 is regarded as the target parameter to be estimated. For an estimator S the empirical mean squared error (EMSE) is: EMSE = % 23:11 lSj - nlz, where m is the number of samples generated and T]- is the estimate based on the jth sample. The relative efficiency (RE) of S is then obtained by dividing the EMSE of the sample standard deviation by that of S. We generated m = 50, 000 samples from (1 —s)N(0, 1)+5N(1,0.1) and (1-€)N(0, 1)+€5{0} with e = 0, .1 and .2 for different sample sizes 72. Some results are listed in Table 3.7 and Table 3.8 with 5 = 7. Table 3.7. RES of various robust scales ()6 = 7 for scaled-deviation trimmed/Winsorized scale) at (1 — e)N(0, l) + eN(1,0.1) Q SRO s s... MAD Q .930 s S... MAD n e=10% e=20% 20 0.305 0.634 0.972 0.993 0.414 0.322 0.615 0.958 0.985 0.417 40 0.444 0.606 0.986 0.992 0.367 0.527 0.586 0.973 0.980 0.369 60 0.552 0.570 0.985 0.988 0.323 0.650 0.564 0.977 0.982 0.318 80 0.617 0.559 0.992 0.993 0.290 0.732 0.533 0.981 0.982 0.281 100 0.666 0.524 0.991 0.991 0.256 0.791 0.527 0.978 0.978 0.261 Table 3.8. RES of various robust scales (5 = 7 for scaled-deviation trimmed/Winsorized scale) at (1 - €)N(0, 1) + 56m} Q SEC 5 S... MAD Q SRC s s... MAD n e=10% e=20% 20 0.502 0.494 0.825 0.908 0.361 0.604 0.410 0.758 0.875 0.328 40 0.721 0.413 0.872 0.924 0.319 0.711 0.309 0.823 0.903 0.278 60 0.779 0.342 0.897 0.930 0.295 0.633 0.248 0.856 0.917 0.252 80 0.746 0.296 0.908 0.935 0.283 0.562 0.216 0.873 0.927 0.241 100 0.699 0.263 0.916 0.940 0.270 0.502 0.195 0.888 0.934 0.229 For small 5, simulation results Show that scaled-deviation Winsorized scale is more efficient 50 than scaled-deviation trimmed scale at the situation “bad” points from the areas around the center. When 5 gets large, the difference becomes small and all most achieve the same efficiency when [i > 7. Our simulation results for e = 0.0 (not listed in the table) confirms the validity of the asymptotic ones in Table 3.2 for N (0, 1). On the other hand, when there is just a 10% or 20% contamination in the distribution, all other estimators Q, SR0, MAD become less efficient than S and Sw which are the most efficient, reflecting the robustness of these estimators. It is remarkable that S5 is overwhelmingly more efficient than all other estimators if we suitably choose [3. The other three types of robust estimators that are built on (recursive) medians only use 50% of information around the center, so it is very robust and efficient when contaminating points are from either end. But its strength is also its weakness, when contaminating points are close to the center, it will use the “bad” points and lose its efiiciency. The two types of estimators S, Sw introduced in this chapter achieve satisfactory efficiency when contaminating points are from either end, although it’s a little bit less efficient than other three(Q, SR0, MAD). But when outliers/contaminating points are from the areas near the center, S and Sw are far more efficient than them. 3.6 Concluding remarks Unlike the standard deviation, all the trimmed and Winsorized standard deviation and robust scales discussed in the chapter have bounded influence functions for suitable distributions and weight functions and hence are locally robust. In terms of the global robustness, the scaled deviation trimmed and Winsorized standard deviations S and Sm are exceptional in the sense that they can enjoy the best possible breakdown point robustness for any )6 2 1 with other robust scales. Relative to the standard deviation, the scaled deviation trimmed S and Winsorized stan- dard deviations Sw are highly efficient for large B’s at light-tailed symmetric distributions and much more efficient at heavy-tailed ones for small 3. Three popular scale estimators Q, SRC, MAD which are built on (recursive) medians are highly efficient when “bad” points 51 are from either end. However they lose the capability to tell the truth when contaminating points are presented around the center. At this time, S and Sw can make a difl'erence. S and Sw are more flexible than other robust scale estimators since 5 can take different values, which also raises a question on the choice of 5 value. In light of our simulation studies, a ,6 value between 4 to 7 would be recommended so that S and Sw can be very efficient at both light- and heavy-tailed distributions. Instead of a fixed value one might also adopt an adaptive data-driven approach to determine an appropriate 5 value. For a given data set, one determines a value for 5 based on the heaviness of the tail. Generally speaking, a large value of 5 is selected for a light-tailed data set while a smaller value for a heavy-tailed one. 52 IF Figure 3.2. Influence functions of Sw for N (0, 1) with a constant weight and fl = 3. 53 -2 1 Figure 3.3. Influence functions of various scales for normal distribution. (5 = 4.5 for S and Sw) 54 IF o q alto o '5, ,5 MAD M """ SR0 N _ ":1“ ....... 0 ' S - Sw l j l l I Figure 3.4. Influence functions of various scales for Cauchy distribution. (5 = 4.5 for S and Sw) 55 .441 ”‘5‘ IF Figure 3.5. Influence functions of various scales for exponential distribution. (3 = 4.5 for S and Sw) 56 IF l 1 l l l A -2 0 2 4 x Figure 3.6. Influence functions of various scales for 0.9N(0, 1) + 0.1N(1,0.1). (fl = 4.5 for S and 8.”) 57 0. _ V" , ------- we - O ‘0. - O u, s a _ 4 SW ‘7. .1 O I 4’ I ‘ ’r I v’l 0’ I 'l ‘ If I 'l ‘\ 'a" N I I - . _ l I O I .‘ I I II II I) If V O I O' —( I T I I I I I I 0 1 2 3 4 5 6 7 beta Figure 3.7. ARE of trimmed and Winsorized standard deviations for normal distribution 58 0. ‘_... I Q_ /R/—\ O I I , 1 i """""""""""""""" "t K‘ dddddddddddddddd I‘.‘ ' \.b-' l I (O._ '1‘ ° .’ [U I 05 7' <1: 7 I V- I o 1' I I 1 I I 7 N_ I o I : s I — SW I Q- C I | I | I I I I 0 I 2 3 4 5 6 7 beta Figure 3.8. ARE of trimmed and Winsorized standard deviations for Cauchy distribution 59 1.5 1.0 ARE 0.0 beta Figure 3.9. ARE of trirmned and Winsorized stande deviations for exponential distribution 60 beta Figure 3.10. GES of trimmed and Winsorized standard deviations for normal distribution 61 GES 4 l beta Figure 3.11. GES of trimmed and Winsorized standard deviations for Cauchy distribution 62 beta Figure 3.12. GES of trimmed and Winsorized standard deviations for exponential distribution 63 CHAPTER 4 The Multiple Least Trimmed Squares Estimator 4.1 Introduction Consider the multiple regression model y,-=th,-+€,-, i=1,...,n with :13,- = (2:31, . . . , $5P)t 6 RP and yi E R . The matrix B 6 RP contains the regression coefficients. The error terms 51, . . . ,5" are i.i.d. with zero center and a positive scale 0. Furthermore, we assume that the errors are independent of the carriers. Note that this model generalizes the location model (x,- = 1). Denote the entire sample {Zn = (xi; y,);i = 1,... ,n} and write X = (x1,...,xn)t for the design matrix and Y = (y1,.. . ,yn)‘ for the vector of responses. The classical estimator for B is the least-squares (LS) estimator B L S which is given by 8L5 = (X‘X)'1X‘Y (4.1.1) while 02 is unbiasedly estimated by 1 "-P (Y -— x515)? (4.1.2) .2 __ aLS“ Since the least squares estimator is extremely sensitive to outliers, we aim to construct a robust alternative. An overview of strategies to robustify the multiple regression method 64 is given by Maronna and Yohai (1997) in the context of simultaneous equations models. Koenker and Portnoy (1990) apply a regression M-estimator to each coordinate of the re- sponses and Bai et al. (1990) minimize the sum of the euclidean norm of the residuals. However, these two methods are not affine equivariant. Our approach will be different from the latter, since it will be affine equivariant. Agullo, J ., and Croux, C., al.(2002) discussed some pr0perties of multivariate trimmed least squares estimator. From computational as- pect, Peter J. Rousseeuw, P. J. and Driessen, K. V. introduced an fast algorithm which is based on “C—step”. “C—step” will gurantee the reduction of objective function during iterations. However, as in location and scale settings, the general trimmed regression estimator suffers from very low efficiency while keeping the highest breakdown point. As shown in this chapter, the efficiency for ordinary least squares estimator is only 7.1% which is far less than satisfactory. Meanwhile, small proportion of trimming will cause low breakdown point issue and, hence, low robustness. In this Chapter, we are trying to work out this dilemma and find an estimator as in location and scale setting that has the highest breakdown point and can have high level efficiency. It turned out that this kind of estimator is hard to define and hard to discuss, don’t not even mention to come up with an algorithm. The estimator defined in this chapter is expected to have highest breakdown point (strict proof is not available due to technical reasons) and truly efficient. However we found that starting from any initial subset, keeping those points with residuals close to zero and after several steps of iterating, we found that the subsequent subset is relatively “stable”. From figure 4.1 we can see that the mean squre of residuals is stabilized after a few number of iterations. Then we define an estimator with the minimum mean square error in the collection of ”stable” sets, we obtain the estimator defined in this chapter. This definition is different from the general least squares estimator defined by Agullo, J ., and Croux, C., al.(2002). In Section 4.2 we give a formal definition of the multiple least trimmed squares (LTS) estimator. In Section 4.3, we derive the influence function and study the ARE. A time efficient algorithm to compute the LTS is presented in Section 4.4. 65 1.5 MSR 1 0 L 0.5 0.0 number of iteration Figure 4.1. MSR vs number of iterations with 100 arbitrary initial subsets 66 4.2 Definition and properties Our approach consists of finding the subset H of observations having the property that the Mean Square of its residuals from a LS—fit 8w), solely based on this subset is minimal, the subset H of observations satisfies H: {i : (1,-(531,5) < €d(h)(33ub)}, let h= [(n+p+ 1)/2] or [( n +p+ 2) ),/2]d (3 sub): (Y — X38532 . When H is not unique, we take the one that has Minimum Mean Square of its residuals. Denote the collection of H by H. The resulting estimator will then be simply the LS-estimator computed from the optimal subset which defined by its least squares estimator. When X = (1, . . . ,1)‘ 6 RP, it reduces to a multiple regression model with only an intercept that is a location model. When I.’ = 1, our approach is equivalent to the general least trimmed squares estimator (see, e.g. Agullo, etc) , which is a generalization of the LTS estimator (Rousseeuw 1984) for robust regression. Consider a dataset Zn = {(x,; y,); i = 1, . . . , n} E RP'I'1 and for any 3 6 RP denote 13(3) = y,- — 8‘23,- the corresponding residuals. Definition 4.2.1. With the notation above the multiple least trimmed squares estimator (LTS) is defined as 8111’sz )= .8501) where H e argminnenc‘riew) (4.2.1) with (TAZLS(H) = 2161;?) ( )/(#(H )—p) for any H E 'H. The variance of the errors can then be estimated by 2 .. ULTS(Zn )_ — czasub(H) (4'22) where Cg is a consistency factor. Note that if the minimization problem has more than one solution, in which case we . . .. 2 look at argmln H Sigma L S(H ) as a set, we arbitrarily select one of these solutions to determine the LTS estimator. In Section 5 a consistency factor Cg will be proposed to attain Fisher consistency at the specified model. Note that for Z = +00, we find back the 67 classical least squares estimator. Throughout the text we will suppose that the data set Z, = {(1,}; yi);z' = 1, . . . ,n} E R1”1 is in general position in the sense that no h points of Z; are lying on the same hyperplane of RP“. Formally, this means that for all be RP, 7 e R, it holds that #{($jiyj)l:8t$j + 7% = 0} < h (4.2.3) unless if 3 and 7 are both zero vectors. 4.3 The influence function and asymptotic variances The functional form of the LTS estimator can be defined as follows. Let K be an arbitrary (p + 1) dimensional continuously distribution which represents the joint distribution of the carriers and response variables. 3 Let us denote dike, y) = (y — B(A)‘x)2, then it follows that A = {(12.11) 6 RP“ Id?4(x, y) S €q(A)} where (M) = (Di)'1(0-5) with Di“) = PK(d?4($a y) S t). Define DK(€) = {AIA = {($,y)ld31($. y) S 34W} (4.3-1) To define the LTS estimator at the distribution K we require that PKw‘x = 0) < 1/2 for all e 6 RP \{0} (4.3.2) For each A E D K (t’), the least squares solution over the set A is then given by BA(K) = (AxxtdK(x,y))—l[Axy‘dK(x,y) (4.3.3) andA y - BA(K)t$)2dK($; y) _ f ( 200- A PK(A) 0A (4.3.4) 68 Furthermore, a set fl 6 DK(€) is called an LTS solution if 0%(K) S 0300 for any other A E D K(€). The LTS functionals at the distribution K are then defined as BLTS(K) = BA(K) and UETS(K) = ce 0%(K) (4.3.5) The constant Ce can be chosen such that consistency will be obtained at the specified model. If the distribution K is not continuous, then the definition of DK(€) can be modified as in Croux and Haesbroeck (1999) to ensure that the set DK (3) is non-empty. Now consider the regression model y=l3tx+e where :1: = (x1, . . . , mp)t is the p-dimensional vector of explanatory variables, and e is the error term. We suppose that e is independent of a: and has a distribution Fa with density f0(u) = g(u2/U2)/U where o > 0. The function g is assumed to have a strictly negative derivative 9' such that F0 is a unimodal elliptically symmetric distribution around the origin. The distribution of z = (r,y) is denoted by H. A regularity condition (to avoid degenerate situations) on the model distribution H is that Pgwtas + yty = O) < 1/2 (4.3.6) for all 6 6 RP and 'y E R not both equal to zero at the same time. This general posi- tion condition says that the maximal amount of probability mass of H lying on the same hyperplane must be lower than 1 / 2. 69 Theorem 4.3.1. Denote F(tq) - 1 Where F is the symmetric error distribution and q = K "1(O.5) with K (t) = Pp0(ete _<_ t). Cg: Then the functionals BLTS and 0%TS are Fisher-consistent estimators for the parameters 8 and 02 at the model distribution K: BLTS(K) = B and 0,235 = 02 Proof. First of all, due to equivariance, we may assume that B = O and 02 = 1, so y = 6 ~ F. It now suffices to show that BLT5(K) = 0. Then we will have that 02(K) is that LTS functional at the distribution of y ~— BLTS(K)‘:r = y = e, the consistent coefficient c can be easily derived. Since BLTS is defined solely based on the set C = {(x, y) E RP+1|(y - BETSzV _<_ liq}. Therefore /C :r(y — BETSx)dK(a:, y) = 0 (4.3.7) Now suppose that BLTS 79 0. From (4.3.7) it follows that [G 82mm — Birszwnwdaa) = 0 Which can be rewritten as [R BtLTszl(:1:)dG(y) = 0 (4.3.8) with d+fl§ Ia) = [WE (y —- Birs$)dF(y), where C1; = {y E R|(:c,y) E C}, Fix :1: and set d = 327151;. Since y is symmetric, we have that d+fl§ t Ice) = [My] (y - ramming) m 2 2 - = [0 t(g(—g<(d-—t> >>dt If d > O we have (d + t)2 > (d - t)2 (for t > 0) and since 9 is strictly decreasing this implies 1(2) < 0. Similarly, we can show that d < 0 implies I(:r) > 0. Also, 8317.52: = 0 implies 70 I(:r) = 0. However, due to condition (4.3.6), the latter event occurs with probability less than 0.5. Therefore, we obtain [0 :1:(y — 8213sz (1:, y) < O which contradicts (4.3.8), so we conclude that BLTS = 0. D The influence function of a functional T at the distribution K measures the effect on T of adding a small mass at z = (:r, y). If we denote the point mass at 2 by A; and consider the contaminated distribution KW = (1 — e)K + eAz then the influence function is given by T(K5,z) - T(K) 3 E = a—E'T(K5rz)|€=0' IF(2;T,K)=5113% (See Hampel et al. 1986.) It can easily be seen that the LTS is equivariant for affine transformations of the regressors and responses and for regression transformations which add a linear function of the explanatory variables to the responses. Therefore, it suffices to derive the influence function at a model distribution K0 for which 8 = 0 and the error distribution F = F0 with density f0(y) = g(y2). The following theorem gives the influence function of the LTS regression functional at K0. Theorem 4.3.2. With the notations from above, we have that y 1(3/2 S 6(1) 1F(z;BLTs, K0) = dextrl (2F(\/€—Q) - 1) - mam/a) (4.3.9) Proof. Consider the contaminated distribution K5 = (1 -— E)Ko + EAZO with 20 = (230,310) and denote BE := BLTS(K€) and 0'? := UiTSUQ). Then (4.3.3) results in A -1 Be = ( / xx‘dKe(x; 31)) / xy’dKer; y) A, A. where fie 6 DK. (0) is a LTS solution. Differentiating w.r.t. e and evaluating at 0 yields _1 a IF(zo; BLTS, K0) =( [A xz‘dKo(a:; y)) 55' (A. xy‘dKea; y)|€=0 + g; [( [is xz‘dK5(x;y))_l] [6:0 [4 xy‘dKo(x,y) F isher-consistency yields that 14 = {(3:,y) : Rp+q; y2 S tq} where q = (D%.)’1(1/2) with D%(t) = Pp(y2 _<_ t). Hence fl = R” x {y E R; y2 S Zq} =: Rp xA. This implies 71 Al'ytdKM-Tw) = [RP :rdG(:r:) [A de(y) = 0 by symmetry of F and /,xx‘dKo(z)= / 2:3:th(32) / dF(y)= (2F(\/€q)—1)Eg[zxt] A RP A Therefore, we obtain EG[XXt] a IF(zo;BLTs.Ko> = (mm) _ n5; 4. xy‘dKe(x;y)|£=0 (4.3.10) _ EGlXth 3 . * - (2F(\/Z§) _ 1)gfll - 6) [As zy‘dKo($,y) + error/01% 6 A5) 5:0 (4.3.11) _ Eg[XXt] 0 . — ——-——(2 M «an (5; [A xy‘de. y)+$oy01(y2 s eq)|€=0 (4.3.12) Let us denote dg(:c;y) = (y — 821:)2, then it follows that 145 = {(x,y) E Rp+1;dg(a:;y) g €q(e)} where q(e) = (D%€)‘l(0.5) with 0&5 (t) = PKg (dg(a:;y) S t). For :1: fixed we define the set 85,3 := {y e Rldgcc; y) S £q(e)}. Then it follows that ~//i5 my‘dKoht; y) = [RP/56,17 zde(y)dG(I) = [m [ems/gabdyxdcv) (4.3.13) Using the transformation v = y — 822:, we obtain that 1(6) == / y9(y2)dy ! _. t t 2 — [v 2$£q(€) (v + Ber)g((v + 85x) )dv «41(6) 2 = f (v + ng)g((v + 822:) )dv -lQ(€) 72 Note that 1(5) — 1(0) _1[( 39(6) fl- 6 _E (v + 82x)g((v + 821:)2)dv — f :(v + 823:)g((v + 82.x)2)dv) 4% -fl' m t t 2 ‘fla t t 2 .+ (Ls/23(1) + 853:)g((v + 851:) )dv — [475(1) + B :L')g((v + B x) )dv)] 1 41(6) 2 - 134(6) 2 =;( L/Ia (v + egz)g((v + 35.2) )dv — f—m (v + Bgz)g((v + 32x) )dv) fl? 1 + (1% E ((1) + 82$)g((v + 32x)2)dv — (U + th)g((v + Bt$)2))dv =(61 + Bée>g((e. + 3:4)2) WW:- fl'q' _ (_02 + 82mm + 13,523?) (-\/e(1(€)€- (as?) J?— + (1;; ”£1 ((1) + 82$)g((v + 8:1)2)dv —- (U + th)g((v + th)2))dv m t 2 2 I 2 t =/fl_ (IF(zO;BLTs,Ko) 129(1) )+21} 9 (v )IF(ZO§BLTSaKO) $)dv+020(1) - «1 So we have that 6 sgl(e)le=o = ((2F(\/€q) — 1) + 262)1F(zosBLTS’ Koltl‘ where e = 155;, g'(v2)v'~’dv = In“26 vdf(v) = JIM/76) - tam/26) — 1) Cl Note that the influence function is bounded in y but unbounded in 1:. Closer inspection of (6.1) shows, however, that only good leverage points, which have outlying x but satisfy the regression model, can have a high effect on the LTS estimator. Bad leverage points will give a zero influence. Remark: The influence function of the LTS location estimator T at a symmetric distri- bution F0 can be obtained easily, it is given by 1F (31; T; Fo) = 1(y2 _<_ 6(1) y (WM/5) - 1) - Zth/Z’E) Therefore, it follows that the influence function of B LTS can be rewritten as IF(z; BLTS, K0) = Ea[XXt]'1$IF(y; T, F0) : (4.3.14) 73 The asymptotic variance-coVariance matrix of BLTS can now be computed by means of ASV(BLTS, K0) = EK[IF(z; BLTS: K0)®IF(z; BLTS: K0)t] (see e.g. Hampel et al. 1986). Here A ® B denotes the Kronecker product of a (p x 1) matrix A with a (l x p) matrix B, which results in a (p x p) matrix with (i, j )-th block equals aibj, where aj are the elements of the matrix A and bj are the elements of the matrix B. Let us denote 23; := Eg[X X t], then expression (4.3.14) implies that ASV(BLTS,K0) = ASV(T,F0))2;1) (4.3.15) From (4.3.15) it follows that for every 1 _<_ i _<_ p the asymptotic variance of (81,715),- equals ASV((BLTS)2': K0) = EKl1F2(Z; (Bush, Koll = (251)4iASV(T, Fol) For i 79 i' we obtain the asymptotic covariances ASV((BLTS)ia (BLTSh'h K0) = EK[1F(Z; (BLTsli, K0)1F(Z; (BLTS)¢": Koll = (2,:1),,.,ASV(T, F0) and all other asymptotic covariances (for j’ aé 3') equal 0. Due to affine equivariance, we may consider w.l.o.g. the case where o = 1. Then all asymptotic covariances are zero, while ASV((BLTS),-, K0) = ASV(T, F0) for all 1 g i S p. The limit case Z = 00 yields the asymptotic variance of the least squares estimator ASV((B L 3),, K0) = ASV (M; F0) where M is the functional form of the sample mean. Therefore, we can compute the asymptotic relative efficiency of the LTS estimator at the model distribution K0 with respect to the least squares estimator as ASV((BLS):'; K0) _ ASWM; F0) ASV((BLTS)2'1K0) = ASV((BLTs)i; K0) " ASV(T;F0) = ARE(T, F0) for all 1 S i g p. Hence the asymptotic relative efficiency of the LTS estimator in p + 1 dimensions does not depend on the distribution of the carriers, but only on the distribution of the errors and equals the asymptotic relative'efficiency of the LTS location estimator at the error distribution F0. For the normal distribution these relative efficiencies are given in 74 Table 1. Note that the efficiency of LTS does not depend on p, the number of explanatory variables, but only on the number of dependent variables. Table 4.1. Asymptotic relative efficiency of the LTS estimator w.r.t. the Least Squares estimator at the normal distribution for several values of K. K 1 3 5 7 10 20 30 ARE 0.071 0.286 0.483 0.636 0.792 0.973 0.997 4.4 Finite-sample simulations 4.4.1 Algorithm ,In algorithmic terms, the procedure can be described as follows: Step I 1. Create an initial subsets H0. Draw a random p-subset J, and compute do := the coefficients of the hyperplane through J. If J does not define a unique hyperplane (i.e., when - _ the rank of X J is less than p), redraw J random observations until it does. Then compute the residuals To (i) := y,- — 651‘: for i = 1, . . . ,n. Sort the absolute values of these residuals, which yields a permutation 7r for which rg(7r(1)) S r3(7r(2)) S . . - S ‘rg(7r(n)), Ho = {ilr3(i) S 5T3(7r(h))}, h = [Tb/2] + [(P + 1)/2]- 3. Compute ,60 := LS regression estimator based on H0. 4. Iterate K (say 20) steps or until Hk = Hk-1 or Hk = Hk_2, record 6, H in the last step and k(the actual number of iteration). Step II Repeat M (say 300) times step I by choose different initial subsets K0. For simplicity, write 62-, H,- and ki, i=1,...,M Step 111 If min(k,-) < K, then Choose argminlsisM,k,- 0 for the given 2:. We need the following results whose more general versions are given and treated in Zuo (2003). Lemma 5.1.1. For fixed :1: 6 R and sufi‘iciently small 8, We have for fixed 0 < 6 < 00 (a) D(y, F) and D(y, F (5,633)) are Lipschitz continuous in y 6 R; (b) supyeg |D(y,F(e,53)) — D(y, F)| = 03(1) for any bounded set S C R; (C) |L(F(€,5x) - L(Fll = 022(1): |U(F(€.6z)) .. U(F)! = 03(1)- First we write 13%;? (y — T(F)>w(D(y. when) T(FE) — T(F) = U(FE) L(FE) w(D(y. FelldFe(3/) (5.1.1) We focus on the numerator. The denominator can be treated in the same (but less involved) 79 manner. The numerator can clearly be decomposed into three terms ”(’75) U(F)) .=(/ -/m )-(y T( ()F)w(D(y.F.»eF.(y). L(Fe) ) . 12. = 11w] (y - T(F)) 0 $11. = i / A(y.e)(y — T)w(D(y.F»dF(D(D(y. F))I = 0.0).- (b) supyea Iyw<1> o,- (e) (D(y, F.) — D(y, F))/e = IF(e; D(y, F)) + 210.0) + 0.0). First we write U (Fe) Tw(FE)-Tw(F) = fw(D( (y,F.))dF.(y)ULUF.)W D(y, 2))(31’ Tw)dFe(y ) L(Fe) + / w(D(y.F.))(L(F.) — Tw»dF.(e) -—oo 00 + / w(D(y.F.»(U(F.) — Tw)dF.(y)] U(Fe) Lebesgue’s dominated convergence‘theorem implies immediately that [w(D(y,F.))dF.(y) = [w(D(y.F))dF(y)+o.(1) (5.1.5) We now focus on the numerator. Call the three terms Ii(F5,y), i = 1, 2, 3, respectively. By 81 the proof of Theorem 2.3.1, we see immediately that $11052. y) =(U — T...) W(fi)f(U)IF(L; U) — (L — T...) w(F)f(L)IF(e; L) U +/L (y -— Tw) w(1)(D(y, F))h(:r:,y)dF(y) + I(L S :1: S U)(:v — Tw) w(D(r, F)) + _ U 1 E E f (y - Tw) W(D(y.F))dF(y) + 03(1). (5.1.6) L Now it suflices to treat 12(F5,y). Following the proof of Theorem 2.3.1 and employing Lemmas 5.1.1 and 5.1.2, we have L :I2(F€,y) = [(2 < L) w(D($,F))(L — Tw) + IF(x;L) j... w(D(y,F))dF(y) L +(L — T...) l... w(1>(U(y. F))h(e. y)dF(y) + (L - T...) w(e)f(L)IF(e; L) l—e e L + f...“ — T...) w(D(y, F))th» + o.(1). (5.1.7) Likewise we have $130241) = 1(1- > U)w(D(:1:,F))(U -— Tm) + IF(x; U) [U00 w(D(y.F))dF(.U) +(U — T...) l: w<1>(D(y. F))h(e. y)eF(y) — (U — T...) w(e)f(U)IF(.e; U) 1—5 8 + /°°(U — T...) w(D(y. F))th» + o.(1). (518) U Combining the last four displays, we have the desired result. El PROOF OF THEOREM 2.4.1. For the sake of convenience, we define Vn = fi(Fn -F), Hn(‘) = fi(D(°,Fn) _ D(',F)) (5.1.9) The following result, a special version of a general result in Zuo (2003), is needed in the ‘ proof. Lemma 5.1.3. Assume that F’ = f exists at u and continuous in small neighborhoods of u i o with f (a) and f (a + o) + f (a - a) positive. Then for 0 < 6 < co and any numbers L1 < U1 (0) SUP (1+I$|)|Hn($)| =0p(1); and z€[L1,U1] (b) Hn(a:) = f]F(y,D(:r, F))Vn(dy) + 010(1), uniformly over a: 6 [L1, U1]. 82 PROOF: For :1: 6 [L1, U1], it is readily seen that D($,Fn) - D($,F) = -(D($,F)(0n - 0)+(11en - [ID/0n- (a) follows immediately since the given conditions allow asymptotic representations for both an and on (see, e.g., page 92 of Serfling (1980)), which lead to (b). C] The proof of the theorem follows the lines given in Zuo (2003). First, we can write Un Un \/"_1 (Tn - T) = fi/Ln (y _ le(D(y) Fnllan(y)//Ln w(D(y, Fnllan(y) (5.1-10) and the numerator then can be decomposed into three terms (In U Iln = fi/Ln (y - T)w(D(y.Fn))Fn(dy) - x/T—z/L (y - T)w(D(y,Fn))Fn(dy); U U 12.. = x/E/L (y - T)W(D(U,Fn))Fn(dy) — fi/L (y - T)w(D(y. F ))Fn(dy); U 13,, = «Ft/L (y — T)w(D(y,F))Fn(dy). It follows immediately that £311 6 = % gnu.) (5.1.11) For 12", we note that U U 12.. =¢F / (y — T)w(U(y.F.»F.(dy) — U5 f (y — T)w(U(y.F»F.(dy) L L U = [L (y — T)w’(6n(y))Hn(U)Fn(dy) U = [L (y — T)w’(D(y.F»H.(y)Fn(dy) U + [L (y — T)(w’(e.(y» - w'(D(y, F)))Hn(3/)Fn(dy) A =J1n + Jzn, where 9n(y) is a point between D(y, F”) and D(y, F). For Jzn, by Lemma 5.1.3, we have U J... = [L (y - T)(w’(9n(y)) — w’(U(y. F)))H..(y)F.(dy) U S/L (|y|+|T|)|Hn(y)|(w'(9n(y))-w'(D(y.F)))Fn(dy) = 012(1) 83 On the other hand, by Lemma 5.1.3, continuity of w’ and boundedness of L and U, Fubini’s Theorem and the central limit theorem, we Obtain fLU(y — T)w’(D(y. F)))H..(y)(F. — F)(dy) = LU“ - T)w’(D(y, F)))( / my; D(y, F))y.(de))(F. — F)(dy) + .,(1) = £7 / [51(3) - T)w’(D(y. F ))1F(x; 0(ny ))Vn(dy)vn(d:r) + 012(1) U = -0—%( — [L (y — T)w’(D(y.F»D(y,F)u.(dy) / IF(e;e(F»u.(de)+ U - [L (y-T)w’(D(y.F»un(ey) [IF(e;y(F»u.—.(dx)) +0.0) = 0.0). which, in conjunction with Lemma 5.1.3 and the Ribini’s Theorem, yields U I = l. (y - T)w (D(y. F)))H.(y)F(dy) + 0.0) U = f ([1. (y_T)y'(n(y,F»»F(x;D(y.F))F(dy))u.(de)+oy(1) = 5% £3209) + 0:20)- Hence ifg = 2}? gym) + 0.0). (51-12) For [1", we note that =f/Z" (y— T)w(D(y F..»F..(dy)— f/U (y— (Dy F.»F.(dy) =./5/L:(y— D(yFn»F.(dy) )+f/U" (y— )wD( (y.F.»F.(dy) =Vln+V2n Now we deal with V1,, only since V2,, can be’ treated similarly. By mean value theorem, L L v... = fl [L (y — T)w(D(y. F..»F(dy) + [1...” — T)w(D(y. F..»u.(dy) n L = -(y.. — T)w(D(y.. Fn))f(nn) «F (L. — L) + [1...” — F)w(D(y. F..»u.(dy). 84 where r)" is a point between Ln and L. Note that by the conditions given, we have (Tin _ T)w(D(llny Fn))f(77n)\/E(Ln - L) = (L — T)w(D(L, F))f(L)—% Z IF(x,; L(F)) + 0.0). i=1 Since P(X = L) = 0, it is readily seen that for large n and L" = —1 — |L|, L f (y — T)w(D(y. F.»u.(ey) Ln = “/lI(L*,Ln)(y) — I(1.=",L)(y)l(y- T)w(D(y,Fn))Vn(d1/) +0p(1) = 011(1), by an empirical process theory argument; see Pollard (1984) or van der Vaart and Wellner (1996). Thus v... = -(L — T)w(D(L. omnfi )3 IF(x,; L(F)) + 0,,(1), i=1 which, combining with a similar result from V2”, gives 5152 = % geld.) + 0.0). (51.13) In the same but much less involved manner we can show that U U" / w(D(rc,Fn))Fn(dx)= / w(D(z,F))F(dr)+0p(1/\/r—1). (5.1.14) Ln L Now (5.1.11), (5.1.12), (5.1.13) and (5.1.14) give the desired result. [3 PROOF OF THEOREM 2.4.3. The proof is very similar to that, of Theorem 2.4.1. We adopt the notation in the proof of Theorem 2.4.1. Let w(D(y,Fn)) — w(D(y,F)) = w(1)(0(y,Fn))(D(y, F") — D(y, F)). We need the following lemma whose proof is skipped here. Lemma 5.1.4. Under the conditions of Theorem 2.4.3, we have (a) sup(1+ 1y)» w<1>(e(y,F.» — w(D(y.F»( = 0.0). and yER (b) Hue!) = 9011(1) + 011(1)- 85 We first write 1 U(Fn) Tw(Fn) -Tw(F) = fw(D(y,Fn))an(3/) [fur ) w(D(nynDtU-Tw)al1”‘n(y) L(Fn) + / w(D(y,Fn))(L(Fn) — Tw))an(y) 1 L(Fn)w(p(y,Fn))(U(Fn) - Tw)dFe(y)l The given conditions guarantee that w(D(yn,Fn)) —> w(D(y,F)) as. for every y E R and every sequence yn -> y. Skorokhod representation theorem and Lebesgue’s dominated convergence theorem imply immediately that [w(D(y,Fn))an(y) = /w(D(y,F))dF(y)+o(1), a.s. (5.1.15) We now focus on the numerator. Call the three terms L(Fn, y), i = 1, 2, 3, respectively. By the proof of Theorem 2.4.1, we see immediately that n 11(Fn. y) = £2 ((U — Ty) w(U)f(U)IF(X.-; U) + 1(L s X. s U)(X.- - in.) w(D(X.» F)) i=5 +L (y _ Tw) w(1)(D(y, F))h(Xi,y)dF(y) - (L - Tw) w(filflLUFin; Ll) +0p(n_1/2). (5.1.16) Now it suffices to treat 12(Fn,y). Following the proof of Theorem 2.4.1 and employing Lemmas5.1...3and514, wehave 12-(Fn,y) — -:ZI(I(X.' (D(y. F))h(X.-, y)dF(y) + (L — T...) w(F)f(L)IF(X.-; L)) +op(n‘1/2). (5.1.17) Likewise we have 13(-Fn.y) — - ;Z(I( I(.-X > U) (D(X..F»(U - T...) + IF(X.;U) [Um w(D(y.F))dF(y) +(U— T...) [ 0 1 211. = 2. / A(y.e)(y2 - .) w2(D(y.F))dF(y) — / A(y.e)(y2 - e) W2(D(y. F))dF(y) + / A(y.e)(y2 — s) w§1)(D(y, F))h(e. y)dF(y) + o.(1) where A(y,e) = I(:r: E [L(FE),U(F€)] — I(:t‘ E [L(F),U(F)]). Call the three terms with integration 1151., i = 1, 2, 3 respectively. It’s obvious that I152 and [253 are 03(1) because of the boundedness of L(F)) and U (F), Lemma 5.2.1, and Lebesgue’s dominated convergence theorem. Conditions on f and wg, the mean value theorem and Lemma 5.2.1 imply that 1 11., =2 /(I{ze[L(FE),U(F5)]} - I{z€[L(F),U(F)]})(y2 -.- 8) W2(D(y. F ))dF (y) 1 U(Fg) L(Fe) =E( L(F) (y2 —s)w2(D(y,F))dF(y) - [L(F) (y2 —s)W2(D(.U.F))dF(y)) =(03. — s) wz(D(62.. F))f(62.)(IF(e. U(F)) + 0.0)) — (6%. — s) w2(D((h.. F))f(915)(1F($a L(F)) + o.(1)) =(U(F)2 - .) w2(D(U. F))f(U)IF(e. U(F)) - (L(F)2 - s) W2(D(L.F))f(L)1F(x. L(F)) + o.(1) where 025 is a point between U (F) and U (F5), and 015 between L(F) and ME.) Therefore the desired result now follows. E] Proof of Theorem 3.3.5 88 Proof. The proof is very similar to that of Theorem 3.3.2. We adopt the notation in the proof of Theorem 3.3.2. Let w2(D(y, Fn)) - w(D(y,F)) = w§1)(e(y,rn))(o(y, Fa) — D(y,F)). We need the following facts whose proofs are omitted here. Lemma 5.2.2. Under the conditions of Theorem 3.3. 5, we have (a) supy.e(1 + y2)(w£1’(e(y. F.» - wé”(D(y. F)» = o.(1); and (b) supyeRU + y2)ly2wgl)(9(y, F5))| < 00 for sufliciently small 6 > 0; (C) (D(y, Fe) - D(y, F))/6 = IF(x, 0(3), F)) + y0x(1) + 01(1)- First we write F F— 1 ”up” D( F))(2— )dF() 3w( el-3w( )_fW2(D(y’F5))dFE(y)[./L(FE) w2( y: e y 3w 63/ L(Fe) + l... w2(D(y.F.»(L2(F.) — ew)>dFe(U) + [I]: )w2(D(y. F.))(U2(F.) — s...)dF.(y)] (5.2.5) Lebesgue’s dominated convergence theorem implies immediately that /W2(D(yaFe))dFe(3/l =/W2(D(U.F))dF(y)+0x(1) (52-6) Step 2: We now focus on the numerator of equation (5.2.5), Call three terms I,(F5,y), i = 1, 2,3, respectively. By the proof of Theorem 3.4.1, we see immediately that in (F... y) =(U2 — sw)W2(fl)f(U)1F(x; U) — (L2 — so) we(e)f(L)IF(e; L) U +/L (y2 - 8w)Wi1)(D(U,F))h(x.y)dF(y) + 1(L S 3 S UNI? - 8w) W2(D($,F)) l—e e + U [L (y2 - s...) W2(D(y.1'*‘))c11”(1/) +0.0). (527) Now it suffices to treat 12(F5,y). Following the proof of Theorem 3.4.1 and employing Lemma 5.2.1 and 5.2.3, we have Ergo}, y) = (L2 — sw)I(:c < L) w(P(:r, F)) + (L2 - sw) w2(D(L,F))1F($; L) L(F) + (L2 — s...) / w§1’(D(y, F))h(e. y)dF(y) -CD L(F) +2L(F)1F(y,L(F))/_00 w2(D(y,F))dF(y) (5.2.8) 89 _ U 1 . E f (L2 - s)w(D(y.F))eF(y) +o.(1) , (5.2.9) L + Likewise we have i13(F..y) = (U2 — so)1(e > U) w(F(e.F)) — (U2 — sw) w2(D(U. F))IF(e; U) + (U2 — so) [I]; w§1’(D(y. F))h(e.y)dF(y) oo +2U(F)IF(2:,U(F)) [U(F)w2(D(y,F))dF(y) (5.2.10) 1 - E U 2 + 5 ll. (U - s)w(D(y,F))dF(y) + 03(1) (5.2.11) Combining the last four displays, we have the desired result. El Proof of Theorem 3.4.1 Proof. For sake of convenience, we define Vn = \/T—1(Fn - F). Hn(') = J5(Dn(-.Fn) - D(nFll (52-12) The following result, is needed in the proof. Lemma 5.2.3. Assume that F’ = f exists at p and continuous in small neighborhoods of a :1; o with fly) and f(u + a) + f(/1 - a) positive. Then for 0 < B < 00, we have (a) supxelefl + x2)|Hn(:r:)| = 0,,(1); and (b) Hn($) = f h(y.z)vn(dy) + 012(1), uniformly on L 6 MW ). U (F )1- Proof. For a: E [L, U], it is readily seen that D(Ian) _ D($,F) = "(D($1F)(afl _ 0') + (”41 "' “ll/071' (a) follows immediately since the given conditions allow asymptotic representations for both an and on (see, e.g., page 92 of Serfling (1980)), which lead to (b). C] 90 Since S2(F) can be breaken into two parts (s(F), A(F)) and the proof of the two parts are similar, we only prove S(Fn) - S(F =%ZIF( X,;s(F)) +o,,(;_) ni=1 =-—2:n( X.)+-,1,Ze2( X.)+-,’,2:jm(X- +019 jfi) First, observe that Un Un Mew.) — e(F» = «a (I... (y2 - e)w2(D(y.F.»F.(ey)/ ft. w2(D(y.F..»F.(dy) (5.2.13) and the numerator then can be decomposed into three terms Un U 111. = We [L (y2 - S)w2(D(y, Fn))Fn(dy) - fi/L (312 - S)w2(D(y,Fn))Fn(dU) 3 y 12. = «a [L (y2 — s)w2(D(y,Fn»Fn(dy) — t/fi/L (y2 — s)w2(D(y.F»F.(ey) U 1y. = «a f). (y2 — s)we(D(y.F))Fn(ey) It follows immediately that 13,, = % 27305). (5.2.14) For 12”, we note that U U I... =./5 [L (y2 — s) we(D(y,F.»F..(dy) — «a [L (y2 — s) w2(D(y,F))F..(ey) U = [L (y2 — .) w§1’(e.(y))H..(y)F..(dy) U 2 (1) = f (y — e)w2 (D(y,F»H..(y)F.(dy) +/U (y— ’(..e (y» — wE.”(D(y.F»)H.(y)F.(ey) =J1n + 121. where 6n(y) is a point between D(y, Fn) and D(y,F). For J2", by using Lemma 5.2.3, we 91 have Jen: [U (y— 2.1’()-e..(y) wt 1’(D(y.F)))H..(y)F..(ey) s [LU (y +s »(1H..(y)I(w§’(e..(y»—w§”(D(y.F»)F.(dy) =0p(1) On the other hand, by Lemma 6.3, continuity of my) and boundedness of L and U, Fubini’s Theorem and the central limit theorem, we obtain U 2 (1) [L (y -s)w2 (D(y.F)»H..(y)(F.-F)(dy) U = / (yZ-s)w“’(D(y F» (/h(e.y)y..(de)+o.(1))(F..-F)(dy) =71. / [U (y _.)w<1>(o D(y (F)))IF(e-.D(y.F»e..(ey)u..(de) +oy(1) = m (ll-A (y2_3)w21)1)( (y,F))D(y,F)l/n(dy)/IF($,U(F))Vn(Cl$) U — [L (y2 — .)yg1>(o(y,p))y,,(yy) / IF(e.y(F))u.(de)) + 0.0) which, in conjunction with Lemma 5.2.3 and the thini’s Theorem, yields U 2 (1) = J.../ (y — s)w2 (D(y.F»)H..(y)F(dy) + opu) =([ [U (y —s)w§”(D(y.F)))h(e.y)F(dy))u..(ee) + 0,.(1) = 62— ‘/— 2720“) + 011(1) Hence I_2_n t( )+0 1 5.2.15 —,—2 =72: 2(X p() , ( ) For 11m note that Un U I... = «a f (y2 — s) we(D(y. F..»F..(dy) — (47/ (y2 - s) w2(D(y.F..))F..(dy) Ln L L Un , = ya / (y2 — s)w2(D(y,Fn))Fn(dy) + t/fi / (1)2 - s)W2(D(y.Fn))Fn(dy) Ln U A = Vln + V2n 92 Next we only deal with Vln since V2,, can be treated similarly. By mean value theorem, L v1. = «a (#2 — s) w2(D(y,Fn>)Fn(dy) L 2 L 2 = J17 [L,.” - s) w2(D(y.Fn)>F(dy) + (1...“ — s) w2(D(y, Fn>>un(dy) L = —(7772z — 5) W2(D(77ny Fn))f(77n)\/7—1(Ln " L) +/L (3’2 — 5) w2(D(y, F"))V"(dy) where 17,, is a point between Ln and L. Note that by the conditions given, we have 1 11 (17121-3) W2(D(77n. Fn))f(nn)\/5(Ln-L) = (142-8) W2(D(L, ”WU-J: Z IF(Xz', L(F))+0p(1), i=1 Since P(X = L) = 0, it is readily seen that for large n and L“ = —1 — |L|, , L f (y2 - s) W2(D(y, Fn>)u..(dy) Ln = /11{(L*,Ln)}(y) - I{(L*,L,,)}(y)](y2 — Slw2(D(yy Fn))Vn(d3/) + 010(1) = 013(1): by an empirical process theory argument; See Pollard (1984) or van der Vaart and Wellner (1986). Thus 1 n v1. = —(L2 — s)w2(D(L.F))f(L)T 231m.» L(F)) + 012(1), (5.2.16) n i=1 which combining with a similar results from V2", gives 11" 1 iZ(X)+o(1) (5217) — = — 1 ,- p . . 6 l/fi i=1 In the same but much less involved manner, we can show that Un U (I... W2(D(D. F))F..(Jm) = [L w2(D(y.F>>F(dy) + ops/«5) (5.2.18) Now (5.2.14), (5.2.15), (5.2.17) and (5.2.18) give the desired result. El Proof of Theorem 3.4.3. The proof is very similar to that of Theorem 3.4.1. We adopt the notation in the proof of Theorem 3.4.1. Let w(D(y,Fn» — w(D(y,F)) = w(1)(0(y,Fn))(D(y,Fn) — D(y,F)). We need the following lemma whose proof is skipped here. 93 Lemma 5.2.4. Under the conditions of Theorem 3.4.3, we have (a) supyeyu + ,2). w§1’(a(y. F.» — wé”(w2(D(y. F)))! = 0pm,- and (b) Hn(y) = y0p(1) + 012(1)- We first write 1 3”" =I W2(D(y,Fn))an Ln 811,100+ w2(D(y,Fn))(L2(Fn) - 3112))an(1') Un (y) [[Ln" w2(D(y.F..)>(z -s..)dF..(x ) + [U00 W2(D(nyn))(R2n “ SW)dF"($)] The given conditions guarantee that w(D(yn,Fn)) -+ w(D(y,F)) a.s. for every 31 6 R and every sequence 3],, —-+ y. Skorohod representation theorem and Lebesgue’s dominated convergence theorem imply immediately that / w(D(y.F.>>dF..(y) = / w2(D(y,F))dF(:c> (52.19) We now focus on the numerator. Call the three terms I,-(Fn,y), i = 1, 2, 3, respectively. By the proof of Theorem 3.4.1, we see immediately that 11(Fn.y)=-,1;Zn( IE)(yz—sw)w§”(D(y.F))h(X.-,y)dF(y))+op(n'1/2) (5.2.20) nizl Now it suffices to treat 12(Fn, y). Following the proof of Theorem 3.4.1, and employing Lemmas 5.2.3, 5.2.4; we have 12(=F..,y) fiznaz —sw)I(X-+op(n > (5221) Likewise we have 13(Fn.y> =(U2 — 3w)I(Xi > U)w2(D(X.-, F)) (52.22) 94 +(U2- s ) / ‘1’D(y.)>h(x..y)dF(y) + 2U(F)IF(X., U(F)) [w w(D(y, F))dF(y) + own-V2) U(F) Combining the last four displays, we have the desired result. 95 (5.2.23) “2‘ BIBLIOGRAPHY [1] Agullo, J., Croux, C., and Van Aelst, S. (2002). The multivariate least trimmed squares estimator. Submitted. [2] Bai, Z.D., Chen, N.R., Miao, B.Q., and Rao, C.R. (1990), Asymptotic Theory of Least Distance Estimate in Multivariate Linear Models, Statistics, 21, 503-519. [3] Bickel, P. J. (1965). On some robust estimates of location. Ann. Math. Statist. 36 847-858. [4] Donoho, D. L., and Huber, P. J. (1983). The notion of breakdown point. In A Festschn'ft for Erich L. Lehmann (P. J. Bickel, K. A. Doksum and J. L. Hodges, Jr., eds) 157-184. Wadsworth, Belmont, CA. [5] Jaeckel, L. A. (1971). Some flexible estimates of location. Ann. Math. Statis. 42 1540-1552. [6] Jureckova, J., Koenker, R. and Welsh, A. H. (1994). Adaptive choice of trimming proportions. Ann. Inst. Statist. Math. 46 737-755. [7] Hampel, F. R., Ronchetti, E. Z., Rousseeuw, P. J. and Stahel, W. A. (1986). Robust Statistics: The approach based on influence function. Wiley, New York. [8] Hogg, R. V. (1974). Adaptive robust Procedures: A Partial review and some sug- gestions for future applications and theory. J. Amer. Statist. Assoc. 69 909-923. [9] Kim, S. (1992). The metrically trimmed mean as a robust estimator of location. Ann. Statist. 20 1534-1547. [10] Koenker, R., Portnoy, S. (1990). M-estimation of multivariate regressions. Journal of the American Statistical Association, 85, 1060-1068. [11] Marrona, R.A., and Yohai, V.J. (1997), Robust Estimation in Simultaneous Equa- tions Models, Journal of Statistical Planning and Inference, 57, 233—244. [12] Pollard, D. (1984). Convergence of Stochastic Processes. Springer-Verlag, New York. [13] Rousseeuw, P.J. (1984). Least median of squares regression. Journal of the American Statistical Association, 79, 871—880. [14] Rousseeuw, P.J., Van Aelst, S., Van Driessen, K., and Agullo, J. (2001). Robust multivariate regression. submitted. 96 [15] Rousseeuw, P.J. and Van Driessen, K. (2002), Computing LTS Regression for Large Data Sets, Estadistica, 54, 163-190. [16] Serfling, R. (1980). Approximation Theorems of Mathematical Statistics, John Wiley & Sons, New York. [17] Shorack, G. R. (1974). Random Means. Ann. Statist. 2 661-675. [18] Peter J. Rousseeuw and Christophe Croux. Alternative to the Median Absolute Deviation. J. AM. Stat. Assoc. 88, 1993. [19] Stigler, S. M. (1973). The asymptotic distribution of the trimmed mean. Ann. Statist. 1 472-477. [20] Stigler, S. M. (1977). Do robust estimators work with real data? Ann. Statist. 5 1055-1077. [21] Thkey, J. W. (1948). Some elementary problems of importance to small sample practice. Human Biology 20 205-214. [22] van der Vaart, A. W., and Wellner, J. A. (1996). Weak Convergence and Empirical Processes With Applications to Statistics. Springer. [23] Welsh, AH; Morrison, HL Robust L estimation of scale with an application in as- tronomy. J. Amer. Statist. Assoc. 85 (1990), no. 411, 729—743. 62F35. [24] Zuo, Y. (2003). Projection depth trimmed means for multivariate data: robustness and efficiency (the latest version: Multi-dimensional trimming based on projection depth was tentatively accepted by the Annals of Statistics in 2004). 97 I1111111111211I111]